# A LYRICAL EVOLUTION: 

### An Investigation of the Cultural Lexicon & Historical Relevance of U.S. Popular Music from 1958 - Present

---

**An NLP based Capstone Project & Final Report Created By:**

Ben Smith, Chris Teceno, Jerry Nolf & Rachel Robbins-Mayhill
Codeup   |   Innis Cohort   |   June 2022  

<img src="dataset-cover.png">

## Project Goal
This project aimed to investigate the patterns of song lyrics across decades using Time Series Analysis and Natural Language Processing techniques including Topic Modeling, Sentiment Analysis, and Term Frequency.  The data used was collected from a Kaggle data set of the Billboard Top 100 Songs from 1958 to 2021 and lyrics pulled through web-scraping from the Genius.com API. We believe the lyrics of popular songs could be used for historical analysis using exploratory methods and hypothesis testing to identify changing societal trends in relationships, technology, sexuality, and vulgarity. Furthermore, we believe we can predict the decade the song first appeared on the Top 100 using features and machine learning methods.

## Project Description

Songs are powerful tokens: they can soothe, validate, ignite, confront, and educate us – among other things. Like time capsules, they are captured for eternity. The slang and language used are often indicative of the times, and you can probably recall exactly when a song was made based on what is mentioned. Arguably, music is a catalyst for societal and cultural evolution like no other art form. It has been causing controversy and societal upheaval for decades, and it seems with every generation there’s a new musical trend that has the older generations shaking their heads. 

For centuries, songs have been passed down through generations, being sung as oral histories. However, with advancements of the 20th century, technology has made the world of music a much smaller place and, thanks to cheap, widely-available audio equipment, songs are now distributed on a much larger scale, having a farther-reaching impact, and a more permanent place in history. 

This project aimed to combine the record of lyrical history and technological advancements to evaluate the changes in the cultural lexicon and societal evolution over the last 50+ years. Using machine learning and natural language processing methodologies we investigated the topics prevalent in songs of the past, predicted the decade in which they were written, and conducted historical analysis through exploration to identify changing societal trends in relationships, technology, sexuality, and vulgarity.

<img src='Billboard.png' width="350" height="350" align="left"/> To do this, we acquired a [Kaggle](https://www.kaggle.com/datasets/dhruvildave/billboard-the-hot-100-songs) data set of the Billboard Top 100 Songs from its inception in 1958 to present. We then utilized the [Genius.com](https://genius.com/) API and LyricGenius Library to conduct web scraping to pull the lyrics for the specified songs which became the corpus for this project. After acquiring and preparing the corpus, our team conducted natural language processing exploration utilizing methods such as topic modeling, word clouds, and bigrams. We employed multiclass classification methods to create multiple machine learning models. The end goal was to create an NLP model that accurately predicted the decade a song first appeared on the Billboard Top 100 chart, based on the words and word combinations found in the lyrics of the song.

We choose the Billboard Hot 100 song list as a focus because it is the music industry standard record chart in the United States for song popularity, published weekly by Billboard magazine. It provides a window into popular culture at a given time, by providing chart rankings of songs that were trending on sales, airplay, and now streaming for that week in the United States. It is arguably the best historical record of the impact of specific popular songs over time.

## Initial Thoughts & Hypothesis

The initial hypothesis of this project was that we could use the top songs of each decade in conjunction with topic modeling to identify unique words or topics which could be used as features to accurately predict the decade a song was on the Billboard Top 100 using machine learning. The thought behind this was that popular songs have been the historians of a unique lexicon, specific to their place in time. We believe the lyrics of popular songs could be analyzed through machine learning to identify societal trends in relationships, technology, sexuality, and vulgarity.

## Initial Questions

The focus of this project is on identifying the decade a song first appeared on the Billboard Top 100. Below are some of the initial questions this project looked to answer throughout the Data Science Pipeline.
 
##### Data-Focused Questions
- What are the most frequently occuring words?
- What are the most frequently occuring bigrams (pairs of words) by each decade?
- What decade did the song first appear in the top 100?
- What topics are most unique to each decade?
- Is there a correlation between sentiment and decade?
- How do topics, such as violence, sexual explicitness, technology references, or relationship references, change over time?
- How does foreign language usage change over time?

## Key Findings

The key findings for this presentation are available in slide format by clicking on the [Final Slide Presentation](https://www.canva.com/design/DAFCXoeG7z0/jNCtQkQFqyOTWS5Ckg8Xuw/view?utm_content=DAFCXoeG7z0&utm_campaign=designshare&utm_medium=link2&utm_source=sharebutton).


TBD.......

==========================================================================================================================================================

## I. ACQUIRE
To acquire the data for this project, we utilized a [Kaggle](https://www.kaggle.com/datasets/dhruvildave/billboard-the-hot-100-songs) data set of the Billboard Top 100 Songs from its inception in 1958 to present. 

The dataset provided:
- date song was on the Billboard Top 100 
- rank of song  
- title
- artist name
- rank of song the previous week
- rank of song at it's peak week
- number of weeks song was on the Top 100  

We selected only unique artists and songs, to ensure there were no duplicates, keeping only the earliest appearance on the chart to standardize the selections in the event of multiple appearances. Following song selection with the Kaggle dataset, we then obtained an API token to utilize the [Genius.com](https://genius.com/) API and [LyricGenius Library](https://pypi.org/project/lyricsgenius/) to conduct web scraping to pull the lyrics for the specified songs which became the corpus for this project.

The acquired data can be easily accessed via a [Google Drive .csv file](https://drive.google.com/file/d/1S0dJ7-5x8NIgt1LranE3UETgl_JvukGT/view). 

### Note about imports: 
Imports for this project are added in the sections in which they are required.

In [1]:
# import for acquisition
import os
import json
import requests
import draft_prepare as prep
import draft_explore as explore
import draft_model as model


# import for data manipulation
import pandas as pd
import numpy as np
from typing import Dict, List, Optional, Union, cast

# import to ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
# acquire data from .csv saved and processed using functions found in acquire.py
df = pd.read_csv('songs_0526.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,title,artist,date,lyrics
0,2,#1,Nelly,2001-10-20,#1 LyricsUh uh uh I just gotta bring it to the...
1,4,#9 Dream,John Lennon,1974-12-21,#9 Dream Lyrics[Verse 1] So long ago Was it in...
2,5,#Beautiful,Mariah Carey Featuring Miguel,2013-05-25,"#Beautiful Lyrics[Intro: Mariah Carey] Ah, ah,..."
3,6,#SELFIE,The Chainsmokers,2014-03-15,#SELFIE Lyrics[Verse 1] When Jason was at the ...
4,7,#thatPOWER,will.i.am Featuring Justin Bieber,2013-04-06,#thatPOWER Lyrics[Instrumental break] [Pre-Ch...


In [3]:
# obtain number of columns and rows for original dataframe
df.shape

(23762, 5)

### The Original DataFrame Size: 
- 23,762 rows, or documents, and 5 columns.

---- Draft Info for Acquire ----


we started with 300k+ rows but it contain a lot of duplicates

we then grouped by artist and song title with date.min() in order to remove duplicates and keep the earliest appearance in the top 100

with this reduced dataset of unique songs we were able to use lyricsgenius to pull data from the Genius API to pull in lyrics

Some string cleaning was performed to retrieve more results, (title and song formatting differed from the Kaggle dataset and genius.com)

so the main challenges were:

duplicated songs, and inconsistent naming, plus the Genius API sometimes returned incorrect lyrics.

we are able to test for correct lyrics by doing a string comparison of song title and the first section of the returned lyrics

all properly pulled lyrics contain the song title prior to the actual lyrics

so

song.title in song.lyrics gave us an accurate test

however with regex, ben was able to be more specific

song.title == song.title_from_lyrics

I’d have to look back at his notebook to know what he called the variables

but thats the gist of it

also add in that it took 1000+ hours to run the initial acquire

==========================================================================================================================================================

## II. PREPARE

After data acquisition, the dataframe was analyzed and cleaned to facilitate functional exploration and clarify variable confusion. The preparation of this data can be replicated using the 'get_data' function saved within the prepare.py file inside the [Lyrical Evolution](https://github.com/CBRJ-Lyrical-Metrics/song-lyrics-capstone) repository on GitHub. The function takes in the original acquire dataframe and returns it with the changes noted below.

**Steps Taken to Clean & Prepare Data:**

- Cleaning: 
    - Make all text lowercase
    - Normalize, encode, and decode to remove accented text and special characters
    - Expand abbreviated contractions
    - Lemmatize words to acquire base words
    - Remove stopwords
    - Convert date to DateTime format
    - Remove song part identifiers ('lyrics' 'verse', 'chorus', 'hook', 'embed')
    
---   
- Address missing values, data errors, unnecessary data, and unclear values:
    - The dataset had no null values, therefore, there was no need to drop any observations for this reason.
    - Data Errors: The API returned lyrics that were not the expected song's lyrics. This was addressed by:
        1. Mannually checking some of the lyrics to identify the output pattern.
        2. Coding to compare the title syntax, to the lyric output pattern. If they match after cleaning manipulation, they would be accepted as correct. If the title and lyric output syntax did not match, further cleaning would be conducted to correct the error, or the song would be dropped.  
---    
- Create feature engineered columns:
    - Decade 
    - Chorus Count
    - Verse Count
    - Verse/Chorus Ratio
    - Word Count
    - Unique Words per Song
    - Unique Words per Decade
    - Bigrams
    - Trigrams
    
- Apply Natural Language Processing (NLP Methods):
    - Topic Modeling
    - Sentiment Analysis
    
---
- Split corpus into train, validate, and test samples 

**Note on Splitting Data:**
The data was not split prior to Exploration because the features were not utilized in modeling, therefore there was no concern of data leakage. 

---

### Specialized Preparation Steps
After applying data pre-processing methods to conduct basic cleaning, two specialized preparation steps were taken specific to Natural Langauge Processing. 

### Topic Modeling

The first specialized preparation step taken was Topic Modeling. Specifically, Latent Dirichlet Allocation was the Topic Modeling method utilized to create clustered groupings of labeled text, identifying the 20 most common topics found within the corpus. Those 20 topics were then manually reviewed for accuracy and theme, then decreased to 17 when it was determined there was overlap between a few of the topics. Following the identification of the 17 most common topics, those with similar themes were combined into two related groupings that would be investigated further in exploration:
- **Vice Topics** - including the three separate topics of violence, sex, and money
- **Relationship Topics** -  including the five separate topics of affection, breakups, heartache, jealousy, sex

These labels were then added to the dataframe, identifying all documents that qualified as having the designated topic as a theme, or listing the theme as 'Other' if the Vice or Relationship topics were not applied. 


### Sentiment Analysis

Secondly, Sentiment Analysis was also conducting in preparation as a speciaized approach, specific to Natural Language Pricessing and in conjunction with Time Series Analysis.  The sklearn.decomposition package was used to conduct Sentiment Analaysis, first examining the change in average sentiment score over time by looking at a rolling 5 year avereage and average by decade. Divided sentiment score into 5 categories _________. Looked at what portion of the total taken up by each category adn how it changed over time. 

---

## Results of Data Preparation

In [4]:
# import for prepare
import draft_prepare
import unicodedata
import re

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

from time import strftime
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from collections import Counter

In [None]:
# apply the data preparation observations and tasks to clean the data using the prep_data function found in the prepare.py
df = prep.get_data()
df = prep.get_topics(df)
# view first few rows of dataframe
df.head()

Features added ******************
fitting lda ***********

In [None]:
df.shape

In [None]:
df.info()

## Prepared DataFrame Size: 
- 23,762 rows, or documents, and 23 columns.

---

### PREPARE - SPLIT  ( Adjustments will be made prior to the final)

In [None]:
# import for split
from sklearn.model_selection import train_test_split

After preparing the corpus, it was split into 3 samples; train, validate, and test using:

- Random State: 42
- Test = 20% of the original dataset
- The remaining 80% of the dataset is divided between valiidate and train
    - Validate (.30*.80) = 24% of the original dataset
    - Train (.70*.80) = 56% of the original dataset
    
The split of this data can be replicated using the split_data function saved within the prepare.py file inside the [_____](_________) repository on GitHub.

In [None]:
# split the data into train, validate, and test using the split_data function found in the prepare.py
train, validate, test = prep.split_data(df)
# obtain the number of rows and columns for the splits
train.shape, validate.shape, test.shape

==========================================================================================================================================================

## III. EXPLORE

In [None]:
# import for data visualization
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib.ticker import StrMethodFormatter
from matplotlib import style
from wordcloud import WordCloud
import draft_explore as explore

After acquiring and preparing the corpus, exploration was conducted. All univariate exploration was completed on the entire cleaned corpus in the workbook for this project. For the purpose of the final report, only the target variable will be displayed in order to reduce noise and provide focused context for the project. Following univariate exploration, the split sets (train, validate, and test samples) were utilized thorugh modeling, where only the train set was used for bivariate and multivariate exploration to prevent data leakage.

---

### EXPLORATION QUESTIONS

All bivariate exploration was conducted on the train corpus to prevent data leakage. The initial questions and univariate exploration guided the bivariate exploration.

#### EXPLORE QUESTIONS

### QUESTION 1: How has Sentiment Chaged Over Time

In [None]:
#Visualization
palette = [
'#1f1e1b', #(black)
'#fc9d1c', #(orange)
'#ec1c34', #(red)
'#69b138', #(green)
'#2dace4', #(blue)
'#fbdb08' #(yellow)
]

plt.figure(figsize=(12,8))
sns.barplot(data=df, y='sentiment', x='decade', ci=None, ec='black',
            palette=palette)
plt.title('Average Sentiment by Decade', fontsize=16)
plt.ylabel('Average Sentiment Score', fontsize=14)
plt.xlabel(None)
plt.xticks(fontsize=14)
plt.show()

In [None]:
# Hypothesis Testing

#### ANSWER 1: 
Sentiment was fairly steady in the 60's and 70's, followed by a gradual downward trend which becomes sharper in the 2000's and 2010's. The downward trend is due to an increase in very negative sentiment and decrease in very positivesentiment while mid-range sentiment stays contstant. 

---

### QUESTION 2: What Topics are Most Prevalent From ?

In [None]:
# Visualization
explore.topic_popularity(df)

In [None]:
# Hypothesis Testing


#### ANSWER 2: 
Breakups are by far the most popular topics in songs, followed by being lost in life, then affection, sex, and nature.

---

### QUESTION 3: How Do Relationship Topics Change Over the Decades?   

In [None]:
# visualization
explore.relationship_line(df)

Observation:
There appears to be an inverse relationship between affection and sex, with the topic of affection decreaseing in prevalence over time, and the topic of sex increasing in usage. 

In [None]:
# visualization
explore.touch_swarm(df)

In [None]:
# Hypothesis Testing

#### ANSWER 3: 
While most relationship topics appear constant, affection and sex have an inverse relationship. 

---

### QUESTION 4: How Do Vice Topics Change Over the Decades?   

In [None]:
explore.vice_swarm(df)

#### ANSWER 4: 
After 1990 sex became extremely popular in lyrics, then around 2015 violence and money exploded as well. 

---

### QUESTION 5: What Happened to the Love?

In [None]:
# Visualization
TBD

In [None]:
# Hypothesis Testing

#### ANSWER 5: 
Love went from most common word in the early decades, to lower in the top 5, then out of the top 5 and replaced with like. 

---

### EXPLORATION SUMMARY





==========================================================================================================================================================

In [None]:
### SPLIT???

## IV. MODEL

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer


### Focus of Model Metrics
The target variable, Decade, is a categorical variable, therefore classification machine learning algorithms were used to fit to the training corpus and the models were evaluated on the validate corpus. The metrics used for model evaluation was accuracy, due to the multi-class classification approach. In other words, the model was optimized for identifying true positives, false positive, true negatives, and false negatives, therefore we focused on creating a model with the highest accuracy score from train to validate. 

In [None]:
# get fresh data
df = pd.read_csv('songs_0526.csv', index_col=0)
# prep for model
df = prep.model_clean(df)

### Set X & y
As mentioned above, two different approaches were taken to prepare the data for modeling. Feature engineering was done for exploratory analysis and even more for modeling. This however did not result in a significant improvement in the accuracy of the model. Therefore, the data was prepared for modeling by using TF-IDF vectorization which takes into account the word count in each file vs word count in the entire corpus. Below is how this was performed:

In [None]:
# make vectorizer
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(df["lyrics"])
y = df["decade"]

### Set Baseline

A baseline prediction was set by using the mode for decade. This gave us a baseline accuracy of 20.6%. We will evaluate the accuracy of our models in comparrison to that baseline.

### Condsider Feature Engineering
First lets look at the models with the lower accuracy, this is the df using feature engineering not including TF-IDF. 
The following were adjustable:
- scale or not scale
- use only unique bigrams as features or use all numeric features

### Observation of models with feature engineering:

### Consider TF-IDF

#### The Type of Classification models built were 
- Decision Tree
- Random Forest
- Logistic Regression

The models were run with many trials, adjusting parameters and algorithms to find the best performing model.  

- All Logistic Regression models appeared to be overfit based upon their high performance on train accuracy compared to the significant drop off on validate accuracy.
    - This is in part due to the use of TF-IDF which analyzes each word in the train corpus and does not remove attributes.
- In general all models outperformed baseline, which had  ___ accuracy on train and ___ accuracy on validate.
- The Logistic Regression Model that performed best had a c of 1000 and solver of 'lbfgs', with train accuracy of 98% and validate accuracy of 61% performing 19% better than baseline with validate. It was then applied to the un-seen test data with an accuracy of 56%.

---

### MODEL - DECISION TREE

In [None]:
results = model.run_decision_tree_models(df)
results.head(1) # show baseline

In [None]:
results.sort_values('validate_accuracy', ascending=False).head(3)

The Decision Tree model that performed the best on train & validate set had max_depth of 2, with 51% accuracy on train, and 45% accuracy on validate, so that model will be isolated below in the event it is the best performing model to be applied to the test (unseen) dataset. 

---

### Model - RANDOM FOREST

In [None]:
results2 = model.run_random_forest_models(df)
results2.sort_values('validate_accuracy', ascending=False).head(3)

The Random Forest model that performed the best on train & validate set had max_depth of 100 and min_sample_leaf of 1, with 99%  accuracy on train, and 48% accuracy on validate, so that model will be isolated below in the event it is the best performing model to be applied to the test (unseen) dataset. 

---

### Model - LOGISTIC REGRESSION

In [None]:
results3 = model.run_logistic_reg_models(df)
results3.sort_values('validate_accuracy', ascending=False).head(3)

Evaluating the model with the validate data set was done in the function above for comparrison. The Logistic Regression Model that performed best had a c-statistic of 1000 with a train accuracy of 99% and validate accuracy of 61% performing 19% better than baseline on unseen (validate) data.

---

### Best Performing Model Applied to Test Data (Unseen Data)

In [None]:
results3.sort_values('validate_accuracy', ascending=False).head(1)

This model is expected to perform around 56-60% accuracy in the future on data it has not seen, given no major changes in the data source, which is better than the baseline prediction.

==========================================================================================================================================================

## V. CONCLUSION

This project aimed to investigate the patterns of song lyrics across decades using Time Series Analysis and Natural Language Processing techniques including Topic Modeling, Sentiment Analysis, and Term Frequency using a Kaggle data set of the Billboard Top 100 Songs from 1958 - 2021 and lyrics pulled from the Genius.com API. We believed the lyrics of popular songs could be used for historical analysis using exploratory methods and hypothesis testing to identify changing societal trends in relationships, technology, sexuality, and vulgarity. Furthermore, we beleive we can predict the decade the song appeared on the Top 100 using features and machine learning methods.

Through exploration and modeling, we determined _____________________.

This information could be usiful in various contexts:
- Anthropologic and Sociologic Academic Analysis
- Marketing Analysis for companies associated with the music industry

### RECOMMENDATIONS

TBD

### NEXT STEPS

TBD with more exploration

==========================================================================================================================================================