## A LYRICAL EVOLUTION: 

#### An Investigation of the Cultural Lexicon of U.S. Popular Music from 1958 - Present
---

By: Jerry Nolf, Rachel Robbins-Mayhill, Ben Smith,  & Chris Teceno    |    Codeup   |   Innis Cohort   |   June 2022  

<img src="dataset-cover.png">

*** **WARNING**: *** This project contains explicit content in the form of isolated words identified through Topic Modeling as features grouped within a topic. 

The findings of this project are available in presentation format by clicking on the [Final Slide Presentation](https://www.canva.com/design/DAFCXoeG7z0/jNCtQkQFqyOTWS5Ckg8Xuw/view?utm_content=DAFCXoeG7z0&utm_campaign=designshare&utm_medium=link2&utm_source=sharebutton).

## Project Goal
This project aimed to investigate the patterns of song lyrics across decades by applying Natural Language Processing techniques including Topic Modeling and Sentiment Analysis, while using a Kaggle data set of the Billboard Top 100 Songs from 1958 - Present and lyrics pulled from the Genius.com API. We believe the lyrics of popular songs could be used for historical analysis using exploratory methods and hypothesis testing to identify changing societal trends in relationships, sexuality, and vulgarity. Furthermore, we beleive we can predict the decade the song appeared on the Top 100 using features and machine learning methods.

## Project Description

Songs are powerful tokens: they can soothe, validate, ignite, confront, and educate us – among other things. Like time capsules, they are captured for eternity. The slang and language used are often indicative of the times, and you can probably recall exactly when a song was made based on what is mentioned. Arguably, music is a catalyst for societal and cultural evolution like no other art form. It has been causing controversy and societal upheaval for decades, and it seems with every generation there’s a new musical trend that has the older generations shaking their heads. 

For centuries, songs have been passed down through generations, being sung as oral histories. However, with advancements of the 20th century, technology has made the world of music a much smaller place and, thanks to cheap, widely-available audio equipment, songs are now distributed on a much larger scale, having a farther-reaching impact, and a more permanent place in history. 

This project aimed to combine the record of lyrical history and technological advancements to evaluate the changes in the societal lexicon over the last 60+ years. Using machine learning and natural language processing methodologies we investigated the topics prevalent in songs of the past, predicted the decade in which they were written, and conducted historical analysis through exploration to identify changing societal trends in relationships, sexuality, and vulgarity.

<img src='Billboard.png' width="350" height="350" align="left"/> To do this, we acquired a [Kaggle](https://www.kaggle.com/datasets/dhruvildave/billboard-the-hot-100-songs) data set of the Billboard Top 100 Songs from its inception in 1958 to present. We then utilized the [Genius.com](https://genius.com/) API and LyricGenius Library to conduct web scraping to pull the lyrics for the specified songs which became the corpus for this project. After acquiring and preparing the corpus, our team conducted time series analysis and natural language processing exploration utilizing methods such as sentiment analysis and topic modeling. We also employed multiclass classification methods to create multiple machine learning models. The end goal was to create an NLP model that accurately predicted the decade a song first appeared on the Billboard Top 100 chart, based on the words found in the lyrics of the song.

We choose the Billboard Hot 100 song list as a focus because it is the music industry standard record chart in the United States for song popularity, published weekly by Billboard magazine. It provides a window into popular culture at a given time, by providing chart rankings of songs that were trending on sales, airplay, and now streaming for that week in the United States. It is arguably the best historical record of the impact of specific popular songs over time.

## Initial Thoughts & Hypothesis

The initial hypothesis of this project was that we could use the top songs of each decade in conjunction with topic modeling and sentiment analysis to identify lyric features that would accurately predict the decade a song was on the Billboard Top 100 using machine learning. The thought behind this was that popular songs have been the historians of a unique lexicon, specific to their place in time. We believe the lyrics of popular songs could be analyzed through machine learning to identify societal trends in relationships, sexuality, and vulgarity.

## Initial Questions

The focus of this project is on identifying the decade a song first appeared on the Billboard Top 100. Below are some of the initial questions this project looked to answer throughout the Data Science Pipeline.
 
##### Data-Focused Questions
- How does sentiment within lyrics change over time?
- Is there a correclation between sentiment and the time a song was popular?
- Is there a correlation between events in history and sentiment of lyrics?
- What topics are most prevalent across the decades?
- How do topics within lyrics change over time?
- Is there a correlation between topics and the time a song was popular?

## Key Findings

Through exploratory analysis, we discovered US popular music has undergone a major cultural shift starting in the 1990's, where: 

- overall sentiment decreased 
- lyrics became more complex 
- topics shifted towards sex, money, & violence 
- ‘love’ was replaced with ‘like’

Ultimately, our hypothesis that we could use the top songs of each decade to accurately predict the decade a song was on the Billboard Top 100 was true. Although, certain decades were predicted more accurately than others. Our best performing models were based heavily on TF/IDF with the top performing model being a Logistic Regression model with an F-1 score that was 220% over baseline

==========================================================================================================================================================

## I. ACQUIRE
To acquire the data for this project, we utilized a [Kaggle](https://www.kaggle.com/datasets/dhruvildave/billboard-the-hot-100-songs) data set of the entire listing of Billboard Top 100 Songs from its inception in 1958 to present. 

The dataset provided:
- date song was on the Billboard Top 100 
- rank of song  
- title
- artist name
- rank of song the previous week
- rank of song at it's peak week
- number of weeks song was on the Top 100  

The original Kaggle dataset contained more than 300,000 entires. We selected only unique artists and songs, to ensure there were no duplicates, keeping only the earliest appearance on the chart to standardize the selections in the event of multiple appearances. Following song selection with the Kaggle dataset, we then obtained an API token to utilize the [Genius.com](https://genius.com/) API and [LyricGenius Library](https://pypi.org/project/lyricsgenius/) to conduct web scraping and pull the lyrics for the specified songs which became the corpus for this project.

### Note about imports: 
Imports for this project are added in the sections in which they are required.

In [1]:
# import for acquisition
import os
import json
import requests
import final_acquire as acquire
import final_prepare as prepare
import final_explore as explore
import final_model as model

# import for data manipulation
import pandas as pd
import numpy as np
from typing import Dict, List, Optional, Union, cast

# import to ignore warnings
import warnings
warnings.filterwarnings('ignore')

SyntaxError: invalid non-printable character U+200B (final_explore.py, line 176)

In [None]:
# acquire data from .json saved and processed using functions found in wrangle.py
df = pd.read_csv("songs_0526.csv")
df.head()

In [None]:
# obtain number of columns and rows for original dataframe
df.shape

#### Original Filtered DataFrame Size: 23,762 rows, or documents, and 5 columns.

==========================================================================================================================================================

## II. PREPARE

After data acquisition, the dataframe was analyzed and cleaned to facilitate functional exploration and clarify variable confusion. The preparation of this data can be replicated using the 'get_data' function saved within the prepare.py file inside the [Lyrical Evolution](https://github.com/CBRJ-Lyrical-Metrics/song-lyrics-capstone) repository on GitHub. The function takes in the original acquire dataframe and returns it with the changes noted below.

**Steps Taken to Clean & Prepare Data:**

- Cleaning: 
    - Make all text lowercase
    - Normalize, encode, and decode to remove accented text and special characters
    - Expand abbreviated contractions
    - Lemmatize words to acquire base words
    - Remove stopwords
    - Convert date to DateTime format
    - Remove song part identifiers ('lyrics' 'verse', 'chorus', 'hook', 'embed')
    
---   
- Address missing values, data errors, unnecessary data, and unclear values:
    - No null values
    - Data Errors : The API returned lyrics that were not the expected song's lyrics 
        - Mannually checking some
        Compared title, if they match after cleaning manipulation, 
---    
- Create feature engineered columns:
    - Decade 
    - Chorus Count
    - Verse Count
    - Verse/Chorus Ratio
    - Word Count
    - Unique Words per Song
    - Unique Words per Decade
    - Bigrams
    - Trigrams
    
- Apply Natural Language Processing (NLP Methods:
    - Sentiment Analysis
    - Topic Modeling
    
---
- Split corpus into train, validate, and test samples 


**Note on Splitting Data:**



**Note on Missing Value Handling:**
The missing value removal equated to removing ______ observations/documents, which was about ___   \% of the data set. It still left _______ observations, a substantial number. If given more time with the data, it is recommended to investigate other ways to impute the missing data.

### Sentiment Analysis

Natural Langauge Toolkit (NLTK) was used to prepare the corpus for sentiment analysis. NLTK assigns a score between -1 and +1 to each song based on whether the the sum of words and phrases in the song are considered to be positive or negative.
After scores were assigned to each song based upon the lyrical content, sentiment score ranges were divided into 5 categories: very negative, somewhat negative, nuetral, somewhat positive, and very positive. Each song was then labled with the sentiment category by it's correspoinding sentiment score in preparation for exploration.

### Topic Modeling

Latent Dirichlet Allocation (or LDA) spearheaded the extraction of topics within the lyrics. This unsupervised machine learning method detected word and phrase patterns. It then clustered groups of words that could best be labeled as a topic. 20 major topics were originally produced, but 3 were overlapping in tone and were therefore manually combined with others, resulting in 17 final topics to explore. These topics will be outlined in more detail through the exploration section of this report.

<img src='final_topics.png' width="900"  align="center"/>

---

## Results of Data Preparation

In [None]:
# import for prepare
import final_prepare as prepare
import unicodedata
import re

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

from time import strftime
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from collections import Counter

In [None]:
# apply the data preparation observations and tasks to clean the data using the prep_data function found in the prepare.py
df = prepare.get_data()
# view first few rows of dataframe
# obtain the number of rows and columns for the updated/cleaned dataframe 
print(df.shape)
df.head()

In [None]:
# obtain number of columns and rows 'cleaned' dataframe
df.shape

## Prepared DataFrame Size: 134 rows, or documents, and 13 columns.

---

### PREPARE - SPLIT

In [None]:
# import for split
from sklearn.model_selection import train_test_split

After preparing the corpus, it was split into 3 samples; train, validate, and test using:

- Random State: ______
- Test = 20% of the original dataset
- The remaining 80% of the dataset is divided between valiidate and train
    - Validate (.30*.80) = 24% of the original dataset
    - Train (.70*.80) = 56% of the original dataset
    
The split of this data can be replicated using the split_data function saved within the prepare.py file inside the [_____](_________) repository on GitHub.

In [None]:
# split the data into train, validate, and test using the split_data function found in the prepare.py
train, validate, test = prepare.split_data(df)
# obtain the number of rows and columns for the splits
train.shape, validate.shape, test.shape

==========================================================================================================================================================

## III. EXPLORE

In [None]:
# import for data visualization
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib.ticker import StrMethodFormatter
from matplotlib import style
from wordcloud import WordCloud
import final_explore as explore

After acquiring and preparing the corpus, exploration was conducted. All univariate exploration was completed on the entire cleaned corpus in the workbook for this project. For the purpose of the final report, only the target variable will be displayed in order to reduce noise and provide focused context for the project. Following univariate exploration, the split sets (train, validate, and test samples) were utilized thorugh modeling, where only the train set was used for bivariate and multivariate exploration to prevent data leakage.

---

### UNIVARIATE EXPLORATION

In [None]:
# Keep??????

#### UNIVARIATE EXPLORATION of TARGET VARIABLE

#### OBSERVATIONS: 
- 

---

### EXPLORATION QUESTIONS

All bivariate exploration was conducted on the train corpus to prevent data leakage. The initial questions and univariate exploration guided the bivariate exploration.

#### EXPLORE QUESTIONS

#### QUESTION 1: 
How has song sentiment changed over time?


Examined the change in average sentiment score over tiem by looking at a rolling 5 year avereage and average by decade. 

In [None]:
explore.sentiment_lineplot(df)

#### ANSWER 1:
Sentiment was fairly steady in the 60's and 70's, followed by a gradual downward trend which becomes sharper in the 2000's and 2010's. The downward trend is due to an increase in very negative sentiment and decrease in very positivesentiment while mid-range sentiment stays contstant.

#### Question 2:
What topics are most prevalent across the decades?


 but through manual analysis, it was determined a few groupings were very similar. Combining these groupings into the same topic resulted in **17 finalized topics for our dataset**. 

In [None]:
explore.topic_popularity(df)

#### Answer 2:
Breakups are by far the most popular topics in songs across all decades, followed by being or feeling lost, then affection, sex, and nature. The least popular over the years is Spanish-influence songs, holiday songs, and songs about jealousy.

#### QUESTION 3: 
- How do relationship topics change over the decades?

In [None]:
explore.relationship_line(df)

#### Observation:
Most topics in our "relationships" group seem to be consistently present over time. However, sex and affection seem to stand out considering sex's rise and affection's decline in the 90's. 

In [None]:
explore.touch_swarm(df)

#### Answer 3:
While most relationship topics appear constant, pulling out and viewing the presence of songs about sex versus affection over time seems to allow us to come to the conclusion that the categories have an inverse relationship when comparing them in this environment.

#### QUESTION 4: 
- How do vice topics change over the decades?

In [None]:
explore.vice_swarm(df)

#### ANSWER 4: 
After 1990 more explicit sexual themes became extremely popular in lyrics. Songs about violence and money then slowly followed until making a more impactful presence around 2015.

#### Question 5:
How has the prevalence of the word 'like' changed over the decades?

In [None]:
explore.love_vs_like_lineplot(df)

#### ANSWER 5: 
The word love's presence severely diminishedafter 1990. Around 2008, the word like became more present in lyrics and it continues to increase while love's decline contines.

#### Question 6: 
How does sentiment align with historical events?

In [None]:
explore.historical_lineplot(df)

#### Answer 6:

#### Question 7:
How did unique word count change over time?

In [None]:
explore.unique_words_lineplot(df)

#### Answer 7:

==========================================================================================================================================================

## IV. MODEL

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer


### Focus of Model Metrics
The target variable, Decade, is a categorical variable, therefore classification machine learning algorithms were used to fit to the training corpus and the models were evaluated on the validate corpus. The metrics used for model evaluation was accuracy, due to the multi-class classification approach. In other words, the model was optimized for identifying true positives, false positive, true negatives, and false negatives, therefore we focused on creating a model with the highest accuracy score from train to validate. 

In [None]:
# get the data
df = prepare.get_data()
# remove incomplete decades (1950, 2020)
df = df[(df.decade != 1950) & (df.decade != 2020)]

### Set X & y
As mentioned above, two different approaches were taken to prepare the data for modeling. Feature engineering was done for exploratory analysis and even more for modeling. This however did not result in a significant improvement in the accuracy of the model. Therefore, the data was prepared for modeling by using TF-IDF vectorization which takes into account the word count in each song's lyrics vs word count in the entire corpus. Below is how this was performed:

In [None]:
# make vectorizer
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(df["lyrics"])
y = df["decade"]

### Set Baseline
A baseline prediction was set by using the mode for decade. This gave us a baseline accuracy of 20.6%. We will evaluate the accuracy of our models in comparrison to that baseline.
### Consider Feature Engineering
First lets look at the models with the lower accuracy, this is the df using feature engineering not including TF-IDF. 
The following were adjustable:
- scale or not scale
- use only unique bigrams as features or use all numeric features
### Observation of models with feature engineering:
### Consider TF-IDF
#### The Type of Classification models built were 
- Decision Tree
- Random Forest
- Logistic Regression

The models were run with many trials, adjusting parameters and algorithms to find the best performing model.  

- All Logistic Regression models appeared to be overfit based upon their high performance on train accuracy compared to the significant drop off on validate accuracy.
    - This is in part due to the use of TF-IDF which analyzes each word in the train corpus and does not remove attributes.
- In general all models outperformed baseline, which had  ___ accuracy on train and ___ accuracy on validate.
- The Logistic Regression Model that performed best had a c of 1000 and solver of 'lbfgs', with train accuracy of 98% and validate accuracy of 61% performing 19% better than baseline with validate. It was then applied to the un-seen test data with an accuracy of 56%.
---
### MODEL - DECISION TREE


In [None]:
# uncommit to run the following code
# results = model.run_decision_tree_models(df)
# results.drop(columns='test_accuracy').head(1) # show baseline

In [None]:
# uncommit to run the following code
# results.drop(columns='test_accuracy').sort_values('validate_accuracy', ascending=False).head(3)

The Decision Tree model that performed the best on train & validate set had max_depth of 10, with 41% accuracy on train, and 31% accuracy on validate, so that model will be isolated below in the event it is the best performing model to be applied to the test (unseen) dataset. 

---

### Model - RANDOM FOREST

In [None]:
# uncommit to run the following code
# results2 = model.run_random_forest_models(df)
# results2.drop(columns='test_accuracy').sort_values('validate_accuracy', ascending=False).head(3)

The Random Forest model that performed the best on train & validate set had max_depth of 100 and min_sample_leaf of 2, with 93%  accuracy on train, and 40% accuracy on validate, so that model will be isolated below in the event it is the best performing model to be applied to the test (unseen) dataset. 

---

### Model - LOGISTIC REGRESSION

In [None]:
# uncommit to run the following code
# results3 = model.run_logistic_reg_models(df)
# results3.drop(columns='test_accuracy').sort_values('validate_accuracy', ascending=False).head(3)

Evaluating the model with the validate data set was done in the function above for comparrison. The Logistic Regression Model that performed best had a c-statistic of 1000 with a train accuracy of 69% and validate accuracy of 45% performing 222% better than baseline on unseen (validate) data.

---

### Best Performing Model Applied to Test Data (Unseen Data)

In [None]:
# uncommit to run the following code
# results3.sort_values('validate_accuracy', ascending=False).head(1)


==========================================================================================================================================================

## V. CONCLUSION

This project aimed to investigate the patterns of song lyrics across decades using Time Series Analysis and Natural Language Processing techniques including Topic Modeling, Sentiment Analysis, and Term Frequency using a Kaggle data set of the Billboard Top 100 Songs from 1958 - 2021 and lyrics pulled from the Genius.com API. We believed the lyrics of popular songs could be used for historical analysis using exploratory methods and hypothesis testing to identify changing societal trends in relationships, technology, sexuality, and vulgarity. Furthermore, we beleive we can predict the decade the song appeared on the Top 100 using features and machine learning methods.

Through exploration and modeling, we determined _____________________.

This information could be usiful in various contexts:
- Anthropologic and Sociologic Academic Analysis
- Marketing Analysis for companies associated with the music industry

### RECOMMENDATIONS

### NEXT STEPS

TBD with more exploration


==========================================================================================================================================================