# A LYRICAL EVOLUTION: 

### An Investigation of the Cultural Lexicon & Historical Relevance of U.S. Popular Music from 1958 - Present

---

**An NLP based Capstone Project & Final Report Created By:**

Ben Smith, Chris Teceno, Jerry Nolf & Rachel Robbins-Mayhill
Codeup   |   Innis Cohort   |   June 2022  

<img src="dataset-cover.png">

## Project Goal
This project aimed to investigate the patterns of song lyrics across decades using Natural Language Processing techniques including Topic Modeling, Sentiment Analysis, and Term Frequency using a Kaggle data set of the Billboard Top 100 Songs from 1958 - 2021 and lyrics pulled from the Genius.com API. We believe the lyrics of popular songs could be used for historical analysis using exploratory methods and hypothesis testing to identify changing societal trends in relationships, technology, sexuality, and vulgarity. Furthermore, we beleive we can predict the decade the song appeared on the Top 100 using features and machine learning methods.

## Project Description

Songs are powerful tokens: they can soothe, validate, ignite, confront, and educate us – among other things. Like time capsules, they are captured for eternity. The slang and language used are often indicative of the times, and you can probably recall exactly when a song was made based on what is mentioned. Arguably, music is a catalyst for societal and cultural evolution like no other art form. It has been causing controversy and societal upheaval for decades, and it seems with every generation there’s a new musical trend that has the older generations shaking their heads. 

For centuries, songs have been passed down through generations, being sung as oral histories. However, with advancements of the 20th century, technology has made the world of music a much smaller place and, thanks to cheap, widely-available audio equipment, songs are now distributed on a much larger scale, having a farther-reaching impact, and a more permanent place in history. 

This project aimed to combine the record of lyrical history and technological advancements to evaluate the changes in the cultural lexicon and societal evolution over the last 50+ years. Using machine learning and natural language processing methodologies we investigated the topics prevalent in songs of the past, predicted the decade in which they were written, and conducted historical analysis through exploration to identify changing societal trends in relationships, technology, sexuality, and vulgarity.

<img src='Billboard.png' width="350" height="350" align="left"/> To do this, we acquired a [Kaggle](https://www.kaggle.com/datasets/dhruvildave/billboard-the-hot-100-songs) data set of the Billboard Top 100 Songs from its inception in 1958 to present. We then utilized the [Genius.com](https://genius.com/) API and LyricGenius Library to conduct web scraping to pull the lyrics for the specified songs which became the corpus for this project. After acquiring and preparing the corpus, our team conducted natural language processing exploration utilizing methods such as topic modeling, word clouds, and bigrams. We employed multiclass classification methods to create multiple machine learning models. The end goal was to create an NLP model that accurately predicted the decade a song first appeared on the Billboard Top 100 chart, based on the words and word combinations found in the lyrics of the song.

We choose the Billboard Hot 100 song list as a focus because it is the music industry standard record chart in the United States for song popularity, published weekly by Billboard magazine. It provides a window into popular culture at a given time, by providing chart rankings of songs that were trending on sales, airplay, and now streaming for that week in the United States. It is arguably the best historical record of the impact of specific popular songs over time.

## Initial Thoughts & Hypothesis

The initial hypothesis of this project was that we could use the top songs of each decade in conjunction with topic modeling to identify unique words or topics which could be used as features to accurately predict the decade a song was on the Billboard Top 100 using machine learning. The thought behind this was that popular songs have been the historians of a unique lexicon, specific to their place in time. We believe the lyrics of popular songs could be analyzed through machine learning to identify societal trends in relationships, technology, sexuality, and vulgarity.

## Initial Questions

The focus of this project is on identifying the decade a song first appeared on the Billboard Top 100. Below are some of the initial questions this project looked to answer throughout the Data Science Pipeline.
 
##### Data-Focused Questions
- What are the most frequently occuring words?
- What are the most frequently occuring bigrams (pairs of words) by each decade?
- What decade did the song first appear in the top 100?
- What topics are most unique to each decade?
- Is there a correlation between sentiment and decade?
- How do topics, such as violence, sexual explicitness, technology references, or relationship references, change over time?
- How does foreign language usage change over time?

## Key Findings

The key findings for this presentation are available in slide format by clicking on the [Final Slide Presentation](https://www.canva.com/design/DAFCXoeG7z0/jNCtQkQFqyOTWS5Ckg8Xuw/view?utm_content=DAFCXoeG7z0&utm_campaign=designshare&utm_medium=link2&utm_source=sharebutton).


TBD.......

==========================================================================================================================================================

## I. ACQUIRE
To acquire the data for this project, we utilized a [Kaggle](https://www.kaggle.com/datasets/dhruvildave/billboard-the-hot-100-songs) data set of the Billboard Top 100 Songs from its inception in 1958 to present. 

The dataset provided:
- date song was on the Billboard Top 100 
- rank of song  
- title
- artist name
- rank of song the previous week
- rank of song at it's peak week
- number of weeks song was on the Top 100  

We selected only unique artists and songs, to ensure there were no duplicates, keeping only the earliest appearance on the chart to standardize the selections in the event of multiple appearances. Following song selection with the Kaggle dataset, we then obtained an API token to utilize the [Genius.com](https://genius.com/) API and [LyricGenius Library](https://pypi.org/project/lyricsgenius/) to conduct web scraping to pull the lyrics for the specified songs which became the corpus for this project.

The acquired data can be easily accessed via a [Google Drive .csv file](https://drive.google.com/file/d/1S0dJ7-5x8NIgt1LranE3UETgl_JvukGT/view). 

### Note about imports: 
Imports for this project are added in the sections in which they are required.

In [5]:
# import for acquisition
import os
import json
import requests
import draft_prepare as prep
import draft_explore as explore
import draft_model as model


# import for data manipulation
import pandas as pd
import numpy as np
from typing import Dict, List, Optional, Union, cast

# import to ignore warnings
import warnings
warnings.filterwarnings('ignore')

ModuleNotFoundError: No module named 'draft_prepare'

In [None]:
# acquire data from .csv saved and processed using functions found in acquire.py
df = pd.read_csv('../songs_0526.csv').drop(columns=['Unnamed: 0'])
df = draft_prepare.clean_df(df)
df = df.set_index('date')
df = df[(df.decade != 1950) & (df.decade != 2020)]

In [None]:
# obtain number of columns and rows for original dataframe
df.shape

### The Original DataFrame Size: ____ rows, or documents, and ____ columns.

==========================================================================================================================================================

## II. PREPARE

After data acquisition, the dataframe was analyzed and cleaned to facilitate functional exploration and clarify variable confusion. The preparation of this data can be replicated using the ______ function saved within the prepare.py file inside the [Lyrical Evolution](https://github.com/CBRJ-Lyrical-Metrics/song-lyrics-capstone) repository on GitHub. The function takes in the original acquire dataframe and returns it with the changes noted below.

**Steps Taken to Clean & Prepare Data:**

- Cleaning: 
    - Make all text lowercase
    - Normalize, encode, and decode to remove accented text and special characters
    - Tokenize strings to break words and punctuation into discrete units
    - Expand abbreviated contractions
    - Lemmatize words to acquire base words
    - Remove stopwords
    - Convert date to DateTime format
    - Remove song part identifiers ('lyrics' 'verse', 'chorus', 'hook', 'embed')
    
---   
- Address missing values, data errors, unnecessary data, and unclear values:
    - 
    - Drop missing values to prevent impediments in exploration and modeling: ______ documents/observations that had null values in the ______ column 
    - Drop all rows where ________
    - Total dropped documents = _______
---    
- Create feature engineered columns:
    - Decade 
    - Chorus Count
    - Verse Count
    - Verse/Chorus Ratio
    - Word Count
    - Unique Words per Song
    - Unique Words per Decade
    - Bigrams
    - Trigrams
    
- Apply Natural Language Processing (NLP Methods:
    - Topic Modeling
    - Sentiment Analysis
    
---
- Split corpus into train, validate, and test samples 


**Note on Splitting Data:**



**Note on Missing Value Handling:**
The missing value removal equated to removing ______ observations/documents, which was about ___   \% of the data set. It still left _______ observations, a substantial number. If given more time with the data, it is recommended to investigate other ways to impute the missing data.

---

## Results of Data Preparation

In [None]:
# import for prepare
import draft_prepare
import unicodedata
import re

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

from time import strftime
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from collections import Counter

In [None]:
# apply the data preparation observations and tasks to clean the data using the prep_data function found in the prepare.py
df = prepare.prep_data(df)
# view first few rows of dataframe
# obtain the number of rows and columns for the updated/cleaned dataframe 
print(df.shape)
df.head()

## Prepared DataFrame Size: 134 rows, or documents, and 13 columns.

---

### PREPARE - SPLIT

In [None]:
# import for split
from sklearn.model_selection import train_test_split

After preparing the corpus, it was split into 3 samples; train, validate, and test using:

- Random State: 42
- Test = 20% of the original dataset
- The remaining 80% of the dataset is divided between valiidate and train
    - Validate (.30*.80) = 24% of the original dataset
    - Train (.70*.80) = 56% of the original dataset
    
The split of this data can be replicated using the split_data function saved within the prepare.py file inside the [_____](_________) repository on GitHub.

In [None]:
# split the data into train, validate, and test using the split_data function found in the prepare.py
train, validate, test = prepare.split_data(df)
# obtain the number of rows and columns for the splits
train.shape, validate.shape, test.shape

==========================================================================================================================================================

## III. EXPLORE

In [None]:
# import for data visualization
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib.ticker import StrMethodFormatter
from matplotlib import style
from wordcloud import WordCloud
import explore

In [None]:
# Set Universal Visualization Formatting

# determine figure size
plt.rc('figure', figsize=(20, 8))
# determine font size
plt.rc('font', size=15)
# determine style
plt.style.use('seaborn-deep')

After acquiring and preparing the corpus, exploration was conducted. All univariate exploration was completed on the entire cleaned corpus in the workbook for this project. For the purpose of the final report, only the target variable will be displayed in order to reduce noise and provide focused context for the project. Following univariate exploration, the split sets (train, validate, and test samples) were utilized thorugh modeling, where only the train set was used for bivariate and multivariate exploration to prevent data leakage.

---

### UNIVARIATE EXPLORATION

#### UNIVARIATE EXPLORATION of TARGET VARIABLE

In [None]:
# create visualization
df.language.value_counts().plot(kind='pie', y='Language', autopct="%1.1f%%")
# remove y axis label
plt.ylabel(None)
#add title
plt.title('Top 4 Programming Langauges Across Corpus by Percentage')
plt.show()

#### OBSERVATIONS: 
- 

---

### EXPLORATION QUESTIONS

All bivariate exploration was conducted on the train corpus to prevent data leakage. The initial questions and univariate exploration guided the bivariate exploration.

#### EXPLORE QUESTIONS

### QUESTION 1: 

In [None]:
#### ANSWER 1: 

In [None]:
---

In [None]:
### QUESTION 2: 

In [None]:
#### ANSWER 2: 

In [None]:
---

In [None]:
### QUESTION 3: 

In [None]:
#### ANSWER 3: 

In [None]:
---

In [None]:
### QUESTION 4: 

In [None]:
#### ANSWER 4: 

In [None]:
---

In [None]:
### QUESTION 5: 

In [None]:
#### ANSWER 5: 

In [None]:
---

In [None]:
### EXPLORATION SUMMARY

==========================================================================================================================================================

In [None]:
### SPLIT???

## IV. MODEL

### Focus of Model Metrics

==========================================================================================================================================================

## V. CONCLUSION

==========================================================================================================================================================