# A LYRICAL EVOLUTION: 

### An Investigation of the Cultural Lexicon & Historical Relevance of U.S. Popular Music from 1958 - 2021

---

**An NLP based Capstone Project & Final Report Created By:**

Ben Smith, Chris Teceno, Jerry Nolf & Rachel Robbins-Mayhill
Codeup   |   Innis Cohort   |   June 2022  

<img src="dataset-cover.png">

## Project Goal
This project aims to investigate the patterns of song lyrics across decades using Natural Language Processing techniques including Topic Modeling, Sentiment Analysis, and Term Frequency. 

## Project Description

Songs are powerful tokens: they can soothe, validate, ignite, confront, and educate us – among other things. Like time capsules, they are now captured for eternity. The slang and language used are often indicative of the times, and you can probably recall exactly when a song was made based on what is mentioned. There’s nothing quite like a song to provide a picture of what was going on culturally at a specified time. Arguably, music is a catalyst for societal and cultural evolution like no other artform. It has been causing controversy and societal upheaval for decades, and it seems with every generation there’s a new musical trend that has the older generations shaking their heads and clutching their jewelry. 


Traditionally, for centuries, songs have been passed down through generations, being sung like oral histories. However, with advancements of the 20th century, technology has made the world of music a much smaller place and, thanks to cheap, widely-available audio equipment, songs could suddenly be distributed on a much larger scale, having farther reaching impact, and a more permanent place in history. This project aims to take those technological advancements a step further in regards to the historical impact of song lyrics. By using machine learning and natural language processing methodologies,  investigate songs of the past, and categorize them in the decade i


, songs can be analyzed by machines to provide a window to their cultural revalence and societal impact. 

The Billboard Hot 100 is the music industry standard record chart in the United States for songs, published weekly by Billboard magazine. Chart rankings are based on sales, radio play, and online streaming in the United States. It is arguablly the best historical record of the  <img src='Billboard.png' width="350" height="350" align="left"/> impact of specific popular songs over time. 

Every week, Billboard releases "The Hot 100" chart of songs that were trending on sales and airplay for that week. This project used a dataset from [Kaggle](https://www.kaggle.com/datasets/dhruvildave/billboard-the-hot-100-songs) that is a collection of all "The Hot 100" charts released since its inception in 1958.

## Initial Thoughts & Hypothesis

The initial hypothesis of this project was that we could use the top songs of each decade in conjunction with topic modeling to identify unique words or topics which could be used as a feature or a set of features to accurately predict the decade a song was on the Billboard Top 100 using machine learning. The thought behind this was that popular songs have been the historians of a unique lexicon, specific to their place in time. We believe the lyrics of popular songs could be analyzed through machine learning to identify societal trends in relationships, technology, sexuality, and vulgarity.

## Initial Questions

The focus of this project is on identifying the decade a song first appeared on the Billboard Top 100. Below are some of the initial questions this project looked to answer throughout the Data Science Pipeline.
 
##### Data-Focused Questions
- What are the most frequently occuring words?
- What are the most frequently occuring bigrams (pairs of words) by each decade?
- What decade did the song first appear in the top 100?
- What topics are most unique to each decade?
- Is there a correlation between sentiment and decade?

## Key Findings

The key findings for this presentation are available in slide format by clicking on the [Final Slide Presentation](URL).

==========================================================================================================================================================

## I. ACQUIRE

### Note about imports: 
Imports for this project are added in the sections in which they are required.

In [3]:
# import for acquisition
import os
import json
import requests
import final_acquire
import final_prepare
import final_explore
import final_model
from env import github_token, github_username

# import for data manipulation
import pandas as pd
import numpy as np
from typing import Dict, List, Optional, Union, cast

# import to ignore warnings
import warnings
warnings.filterwarnings('ignore')

ModuleNotFoundError: No module named 'final_acquire'

In [None]:
# acquire data from .json saved and processed using functions found in wrangle.py
df = pd.read_json("data.json")
df.head()

In [None]:
# obtain number of columns and rows for original dataframe
df.shape

### The Original DataFrame Size: ____ rows, or documents, and ____ columns.

==========================================================================================================================================================

## II. PREPARE

After data acquisition, the table was analyzed and cleaned to facilitate functional exploration and clarify variable confusion. The preparation of this data can be replicated using the prep_data  function saved within the prepare.py file inside the 'NLP-Project' repository on GitHub. The function takes in the original data.json dataframe and returns it with the changes noted below.

**Steps Taken to Clean & Prepare Data:**

- Basic Cleaning: 
    - Make all text lowercase
    - Normalize, encode, and decode to remove accented text and special characters
    - Tokenize strings to break words and punctuation into discrete units
    - Stem and Lemmatize words to acquire base words
    - Remove stopwords
    - Rename columns
---   
- Address missing values, data errors, unnecessary data, and unclear values:
    - Replace Jupyter Notebook values with Python after manually verifying most Jupyter Notebook entires used the Python programming language 
    - Drop missing values to prevent impediments in exploration and modeling: 9 documents/observations that had null values in the language column 
    - Drop all rows where README length was 0
    - Total dropped documents = 32
---    
- Create feature engineered columns:
    - unique words
    - character count
    - word count
    - unique word count
    - most common word count (2nd, 3rd, 4th, 5th most common)
    - unique bigram count
    - count of bigrams unique to each language in train set(this is done by creating a new column for each language)
    
---
- Split corpus into train, validate, and test samples

**Note on Missing Value Handling:**
The missing value removal equated to removing 9 observations/documents, which was about 9\% of the data set. It still left a substantial number of observations above the minimum expectation of 100. If given more time with the data, it is recommended to investigate other ways to impute the missing data.

---

## Results of Data Preparation

In [None]:
# import for prepare
import prepare
import unicodedata
import re

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

from time import strftime
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
from collections import Counter

In [None]:
# apply the data preparation observations and tasks to clean the data using the prep_data function found in the prepare.py
df = prepare.prep_data(df)
# view first few rows of dataframe
# obtain the number of rows and columns for the updated/cleaned dataframe 
print(df.shape)
df.head()

## Prepared DataFrame Size: 134 rows, or documents, and 13 columns.

---

### PREPARE - SPLIT

In [None]:
# import for split
from sklearn.model_selection import train_test_split

After preparing the corpus, it was split into 3 samples; train, validate, and test using:

- Random State: ______
- Test = 20% of the original dataset
- The remaining 80% of the dataset is divided between valiidate and train
    - Validate (.30*.80) = 24% of the original dataset
    - Train (.70*.80) = 56% of the original dataset
    
The split of this data can be replicated using the split_data function saved within the prepare.py file inside the [_____](_________) repository on GitHub.

In [None]:
# split the data into train, validate, and test using the split_data function found in the prepare.py
train, validate, test = prepare.split_data(df)
# obtain the number of rows and columns for the splits
train.shape, validate.shape, test.shape

==========================================================================================================================================================

## III. EXPLORE

In [None]:
# import for data visualization
import seaborn as sns
from matplotlib import pyplot as plt
from matplotlib.ticker import StrMethodFormatter
from matplotlib import style
from wordcloud import WordCloud
import explore

In [None]:
# Set Universal Visualization Formatting

# determine figure size
plt.rc('figure', figsize=(20, 8))
# determine font size
plt.rc('font', size=15)
# determine style
plt.style.use('seaborn-deep')

After acquiring and preparing the corpus, exploration was conducted. All univariate exploration was completed on the entire cleaned corpus in the workbook for this project. For the purpose of the final report, only the target variable will be displayed in order to reduce noise and provide focused context for the project. Following univariate exploration, the split sets (train, validate, and test samples) were utilized thorugh modeling, where only the train set was used for bivariate and multivariate exploration to prevent data leakage.

---

### UNIVARIATE EXPLORATION

#### UNIVARIATE EXPLORATION of TARGET VARIABLE

In [None]:
# create visualization
df.language.value_counts().plot(kind='pie', y='Language', autopct="%1.1f%%")
# remove y axis label
plt.ylabel(None)
#add title
plt.title('Top 4 Programming Langauges Across Corpus by Percentage')
plt.show()

#### OBSERVATIONS: 
- 

---

### EXPLORATION QUESTIONS

All bivariate exploration was conducted on the train corpus to prevent data leakage. The initial questions and univariate exploration guided the bivariate exploration.

#### EXPLORE QUESTIONS

### QUESTION 1: 

==========================================================================================================================================================

## IV. MODEL

==========================================================================================================================================================

## V. CONCLUSION

==========================================================================================================================================================