# Performing analysis on lyric data combined with DistilBERT's emotion classification

## Project Overview 

The dataset used in this project is the "Music Dataset: Lyrics and Metadata from 1950 to 2019" by Moura et al. (2020), which consists of music lyrics and metadata on those lyrics spanning nearly seven decades.

For this analysis, we used a pretrained DistilBERT model loaded from the Hugging Face Model Hub. We fine-tuned this DistilBERT model on the "Emotions Dataset for NLP" dataset by Govi (2019), which contains labeled data collected from Twitter, labeled by one of six emotions: "anger", "joy", "surprise", "fear", "love", and "sadness". We then used this fined-tuned model to predict an emotion label for the lyrics of each song in the "Music Dataset: Lyrics and Metadata from 1950 to 2019" dataset.

In this notebook, we take the predictions made by DistilBERT, associate each with its corresponding entry in the original "Music Dataset: Lyrics and Metadata from 1950 to 2019" dataset, and perform a time series analysis on this data. The aim was to see what trends there might have been in the overall emotional content of lyrics released over this time period. 

What we find in this analysis is that songs released in a particular year were increasingly more likely to express "anger" and "fear" and less likely to express "love", as time went on. 

## Future Work

Collecting a dataset with a broader and more complex array of emotion classes would be fundamental in arriving at a deeper analysis of the emotional content of lyrics. Additionally, classifying parts of lyrics and capturing the various emotions that can be expressed in a single song would be greatly useful for getting a richer understanding of the complex emotions a single song can express. Because music doesn't only consist of lyrics, collecting a dataset with emotions labeled on other aspects of a song, such as rythm and melody, and using that for analysis alongside NLP emotion classification on lyrics, will enable a broader understanding of the emotions a particular song communicates.    

It would be of interest to look into trends in emotions expressed in music by geographical location, and viewing trends over time lined with other factors that may affect the emotional expression of artists, such as economic conditions over time or other historical developments. 


## Citations

Govi, Praveen. "Emotions Dataset for NLP." Kaggle. Accessed July 11, 2024. https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp/data.

Moura, Luan; Fontelles, Emanuel; Sampaio, Vinicius; França, Mardônio (2020), “Music Dataset: Lyrics and Metadata from 1950 to 2019”, Mendeley Data, V2, doi: 10.17632/3t9vbwxgr5.2




In [1]:
from typing import Any, Dict
from copy import deepcopy
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import f_regression
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# Notebook Config

In [2]:
# Set this to True if you wish to save notebook outputs, False otherwise.
overwrite_saved = False
original_dataset_path = './raw_data/Music_Dataset_Lyrics_and_Metadata_from_1950_to_2019/tcc_ceds_music.csv'
emotion_classified_dataset_path = 'lyrics_with_emotion_preds.csv'
combined_dataset_csv_path = './processed_data/tcc_ceds_music_with_emotion_classification.csv'
percent_change_over_time_csv_path = './processed_data/percent_change_over_time.csv'

# Functions

In [3]:
def fast_index_match(df1: pd.DataFrame, df2: pd.DataFrame, col: str) -> bool:
    """
    Quickly check if two DataFrames have matching values in specified columns, in the same order.

    Parameters:
    -----------
    df1 : pd.DataFrame
        The first DataFrame to compare.
    df2 : pd.DataFrame
        The second DataFrame to compare.
    col : str
        The column name to use for comparison.

    Returns:
    --------
    bool
        True if there's a perfect match between the values in the specified columns,
        in the same order, with no duplicates. False otherwise.
    """
    # Check if lengths match
    if len(df1) != len(df2):
        return False
    
    # Check for duplicates
    if df1[col].duplicated().any() or df2[col].duplicated().any():
        return False
    
    # Reset index and select only the columns we're interested in
    df1_series = df1.reset_index(drop=True)[col]
    df2_series = df2.reset_index(drop=True)[col]
    
    # Use pandas equals method for fast comparison
    return df1_series.equals(df2_series)


def normalize_counts(row: Dict[str, Any], release_date_total_counts: Dict[Any, float]) -> float:
    """
    Normalizes the count of a row by the total counts of its release date.

    Args:
        row (Dict[str, Any]): A dictionary representing a row with at least 'count' and 'release_date' keys.
        release_date_total_counts (Dict[Any, float]): A dictionary mapping release dates to their total counts.

    Returns:
        float: The normalized count as a percentage.
    """
    return row['count'] / release_date_total_counts[row['release_date']] * 100


def fit_trendline_to_emotion_df(emotion_df: pd.DataFrame) -> np.ndarray:
    """
    Fit a linear trendline to the emotion data.

    This function takes a DataFrame containing emotion data with 'release_date' and 'percentage_share' columns,
    fits a linear regression model, and returns the predicted trendline.

    Parameters:
    -----------
    emotion_df : pd.DataFrame
        A DataFrame containing at least two columns:
        - 'release_date': The dates of the data points
        - 'percentage_share': The percentage share of the emotion for each date

    Returns:
    --------
    np.ndarray
        An array of predicted values forming the trendline, with the same length as the input DataFrame.
    boolean
        True or False value indicating True if the trend is statistically significant, False otherwise.
        
    Raises:
    -------
    ValueError
        If the input DataFrame doesn't contain the required columns.

    Example:
    --------
    >>> df = pd.DataFrame({'release_date': pd.date_range(start='2023-01-01', periods=5),
    ...                    'percentage_share': [.1, .1, .2, .2, .4]})
    >>> trendline, sig = fit_trendline_to_emotion_df(df)
    """
    if 'release_date' not in emotion_df.columns or 'percentage_share' not in emotion_df.columns:
        raise ValueError("Input DataFrame must contain 'release_date' and 'percentage_share' columns")

    X = np.array(emotion_df['release_date']).reshape(-1, 1)
    y = emotion_df['percentage_share'].values
    model = LinearRegression()
    model.fit(X, y)

    trendline = model.predict(X)

    f_statistic, p_value = f_regression(X, y)
    significant = False
    if p_value[0] < 0.05:
        significant = True

    ## Uncomment if desired.
    # print(f"F-statistic: {f_statistic[0]}")
    # print(f"P-value: {p_value[0]}")
    # print(f"R-squared: {model.score(X, y)}")
    
    return trendline, significant


# Load Datasets

In [4]:
original_dataset = pd.read_csv(original_dataset_path)

In [5]:
original_dataset.head()

Unnamed: 0.1,Unnamed: 0,artist_name,track_name,release_date,genre,lyrics,len,dating,violence,world/life,...,sadness,feelings,danceability,loudness,acousticness,instrumentalness,valence,energy,topic,age
0,0,mukesh,mohabbat bhi jhoothi,1950,pop,hold time feel break feel untrue convince spea...,95,0.000598,0.063746,0.000598,...,0.380299,0.117175,0.357739,0.454119,0.997992,0.901822,0.339448,0.13711,sadness,1.0
1,4,frankie laine,i believe,1950,pop,believe drop rain fall grow believe darkest ni...,51,0.035537,0.096777,0.443435,...,0.001284,0.001284,0.331745,0.64754,0.954819,2e-06,0.325021,0.26324,world/life,1.0
2,6,johnnie ray,cry,1950,pop,sweetheart send letter goodbye secret feel bet...,24,0.00277,0.00277,0.00277,...,0.00277,0.225422,0.456298,0.585288,0.840361,0.0,0.351814,0.139112,music,1.0
3,10,pérez prado,patricia,1950,pop,kiss lips want stroll charm mambo chacha merin...,54,0.048249,0.001548,0.001548,...,0.225889,0.001548,0.686992,0.744404,0.083935,0.199393,0.77535,0.743736,romantic,1.0
4,12,giorgos papadopoulos,apopse eida oneiro,1950,pop,till darling till matter know till dream live ...,48,0.00135,0.00135,0.417772,...,0.0688,0.00135,0.291671,0.646489,0.975904,0.000246,0.597073,0.394375,romantic,1.0


In [6]:
emotion_classified_dataset = pd.read_csv(emotion_classified_dataset_path)

In [7]:
emotion_classified_dataset.head()

Unnamed: 0.1,Unnamed: 0,lyrics_text,predicted_emotion_label,predicted_encoded_label
0,0,hold time feel break feel untrue convince spea...,sadness,4
1,1,believe drop rain fall grow believe darkest ni...,joy,2
2,2,sweetheart send letter goodbye secret feel bet...,sadness,4
3,3,kiss lips want stroll charm mambo chacha merin...,joy,2
4,4,till darling till matter know till dream live ...,sadness,4


In [8]:
emotion_classified_dataset = emotion_classified_dataset.rename(columns={'lyrics_text': 'lyrics'})

# Merge Datasets

In [9]:
fast_index_match(original_dataset, emotion_classified_dataset, 'lyrics')

True

In [10]:
combined_dataset = original_dataset.assign(predicted_emotion_label=emotion_classified_dataset['predicted_emotion_label'])

In [11]:
combined_dataset.head()

Unnamed: 0.1,Unnamed: 0,artist_name,track_name,release_date,genre,lyrics,len,dating,violence,world/life,...,feelings,danceability,loudness,acousticness,instrumentalness,valence,energy,topic,age,predicted_emotion_label
0,0,mukesh,mohabbat bhi jhoothi,1950,pop,hold time feel break feel untrue convince spea...,95,0.000598,0.063746,0.000598,...,0.117175,0.357739,0.454119,0.997992,0.901822,0.339448,0.13711,sadness,1.0,sadness
1,4,frankie laine,i believe,1950,pop,believe drop rain fall grow believe darkest ni...,51,0.035537,0.096777,0.443435,...,0.001284,0.331745,0.64754,0.954819,2e-06,0.325021,0.26324,world/life,1.0,joy
2,6,johnnie ray,cry,1950,pop,sweetheart send letter goodbye secret feel bet...,24,0.00277,0.00277,0.00277,...,0.225422,0.456298,0.585288,0.840361,0.0,0.351814,0.139112,music,1.0,sadness
3,10,pérez prado,patricia,1950,pop,kiss lips want stroll charm mambo chacha merin...,54,0.048249,0.001548,0.001548,...,0.001548,0.686992,0.744404,0.083935,0.199393,0.77535,0.743736,romantic,1.0,joy
4,12,giorgos papadopoulos,apopse eida oneiro,1950,pop,till darling till matter know till dream live ...,48,0.00135,0.00135,0.417772,...,0.00135,0.291671,0.646489,0.975904,0.000246,0.597073,0.394375,romantic,1.0,sadness


## Save Merged

In [12]:
if overwrite_saved:
    combined_dataset.to_csv(combined_dataset_csv_path)

# Analyzing Emotion Trends over Time

## Get Counts of Each Emotion for Each Release Date

In [13]:
# Step 1: Create all possible combinations of release_date and predicted_emotion_label
all_combinations = pd.MultiIndex.from_product([
    combined_dataset['release_date'].unique(),
    combined_dataset['predicted_emotion_label'].unique()
], names=['release_date', 'predicted_emotion_label'])

# Step 2: Perform the groupby operation
grouped = combined_dataset.groupby(['release_date', 'predicted_emotion_label']).size()

# Step 3: Reindex with all combinations, filling missing values with 0
emotion_counts_per_year_df = grouped.reindex(all_combinations, fill_value=0).reset_index(name='count')

emotion_counts_per_year_df = emotion_counts_per_year_df.rename(columns={'predicted_emotion_label': 'emotion'})

In [14]:
emotion_counts_per_year_df.head()

Unnamed: 0,release_date,emotion,count
0,1950,sadness,13
1,1950,joy,31
2,1950,fear,2
3,1950,love,1
4,1950,anger,3


## Plot Emotion Counts Over Time

In [15]:
fig = px.line(emotion_counts_per_year_df, x="release_date", y="count", color='emotion', title='Emotion Counts Over Time')
fig.show()

## Plot Percentage Share of the Total Number of Songs Released Each Year for Each Emotion Over Time 

In [16]:
release_date_total_counts = combined_dataset['release_date'].value_counts().to_dict()

In [17]:
emotion_percentages_per_year_df = deepcopy(emotion_counts_per_year_df)

In [18]:
emotion_percentages_per_year_df['count'] = emotion_counts_per_year_df.apply(normalize_counts, release_date_total_counts=release_date_total_counts, axis=1)
emotion_percentages_per_year_df = emotion_percentages_per_year_df.rename(columns={'count': 'percentage_share'})

In [19]:
emotion_percentages_per_year_df.head()

Unnamed: 0,release_date,emotion,percentage_share
0,1950,sadness,25.490196
1,1950,joy,60.784314
2,1950,fear,3.921569
3,1950,love,1.960784
4,1950,anger,5.882353


### Plotting all Emotions Together 

In [20]:
fig = px.line(emotion_percentages_per_year_df, x="release_date", y="percentage_share", color='emotion', title='Percentage Share of Each Emotion Over Time')
fig.show()

### Making Subplots for Each Emotion 

In [21]:
emotion_labels = set(sorted(emotion_percentages_per_year_df['emotion'].values.tolist()))
emotion_labels

{'anger', 'fear', 'joy', 'love', 'sadness', 'surprise'}

In [22]:
emotion_titles = tuple([emotion.capitalize() for emotion in emotion_labels])
emotion_trend_significance = {}

fig = make_subplots(rows=3, cols=2, subplot_titles=emotion_titles, y_title='Percentage Share of Emotion', x_title='Release Date')

row_val = 0
col_val = 0

for idx, emotion in enumerate(emotion_labels):
    emotion_df = emotion_percentages_per_year_df.where(emotion_percentages_per_year_df['emotion']==emotion).dropna()
    trendline, significant = fit_trendline_to_emotion_df(emotion_df)
    emotion_trend_significance[emotion] = significant

    if idx % 2 == 0:
        row_val+=1
    col_val+=1
    if col_val > 2:
        col_val = 1

    # Add the trendline with modified appearance based on significance
    trendline_color = 'black'
    trendline_dash = 'solid' if significant else 'dot'
    
    fig.add_trace(go.Scatter(x=emotion_df['release_date'], y=emotion_df['percentage_share'], mode='lines', name=emotion), row=row_val, col=col_val)
    fig.add_trace(go.Scatter(x=emotion_df['release_date'], y=trendline, mode='lines', name=f'{emotion} trend', line=dict(color=trendline_color, dash=trendline_dash)), row=row_val, col=col_val)


fig.update_layout(height=600, width=800, title_x=0.5, title_y=.95, title_text="Trends in Percentage Share of Each of Six Emotions Detected<br>in Song Lyrics Per Release Date (Grouped by Year), 1950-2019", showlegend=False)
fig.show()




# Save Processed Dataframe to CSV

In [None]:
if overwrite_saved:
    emotion_percentages_per_year_df.to_csv(percent_change_over_time_csv_path)