#### This notebook contains all code for preprocessing the lyrics dataset and producing the backend of a hypothetical song recommendation app that recommends songs to users based on lyrics similarity.

# Package imports

In [79]:
# Imports
import os
import regex as re
import pandas as pd
import numpy as np
import sklearn as skl
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Data loading and preprocessing
I use the "Song Lyrics Dataset" from Kaggle: https://www.kaggle.com/datasets/deepshah16/song-lyrics-dataset.
This dataset contains data on ~6000 songs from several pop artists, including full text of the lyrics from each song. The data comes in the form of CSV files (or JSON files), one for each artist. I will choose to load from CSV and combine all data into one DataFrame.

First we inspect a single one of these artist CSVs.

In [80]:
print("Files: ", os.listdir("lyrics_dataset/csv"))
ColdPlay = pd.read_csv('lyrics_dataset\csv\ColdPlay.csv')
ColdPlay.head()

Files:  ['ArianaGrande.csv', 'Beyonce.csv', 'BillieEilish.csv', 'BTS.csv', 'CardiB.csv', 'CharliePuth.csv', 'ColdPlay.csv', 'Drake.csv', 'DuaLipa.csv', 'EdSheeran.csv', 'Eminem.csv', 'JustinBieber.csv', 'KatyPerry.csv', 'Khalid.csv', 'LadyGaga.csv', 'Maroon5.csv', 'NickiMinaj.csv', 'PostMalone.csv', 'Rihanna.csv', 'SelenaGomez.csv', 'TaylorSwift.csv']


Unnamed: 0.1,Unnamed: 0,Artist,Title,Album,Year,Date,Lyric
0,0,Coldplay,The Scientist,A Rush of Blood to the Head,2002.0,2002-08-26,come up to meet you tell you i'm sorry you don...
1,1,Coldplay,Viva la Vida,Viva La Vida or Death and All His Friends,2008.0,2008-05-25,chris martin i used to rule the world seas wou...
2,2,Coldplay,Fix You,X&Y,2005.0,2005-06-06,chris martin when you try your best but you do...
3,3,Coldplay,Yellow,Parachutes,2000.0,2000-06-26,chris martin look at the stars look how they s...
4,4,Coldplay,Hymn for the Weekend,A Head Full of Dreams,2016.0,2016-01-25,beyoncé and said drink from me drink from me o...


In [81]:
del(ColdPlay)

We have several unnecessary features. For my lyric recommendation engine, I want only the lyrics data and a single ID feature, for which I can use a concatenation of Artist & Title. 

I design a loop for quickly loading the 21 CSV files and preprocessing them accordingly.

This loop is the product of additional preprocessing and exploration that has been omitted here:
* I discovered NAN values in some lyrics, which I now remove here.

* A number of songs in this dataset are remixes, demos, or other copies of other songs in this dataset. In these cases the lyrics match almost exactly, resulting in recommendations for a song containing simply several remixes, covers, etc. of that song. I want a system that recommends **different** songs, so I drop these "duplicates" dataset.
    * In order to avoid dropping songs that contain words like "live" or "mix" in the title that do not necessarily indicate a live recording or an artist mix, I drop where certain words are preceeded or followed by a parenthesis or brackets. This manages to eliminate nearly all rows I want to drop without dropping any of the rows that I want to keep. 
    * In a final product, requesting recommendations on a remix would internally just use the **original song** for calculating recommendations. This could be done by a simple mapping or find-and-replace. For example, requesting recommendations for the song "Taylor Swift - Love Story (Digital Dog Remix)" would simply calculate recommendations using the song "Taylor Swift - Love Story". 

In [82]:
# List strings to search for in song titles to determine a row should be dropped 
drop_words = ["remix", "mix)", "mix]", "(live", "[live", "live from", "recorded live", "version)", "version]", "edit)", "edit]", "edited)", "edited]", "demo)", "demo]", "the beyonce experience live", "homecoming live", "(acoustic", "[acoustic", "acoustic)", "acoustic]"]
# Escape special characters
drop_words = [re.escape(word) for word in drop_words]
# Convert to logical statement for inputting to function argument
drop_words = '|'.join(drop_words)

In [83]:
# Set path to data
data_folder = "lyrics_dataset/csv"

# Initialize a master dataframe that I will combine all lyric data into.
lyrics_df = pd.DataFrame()

# Loop through all files in folder
for file in os.listdir(data_folder):
    # Search only for .csvs, just in case
    if file.endswith(".csv"):
        # Assign name
        dataframe_name = os.path.splitext(file)[0]
        # Get path
        file_path = os.path.join(data_folder, file)
        # Load each dataframe as a global variable
        globals()[dataframe_name] = pd.read_csv(file_path)
        # Drop rows where 'Lyric' is NaN
        globals()[dataframe_name] = globals()[dataframe_name].dropna(subset=["Lyric"])
        # Drop rows where Title contains "remix"
        globals()[dataframe_name] = globals()[dataframe_name][~globals()[dataframe_name]["Title"].str.contains(drop_words, case=False, na=False)]        
        # Merge artist and song name into one variable
        globals()[dataframe_name]["Title and Artist"] = (globals()[dataframe_name]["Artist"] + " - " + globals()[dataframe_name]["Title"])
        # Drop Date, Album and Year
        globals()[dataframe_name].drop(columns=["Artist", "Title", "Date", "Album", "Year"], 
                                       inplace=True)
        # Make "Title and Artist" the first column
        cols = ["Title and Artist"] + ["Lyric"]
        globals()[dataframe_name] = globals()[dataframe_name][cols]
        # Append the current DataFrame to the lyrics_df DataFrame
        lyrics_df = pd.concat([lyrics_df, globals()[dataframe_name]],
                              ignore_index=True)
        # DEBUG PRINT to inspect the outputs of this loop
        print("************************")
        print(dataframe_name, "dataset: ")
        print(globals()[dataframe_name].head())
        print("************************")
        print("\n")

lyrics_df.head()

************************
ArianaGrande dataset: 
                         Title and Artist  \
0          Ariana Grande - ​thank u, next   
1                 Ariana Grande - 7 rings   
2         Ariana Grande - ​God is a woman   
3            Ariana Grande - Side To Side   
4  Ariana Grande - ​​no tears left to cry   

                                               Lyric  
0  thought i'd end up with sean but he wasn't a m...  
1  yeah breakfast at tiffany's and bottles of bub...  
2  you you love it how i move you you love it how...  
3  ariana grande  nicki minaj i've been here all ...  
4  right now i'm in a state of mind i wanna be in...  
************************


************************
Beyonce dataset: 
          Title and Artist                                              Lyric
0  Beyoncé - Drunk in Love  beyoncé i've been drinkin' i've been drinkin' ...
1      Beyoncé - Formation  messy mya what happened at the new wil'ins bit...
2      Beyoncé - Partition  part  yoncé   let m

Unnamed: 0,Title and Artist,Lyric
0,"Ariana Grande - ​thank u, next",thought i'd end up with sean but he wasn't a m...
1,Ariana Grande - 7 rings,yeah breakfast at tiffany's and bottles of bub...
2,Ariana Grande - ​God is a woman,you you love it how i move you you love it how...
3,Ariana Grande - Side To Side,ariana grande nicki minaj i've been here all ...
4,Ariana Grande - ​​no tears left to cry,right now i'm in a state of mind i wanna be in...


The following was some of the code originally used to devise my additional stop words & drop words. Left here for example.

In [84]:
suspect_rows = lyrics_df[lyrics_df["Title and Artist"].str.contains("mix", case=False)]["Title and Artist"]
suspect_rows

414           Beyoncé - Baby Boy [Junior’s World Mixshow]
2233                        Eminem - Kill My Pain (Mixup)
2345    Eminem - Jimmy Crack Corn (Cashis Vocal Mix (EX))
3191                      Lady Gaga - The DJ Vice Megamix
Name: Title and Artist, dtype: object

# Recommendation algorithm

First I build my stopword list. Adapted from Scikit-Learn framework. I decide to include some additional stopwords, like "remix" or "produced" as the original dataset does not seem to be perfectly clean, sometimes containing words like these before the actual lyrics. I also drop zero-width white space characters (\u200b), which appear in some places.

In [85]:
STOP_WORD_LIST = [
    "a", "about", "above", "across", "after", "afterwards", "again", "against", "all", "almost", "alone", "along", "already", "also", "although", "always", "am", "among", "amongst", "amoungst", "amount", "an", "and", "another", "any", "anyhow", "anyone", "anything", "anyway", "anywhere", "are", "around", "as", "at", "back", "be", "became", "because", "become", "becomes", "becoming", "been", "before", "beforehand", "behind", "being", "below", "beside", "besides", "between", "beyond", "bill", "both", "bottom", "but", "by", "call", "can", "cannot", "cant", "co", "con", "could", "couldnt", "cry", "de", "describe", "detail", "do", "done", "down", "due", "during", "each", "eg", "eight", "either", "eleven", "else", "elsewhere", "empty", "enough", "etc", "even", "ever", "every", "everyone", "everything", "everywhere", "except", "few", "fifteen", "fifty", "fill", "find", "fire", "first", "five", "for", "former", "formerly", "forty", "found", "four", "from", "front", "full", "further", "get", "give", "go", "had", "has", "hasnt", "have", "he", "hence", "her", "here", "hereafter", "hereby", "herein", "hereupon", "hers", "herself", "him", "himself", "his", "how", "however", "hundred", "i", "ie", "if", "in", "inc", "indeed", "interest", "into", "is", "it", "its", "itself", "keep", "last", "latter", "latterly", "least", "less", "ltd", "made", "many", "may", "me", "meanwhile", "might", "mill", "mine", "more", "moreover", "most", "mostly", "move", "much", "must", "my", "myself", "name", "namely", "neither", "never", "nevertheless", "next", "nine", "no", "nobody", "none", "noone", "nor", "not", "nothing", "now", "nowhere", "of", "off", "often", "on", "once", "one", "only", "onto", "or", "other", "others", "otherwise", "our", "ours", "ourselves", "out", "over", "own", "part", "per", "perhaps", "please", "put", "rather", "re", "same", "see", "seem", "seemed", "seeming", "seems", "serious", "several", "she", "should", "show", "side", "since", "sincere", "six", "sixty", "so", "some", "somehow", "someone", "something", "sometime", "sometimes", "somewhere", "still", "such", "system", "take", "ten", "than", "that", "the", "their", "them", "themselves", "then", "thence", "there", "thereafter", "thereby", "therefore", "therein", "thereupon", "these", "they", "thick", "thin", "third", "this", "those", "though", "three", "through", "throughout", "thru", "thus", "to", "together", "too", "top", "toward", "towards", "twelve", "twenty", "two", "un", "under", "until", "up", "upon", "us", "very", "via", "was", "we", "well", "were", "what", "whatever", "when", "whence", "whenever", "where", "whereafter", "whereas", "whereby", "wherein", "whereupon", "wherever", "whether", "which", "while", "whither", "who", "whoever", "whole", "whom", "whose", "why", "will", "with", "within", "without", "would", "yet", "you", "your", "yours", "yourself", "yourselves", "remix", "mix", "produced", "producer", "edit", "oh", "ah" "ra", "la", "\u200b"
]


Next I get only the lyrics from my dataframe, and produce a matrix of TF-IDF vectorized lyric data. I also save the "Title and Artist" data for indexing.

In [86]:
# Get lyrics column
lyrics = lyrics_df['Lyric']

# Initialize tfidf vectorizer
tfidf = TfidfVectorizer(max_features=None, 
                        stop_words=STOP_WORD_LIST, 
                        lowercase=True)

# Fit vectorizer and produce TF-IDF matrix
tfidf_matrix = tfidf.fit_transform(lyrics)

# Save song & artist column as indices for mapping
songs_and_artists = lyrics_df['Title and Artist'].tolist()
        

I use cosine similarity to measure song similarity by 

In [87]:
# Compute the cosine similarity matrix
similarity_matrix = cosine_similarity(tfidf_matrix)

In [88]:
def recommend_songs(query_song, songs_and_artists, similarity_matrix, top_n=5):
    """
    Recommend songs based on lyrics similarity.

    Parameters:
    - query_song: str, the title of the song to query
    - song_titles: list of song titles
    - similarity_matrix: precomputed cosine similarity matrix
    - top_n: int, number of recommendations to return

    Returns:
    - list of recommended songs
    """
    # Find the index of the query song
    try:
        idx = songs_and_artists.index(query_song)
    except ValueError:
        return f"Song '{query_song}' not found in the dataset."

    # Get similarity scores for the song
    similarity_scores = list(enumerate(similarity_matrix[idx]))

    # Sort by similarity score (descending) and exclude the query song
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    similarity_scores = [s for s in similarity_scores if s[0] != idx]

    # Retrieve top N recommendations
    top_songs = [(songs_and_artists[i], score) for i, score in similarity_scores[:top_n]]

    return top_songs


In [89]:
query_song = "Coldplay - The Scientist"

In [90]:
# Get recommendations
recommendations = recommend_songs(query_song, songs_and_artists, similarity_matrix, top_n=10)

# Print results
print(f"Recommendations for '{query_song}':")
for song, score in recommendations:
    print(f"- {song} (similarity: {score:.2f})")


Recommendations for 'Coldplay - The Scientist':
- Eminem - The Monster (similarity: 0.28)
- Drake - Signs (similarity: 0.21)
- Coldplay - WOTW / POTP (similarity: 0.20)
- Drake - Easy (similarity: 0.19)
- Beyoncé - Why Don’t You Love Me (similarity: 0.17)
- Rihanna - Complicated (similarity: 0.17)
- Taylor Swift - ​invisible string (the long pond studio sessions) (similarity: 0.17)
- Coldplay - Major Minus (similarity: 0.16)
- Coldplay - Major minus - single version (similarity: 0.16)
- Rihanna - Yeah, I Said It (similarity: 0.15)
