# Spotify Recommendation System

# Introduction

Here, I'm introducing my project where I aim to build a Spotify Recommendation System based on song lyrics. It's an exciting challenge that combines my interest of data science with Machine Learning. 

Below is an image that visualizes my Spotify Recommendation System that I have built so far. 

![My Spotify Recomendation System](img.png)


In [55]:
# Import the necessary library
import pandas as pd  # I chose pandas for its powerful data manipulation capabilities

# Read the dataset
df = pd.read_csv("spotify_millsongdata.csv")
# This line loads the Spotify Million Song Dataset into a DataFrame for analysis


## Dataset Overview

Embarking on this journey, the first step is to familiarize myself with the terrain—the dataset. It's akin to understanding the pieces of a puzzle before starting to put them together. This dataset isn't just numbers and text; it's the backbone of my Spotify Recommendation System. By examining the first and last entries, assessing the dataset's shape, and scouring for any missing values, I gain valuable insights into the data's structure and quality. This foundational knowledge is critical as I move forward with cleaning and processing the data for my recommendation model.

**Data Source Acknowledgment:**
My exploration is powered by the Spotify Million Song Dataset, available on Kaggle. This comprehensive collection of song metadata and features serves as the starting point for my project, offering a deep well of information to draw from. For those interested in the depth and breadth of music data available, I highly recommend checking out this dataset. [Explore the Spotify Million Song Dataset on Kaggle](https://www.kaggle.com/datasets/notshrirang/spotify-million-song-dataset).


## Data Understanding


In [56]:
# Displaying the first few rows to get a feel for the data structure
df.head(5)

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \r\nA..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \r\nTouch me gen..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \r\nWhy I had...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...


In [57]:
# Checking the last few rows to see the data's end
df.tail(5)

Unnamed: 0,artist,song,link,text
57645,Ziggy Marley,Good Old Days,/z/ziggy+marley/good+old+days_10198588.html,Irie days come on play \r\nLet the angels fly...
57646,Ziggy Marley,Hand To Mouth,/z/ziggy+marley/hand+to+mouth_20531167.html,Power to the workers \r\nMore power \r\nPowe...
57647,Zwan,Come With Me,/z/zwan/come+with+me_20148981.html,all you need \r\nis something i'll believe \...
57648,Zwan,Desire,/z/zwan/desire_20148986.html,northern star \r\nam i frightened \r\nwhere ...
57649,Zwan,Heartsong,/z/zwan/heartsong_20148991.html,come in \r\nmake yourself at home \r\ni'm a ...


In [58]:
# Understanding the size of my dataset
df.shape

(57650, 4)

In [59]:
# Identifying if there are any missing values I need to deal with
df.isnull().sum()

artist    0
song      0
link      0
text      0
dtype: int64

## Data Cleaning & Processing

Here, I'm cleaning and processing the data to prepare it for analysis. This includes dropping unnecessary columns, sampling the dataset for manageability, and cleaning the text data.

In [60]:
# Dropping unnecessary 'link' column and sampling the dataset due my low GPU Processing Time!
df = df.sample(5000).drop("link", axis=True).reset_index(drop=True)


In [61]:
# Quick check to ensure the data looks good after initial cleaning
df.head(5)

Unnamed: 0,artist,song,text
0,Ian Hunter,Rape,He searched through his love like a thief on t...
1,Morrissey,Let Me Kiss You,"There's a place in the sun, \r\nFor anyone wh..."
2,'n Sync,Kiss Me At Midnight,Kiss me at midnight \r\n5...4...3...2...1 \r...
3,Nat King Cole,Let True Love Begin,When you're young \r\nYou're afraid of the da...
4,Leann Rimes,Why Can't We,Look how the wind carries the sea \r\nAnd see...


In [62]:
# Look at the lyrics of the first song in my dataset
df["text"][0]

'He searched through his love like a thief on the run  \r\nHe searched through his face - to see the guilt water run  \r\nBut he\'s fresh out of tears and nobody has come  \r\nAnd justice has got to be done  \r\n  \r\nOh moon in the city stay open and clear  \r\nFor his vision ain\'t good and his mind\'s disappeared  \r\n"get along mother nature" they spat at your son  \r\nSo justice has got to be done  \r\n  \r\nAnd beauty is lying alone in the park  \r\nHer friend has gone bowling in the alleys so dark  \r\nWhere\'s her knight in white armor who rides a chrome ford  \r\nJustice would seem to be bored  \r\nJustice would seem to be bored  \r\n  \r\nA knife full of life penetrated the bait  \r\nWhile he thinks \'o the sister and the mother that he hates  \r\nAnd he thinks he\'ll get off \'\'cause he\'s sick, and stoned  \r\nAnd justice was made to be honed  \r\nAnd justice was made to be honed  \r\n  \r\nAnd his lawyer is smiling one hell of a smile  \r\n\'n he\'s lying all the lies - o

In [63]:
# Confirming the dataset size after my processing
df.shape

(5000, 3)

In [64]:
# My initial cleaning approach
# df["text"].str.lower().replace(r"^a-ZA-Z-09", "") 

# Cleaning the 'text' column to ensure consistency in my analysis
df["text"] = df["text"].str.lower().replace(r"^\w\s", "").replace(r"\n", "", regex=True)

# Stackflow 

In [65]:
# Final check at the tail of the dataset to confirm text cleaning
df.tail(5)

Unnamed: 0,artist,song,text
4995,Def Leppard,Ring Of Fire,"fun girl, you tempted me, a feast of sparks \..."
4996,Wanda Jackson,I've Gotta Sing,the sun's shining beautiful and everything is ...
4997,Elton John,Come And Get It,"if you want it, here it is, come and get it \..."
4998,Linda Ronstadt,It's So Easy,it's so easy to fall in love \rit's so easy t...
4999,Tracy Chapman,A Hundred Years,baby sweet baby \rwon't you please \rcome on...


## Machine Learning Model (NLP)

I'm exploring the realm of Natural Language Processing (NLP) to delve deep into the lyrics, employing techniques such as tokenization and stemming to refine the text for our machine learning model. To ensure a solid foundation for these techniques, I've leveraged insightful resources from Kaggle, particularly focusing on topic modeling with LDA as a precursor to my model development. This resource has been instrumental in shaping my approach to handling and analyzing textual data, providing a comprehensive understanding of NLP's potential within machine learning frameworks.

**Inspiration and Learning Resource:**
- For an in-depth exploration of NLP and topic modeling, I found Samuel Cortinhas' Kaggle Notebook on Topic Modelling with LDA extremely helpful. It's a must-read for anyone looking to grasp the intricacies of NLP in a machine learning context. [Explore the Kaggle Notebook](https://www.kaggle.com/code/samuelcortinhas/nlp6-topic-modelling-with-lda).



In [66]:
# At first I had to download the 'punkt' module
# nltk.download('punkt')

# Then I'm going to import the NLTK for text processing
import nltk
from nltk.stem.porter import PorterStemmer

# Initializing the stemmer to condense words to their roots
stemmer = PorterStemmer()

In the development of the following sections, especially those involving natural language processing (NLP), I've drawn inspiration and borrowed code snippets from the official NLTK documentation and its GitHub repository. These resources have been invaluable in understanding the intricacies of text processing and analysis. 

Additionally, my approach to tokenization was significantly influenced by a comprehensive Kaggle notebook on the topic. This notebook provided a practical and in-depth look at tokenization, which is a fundamental step in NLP. You can find these resources at:

- NLTK Documentation: [NLTK GitHub](https://github.com/nltk/nltk)
- Tokenization Techniques: [Kaggle Notebook on NLP Tokenization by Samuel Cortinhas](https://www.kaggle.com/code/samuelcortinhas/nlp1-tokenization)


In [67]:
# Defining my own function to tokenize and stem the song lyrics
def token(txt):
    token = nltk.word_tokenize(txt)
    a = [stemmer.stem(w) for w in token]
    return " ".join(a)

In [68]:
# Testing out my tokenization and stemming on a simple sentence
token("you are beautiful, beauty")

'you are beauti , beauti'

In [69]:
# Applying my function to all the song lyrics in the dataset
df["text"].apply(lambda x: token(x))

0       he search through hi love like a thief on the ...
1       there 's a place in the sun , for anyon who ha...
2       kiss me at midnight 5 ... 4 ... 3 ... 2 ... 1 ...
3       when you 're young you 're afraid of the dark ...
4       look how the wind carri the sea and see how th...
                              ...                        
4995    fun girl , you tempt me , a feast of spark in ...
4996    the sun 's shine beauti and everyth is go my w...
4997    if you want it , here it is , come and get it ...
4998    it 's so easi to fall in love it 's so easi to...
4999    babi sweet babi wo n't you pleas come on back ...
Name: text, Length: 5000, dtype: object

In [70]:
# Preparing the text data for similarity analysis with TF-IDF and Cosine Similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity\

# TF-IDF Vectorizer will help me transform the text into a meaningful vector of numbers
tfid = TfidfVectorizer(analyzer="word", stop_words="english")
tfid

In [71]:
# This is where the magic happens, transforming the 'text' column
matrix = tfid.fit_transform(df["text"])
matrix

<5000x23629 sparse matrix of type '<class 'numpy.float64'>'
	with 267093 stored elements in Compressed Sparse Row format>

In [72]:
# Calculating the similarity between songs based on their lyrics
similar = cosine_similarity(matrix)

# Quick check on how similar the first song is to the rest
similar[0]

array([1.        , 0.01685246, 0.00627201, ..., 0.0115705 , 0.00721461,
       0.01882653])

After computing the TF-IDF matrix, I use cosine similarity to measure the similarity between song lyrics. Cosine similarity is a metric used to determine how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space. In the context of text analysis, these vectors represent the TF-IDF scores of two documents.

The cosine similarity is advantageous because even if two similar documents are far apart by the Euclidean distance (due to the size of the document), they could still be oriented closer together. The smaller the angle, the higher the cosine similarity.

- When the angle is 0°, the cosine similarity is 1, indicating that the vectors are identical.
- When the angle is 90°, the cosine similarity is 0, suggesting the vectors are orthogonal or independent.
- When the angle is 180°, the cosine similarity is -1, indicating that the vectors are diametrically opposed.

Here's an illustration that visually encapsulates this concept:

![Cosine Similarity Explanation](img2.jpg)

From this concept, I've learned a key aspect of how algorithms interpret the 'similarity' between different pieces of text. This understanding is crucial for the next steps in my recommendation system, as it allows me to rank songs by their lyrical similarity and recommend songs that share a similar thematic presence or lyrical structure to the user.


In [74]:
# Finding a song by name in my dataset
df[df["song"] == "Come And Get It"].index[0]

4997

## Spotify Recommendation Function

I'm wrapping up with one of the most crucial parts: the recommendation function. This function will take a song name as input and return a list of recommended songs based on their lyrics' similarity. 

In [76]:
# Defining the function for my Spotify Recommendation System.
def recommender(song_name):
    # First, finds the index of the song in the DataFrame. This is like finding the song's unique address in my list.
    idx = df[df["song"] == song_name].index[0]
    # Next, I retrieve the similarity scores for the song from the cosine similarity matrix as explained above.
    # These scores will tell me how 'close' or 'far' each song is from my song of interest in terms of lyrics! 
    distance = sorted(list(enumerate(similar[idx])), reverse=True, key=lambda x: x[1])
    
    # I then create an empty list to store the recommended songs.
    song = []
    
    # Now, for each song in my dataset, I'll go through the sorted list of similarity scores.
    # The enumerate function provides me with both the index of the song and its similarity score.
    # I'll skip the first one because it's the song itself (with a similarity score of 1).
    for s_id in distance[1:6]:
        # I add the song to my list of recommendations. I'm choosing the top 5 songs.
        # This is done by looking up the song title by its index in the DataFrame.
        song.append(df.iloc[s_id[0]].song)
    
    # Finally, I return the list of recommended songs back to the user.
    return song

# And Voila! It recommends me songs when I call this function with a song name!

# Putting my recommender to the test with a sample song
recommender("Come And Get It")

['Driving Too Fast',
 'Hold Fast To The Right',
 'Fast Car',
 'Boom Boom Mancini',
 "Goin' Down Slow"]

## Serialization

Finally, I'm saving my work so I can easily use or share it later. This includes saving the similarity matrix and the processed DataFrame. It's important for deploying my recommendation system with the streamlit app that I going to show it! 


In [77]:
# Using pickle to save my model and data
import pickle

# Saving the cosine similarity matrix and DataFrame
pickle.dump(similar, open("similarity", "wb"))
pickle.dump(df, open("df", "wb"))

## Reflection

Reflecting back on this Double Diamond project so far, I've not only gained practical exprience in handling and analyzing large datasets but also deepened my understanding of how machine learning can be applied in the a specific domain in this case on Music. As I said even before, I start getting acquainted with the Spotify Million Song Dataset. 

I learned the various dimension of data, it's shape, missing values and information cotnanted within the outcomes of the proejct. Also the Data Wrangling taught me that the quality of the input data is significant. However the most useful aspect that I have specialized in this phase is building a small machine learning model using Natural Language Processing (NLP). Like diving into the tokenization and stemming helped me to see text not as text but as data point number that could be analyzed!

Perhaps, understanding and implementing the cosine similarity was also interguining to me, especially like understanding the measure "distance" between songs. All in all, since my majors are in ICT learning about all thse technical aspect of machine learning is quite important! So I learned a lot so far but I want to improve that even more! 