# Spotify Music Recommender System

### Read Dataset

In [69]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [70]:
df = pd.read_csv('spotify_millsongdata.csv')

In [71]:
df.head()

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \r\nA..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \r\nTouch me gen..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \r\nWhy I had...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...


In [72]:
df.tail()

Unnamed: 0,artist,song,link,text
57645,Ziggy Marley,Good Old Days,/z/ziggy+marley/good+old+days_10198588.html,Irie days come on play \r\nLet the angels fly...
57646,Ziggy Marley,Hand To Mouth,/z/ziggy+marley/hand+to+mouth_20531167.html,Power to the workers \r\nMore power \r\nPowe...
57647,Zwan,Come With Me,/z/zwan/come+with+me_20148981.html,all you need \r\nis something i'll believe \...
57648,Zwan,Desire,/z/zwan/desire_20148986.html,northern star \r\nam i frightened \r\nwhere ...
57649,Zwan,Heartsong,/z/zwan/heartsong_20148991.html,come in \r\nmake yourself at home \r\ni'm a ...


### Dataset Details

In [73]:
df.shape

(57650, 4)

In [74]:
df.isna().sum()

artist    0
song      0
link      0
text      0
dtype: int64

In [75]:
df.duplicated().sum()

0

In [76]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57650 entries, 0 to 57649
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   artist  57650 non-null  object
 1   song    57650 non-null  object
 2   link    57650 non-null  object
 3   text    57650 non-null  object
dtypes: object(4)
memory usage: 1.8+ MB


In [77]:
df.describe()

Unnamed: 0,artist,song,link,text
count,57650,57650,57650,57650
unique,643,44824,57650,57494
top,Donna Summer,Have Yourself A Merry Little Christmas,/a/abba/ahes+my+kind+of+girl_20598417.html,I just came back from a lovely trip along the ...
freq,191,35,1,6


Dropped Link column froom our dataset as it wasnt necessary

In [78]:
df = df.sample(5000).drop('link', axis =1).reset_index(drop = True)

In [79]:
df.head()

Unnamed: 0,artist,song,text
0,Kirk Franklin,Kingdom Come,I've been up I've been down been looking for s...
1,Leann Rimes,On The Side Of Angels,"I've never been so certain, \r\nI've never be..."
2,Pat Benatar,Suburban King,The alarm clock rings but you don't move \r\n...
3,Leonard Cohen,Hunter's Lullaby,Your father's gone a-hunting \r\nHe's lost in...
4,Overkill,I'm Alright,Oh my God in blood soaked silhouette \r\nOh m...


In [80]:
df['text'][0]

"I've been up I've been down been looking for some joy to come around I've been  \r\nPraying for some sunshine been looking for a love that I can call mine. See\r\nI've  \r\nCried for so long now I'm ready for the tears to be gone so I'm calling you\r\nright  \r\nNow cause you said that I can make it Lord any how  \r\n  \r\nChorus: so I'm gonna wait (on you)  \r\nCause I know that your gonna pull (me through)  \r\nI hear you telling me to (be strong)  \r\nCause deliverance is coming and (it wont be long)  \r\nThe storm will pass (away)  \r\nI believe its gonna be (a brighter day)  \r\nI can finally see (the sun)  \r\nSo I'm waiting on you til (Thy kingdom come)  \r\n  \r\nThis world can be cold and I can feel te hurting in my soul but the pain it  \r\nWont last long cause I know its only here to make me strong just take it from  \r\nMe some day we're gonna be free we been waiting Lord for a night but the Lord  \r\nSaid its gonna be alright  \r\nChorus 2x  \r\n  \r\nJill schott yall  \r

In [81]:

df.shape

(5000, 3)

In [82]:
df.head()

Unnamed: 0,artist,song,text
0,Kirk Franklin,Kingdom Come,I've been up I've been down been looking for s...
1,Leann Rimes,On The Side Of Angels,"I've never been so certain, \r\nI've never be..."
2,Pat Benatar,Suburban King,The alarm clock rings but you don't move \r\n...
3,Leonard Cohen,Hunter's Lullaby,Your father's gone a-hunting \r\nHe's lost in...
4,Overkill,I'm Alright,Oh my God in blood soaked silhouette \r\nOh m...


##  Preprocessing

### Text Cleaniing

Text cleaning is a critical step in preparing text data for analysis or modeling in NLP and text mining tasks. It involves several key techniques to ensure the text is in a usable format. First, we convert all text to lowercase to standardize word representations. Next, we break the text into smaller units called tokens through tokenization, aiding in word counting and analysis. Special characters, punctuation, and non-alphanumeric characters are removed to retain only meaningful content. Common but insignificant words (stopwords) are also eliminated. Words are further normalized using stemming or lemmatization to reduce variations. Numeric data handling, HTML tags, URLs, and spell checking are additional steps to clean the text. Finally, we ensure consistent text representation by cleaning whitespace and tokens. These techniques collectively improve text quality and enhance the accuracy of NLP and text mining tasks.

In [83]:
df['text'] = df['text'].str.lower().replace(r'^a-ZA-Z0-9', ' ').replace(r'\n', ' ', regex = True)

### Tokenization

Tokenization is a fundamental process in natural language processing (NLP) that involves breaking down raw text into smaller units called tokens. These tokens can be words, sentences, or even individual characters, depending on the level of granularity needed for analysis. Tokenization is essential for various NLP tasks such as text classification, sentiment analysis, and machine translation. By dividing text into tokens, NLP systems can better understand and process language, enabling further analysis, feature extraction, and modeling on textual data.

In [84]:
# converting similar words into 1 word like beauty, beautiful and beatui

In [85]:
## then convert this whiole textual data into vectors called vectorization

In [86]:
## vectorization calculates the distance,

In [87]:
## not the eucladian distance

In [88]:
# but the angle of each datapoint

### Content Based Filtering


Content-Based Filtering is a recommendation system technique that suggests items to users based on the similarity between the items and the user's preferences. It analyzes the characteristics and features of items (such as movies, articles, or products) and compares them to the user's profile or past interactions. By understanding the content of items and user preferences, content-based filtering recommends items that are similar to those the user has liked or interacted with in the past. This approach is especially useful when there is sufficient information about the items' attributes or content, allowing the system to make personalized recommendations based on the user's interests and preferences.

In [89]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [90]:
df.head()

Unnamed: 0,artist,song,text
0,Kirk Franklin,Kingdom Come,i've been up i've been down been looking for s...
1,Leann Rimes,On The Side Of Angels,"i've never been so certain, \r i've never bee..."
2,Pat Benatar,Suburban King,the alarm clock rings but you don't move \r b...
3,Leonard Cohen,Hunter's Lullaby,your father's gone a-hunting \r he's lost in ...
4,Overkill,I'm Alright,oh my god in blood soaked silhouette \r oh my...


In [107]:
df.tail()

Unnamed: 0,artist,song,text
4995,Nazareth,Road Ladies,by: frank zappa as performed by nazareth \r s...
4996,Janis Joplin,Kozmic Blues,"time keeps moving on, \r friends they turn aw..."
4997,Tragically Hip,Last American Exit,you know the reasons i can't conceal you kow i...
4998,Johnny Cash,Gentle On My Mind,"well, it's knowin' that your door is always op..."
4999,Bonnie Raitt,Pleasin' Each Other,"you remind me, you remind me \r that it's a t..."


In [92]:
import nltk
from nltk.stem.porter import PorterStemmer

In [93]:
stemmer = PorterStemmer()

In [94]:
def token(txt):
    token = nltk.word_tokenize(txt)
    a = [stemmer.stem(w) for w in token]
    return " ".join(a)

In [95]:
token("you are beautiful")

'you are beauti'

In [96]:
## creatting a lambda function to use this token in whole dataset

In [97]:
df['text'].apply(lambda x: token(x))

0       i 've been up i 've been down been look for so...
1       i 've never been so certain , i 've never been...
2       the alarm clock ring but you do n't move but t...
3       your father 's gone a-hunt he 's lost in the f...
4       oh my god in blood soak silhouett oh my god on...
                              ...                        
4995    by : frank zappa as perform by nazareth said ,...
4996    time keep move on , friend they turn away , we...
4997    you know the reason i ca n't conceal you kow i...
4998    well , it 's knowin ' that your door is alway ...
4999    you remind me , you remind me that it 's a tri...
Name: text, Length: 5000, dtype: object

### Vectorization

Vectorization is the process of converting data into numerical vectors or arrays, making it suitable for machine learning and data analysis tasks. In NLP, it transforms text into numerical representations, while in image processing, it converts images into numerical arrays of pixel values. This conversion enables efficient handling and analysis of data by machine learning algorithms.

In [98]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [99]:
df['text']

0       i've been up i've been down been looking for s...
1       i've never been so certain,  \r i've never bee...
2       the alarm clock rings but you don't move  \r b...
3       your father's gone a-hunting  \r he's lost in ...
4       oh my god in blood soaked silhouette  \r oh my...
                              ...                        
4995    by: frank zappa as performed by nazareth  \r s...
4996    time keeps moving on,  \r friends they turn aw...
4997    you know the reasons i can't conceal you kow i...
4998    well, it's knowin' that your door is always op...
4999    you remind me, you remind me  \r that it's a t...
Name: text, Length: 5000, dtype: object

In [110]:
tfid = TfidfVectorizer(analyzer = 'word', stop_words = 'english')

In [111]:
matrix = tfid.fit_transform(df['text'])

In [112]:
matrix

<5000x23992 sparse matrix of type '<class 'numpy.float64'>'
	with 271247 stored elements in Compressed Sparse Row format>

In [113]:
similar = cosine_similarity(matrix)

In [114]:
similar[0]

array([1.        , 0.06055849, 0.01789536, ..., 0.07300922, 0.03580611,
       0.03418979])

In [116]:
df[df['song']=='Last American Exit']

Unnamed: 0,artist,song,text
4997,Tragically Hip,Last American Exit,you know the reasons i can't conceal you kow i...


In [117]:
df[df['song']=='Last American Exit'].index[0]

4997

### Recommender Funcation 

In [118]:
def recommender (song_name):
    idx = df[df['song']==song_name].index[0]
    distance = sorted(list(enumerate(similar[idx])), reverse=True, key=lambda x:x[1])
    song = []
    for s_id in distance[1:5]:
        song.append(df.iloc[s_id[0]].song)
    return song

In [119]:
recommender("Last American Exit")

['Do You Know?', 'I Know Better', 'Poker', 'I Will Be There']

In [120]:
import pickle

In [121]:
pickle.dump(similar, open("similarity.pkl", "wb"))

In [122]:
pickle.dump(df, open("df.pkl", "wb"))