# MUSIC RECOMMENDATION SYSTEM USING CONTENT-BASED FILTERING

## IMPORTING NECESSARY LIBRARIES

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pickle

In [2]:
df=pd.read_csv("spotify_millsongdata.csv")
df.head()

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \r\nA..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \r\nTouch me gen..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \r\nWhy I had...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...


## DATA PREPROCESSING

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57650 entries, 0 to 57649
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   artist  57650 non-null  object
 1   song    57650 non-null  object
 2   link    57650 non-null  object
 3   text    57650 non-null  object
dtypes: object(4)
memory usage: 1.8+ MB


In [4]:
df.isna().sum()

artist    0
song      0
link      0
text      0
dtype: int64

Here we have around 50000 songs in dataset. This may make processing slow.Hence we will be only considering random 5000 songs.Also we will be removing the 'link' attribute since it has no effect on music recommendation.

In [5]:
df =df.sample(5000).drop('link', axis=1).reset_index(drop=True)
df.head()

Unnamed: 0,artist,song,text
0,LL Cool J,We're Gonna Make It,"I know the Lord, will make a way \r\nHe will ..."
1,Josh Groban,She's Out Of My Life,She's out of my life \r\nShe's out of my life...
2,Kirsty Maccoll,They Don't Know,(kirsty maccoll) \r\nYou've been around for s...
3,The Jam,Smithers-Jones,"Here we go again, it's Monday at last, \r\nHe..."
4,Green Day,American Idiot,Don't want to be an American idiot. \r\nDon't...


Now lets look at the 'text' attribute more closely..

In [6]:
df["text"][0]

"I know the Lord, will make a way  \r\nHe will make a way, yes he will  \r\nYes, he will  \r\n  \r\nOh, I do believe that we will make it  \r\nYeah, yeah, yeah, yeah  \r\n  \r\nUh uh  \r\nI was at rock bottom, my whole life was mo' problems  \r\nReincarnation of a slave pickin' cotton  \r\nStress beamin' down like the sun I felt rotten  \r\nTo the core, was at war, cause the enemy is plottin'  \r\nI hear him knockin', sayin' that we got him  \r\nHot like solar, he wanna burn my soul up  \r\nWorld on my shoulders but I roll back ya boulders  \r\nWords have a funny way of comin back to scold ya  \r\nWatch what come out of your mouth, you crack a molar  \r\nI tried to told ya, he he he  \r\nEnemies gunnin', true believers ain't runnin'  \r\nOr duckin', we ain't scared of nothin', ya feel me?  \r\nThey wanna test me and press my buttons, oh really?  \r\nAllow Uncle L to hip y'all to somethin', uh, uh  \r\nYeah, uh, check it out  \r\nThere's a living power, make a man out a coward  \r\nRebu

Here we can see that text contains impurities like '\r', '\n', etc..Hence we will be replacing it with blank spaces.

In [7]:
df["text"]=df["text"].str.replace(r'\r',' ',regex=True).replace(r'\n',' ',regex=True)
df["text"][0]

"I know the Lord, will make a way    He will make a way, yes he will    Yes, he will        Oh, I do believe that we will make it    Yeah, yeah, yeah, yeah        Uh uh    I was at rock bottom, my whole life was mo' problems    Reincarnation of a slave pickin' cotton    Stress beamin' down like the sun I felt rotten    To the core, was at war, cause the enemy is plottin'    I hear him knockin', sayin' that we got him    Hot like solar, he wanna burn my soul up    World on my shoulders but I roll back ya boulders    Words have a funny way of comin back to scold ya    Watch what come out of your mouth, you crack a molar    I tried to told ya, he he he    Enemies gunnin', true believers ain't runnin'    Or duckin', we ain't scared of nothin', ya feel me?    They wanna test me and press my buttons, oh really?    Allow Uncle L to hip y'all to somethin', uh, uh    Yeah, uh, check it out    There's a living power, make a man out a coward    Rebuild your strength like the new Trade Towers    I

## MODEL BUILDING

In [8]:
tfidvector = TfidfVectorizer(analyzer='word',stop_words='english')
matrix = tfidvector.fit_transform(df['text'])
similarity = cosine_similarity(matrix)
similarity[0]

array([1.        , 0.05552831, 0.14340649, ..., 0.02431756, 0.01847643,
       0.06708269])

We need to reduce the size of matrix for future purpose.

In [9]:
similarity=similarity.astype(np.float16)

In [10]:
def recommendation(song_df):
    index = df[df['song'] == song_df].index[0]
    distances = sorted(list(enumerate(similarity[index])),reverse=True,key=lambda x:x[1])
    songs = []
    for m_id in distances[1:6]:
        songs.append(df.iloc[m_id[0]].song)
    return songs

In the above function index of the input song is found and stored in index.An enumerator is used to keep track of indices of each elements using the (index,element) format.Enumerating is done to ensure that indices are preserved even after sorting.This is converted to a list and sorted based on the value x[1], i.e. the element.

Now lets try our function...

In [12]:
recommendation("Smithers-Jones")

['Mr. Jones',
 "Let's Work Together",
 'Get Up',
 'Will Work For Love',
 'Clock Goes Round']

## SAVING OBJECTS AS PICKLE FILE

In [13]:
pickle.dump(similarity,open('similarity.pkl','wb'))
pickle.dump(df,open('df.pkl','wb'))