# Music recommender system

One of the most used machine learning algorithms is recommendation systems. A **recommendation**  **engine**  is a song filtering system which aim is to predict a rating or preference a user would give to an item, eg. a film, a product, a song, etc.

Which type of recommender can we have?   

There are two types of recommender systems: 
-1) Content-based filters
-2) Collaborative filters
  
> Content-based filters predicts what a user likes based on what that particular user has liked in the past. On the other hand, collaborative-based filters predict what a user like based on what other users, that are similar to that particular user, have liked.

### 1) Content-based filters

Recommendations done using content-based recommenders can be seen as a user-specific classification problem. This classifier learns the listner's prefferences from the features of the song.

The best approach is **keyword matching**.

In a few words, the idea behind is to extract usefull keywords present in a song description a user likes, search for the keywords in other song descriptions to estimate similarities among them, and based on that, recommend those songs to the user.

*How is this performed?*

In our case, because we are using with text and words, **Term Frequency-Inverse Document Frequency (TF-IDF)** can be used for this matching process.
  
We'll go through the steps for generating a **content-based** music recommender system.

### Importing libraries

First, import required libraries.

In [85]:
import pandas as pd
df=pd.read_csv("spotify_millsongdata.csv")

### Displaying Data
This block displays the first 5 rows of the dataset using `df.head(5)` to give an overview of its structure and contents.


In [86]:
df.head(5)

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \r\nA..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \r\nTouch me gen..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \r\nWhy I had...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...


### Checking for Missing Values
Here, we check for missing values in the dataset using `df.isnull().sum()` to understand the data quality.


In [87]:
df.isnull().sum()

artist    0
song      0
link      0
text      0
dtype: int64

Because the dataset is too big, we are going to resample it to only 5000 songs.

In [88]:
#Sample the first 5000 songs from the DataFrame
df = df.head(5000)


### Displaying DataFrame Shape
We print the shape of the DataFrame using `df.shape` to understand its dimensions (number of rows and columns).


In [89]:
df.shape

(5000, 4)

### Text Preprocessing
Text preprocessing is performed in this block, where we convert text to lowercase and remove special characters and newline characters using regular expressions.


In [90]:
df['text']=df['text'].str.lower().replace(r'^\w\s','').replace(r'\n','',regex=True)

### Importing NLTK Libraries
NLTK libraries are imported in this block, and necessary data is downloaded to perform text processing tasks.

In [91]:
import nltk
nltk.download('punkt')
import nltk.data
nltk.data.path.append("/root/nltk_data")
from nltk.stem.porter import PorterStemmer

[nltk_data] Downloading package punkt to C:\Users\Muskan
[nltk_data]     Computer\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Initializing Stemmer
A Porter stemmer is initialized for word stemming, which reduces words to their root form.


In [92]:
stemmer=PorterStemmer()

### Tokenization and Stemming Function
We define a function `token()` to tokenize text into words and apply stemming to each word using NLTK's word_tokenize() and PorterStemmer.

In [93]:
def token(txt):
    token=nltk.word_tokenize(txt)
    a=[stemmer.stem(w)for w in token]
    return " ".join(a)

### Testing Tokenization and Stemming
The `token()` function is tested on sample text to ensure it correctly tokenizes and stems words.


In [94]:
token("you are beautiful,beauty")

'you are beauti , beauti'

### Applying Tokenization and Stemming
Tokenization and stemming are applied to the text column in the DataFrame using the `token()` function and `apply()` method.

In [95]:
df['text'].apply(lambda x:token(x))

0       look at her face , it 's a wonder face and it ...
1       take it easi with me , pleas touch me gentli l...
2       i 'll never know whi i had to go whi i had to ...
3       make somebodi happi is a question of give and ...
4       make somebodi happi is a question of give and ...
                              ...                        
4995    you wo n't take my love for tender you can put...
4996    i 've look at it everi way i can from under an...
4997    i wo n't walk with my head bow ( be on ) beyon...
4998    dress up like a dog 's dinner butter would n't...
4999    now there 's newsprint all over your face well...
Name: text, Length: 5000, dtype: object

we are going to use TfidfVectorizer and cosine_similarity from the Scikit-learn package .

In [96]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

After that, we use TF-IDF vectorizer that calculates the TF-IDF score for each song lyric, word-by-word. 

Here, we pay particular attention to the arguments we can specify.

In [97]:
tfidvector = TfidfVectorizer(analyzer='word',stop_words='english')
matrix = tfidvector.fit_transform(df['text'])

*to use this matrix for a recommendation :-* 

First calculate the similarity of one lyric to another by using **cosine similarity**.

We have to calculate the cosine similarity of each item with every other item in the dataset. So we just pass the lyric_matrix as argument.

In [98]:
similer = cosine_similarity(matrix)

Once we get the similarities, we'll store in a dictionary the names of the 50  most similar songs for each song in our dataset.

In [99]:
similer[0]

array([1.        , 0.00169743, 0.00936841, ..., 0.03331108, 0.03366429,
       0.08968344])

In [100]:
df.tail()

Unnamed: 0,artist,song,link,text
4995,Elvis Costello,Love For Tender,/e/elvis+costello/love+for+tender_20047299.html,you won't take my love for tender \ryou can p...
4996,Elvis Costello,Love Went Mad,/e/elvis+costello/love+went+mad_20047460.html,i've looked at it every way i can \rfrom unde...
4997,Elvis Costello,Lover's Walk,/e/elvis+costello/lovers+walk_20047530.html,i won't walk with my head bowed \r(be on) bey...
4998,Elvis Costello,Luxembourg,/e/elvis+costello/luxembourg_20047531.html,dressed up like a dog's dinner \rbutter would...
4999,Elvis Costello,Men Called Uncle,/e/elvis+costello/men+called+uncle_20047301.html,now there's newsprint all over your face \rwe...


### Data Integrity Checks
Checks are performed to ensure data integrity. It verifies if the DataFrame is empty and if a specific song exists in the DataFrame.


In [101]:
# Check if the DataFrame is empty
if df.empty:
    print("The DataFrame is empty. Please check your data source or filtering criteria.")
else:
    # Check the number of rows in the DataFrame
    print(f"The DataFrame has {df.shape[0]} rows.")

# Check if the desired song exists in the DataFrame
if df['song'].str.contains("The Brothers Cup").any():
    print("The song 'It All Depends On You' exists in the DataFrame.")
else:
    print("The song 'It All Depends On You' does not exist in the DataFrame.")

The DataFrame has 5000 rows.
The song 'It All Depends On You' does not exist in the DataFrame.


In [102]:

def recommender(song_name):
    if song_name not in df['song'].values:
        return f"The song '{song_name}' does not exist in the DataFrame."
    
    idx = df[df['song'] == song_name].index[0]
    distance = sorted(list(enumerate(similer[idx])), reverse=True, key=lambda x: x[1])
    recommended_songs = []
    for i, s_id in enumerate(distance[1:5], 1):
        recommended_songs.append(f"{i}. {df.iloc[s_id[0]].song}")
    
    if isinstance(recommended_songs, list):
        for song in recommended_songs:
            print(song)
    else:
        print(recommended_songs)

In [104]:
recommender("Men Called Uncle")

1. Uncle Love
2. I Hate Men
3. Married Men
4. Round And Round
