# Music recommender system

One of the most used machine learning algorithms is recommendation systems. A **recommender** (or recommendation) **system** (or engine) is a filtering system which aim is to predict a rating or preference a user would give to an item, eg. a film, a product, a song, etc.

Which type of recommender can we have?   

There are two main types of recommender systems: 
- Content-based filter

> Content-based filters predicts what a user likes based on what that particular user has liked in the past. On the other hand, collaborative-based filters predict what a user like based on what other users, that are similar to that particular user, have liked.



### 1) Content-based filters

Recommendations done using content-based recommenders can be seen as a user-specific classification problem. This classifier learns the user's likes and dislikes from the features of the song.


The most straightforward approach is **keyword matching**.


In a few words, the idea behind is to extract meaningful keywords present in a song description a user likes, search for the keywords in other song descriptions to estimate similarities among them, and based on that, recommend those songs to the user.

*How is this performed?*

In our case, because we are working with text and words, **Term Frequency-Inverse Document Frequency (TF-IDF)** can be used for this matching process.
  
We'll go through the steps for generating a **content-based** music recommender system.

### Importing required libraries

First, we'll import all the required libraries.

In [1]:
import numpy as np
import pandas as pd

In [2]:
 from typing import List, Dict

In [3]:
# we use  **TF-IDF score before** 
# we are going to use TfidfVectorizer from the Scikit-learn package again.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# import dataset

So imagine that we have the [following dataset](https://www.kaggle.com/mousehead/songlyrics/data#). 

This dataset contains name, artist, and lyrics for *57650 songs in English*. The data has been acquired from LyricsFreak through scraping.

In [5]:
df = pd.read_csv('D:\Github_slash_mark_project\slash_mark_project\major-project\model-1\data\songdata.csv')

In [6]:
df.head(3)

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \nAnd..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \nTouch me gentl..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \nWhy I had t...


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57650 entries, 0 to 57649
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   artist  57650 non-null  object
 1   song    57650 non-null  object
 2   link    57650 non-null  object
 3   text    57650 non-null  object
dtypes: object(4)
memory usage: 1.8+ MB


In [8]:
df.shape

(57650, 4)

In [9]:
df['text'] = df['text'].str.replace(r'\n', '')

In [10]:
df

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \nAnd..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \nTouch me gentl..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \nWhy I had t...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...
...,...,...,...,...
57645,Ziggy Marley,Good Old Days,/z/ziggy+marley/good+old+days_10198588.html,Irie days come on play \nLet the angels fly l...
57646,Ziggy Marley,Hand To Mouth,/z/ziggy+marley/hand+to+mouth_20531167.html,Power to the workers \nMore power \nPower to...
57647,Zwan,Come With Me,/z/zwan/come+with+me_20148981.html,all you need \nis something i'll believe \nf...
57648,Zwan,Desire,/z/zwan/desire_20148986.html,northern star \nam i frightened \nwhere can ...


# big data

In [11]:
# remove unwanted data 

In [12]:
data = df.drop('link',axis = 1)

In [13]:
data

Unnamed: 0,artist,song,text
0,ABBA,Ahe's My Kind Of Girl,"Look at her face, it's a wonderful face \nAnd..."
1,ABBA,"Andante, Andante","Take it easy with me, please \nTouch me gentl..."
2,ABBA,As Good As New,I'll never know why I had to go \nWhy I had t...
3,ABBA,Bang,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,Making somebody happy is a question of give an...
...,...,...,...
57645,Ziggy Marley,Good Old Days,Irie days come on play \nLet the angels fly l...
57646,Ziggy Marley,Hand To Mouth,Power to the workers \nMore power \nPower to...
57647,Zwan,Come With Me,all you need \nis something i'll believe \nf...
57648,Zwan,Desire,northern star \nam i frightened \nwhere can ...


In [14]:
# make model 

In [15]:
tfidf = TfidfVectorizer(analyzer='word', stop_words='english')

In [16]:
tfidf

In [17]:
lyrics_matrix = tfidf.fit_transform(data['text'])

In [18]:
lyrics_matrix

<57650x82077 sparse matrix of type '<class 'numpy.float64'>'
	with 3095478 stored elements in Compressed Sparse Row format>

*How do we use this matrix for a recommendation?* 

We now need to calculate the similarity of one lyric to another. We are going to use **cosine similarity**.

We want to calculate the cosine similarity of each item with every other item in the dataset. So we just pass the lyrics_matrix as argument.

# small data 

In [19]:
# do on some data 

In [1]:
# s_data = s_data.sample(n=5000)

In [None]:
#  make model

In [None]:
tfidf = TfidfVectorizer(analyzer='word', stop_words='english')

In [None]:
lyrics_matrix = tfidf.fit_transform(s_data['text'])

In [None]:
lyrics_matrix

<5000x23779 sparse matrix of type '<class 'numpy.float64'>'
	with 271786 stored elements in Compressed Sparse Row format>

In [None]:
cosine_similarities = cosine_similarity(lyrics_matrix) 

In [None]:
cosine_similarities

array([[1.        , 0.0112955 , 0.01217367, ..., 0.00260185, 0.02330793,
        0.01199469],
       [0.0112955 , 1.        , 0.00589492, ..., 0.00518837, 0.01187278,
        0.00418723],
       [0.01217367, 0.00589492, 1.        , ..., 0.00566479, 0.00519437,
        0.00380942],
       ...,
       [0.00260185, 0.00518837, 0.00566479, ..., 1.        , 0.01263193,
        0.        ],
       [0.02330793, 0.01187278, 0.00519437, ..., 0.01263193, 1.        ,
        0.04007434],
       [0.01199469, 0.00418723, 0.00380942, ..., 0.        , 0.04007434,
        1.        ]])

Once we get the similarities, we'll store in a dictionary the names of the 50  most similar songs for each song in our dataset.

In [None]:
similarities = {}

In [None]:
s_data

Unnamed: 0,artist,song,text
4427,Wet Wet Wet,Sweet Little Mystery,"Baby baby baby, yeah \nWestern Union man \nS..."
3661,Starship,Jane,Jane you say it's all over for you and me girl...
4929,Rod Stewart,Have I Told You Lately,Have I told you lately that I love you? \nHav...
3757,Zebrahead,All For None And None For All,Sticky hands and a crooked hook in his stride ...
3244,Kiss,Lonely Is The Hunter,"My eggs in one basket, but she threw me a bone..."
...,...,...,...
4582,The Killers,Carry Me Home,Let me out \nDon't tell me everything \nStar...
3189,Pet Shop Boys,The Way It Used To Be,"I'm here, you're there \nCome closer, tonight..."
501,Rick Astley,Natures Gift,Everybody needs a woman in their life \nA mot...
3698,The Jam,Burning Sky,"Dear, \nHow are things in your little world, ..."


In [None]:
# s_data = s_data.drop('link',axis= 1)

In [None]:
s_data

Unnamed: 0,artist,song,text
4427,Wet Wet Wet,Sweet Little Mystery,"Baby baby baby, yeah \nWestern Union man \nS..."
3661,Starship,Jane,Jane you say it's all over for you and me girl...
4929,Rod Stewart,Have I Told You Lately,Have I told you lately that I love you? \nHav...
3757,Zebrahead,All For None And None For All,Sticky hands and a crooked hook in his stride ...
3244,Kiss,Lonely Is The Hunter,"My eggs in one basket, but she threw me a bone..."
...,...,...,...
4582,The Killers,Carry Me Home,Let me out \nDon't tell me everything \nStar...
3189,Pet Shop Boys,The Way It Used To Be,"I'm here, you're there \nCome closer, tonight..."
501,Rick Astley,Natures Gift,Everybody needs a woman in their life \nA mot...
3698,The Jam,Burning Sky,"Dear, \nHow are things in your little world, ..."


In [None]:
# Reset the index of s_data to ensure proper indexing
s_data = s_data.reset_index(drop=True)

for i in range(len(cosine_similarities)):

    similar_indices = cosine_similarities[i].argsort()[:-50:-1] 
 
    similarities[s_data['song'].iloc[i]] = [(cosine_similarities[i][x], s_data['song'][x], s_data['artist'][x]) for x in similar_indices][1:]

After that, all the magic happens. We can use that similarity scores to access the most similar items and give a recommendation.

For that, we'll define our Content based recommender class.

In [None]:
class ContentBasedRecommender:
    # def __init__(self, matrix):
    #     self.matrix_similar = matrix

    # def _print_message(self, song, recom_song):
    #     rec_items = len(recom_song)
        
    #     print(f'The {rec_items} recommended songs for {song} are:')
    #     for i in range(rec_items):
    #         print(f"Number {i+1}:")
    #         print(f"{recom_song[i][1]} by {recom_song[i][2]} with {round(recom_song[i][0], 3)} similarity score") 
    #         print("--------------------")
        
    def recommend(self, recommendation):
        # Get song to find recommendations for
        song = recommendation['song']
        # Get number of songs to recommend
        number_songs = recommendation['number_songs']
        # Get the number of songs most similars from matrix similarities
        recom_song = self.matrix_similar[song][:number_songs]
        # print each item
        self._print_message(song=song, recom_song=recom_song)

In [None]:
recommedations = ContentBasedRecommender(similarities)

Then, we are ready to pick a song from the dataset and make a recommendation.

In [None]:
recommendation = {
    "song": s_data['song'].iloc[10],
    "number_songs": 5
}

In [None]:
# cosine_similarities = cosine_similarity(lyrics_matrix) 

In [None]:
recommedations.recommend(recommendation)

The 5 recommended songs for More Than We Bargained For are:
Number 1:
Can't Depend On Love by Gordon Lightfoot with 0.266 similarity score
--------------------
Number 2:
Number The Brave by Wishbone Ash with 0.18 similarity score
--------------------
Number 3:
Two For The Price Of One by ABBA with 0.168 similarity score
--------------------
Number 4:
If You Go Away by Cyndi Lauper with 0.163 similarity score
--------------------
Number 5:
I Can't Keep Away From You by Loretta Lynn with 0.162 similarity score
--------------------


And we can pick another random song and recommend again:

In [None]:
recommendation2 = {
    "song": s_data['song'].iloc[120],
    "number_songs": 5 
}

In [None]:
recommedations.recommend(recommendation2)

The 5 recommended songs for All That She Wants are:
Number 1:
A!! She Wants To Do Is Dance by Don Henley with 0.444 similarity score
--------------------
Number 2:
How Do You? by Radiohead with 0.424 similarity score
--------------------
Number 3:
Face To The Highway by Tom Waits with 0.407 similarity score
--------------------
Number 4:
To Be Alive by Yes with 0.354 similarity score
--------------------
Number 5:
Rich Kid by Uriah Heep with 0.335 similarity score
--------------------


# Medium Data


In [21]:
df.shape

(57650, 4)

In [22]:
m_data = df.sample(n=15000)

In [23]:
m_data = m_data.drop('link',axis=1)

In [24]:
# model

In [25]:
m_data

Unnamed: 0,artist,song,text
55863,Weezer,O Holy Night,"Oh holy night, the stars are brightly shining ..."
42865,Marillion,Chelsea Monday,"Catalogue princess, apprentice seductress \nH..."
20260,Uriah Heep,The Wizard,He was the wizard of a thousand kings \nAnd I...
53578,Tom Waits,Face To The Highway,I'm going away \nI'm going away \nI'm going ...
872,Arrogant Worms,Mounted Animal Nature Trail,"On the Mounted Animal Nature Trail, you'll be ..."
...,...,...,...
23794,Alphaville,Peace On Earth,"It's nothing serious, just a simple case of bo..."
11855,Lloyd Cole,Sweetheart,I got your letter baby the one that said \nYo...
51025,Rod Stewart,Cigarettes And Alcohol,Is it my imagination \nOr have I finally foun...
24421,Avril Lavigne,Pathetic,"I know I'm pathetic, I knew when he said it \..."


# MODEL

In [26]:
tfidf = TfidfVectorizer(analyzer='word', stop_words='english')

In [27]:
lyrics_matrix = tfidf.fit_transform(m_data['text'])

In [28]:
cosine_similarities = cosine_similarity(lyrics_matrix) 

In [29]:
similarities = {}

In [30]:
# Reset the index of s_data to ensure proper indexing
m_data = m_data.reset_index(drop=True)

for i in range(len(cosine_similarities)):
    # Now we'll sort each element in cosine_similarities and get the indexes of the songs. 
    similar_indices = cosine_similarities[i].argsort()[:-50:-1] 
    # After that, we'll store in similarities each name of the 50 most similar songs.
    # Except the first one that is the same song.
    similarities[m_data['song'].iloc[i]] = [(cosine_similarities[i][x], m_data['song'][x], m_data['artist'][x]) for x in similar_indices][1:]

In [31]:
class ContentBasedRecommender:
    def __init__(self, matrix):
        self.matrix_similar = matrix

    def _print_message(self, song, recom_song):
        rec_items = len(recom_song)
        
        print(f'The {rec_items} recommended songs for {song} are:')
        for i in range(rec_items):
            print(f"Number {i+1}:")
            print(f"{recom_song[i][1]} by {recom_song[i][2]} with {round(recom_song[i][0], 3)} similarity score") 
            print("--------------------")
        
    def recommend(self, recommendation):
        # Get song to find recommendations for
        song = recommendation['song']
        # Get number of songs to recommend
        number_songs = recommendation['number_songs']
        # Get the number of songs most similars from matrix similarities
        recom_song = self.matrix_similar[song][:number_songs]
        # print each item
        self._print_message(song=song, recom_song=recom_song)

In [32]:
recommedations = ContentBasedRecommender(similarities)

In [33]:
recommendation = {
    "song": m_data['song'].iloc[10],
    "number_songs": 4 
}

In [34]:
recommedations.recommend(recommendation)

The 4 recommended songs for Strange Boat are:
Number 1:
Strange by Wet Wet Wet with 0.595 similarity score
--------------------
Number 2:
Strange One by Marianne Faithfull with 0.485 similarity score
--------------------
Number 3:
Every Day by Roxette with 0.463 similarity score
--------------------
Number 4:
Strange by Poison with 0.455 similarity score
--------------------


In [35]:
recommendation2 = {
    "song": m_data['song'].iloc[120],
    "number_songs": 4 
}

In [36]:
recommedations.recommend(recommendation2)

The 4 recommended songs for Thug Motivation 101 are:
Number 1:
Ay,ay,i by Gloria Estefan with 0.699 similarity score
--------------------
Number 2:
No Way by Lady Gaga with 0.677 similarity score
--------------------
Number 3:
Talk Is Cheap by Miley Cyrus with 0.674 similarity score
--------------------
Number 4:
I'm Talking About You by Chuck Berry with 0.651 similarity score
--------------------


In [38]:
import pickle

In [40]:
pickle.dump(recommedations, open('content_based_recomandation.pkl', 'wb'))