# **Music Recommendation System**

In [1]:
import warnings #Used to ignore the warning given as output of the code.
warnings.filterwarnings('ignore')

import numpy as np # Basic libraries of python for numeric and dataframe computations.
import pandas as pd

import matplotlib.pyplot as plt #Basic library for data visualization.
import seaborn as sns #Slightly advanced library for data visualization

# from sklearn.metrics.pairwise import cosine_similarity #To compute the cosine similarity between two vectors.
from collections import defaultdict #A dictionary output that does not raise a key error

from sklearn.metrics import mean_squared_error # A performance metrics in sklearn.

# **Milestone 2**

Now that we have explored the data, let's apply different algorithms to build recommendation systems

**Note:** Use the shorter version of the data i.e. the data after the cutoffs as used in Milestone 1.

In [2]:
df_final = pd.read_csv('C:/MITADSC/Capstone/df_final.csv')

In [3]:
df_final.drop('Unnamed: 0', 1, inplace=True) 
df_final.head()

Unnamed: 0,user_id,song_id,play_count,title,release,artist_name,year
0,6958,447,1,Daisy And Prudence,Distillation,Erin McKeown,2000
1,6958,512,1,The Ballad of Michael Valentine,Sawdust,The Killers,2004
2,6958,549,1,I Stand Corrected (Album),Vampire Weekend,Vampire Weekend,2007
3,6958,703,1,They Might Follow You,Tiny Vipers,Tiny Vipers,2007
4,6958,719,1,Monkey Man,You Know I'm No Good,Amy Winehouse,2007


### **Popularity-Based Recommendation Systems**

Let's take the count and sum of play counts of the songs and build the popularity recommendation systems on the basis of the sum of play counts.

In [4]:
#Calculating average play_count
average_count = df_final.groupby('song_id').mean()['play_count'] #Hint: Use groupby function on the song_id column. 

#Calculating the frequency a song is played.
play_freq = df_final.groupby('song_id').sum()['play_count']#Hint: Use groupby function on the song_id column

In [5]:
#Making a dataframe with the average_count and play_freq
final_play = pd.DataFrame({'avg_count':average_count, 'play_freq':play_freq})
final_play.head()

Unnamed: 0_level_0,avg_count,play_freq
song_id,Unnamed: 1_level_1,Unnamed: 2_level_1
21,1.631387,447
22,1.464286,205
50,1.616822,173
52,1.715232,777
62,1.727273,209


Now, let's create a function to find the top n songs for a recommendation based on the average play count of song. We can also add a threshold for a minimum number of playcounts for a song to be considered for recommendation.

In [6]:
#Build the function for finding top n songs

def top_n_songs(average_count, n, play_freq):
    
    #Finding products with minimum number of interactions
    recommendations = final_play[final_play['play_freq'] > play_freq]
    
    #Sorting values w.r.t average rating 
    recommendations = recommendations.sort_values(by = 'avg_count', ascending = False)
    
    return recommendations.index[:n]

In [7]:
#Recommend top 10 songs using the function defined above with at least 100 plays
top_songs = list(top_n_songs(final_play, 10, 100))
top_songs

[7224, 6450, 8324, 9942, 8483, 5531, 657, 5653, 614, 2220]

### **User User Similarity-Based Collaborative Filtering**

To build the user-user-similarity based and subsequent models we will use the "surprise" library.

In [8]:
#Install the surprise package using pip. Uncomment and run the below code to do the same. 
#!pip install surprise 

In [9]:
# Import necessary libraries
# To compute the accuracy of models
from surprise import accuracy

# class is used to parse a file containing play_counts, data should be in structure - user; item ; play_count
from surprise.reader import Reader

# class for loading datasets
from surprise.dataset import Dataset

# for tuning model hyperparameters
from surprise.model_selection import GridSearchCV

# for splitting the data in train and test dataset
from surprise.model_selection import train_test_split

# for implementing similarity-based recommendation system
from surprise.prediction_algorithms.knns import KNNBasic

# for implementing matrix factorization based recommendation system
from surprise.prediction_algorithms.matrix_factorization import SVD

# for implementing KFold cross-validation
from surprise.model_selection import KFold

#For implementing clustering-based recommendation system
from surprise import CoClustering

### Some useful functions

The below is the function to calculate precision@k and recall@k, RMSE and F1_Score@k to evaluate the model performance.

**Think About It:** Which metric should be used for this problem to compare different models?

In [10]:
#The function to calulate the RMSE, precision@k, recall@k and F_1 score. 
def precision_recall_at_k(model, k=30, threshold=1.5):
    """Return precision and recall at k metrics for each user"""

    # First map the predictions to each user.
    user_est_true = defaultdict(list)
    
    #Making predictions on the test data
    predictions=model.test(testset)
    
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, user_ratings in user_est_true.items():

        # Sort user ratings by estimated value
        user_ratings.sort(key=lambda x: x[0], reverse=True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in user_ratings[:k])

        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. We here set Precision to 0 when n_rec_k is 0.

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. We here set Recall to 0 when n_rel is 0.

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0
    
    #Mean of all the predicted precisions are calculated.
    precision = round((sum(prec for prec in precisions.values()) / len(precisions)),3)
    #Mean of all the predicted recalls are calculated.
    recall = round((sum(rec for rec in recalls.values()) / len(recalls)),3)
    
    accuracy.rmse(predictions)
    print('Precision: ', precision) #Command to print the overall precision
    print('Recall: ', recall) #Command to print the overall recall
    print('F_1 score: ', round((2*precision*recall)/(precision+recall),3)) # Formula to compute the F-1 score.

**Think About It:** In the function precision_recall_at_k above the threshold value used is 1.5. How precision and recall are affected by chaning the threshold? What is the intuition behind using the threshold value 1.5? 

In [11]:
# Instantiating Reader scale with expected play count
reader = Reader(rating_scale=(0, 5)) #use (0,5)

# loading the dataset
data = Dataset.load_from_df(df_final[['user_id', 'song_id', 'play_count']], reader) #Take only "user_id","song_id", and "play_count"

# splitting the data into train and test dataset
trainset, testset = train_test_split(data, test_size=0.4, random_state=42) # Take test_size=0.4

**Think About It:** How changing the test size would change the results and outputs?

In [12]:
#Build the default user-user-similarity model
sim_options = {'name': 'cosine',
               'user_based': True}

#KNN algorithm is used to find desired similar items.
sim_user_user = KNNBasic(sim_options=sim_options, verbose=False, random_state=1) #use random_state=1 

# Train the algorithm on the trainset, and predict play_count for the testset
sim_user_user.fit(trainset)

# Let us compute precision@k, recall@k, and f_1 score with k =30.
precision_recall_at_k(sim_user_user, k=30) #Use sim_user_user model

RMSE: 1.0817
Precision:  0.401
Recall:  0.705
F_1 score:  0.511


**Observations and Insights:_________**

In [13]:
#predicting play_count for a sample user with a listened song.
sim_user_user.predict("6958", "1671", r_ui=2, verbose=True) #use user id 6958 and song_id 1671

user: 6958       item: 1671       r_ui = 2.00   est = 1.70   {'was_impossible': True, 'reason': 'User and/or item is unknown.'}


Prediction(uid='6958', iid='1671', r_ui=2, est=1.698939503494818, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})

In [14]:
#predicting play_count for a sample user with a song not-listened by the user.
sim_user_user.predict("6958","3232", verbose=True) #Use user_id 6958 and song_id 3232

user: 6958       item: 3232       r_ui = None   est = 1.70   {'was_impossible': True, 'reason': 'User and/or item is unknown.'}


Prediction(uid='6958', iid='3232', r_ui=None, est=1.698939503494818, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})

**Observations and Insights:_________**

Now, let's try to tune the model and see if we can improve the model performance.

In [15]:
# setting up parameter grid to tune the hyperparameters
param_grid = {'k': [10, 20, 30], 'min_k': [3, 6, 9],
              'sim_options': {'name': ["cosine",'pearson',"pearson_baseline"],
                              'user_based': [True], "min_support":[2,4]}
              }

# performing 3-fold cross validation to tune the hyperparameters
gs = GridSearchCV(KNNBasic, param_grid, measures=['rmse'], cv=3, n_jobs=-1)

# fitting the data
gs.fit(data) #Use entire data for GridSearch

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])


1.0038432128090233
{'k': 30, 'min_k': 9, 'sim_options': {'name': 'pearson_baseline', 'user_based': True, 'min_support': 2}}


In [16]:
# Train the best model found in above gridsearch.
sim_options = {'name': 'cosine',
               'user_based': True}

# creating an instance of KNNBasic with optimal hyperparameter values
sim_user_user_optimized = KNNBasic(sim_options=sim_options, k=30, min_k=9, random_state=1, verbose=False)

# training the algorithm on the trainset
sim_user_user_optimized.fit(trainset)

# Let us compute precision@k and recall@k also with k =10.
precision_recall_at_k(sim_user_user_optimized)


RMSE: 1.0873
Precision:  0.4
Recall:  0.696
F_1 score:  0.508


**Observations and Insights:_________**

In [17]:
#Predict the play count for a user who has listened to the song. Take user_id 6958, song_id 1671 and r_ui=2
sim_user_user_optimized.predict("6958", "1671", r_ui = 2, verbose=True)

user: 6958       item: 1671       r_ui = 2.00   est = 1.70   {'was_impossible': True, 'reason': 'User and/or item is unknown.'}


Prediction(uid='6958', iid='1671', r_ui=2, est=1.698939503494818, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})

In [18]:
#Predict the play count for a song that is not listened by the user (with user_id 6958)
sim_user_user_optimized.predict("6958", "3232", verbose=True)

user: 6958       item: 3232       r_ui = None   est = 1.70   {'was_impossible': True, 'reason': 'User and/or item is unknown.'}


Prediction(uid='6958', iid='3232', r_ui=None, est=1.698939503494818, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})

**Observations and Insights:______________**

**Think About It:** Along with making predictions on listened and unknown songs can we get 5 nearest neighbors (most similar) to a certain user?

In [19]:
#Use inner id 0. 
sim_user_user_optimized.get_neighbors(0,5)

[3, 6, 8, 9, 11]

Below we will be implementing a function where the input parameters are - 

- data: a **song** dataset
- user_id: a user id **against which we want the recommendations**
- top_n: the **number of songs we want to recommend**
- algo: the algorithm we want to use **for predicting the play_count**
- The output of the function is a **set of top_n items** recommended for the given user_id based on the given algorithm

In [20]:
def get_recommendations(data, user_id, top_n, algo):
    
    # creating an empty list to store the recommended product ids
    recommendations = []
    
    # creating an user item interactions matrix 
    user_item_interactions_matrix = data.pivot(index='user_id', columns='song_id', values='play_count')
    
    # extracting those song ids which the user_id has not interacted yet
    non_interacted_products = user_item_interactions_matrix.loc[user_id][user_item_interactions_matrix.loc[user_id].isnull()].index.tolist()
    
    # looping through each of the product ids which user_id has not interacted yet
    for item_id in non_interacted_products:
        
        # predicting the ratings for those non interacted product ids by this user
        est = algo.predict(user_id, item_id).est
        
        # appending the predicted ratings
        recommendations.append((item_id, est))

    # sorting the predicted ratings in descending order
    recommendations.sort(key=lambda x: x[1], reverse=True)

    return recommendations[:top_n] # returing top n highest predicted rating products for this user

In [21]:
#Make top 5 recommendations for user_id 6958 with a similarity-based recommendation engine.
recommendations = get_recommendations(df_final, "6958", 5, sim_user_user_optimized)

ValueError: Index contains duplicate entries, cannot reshape

In [None]:
#Building the dataframe for above recommendations with columns "song_id" and "predicted_ratings"
pd.DataFrame(recommendations, columns=['song_id', 'predicted_ratings'])

**Observations and Insights:______________**

### Correcting the play_counts and Ranking the above songs

In [None]:
def ranking_songs(recommendations, final_rating):
  # sort the songs based on play counts
  ranked_songs = final_rating.loc[[items[0] for items in recommendations]].sort_values('play_freq', ascending=False)[['play_freq']].reset_index()

  # merge with the recommended songs to get predicted play_count
  ranked_songs = ranked_songs.merge(pd.DataFrame(recommendations, columns=['song_id', 'predicted_ratings']), on='song_id', how='inner')

  # rank the songs based on corrected play_counts
  ranked_songs['corrected_ratings'] = ranked_songs['predicted_ratings'] - 1 / np.sqrt(ranked_songs['play_freq'])

  # sort the songs based on corrected play_counts
  ranked_songs = ranked_songs.sort_values('corrected_ratings', ascending=False)
  
  return ranked_songs

**Think About It:** In the above function to make the correction in the predicted play_count a quantity 1/np.sqrt(n) is subtracted. What is the intuition behind it? Is it also possible to add this quantity instead of subtracting?

In [None]:
#Applying the ranking_songs function on the final_play data. 
ranking_products(recommendations, final_play)

**Observations and Insights:______________**

### Item Item Similarity-based collaborative filtering recommendation systems 

In [69]:
#Apply the item-item similarity collaborative filtering model with random_state=1 and evaluate the model performance.

sim_options = {'name': 'cosine',
               'user_based': False}

#KNN algorithm is used to find desired similar items.
sim_item_item = KNNBasic(sim_options=sim_options, random_state=1, verbose=False)

# Train the algorithm on the trainset, and predict ratings for the testset
sim_item_item.fit(trainset)

# Let us compute precision@k, recall@k, and f_1 score with k =10.
precision_recall_at_k(sim_item_item)

RMSE: 1.0320
Precision:  0.316
Recall:  0.572
F_1 score:  0.407


**Observations and Insights:______________**

In [70]:
#predicting play count for a sample user_id 6958 and song (with song_id 1671) heard by the user.
sim_item_item.predict("6958", "1671", r_ui=5, verbose=True)

user: 6958       item: 1671       r_ui = 5.00   est = 1.70   {'was_impossible': True, 'reason': 'User and/or item is unknown.'}


Prediction(uid='6958', iid='1671', r_ui=5, est=1.698939503494818, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})

In [71]:
#Predict the play count for a user that has not listened to the song (with song_id 1671)
df_final.loc[df.final['column_name'] == some_value]
sim_item_item.predict("6958", "1671", r_ui=5, verbose=True)

user: 6958       item: 1671       r_ui = 5.00   est = 1.70   {'was_impossible': True, 'reason': 'User and/or item is unknown.'}


Prediction(uid='6958', iid='1671', r_ui=5, est=1.698939503494818, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})

**Observations and Insights:______________**

In [72]:
#Apply grid search for enhancing model performance

# setting up parameter grid to tune the hyperparameters
param_grid = {'k': [10, 20, 30], 'min_k': [3, 6, 9],
              'sim_options': {'name': ["cosine",'pearson',"pearson_baseline"],
                              'user_based': [False], "min_support":[2,4]}
              }

# performing 3-fold cross validation to tune the hyperparameters
gs = GridSearchCV(KNNBasic, param_grid, measures=['rmse'], cv=3, n_jobs=-1)

# fitting the data
gs.fit(data)

# find best RMSE score
print(gs.best_score['rmse'])

# Extract the combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

0.9884260193514258
{'k': 20, 'min_k': 6, 'sim_options': {'name': 'pearson_baseline', 'user_based': False, 'min_support': 2}}


**Think About It:** How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the list of hyperparameter [here](https://surprise.readthedocs.io/en/stable/knn_inspired.html).

In [73]:
#Apply the best modle found in the grid search.
sim_options =  {'name': 'msd',
               'user_based': False}

# creating an instance of KNNBasic with optimal hyperparameter values
sim_item_item_optimized = KNNBasic(sim_options=sim_options, k=20, min_k=6 , random_state=1, verbose=False)

# training the algorithm on the trainset
sim_item_item_optimized.fit(trainset)

# Let us compute precision@k and recall@k, f1_score@k and RMSE
precision_recall_at_k(sim_item_item_optimized)

RMSE: 1.0248
Precision:  0.371
Recall:  0.562
F_1 score:  0.447


**Observations and Insights:______________**

In [74]:
#Predict the play_count by a user(user_id 6958) for the song (song_id 1671)
sim_item_item_optimized.predict("6958", "1671", verbose=True)

user: 6958       item: 1671       r_ui = None   est = 1.70   {'was_impossible': True, 'reason': 'User and/or item is unknown.'}


Prediction(uid='6958', iid='1671', r_ui=None, est=1.698939503494818, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})

In [75]:
#predicting play count for a sample user_id 6958 with song_id 3232 which is not heard by the user.
sim_item_item_optimized.predict("6958", "3232", verbose=True)

user: 6958       item: 3232       r_ui = None   est = 1.70   {'was_impossible': True, 'reason': 'User and/or item is unknown.'}


Prediction(uid='6958', iid='3232', r_ui=None, est=1.698939503494818, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'})

**Observations and Insights:______________**

In [76]:
#Find five most similar users to the user with inner id 0
sim_item_item_optimized.get_neighbors(0, k=5)

[13, 16, 18, 29, 39]

In [77]:
#Making top 5 recommendations for user_id 6958 with item_item_similarity-based recommendation engine.
recommendations = get_recommendations(df_final, "6958", 5, sim_item_item_optimized)

ValueError: Index contains duplicate entries, cannot reshape

In [None]:
#Building the dataframe for above recommendations with columns "song_id" and "predicted_play_count"
pd.DataFrame(recommendations, columns=['song_id', 'predicted_play_count'])

In [None]:
#Applying the ranking_songs function. 
ranking_products(recommendations, final_play)

**Observations and Insights:_________**

### Model Based Collaborative Filtering - Matrix Factorization

Model-based Collaborative Filtering is a **personalized recommendation system**, the recommendations are based on the past behavior of the user and it is not dependent on any additional information. We use **latent features** to find recommendations for each user.

In [78]:
# Build baseline model using svd
svd = SVD(random_state=1)

# training the algorithm on the trainset
svd.fit(trainset)

# Use the function precision_recall_at_k to compute precision@k, recall@k, F1-Score@k, and RMSE
precision_recall_at_k(svd)

RMSE: 1.0026
Precision:  0.432
Recall:  0.654
F_1 score:  0.52


In [79]:
# Making prediction for user (with user_id 6958) to song (with song_id 1671), take r_ui=2
svd.predict("6958", "1671", r_ui=2, verbose=True)

user: 6958       item: 1671       r_ui = 2.00   est = 1.70   {'was_impossible': False}


Prediction(uid='6958', iid='1671', r_ui=2, est=1.698939503494818, details={'was_impossible': False})

In [None]:
# Making prediction for user who has not listened the song (song_id 3232)
svd.predict("6958", "3232", r_ui=2, verbose=True)

#### Improving matrix factorization based recommendation system by tuning its hyperparameters

In [80]:
# set the parameter space to tune
param_grid = {'n_epochs': [10, 20, 30], 'lr_all': [0.001, 0.005, 0.01],
              'reg_all': [0.2, 0.4, 0.6]}

# performing 3-fold gridsearch cross validation
gs_ = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3, n_jobs=-1)

# fitting data
gs_.fit(data)

# best RMSE score
print(gs_.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs_.best_params['rmse'])

1.0030215721308338
{'n_epochs': 30, 'lr_all': 0.01, 'reg_all': 0.2}


**Think About It**: How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the available hyperparameters [here](https://surprise.readthedocs.io/en/stable/matrix_factorization.html).

In [81]:
# Building the optimized SVD model using optimal hyperparameters
svd_optimized = SVD(n_epochs = 30, lr_all = 0.01, reg_all = 0.2, random_state=1)

# Train the algorithm on the trainset
svd_optimized = svd_optimized.fit(trainset)

# Use the function precision_recall_at_k to compute precision@k, recall@k, F1-Score@k, and RMSE
precision_recall_at_k(svd_optimized)

RMSE: 1.0085
Precision:  0.414
Recall:  0.645
F_1 score:  0.504


**Observations and Insights:_________**

In [82]:
#Using svd_algo_optimized model to recommend for userId 6958 and song_id 1671.
svd_optimized.predict("6958", "1671", r_ui=5, verbose=True)

user: 6958       item: 1671       r_ui = 5.00   est = 1.70   {'was_impossible': False}


Prediction(uid='6958', iid='1671', r_ui=5, est=1.698939503494818, details={'was_impossible': False})

In [83]:
#Using svd_algo_optimized model to recommend for userId 6958 and song_id 3232 with unknown baseline rating.
svd_optimized.predict("6958", "3232", verbose=True)

user: 6958       item: 3232       r_ui = None   est = 1.70   {'was_impossible': False}


Prediction(uid='6958', iid='3232', r_ui=None, est=1.698939503494818, details={'was_impossible': False})

**Observations and Insights:_________**

In [84]:
# Getting top 5 recommendations for user_id 6958 using "svd_optimized" algorithm.
recommendations = get_recommendations(df_final, "6958", 5, svd_optimized)

ValueError: Index contains duplicate entries, cannot reshape

In [85]:
#Ranking songs based on above recommendations
ranking_products(recommendations, final_play)

NameError: name 'ranking_products' is not defined

**Observations and Insights:_________**

### Cluster Based Recommendation System

In **clustering-based recommendation systems**, we explore the **similarities and differences** in people's tastes in songs based on how they rate different songs. We cluster similar users together and recommend songs to a user based on play_counts from other users in the same cluster.

In [86]:
# Make baseline clustering model
clust_baseline = CoClustering(random_state=1)

# training the algorithm on the trainset
clust_baseline.fit(trainset)

# Let us compute precision@k and recall@k with k =10.
precision_recall_at_k(clust_baseline, k = 10)

RMSE: 1.0428
Precision:  0.408
Recall:  0.493
F_1 score:  0.446


In [87]:
#Making prediction for user_id 6958 and song_id 1671.
clust_baseline.predict(6958, 1671, r_ui=5, verbose=True)

user: 6958       item: 1671       r_ui = 5.00   est = 1.36   {'was_impossible': False}


Prediction(uid=6958, iid=1671, r_ui=5, est=1.3627795906030837, details={'was_impossible': False})

In [88]:
#Making prediction for user (userid 6958) for a song(song_id 3232) not heard by the user.
clust_baseline.predict(6958, 3232, r_ui=5, verbose=True)

user: 6958       item: 3232       r_ui = 5.00   est = 1.55   {'was_impossible': False}


Prediction(uid=6958, iid=3232, r_ui=5, est=1.5488944215348255, details={'was_impossible': False})

#### Improving clustering-based recommendation system by tuning its hyper-parameters

In [89]:
# set the parameter space to tune
param_grid = {'n_cltr_u':[5,6,7,8], 'n_cltr_i': [5,6,7,8], 'n_epochs': [10,20,30]}

# performing 3-fold gridsearch cross validation
gs = GridSearchCV(CoClustering, param_grid, measures=['rmse'], cv=3, n_jobs=-1)

# fitting data
gs.fit(data)

# best RMSE score
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

1.0484356801913162
{'n_cltr_u': 5, 'n_cltr_i': 5, 'n_epochs': 30}


**Think About It**: How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the available hyperparameters [here](https://surprise.readthedocs.io/en/stable/co_clustering.html).

In [None]:
# Train the tuned Coclustering algorithm
clust_tuned = CoClustering(n_cltr_u=3,n_cltr_i=3, n_epochs=40, random_state=1)

# training the algorithm on the trainset
clust_tuned.fit(trainset)

# Let us compute precision@k and recall@k with k =10.
precision_recall_at_k(clust_tuned)

**Observations and Insights:_________**

In [None]:
#Using co_clustering_optimized model to recommend for userId 6958 and song_id 1671.
clust_tuned.predict(6958, 1671, r_ui=5, verbose=True)

In [None]:
#Use Co_clustering based optimized model to recommend for userId 6958 and song_id 3232 with unknown baseline rating.
clust_tuned.predict(6958, 3232, verbose=True)

**Observations and Insights:_________**

#### Implementing the recommendation algorithm based on optimized CoClustering model

In [None]:
#Getting top 5 recommendations for user_id 6958 using "Co-clustering based optimized" algorithm.
clustering_recommendations = get_recommendations(df_final, "6958", 5, clust_tuned)

### Correcting the play_count and Ranking the above songs

In [None]:
#Ranking songs based on above recommendations
ranking_products(recommendations, final_play)

**Observations and Insights:_________**

### Content Based Recommendation Systems

**Think About It:** So far we have only used the play_count of songs to find recommendations but we have other information/features on songs as well. Can we take those song features into account?

In [22]:
df_small=df_final

In [23]:
# Concatenate the "title","release","artist_name" columns to create a different column named "text"
df_small['text']=df_small['title'].astype(str)+' '+df_small['release']+' '+df_small['artist_name']
df_small.head()

Unnamed: 0,user_id,song_id,play_count,title,release,artist_name,year,text
0,6958,447,1,Daisy And Prudence,Distillation,Erin McKeown,2000,Daisy And Prudence Distillation Erin McKeown
1,6958,512,1,The Ballad of Michael Valentine,Sawdust,The Killers,2004,The Ballad of Michael Valentine Sawdust The Ki...
2,6958,549,1,I Stand Corrected (Album),Vampire Weekend,Vampire Weekend,2007,I Stand Corrected (Album) Vampire Weekend Vamp...
3,6958,703,1,They Might Follow You,Tiny Vipers,Tiny Vipers,2007,They Might Follow You Tiny Vipers Tiny Vipers
4,6958,719,1,Monkey Man,You Know I'm No Good,Amy Winehouse,2007,Monkey Man You Know I'm No Good Amy Winehouse


In [24]:
#Select the columns 'user_id', 'song_id', 'play_count', 'title', 'text' from df_small data
col_names = ['user_id', 'song_id', 'play_count', 'title', 'text']
df_small = df_small[col_names]

#drop the duplicates from the title column
df_small.drop_duplicates(subset = ['title'], keep = 'first')


Unnamed: 0,user_id,song_id,play_count,title,text
0,6958,447,1,Daisy And Prudence,Daisy And Prudence Distillation Erin McKeown
1,6958,512,1,The Ballad of Michael Valentine,The Ballad of Michael Valentine Sawdust The Ki...
2,6958,549,1,I Stand Corrected (Album),I Stand Corrected (Album) Vampire Weekend Vamp...
3,6958,703,1,They Might Follow You,They Might Follow You Tiny Vipers Tiny Vipers
4,6958,719,1,Monkey Man,Monkey Man You Know I'm No Good Amy Winehouse
...,...,...,...,...,...
4941,40245,2161,1,The Last Song,The Last Song The All-American Rejects The All...
5000,33280,519,1,Invincible,Invincible Black Holes And Revelations Muse
5010,33280,1536,2,Paper Gangsta,Paper Gangsta The Fame Monster Lady GaGa
5698,1246,3466,1,Starlight,Starlight Starlight Muse


In [27]:
#Set the title column as the index
df_small.set_index('title')


Unnamed: 0_level_0,user_id,song_id,play_count,text
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Daisy And Prudence,6958,447,1,Daisy And Prudence Distillation Erin McKeown
The Ballad of Michael Valentine,6958,512,1,The Ballad of Michael Valentine Sawdust The Ki...
I Stand Corrected (Album),6958,549,1,I Stand Corrected (Album) Vampire Weekend Vamp...
They Might Follow You,6958,703,1,They Might Follow You Tiny Vipers Tiny Vipers
Monkey Man,6958,719,1,Monkey Man You Know I'm No Good Amy Winehouse
...,...,...,...,...
Half Of My Heart,47786,9139,1,Half Of My Heart Battle Studies John Mayer
Bitter Sweet Symphony,47786,9186,1,Bitter Sweet Symphony Bitter Sweet Symphony Th...
The Police And The Private,47786,9351,2,The Police And The Private Live It Out Metric
Just Friends,47786,9543,1,Just Friends Back To Black Amy Winehouse


In [26]:

# see the first 5 records of the df_small dataset
#df_small.head()

Unnamed: 0,user_id,song_id,play_count,title,text
0,6958,447,1,Daisy And Prudence,Daisy And Prudence Distillation Erin McKeown
1,6958,512,1,The Ballad of Michael Valentine,The Ballad of Michael Valentine Sawdust The Ki...
2,6958,549,1,I Stand Corrected (Album),I Stand Corrected (Album) Vampire Weekend Vamp...
3,6958,703,1,They Might Follow You,They Might Follow You Tiny Vipers Tiny Vipers
4,6958,719,1,Monkey Man,Monkey Man You Know I'm No Good Amy Winehouse


In [28]:
# Create the series of indices from the data
indices = pd.Series(df_small.index)
indices[:5]

0    0
1    1
2    2
3    3
4    4
dtype: int64

In [33]:
#Importing necessary packages to work with text data
import nltk
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")
import re
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

[nltk_data] Downloading package punkt to C:\Users\Hank
[nltk_data]     Daily\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Hank
[nltk_data]     Daily\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Hank
[nltk_data]     Daily\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


We will create a **function to pre-process the text data:**

In [30]:
# Function to tokenize the text
def tokenize(text):
    text = re.sub(r"[^a-zA-Z]"," ",text.lower())
    tokens = word_tokenize(text)
    words = [word for word in tokens if word not in stopwords.words("english")] #Use stopwords of english
    text_lems = [WordNetLemmatizer().lemmatize(lem).strip() for lem in words]

    return text_lems

In [31]:
#Create tfidf vectorizer 
tfidf = TfidfVectorizer(tokenizer=tokenize)

# Fit_transfrom the above vectorizer on the text column and then convert the output into an array.
df_tfidf = tfidf.fit_transform(df_small['text'].values).toarray()

In [34]:
# Compute the cosine similarity for the tfidf above output
similar_songs = cosine_similarity(df_tfidf, df_tfidf)
similar_songs

MemoryError: Unable to allocate 143. GiB for an array with shape (138301, 138301) and data type float64

 Finally, let's create a function to find most similar songs to recommend for a given song

In [35]:
# function that takes in song title as input and returns the top 10 recommended songs
def recommendations(title, similar_songs):
    
    recommended_songs = []
    
    # gettin the index of the song that matches the title
    idx = indices[indices == title].index[0]

    # creating a Series with the similarity scores in descending order
    score_series = pd.Series(similar_songs[idx]).sort_values(ascending = False)

    # getting the indexes of the 10 most similar songs
    top_10_indexes = list(score_series.iloc[1:11].index)
    print(top_10_indexes)
    
    # populating the list with the titles of the best 10 matching songs
    for i in top_10_indexes:
        recommended_songs.append(list(df_small.index)[i])
        
    return recommended_songs

Recommending 10 songs similar to Learn to Fly

In [36]:
# Make the recommendation for the song with title 'Learn To Fly'
recommendations("Learn To Fly", similar_songs)

NameError: name 'similar_songs' is not defined

**Observations and Insights:_________**

## **Conclusion and Recommendations:** 

- **Refined Insights -** What are the most meaningful insights from the data relevant to the problem?

- **Comparison of various techniques and their relative performance -** How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?

- **Proposal for the final solution design -** What model do you propose to be adopted? Why is this the best solution to adopt?