<a href="https://colab.research.google.com/github/AMassani/Angular.Project1/blob/master/Music_Recommendation_System_Full_Code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Music Recommendation System**

## **Problem Definition**

### **The Context:**

A **Recommender System** is like a smart assistant for your customers. It looks at what your customer has bought, suggested or reviewed for products or services they purchased or used or contents they have liked.

The same system can be used to suggest or recommend songs to the listener of Spotify, which is digital music stream service provider that gives users access to millions of songs, podcasts, and audio content from artists all over the world. It helps users discover more of what they want, which can boost engagement, sales, and customer satisfaction.


### **The objective:**

 Accurately predict and recommend the top 10 songs that a specific user is most likely to listen to next.
 In simple business terms, the goal is to personalize the user experience by showing each user a tailored list of songs they’re highly likely to enjoy, leading to:

Increased user engagement (more listening time)

Better user satisfaction

Higher retention and loyalty

### **The key questions:**

Understand the user behaviour:
- What are their past listening habits?
- What genres, artists and songs do they prefer?

Understand the features or attributes of songs:
- What makes a song likely to be listened to?
- Is it similart to the songs, the user listened or liked before?
- Is it popular among other users?
- Does it match the user's mood, time of day or context?

Understand the data collected or required?
- Do we have user interaction data (plays, skips, likes)?
- Do we have song metadata (genre, artist, tempo)?
- Is there contextual data (time of day, device used, etc.)?

Evaluate how to measure the similarity or interest?
- Between users (Collaborative filtering)
- Between songs (Content based filtering)
- Combination of both (Hybrid method)

How do we evaluate the system is recommending what users want to listen?
- What measurements do we use?
- How to test this?

How do we address any data privacy concerns?
- Any personalized data collected?

How much would this cost us?
- Depending on how complex the model is, do we have enough resources to host the Recsys system?
- Hiring of resources to handle different phases of Data collections and building a Recommender system.
- On going maintenance of the system since the users could change their likes and dislikes on on-going basis.

### **The problem formulation**:

Using data science, we are trying to solve the core problem of predicting the top 10 songs that the user would like to listen based on the number of times the user has played a song.

We will use various steps in data science to explore the data and understand the user behaviour or pattern and uncover what the user likes. Applying machine learning model will help predict which songs the user would like.


## **Data Dictionary**

The core data is the Taste Profile Subset released by the Echo Nest as part of the Million Song Dataset. There are two files in this dataset. The first file contains the details about the song id, titles, release, artist name, and the year of release. The second file contains the user id, song id, and the play count of users.

**song_data**

- song_id - A unique id given to every song
- title - Title of the song
- Release - Name of the released album
- Artist_name - Name of the artist
- year - Year of release

**count_data**

- user _id - A unique id given to the user
- song_id - A unique id given to the song
- play_count - Number of times the song was played

## **Data Source**
http://millionsongdataset.com/

### **Importing Libraries and the Dataset**

In [1]:
# Mounting the drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
# Used to ignore the warning given as output of the code
import warnings
warnings.filterwarnings('ignore')

# Basic libraries of python for numeric and dataframe computations
import numpy as np
import pandas as pd

# Import Matplotlib the Basic library for data visualization
import matplotlib.pyplot as plt
%matplotlib inline

# Import Seaborn - Slightly advanced library for data visualization
import seaborn as sns

# Import the required library to compute the cosine similarity between two vectors
from sklearn.metrics.pairwise import cosine_similarity

# Import defaultdict from collections A dictionary output that does not raise a key error
from collections import defaultdict

# Impoort mean_squared_error : a performance metrics in sklearn
from sklearn.metrics import mean_squared_error

### **Load the dataset**

In [38]:
# Importing the datasets
user_data_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/count_data.csv')
song_data_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/song_data.csv')

### **Understanding the data by viewing a few observations**

In [39]:
# Display first 10 records of user_data_df data
user_data_df.head(10)


Unnamed: 0.1,Unnamed: 0,user_id,song_id,play_count
0,0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAKIMP12A8C130995,1
1,1,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBBMDR12A8C13253B,2
2,2,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBXHDL12A81C204C0,1
3,3,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBYHAJ12A6701BF1D,1
4,4,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SODACBL12A8C13C273,1
5,5,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SODDNQT12A6D4F5F7E,5
6,6,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SODXRTY12AB0180F3B,1
7,7,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOFGUAY12AB017B0A8,1
8,8,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOFRQTD12A81C233C0,1
9,9,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOHQWYZ12A6D4FA701,1


In [40]:
# Display first 10 records of song_data_df data
song_data_df.head(10)


Unnamed: 0,song_id,title,release,artist_name,year
0,SOQMMHC12AB0180CB8,Silent Night,Monster Ballads X-Mas,Faster Pussy cat,2003
1,SOVFVAK12A8C1350D9,Tanssi vaan,Karkuteillä,Karkkiautomaatti,1995
2,SOGTUKN12AB017F4F1,No One Could Ever,Butter,Hudson Mohawke,2006
3,SOBNYVR12A8C13558C,Si Vos Querés,De Culo,Yerba Brava,2003
4,SOHSBXH12A8C13B0DF,Tangle Of Aspens,Rene Ablaze Presents Winter Sessions,Der Mystic,0
5,SOZVAPQ12A8C13B63C,"Symphony No. 1 G minor ""Sinfonie Serieuse""/All...",Berwald: Symphonies Nos. 1/2/3/4,David Montgomery,0
6,SOQVRHI12A6D4FB2D7,We Have Got Love,Strictly The Best Vol. 34,Sasha / Turbulence,0
7,SOEYRFT12AB018936C,2 Da Beat Ch'yall,Da Bomb,Kris Kross,1993
8,SOPMIYT12A6D4F851E,Goodbye,Danny Boy,Joseph Locke,0
9,SOJCFMH12A8C13B0C2,Mama_ mama can't you see ?,March to cadence with the US marines,The Sun Harbor's Chorus-Documentary Recordings,0


### **Let us check the data types and and missing values of each column**

In [41]:
# Display info of user_data_df
user_data_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000000 entries, 0 to 1999999
Data columns (total 4 columns):
 #   Column      Dtype 
---  ------      ----- 
 0   Unnamed: 0  int64 
 1   user_id     object
 2   song_id     object
 3   play_count  int64 
dtypes: int64(2), object(2)
memory usage: 61.0+ MB


In [42]:
# Display info of song_data_df
song_data_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 5 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   song_id      1000000 non-null  object
 1   title        999983 non-null   object
 2   release      999993 non-null   object
 3   artist_name  1000000 non-null  object
 4   year         1000000 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 38.1+ MB


#### **Observations and Insights:_____________**
Observations of user data:

The data shows the user behaviour

- Unnamed column seems to be a row counter with int data type. Does not seem to help us in our prediction so we should drop it.
- user_id is a unique id of the user and will help us to identify the total play count, this is an alphanumeric value which will need to be converted to numeric value.
- song_id is a unique identifier of the song that the user has played.
This is also an alphanumeric value which will need to be converted to numeric value.
- play_count is a numeric value that identifies the number of times user played this song.


Observations of song data:

The data shows the meta data of all the songs available for the users.

- song_id is unique identifier of the song. This is the joining key between the 2 data frames which we can use to merge the data frames for further exploration.
- title is the title of the song.
- release is the release info of the song.
- artist_name is the name of the artist performing the song.
- year is the year in which the song was released.

"title" and "release" columns have null values which we might have to fill with some default values or drop rows where critical data is missing.

Key insights:
Song metadata like "artists", "year", "release" and "title" can be used to for content-based feature recommendations for e.g recommend songs from the same artist or same year or same title.

For recommendation of songs to the new users, we can use content-based filtering of the metadata, popular songs by the artist or new releases.

After merging the 2 data frames, we can further deepen our knowledge of the user behaviour like
- What type of songs the users are listening to (genre via artist, release year etc)
- Most played songs or artists in a specific time period.






In [43]:
# Drop the column 'Unnamed: 0'
user_data_df.drop('Unnamed: 0', axis=1, inplace=True)

user_data_df.info()

# Left merge user_data_df and song_data_df on "song_id". Drop duplicates from song_data_df data simultaneously
merged_data_df = pd.merge(user_data_df, song_data_df, on='song_id', how='left').drop_duplicates()

# Display first 10 records of merged_data_df
merged_data_df.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000000 entries, 0 to 1999999
Data columns (total 3 columns):
 #   Column      Dtype 
---  ------      ----- 
 0   user_id     object
 1   song_id     object
 2   play_count  int64 
dtypes: int64(1), object(2)
memory usage: 45.8+ MB


Unnamed: 0,user_id,song_id,play_count,title,release,artist_name,year
0,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOAKIMP12A8C130995,1,The Cove,Thicker Than Water,Jack Johnson,0
1,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBBMDR12A8C13253B,2,Entre Dos Aguas,Flamenco Para Niños,Paco De Lucia,1976
2,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBXHDL12A81C204C0,1,Stronger,Graduation,Kanye West,2007
4,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOBYHAJ12A6701BF1D,1,Constellations,In Between Dreams,Jack Johnson,2005
5,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SODACBL12A8C13C273,1,Learn To Fly,There Is Nothing Left To Lose,Foo Fighters,1999
6,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SODDNQT12A6D4F5F7E,5,Apuesta Por El Rock 'N' Roll,Antología Audiovisual,Héroes del Silencio,2007
7,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SODXRTY12AB0180F3B,1,Paper Gangsta,The Fame Monster,Lady GaGa,2008
8,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOFGUAY12AB017B0A8,1,Stacked Actors,There Is Nothing Left To Lose,Foo Fighters,1999
9,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOFRQTD12A81C233C0,1,Sehr kosmisch,Musik von Harmonia,Harmonia,0
10,b80344d063b5ccb3212f76538f3d9e43d87dca9e,SOHQWYZ12A6D4FA701,1,Heaven's gonna burn your eyes,Hôtel Costes 7 by Stéphane Pompougnac,Thievery Corporation feat. Emiliana Torrini,2002


In [44]:
## Name the obtained dataframe as "df"
df = merged_data_df


**Think About It:** As the user_id and song_id are encrypted. Can they be encoded to numeric features?

Most machine learning algorithms work with numeric data and hence the string values need to be encoded into numeric features.

Encoded numeric values can be used for user item matrix, apply matrix factorization. Also numeric values are more memory efficient in performing calculation on large models.

In [45]:
# Apply label encoding for "user_id" and "song_id"
from sklearn.preprocessing import LabelEncoder

# Create separate encoder instances
user_encoder = LabelEncoder()
song_encoder = LabelEncoder()

df['user_id'] = user_encoder.fit_transform(df['user_id'])
df['song_id'] = song_encoder.fit_transform(df['song_id'])

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2054534 entries, 0 to 2086945
Data columns (total 7 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   user_id      int64 
 1   song_id      int64 
 2   play_count   int64 
 3   title        object
 4   release      object
 5   artist_name  object
 6   year         int64 
dtypes: int64(4), object(3)
memory usage: 125.4+ MB


**Think About It:** As the data also contains users who have listened to very few songs and vice versa, is it required to filter the data so that it contains users who have listened to a good count of songs and vice versa?

A dataset of size 2000000 rows x 7 columns can be quite large and may require a lot of computing resources to process. This can lead to long processing times and can make it difficult to train and evaluate your model efficiently.
In order to address this issue, it may be necessary to trim down your dataset to a more manageable size.

In [46]:
# Get the column containing the users
users = df.user_id

# Create a dictionary that maps users(listeners) to the number of songs that they have listened to
playing_count = dict()

for user in users:
    # If we already have the user, just add 1 to their playing count
    if user in playing_count:
        playing_count[user] += 1

    # Otherwise, set their playing count to 1
    else:
        playing_count[user] = 1

In [47]:
# We want our users to have listened at least 90 songs
SONG_COUNT_CUTOFF = 90

# Create a list of users who need to be removed
remove_users = []

for user, num_songs in playing_count.items():

    if num_songs < SONG_COUNT_CUTOFF:
        remove_users.append(user)

df = df.loc[ ~ df.user_id.isin(remove_users)]

In [48]:
# Get the column containing the songs
songs = df.song_id

# Create a dictionary that maps songs to its number of users(listeners)
playing_count = dict()

for song in songs:
    # If we already have the song, just add 1 to their playing count
    if song in playing_count:
        playing_count[song] += 1

    # Otherwise, set their playing count to 1
    else:
        playing_count[song] = 1

In [49]:
# We want our song to be listened by atleast 120 users to be considred
LISTENER_COUNT_CUTOFF = 120

remove_songs = []

for song, num_users in playing_count.items():
    if num_users < LISTENER_COUNT_CUTOFF:
        remove_songs.append(song)

df_final= df.loc[ ~ df.song_id.isin(remove_songs)]

Out of all the songs available, songs with play_count less than or equal to 5 are in almost 90% abundance. So for building the recommendation system let us consider only those songs.

In [50]:
# Keep only records of songs with play_count less than or equal to (<=) 5
df_final = df_final[df_final.play_count<=5]

In [52]:
# Check the shape of the data
df.shape

(467139, 7)

## **Exploratory Data Analysis**

### **Let's check the total number of unique users, songs, artists in the data**

Total number of unique user id

In [53]:
# Display total number of unique user_id
unique_users = df['user_id'].nunique()
print(f"Total number of unique users: {unique_users}")

Total number of unique users: 3338


Total number of unique song id

In [54]:
# Display total number of unique song_id
unique_songs = df['song_id'].nunique()
print(f"Total number of unique songs: {unique_songs}")


Total number of unique songs: 9999


Total number of unique artists

In [55]:
# Display total number of unique artists
unique_artists = df['artist_name'].nunique()
print(f"Total number of unique artists: {unique_artists}")


Total number of unique artists: 3378


#### **Observations and Insights:__________**


### **Let's find out about the most interacted songs and interacted users**

Most interacted songs

Most interacted users

#### **Observations and Insights:_______**


Songs released on yearly basis

In [None]:
# Find out the number of songs released in a year, use the songs_df
  # Hint: Use groupby function on the 'year' column

In [None]:
# Create a barplot plot with y label as "number of titles played" and x -axis year

# Set the figure size

# Set the x label of the plot

# Set the y label of the plot

# Show the plot

#### **Observations and Insights:__________** #

**Think About It:** What other insights can be drawn using exploratory data analysis?

Now that we have explored the data, let's apply different algorithms to build recommendation systems.

**Note:** Use the shorter version of the data, i.e., the data after the cutoffs as used in Milestone 1.

## Building various models

### **Popularity-Based Recommendation Systems**

Let's take the count and sum of play counts of the songs and build the popularity recommendation systems based on the sum of play counts.

In [None]:
# Calculating average play_count
       # Hint: Use groupby function on the song_id column

# Calculating the frequency a song is played
      # Hint: Use groupby function on the song_id column

In [None]:
# Making a dataframe with the average_count and play_freq

# Let us see the first five records of the final_play dataset


Now, let's create a function to find the top n songs for a recommendation based on the average play count of song. We can also add a threshold for a minimum number of playcounts for a song to be considered for recommendation.

In [None]:
# Build the function to find top n songs

In [None]:
# Recommend top 10 songs using the function defined above

### **User User Similarity-Based Collaborative Filtering**

To build the user-user-similarity-based and subsequent models we will use the "surprise" library.

In [None]:
# Install the surprise package using pip. Uncomment and run the below code to do the same

# !pip install surprise

In [None]:
# Import necessary libraries

# To compute the accuracy of models


# This class is used to parse a file containing play_counts, data should be in structure - user; item; play_count


# Class for loading datasets


# For tuning model hyperparameters


# For splitting the data in train and test dataset


# For implementing similarity-based recommendation system


# For implementing matrix factorization based recommendation system


# For implementing KFold cross-validation

# For implementing clustering-based recommendation system


### Some useful functions

Below is the function to calculate precision@k and recall@k, RMSE, and F1_Score@k to evaluate the model performance.

**Think About It:** Which metric should be used for this problem to compare different models?

In [None]:
def precision_recall_at_k(model, k=30, threshold=1.5):
    """Return precision and recall at k metrics for each user"""

    # First map the predictions to each user.
    user_est_true = defaultdict(list)

    #Making predictions on the test data
    predictions = model.test(testset)

    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = dict()
    recalls = dict()
    for uid, playing_count in user_est_true.items():

        # Sort play count by estimated value
        playing_count.sort(key=lambda x: x[0], reverse=True)

        # Number of relevant items
        n_rel = sum((true_r >= threshold) for (_, true_r) in playing_count)

        # Number of recommended items in top k
        n_rec_k = sum((est >= threshold) for (est, _) in playing_count[:k])

        # Number of relevant and recommended items in top k
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold))
                              for (est, true_r) in playing_count[:k])

        # Precision@K: Proportion of recommended items that are relevant
        # When n_rec_k is 0, Precision is undefined. We here set Precision to 0 when n_rec_k is 0.

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0

        # Recall@K: Proportion of relevant items that are recommended
        # When n_rel is 0, Recall is undefined. We here set Recall to 0 when n_rel is 0.

        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0

    #Mean of all the predicted precisions are calculated.
    precision = round((sum(prec for prec in precisions.values()) / len(precisions)),3)
    #Mean of all the predicted recalls are calculated.
    recall = round((sum(rec for rec in recalls.values()) / len(recalls)),3)

    accuracy.rmse(predictions)
    print('Precision: ', precision) #Command to print the overall precision
    print('Recall: ', recall) #Command to print the overall recall
    print('F_1 score: ', round((2*precision*recall)/(precision+recall),3)) # Formula to compute the F-1 score.

**Think About It:** In the function precision_recall_at_k above the threshold value used is 1.5. How precision and recall are affected by changing the threshold? What is the intuition behind using the threshold value of 1.5?

Below we are loading the **dataset**, which is a **pandas dataframe**, into a **different format called `surprise.dataset.DatasetAutoFolds`** which is required by this library. To do this we will be **using the classes `Reader` and `Dataset`**

You will also notice here that we read the dataset by providing a scale of ratings. However, as you would know, we do not have ratings data of the songs. In this case, we are going to use play_count as a proxy for ratings with the assumption that the more the user listens to a song, the higher the chance that they like the song

In [None]:
# Instantiating Reader scale with expected rating scale
 #use rating scale (0, 5)

# Loading the dataset
 # Take only "user_id","song_id", and "play_count"

# Splitting the data into train and test dataset
 # Take test_size = 0.4, random_state = 42

**Think About It:** How changing the test size would change the results and outputs?

In [None]:
# Build the default user-user-similarity model


# KNN algorithm is used to find desired similar items
 # Use random_state = 1

# Train the algorithm on the trainset, and predict play_count for the testset


# Let us compute precision@k, recall@k, and f_1 score with k = 30
 # Use sim_user_user model

**Observations and Insights:_________**

In [None]:
# Predicting play_count for a sample user with a listened song
# Use any user id  and song_id

In [None]:
# Predicting play_count for a sample user with a song not-listened by the user
 #predict play_count for any sample user

**Observations and Insights:_________**

Now, let's try to tune the model and see if we can improve the model performance.

In [None]:
# Setting up parameter grid to tune the hyperparameters


# Performing 3-fold cross-validation to tune the hyperparameters

# Fitting the data
 # Use entire data for GridSearch

# Best RMSE score

# Combination of parameters that gave the best RMSE score


In [None]:
# Train the best model found in above gridsearch


**Observations and Insights:_________**

In [None]:
# Predict the play count for a user who has listened to the song. Take user_id 6958, song_id 1671 and r_ui = 2


In [None]:
# Predict the play count for a song that is not listened to by the user (with user_id 6958)


**Observations and Insights:______________**

**Think About It:** Along with making predictions on listened and unknown songs can we get 5 nearest neighbors (most similar) to a certain song?

In [None]:
# Use inner id 0


Below we will be implementing a function where the input parameters are:

- data: A **song** dataset
- user_id: A user-id **against which we want the recommendations**
- top_n: The **number of songs we want to recommend**
- algo: The algorithm we want to use **for predicting the play_count**
- The output of the function is a **set of top_n items** recommended for the given user_id based on the given algorithm

In [None]:
def get_recommendations(data, user_id, top_n, algo):

    # Creating an empty list to store the recommended song ids


    # Creating an user item interactions matrix


    # Extracting those song ids which the user_id has not played yet

    # Looping through each of the song ids which user_id has not interacted yet


        # Predicting the users for those non played song ids by this user


        # Appending the predicted play_counts

    # Sorting the predicted play_counts in descending order


    return # Returing top n highest predicted play_count songs for this user

In [None]:
# Make top 5 recommendations for any user_id with a similarity-based recommendation engine


In [None]:
# Building the dataframe for above recommendations with columns "song_id" and "predicted_play_count"


**Observations and Insights:______________**

### Correcting the play_counts and Ranking the above songs

In [None]:
def ranking_songs(recommendations, playing_count):
  # Sort the songs based on play counts

  # Merge with the recommended songs to get predicted play_counts

  # Rank the songs based on corrected play_counts

  # Sort the songs based on corrected play_counts

  return

**Think About It:** In the above function to correct the predicted play_count a quantity 1/np.sqrt(n) is subtracted. What is the intuition behind it? Is it also possible to add this quantity instead of subtracting?

In [None]:
# Applying the ranking_songs function on the final_play data


**Observations and Insights:______________**

### Item Item Similarity-based collaborative filtering recommendation systems

In [None]:
# Apply the item-item similarity collaborative filtering model with random_state = 1 and evaluate the model performance


**Observations and Insights:______________**

In [None]:
# Predicting play count for a sample user_id 6958 and song (with song_id 1671) listened to by the user


In [None]:
# Predict the play count for a user that has not listened to the song (with song_id 1671)

**Observations and Insights:______________**

In [None]:
# Apply grid search for enhancing model performance

# Setting up parameter grid to tune the hyperparameters


# Performing 3-fold cross-validation to tune the hyperparameters

# Fitting the data


# Find the best RMSE score

# Extract the combination of parameters that gave the best RMSE score


**Think About It:** How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the list of hyperparameters [here](https://surprise.readthedocs.io/en/stable/knn_inspired.html).

In [None]:
# Apply the best model found in the grid search


**Observations and Insights:______________**

In [None]:
# Predict the play_count by a user(user_id 6958) for the song (song_id 1671)


In [None]:
# Predicting play count for a sample user_id 6958 with song_id 3232 which is not listened to by the user


**Observations and Insights:______________**

In [None]:
# Find five most similar items to the item with inner id 0


In [None]:
# Making top 5 recommendations for any user_id  with item_item_similarity-based recommendation engine


In [None]:
# Building the dataframe for above recommendations with columns "song_id" and "predicted_play_count"


In [None]:
# Applying the ranking_songs function


**Observations and Insights:_________**

### Model Based Collaborative Filtering - Matrix Factorization

Model-based Collaborative Filtering is a **personalized recommendation system**, the recommendations are based on the past behavior of the user and it is not dependent on any additional information. We use **latent features** to find recommendations for each user.

In [None]:
# Build baseline model using svd


In [None]:
# Making prediction for user (with user_id 6958) to song (with song_id 1671), take r_ui = 2


In [None]:
# Making a prediction for the user who has not listened to the song (song_id 3232)


#### Improving matrix factorization based recommendation system by tuning its hyperparameters

In [None]:
# Set the parameter space to tune


# Performe 3-fold grid-search cross-validation


# Fitting data

# Best RMSE score

# Combination of parameters that gave the best RMSE score


**Think About It**: How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the available hyperparameters [here](https://surprise.readthedocs.io/en/stable/matrix_factorization.html).

In [None]:
# Building the optimized SVD model using optimal hyperparameters


**Observations and Insights:_________**

In [None]:
# Using svd_algo_optimized model to recommend for userId 6958 and song_id 1671


In [None]:
# Using svd_algo_optimized model to recommend for userId 6958 and song_id 3232 with unknown baseline play_count


**Observations and Insights:_________**

In [None]:
# Getting top 5 recommendations for user_id 6958 using "svd_optimized" algorithm


In [None]:
# Ranking songs based on above recommendations

**Observations and Insights:_________**

### Cluster Based Recommendation System

In **clustering-based recommendation systems**, we explore the **similarities and differences** in people's tastes in songs based on how they rate different songs. We cluster similar users together and recommend songs to a user based on play_counts from other users in the same cluster.

In [None]:
# Make baseline clustering model


In [None]:
# Making prediction for user_id 6958 and song_id 1671


In [None]:
# Making prediction for user (userid 6958) for a song(song_id 3232) not listened to by the user


#### Improving clustering-based recommendation system by tuning its hyper-parameters

In [None]:
# Set the parameter space to tune


# Performing 3-fold grid search cross-validation

# Fitting data

# Best RMSE score

# Combination of parameters that gave the best RMSE score


**Think About It**: How do the parameters affect the performance of the model? Can we improve the performance of the model further? Check the available hyperparameters [here](https://surprise.readthedocs.io/en/stable/co_clustering.html).

In [None]:
# Train the tuned Coclustering algorithm


**Observations and Insights:_________**

In [None]:
# Using co_clustering_optimized model to recommend for userId 6958 and song_id 1671


In [None]:
# Use Co_clustering based optimized model to recommend for userId 6958 and song_id 3232 with unknown baseline play_count


**Observations and Insights:_________**

#### Implementing the recommendation algorithm based on optimized CoClustering model

In [None]:
# Getting top 5 recommendations for user_id 6958 using "Co-clustering based optimized" algorithm


### Correcting the play_count and Ranking the above songs

In [None]:
# Ranking songs based on the above recommendations


**Observations and Insights:_________**

### Content Based Recommendation Systems

**Think About It:** So far we have only used the play_count of songs to find recommendations but we have other information/features on songs as well. Can we take those song features into account?

In [None]:
# Concatenate the "title", "release", "artist_name" columns to create a different column named "text"

In [None]:
# Select the columns 'user_id', 'song_id', 'play_count', 'title', 'text' from df_small data

# Drop the duplicates from the title column

# Set the title column as the index

# See the first 5 records of the df_small dataset


In [None]:
# Create the series of indices from the data


In [None]:
# Importing necessary packages to work with text data
import nltk

# Download punkt library


# Download stopwords library


# Download wordnet


# Import regular expression


# Import word_tokenizer


# Import WordNetLemmatizer

# Import stopwords


# Import CountVectorizer and TfidfVectorizer


We will create a **function to pre-process the text data:**

In [None]:
# Create a function to tokenize the text

In [None]:
# Create tfidf vectorizer

# Fit_transfrom the above vectorizer on the text column and then convert the output into an array


In [None]:
# Compute the cosine similarity for the tfidf above output


 Finally, let's create a function to find most similar songs to recommend for a given song.

In [None]:
# Function that takes in song title as input and returns the top 10 recommended songs
def recommendations(title, similar_songs):



    # Getting the index of the song that matches the title


    # Creating a Series with the similarity scores in descending order


    # Getting the indexes of the 10 most similar songs


    # Populating the list with the titles of the best 10 matching songs


    return

Recommending 10 songs similar to Learn to Fly

In [None]:
# Make the recommendation for the song with title 'Learn To Fly'


**Observations and Insights:_________**

## **Conclusion and Recommendations**

**1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success)**:
- How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?

**2. Refined insights**:
- What are the most meaningful insights from the data relevant to the problem?

**3. Proposal for the final solution design:**
- What model do you propose to be adopted? Why is this the best solution to adopt?