# Music recommender system

One of the most used machine learning algorithms is recommendation systems. A **recommender** (or recommendation) **system** (or engine) is a filtering system which aim is to predict a rating or preference a user would give to an item, eg. a film, a product, a song, etc.

Which type of recommender can we have?   

There are two main types of recommender systems: 
- Content-based filters
- Collaborative filters
  
> Content-based filters predicts what a user likes based on what that particular user has liked in the past. On the other hand, collaborative-based filters predict what a user like based on what other users, that are similar to that particular user, have liked.

We have previously developed a content-based recommendation system. Now, we'll look into collaborative filtering. 

### 2) Collaborative filters

Collaborative Filters work with an interaction matrix, also called rating matrix. The aim of this algorithm is to learn a function that can predict if a user will benefit from an item - meaning the user will likely buy, listen to, watch this item.

Among collaborative-based systems, we can encounter two types: **user-item** filtering and **item-item** filtering. 

*What algorithms do collaborative filters use to recommend new songs?* There are several machine learning algorithms that can be used in the case of collaborative filtering.

### Importing required libraries

First, we'll import all the required libraries.

In [None]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
from scipy.sparse import csr_matrix

In [None]:
import Recommenders as Recommenders

### Reading the files

We are going to use the Million Song Dataset, a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

There are two files that will be interesting for us. The first of them will give us information about the songs. Particularly, it contains the user ID, song ID and the listen count. On the other hand, the second file will contain song ID, title of that song, release, artist name and year. 
We need to merge these two DataFrames. For that aim, we'll use the `song_ID` 

In [None]:
song_df_1 = pd.read_csv('triplets_file.csv')
song_df_1.head()

In [None]:
song_df_2 = pd.read_csv('song_data.csv')
song_df_2.head()

In [None]:
# combine both data
songs = pd.merge(song_df_1, song_df_2.drop_duplicates(['song_id']), on='song_id', how='left')
songs.head()

In [None]:
songs.head()

We'll save this dataset into a `csv file` so we have this available if there is any other recommendation system project we want to do.

In [None]:
songs.to_csv('songs.csv', index=False)

We can read this file into a new **DataFrame** that we'd call `df_songs`.

In [None]:
df_songs = pd.read_csv('songs.csv')

## Exploring the data

As usual, any data science or machine learning project starts with an exploratory data analysis (EDA). The aim of EDA is to understand and get insights on our data.

We'll first inspect the first rows of our `DataFrame`.

In [None]:
df_songs.head()

Then, we'll check how many observions there are in the dataset.

In [None]:
#Get total observations
print(f"There are {df_songs.shape[0]} observations in the dataset")

Now, we should perform some cleaning steps. But looking at the dataset, we can see that there is no missing values.

In [None]:
df_songs.isnull().sum()

And most of the columns contain strings.

In [None]:
df_songs.dtypes

Let's start exploring some characteristics of the dataset: 

- Unique songs:

In [None]:
#Unique songs
unique_songs = df_songs['title'].unique().shape[0]
print(f"There are {unique_songs} unique songs in the dataset")

- Unique artists:

In [None]:
#Unique artists
unique_artists = df_songs['artist_name'].unique().shape[0]
print(f"There are {unique_artists} unique artists in the dataset")

- Unique users:

In [None]:
#Unique users
unique_users = df_songs['user_id'].unique().shape[0]
print(f"There are {unique_users} unique users in the dataset")

We'll go ahead and explore the popularity of songs and artists.

### Most popular songs

How do we determine which are the most popular songs? For this task, we'll count how many times each song appears. Note that while we are using  `listen_count`, we only care about the number of rows, we don't consider the number present in that row. This number represents how many times one user listen to the same song.

In [None]:
#count how many rows we have by song, we show only the ten more popular songs 
ten_pop_songs = df_songs.groupby('title')['listen_count'].count().reset_index().sort_values(['listen_count', 'title'], ascending = [0,1])
ten_pop_songs['percentage']  = round(ten_pop_songs['listen_count'].div(ten_pop_songs['listen_count'].sum())*100, 2)

In [None]:
ten_pop_songs = ten_pop_songs[:10]
ten_pop_songs

In [None]:
labels = ten_pop_songs['title'].tolist()
counts = ten_pop_songs['listen_count'].tolist()

In [None]:
plt.figure()
sns.barplot(x=counts, y=labels, palette='Set3')
sns.despine(left=True, bottom=True)

### Most popular artist

For the next task, we'll count how many times each artist appears. Again, we'll count how many times the same artist appears.

In [None]:
#count how many rows we have by artist name, we show only the ten more popular artist 
ten_pop_artists  = df_songs.groupby(['artist_name'])['listen_count'].count().reset_index().sort_values(['listen_count', 'artist_name'], 
                                                                                                ascending = [0,1])

In [None]:
ten_pop_artists = ten_pop_artists[:10]
ten_pop_artists

In [None]:
plt.figure()
labels = ten_pop_artists['artist_name'].tolist()
counts = ten_pop_artists['listen_count'].tolist()
sns.barplot(x=counts, y=labels, palette='Set2')
sns.despine(left=True, bottom=True)

### Listen count by user

We can also get some other information from the feature `listen_count`. We will answer the folloging questions:

**What was the maximum time the same user listen to a same song?**

In [None]:
listen_counts = pd.DataFrame(df_songs.groupby('listen_count').size(), columns=['count'])

In [None]:
print(f"The maximum time the same user listened to the same songs was: {listen_counts.reset_index(drop=False)['listen_count'].iloc[-1]}")

**How many times on average the same user listen to a same song?**

In [None]:
print(f"On average, a user listen to the same song {df_songs['listen_count'].mean()} times")

We can also check the distribution of `listen_count`:

In [None]:
plt.figure(figsize=(20, 5))
sns.boxplot(x='listen_count', data=df_songs)
sns.despine()

**What are the most frequent number of times a user listen to the same song?**

In [None]:
listen_counts_temp = listen_counts[listen_counts['count'] > 50].reset_index(drop=False)

In [None]:
plt.figure(figsize=(16, 8))
sns.barplot(x='listen_count', y='count', palette='Set3', data=listen_counts_temp)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.show();

**How many songs does a user listen in average?**

In [None]:
song_user = df_songs.groupby('user_id')['song_id'].count()

In [None]:
plt.figure(figsize=(16, 8))
sns.distplot(song_user.values, color='orange')
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.show();

In [None]:
print(f"A user listens to an average of {np.mean(song_user)} songs")

In [None]:
print(f"A user listens to an average of {np.median(song_user)} songs, with minimum {np.min(song_user)} and maximum {np.max(song_user)} songs")

## Data Preprocessing

In [None]:
# creating new feature combining title and artist name
df_songs['song'] = df_songs['title']+' - '+df_songs['artist_name']
df_songs.head()

short samples

In [None]:
# taking top 10k samples for quick results
df_songs = df_songs.head(10000)

In [None]:
# cummulative sum of listen count of the songs
song_grouped = df_songs.groupby(['song']).agg({'listen_count':'count'}).reset_index()
song_grouped.head()

In [None]:
grouped_sum = song_grouped['listen_count'].sum()
song_grouped['percentage'] = (song_grouped['listen_count'] / grouped_sum ) * 100
song_grouped.sort_values(['listen_count', 'song'], ascending=[0,1])

## Popularity Based Recommendation

In [None]:
pr = Recommenders.popularity_recommender_py()

In [None]:
pr.create(df_songs, 'user_id', 'song')

In [None]:
# display the top 10 popular songs
pr.recommend(df_songs['user_id'][5])

In [None]:
pr.recommend(df_songs['user_id'][100])

## Item based popularity

In [None]:
ir = Recommenders.item_similarity_recommender_py()
ir.create(df_songs, 'user_id', 'song')

In [None]:
user_items = ir.get_user_items(df_songs['user_id'][5])

In [None]:
# display user songs history
for user_item in user_items:
    print(user_item)

In [None]:
# give song recommendation for that user
ir.recommend(df_songs['user_id'][5])

In [None]:
# give related songs based on the words
ir.get_similar_items(['Oliver James - Fleet Foxes', 'The End - Pearl Jam'])