# Preparing the development of a music recommender system

In [22]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Data Cleaning

##### users.csv

**Task**: Import the *users.csv* file 

In [23]:
df_users = pd.read_csv("users.csv", delimiter=";")

**Task**: Rename the columns according to the description in the exercise sheet into a more readible format.

In [None]:
df_users.head(10)

**Task**: Unify the labels for the *Premium* attribute.

In [None]:
df_users.rename(columns={'uid' : 'user_id'}, inplace=True)
df_users.rename(columns={"p" : "Premium"}, inplace=True)
df_users.rename(columns={"m1" : "Minutes1"}, inplace=True)
df_users.rename(columns={"m2" : "Minutes2"}, inplace=True)
df_users.rename(columns={"m3" : "Minutes3"}, inplace=True)
df_users['Premium'] = df_users['Premium'].map({'Yes': True, 'No': False, '1':True, '0': False})
df_users

**Task**: Impute the missing values of the attribute *Minutes2* using the values of *Minutes1* and *Minutes3*.

In [None]:
df_users.info()
df_users['differenceCol']= df_users['Minutes3']-df_users['Minutes1']
df_users.plot(x='user_id', y='differenceCol', style='o')

In [None]:
df_users['Minutes2'] = df_users['Minutes2'].fillna((df_users['Minutes1'] +df_users['Minutes3'])/2)
df_users

##### user_behavior.csv

**Task**: Read the *user_behavior.csv* file.

In [None]:
user_behavior = pd.read_csv("user_behavior.csv", delimiter=";")
user_behavior

**Task**: Rename the columns according to the description in the exercise sheet.

In [None]:
user_behavior.rename(columns={'ml' : 'minListened'}, inplace=True)
user_behavior.rename(columns={"g" : "genre"}, inplace=True)
user_behavior.rename(columns={"f" : "liked"}, inplace=True)
user_behavior.rename(columns={"mod" : "reviewDate"}, inplace=True)
user_behavior


**Task:** Fix the data types of the attributes *Genre* (categorical) and *Favorite* (binary, categorical).

In [None]:
user_behavior['genre'] = user_behavior['genre'].astype('category')
user_behavior['liked'] = user_behavior['liked'].astype('bool')
user_behavior.info()

**Task:** Some genres have more songs than others. Adjust the data set such that it includes only the four largest genres and the genre "Other" that summarizes all remaining genres.

In [None]:
genreCount = user_behavior['genre'].value_counts()
genreCount.plot(kind='bar')

In [None]:
user_behavior['genre'] = user_behavior['genre'].map({'Electronic':'Electronic', 'Rock': 'Rock', 'Hip-Hop':'Hip-Hop', 'Pop':'Pop'}).fillna('Other')
user_behavior['genre']= user_behavior['genre'].astype('category')

genreCount = user_behavior['genre'].value_counts()
genreCount.plot(kind='bar')

**Task:** Create for a new column for the weekday, year, month, and day of each date names *ModifiedAt*.

In [None]:
user_behavior

In [None]:
user_behavior['reviewDate'] = pd.to_datetime(user_behavior['reviewDate'], format= "%Y-%m-%d")
user_behavior['weekday'] = user_behavior['reviewDate'].dt.day_name()
user_behavior['Year'] = user_behavior['reviewDate'].dt.year
user_behavior['Month'] = user_behavior['reviewDate'].dt.month
user_behavior['Day'] = user_behavior['reviewDate'].dt.day

user_behavior

#### artists.csv

**Task**: Read the *artists.csv* file and re-name the columns according to the exercise sheet.

In [None]:
artists = pd.read_csv("artists.csv", delimiter=";")

artists.rename(columns={"featured" : "Featured"}, inplace=True)
artists

**Task:** Convert the attributes *Genre* and *Featured* to categorical variables.

In [None]:
artists['genre'] = artists['genre'].astype('category')
artists['Featured'] = artists['Featured'].astype('category')
artists.info()

### Data aggregation

**Task:** Merge the *users* and *user_behavior* tables together. Create a view in which you determine how many minutes a user listens to songs on average. Additionally, what is the highest number of clicks a user had on a song?

In [None]:
user_with_behavior= pd.merge(df_users, user_behavior, on='user_id')
user_with_behavior.groupby('user_id').agg(Average = ('minListened','mean'), MaxClicks = ("num_clicks","max")).reset_index()


**Task:** Merge the *user_behavior* and *artist* tables to determine the most clicked artist per genre (defined by the song).

In [None]:
artists_with_behavior= artists.merge(user_behavior, left_on='artist_id', right_on='artists')
artists_with_behavior


Question: Why can't we just use artists.merge(user_behavior)?

Answer: 

Which is the most clicked artist per genre of the song?

In [None]:
group = artists_with_behavior.groupby()

Answer:

**Task**: Determine for each artist, the fan that spends the most minutes listening their music