# CoderSchool Final Project Genres
## Music Recommendation System

In [None]:
import pandas as pd
import numpy as np

For now we will work with the MasterSongList data. Let's see later on if we can use a more detailed dataframe

Let's start with the genre analysis, since there are 29 genres after cleaning the data, this will be a multi class problem. Here are the classifiers that we will use, all of them are compatible with multiclass:
- kNN
- Logistic Regression OVR
- or Logistic Regression OVO
- SVC OVR (default)
- or SVC OVO

We are going to try both OVO and OVR for LogReg and SVC but will only keep one of the 2.
For all of these models we will use a Pipeline to combine the classifier with GridSearchCV (optimize parameters) and SelectKBest (optimize features)
The results will be compared using VotingClassifier

# Part 1 - Data cleaning

In [None]:
full_df = pd.read_json('MasterSongList.json')
full_df.head(3)

### Clean the genres

We need to remove the list format

In [None]:
full_df2 = full_df.copy()
full_df2['genres'] = full_df2['genres'].apply(''.join)

And only want to keep the first genre

In [None]:
def split_first_genre(genre):
    if len(genre) > 0:
        return genre.split(':')[0]
    else:
        return genre

full_df2['genres'] = full_df2['genres'].apply(split_first_genre)

Let's see what genres are available

In [None]:
unique_genres = full_df2['genres'].unique()
unique_genres.sort()

In [None]:
unique_genres.tolist()

### Audio Features

We now only want to keep the audio features and the genre, let's create a new dataframe: df

In [None]:
features_headers = ['key', 'energy', 'liveliness', 'tempo', 'speechiness', 'acousticness', 'instrumentalness', 'time_signature', 'duration', 'loudness', 'valence', 'danceability', 'mode', 'time_signature_confidence', 'tempo_confidence', 'key_confidence', 'mode_confidence']
features_list = full_df2['audio_features'].tolist()
df = pd.DataFrame(features_list, columns=features_headers)
df['genres'] = full_df2['genres']
df.head()

### NaN rows

Let's remove the songs with no genres

In [None]:
df.shape

In [None]:
df['genres'].replace('', np.nan, inplace=True)
df.dropna(subset=['genres'], inplace=True)
df.shape

Let's have a look at the NaN rows and their distribution among the genres

In [None]:
def checknan(x):
    return np.isnan(x)

In [None]:
genres_df = ['bluegrass', 'blues & blues rock', "children's", 'christian', 'classical', 'country', 'dance', "dubstep & drum 'n' bass", 'easy listening', 'electronica', 'film scores', 'folk', 'funk', 'hawaiian ', 'indie', "int'l", 'international/world', 'jazz', 'latin', 'nature sounds', 'oldies', 'pop', 'r&b', 'rap', 'reggae & ska', 'reggaeton', 'rock', 'showtunes', 'singer-songwriter']

In [None]:
for i in genres_df:
    songs = df[df['genres'] == i]
    genres_nan = songs['speechiness'].apply(checknan)
    print(i)
    print(genres_nan.value_counts())
    print("")

The dataset is quite disbalanced. First, let's:
- drop the NaN rows when count is above 1000
- replace the NaN rows values by the median of the others when under 1000
- combine some of the similar genres with low number of rows: international & hawai, etc...

In [None]:
df_bal = df.copy()

Let'd group all the international songs

In [None]:
df_bal.loc[(df_bal['genres'].str.contains("hawa")), 'genres'] = 'international/world'
df_bal.loc[(df_bal['genres'] == "int'l"), 'genres'] = 'international/world'

Since I am not too sure what 'showtunes' is, let's look at a few samples

In [None]:
full_df2[full_df2['genres'] == 'showtunes'].head(3)

This can be grouped with 'film_scores' as 'film/show'

In [None]:
df_bal.loc[(df_bal['genres'] == 'showtunes'), 'genres'] = 'film/show'
df_bal.loc[(df_bal['genres'] == 'film scores'), 'genres'] = 'film/show'

In [None]:
df_bal.head()

In [None]:
df_bal['genres'].value_counts()

Let's differentiate genres that have more/less than 1000 non-NaN rows

In [None]:
new_genres_df = ['bluegrass', 'blues & blues rock', "children's", 'christian', 'classical', 'country', 'dance', "dubstep & drum 'n' bass", 'easy listening', 'electronica', 'film/show', 'folk', 'funk', 'indie', 'international/world', 'jazz', 'latin', 'nature sounds', 'oldies', 'pop', 'r&b', 'rap', 'reggae & ska', 'reggaeton', 'rock', 'singer-songwriter']

In [None]:
large_genres = []
small_genres = []

for i in new_genres_df:
    songs_genre = df_bal[df_bal['genres'] == i]
    songs_genre_nan = songs_genre['speechiness'].apply(checknan)
    if len(songs_genre_nan[songs_genre_nan == False]) >= 1000:
        large_genres.append(i)
    else:
        small_genres.append(i)

print(large_genres)
print(small_genres)

Let's drop NaN on large genres

In [None]:
new_df = pd.DataFrame()

for i in large_genres:
    songs = df_bal[df_bal['genres'] == i]
    new_songs = songs.dropna(axis=0, how='any')
    new_df = pd.concat([new_df, new_songs])

Let's replace NaN by median on small genres

In [None]:
for i in small_genres:
    songs = df_bal[df_bal['genres'] == i]
    new_songs = songs.fillna(songs.median())
    new_df = pd.concat([new_df, new_songs])

Now we don't have any NaN value left. However we can see below that the dataframe is not well balanced

In [None]:
new_df['genres'].value_counts()

# Part 2 - Select data

### Scale features

First let's randomize the data

In [None]:
new_df = new_df.sample(frac=1, random_state=101).reset_index(drop=True)

In [None]:
X = new_df.drop('genres', axis=1)
y = new_df['genres']

Let's scale the features

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scale = scaler.fit_transform(X)

### Over and undersampling

In order to avoid imbalanced data, we will also try to use a combination of over and under sampling

In [None]:
from collections import Counter
sorted(Counter(y).items())

In [None]:
from imblearn.combine import SMOTEENN
smote_enn = SMOTEENN(random_state=0)
X_resampled, y_resampled = smote_enn.fit_sample(X_scale, y)
sorted(Counter(y_resampled).items())

We now also have 2 new data sources: X_resampled and y_resampled on which we could test our model

# Part 3: Try classifiers

For all our classifiers we will use a pipeline (classifier + SelectKBest) as we as GridSearchCV

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report

Here are the classifiers we are going to try:
- kNN on X_scale and y
- kNN on X_resampled and y_resampled
- LogReg on X_scale and y
- LogReg on X_resampled and y_resampled

### kNN1 on X_scale and y

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scale, y, test_size=0.1, random_state=101)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn1 = KNeighborsClassifier()
selector1 = SelectKBest()

In [None]:
steps_knn1 = [('feature_selection', selector1), ('kneighbors', knn1)]
parameters_knn1 = dict(feature_selection__k=[5,7,10,12], kneighbors__n_neighbors=[3,5,7,10])
pipeline_knn1 = Pipeline(steps_knn1)

In [None]:
grid_knn1 = GridSearchCV(pipeline_knn1, param_grid=parameters_knn1, verbose=3)
grid_knn1.fit(X_train, y_train)

In [None]:
print(grid_knn1.best_estimator_)
predictions_knn1 = grid_knn1.predict(X_test)
print(classification_report(y_test, predictions_knn1))

We can see that nature sounds has a result of 1 because there was probably no data in the test sample. Let's try this classifier again with the resampled data

### kNN on X_resampled and y_resampled

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.1, random_state=101)
knn2 = KNeighborsClassifier()
selector2 = SelectKBest()
steps_knn2 = [('feature_selection', selector2), ('kneighbors', knn2)]
parameters_knn2 = dict(feature_selection__k=[5,7,10,12], kneighbors__n_neighbors=[3,5,7,10])
pipeline_knn2 = Pipeline(steps_knn2)

In [None]:
grid_knn2 = GridSearchCV(pipeline_knn2, param_grid=parameters_knn2, verbose=3)
grid_knn2.fit(X_train, y_train)

In [None]:
print(grid_knn2.best_estimator_)
predictions_knn2 = grid_knn2.predict(X_test)
print(classification_report(y_test, predictions_knn2))

We can notice that the resampled data gives way better results. We will keep only this kNN classifier as the first result is too low. However we can note the computation time increased significantly: so we will reduce some of the parameters later on.

NOTE: we should be careful about overfitting with this specific model

### Logistic Regression

We will also use GridSearchCV for LogReg but we will include more parameters as several things might be interesting: class_weight (to balance automatically the data)

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scale, y, test_size=0.1, random_state=101)

In [None]:
lr1 = LogisticRegression(max_iter=5000, class_weight='balanced')
selector_lr1 = SelectKBest()
steps_lr1 = [('feature_selection', selector_lr1), ('LogReg', lr1)]
parameters_lr1 = dict(feature_selection__k=[5,8,12], 
                      LogReg__solver=['newton-cg', 'sag', 'saga', 'lbfgs'],
                      LogReg__multi_class=['ovr', 'multinomial'])

pipeline_lr1 = Pipeline(steps_lr1)

In [None]:
grid_lr1 = GridSearchCV(pipeline_lr1, param_grid=parameters_lr1, verbose=3)
grid_lr1.fit(X_train, y_train)

In [None]:
print(grid_lr1.best_estimator_)
predictions_lr1 = grid_lr1.predict(X_test)
print(classification_report(y_test, predictions_lr1))

Let's try the same thing ut without the balanced data

In [None]:
lr2 = LogisticRegression(max_iter=5000)
selector_lr2 = SelectKBest()
steps_lr2 = [('feature_selection', selector_lr2), ('LogReg', lr2)]
parameters_lr2 = dict(feature_selection__k=[5,8,12], 
                      LogReg__solver=['newton-cg', 'sag', 'saga', 'lbfgs'],
                      LogReg__multi_class=['ovr', 'multinomial'])

pipeline_lr2 = Pipeline(steps_lr2)

In [None]:
grid_lr2 = GridSearchCV(pipeline_lr2, param_grid=parameters_lr2, verbose=3)
grid_lr2.fit(X_train, y_train)

In [None]:
print(grid_lr2.best_estimator_)
predictions_lr2 = grid_lr2.predict(X_test)
print(classification_report(y_test, predictions_lr2))

# <font color=‘orange’> Also look at resampled data </font>

### SVC on X_scaled and y

In [None]:
from sklearn.svm import SVC

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_scale, y, test_size=0.1, random_state=101)

In [None]:
svc1 = SVC()
selector_svc1 = SelectKBest()
steps_svc1 = [('feature_selection', selector_svc1), ('SVC', svc1)]
parameters_svc1 = dict(feature_selection__k=[5,8,12], 
                      SVC__C=[0.1,1, 10],
                      SVC__gamma=[1,0.1,0.01,0.001],
                      SVC__decision_function_shape :[‘ovo’, ‘ovr’,])

pipeline_svc1 = Pipeline(steps_svc1)

In [None]:
grid_svc1 = GridSearchCV(pipeline_svc1, param_grid=parameters_svc1, verbose=3)
grid_svc1.fit(X_train, y_train)

In [None]:
print(grid_svc1.best_estimator_)
predictions_svc1 = grid_svc1.predict(X_test)
print(classification_report(y_test, predictions_svc1))

# <font color=‘orange’> Comments</font>

# <font color=‘orange’> Later: voting classifier </font>