# CoderSchool Final Project Genres
## Music Recommendation System

In [1]:
import pandas as pd
import numpy as np

For now we will work with the MasterSongList data. Let's see later on if we can use a more detailed dataframe

Let's start with the genre analysis, since there are 29 genres after cleaning the data, this will be a multi class problem. Here are the classifiers that we will use, all of them are compatible with multiclass:
- kNN
- Logistic Regression OVR
- or Logistic Regression OVO
- SVC OVR (default)
- or SVC OVO

We are going to try both OVO and OVR for LogReg and SVC but will only keep one of the 2.
For all of these models we will use a Pipeline to combine the classifier with GridSearchCV (optimize parameters) and SelectKBest (optimize features)
The results will be compared using VotingClassifier

# Part 1 - Data cleaning

In [2]:
full_df = pd.read_json('MasterSongList.json')
full_df.head(3)

Unnamed: 0,_id,album,artist,audio_features,context,decades,genres,lyrics_features,moods,name,new_context,picture,recording_id,sub_context,yt_id,yt_views
0,{'$oid': '52fdfb440b9398049f3d7a8c'},Gangnam Style (강남스타일),PSY,"[11, 0.912744, 0.083704, 132.069, 0.293137, 0....",[work out],[],[pop],"[oppa, gangnam, style, gangnam, style, najeneu...","[energetic, motivational]",Gangnam Style (강남스타일),work out,http://images.musicnet.com/albums/073/463/405/...,50232.0,[working out: cardio],9bZkp7q19f0,2450112089
1,{'$oid': '52fdfb3d0b9398049f3cbc8e'},Native,OneRepublic,"[6, 0.7457039999999999, 0.11995499999999999, 1...",[energetic],[2012],[pop],"[lately, i, ve, been, i, ve, been, losing, sle...",[happy],Counting Stars,energetic,http://images.musicnet.com/albums/081/851/887/...,5839.0,[energy boost],hT_nvWreIhg,1020297206
2,{'$oid': '52fdfb420b9398049f3d3ea5'},Party Rock Anthem,LMFAO,"[5, 0.709932, 0.231455, 130.03, 0.121740999999...","[energetic, energetic, energetic, energetic]",[],[],"[party, rock, yeah, woo, let, s, go, party, ro...","[happy, celebratory, rowdy]",Party Rock Anthem,housework,http://images.musicnet.com/albums/049/414/127/...,52379.0,"[energy boost, pleasing a crowd, housework, dr...",KQ6zr6kCPj8,971128436


### Clean the genres

We need to remove the list format

In [3]:
full_df2 = full_df.copy()
full_df2['genres'] = full_df2['genres'].apply(''.join)
# full_df2['genres'].value_counts()

And only want to keep the first genre

In [4]:
def split_first_genre(genre):
    if len(genre) > 0:
        return genre.split(':')[0]
    else:
        return genre

full_df2['genres'] = full_df2['genres'].apply(split_first_genre)

Let's see what genres are available

In [5]:
unique_genres = full_df2['genres'].unique()
unique_genres.sort()

In [8]:
unique_genres.tolist()

['',
 'bluegrass',
 'blues & blues rock',
 "children's",
 'christian',
 'classical',
 'country',
 'dance',
 "dubstep & drum 'n' bass",
 'easy listening',
 'electronica',
 'film scores',
 'folk',
 'funk',
 'hawaiian ',
 'indie',
 "int'l",
 'international/world',
 'jazz',
 'latin',
 'nature sounds',
 'oldies',
 'pop',
 'r&b',
 'rap',
 'reggae & ska',
 'reggaeton',
 'rock',
 'showtunes',
 'singer-songwriter']

### Audio Features

We now only want to keep the audio features and the genre, let's create a new dataframe: df

In [9]:
features_headers = ['key', 'energy', 'liveliness', 'tempo', 'speechiness', 'acousticness', 'instrumentalness', 'time_signature', 'duration', 'loudness', 'valence', 'danceability', 'mode', 'time_signature_confidence', 'tempo_confidence', 'key_confidence', 'mode_confidence']
features_list = full_df2['audio_features'].tolist()
df = pd.DataFrame(features_list, columns=features_headers)
df['genres'] = full_df2['genres']
df.head()

Unnamed: 0,key,energy,liveliness,tempo,speechiness,acousticness,instrumentalness,time_signature,duration,loudness,valence,danceability,mode,time_signature_confidence,tempo_confidence,key_confidence,mode_confidence,genres
0,11.0,0.912744,0.083704,132.069,0.293137,0.005423,1e-06,0.0,4.0,218.30667,-3.89,0.752186,0.72692,0.552,0.541,1.0,1.0,pop
1,6.0,0.745704,0.119955,100.008,0.046255,0.02623,0.012727,1.0,4.0,235.06086,-7.687,0.351282,0.691817,0.737,0.634,0.796,1.0,pop
2,5.0,0.709932,0.231455,130.03,0.121741,0.036662,0.0,0.0,4.0,232.46104,-5.15,0.37439,0.704729,0.565,0.565,0.743,1.0,
3,3.0,0.705822,0.053292,126.009,0.126016,0.001966,0.0,0.0,4.0,194.09333,-3.898,0.592798,0.875137,0.004,0.114,1.0,0.742,dance
4,3.0,0.741757,0.072774,129.985,0.051255,0.096732,0.000474,0.0,4.0,285.42667,-5.86,0.58563,0.730711,0.271,0.324,0.822,1.0,reggaeton


### NaN rows

Let's remove the songs with no genres

In [10]:
df.shape

(36733, 18)

In [11]:
df['genres'].replace('', np.nan, inplace=True)
df.dropna(subset=['genres'], inplace=True)
df.shape

(33057, 18)

Let's have a look at the NaN rows and their distribution among the genres

In [12]:
def checknan(x):
    return np.isnan(x)

In [13]:
genres_df = ['bluegrass', 'blues & blues rock', "children's", 'christian', 'classical', 'country', 'dance', "dubstep & drum 'n' bass", 'easy listening', 'electronica', 'film scores', 'folk', 'funk', 'hawaiian ', 'indie', "int'l", 'international/world', 'jazz', 'latin', 'nature sounds', 'oldies', 'pop', 'r&b', 'rap', 'reggae & ska', 'reggaeton', 'rock', 'showtunes', 'singer-songwriter']

In [14]:
for i in genres_df:
    songs = df[df['genres'] == i]
    genres_nan = songs['speechiness'].apply(checknan)
    print(i)
    print(genres_nan.value_counts())
    print("")

bluegrass
False    213
True      89
Name: speechiness, dtype: int64

blues & blues rock
False    727
True     175
Name: speechiness, dtype: int64

children's
False    196
True      64
Name: speechiness, dtype: int64

christian
False    204
True      32
Name: speechiness, dtype: int64

classical
True     637
False    455
Name: speechiness, dtype: int64

country
False    1075
True      209
Name: speechiness, dtype: int64

dance
False    2000
True      391
Name: speechiness, dtype: int64

dubstep & drum 'n' bass
False    272
True      46
Name: speechiness, dtype: int64

easy listening
False    151
True      45
Name: speechiness, dtype: int64

electronica
False    1249
True      227
Name: speechiness, dtype: int64

film scores
False    180
True      65
Name: speechiness, dtype: int64

folk
False    402
True      85
Name: speechiness, dtype: int64

funk
False    470
True      96
Name: speechiness, dtype: int64

hawaiian 
False    12
True      8
Name: speechiness, dtype: int64

indie
False  

The dataset is quite disbalanced. First, let's:
- drop the NaN rows when count is above 1000
- replace the NaN rows values by the median of the others when under 1000
- combine some of the similar genres with low number of rows: international & hawai, etc...

In [15]:
df_bal = df.copy()

Let'd group all the international songs

In [16]:
df_bal.loc[(df_bal['genres'].str.contains("hawa")), 'genres'] = 'international/world'
df_bal.loc[(df_bal['genres'] == "int'l"), 'genres'] = 'international/world'

Since I am not too sure what 'showtunes' is, let's look at a few samples

In [17]:
full_df2[full_df2['genres'] == 'showtunes'].head(3)

Unnamed: 0,_id,album,artist,audio_features,context,decades,genres,lyrics_features,moods,name,new_context,picture,recording_id,sub_context,yt_id,yt_views
550,{'$oid': '52fdfb470b9398049f3db3da'},Dreamgirls: Music From The Motion Picture,Beyoncé,"[6, 0.863908, 0.933894, 129.015, 0.07024000000...",[happy],[],showtunes,"[listen, to, the, song, here, in, my, heart, a...",[happy],Listen,,http://images.musicnet.com/albums/009/114/795/...,78256.0,,w1rLZfAfQLM,48657477
1622,{'$oid': '52fdfb470b9398049f3db3d8'},Mamma Mia!: The Movie Soundtrack,Dominic Cooper,"[5, 0.731035, 0.319639, 133.01, 0.029075, 0.01...",[happy],[],showtunes,"[i, wasn, t, jealous, before, we, met, now, ev...",[happy],Lay All Your Love On Me,,http://images.musicnet.com/albums/015/129/627/...,50071.0,,AQKk1nkDa8U,12748652
2008,{'$oid': '52fdfb470b9398049f3db3a7'},Dreamgirls: Music From The Motion Picture,Jennifer Hudson,"[10, 0.734026, 0.16022, 121.585, 0.081468, 0.2...",[happy],[],showtunes,"[and, i, am, tellin, you, i, m, not, goin, you...",[happy],And I Am Telling You I'm Not Going,,http://images.musicnet.com/albums/009/114/795/...,78227.0,,QsiSRSgqE4E,9781891


This can be grouped with 'film_scores' as 'film/show'

In [26]:
df_bal.loc[(df_bal['genres'] == 'showtunes'), 'genres'] = 'film/show'
df_bal.loc[(df_bal['genres'] == 'film scores'), 'genres'] = 'film/show'

In [18]:
df_bal['genres'].value_counts()

rock                       7485
rap                        2931
r&b                        2721
dance                      2391
jazz                       2295
indie                      2089
electronica                1476
latin                      1312
country                    1284
singer-songwriter          1189
classical                  1092
blues & blues rock          902
pop                         901
international/world         811
reggae & ska                754
funk                        566
oldies                      548
folk                        487
dubstep & drum 'n' bass     318
bluegrass                   302
children's                  260
film scores                 245
christian                   236
easy listening              196
reggaeton                   127
showtunes                   105
nature sounds                34
Name: genres, dtype: int64

Let's differentiate genres that have more/less than 1000 non-NaN rows

In [19]:
new_genres_df = ['bluegrass', 'blues & blues rock', "children's", 'christian', 'classical', 'country', 'dance', "dubstep & drum 'n' bass", 'easy listening', 'electronica', 'film/show', 'folk', 'funk', 'indie', 'international/world', 'jazz', 'latin', 'nature sounds', 'oldies', 'pop', 'r&b', 'rap', 'reggae & ska', 'reggaeton', 'rock', 'singer-songwriter']

In [20]:
large_genres = []
small_genres = []

for i in new_genres_df:
    songs_genre = df_bal[df_bal['genres'] == i]
    songs_genre_nan = songs_genre['speechiness'].apply(checknan)
    if len(songs_genre_nan[songs_genre_nan == False]) >= 1000:
        large_genres.append(i)
    else:
        small_genres.append(i)

print(large_genres)
print(small_genres)

['country', 'dance', 'electronica', 'indie', 'jazz', 'latin', 'r&b', 'rap', 'rock', 'singer-songwriter']
['bluegrass', 'blues & blues rock', "children's", 'christian', 'classical', "dubstep & drum 'n' bass", 'easy listening', 'film/show', 'folk', 'funk', 'international/world', 'nature sounds', 'oldies', 'pop', 'reggae & ska', 'reggaeton']


Let's drop NaN on large genres

In [21]:
new_df = pd.DataFrame()

for i in large_genres:
    songs = df_bal[df_bal['genres'] == i]
    new_songs = songs.dropna(axis=0, how='any')
    new_df = pd.concat([new_df, new_songs])

Let's replace NaN by median on small genres

In [22]:
for i in small_genres:
    songs = df_bal[df_bal['genres'] == i]
    new_songs = songs.fillna(songs.median())
    new_df = pd.concat([new_df, new_songs])

Now we don't have any NaN value left. However we can see below that the dataframe is not well balanced

In [23]:
new_df['genres'].value_counts()

rock                       6435
rap                        2452
r&b                        2344
dance                      2000
jazz                       1889
indie                      1834
electronica                1249
classical                  1092
country                    1075
singer-songwriter          1034
latin                      1032
blues & blues rock          902
pop                         901
international/world         811
reggae & ska                754
funk                        566
oldies                      548
folk                        487
dubstep & drum 'n' bass     318
bluegrass                   302
children's                  260
christian                   236
easy listening              196
reggaeton                   127
nature sounds                34
Name: genres, dtype: int64

We will now drop the genres with low number of songs: under 500

In [72]:
drop_genres = ["dubstep & drum 'n' bass", "bluegrass", "children's", "christian", 'easy listening', 'reggaeton', 'nature sounds', "folk"]

In [73]:
new_df = new_df[new_df['genres'].isin(drop_genres) == False]
new_df['genres'].value_counts()

rock                   6435
rap                    2452
r&b                    2344
dance                  2000
jazz                   1889
indie                  1834
electronica            1249
classical              1092
country                1075
singer-songwriter      1034
latin                  1032
blues & blues rock      902
pop                     901
international/world     811
reggae & ska            754
funk                    566
oldies                  548
Name: genres, dtype: int64

# Part 2 - Select data

### Scale features

First let's randomize the data

In [74]:
new_df = new_df.sample(frac=1, random_state=101).reset_index(drop=True)

### Data 1

In [75]:
X = new_df.drop('genres', axis=1)
y = new_df['genres']

Let's scale the features

In [208]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scale = scaler.fit_transform(X)

In [209]:
pickle.dump(scaler, open('genre_feature_scaler.pickle', 'wb'))

### Data 2: SMOTEEN - Over and undersampling

In order to avoid imbalanced data, we will also try to use a combination of over and under sampling

In [77]:
from collections import Counter
sorted(Counter(y).items())

[('blues & blues rock', 902),
 ('classical', 1092),
 ('country', 1075),
 ('dance', 2000),
 ('electronica', 1249),
 ('funk', 566),
 ('indie', 1834),
 ('international/world', 811),
 ('jazz', 1889),
 ('latin', 1032),
 ('oldies', 548),
 ('pop', 901),
 ('r&b', 2344),
 ('rap', 2452),
 ('reggae & ska', 754),
 ('rock', 6435),
 ('singer-songwriter', 1034)]

In [78]:
from imblearn.combine import SMOTEENN
smote_enn = SMOTEENN(random_state=0)
X_smoteen, y_smoteen = smote_enn.fit_sample(X_scale, y)
sorted(Counter(y_resampled).items())

[('blues & blues rock', 6213),
 ('classical', 6316),
 ('country', 6114),
 ('dance', 5115),
 ('electronica', 5881),
 ('folk', 6337),
 ('funk', 6290),
 ('indie', 5063),
 ('international/world', 6180),
 ('jazz', 5288),
 ('latin', 6080),
 ('oldies', 6320),
 ('pop', 6058),
 ('r&b', 4248),
 ('rap', 4969),
 ('reggae & ska', 6321),
 ('rock', 822),
 ('singer-songwriter', 6038)]

We now also have 2 new data sources: X_resampled and y_resampled on which we could test our model

### Data 3: resample Rock

In [108]:
rock_songs = new_df[new_df['genres'] == 'rock']
rock_sample = rock_songs.sample(n=4000, random_state=101).index

In [109]:
rock_data = new_df.drop(index = rock_sample)
X_rock = rock_data.drop('genres', axis=1)
y_rock = rock_data['genres']
X_rock = scaler.fit_transform(X_rock)
X_rock.shape

(22918, 17)

### Data 4: SMOTE - oversampling

In [110]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_sample(X_rock, y_rock)
sorted(Counter(y_smote).items())

[('blues & blues rock', 2452),
 ('classical', 2452),
 ('country', 2452),
 ('dance', 2452),
 ('electronica', 2452),
 ('funk', 2452),
 ('indie', 2452),
 ('international/world', 2452),
 ('jazz', 2452),
 ('latin', 2452),
 ('oldies', 2452),
 ('pop', 2452),
 ('r&b', 2452),
 ('rap', 2452),
 ('reggae & ska', 2452),
 ('rock', 2452),
 ('singer-songwriter', 2452)]

# Part 3: Try classifiers

For all our classifiers we will use a pipeline (classifier + SelectKBest) as we as GridSearchCV

In [111]:
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report

Here are the classifiers we are going to try:
- kNN on X_scale and y
- kNN on X_resampled and y_resampled
- LogReg on X_scale and y
- LogReg on X_resampled and y_resampled

### kNN1 on X_scale and y

In [112]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scale, y, test_size=0.1, random_state=101)

In [113]:
from sklearn.neighbors import KNeighborsClassifier
knn1 = KNeighborsClassifier()
selector1 = SelectKBest()

In [185]:
steps_knn1 = [('feature_selection', selector1), ('kneighbors', knn1)]
parameters_knn1 = dict(feature_selection__k=[8,12,15], kneighbors__n_neighbors=[3,6,10])
pipeline_knn1 = Pipeline(steps_knn1)

In [186]:
grid_knn1 = GridSearchCV(pipeline_knn1, param_grid=parameters_knn1, verbose=3)
grid_knn1.fit(X_train, y_train)

Fitting 3 folds for each of 9 candidates, totalling 27 fits
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.367529 -   0.6s
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.6s remaining:    0.0s


[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.364908 -   0.6s
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.3s remaining:    0.0s


[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.366092 -   0.6s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.407870 -   0.6s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.407380 -   0.6s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.413434 -   0.6s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.426185 -   0.7s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.427811 -   0.7s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.436733 -   0.7s
[CV]

[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:   58.3s finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('feature_selection', SelectKBest(k=10, score_func=<function f_classif at 0x10baf9bf8>)), ('kneighbors', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'feature_selection__k': [8, 12, 15], 'kneighbors__n_neighbors': [3, 6, 10]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=3)

In [187]:
print(grid_knn1.best_estimator_)
predictions_knn1 = grid_knn1.predict(X_test)
print(classification_report(y_test, predictions_knn1))

Pipeline(memory=None,
     steps=[('feature_selection', SelectKBest(k=12, score_func=<function f_classif at 0x10baf9bf8>)), ('kneighbors', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=10, p=2,
           weights='uniform'))])
                     precision    recall  f1-score   support

 blues & blues rock       0.36      0.38      0.37        91
          classical       0.68      0.88      0.77        97
            country       0.28      0.28      0.28       119
              dance       0.44      0.57      0.49       198
        electronica       0.40      0.30      0.34       125
               funk       0.36      0.30      0.33        50
              indie       0.18      0.12      0.15       161
international/world       0.67      0.28      0.40        99
               jazz       0.43      0.51      0.46       194
              latin       0.34      0.24      0.28       124
             oldies  

### kNN1bis on X_scale and y (no selectKBest)

In [184]:
X_train, X_test, y_train, y_test = train_test_split(X_scale, y, test_size=0.1, random_state=101)
knn1bis = KNeighborsClassifier(n_neighbors=10)
knn1bis.fit(X_train, y_train)
predictions_knn1bis = knn1bis.predict(X_test)
print(classification_report(y_test, predictions_knn1bis))

                     precision    recall  f1-score   support

 blues & blues rock       0.37      0.35      0.36        91
          classical       0.74      0.85      0.79        97
            country       0.27      0.29      0.28       119
              dance       0.40      0.57      0.47       198
        electronica       0.38      0.26      0.31       125
               funk       0.41      0.38      0.40        50
              indie       0.19      0.16      0.17       161
international/world       0.61      0.30      0.41        99
               jazz       0.38      0.53      0.44       194
              latin       0.36      0.24      0.29       124
             oldies       0.32      0.15      0.20        67
                pop       0.45      0.26      0.33        80
                r&b       0.28      0.29      0.29       226
                rap       0.59      0.67      0.62       237
       reggae & ska       0.87      0.25      0.39        79
               rock    

Same results as above

### kNN2 on X_smoteen and y_smoteen

In [117]:
X_train, X_test, y_train, y_test = train_test_split(X_smoteen, y_smoteen, test_size=0.1, random_state=101)
knn2 = KNeighborsClassifier()
selector2 = SelectKBest()
steps_knn2 = [('feature_selection', selector2), ('kneighbors', knn2)]
parameters_knn2 = dict(feature_selection__k=[8,12,15], kneighbors__n_neighbors=[3,6,10])
pipeline_knn2 = Pipeline(steps_knn2)

In [118]:
grid_knn2 = GridSearchCV(pipeline_knn2, param_grid=parameters_knn2, verbose=3)
grid_knn2.fit(X_train, y_train)

Fitting 3 folds for each of 9 candidates, totalling 27 fits
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.749527 -   3.2s
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.3s remaining:    0.0s


[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.742071 -   3.5s
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    6.8s remaining:    0.0s


[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.742997 -   3.4s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.693740 -   5.1s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.685705 -   4.8s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.688934 -   4.5s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.652078 -   5.3s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.647533 -   5.3s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.646933 -   5.9s
[CV]

[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:  8.9min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('feature_selection', SelectKBest(k=10, score_func=<function f_classif at 0x10baf9bf8>)), ('kneighbors', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'feature_selection__k': [8, 12, 15], 'kneighbors__n_neighbors': [3, 6, 10]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=3)

In [119]:
print(grid_knn2.best_estimator_)
predictions_knn2 = grid_knn2.predict(X_test)
print(classification_report(y_test, predictions_knn2))

Pipeline(memory=None,
     steps=[('feature_selection', SelectKBest(k=15, score_func=<function f_classif at 0x10baf9bf8>)), ('kneighbors', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform'))])
                     precision    recall  f1-score   support

 blues & blues rock       0.97      0.98      0.97       560
          classical       0.99      1.00      0.99       654
            country       0.95      1.00      0.97       586
              dance       0.97      0.96      0.96       486
        electronica       0.96      0.97      0.97       615
               funk       0.97      0.99      0.98       631
              indie       0.94      0.94      0.94       507
international/world       0.97      0.99      0.98       633
               jazz       0.99      0.97      0.98       498
              latin       0.97      0.99      0.98       650
             oldies   

We can notice that the resampled data gives way better results. We will keep only this kNN classifier as the first result is too low. However we can note the computation time increased significantly: so we will reduce some of the parameters later on.

NOTE: score is too high, probably an issue

Let's have a look at cross validation to check if our sample had an influence on the score

In [122]:
from sklearn.model_selection import cross_val_score
scores_smoteen = cross_val_score(grid_knn2, X_smoteen, y_smoteen, cv=5)

Fitting 3 folds for each of 9 candidates, totalling 27 fits
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.721676 -   2.2s
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.2s remaining:    0.0s


[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.714647 -   2.6s
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    4.8s remaining:    0.0s


[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.719276 -   2.2s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.675884 -   2.8s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.670774 -   2.9s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.673867 -   3.0s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.645904 -   3.5s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.638622 -   3.7s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.641587 -   3.5s
[CV]

[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:  6.5min finished


Fitting 3 folds for each of 9 candidates, totalling 27 fits
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.763334 -   2.9s
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.9s remaining:    0.0s


[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.734960 -   4.3s
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    7.2s remaining:    0.0s


[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.735929 -   2.6s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.698318 -   3.2s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.681784 -   4.3s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.683902 -   4.5s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.650560 -   5.6s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.645904 -   4.1s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.647090 -   3.9s
[CV]

[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:  6.9min finished


Fitting 3 folds for each of 9 candidates, totalling 27 fits
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.771870 -   2.3s
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.3s remaining:    0.0s


[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.739254 -   2.5s
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    4.9s remaining:    0.0s


[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.737706 -   2.6s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.704896 -   3.4s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.684111 -   3.4s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.681185 -   3.0s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.653170 -   3.7s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.643657 -   3.6s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.648469 -   3.5s
[CV]

[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:  6.4min finished


Fitting 3 folds for each of 9 candidates, totalling 27 fits
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.769833 -   2.5s
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.5s remaining:    0.0s


[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.738171 -   2.2s
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    4.7s remaining:    0.0s


[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.736834 -   2.0s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.700574 -   3.1s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.683549 -   3.3s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.680154 -   3.4s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.647727 -   3.8s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.643697 -   3.8s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.643545 -   3.3s
[CV]

[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:  6.5min finished


Fitting 3 folds for each of 9 candidates, totalling 27 fits
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.768978 -   2.0s
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.0s remaining:    0.0s


[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.738652 -   2.1s
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    4.1s remaining:    0.0s


[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.738038 -   2.0s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.700249 -   3.6s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.680419 -   3.2s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.683084 -   3.0s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.649214 -   3.3s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.644660 -   3.9s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.643385 -   4.1s
[CV]

[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:  6.0min finished


In [124]:
print(scores_smoteen)
print(scores_smoteen.mean())

[0.97240051 0.96393601 0.96644367 0.97275598 0.97044968]
0.9691971700367988


### kNN3 on X_rock and y_rock (rock songs resampled)

In [125]:
X_train, X_test, y_train, y_test = train_test_split(X_rock, y_rock, test_size=0.1, random_state=101)
knn3 = KNeighborsClassifier()
selector3 = SelectKBest()
steps_knn3 = [('feature_selection', selector3), ('kneighbors', knn3)]
parameters_knn3 = dict(feature_selection__k=[8,12,15], kneighbors__n_neighbors=[3,6,10])
pipeline_knn3 = Pipeline(steps_knn3)

In [126]:
grid_knn3 = GridSearchCV(pipeline_knn3, param_grid=parameters_knn3, verbose=3)
grid_knn3.fit(X_train, y_train)

Fitting 3 folds for each of 9 candidates, totalling 27 fits
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.335222 -   0.4s
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.4s remaining:    0.0s


[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.333721 -   0.4s
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    0.9s remaining:    0.0s


[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.339156 -   0.5s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.377361 -   0.5s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.371545 -   0.5s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.374381 -   0.6s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.389131 -   0.6s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.391621 -   0.6s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.391849 -   0.6s
[CV]

[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:   42.5s finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('feature_selection', SelectKBest(k=10, score_func=<function f_classif at 0x10baf9bf8>)), ('kneighbors', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'feature_selection__k': [8, 12, 15], 'kneighbors__n_neighbors': [3, 6, 10]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=3)

In [127]:
print(grid_knn3.best_estimator_)
predictions_knn3 = grid_knn3.predict(X_test)
print(classification_report(y_test, predictions_knn3))

Pipeline(memory=None,
     steps=[('feature_selection', SelectKBest(k=12, score_func=<function f_classif at 0x10baf9bf8>)), ('kneighbors', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=10, p=2,
           weights='uniform'))])
                     precision    recall  f1-score   support

 blues & blues rock       0.34      0.45      0.39        87
          classical       0.76      0.79      0.77        90
            country       0.23      0.33      0.27       105
              dance       0.44      0.58      0.50       202
        electronica       0.31      0.22      0.26       122
               funk       0.32      0.22      0.26        55
              indie       0.21      0.21      0.21       174
international/world       0.54      0.29      0.38        90
               jazz       0.44      0.50      0.47       202
              latin       0.32      0.28      0.30       104
             oldies  

kNN with rock samples gives even lower scores

### kNN4 on X_smote and y_smote (rock songs resampled)

In [128]:
X_train, X_test, y_train, y_test = train_test_split(X_smote, y_smote, test_size=0.1, random_state=101)
knn4 = KNeighborsClassifier()
selector4 = SelectKBest()
steps_knn4 = [('feature_selection', selector4), ('kneighbors', knn4)]
parameters_knn4 = dict(feature_selection__k=[8,12,15], kneighbors__n_neighbors=[3,6,10])
pipeline_knn4 = Pipeline(steps_knn4)

In [129]:
grid_knn4 = GridSearchCV(pipeline_knn4, param_grid=parameters_knn4, verbose=3)
grid_knn4.fit(X_train, y_train)

Fitting 3 folds for each of 9 candidates, totalling 27 fits
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.503677 -   1.0s
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s


[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.506837 -   0.9s
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.9s remaining:    0.0s


[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.499440 -   1.0s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.485691 -   1.2s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.486685 -   1.1s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.485840 -   1.2s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.473621 -   1.3s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.475330 -   1.4s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.478240 -   1.6s
[CV]

[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:  2.1min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('feature_selection', SelectKBest(k=10, score_func=<function f_classif at 0x10baf9bf8>)), ('kneighbors', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'feature_selection__k': [8, 12, 15], 'kneighbors__n_neighbors': [3, 6, 10]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=3)

In [130]:
print(grid_knn4.best_estimator_)
predictions_knn4 = grid_knn4.predict(X_test)
print(classification_report(y_test, predictions_knn4))

Pipeline(memory=None,
     steps=[('feature_selection', SelectKBest(k=15, score_func=<function f_classif at 0x10baf9bf8>)), ('kneighbors', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform'))])
                     precision    recall  f1-score   support

 blues & blues rock       0.70      0.91      0.79       244
          classical       0.87      0.92      0.90       269
            country       0.58      0.80      0.67       251
              dance       0.50      0.59      0.54       234
        electronica       0.61      0.76      0.68       223
               funk       0.73      0.97      0.83       228
              indie       0.50      0.38      0.43       243
international/world       0.75      0.83      0.79       247
               jazz       0.64      0.50      0.56       221
              latin       0.73      0.75      0.74       259
             oldies   

This score is interesting, let's look into it

### Logistic Regression

We will also use GridSearchCV for LogReg but we will include more parameters as several things might be interesting: class_weight (to balance automatically the data). Firs let's run with the imbalanced data

In [40]:
from sklearn.linear_model import LogisticRegression

### lr1 with X_scale and y

In [41]:
X_train, X_test, y_train, y_test = train_test_split(X_scale, y, test_size=0.1, random_state=101)

In [42]:
lr1 = LogisticRegression(max_iter=10000)
selector_lr1 = SelectKBest()
steps_lr1 = [('feature_selection', selector_lr1), ('LogReg', lr1)]
parameters_lr1 = dict(feature_selection__k=[8, 12, 15], 
                      LogReg__solver=['newton-cg', 'sag', 'saga', 'lbfgs'],
                      LogReg__multi_class=['ovr', 'multinomial'])

pipeline_lr1 = Pipeline(steps_lr1)

In [43]:
grid_lr1 = GridSearchCV(pipeline_lr1, param_grid=parameters_lr1, verbose=3)
grid_lr1.fit(X_train, y_train)

Fitting 3 folds for each of 24 candidates, totalling 72 fits
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=8 
[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=8, score=0.410428 -   1.6s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=8 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.6s remaining:    0.0s


[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=8, score=0.416373 -   1.6s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=8 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    3.3s remaining:    0.0s


[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=8, score=0.411564 -   1.6s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=12 
[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=12, score=0.424648 -   1.9s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=12 
[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=12, score=0.432308 -   1.9s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=12 
[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=12, score=0.433232 -   1.9s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=15 
[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=15, score=0.433884 -   2.1s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=15 
[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_s

[CV]  LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=8, score=0.415338 -   0.8s
[CV] LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=12 
[CV]  LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=12, score=0.432669 -   2.4s
[CV] LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=12 
[CV]  LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=12, score=0.441430 -   3.8s
[CV] LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=12 
[CV]  LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=12, score=0.438344 -   3.5s
[CV] LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=15 
[CV]  LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=15, score=0.438260 -   2.7s
[CV] LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=15 
[CV]  LogReg__multi_class=multinomial, LogReg_

[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:  3.9min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('feature_selection', SelectKBest(k=10, score_func=<function f_classif at 0x10baf9bf8>)), ('LogReg', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=10000, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'feature_selection__k': [8, 12, 15], 'LogReg__solver': ['newton-cg', 'sag', 'saga', 'lbfgs'], 'LogReg__multi_class': ['ovr', 'multinomial']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=3)

In [44]:
print(grid_lr1.best_estimator_)
predictions_lr1 = grid_lr1.predict(X_test)
print(classification_report(y_test, predictions_lr1))

Pipeline(memory=None,
     steps=[('feature_selection', SelectKBest(k=15, score_func=<function f_classif at 0x10baf9bf8>)), ('LogReg', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=10000, multi_class='multinomial',
          n_jobs=1, penalty='l2', random_state=None, solver='newton-cg',
          tol=0.0001, verbose=0, warm_start=False))])
                     precision    recall  f1-score   support

 blues & blues rock       0.55      0.42      0.48       107
          classical       0.70      0.87      0.77       121
            country       0.28      0.21      0.24       106
              dance       0.46      0.52      0.49       205
        electronica       0.37      0.23      0.29       125
               folk       0.00      0.00      0.00        44
               funk       0.22      0.08      0.11        53
              indie       0.22      0.04      0.07       179
international/world       0.00      0

Let's try the same thing ut with the balanced data

### lr2 with X_scale and y and 'balanced' option

In [45]:
lr2 = LogisticRegression(max_iter=10000, class_weight='balanced')
selector_lr2 = SelectKBest()
steps_lr2 = [('feature_selection', selector_lr2), ('LogReg', lr2)]
parameters_lr2 = dict(feature_selection__k=[8,12,15], 
                      LogReg__solver=['newton-cg', 'sag', 'saga', 'lbfgs'],
                      LogReg__multi_class=['ovr', 'multinomial'])

pipeline_lr2 = Pipeline(steps_lr2)

In [46]:
grid_lr2 = GridSearchCV(pipeline_lr2, param_grid=parameters_lr2, verbose=3)
grid_lr2.fit(X_train, y_train)

Fitting 3 folds for each of 24 candidates, totalling 72 fits
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=8 
[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=8, score=0.376884 -   1.7s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=8 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.7s remaining:    0.0s


[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=8, score=0.378421 -   1.7s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=8 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    3.5s remaining:    0.0s


[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=8, score=0.376506 -   1.8s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=12 
[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=12, score=0.398031 -   2.0s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=12 
[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=12, score=0.407858 -   2.0s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=12 
[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=12, score=0.393427 -   2.0s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=15 
[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=15, score=0.403743 -   2.2s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=15 
[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_s



[CV]  LogReg__multi_class=ovr, LogReg__solver=sag, feature_selection__k=8, score=0.279290 -  52.6s
[CV] LogReg__multi_class=ovr, LogReg__solver=sag, feature_selection__k=8 




[CV]  LogReg__multi_class=ovr, LogReg__solver=sag, feature_selection__k=8, score=0.379394 - 1.2min
[CV] LogReg__multi_class=ovr, LogReg__solver=sag, feature_selection__k=8 




[CV]  LogReg__multi_class=ovr, LogReg__solver=sag, feature_selection__k=8, score=0.325380 - 1.7min
[CV] LogReg__multi_class=ovr, LogReg__solver=sag, feature_selection__k=12 
[CV]  LogReg__multi_class=ovr, LogReg__solver=sag, feature_selection__k=12, score=0.398031 -   4.5s
[CV] LogReg__multi_class=ovr, LogReg__solver=sag, feature_selection__k=12 
[CV]  LogReg__multi_class=ovr, LogReg__solver=sag, feature_selection__k=12, score=0.407858 -   4.6s
[CV] LogReg__multi_class=ovr, LogReg__solver=sag, feature_selection__k=12 
[CV]  LogReg__multi_class=ovr, LogReg__solver=sag, feature_selection__k=12, score=0.393427 -   4.0s
[CV] LogReg__multi_class=ovr, LogReg__solver=sag, feature_selection__k=15 
[CV]  LogReg__multi_class=ovr, LogReg__solver=sag, feature_selection__k=15, score=0.403743 -   4.8s
[CV] LogReg__multi_class=ovr, LogReg__solver=sag, feature_selection__k=15 
[CV]  LogReg__multi_class=ovr, LogReg__solver=sag, feature_selection__k=15, score=0.407615 -   3.7s
[CV] LogReg__multi_class=o



[CV]  LogReg__multi_class=ovr, LogReg__solver=saga, feature_selection__k=8, score=0.373967 -  53.9s
[CV] LogReg__multi_class=ovr, LogReg__solver=saga, feature_selection__k=8 




[CV]  LogReg__multi_class=ovr, LogReg__solver=saga, feature_selection__k=8, score=0.378299 -  56.2s
[CV] LogReg__multi_class=ovr, LogReg__solver=saga, feature_selection__k=8 




[CV]  LogReg__multi_class=ovr, LogReg__solver=saga, feature_selection__k=8, score=0.373585 -  56.1s
[CV] LogReg__multi_class=ovr, LogReg__solver=saga, feature_selection__k=12 
[CV]  LogReg__multi_class=ovr, LogReg__solver=saga, feature_selection__k=12, score=0.398031 -   2.5s
[CV] LogReg__multi_class=ovr, LogReg__solver=saga, feature_selection__k=12 
[CV]  LogReg__multi_class=ovr, LogReg__solver=saga, feature_selection__k=12, score=0.407858 -   3.1s
[CV] LogReg__multi_class=ovr, LogReg__solver=saga, feature_selection__k=12 
[CV]  LogReg__multi_class=ovr, LogReg__solver=saga, feature_selection__k=12, score=0.393427 -   2.9s
[CV] LogReg__multi_class=ovr, LogReg__solver=saga, feature_selection__k=15 
[CV]  LogReg__multi_class=ovr, LogReg__solver=saga, feature_selection__k=15, score=0.403865 -   2.8s
[CV] LogReg__multi_class=ovr, LogReg__solver=saga, feature_selection__k=15 
[CV]  LogReg__multi_class=ovr, LogReg__solver=saga, feature_selection__k=15, score=0.407736 -   3.2s
[CV] LogReg__mu

[CV]  LogReg__multi_class=multinomial, LogReg__solver=lbfgs, feature_selection__k=8, score=0.359080 -   1.6s
[CV] LogReg__multi_class=multinomial, LogReg__solver=lbfgs, feature_selection__k=8 
[CV]  LogReg__multi_class=multinomial, LogReg__solver=lbfgs, feature_selection__k=8, score=0.364699 -   1.6s
[CV] LogReg__multi_class=multinomial, LogReg__solver=lbfgs, feature_selection__k=12 
[CV]  LogReg__multi_class=multinomial, LogReg__solver=lbfgs, feature_selection__k=12, score=0.388551 -   1.6s
[CV] LogReg__multi_class=multinomial, LogReg__solver=lbfgs, feature_selection__k=12 
[CV]  LogReg__multi_class=multinomial, LogReg__solver=lbfgs, feature_selection__k=12, score=0.390342 -   1.8s
[CV] LogReg__multi_class=multinomial, LogReg__solver=lbfgs, feature_selection__k=12 
[CV]  LogReg__multi_class=multinomial, LogReg__solver=lbfgs, feature_selection__k=12, score=0.382228 -   1.7s
[CV] LogReg__multi_class=multinomial, LogReg__solver=lbfgs, feature_selection__k=15 
[CV]  LogReg__multi_class=mu

[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:  9.4min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('feature_selection', SelectKBest(k=10, score_func=<function f_classif at 0x10baf9bf8>)), ('LogReg', LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=10000,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'feature_selection__k': [8, 12, 15], 'LogReg__solver': ['newton-cg', 'sag', 'saga', 'lbfgs'], 'LogReg__multi_class': ['ovr', 'multinomial']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=3)

In [47]:
print(grid_lr2.best_estimator_)
predictions_lr2 = grid_lr2.predict(X_test)
print(classification_report(y_test, predictions_lr2))

Pipeline(memory=None,
     steps=[('feature_selection', SelectKBest(k=15, score_func=<function f_classif at 0x10baf9bf8>)), ('LogReg', LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=10000,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='saga', tol=0.0001, verbose=0, warm_start=False))])
                     precision    recall  f1-score   support

 blues & blues rock       0.51      0.39      0.44       107
          classical       0.67      0.83      0.74       121
            country       0.20      0.22      0.21       106
              dance       0.48      0.42      0.45       205
        electronica       0.29      0.36      0.32       125
               folk       0.08      0.18      0.12        44
               funk       0.16      0.43      0.23        53
              indie       0.26      0.10      0.15       179
international/world       0.38      0.38    

Score is slightly better with balanced option. Let's try the same one with the rock data

### lr3 with X_rock, y_rock and balanced option

In [131]:
X_train, X_test, y_train, y_test = train_test_split(X_rock, y_rock, test_size=0.1, random_state=101)
lr3 = LogisticRegression(max_iter=10000, class_weight='balanced')
selector_lr3 = SelectKBest()
steps_lr3 = [('feature_selection', selector_lr3), ('LogReg', lr3)]
parameters_lr3 = dict(feature_selection__k=[8,12,15], 
                      LogReg__solver=['newton-cg', 'sag', 'saga', 'lbfgs'],
                      LogReg__multi_class=['ovr', 'multinomial'])

pipeline_lr3 = Pipeline(steps_lr3)

In [132]:
grid_lr3 = GridSearchCV(pipeline_lr3, param_grid=parameters_lr3, verbose=3)
grid_lr3.fit(X_train, y_train)

Fitting 3 folds for each of 24 candidates, totalling 72 fits
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=8 
[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=8, score=0.362395 -   1.4s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=8 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.4s remaining:    0.0s


[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=8, score=0.357725 -   1.7s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=8 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    3.1s remaining:    0.0s


[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=8, score=0.357496 -   1.8s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=12 
[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=12, score=0.391165 -   2.1s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=12 
[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=12, score=0.396858 -   2.2s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=12 
[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=12, score=0.384716 -   1.8s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=15 
[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=15, score=0.394071 -   2.3s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=15 
[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_s

[CV]  LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=8, score=0.345997 -   0.9s
[CV] LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=12 
[CV]  LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=12, score=0.396251 -   3.9s
[CV] LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=12 
[CV]  LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=12, score=0.401367 -   1.8s
[CV] LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=12 
[CV]  LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=12, score=0.388355 -   3.2s
[CV] LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=15 
[CV]  LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=15, score=0.401482 -   4.3s
[CV] LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=15 
[CV]  LogReg__multi_class=multinomial, LogReg_

[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:  2.8min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('feature_selection', SelectKBest(k=10, score_func=<function f_classif at 0x10baf9bf8>)), ('LogReg', LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=10000,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'feature_selection__k': [8, 12, 15], 'LogReg__solver': ['newton-cg', 'sag', 'saga', 'lbfgs'], 'LogReg__multi_class': ['ovr', 'multinomial']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=3)

In [133]:
print(grid_lr3.best_estimator_)
predictions_lr3 = grid_lr3.predict(X_test)
print(classification_report(y_test, predictions_lr3))

Pipeline(memory=None,
     steps=[('feature_selection', SelectKBest(k=15, score_func=<function f_classif at 0x10baf9bf8>)), ('LogReg', LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=10000,
          multi_class='multinomial', n_jobs=1, penalty='l2',
          random_state=None, solver='sag', tol=0.0001, verbose=0,
          warm_start=False))])
                     precision    recall  f1-score   support

 blues & blues rock       0.40      0.60      0.48        87
          classical       0.71      0.83      0.77        90
            country       0.19      0.19      0.19       105
              dance       0.53      0.49      0.51       202
        electronica       0.34      0.34      0.34       122
               funk       0.20      0.47      0.28        55
              indie       0.31      0.15      0.20       174
international/world       0.39      0.33      0.36        90
               jazz       0

We notice a lower score with the rock data resampled :(

### SVC1 on X_scaled and y - OVO

In [143]:
from sklearn.svm import SVC

In [144]:
X_train, X_test, y_train, y_test = train_test_split(X_scale, y, test_size=0.1, random_state=101)

In [145]:
svc1 = SVC(decision_function_shape='ovo')
selector_svc1 = SelectKBest()
steps_svc1 = [('feature_selection', selector_svc1), ('SVC', svc1)]
parameters_svc1 = dict(feature_selection__k=[8,12,15], 
                      SVC__C=[0.1,1, 10],
                      SVC__gamma=[1,0.1,0.01])

pipeline_svc1 = Pipeline(steps_svc1)

In [146]:
grid_svc1 = GridSearchCV(pipeline_svc1, param_grid=parameters_svc1, verbose=3)
grid_svc1.fit(X_train, y_train)

Fitting 3 folds for each of 27 candidates, totalling 81 fits
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=8 ................
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=8, score=0.409974 -  22.1s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=8 ................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   22.1s remaining:    0.0s


[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=8, score=0.410104 -  20.9s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=8 ................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   43.0s remaining:    0.0s


[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=8, score=0.417152 -  22.5s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=12 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=12, score=0.304418 -  34.5s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=12 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=12, score=0.306587 -  33.9s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=12 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=12, score=0.311315 -  34.7s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=15 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=15, score=0.296622 -  39.3s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=15 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=15, score=0.295443 -  40.9s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=15 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=15, score=0.302516 -  38.9s
[CV] SVC

[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=8, score=0.408105 -  27.8s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=12 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=12, score=0.390917 -  40.4s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=12 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=12, score=0.407132 -  40.5s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=12 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=12, score=0.401785 -  39.8s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=15 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=15, score=0.356144 -  43.7s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=15 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=15, score=0.363422 -  44.1s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=15 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=15, score=0.364729 -  44.7s
[CV] SVC__C=10,

[Parallel(n_jobs=1)]: Done  81 out of  81 | elapsed: 31.0min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('feature_selection', SelectKBest(k=10, score_func=<function f_classif at 0x10baf9bf8>)), ('SVC', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovo', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'feature_selection__k': [8, 12, 15], 'SVC__C': [0.1, 1, 10], 'SVC__gamma': [1, 0.1, 0.01]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=3)

In [147]:
print(grid_svc1.best_estimator_)
predictions_svc1 = grid_svc1.predict(X_test)
print(classification_report(y_test, predictions_svc1))

Pipeline(memory=None,
     steps=[('feature_selection', SelectKBest(k=15, score_func=<function f_classif at 0x10baf9bf8>)), ('SVC', SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovo', degree=3, gamma=0.1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])
                     precision    recall  f1-score   support

 blues & blues rock       0.53      0.36      0.43        91
          classical       0.73      0.86      0.79        97
            country       0.40      0.21      0.27       119
              dance       0.47      0.60      0.52       198
        electronica       0.55      0.37      0.44       125
               funk       0.47      0.34      0.40        50
              indie       0.20      0.03      0.05       161
international/world       0.81      0.29      0.43        99
               jazz       0.46      0.65      0.54       194
              latin       0.48    

In [216]:
pickle.dump(grid_svc1, open('genres_svc_ovo.pickle', 'wb'))

### SVC2 on X_scaled and y - OVR

In [148]:
svc2 = SVC(decision_function_shape='ovr')
selector_svc2 = SelectKBest()
steps_svc2 = [('feature_selection', selector_svc2), ('SVC', svc2)]
parameters_svc2 = dict(feature_selection__k=[8,12,15], 
                      SVC__C=[0.1,1, 10],
                      SVC__gamma=[1,0.1,0.01])

pipeline_svc2 = Pipeline(steps_svc2)

In [149]:
grid_svc2 = GridSearchCV(pipeline_svc2, param_grid=parameters_svc2, verbose=3)
grid_svc2.fit(X_train, y_train)

Fitting 3 folds for each of 27 candidates, totalling 81 fits
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=8 ................
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=8, score=0.409974 -  21.1s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=8 ................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   21.1s remaining:    0.0s


[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=8, score=0.410104 -  21.0s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=8 ................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   42.1s remaining:    0.0s


[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=8, score=0.417152 -  21.1s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=12 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=12, score=0.304418 -  32.7s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=12 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=12, score=0.306587 -  33.5s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=12 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=12, score=0.311315 -  33.2s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=15 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=15, score=0.296622 -  38.1s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=15 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=15, score=0.295443 -  38.2s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=15 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=15, score=0.302516 -  43.5s
[CV] SVC

[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=8, score=0.408105 -  29.3s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=12 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=12, score=0.390917 -  39.0s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=12 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=12, score=0.407132 -  40.0s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=12 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=12, score=0.401785 -  40.5s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=15 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=15, score=0.356144 -  43.4s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=15 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=15, score=0.363422 -  43.5s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=15 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=15, score=0.364729 -  44.7s
[CV] SVC__C=10,

[Parallel(n_jobs=1)]: Done  81 out of  81 | elapsed: 30.6min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('feature_selection', SelectKBest(k=10, score_func=<function f_classif at 0x10baf9bf8>)), ('SVC', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'feature_selection__k': [8, 12, 15], 'SVC__C': [0.1, 1, 10], 'SVC__gamma': [1, 0.1, 0.01]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=3)

In [150]:
print(grid_svc2.best_estimator_)
predictions_svc2 = grid_svc2.predict(X_test)
print(classification_report(y_test, predictions_svc2))

Pipeline(memory=None,
     steps=[('feature_selection', SelectKBest(k=15, score_func=<function f_classif at 0x10baf9bf8>)), ('SVC', SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])
                     precision    recall  f1-score   support

 blues & blues rock       0.53      0.36      0.43        91
          classical       0.73      0.86      0.79        97
            country       0.40      0.21      0.27       119
              dance       0.47      0.60      0.52       198
        electronica       0.55      0.37      0.44       125
               funk       0.47      0.34      0.40        50
              indie       0.20      0.03      0.05       161
international/world       0.81      0.29      0.43        99
               jazz       0.46      0.65      0.54       194
              latin       0.48    

In [217]:
pickle.dump(grid_svc2, open('genres_svc_ovr.pickle', 'wb'))

Same score for both, is there an issue?

Let's try with our rock under-sampled the OVO/OVR method

### SVC3 on X_rock and y_rock - OVR

In [179]:
X_train, X_test, y_train, y_test = train_test_split(X_rock, y_rock, test_size=0.1, random_state=101)
svc3 = SVC(decision_function_shape='ovr')
selector_svc3 = SelectKBest()
steps_svc3 = [('feature_selection', selector_svc3), ('SVC', svc3)]
parameters_svc3 = dict(feature_selection__k=[8,12,15], 
                      SVC__C=[0.1,1, 10],
                      SVC__gamma=[1,0.1,0.01])

pipeline_svc3 = Pipeline(steps_svc3)

In [180]:
grid_svc3 = GridSearchCV(pipeline_svc3, param_grid=parameters_svc3, verbose=3)
grid_svc3.fit(X_train, y_train)

Fitting 3 folds for each of 27 candidates, totalling 81 fits
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=8 ................
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=8, score=0.390875 -  16.5s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=8 ................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   16.5s remaining:    0.0s


[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=8, score=0.383910 -  15.9s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=8 ................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   32.4s remaining:    0.0s


[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=8, score=0.385590 -  13.9s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=12 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=12, score=0.239611 -  19.9s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=12 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=12, score=0.237998 -  19.9s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=12 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=12, score=0.247744 -  20.2s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=15 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=15, score=0.178146 -  23.3s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=15 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=15, score=0.174571 -  23.2s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=15 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=15, score=0.179913 -  22.9s
[CV] SVC

[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=8, score=0.369578 -  18.2s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=12 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=12, score=0.371694 -  23.7s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=12 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=12, score=0.378382 -  24.3s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=12 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=12, score=0.384862 -  26.0s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=15 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=15, score=0.342924 -  26.4s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=15 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=15, score=0.339395 -  26.4s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=15 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=15, score=0.352111 -  29.3s
[CV] SVC__C=10,

[Parallel(n_jobs=1)]: Done  81 out of  81 | elapsed: 562.3min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('feature_selection', SelectKBest(k=10, score_func=<function f_classif at 0x10baf9bf8>)), ('SVC', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'feature_selection__k': [8, 12, 15], 'SVC__C': [0.1, 1, 10], 'SVC__gamma': [1, 0.1, 0.01]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=3)

In [181]:
print(grid_svc3.best_estimator_)
predictions_svc3 = grid_svc3.predict(X_test)
print(classification_report(y_test, predictions_svc3))

Pipeline(memory=None,
     steps=[('feature_selection', SelectKBest(k=15, score_func=<function f_classif at 0x10baf9bf8>)), ('SVC', SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])
                     precision    recall  f1-score   support

 blues & blues rock       0.63      0.45      0.52        87
          classical       0.81      0.82      0.82        90
            country       0.32      0.31      0.32       105
              dance       0.51      0.66      0.58       202
        electronica       0.49      0.33      0.39       122
               funk       0.52      0.25      0.34        55
              indie       0.29      0.24      0.26       174
international/world       0.77      0.27      0.40        90
               jazz       0.46      0.65      0.54       202
              latin       0.45    

Results are not as good as previously

### RFC with X_sampled and y

In [188]:
from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(X_scale, y, test_size=0.1, random_state=101)

rfc1 = RandomForestClassifier()
parameters_rfc1 = {'n_estimators':[5, 10, 100], 'min_samples_split':[2, 5, 10], 'max_features':['sqrt', 'log2', 'auto']}
grid_rfc1 = GridSearchCV(rfc1, parameters_rfc1)
grid_rfc1.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_estimators': [5, 10, 100], 'min_samples_split': [2, 5, 10], 'max_features': ['sqrt', 'log2', 'auto']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [189]:
print(grid_rfc1.best_estimator_)
predictions_rfc1 = grid_rfc1.predict(X_test)
print(classification_report(y_test, predictions_rfc1))

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=5,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
                     precision    recall  f1-score   support

 blues & blues rock       0.57      0.38      0.46        91
          classical       0.80      0.86      0.83        97
            country       0.45      0.25      0.32       119
              dance       0.48      0.59      0.53       198
        electronica       0.53      0.32      0.40       125
               funk       0.54      0.30      0.38        50
              indie       0.25      0.07      0.11       161
international/world       0.97      0.28      0.44        99
               jazz       0.

Score a little bit above than previous classifiers

In [218]:
pickle.dump(grid_rfc1, open('genres_rfc1.pickle', 'wb'))

# Voting classifier

In [192]:
from sklearn.ensemble import VotingClassifier

In [193]:
X_train, X_test, y_train, y_test = train_test_split(X_scale, y, test_size=0.1, random_state=101)
voting_classifier = VotingClassifier([('knn', grid_knn4), ('logreg', grid_lr2), ('svc', grid_svc2),('rfc',grid_rfc1)])
voting_classifier.fit(X_train, y_train)

Fitting 3 folds for each of 9 candidates, totalling 27 fits
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.367529 -   0.5s
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.5s remaining:    0.0s


[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.364908 -   0.5s
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.0s remaining:    0.0s


[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.366092 -   0.5s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.407870 -   0.7s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.407380 -   0.6s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.413434 -   0.6s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.426185 -   0.6s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.427811 -   0.6s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.436733 -   0.6s
[CV]

[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:   54.2s finished


[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=8, score=0.385472 -   1.1s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=8 


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.1s remaining:    0.0s


[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=8, score=0.376795 -   1.1s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=8 


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    2.3s remaining:    0.0s


[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=8, score=0.394597 -   1.2s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=12 
[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=12, score=0.402425 -   1.3s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=12 
[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=12, score=0.402551 -   1.4s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=12 
[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=12, score=0.415541 -   1.3s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=15 
[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=15, score=0.402549 -   1.5s
[CV] LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_selection__k=15 
[CV]  LogReg__multi_class=ovr, LogReg__solver=newton-cg, feature_s

[CV]  LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=8, score=0.372537 -   0.6s
[CV] LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=12 
[CV]  LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=12, score=0.392402 -   1.3s
[CV] LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=12 
[CV]  LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=12, score=0.393016 -   0.6s
[CV] LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=12 
[CV]  LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=12, score=0.402528 -   1.6s
[CV] LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=15 
[CV]  LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=15, score=0.393887 -   1.6s
[CV] LogReg__multi_class=multinomial, LogReg__solver=sag, feature_selection__k=15 
[CV]  LogReg__multi_class=multinomial, LogReg_

[Parallel(n_jobs=1)]: Done  72 out of  72 | elapsed:  2.5min finished


Fitting 3 folds for each of 27 candidates, totalling 81 fits
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=8 ................
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=8, score=0.409974 -  21.2s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=8 ................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   21.2s remaining:    0.0s


[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=8, score=0.410104 -  22.1s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=8 ................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   43.3s remaining:    0.0s


[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=8, score=0.417152 -  21.5s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=12 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=12, score=0.304418 -  32.6s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=12 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=12, score=0.306587 -  33.0s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=12 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=12, score=0.311315 -  33.2s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=15 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=15, score=0.296622 -  38.1s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=15 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=15, score=0.295443 -  38.1s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=15 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=15, score=0.302516 -  37.9s
[CV] SVC

[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=8, score=0.408105 -  27.4s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=12 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=12, score=0.390917 -  40.7s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=12 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=12, score=0.407132 -  40.7s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=12 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=12, score=0.401785 -  40.7s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=15 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=15, score=0.356144 -  45.3s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=15 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=15, score=0.363422 -  45.0s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=15 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=15, score=0.364729 -  48.3s
[CV] SVC__C=10,

[Parallel(n_jobs=1)]: Done  81 out of  81 | elapsed: 31.2min finished


VotingClassifier(estimators=[('knn', GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('feature_selection', SelectKBest(k=10, score_func=<function f_classif at 0x10baf9bf8>)), ('kneighbors', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
        ...': ['sqrt', 'log2', 'auto']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0))],
         flatten_transform=None, n_jobs=1, voting='hard', weights=None)

In [194]:
predictions_vc = voting_classifier.predict(X_test)
print(classification_report(y_test, predictions_vc))

                     precision    recall  f1-score   support

 blues & blues rock       0.41      0.38      0.40        91
          classical       0.70      0.88      0.78        97
            country       0.32      0.29      0.30       119
              dance       0.45      0.62      0.52       198
        electronica       0.47      0.36      0.41       125
               funk       0.36      0.32      0.34        50
              indie       0.24      0.07      0.11       161
international/world       0.77      0.30      0.43        99
               jazz       0.47      0.61      0.53       194
              latin       0.38      0.28      0.32       124
             oldies       0.50      0.16      0.25        67
                pop       0.59      0.28      0.38        80
                r&b       0.34      0.31      0.32       226
                rap       0.62      0.73      0.67       237
       reggae & ska       0.67      0.39      0.50        79
               rock    

  if diff:


In [196]:
import pickle
pickle.dump(voting_classifier, open('genres_voting_classifier.pickle', 'wb'))

Voting classifier 2

In [210]:
X_train, X_test, y_train, y_test = train_test_split(X_scale, y, test_size=0.1, random_state=101)
voting_classifier = VotingClassifier([('knn', grid_knn4), ('svc', grid_svc2),('rfc',grid_rfc1)])
voting_classifier.fit(X_train, y_train)

Fitting 3 folds for each of 9 candidates, totalling 27 fits
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.367529 -   0.8s
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.9s remaining:    0.0s


[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.364908 -   0.5s
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.4s remaining:    0.0s


[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.366092 -   0.5s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.407870 -   0.5s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.407380 -   0.6s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.413434 -   0.5s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.426185 -   0.6s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.427811 -   0.6s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.436733 -   0.7s
[CV]

[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:  1.1min finished


[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=8, score=0.409974 -  28.6s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=8 ................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   28.6s remaining:    0.0s


[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=8, score=0.410104 -  23.3s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=8 ................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   51.9s remaining:    0.0s


[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=8, score=0.417152 -  24.3s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=12 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=12, score=0.304418 -  37.6s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=12 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=12, score=0.306587 -  38.4s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=12 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=12, score=0.311315 -  34.6s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=15 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=15, score=0.296622 -  42.8s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=15 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=15, score=0.295443 -  40.2s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=15 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=15, score=0.302516 -  43.6s
[CV] SVC

[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=8, score=0.408105 -  29.1s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=12 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=12, score=0.390917 -  42.6s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=12 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=12, score=0.407132 -  42.9s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=12 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=12, score=0.401785 -  42.8s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=15 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=15, score=0.356144 -  44.8s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=15 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=15, score=0.363422 -  45.3s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=15 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=15, score=0.364729 -  46.3s
[CV] SVC__C=10,

[Parallel(n_jobs=1)]: Done  81 out of  81 | elapsed: 33.3min finished


VotingClassifier(estimators=[('knn', GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('feature_selection', SelectKBest(k=10, score_func=<function f_classif at 0x10baf9bf8>)), ('kneighbors', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
        ...': ['sqrt', 'log2', 'auto']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0))],
         flatten_transform=None, n_jobs=1, voting='hard', weights=None)

In [211]:
predictions_vc = voting_classifier.predict(X_test)
print(classification_report(y_test, predictions_vc))

                     precision    recall  f1-score   support

 blues & blues rock       0.47      0.40      0.43        91
          classical       0.74      0.89      0.80        97
            country       0.34      0.30      0.32       119
              dance       0.44      0.62      0.52       198
        electronica       0.52      0.34      0.41       125
               funk       0.47      0.32      0.38        50
              indie       0.16      0.04      0.07       161
international/world       0.82      0.28      0.42        99
               jazz       0.46      0.64      0.53       194
              latin       0.42      0.26      0.32       124
             oldies       1.00      0.16      0.28        67
                pop       0.74      0.25      0.37        80
                r&b       0.33      0.32      0.32       226
                rap       0.61      0.73      0.66       237
       reggae & ska       0.68      0.33      0.44        79
               rock    

  if diff:


In [212]:
pickle.dump(voting_classifier, open('genres_voting_classifier2.pickle', 'wb'))

Voting classifier 3

In [213]:
X_train, X_test, y_train, y_test = train_test_split(X_smote, y_smote, test_size=0.1, random_state=101)
voting_classifier3 = VotingClassifier([('knn', grid_knn4), ('svc', grid_svc2),('rfc',grid_rfc1)])
voting_classifier3.fit(X_train, y_train)

Fitting 3 folds for each of 9 candidates, totalling 27 fits
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.503677 -   0.8s
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.8s remaining:    0.0s


[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.506837 -   0.9s
[CV] feature_selection__k=8, kneighbors__n_neighbors=3 ...............


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    1.7s remaining:    0.0s


[CV]  feature_selection__k=8, kneighbors__n_neighbors=3, score=0.499440 -   1.1s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.485691 -   1.2s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.486685 -   1.1s
[CV] feature_selection__k=8, kneighbors__n_neighbors=6 ...............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=6, score=0.485840 -   1.1s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.473621 -   1.6s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.475330 -   1.2s
[CV] feature_selection__k=8, kneighbors__n_neighbors=10 ..............
[CV]  feature_selection__k=8, kneighbors__n_neighbors=10, score=0.478240 -   1.4s
[CV]

[Parallel(n_jobs=1)]: Done  27 out of  27 | elapsed:  2.1min finished


[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=8, score=0.428297 -  43.6s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=8 ................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   43.6s remaining:    0.0s


[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=8, score=0.435986 -  45.0s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=8 ................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.5min remaining:    0.0s


[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=8, score=0.428000 -  43.9s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=12 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=12, score=0.331974 - 1.0min
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=12 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=12, score=0.331467 - 1.0min
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=12 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=12, score=0.337120 - 1.1min
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=15 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=15, score=0.194245 - 1.2min
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=15 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=15, score=0.195842 - 1.2min
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=15 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=15, score=0.209600 - 1.2min
[CV] SVC

[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=8, score=0.555360 -  47.5s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=12 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=12, score=0.685132 - 1.2min
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=12 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=12, score=0.677729 - 1.3min
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=12 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=12, score=0.679920 - 1.2min
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=15 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=15, score=0.703437 - 1.5min
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=15 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=15, score=0.694922 - 1.4min
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=15 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=15, score=0.698240 - 1.4min
[CV] SVC__C=10,

[Parallel(n_jobs=1)]: Done  81 out of  81 | elapsed: 63.2min finished


VotingClassifier(estimators=[('knn', GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('feature_selection', SelectKBest(k=10, score_func=<function f_classif at 0x10baf9bf8>)), ('kneighbors', KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
        ...': ['sqrt', 'log2', 'auto']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0))],
         flatten_transform=None, n_jobs=1, voting='hard', weights=None)

In [214]:
predictions_vc3 = voting_classifier3.predict(X_test)
print(classification_report(y_test, predictions_vc3))

                     precision    recall  f1-score   support

 blues & blues rock       0.76      0.94      0.84       244
          classical       0.90      0.94      0.92       269
            country       0.66      0.86      0.74       251
              dance       0.61      0.71      0.65       234
        electronica       0.69      0.81      0.74       223
               funk       0.82      0.98      0.90       228
              indie       0.61      0.49      0.54       243
international/world       0.86      0.89      0.87       247
               jazz       0.69      0.71      0.70       221
              latin       0.83      0.80      0.82       259
             oldies       0.87      0.95      0.91       265
                pop       0.85      0.85      0.85       238
                r&b       0.61      0.29      0.40       265
                rap       0.73      0.66      0.69       252
       reggae & ska       0.97      0.93      0.95       242
               rock    

  if diff:


In [215]:
pickle.dump(voting_classifier3, open('genres_voting_classifier3.pickle', 'wb'))

Voting classifier 4

In [220]:
X_train, X_test, y_train, y_test = train_test_split(X_scale, y, test_size=0.1, random_state=101)
voting_classifier4 = VotingClassifier([('svc', grid_svc2),('rfc',grid_rfc1)])
voting_classifier4.fit(X_train, y_train)

Fitting 3 folds for each of 27 candidates, totalling 81 fits
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=8 ................
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=8, score=0.409974 -  22.2s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=8 ................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   22.2s remaining:    0.0s


[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=8, score=0.410104 -  21.7s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=8 ................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   43.9s remaining:    0.0s


[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=8, score=0.417152 -  21.3s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=12 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=12, score=0.304418 -  35.5s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=12 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=12, score=0.306587 -  36.1s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=12 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=12, score=0.311315 -  37.0s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=15 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=15, score=0.296622 -  42.2s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=15 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=15, score=0.295443 -  40.6s
[CV] SVC__C=0.1, SVC__gamma=1, feature_selection__k=15 ...............
[CV]  SVC__C=0.1, SVC__gamma=1, feature_selection__k=15, score=0.302516 -  43.4s
[CV] SVC

[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=8, score=0.408105 -  26.6s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=12 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=12, score=0.390917 -  39.0s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=12 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=12, score=0.407132 -  39.1s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=12 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=12, score=0.401785 -  38.9s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=15 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=15, score=0.356144 -  43.4s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=15 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=15, score=0.363422 -  43.2s
[CV] SVC__C=10, SVC__gamma=1, feature_selection__k=15 ................
[CV]  SVC__C=10, SVC__gamma=1, feature_selection__k=15, score=0.364729 -  43.8s
[CV] SVC__C=10,

[Parallel(n_jobs=1)]: Done  81 out of  81 | elapsed: 32.1min finished


VotingClassifier(estimators=[('svc', GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('feature_selection', SelectKBest(k=10, score_func=<function f_classif at 0x10baf9bf8>)), ('SVC', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr'...': ['sqrt', 'log2', 'auto']},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0))],
         flatten_transform=None, n_jobs=1, voting='hard', weights=None)

In [221]:
predictions_vc4 = voting_classifier4.predict(X_test)
print(classification_report(y_test, predictions_vc4))

                     precision    recall  f1-score   support

 blues & blues rock       0.48      0.41      0.44        91
          classical       0.72      0.89      0.80        97
            country       0.35      0.29      0.31       119
              dance       0.44      0.66      0.53       198
        electronica       0.49      0.36      0.41       125
               funk       0.46      0.34      0.39        50
              indie       0.19      0.07      0.11       161
international/world       0.79      0.30      0.44        99
               jazz       0.45      0.65      0.53       194
              latin       0.43      0.28      0.34       124
             oldies       0.72      0.19      0.31        67
                pop       0.78      0.26      0.39        80
                r&b       0.34      0.38      0.36       226
                rap       0.62      0.70      0.66       237
       reggae & ska       0.76      0.33      0.46        79
               rock    

  if diff:


In [222]:
pickle.dump(voting_classifier4, open('genres_voting_classifier4.pickle', 'wb'))

In [151]:
def format_audio(audio_features):
    features = np.asarray(audio_features)
    features = features.reshape(1,-1)
    from sklearn.preprocessing import StandardScaler
    test_song_scaled = scaler.transform(features)
    return test_song_scaled

In [165]:
def predict_genre(scaled_song):
    genre = grid_svc2.predict(scaled_song)
    return genre

In [178]:
i = 8752
audio_test = full_df2['audio_features'][i]
genre_test = full_df2['genres'][i]
artist_test = full_df2['artist'][i]
album_test = full_df2['album'][i]
print(audio_test)
print(genre_test)
print(artist_test)
print(album_test)
results = format_audio(audio_test)
final_genre = predict_genre(results)
print(final_genre)

[0, 0.759112, 0.10051399999999999, 101.317, 0.080382, 0.037134, 0.022848, 1, 4, 172.55909, -7.774, 0.590452, 0.719954, 0.668, 0.522, 0.56, 1.0]
funk
Charles Bradley
No Time For Dreaming
['r&b']


In [205]:
def genres_model(audio_features):
    features = np.asarray(audio_features)
    features = features.reshape(1,-1)
    from sklearn.preprocessing import StandardScaler
    test_song_scaled = scaler.transform(features)
    genre = genres_cl.predict(test_song_scaled)
    return genre

In [206]:
pickle.dump(genres_model, open('genres_model_function.pickle', 'wb'))