# Data preparation

In [281]:
import pandas as pd
import numpy as np

Import the data

In [282]:
songs_df = pd.read_json('MasterSongList.json')
songs_df.head(3)

Unnamed: 0,_id,album,artist,audio_features,context,decades,genres,lyrics_features,moods,name,new_context,picture,recording_id,sub_context,yt_id,yt_views
0,{'$oid': '52fdfb440b9398049f3d7a8c'},Gangnam Style (강남스타일),PSY,"[11, 0.912744, 0.083704, 132.069, 0.293137, 0....",[work out],[],[pop],"[oppa, gangnam, style, gangnam, style, najeneu...","[energetic, motivational]",Gangnam Style (강남스타일),work out,http://images.musicnet.com/albums/073/463/405/...,50232.0,[working out: cardio],9bZkp7q19f0,2450112089
1,{'$oid': '52fdfb3d0b9398049f3cbc8e'},Native,OneRepublic,"[6, 0.7457039999999999, 0.11995499999999999, 1...",[energetic],[2012],[pop],"[lately, i, ve, been, i, ve, been, losing, sle...",[happy],Counting Stars,energetic,http://images.musicnet.com/albums/081/851/887/...,5839.0,[energy boost],hT_nvWreIhg,1020297206
2,{'$oid': '52fdfb420b9398049f3d3ea5'},Party Rock Anthem,LMFAO,"[5, 0.709932, 0.231455, 130.03, 0.121740999999...","[energetic, energetic, energetic, energetic]",[],[],"[party, rock, yeah, woo, let, s, go, party, ro...","[happy, celebratory, rowdy]",Party Rock Anthem,housework,http://images.musicnet.com/albums/049/414/127/...,52379.0,"[energy boost, pleasing a crowd, housework, dr...",KQ6zr6kCPj8,971128436


## Remove empty lyrics

We will only keep lyrics and moods in our new dataframe

In [283]:
cols = ['lyrics_features', 'moods']
lyrics_df = songs_df.copy()
lyrics_df = lyrics_df[cols]
lyrics_df.head()

Unnamed: 0,lyrics_features,moods
0,"[oppa, gangnam, style, gangnam, style, najeneu...","[energetic, motivational]"
1,"[lately, i, ve, been, i, ve, been, losing, sle...",[happy]
2,"[party, rock, yeah, woo, let, s, go, party, ro...","[happy, celebratory, rowdy]"
3,"[alagamun, lan, weh, wakun, heya, hanun, gon, ...","[happy, energetic, celebratory]"
4,"[j, lo, the, other, side, out, my, mine, it, s...",[energetic]


Let's remove the list format for both columns

In [284]:
lyrics_df['lyrics_features'] = lyrics_df['lyrics_features'].apply(' '.join)
lyrics_df['moods'] = lyrics_df['moods'].apply(', '.join)
lyrics_df.head()

Unnamed: 0,lyrics_features,moods
0,oppa gangnam style gangnam style najeneun ttas...,"energetic, motivational"
1,lately i ve been i ve been losing sleep dreami...,happy
2,party rock yeah woo let s go party rock is in ...,"happy, celebratory, rowdy"
3,alagamun lan weh wakun heya hanun gon alagamun...,"happy, energetic, celebratory"
4,j lo the other side out my mine it s a new gen...,energetic


We will now replace the empty lyrics songs with NaN and drop them

In [285]:
lyrics_df['lyrics_features'].replace('', np.nan, inplace=True)
lyrics_df.shape

(36733, 2)

In [286]:
lyrics_df.dropna(subset=['lyrics_features'], inplace=True)
lyrics_df.shape

(20931, 2)

Last step is to re-index the dataframe

In [287]:
lyrics_df.reset_index(drop=True, inplace=True)

## Cleaning the lyrics

Before choosing any moods, let's clean up the lyrics (as we will need to use all of them for the Doc2Vec method later on)

1. Convert to lower case
2. Remove punctuation
3. Remove common words
4. Stem words

In [288]:
from string import punctuation
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from nltk.stem.snowball import SnowballStemmer

In [289]:
def clean_text(raw_text):
    # Create empty list to receive result
    clean_words = []
    
    # 1. Convert to lower case
    raw_text = raw_text.lower()
    
    # 2. Remove punctuation
    translator = str.maketrans('', '', punctuation)
    raw_text = raw_text.translate(translator)
    split_words = raw_text.split()
    
    # 3 & 4. Remove common words and stem words
    stemmer = SnowballStemmer('english')
    for word in split_words:
        if word not in ENGLISH_STOP_WORDS:
            stemmed_word = stemmer.stem(word)
            clean_words.append(stemmed_word)
            
    return ' '.join(clean_words)

Let's apply this function to all our lyrics

In [290]:
lyrics_df['clean_lyrics'] = lyrics_df['lyrics_features'].apply(clean_text)

Before:

In [291]:
print(lyrics_df['lyrics_features'][1][0:500])

lately i ve been i ve been losing sleep dreaming about the things that we could be but baby i ve been i ve been praying hard said no more counting dollars we ll be counting stars yeah we ll be counting stars i see this life like a swinging vine swing my heart across the line and my face is flashing signs seek it out and you shall find old but i m not that old young but i m not that bold i don t think the world is sold i m just doing what we re told i feel something so right doing the wrong thing


After:

In [292]:
print(lyrics_df['clean_lyrics'][1][0:500])

late ve ve lose sleep dream thing babi ve ve pray hard said count dollar ll count star yeah ll count star life like swing vine swing heart line face flash sign seek shall old m old young m bold don t think world sold m just do told feel right do wrong thing feel wrong do right thing couldn t lie couldn t lie couldn t lie kill make feel aliv late ve ve lose sleep dream thing babi ve ve pray hard said count dollar ll count star late ve ve lose sleep dream thing babi ve ve pray hard said count doll


## Choosing moods

Let's look at all the moods available: we use the initial dataframe as the moods are still in a list. A set will allow to show all unique values

In [293]:
moods = songs_df['moods'].tolist()
moods_set = set(x for i in moods for x in i)
moods_set

{'aggressive',
 'angsty',
 'atmospheric',
 'campy',
 'celebratory',
 'classy',
 'cocky',
 'cold',
 'earthy',
 'energetic',
 'funky',
 'gloomy',
 'happy',
 'hypnotic',
 'introspective',
 'lush',
 'mellow',
 'motivational',
 'nocturnal',
 'raw',
 'rowdy',
 'sad',
 'seductive',
 'sexual',
 'soothing',
 'spacey',
 'sprightly',
 'sweet',
 'trashy',
 'trippy',
 'visceral',
 'warm'}

Before choosing which mood to use, let's have a look at their distribution

In [294]:
def number_moods(mood):
    count = len(lyrics_df[lyrics_df['moods'].str.contains(mood)])
    return count

for i in moods_set:
    mood_count = number_moods(i)
    print(i, ": ", mood_count)

funky :  2072
hypnotic :  286
rowdy :  1721
happy :  1757
atmospheric :  1155
motivational :  925
campy :  636
cold :  830
introspective :  1417
sweet :  814
sprightly :  1733
lush :  1326
soothing :  1374
mellow :  2856
seductive :  1419
trippy :  780
trashy :  477
energetic :  2305
sad :  1249
earthy :  873
nocturnal :  1334
angsty :  1205
cocky :  1438
visceral :  1112
spacey :  514
raw :  1301
aggressive :  1683
sexual :  505
warm :  1495
classy :  492
gloomy :  750
celebratory :  1479


If we decide to choose 'aggressive' and 'mellow', are there songs that fit both moods?

In [295]:
print(lyrics_df[lyrics_df['moods'].str.contains('aggressive') & lyrics_df['moods'].str.contains('mellow')])

Empty DataFrame
Columns: [lyrics_features, moods, clean_lyrics]
Index: []


Great, there is no overlapping data for those 2 moods. We will use those 2 moods for Bag of Words and TF-IDF

In [296]:
mellow_aggressive_df = lyrics_df.copy()

aggressive_condition = mellow_aggressive_df['moods'].str.contains('aggressive')
mellow_condition = mellow_aggressive_df['moods'].str.contains('mellow')

In [297]:
mellow_aggressive_df = mellow_aggressive_df[aggressive_condition | mellow_condition]

To make it easier, let's make sure the mood column contains either mellow or aggressive, but no other mood

In [298]:
mellow_aggressive_df.loc[aggressive_condition, 'moods'] = 'aggressive'
mellow_aggressive_df.loc[mellow_condition, 'moods'] = 'mellow'

In [299]:
mellow_aggressive_df.head()

Unnamed: 0,lyrics_features,moods,clean_lyrics
37,1 and if you ll see andmoreagain then you will...,mellow,1 ll andmoreagain know andmoreagain eye feel h...
46,wake up wake up grab a brush and put a little ...,aggressive,wake wake grab brush littl makeup hide scar fa...
74,kiss me hard before you go summertime sadness ...,mellow,kiss hard summertim sad just want know babi be...
76,well you done done me in you bet i felt it i t...,mellow,bet felt tri chill hot melt fell right crack m...
121,now greetings to the world vice ala one big go...,aggressive,greet world vice ala big gong zilla longsid sk...


In [300]:
mellow_aggressive_df['moods'].value_counts()

mellow        2856
aggressive    1683
Name: moods, dtype: int64

Since we notice there are more mellow songs than aggressive ones, let's balance the dataframe to make sure this does not have an influence on our result. We will then randomly select 1683 mellow songs (to match the aggressive ones)

In [301]:
aggressive_songs = mellow_aggressive_df[mellow_aggressive_df['moods'] == 'aggressive']
mellow_songs = mellow_aggressive_df[mellow_aggressive_df['moods'] == 'mellow']

In [302]:
mellow_sample = mellow_songs.sample(n=len(aggressive_songs), random_state=101)
mellow_sample.shape

(1683, 3)

We can now concatenate back into one dataframe

In [303]:
mellow_aggressive_sample = pd.concat([aggressive_songs, mellow_sample])
mellow_aggressive_sample.sort_index(axis='index', inplace=True)
mellow_aggressive_sample.head()

Unnamed: 0,lyrics_features,moods,clean_lyrics
37,1 and if you ll see andmoreagain then you will...,mellow,1 ll andmoreagain know andmoreagain eye feel h...
46,wake up wake up grab a brush and put a little ...,aggressive,wake wake grab brush littl makeup hide scar fa...
76,well you done done me in you bet i felt it i t...,mellow,bet felt tri chill hot melt fell right crack m...
121,now greetings to the world vice ala one big go...,aggressive,greet world vice ala big gong zilla longsid sk...
175,the secret side of me i never let you see i ke...,aggressive,secret let cage t control stay away beast ugli...


In [304]:
mellow_aggressive_sample['moods'].value_counts()

aggressive    1683
mellow        1683
Name: moods, dtype: int64

# Bag Of Words

## BOW model

Let's apply the bag of Words model to our data

In [305]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

In [306]:
bag_of_words = count_vect.fit_transform(mellow_aggressive_sample['clean_lyrics'])
print(bag_of_words[0])

  (0, 19678)	1
  (0, 9504)	1
  (0, 12305)	3
  (0, 10388)	1
  (0, 5128)	1
  (0, 2258)	1
  (0, 13568)	1
  (0, 2839)	1
  (0, 3696)	1
  (0, 10356)	1
  (0, 10839)	2
  (0, 17841)	2
  (0, 978)	1
  (0, 19813)	1
  (0, 15555)	1
  (0, 1319)	1
  (0, 18424)	1
  (0, 7392)	1
  (0, 18954)	1
  (0, 13973)	9
  (0, 17893)	3
  (0, 1550)	3
  (0, 8112)	3
  (0, 6393)	3
  (0, 6193)	1
  (0, 9719)	2
  (0, 760)	4
  (0, 10248)	2


## BOW Linear Regression

We will try to look at several classifiers: Logistic Regression, SVC and Random Forests to try and get the best score

In [307]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import RFE
from sklearn.feature_selection import SelectKBest

from sklearn.metrics import accuracy_score

In [308]:
X_bow = bag_of_words
y = mellow_aggressive_sample['moods']

X_train, X_test, y_train, y_test = train_test_split(X_bow, y, test_size=0.1, random_state=42)

In [309]:
lr_bow = LogisticRegression(max_iter=5000)
lr_bow.fit(X_train, y_train)
lr_predictions = lr_bow.predict(X_test)
print(accuracy_score(y_test, lr_predictions))

0.7151335311572701


### BOW LR - GridSearchCV

Let's try to optimize the parameters with GridSearchCV

In [310]:
param_grid = {'solver': ['newton-cg', 'sag', 'saga', 'lbfgs'], 'multi_class':['ovr', 'multinomial']}
grid = GridSearchCV(lr_bow, param_grid, verbose=1)
grid.fit(X_train, y_train)

Fitting 3 folds for each of 8 candidates, totalling 24 fits


[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:  1.8min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=5000, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'solver': ['newton-cg', 'sag', 'saga', 'lbfgs'], 'multi_class': ['ovr', 'multinomial']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [311]:
grid.best_params_

{'multi_class': 'ovr', 'solver': 'sag'}

In [312]:
grid.best_estimator_

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=5000, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='sag', tol=0.0001,
          verbose=0, warm_start=False)

In [313]:
lr_predictions2 = grid.best_estimator_.predict(X_test)
print(accuracy_score(y_test, lr_predictions2))

0.7062314540059347


### BOW LR - SelectFromModel

Let's look at the number of features that can be used. Considering the high number of features, we will start with SelectFromModel as it can provide an optimized number of features

In [314]:
lr_bow3 = SelectFromModel(lr_bow2, prefit=True)
X_train3 = lr_bow3.transform(X_train)

In [315]:
print('X_train shape: ', X_train.shape)
print('X_train3 shape: ', X_train3.shape)

X_train shape:  (3029, 20445)
X_train3 shape:  (3029, 5737)


We can now try with 5,737 features instead of 20,445. This should help reduce the calculating time

In [316]:
X_test3 = lr_bow3.transform(X_test)
X_test3.shape

(337, 5737)

In [317]:
lr_bow4 = LogisticRegression(solver='sag', multi_class='ovr', max_iter=5000)
lr_bow4.fit(X_train3, y_train)
lr_predictions4 = lr_bow4.predict(X_test3)
print(accuracy_score(y_test, lr_predictions4))

0.712166172106825


The score is now similar to what we had originally with linear regression and BOW however the calculating time was reduced significantly.

### BOW LR - RFE

Let's try to optimize the features using RFE method instead of SelectFromModel

In [318]:
#lr_bow5 = LogisticRegression(solver='sag', multi_class='ovr', max_iter=3000)
#rfe_model = RFE(lr_bow5, 5736)
#rfe_model = rfe_model.fit(X_train, y_train)
#lr_predictions5 = rfe_model.predict(X_test)
#print(accuracy_score(y_test, lr_predictions5))

Not done for now as the processing time is too long

Let's now work with another classifier: Support Vector Machine

## BOW Support Vector Machine

In [319]:
svc_bow = SVC(C=1, gamma=1)
svc_bow.fit(X_train, y_train)
svc_predictions = svc_bow.predict(X_test)
print(accuracy_score(y_test, svc_predictions))

0.44807121661721067


We now notice quite a low score with our first trial of SVC, we need to adjust the parameters

### BOW SVC - GridSearchCV

In [320]:
param_grid = {'C': [0.1,1, 10], 'gamma': [1,0.1,0.01,0.001]}
grid = GridSearchCV(svc_bow, param_grid, verbose=1)
grid.fit(X_train, y_train)

Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:  3.1min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': [0.1, 1, 10], 'gamma': [1, 0.1, 0.01, 0.001]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [321]:
grid.best_params_

{'C': 1, 'gamma': 0.001}

In [322]:
grid.best_estimator_

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [323]:
svc_predictions2 = grid.best_estimator_.predict(X_test)
print(accuracy_score(y_test, svc_predictions2))

0.7151335311572701


We can now see the score improved quite significantly. Let's have a look at our features with SelectKBest. We are going to use the same number of features as in SelectFromModel

### BOW SVC - SelectKBest

In [324]:
selector = SelectKBest(k=5736)
X_bow_new = selector.fit_transform(X_bow, y)
X_bow_new.shape

(3366, 5736)

In [325]:
X_train_svc, X_test_svc, y_train_svc, y_test_svc = train_test_split(X_bow_new, y, test_size=0.1, random_state=42)

In [326]:
svc_bow3 = SVC(C=1, gamma=0.001)
svc_bow3.fit(X_train_svc, y_train_svc)
svc_predictions3 = svc_bow3.predict(X_test_svc)
print(accuracy_score(y_test_svc, svc_predictions3))

0.7418397626112759


The score is now even higher. This is for now the best result we got. Let's try to see if we can get a better model with RandomForests

## BOW RandomForests

In [327]:
rfc_bow = RandomForestClassifier(n_estimators=5, min_samples_split=2, max_features='log2')
rfc_bow.fit(X_train, y_train)
rfc_predictions = rfc_bow.predict(X_test)
print(accuracy_score(y_test, rfc_predictions))

0.6201780415430267


The score is quite low (but still a bit higher than for the SVC first trial). Let's work on parameters optimization

### BOW RFC - GridSearchCV

In [328]:
param_grid = {'n_estimators': [5, 10, 100], 'min_samples_split': [2, 3, 4, 5, 10], 'max_features': ['sqrt', 'log2', 'auto']}
grid = GridSearchCV(rfc_bow, param_grid, verbose=1)
grid.fit(X_train, y_train)

Fitting 3 folds for each of 45 candidates, totalling 135 fits


[Parallel(n_jobs=1)]: Done 135 out of 135 | elapsed:  2.7min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='log2', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=5, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_estimators': [5, 10, 100], 'min_samples_split': [2, 3, 4, 5, 10], 'max_features': ['sqrt', 'log2', 'auto']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [329]:
grid.best_params_

{'max_features': 'sqrt', 'min_samples_split': 4, 'n_estimators': 100}

In [330]:
grid.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=4,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [331]:
rfc_predictions2 = grid.best_estimator_.predict(X_test)
print(accuracy_score(y_test, rfc_predictions2))

0.7477744807121661


The score increase significantly, let's check if something can be done by optimizing the number of features (with SelectFromModel)

### BOW RFC - SelectFromModel

We will run SelectFromModel with threshold=mean (default)

In [332]:
rfc_bow3 = SelectFromModel(rfc_bow2, threshold='mean', prefit=True)
X_train_rfc = rfc_bow3.transform(X_train)
print('X_train shape: ', X_train.shape)
print('X_train_rfc shape: ', X_train_rfc.shape)

X_train shape:  (3029, 20445)
X_train_rfc shape:  (3029, 2772)


In [333]:
X_test_rfc = rfc_bow3.transform(X_test)
X_test_rfc.shape

(337, 2772)

In [334]:
rfc_bow4 = RandomForestClassifier(n_estimators=100, min_samples_split=5, max_features='sqrt')
rfc_bow4.fit(X_train_rfc, y_train)
rfc_predictions4 = rfc_bow4.predict(X_test_rfc)
print(accuracy_score(y_test, rfc_predictions4))

0.7329376854599406


The score is very similar to the preivous step (can be slightly higher or slightly lower), however the number of features was divided by 7: which helps save calculation time. 
Let's quickly check what the results would be with the median as a threshold

In [335]:
rfc_bow5 = SelectFromModel(rfc_bow2, threshold='median',prefit=True)
X_train_rfc2 = rfc_bow5.transform(X_train)
print('X_train shape: ', X_train.shape)
print('X_train_rfc shape: ', X_train_rfc2.shape)

X_train shape:  (3029, 20445)
X_train_rfc shape:  (3029, 20445)


In [336]:
X_test_rfc2 = rfc_bow5.transform(X_test)
X_test_rfc2.shape

(337, 20445)

In [337]:
rfc_bow6 = RandomForestClassifier(n_estimators=100, min_samples_split=5, max_features='sqrt')
rfc_bow6.fit(X_train_rfc2, y_train)
rfc_predictions6 = rfc_bow6.predict(X_test_rfc2)
print(accuracy_score(y_test, rfc_predictions6))

0.7655786350148368


The score is also similar to the previous step however we can notice all the features were selected, meaning the threshold was not adapted.

We still notice that after all these steps the score is still not too high, this might be due to the initial choice of mood: aggressive vs. mellow.
Let's now compare with a different technique TF-IDF and check whether we can get a higher score

# TF-IDF

## TF-IDF model

Let's apply the TF-IDF model to our data

In [338]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf_vect = TfidfVectorizer()

In [339]:
tf_idf = tf_idf_vect.fit_transform(mellow_aggressive_sample['clean_lyrics'])
print(tf_idf[0])

  (0, 10248)	0.04513957094995233
  (0, 760)	0.36761268311079304
  (0, 9719)	0.04085118350902226
  (0, 6193)	0.027647459075229572
  (0, 6393)	0.07667902713193415
  (0, 8112)	0.09098316220926433
  (0, 1550)	0.1252334343911038
  (0, 17893)	0.2757095123330948
  (0, 13973)	0.8271285369992843
  (0, 18954)	0.024608861033863328
  (0, 7392)	0.05771471402474821
  (0, 18424)	0.03260804010014442
  (0, 1319)	0.04289938627149122
  (0, 15555)	0.05039650326626615
  (0, 19813)	0.05870664208799416
  (0, 978)	0.07331512614662358
  (0, 17841)	0.05895434301780704
  (0, 10839)	0.15648686359079803
  (0, 10356)	0.03716026178393961
  (0, 3696)	0.05680554681381161
  (0, 2839)	0.028721337725565534
  (0, 13568)	0.07236638107029339
  (0, 2258)	0.05923863491043399
  (0, 5128)	0.021673086750183135
  (0, 10388)	0.024075784291750196
  (0, 12305)	0.08192076723413363
  (0, 9504)	0.02101512397139747
  (0, 19678)	0.04296047152898321


Instead of going through all the classifiers, we will only do RandomForests and Support Vector Machine, as they were performing better. We will start with Random Forests to get the optimized number of features with SelectFromModel. We will later use this number of features with SelectKBest on SVM.

## TF-IDF RandomForests

In [340]:
X_tfidf = tf_idf
y = mellow_aggressive_sample['moods']

X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.1, random_state=42)

In [341]:
rfc_tfidf = RandomForestClassifier(n_estimators=5, min_samples_split=2, max_features='log2')
rfc_tfidf.fit(X_train, y_train)
rfc_predictions = rfc_tfidf.predict(X_test)
print(accuracy_score(y_test, rfc_predictions))

0.6706231454005934


The score is low, parameters need to be adjusted

### TF-IDF RF - GridSearchCV

In [342]:
param_grid = {'n_estimators': [5, 10, 100], 'min_samples_split': [2, 3, 4, 5, 10], 'max_features': ['sqrt', 'log2', 'auto']}
grid = GridSearchCV(rfc_tfidf, param_grid, verbose=1)
grid.fit(X_train, y_train)

Fitting 3 folds for each of 45 candidates, totalling 135 fits


[Parallel(n_jobs=1)]: Done 135 out of 135 | elapsed:  2.6min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='log2', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=5, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_estimators': [5, 10, 100], 'min_samples_split': [2, 3, 4, 5, 10], 'max_features': ['sqrt', 'log2', 'auto']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [343]:
grid.best_params_

{'max_features': 'sqrt', 'min_samples_split': 3, 'n_estimators': 100}

In [344]:
grid.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=3,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [345]:
rfc_predictions2 = grid.best_estimator_.predict(X_test)
print(accuracy_score(y_test, rfc_predictions2))

0.744807121661721


The score has increased. Let's now adjust the number of features

### TF-IDF RFC - SelectFromModel

In [346]:
rfc_tfidf3 = SelectFromModel(rfc_tfidf2, threshold='mean', prefit=True)
X_train_rfc = rfc_tfidf3.transform(X_train)
print('X_train shape: ', X_train.shape)
print('X_train_rfc shape: ', X_train_rfc.shape)

X_train shape:  (3029, 20445)
X_train_rfc shape:  (3029, 2498)


We can see the number of features was reduced to 2466 in this case. We will use this number later on with SVC

In [347]:
X_test_rfc = rfc_tfidf3.transform(X_test)
X_test_rfc.shape

(337, 2498)

In [348]:
rfc_tfidf4 = RandomForestClassifier(n_estimators=100, min_samples_split=10, max_features='sqrt')
rfc_tfidf4.fit(X_train_rfc, y_train)
rfc_predictions4 = rfc_tfidf4.predict(X_test_rfc)
print(accuracy_score(y_test, rfc_predictions4))

0.7388724035608308


In this case, we have significantly reduced the number of features however the score has decreased a little. This is a tradeoff that can be made to save calculation time.

Let's look analyse the TF-IDF set with SVC

## TF-IDF Support Vector Machine

In [349]:
svc_tfidf = SVC(C=1, gamma=1)
svc_tfidf.fit(X_train, y_train)
svc_predictions = svc_tfidf.predict(X_test)
print(accuracy_score(y_test, svc_predictions))

0.7685459940652819


We notice the initial score is already quite high (actually the highest one so far). Let's see if we can improve it even more by optimizing parameters

### TF-IDF SVC - GridSearchCV

In [350]:
param_grid = {'C': [0.1,1, 10], 'gamma': [1,0.1,0.01,0.001]}
grid = GridSearchCV(svc_tfidf, param_grid, verbose=1)
grid.fit(X_train, y_train)

Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:  3.0min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': [0.1, 1, 10], 'gamma': [1, 0.1, 0.01, 0.001]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [351]:
grid.best_params_

{'C': 1, 'gamma': 1}

In [352]:
grid.best_estimator_

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [353]:
svc_predictions2 = grid.best_estimator_.predict(X_test)
print(accuracy_score(y_test, svc_predictions2))

0.7685459940652819


Seems like the parameters were already optimized, however let's see if we can get the same kind of score with a lower number of features, thus reducing the calculation time

### TF-IDF SVC - SelectKBest

In [354]:
selector = SelectKBest(k=6000)
X_tfidf_new = selector.fit_transform(X_tfidf, y)
X_tfidf_new.shape

(3366, 6000)

In [355]:
X_train_svc, X_test_svc, y_train_svc, y_test_svc = train_test_split(X_tfidf_new, y, test_size=0.1, random_state=42)

In [356]:
svc_tfidf3 = SVC(C=1, gamma=1)
svc_tfidf3.fit(X_train_svc, y_train_svc)
svc_predictions3 = svc_tfidf3.predict(X_test_svc)
print(accuracy_score(y_test_svc, svc_predictions3))

0.8071216617210683


We can notice I did not use the 2466 features vaue found with SelectFromModel. Instead I tried to modify manually the number of features and found an even better score with 6000 features: above 80%. This is the best score so far

Let's try one last thing: change the n-gram range on the best model: TF-IDF with SVC

## TF-IDF SVC with ngram range

In [357]:
tf_idf_vect_bigram = TfidfVectorizer(ngram_range=(1, 2))
tf_idf_bigram = tf_idf_vect_bigram.fit_transform(mellow_aggressive_sample['clean_lyrics'])
print(tf_idf_bigram[0])

  (0, 106202)	0.03346675971056171
  (0, 4631)	0.2725503382800047
  (0, 95445)	0.030287322489270428
  (0, 56145)	0.020497998762641627
  (0, 59678)	0.05685030942605343
  (0, 80391)	0.06745548447371151
  (0, 10922)	0.09284885009523762
  (0, 182483)	0.2044127537100035
  (0, 140355)	0.6132382611300106
  (0, 192769)	0.0182451632046755
  (0, 70182)	0.04279004929337004
  (0, 189370)	0.024175804503632447
  (0, 9349)	0.031805872804382136
  (0, 155994)	0.037364282149129485
  (0, 204622)	0.0435254709521098
  (0, 6170)	0.054356292234566825
  (0, 180829)	0.043709117967880545
  (0, 114331)	0.11602033762038265
  (0, 108829)	0.02755085007971237
  (0, 33392)	0.042115987047211964
  (0, 25061)	0.021294179098270936
  (0, 137299)	0.05365288671191963
  (0, 19726)	0.043919893751920576
  (0, 47113)	0.016068561822590914
  (0, 109392)	0.01784993678005198
  :	:
  (0, 192927)	0.05512678580860656
  (0, 70228)	0.06813758457000117
  (0, 189386)	0.060730231836949554
  (0, 9446)	0.06813758457000117
  (0, 156027)	0.0681

In [358]:
print('TF-IDF initial shape: ', tf_idf.shape)
print('TF-IDF with ngram 2 shape: ', tf_idf_bigram.shape)

TF-IDF initial shape:  (3366, 20445)
TF-IDF with ngram 2 shape:  (3366, 208924)


We chose ngram_range = (1,2) meaning we will also be including bigrams. This is why we can notice the dimension significantly increased. Let's now have a look with SVC

In [359]:
X_tfidf_bigram = tf_idf_bigram
y = mellow_aggressive_sample['moods']

X_train, X_test, y_train, y_test = train_test_split(X_tfidf_bigram, y, test_size=0.1, random_state=42)

In [360]:
svc_tfidf_bigram = SVC(C=1, gamma=1)
svc_tfidf_bigram.fit(X_train, y_train)
svc_predictions_bigram = svc_tfidf_bigram.predict(X_test)
print(accuracy_score(y_test, svc_predictions_bigram))

0.7537091988130564


Quite a good initial score, let's try to get it even higher with parameters optimization (GridSearchCV). COnsidering the number of features this should be quite time consuming

In [361]:
param_grid = {'C': [0.1,1, 10], 'gamma': [1,0.1,0.01,0.001]}
grid = GridSearchCV(svc_tfidf_bigram, param_grid, verbose=1)
grid.fit(X_train, y_train)

Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:  5.2min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': [0.1, 1, 10], 'gamma': [1, 0.1, 0.01, 0.001]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [362]:
grid.best_params_

{'C': 1, 'gamma': 1}

As predicted this step was quite time consuming (>5min) and we notice the parameters were already optimized. To reduce the calculation time, it might be interesting to reduce the number of features. Let's work on this with SelectKBest

In [363]:
selector_bigram = SelectKBest(k=6000)
X_tfidf_bigram_new = selector_bigram.fit_transform(X_tfidf_bigram, y)
X_tfidf_bigram_new.shape

(3366, 6000)

In [364]:
X_train_bigram, X_test_bigram, y_train_bigram, y_test_bigram = train_test_split(X_tfidf_bigram_new, y, test_size=0.1, random_state=42)

In [365]:
svc_tfidf_bigram2 = SVC(C=1, gamma=1)
svc_tfidf_bigram2.fit(X_train_bigram, y_train_bigram)
svc_predictions_bigram2 = svc_tfidf_bigram2.predict(X_test_bigram)
print(accuracy_score(y_test_bigram, svc_predictions_bigram2))

0.8071216617210683


We can notice there is the same score as previously (using only 1-grams)

# Doc2Vec

In [366]:
from gensim.models import Doc2Vec

In [367]:
%run Doc2VecHelperFunctions.ipynb

In [368]:
convert_lyrics_to_d2v(lyrics_df['clean_lyrics'])

In [369]:
model = Doc2Vec.load('./song_lyrics.d2v')

Let's see what our model looks like, what are the words related to 'good'

In [370]:
model.wv.most_similar('good')

[('bad', 0.5932339429855347),
 ('real', 0.5345109105110168),
 ('fine', 0.5313093066215515),
 ('right', 0.4795217216014862),
 ('better', 0.46048539876937866),
 ('damn', 0.45751577615737915),
 ('nice', 0.4475826621055603),
 ('pretti', 0.4366227388381958),
 ('wish', 0.43530163168907166),
 ('best', 0.43098145723342896)]

Seems like our model was well trained as most of the words are linked to good

Let's now create our feature vector

## Doc2Vec features

In [371]:
d2v_features = []
for index, row in mellow_aggressive_sample.iterrows():
    song_index = 'SONG_NUMBER_' + str(index)
    d2v_features.append(model[song_index])

In [372]:
np.asarray(d2v_features).shape

(3366, 100)

Let's split the data use train_test_split

In [373]:
X = d2v_features
y = mellow_aggressive_sample['moods']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

Now let's analyze the data with RandomForests

## D2V RandomForest

### D2V RF Model

In [374]:
rfc_d2v = RandomForestClassifier(n_estimators=5, min_samples_split=2, max_features='log2')
rfc_d2v.fit(X_train, y_train)
rfc_predictions = rfc_d2v.predict(X_test)
print(accuracy_score(y_test, rfc_predictions))

0.5934718100890207


The score is quite low, let's try to adjust the parameters

### D2V RF - GridSearchCV

In [375]:
param_grid = {'n_estimators': [5, 10, 100], 'min_samples_split': [2, 3, 4, 5, 10], 'max_features': ['sqrt', 'log2', 'auto']}
grid = GridSearchCV(rfc_d2v, param_grid, verbose=1)
grid.fit(X_train, y_train)

Fitting 3 folds for each of 45 candidates, totalling 135 fits


[Parallel(n_jobs=1)]: Done 135 out of 135 | elapsed:  1.1min finished


GridSearchCV(cv=None, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='log2', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=5, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'n_estimators': [5, 10, 100], 'min_samples_split': [2, 3, 4, 5, 10], 'max_features': ['sqrt', 'log2', 'auto']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [376]:
grid.best_params_

{'max_features': 'sqrt', 'min_samples_split': 3, 'n_estimators': 100}

In [377]:
grid.best_estimator_

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=3,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [378]:
rfc_predictions2 = grid.best_estimator_.predict(X_test)
print(accuracy_score(y_test, rfc_predictions2))

0.7240356083086054


Let's check if changing the features make a difference

### D2V RF - SelectFromModel

In [379]:
rfc_d2v3 = SelectFromModel(rfc_d2v2, threshold='mean', prefit=True)
X_train_d2v = rfc_d2v3.transform(X_train)
print('X_train shape: ', np.asarray(X_train).shape)
print('X_train_rfc shape: ', np.asarray(X_train_d2v).shape)

X_train shape:  (3029, 100)
X_train_rfc shape:  (3029, 32)


In [380]:
X_test_d2v = rfc_d2v3.transform(X_test)
np.asarray(X_test_d2v).shape

(337, 32)

In [386]:
rfc_d2v4 = RandomForestClassifier(n_estimators=100, min_samples_split=3, max_features='sqrt')
rfc_d2v4.fit(X_train_d2v, y_train)
rfc_predictions4 = rfc_d2v4.predict(X_test_d2v)
print(accuracy_score(y_test, rfc_predictions4))

0.6735905044510386


The score is similar to the previous step. Let's look at other classifiers

## D2V LogisticRegression

In [382]:
lr_d2v = LogisticRegression()
lr_d2v.fit(X_train, y_train)
lr_predictions = lr_d2v.predict(X_test)
print(accuracy_score(y_test, lr_predictions))

0.7359050445103857


The score is higher than with RFC, let's also adjust the parameters

### D2V LR - GridSearchCV

In [383]:
param_grid = {'solver': ['newton-cg', 'sag', 'saga', 'lbfgs'], 'multi_class':['ovr', 'multinomial']}
grid = GridSearchCV(lr_d2v, param_grid, verbose=1)
grid.fit(X_train, y_train)

Fitting 3 folds for each of 8 candidates, totalling 24 fits


[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    2.8s finished


GridSearchCV(cv=None, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'solver': ['newton-cg', 'sag', 'saga', 'lbfgs'], 'multi_class': ['ovr', 'multinomial']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [384]:
grid.best_params_

{'multi_class': 'multinomial', 'solver': 'newton-cg'}

In [387]:
grid.best_estimator_

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=1, penalty='l2', random_state=None, solver='newton-cg',
          tol=0.0001, verbose=0, warm_start=False)

In [388]:
lr_predictions2 = grid.best_estimator_.predict(X_test)
print(accuracy_score(y_test, lr_predictions2))

0.7359050445103857


There was no big change on the score after the parameter optimization. Let's try with Support Vector Machine

## D2V Support Vector Machine

In [389]:
svc_d2v = SVC(C=1, gamma=1)
svc_d2v.fit(X_train, y_train)
svc_predictions = svc_d2v.predict(X_test)
print(accuracy_score(y_test, svc_predictions))

0.45103857566765576


The score is very low with this classifier, we will now adjust the parameters

### D2V SVC - GridSearchCV

In [390]:
param_grid = {'C': [0.1,1, 10], 'gamma': [1,0.1,0.01,0.001]}
grid = GridSearchCV(svc_d2v, param_grid, verbose=1)
grid.fit(X_train, y_train)

Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:   45.3s finished


GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': [0.1, 1, 10], 'gamma': [1, 0.1, 0.01, 0.001]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

In [391]:
grid.best_params_

{'C': 1, 'gamma': 0.01}

In [392]:
grid.best_estimator_

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.01, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [393]:
svc_predictions2 = grid.best_estimator_.predict(X_test)
print(accuracy_score(y_test, svc_predictions2))

0.7922848664688428


The score increased, it's now higher than with the 2 other classifiers. We will now optimize the features

### D2V SVC - SelectKBest

In [394]:
selector = SelectKBest(k=32)
X_new = selector.fit_transform(X, y)
X_new.shape

(3366, 32)

In [395]:
X_train_d2v, X_test_d2v, y_train_d2v, y_test_d2v = train_test_split(X_new, y, test_size=0.1, random_state=42)

In [396]:
svc_d2v3 = SVC(C=1, gamma=0.01)
svc_d2v3.fit(X_train_d2v, y_train_d2v)
svc_predictions3 = svc_d2v3.predict(X_test_d2v)
print(accuracy_score(y_test_d2v, svc_predictions3))

0.7655786350148368


The score decreased slightly but with a smaller number of features

# Conclusion

The model that brought in the best scores is TF-IDF with Support Vector Machine classifier. To increase the score we might want to try new classifiers or even with different moods