In [1]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt

import joblib
import os
import sys

module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from utilities import utils
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV, train_test_split

DATA_PATH = utils.get_datapath('data')
MODEL_PATH = utils.get_datapath('model')

%load_ext autoreload
%autoreload 2

[nltk_data] Downloading package stopwords to /home/jng/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**Table of contents**<a id='toc0_'></a>    
- [Model Optimization](#toc1_)    
- [Loading the Dataset](#toc2_)    
- [Preparing the Train and Test Set](#toc3_)    
- [Modelling With TF-IDF Transformed Lyrics](#toc4_)    
  - [Logistic Regression and TF-IDF](#toc4_1_)    
  - [Logsitic Regression with NMF Components of the TF-IDF Vectors](#toc4_2_)    
  - [Multinomial Naive Bayes Classifier and TF-IDF](#toc4_3_)    
  - [Random Forest and TF-IDF](#toc4_4_)    
- [Modeling With OpenAI Embeddings](#toc5_)    
  - [Setting up the Embeddings](#toc5_1_)    
  - [Logistic Regression and Ada Embeddings](#toc5_2_)    
  - [Logsitic Regression with PCA Components of the Ada Embeddings](#toc5_3_)    
- [Predicting Popularity of Hip Hop Songs](#toc6_)    
  - [Subsetting the Dataset for only Hip Hop](#toc6_1_)    
  - [Modeling for Hip Hop Popularity](#toc6_2_)    
    - [Logistic Regression and TF-IDF](#toc6_2_1_)    
    - [Multinomial Naive Bayes and TF-IDF](#toc6_2_2_)    
- [Conclusion](#toc7_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Model Optimization](#toc0_)

This notebook will outline the steps for creating the various models with TF-IDF and Ada document embeddings. We chose TF-IDF as the desired vectorizer as we believe that certain words may be unique to certain genres of music and this information should be captured by a TF-IDF Vectorizer.

Specifically, we will use the following representations of the lyrics as input into the models:
- Non-dimensionality Reduced
    1. TF-IDF
    1. Ada document embeddings from OpenAI
- Dimensionality Reduced
    1. NMF (10 components) of the TF-IDF vectors.
    1. PCA (10 components) of the Ada document embeddings

The differences in the dimensionality techniques are due to the nature of the vector representations. The TF-IDF vectors are sparse martices and NMF performs better than PCA on these types of matrices. In contrast, the Ada embeddings are a dense matrix, so PCA can be used. 

We also decided that we will use the three class target (low, mid and high popularity) to give a more granular prediction of Spotify popularity. 

For the modelling aspects we will run a 5-fold cross validation for each model.

We will be attempting to optimize the following models:
- Logistic Regression for interpretability.
- Multinomial Naive Bays for interpretability and speed.
- Random Forest for performance.

The model evaluation will be carried out in `notebooks/7_model_evaluation.ipynb`.

# <a id='toc2_'></a>[Loading the Dataset](#toc0_)

In [2]:
df = pd.read_csv(DATA_PATH / 'clean_lyrics_final.csv')

In [3]:
display(df.head())
display(df.shape)

Unnamed: 0,song,lyrics,release_year,title,primary_artist,views,cleaned_lyrics,language,log_scaled_views,popular,popularity_three_class,cleaned_lyrics_stem,spotify_popularity,genre,ada_embeddings,spotify_popularity_three_class
0,Kendrick-lamar-swimming-pools-drank-lyrics,\n\n[Produced by T-Minus]\n\n[Intro]\nPour up ...,2012,Swimming Pools (Drank),Kendrick-lamar,5589280.0,pour up drank head shot drank sit down drank ...,en,15.536361,1,2,pour drank head shot drank sit drank stand dr...,78.0,hip hop,"[0.011653340421617031, -0.0033766645938158035,...",2
1,Kendrick-lamar-money-trees-lyrics,\n\n[Produced by DJ Dahi]\n\n[Verse 1: Kendric...,2012,Money Trees,Kendrick-lamar,4592003.0,uh me and my niggas tryna get it ya bish ya b...,en,15.339827,1,2,uh nigga tryna get ya bish ya bish hit hous l...,81.0,hip hop,"[0.0013736916007474065, -0.00975166354328394, ...",2
2,Kendrick-lamar-xxx-lyrics,"\n\n[Intro: Bēkon & Kid Capri]\nAmerica, God b...",2017,XXX.,Kendrick-lamar,4651514.0,america god bless you if its good to you amer...,en,15.352703,1,2,america god bless good america pleas take han...,69.0,hip hop,"[-0.014546433463692665, -0.006841110065579414,...",2
3,A-ap-rocky-fuckin-problems-lyrics,"\n\n[Chorus: 2 Chainz, Drake & Both (A$AP Rock...",2012,Fuckin’ Problems,A-ap-rocky,7378309.0,i love bad bitches thats my fuckin problem an...,en,15.814055,1,2,love bad bitch that fuckin problem yeah like ...,75.0,hip hop,"[-0.018262486904859543, 0.006630411371588707, ...",2
4,Kendrick-lamar-dna-lyrics,"\n\n[Verse 1]\nI got, I got, I got, I got—\nLo...",2017,DNA.,Kendrick-lamar,5113687.0,i got i got i got i got loyalty got royalty i...,en,15.447431,1,2,got got got got loyalti got royalti insid dna...,80.0,hip hop,"[-0.023312833160161972, -0.012944793328642845,...",2


(28560, 16)

# <a id='toc3_'></a>[Preparing the Train and Test Set](#toc0_)

In [4]:
X=df[['cleaned_lyrics_stem', 'ada_embeddings']]
y=df['spotify_popularity_three_class']

# Create train and test splits and set a random state for reproducibility.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=33)

# Dividing up the raws lyrics and the embeddings.
X_train_lyrics = X_train['cleaned_lyrics_stem']
X_train_embeddings = X_train['ada_embeddings']

X_test_lyrics = X_test['cleaned_lyrics_stem']
X_test_embeddings = X_test['ada_embeddings']


In [8]:
X_train_lyrics.shape, X_test_lyrics.shape

((22848,), (5712,))

In [9]:
X_train_embeddings.shape, X_test_embeddings.shape

((22848,), (5712,))

The data has been split correctly and we can now use this split data for modelling.

---

# <a id='toc4_'></a>[Modelling With TF-IDF Transformed Lyrics](#toc0_)

## <a id='toc4_1_'></a>[Logistic Regression and TF-IDF](#toc0_)

Here we are tuning the parameters for a logistic regression model with L2 regularization using a TF-IDF vectorizer as input. Specifically, we will tune the `C` parameter which is the inverese of the alpha in the regularization term. We will also vary the `solver` used by the model, between `newton-cg` and `saga` as these both work for multi-class problems.

In [10]:


vectorizer = TfidfVectorizer(max_df=0.9, 
                             min_df=0.01
                             )

log_reg = LogisticRegression(penalty='l2', max_iter=500)


pipe = Pipeline(steps=
                [
                    ('tfidf', vectorizer),
                    ('log_reg', log_reg)
                ]
            )

param_grid = {
    'log_reg__C':[0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
    'log_reg__solver':['newton-cg', 'saga']
}

search = GridSearchCV(pipe, param_grid, n_jobs=-1, cv=5, verbose=3, return_train_score=True)
search.fit(X_train_lyrics, y_train)
search.score(X_test_lyrics, y_test)

Fitting 5 folds for each of 14 candidates, totalling 70 fits
[CV 3/5] END log_reg__C=0.001, log_reg__solver=newton-cg;, score=(train=0.388, test=0.379) total time=   4.0s
[CV 5/5] END log_reg__C=0.001, log_reg__solver=newton-cg;, score=(train=0.390, test=0.364) total time=   4.0s
[CV 2/5] END log_reg__C=0.0001, log_reg__solver=newton-cg;, score=(train=0.368, test=0.358) total time=   4.1s
[CV 1/5] END log_reg__C=0.001, log_reg__solver=newton-cg;, score=(train=0.387, test=0.373) total time=   4.3s
[CV 5/5] END log_reg__C=0.0001, log_reg__solver=newton-cg;, score=(train=0.368, test=0.362) total time=   4.6s
[CV 1/5] END log_reg__C=0.0001, log_reg__solver=newton-cg;, score=(train=0.369, test=0.361) total time=   4.4s
[CV 2/5] END log_reg__C=0.0001, log_reg__solver=saga;, score=(train=0.367, test=0.357) total time=   4.9s
[CV 4/5] END log_reg__C=0.0001, log_reg__solver=saga;, score=(train=0.361, test=0.363) total time=   5.2s
[CV 2/5] END log_reg__C=0.001, log_reg__solver=saga;, score=(tra

0.42034313725490197

In [11]:
with open(MODEL_PATH / 'log_reg_tfidf.pkl', 'wb') as file:
    joblib.dump(search, file)


## <a id='toc4_2_'></a>[Logsitic Regression with NMF Components of the TF-IDF Vectors](#toc0_)

In [12]:
from sklearn.decomposition import NMF

Next we will add another step into the pipeline namely the NMF components. We will try using 5, 10, 15, 20 components for the NMF.

In [13]:
nmf = NMF(n_components=5)

vectorizer = TfidfVectorizer(max_df=0.9, 
                             min_df=0.01
                             )

log_reg = LogisticRegression(penalty='l2', max_iter=500)

pipe = Pipeline(steps=
                [   
                    ('tfidf', vectorizer),
                    ('nmf', nmf),
                    ('log_reg', log_reg)
                ]
            )

param_grid = {
    'nmf__n_components':[5,10,15,20],
    'log_reg__C':[0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
    'log_reg__solver':['newton-cg', 'saga']
}

search = GridSearchCV(pipe, param_grid, n_jobs=-1, cv=5, verbose=3, return_train_score=True)
search.fit(X_train_lyrics, y_train)
search.score(X_test_lyrics, y_test)

Fitting 5 folds for each of 56 candidates, totalling 280 fits
[CV 5/5] END log_reg__C=0.0001, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.339, test=0.339) total time=   5.5s




[CV 3/5] END log_reg__C=0.0001, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.339, test=0.339) total time=   5.8s
[CV 1/5] END log_reg__C=0.0001, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.339, test=0.339) total time=   5.9s
[CV 2/5] END log_reg__C=0.0001, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.339, test=0.339) total time=   6.0s
[CV 4/5] END log_reg__C=0.0001, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.339, test=0.339) total time=   6.0s




[CV 2/5] END log_reg__C=0.0001, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.339, test=0.339) total time=   7.4s




[CV 4/5] END log_reg__C=0.0001, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.339, test=0.339) total time=   8.7s
[CV 1/5] END log_reg__C=0.0001, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.339, test=0.339) total time=   8.7s
[CV 3/5] END log_reg__C=0.0001, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.339, test=0.339) total time=   9.2s
[CV 5/5] END log_reg__C=0.0001, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.339, test=0.339) total time=   8.8s
[CV 5/5] END log_reg__C=0.0001, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.339, test=0.339) total time=   8.7s
[CV 5/5] END log_reg__C=0.0001, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.339, test=0.339) total time=   9.9s
[CV 4/5] END log_reg__C=0.0001, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.339, test=0.339) total time=   9.8s
[CV 1/5] END log_reg__C=0.0001, log_reg__solver=newton-cg, nmf__n_com



[CV 3/5] END log_reg__C=0.0001, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.339, test=0.339) total time=  11.5s
[CV 1/5] END log_reg__C=0.0001, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.339, test=0.339) total time=   5.4s
[CV 3/5] END log_reg__C=0.0001, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.339, test=0.339) total time=   5.9s
[CV 2/5] END log_reg__C=0.0001, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.339, test=0.339) total time=   6.1s
[CV 4/5] END log_reg__C=0.0001, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.339, test=0.339) total time=   5.8s




[CV 5/5] END log_reg__C=0.0001, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.339, test=0.339) total time=   5.9s
[CV 4/5] END log_reg__C=0.0001, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.339, test=0.339) total time=  14.3s




[CV 2/5] END log_reg__C=0.0001, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.339, test=0.339) total time=  15.2s
[CV 1/5] END log_reg__C=0.0001, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.339, test=0.339) total time=  15.5s




[CV 1/5] END log_reg__C=0.0001, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.339, test=0.339) total time=   8.4s
[CV 4/5] END log_reg__C=0.0001, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.339, test=0.339) total time=   8.1s
[CV 2/5] END log_reg__C=0.0001, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.339, test=0.339) total time=   8.5s
[CV 3/5] END log_reg__C=0.0001, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.339, test=0.339) total time=   8.5s
[CV 5/5] END log_reg__C=0.0001, log_reg__solver=saga, nmf__n_components=15;, score=(train=0.339, test=0.339) total time=   6.9s




[CV 5/5] END log_reg__C=0.0001, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.339, test=0.339) total time=   9.4s
[CV 1/5] END log_reg__C=0.0001, log_reg__solver=saga, nmf__n_components=15;, score=(train=0.339, test=0.339) total time=   9.9s




[CV 4/5] END log_reg__C=0.0001, log_reg__solver=saga, nmf__n_components=15;, score=(train=0.339, test=0.339) total time=   9.0s
[CV 1/5] END log_reg__C=0.001, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.343, test=0.340) total time=   6.4s
[CV 2/5] END log_reg__C=0.001, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.342, test=0.340) total time=   6.5s
[CV 5/5] END log_reg__C=0.001, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.341, test=0.343) total time=   5.8s
[CV 4/5] END log_reg__C=0.001, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.339, test=0.341) total time=   6.2s
[CV 2/5] END log_reg__C=0.0001, log_reg__solver=saga, nmf__n_components=15;, score=(train=0.339, test=0.339) total time=  12.4s
[CV 3/5] END log_reg__C=0.001, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.341, test=0.342) total time=   6.8s
[CV 3/5] END log_reg__C=0.0001, log_reg__solver=saga, nmf__n_components=20;, score=(train



[CV 4/5] END log_reg__C=0.0001, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.339, test=0.339) total time=  10.1s
[CV 1/5] END log_reg__C=0.001, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.343, test=0.341) total time=   7.2s




[CV 1/5] END log_reg__C=0.0001, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.339, test=0.339) total time=  14.2s




[CV 5/5] END log_reg__C=0.0001, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.339, test=0.339) total time=  12.3s
[CV 2/5] END log_reg__C=0.001, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.341, test=0.340) total time=   9.0s
[CV 4/5] END log_reg__C=0.001, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.339, test=0.340) total time=   8.9s
[CV 3/5] END log_reg__C=0.001, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.341, test=0.342) total time=   9.1s
[CV 5/5] END log_reg__C=0.001, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.341, test=0.342) total time=   9.1s




[CV 1/5] END log_reg__C=0.001, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.344, test=0.344) total time=   9.3s
[CV 4/5] END log_reg__C=0.001, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.340, test=0.342) total time=   8.6s
[CV 1/5] END log_reg__C=0.001, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.343, test=0.340) total time=   6.4s
[CV 4/5] END log_reg__C=0.001, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.339, test=0.341) total time=   5.8s
[CV 2/5] END log_reg__C=0.001, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.343, test=0.340) total time=  11.9s
[CV 2/5] END log_reg__C=0.001, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.342, test=0.340) total time=   7.3s
[CV 3/5] END log_reg__C=0.001, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.341, test=0.342) total time=   6.6s
[CV 1/5] END log_reg__C=0.001, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.342,



[CV 4/5] END log_reg__C=0.001, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.340, test=0.341) total time=  11.6s
[CV 3/5] END log_reg__C=0.001, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.343, test=0.343) total time=  12.0s
[CV 2/5] END log_reg__C=0.001, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.342, test=0.340) total time=   7.4s
[CV 4/5] END log_reg__C=0.001, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.339, test=0.340) total time=   8.7s
[CV 5/5] END log_reg__C=0.001, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.340, test=0.341) total time=   7.5s
[CV 2/5] END log_reg__C=0.001, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.342, test=0.339) total time=  14.9s
[CV 3/5] END log_reg__C=0.001, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.341, test=0.343) total time=   9.2s
[CV 1/5] END log_reg__C=0.001, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.342, 



[CV 3/5] END log_reg__C=0.001, log_reg__solver=saga, nmf__n_components=15;, score=(train=0.344, test=0.346) total time=   9.0s
[CV 5/5] END log_reg__C=0.001, log_reg__solver=saga, nmf__n_components=15;, score=(train=0.341, test=0.342) total time=   8.8s
[CV 1/5] END log_reg__C=0.01, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.374, test=0.367) total time=   6.9s
[CV 1/5] END log_reg__C=0.001, log_reg__solver=saga, nmf__n_components=15;, score=(train=0.343, test=0.341) total time=  12.0s
[CV 2/5] END log_reg__C=0.01, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.374, test=0.370) total time=   6.6s
[CV 3/5] END log_reg__C=0.01, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.373, test=0.374) total time=   6.3s
[CV 5/5] END log_reg__C=0.01, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.377, test=0.363) total time=   5.7s
[CV 4/5] END log_reg__C=0.001, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.339, tes



[CV 2/5] END log_reg__C=0.001, log_reg__solver=saga, nmf__n_components=15;, score=(train=0.344, test=0.341) total time=  13.0s




[CV 2/5] END log_reg__C=0.01, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.375, test=0.377) total time=   6.9s
[CV 3/5] END log_reg__C=0.001, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.343, test=0.342) total time=  12.9s
[CV 2/5] END log_reg__C=0.001, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.343, test=0.339) total time=  13.5s
[CV 1/5] END log_reg__C=0.01, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.376, test=0.373) total time=   9.1s
[CV 3/5] END log_reg__C=0.01, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.379, test=0.383) total time=   9.0s
[CV 4/5] END log_reg__C=0.01, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.373, test=0.395) total time=   9.3s
[CV 5/5] END log_reg__C=0.01, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.379, test=0.366) total time=   8.5s




[CV 5/5] END log_reg__C=0.01, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.380, test=0.364) total time=   7.7s




[CV 1/5] END log_reg__C=0.01, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.376, test=0.372) total time=  10.0s
[CV 4/5] END log_reg__C=0.01, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.372, test=0.395) total time=   8.3s
[CV 1/5] END log_reg__C=0.01, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.374, test=0.367) total time=   6.4s
[CV 3/5] END log_reg__C=0.01, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.373, test=0.374) total time=   6.2s
[CV 2/5] END log_reg__C=0.01, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.374, test=0.370) total time=   6.6s
[CV 4/5] END log_reg__C=0.01, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.368, test=0.388) total time=   5.9s
[CV 2/5] END log_reg__C=0.01, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.378, test=0.373) total time=  11.8s




[CV 3/5] END log_reg__C=0.01, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.378, test=0.376) total time=  10.4s
[CV 4/5] END log_reg__C=0.01, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.372, test=0.393) total time=  10.4s
[CV 5/5] END log_reg__C=0.01, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.377, test=0.363) total time=   7.1s
[CV 3/5] END log_reg__C=0.01, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.375, test=0.374) total time=  12.4s
[CV 5/5] END log_reg__C=0.01, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.380, test=0.364) total time=   9.9s




[CV 1/5] END log_reg__C=0.01, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.379, test=0.372) total time=  12.1s
[CV 2/5] END log_reg__C=0.01, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.377, test=0.374) total time=  12.4s
[CV 2/5] END log_reg__C=0.01, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.375, test=0.377) total time=   7.3s
[CV 1/5] END log_reg__C=0.01, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.376, test=0.373) total time=   9.9s
[CV 3/5] END log_reg__C=0.01, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.379, test=0.383) total time=   9.2s
[CV 4/5] END log_reg__C=0.01, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.372, test=0.395) total time=   9.3s
[CV 5/5] END log_reg__C=0.01, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.379, test=0.366) total time=   9.0s
[CV 2/5] END log_reg__C=0.01, log_reg__solver=saga, nmf__n_components=15;, score=(train=0.377, test=0.377) t



[CV 4/5] END log_reg__C=0.1, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.384, test=0.401) total time=   6.4s
[CV 3/5] END log_reg__C=0.1, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.383, test=0.396) total time=   7.1s
[CV 5/5] END log_reg__C=0.1, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.390, test=0.369) total time=   6.3s
[CV 5/5] END log_reg__C=0.01, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.379, test=0.366) total time=   8.7s




[CV 1/5] END log_reg__C=0.01, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.376, test=0.368) total time=  12.1s
[CV 2/5] END log_reg__C=0.01, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.377, test=0.373) total time=  12.5s
[CV 1/5] END log_reg__C=0.1, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.395, test=0.380) total time=   9.3s
[CV 2/5] END log_reg__C=0.1, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.390, test=0.391) total time=   7.8s
[CV 3/5] END log_reg__C=0.1, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.391, test=0.394) total time=   8.0s
[CV 4/5] END log_reg__C=0.01, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.373, test=0.395) total time=  13.0s
[CV 3/5] END log_reg__C=0.01, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.375, test=0.377) total time=  14.2s
[CV 4/5] END log_reg__C=0.1, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.391, test=0.



[CV 1/5] END log_reg__C=0.1, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.395, test=0.384) total time=   9.1s
[CV 4/5] END log_reg__C=0.1, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.388, test=0.404) total time=   8.2s
[CV 1/5] END log_reg__C=0.1, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.388, test=0.377) total time=   6.0s
[CV 5/5] END log_reg__C=0.1, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.394, test=0.377) total time=   8.3s
[CV 3/5] END log_reg__C=0.1, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.392, test=0.388) total time=  10.1s
[CV 3/5] END log_reg__C=0.1, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.383, test=0.396) total time=   6.0s
[CV 4/5] END log_reg__C=0.1, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.384, test=0.401) total time=   5.7s
[CV 5/5] END log_reg__C=0.1, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.396, test=0.37



[CV 2/5] END log_reg__C=0.1, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.386, test=0.385) total time=   7.1s
[CV 5/5] END log_reg__C=0.1, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.389, test=0.369) total time=   6.2s
[CV 2/5] END log_reg__C=0.1, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.391, test=0.389) total time=  12.7s




[CV 1/5] END log_reg__C=0.1, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.395, test=0.380) total time=   9.2s
[CV 4/5] END log_reg__C=0.1, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.391, test=0.406) total time=  13.4s
[CV 2/5] END log_reg__C=0.1, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.391, test=0.393) total time=   9.3s
[CV 3/5] END log_reg__C=0.1, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.394, test=0.400) total time=   9.3s




[CV 3/5] END log_reg__C=0.1, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.392, test=0.389) total time=  15.6s
[CV 4/5] END log_reg__C=0.1, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.392, test=0.400) total time=   9.2s
[CV 2/5] END log_reg__C=0.1, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.394, test=0.389) total time=  16.7s
[CV 1/5] END log_reg__C=0.1, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.396, test=0.385) total time=  17.4s




[CV 1/5] END log_reg__C=0.1, log_reg__solver=saga, nmf__n_components=15;, score=(train=0.396, test=0.383) total time=   9.7s
[CV 5/5] END log_reg__C=0.1, log_reg__solver=saga, nmf__n_components=15;, score=(train=0.393, test=0.378) total time=   8.3s
[CV 4/5] END log_reg__C=0.1, log_reg__solver=saga, nmf__n_components=15;, score=(train=0.388, test=0.405) total time=   8.7s
[CV 1/5] END log_reg__C=1, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.398, test=0.387) total time=   6.6s
[CV 5/5] END log_reg__C=0.1, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.390, test=0.374) total time=  11.4s
[CV 1/5] END log_reg__C=0.1, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.396, test=0.385) total time=   8.9s
[CV 3/5] END log_reg__C=0.1, log_reg__solver=saga, nmf__n_components=15;, score=(train=0.393, test=0.391) total time=  12.3s
[CV 2/5] END log_reg__C=1, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.396, test=0.399) total time=   5



[CV 4/5] END log_reg__C=1, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.393, test=0.407) total time=   5.6s
[CV 5/5] END log_reg__C=1, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.400, test=0.387) total time=   5.9s
[CV 4/5] END log_reg__C=0.1, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.391, test=0.401) total time=  10.3s




[CV 5/5] END log_reg__C=0.1, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.396, test=0.382) total time=  10.7s
[CV 3/5] END log_reg__C=1, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.396, test=0.400) total time=   6.8s




[CV 2/5] END log_reg__C=0.1, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.393, test=0.391) total time=  14.7s




[CV 2/5] END log_reg__C=1, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.400, test=0.392) total time=   8.0s
[CV 3/5] END log_reg__C=0.1, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.395, test=0.394) total time=  15.8s
[CV 1/5] END log_reg__C=1, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.402, test=0.392) total time=  11.3s
[CV 5/5] END log_reg__C=1, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.401, test=0.386) total time=   9.1s
[CV 4/5] END log_reg__C=1, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.397, test=0.413) total time=  10.3s




[CV 3/5] END log_reg__C=1, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.401, test=0.407) total time=  11.7s
[CV 5/5] END log_reg__C=1, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.403, test=0.392) total time=   9.3s
[CV 1/5] END log_reg__C=1, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.401, test=0.390) total time=  10.8s
[CV 1/5] END log_reg__C=1, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.398, test=0.387) total time=   6.7s




[CV 2/5] END log_reg__C=1, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.396, test=0.399) total time=   7.0s
[CV 3/5] END log_reg__C=1, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.396, test=0.400) total time=   7.1s
[CV 2/5] END log_reg__C=1, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.397, test=0.396) total time=  12.7s
[CV 3/5] END log_reg__C=1, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.399, test=0.398) total time=  12.1s
[CV 1/5] END log_reg__C=1, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.403, test=0.387) total time=  10.2s
[CV 4/5] END log_reg__C=1, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.393, test=0.407) total time=   5.9s
[CV 5/5] END log_reg__C=1, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.405, test=0.394) total time=   9.5s




[CV 5/5] END log_reg__C=1, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.401, test=0.387) total time=   6.5s




[CV 4/5] END log_reg__C=1, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.401, test=0.411) total time=  14.4s
[CV 2/5] END log_reg__C=1, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.403, test=0.398) total time=  14.2s
[CV 3/5] END log_reg__C=1, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.403, test=0.398) total time=  14.8s
[CV 4/5] END log_reg__C=1, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.401, test=0.408) total time=  14.8s




[CV 1/5] END log_reg__C=1, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.401, test=0.392) total time=  10.4s
[CV 2/5] END log_reg__C=1, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.400, test=0.392) total time=   9.1s
[CV 4/5] END log_reg__C=1, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.397, test=0.413) total time=  10.6s




[CV 5/5] END log_reg__C=1, log_reg__solver=saga, nmf__n_components=15;, score=(train=0.401, test=0.392) total time=   8.2s




[CV 1/5] END log_reg__C=1, log_reg__solver=saga, nmf__n_components=15;, score=(train=0.401, test=0.389) total time=   9.8s
[CV 3/5] END log_reg__C=1, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.401, test=0.406) total time=  11.6s
[CV 5/5] END log_reg__C=1, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.406, test=0.398) total time=  12.0s
[CV 1/5] END log_reg__C=10, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.399, test=0.387) total time=   7.1s
[CV 4/5] END log_reg__C=1, log_reg__solver=saga, nmf__n_components=15;, score=(train=0.398, test=0.408) total time=  10.0s




[CV 2/5] END log_reg__C=10, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.397, test=0.400) total time=   7.6s




[CV 4/5] END log_reg__C=10, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.393, test=0.407) total time=   6.2s
[CV 5/5] END log_reg__C=1, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.405, test=0.395) total time=  10.4s
[CV 2/5] END log_reg__C=1, log_reg__solver=saga, nmf__n_components=15;, score=(train=0.400, test=0.396) total time=  13.5s
[CV 5/5] END log_reg__C=10, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.400, test=0.386) total time=   6.1s
[CV 3/5] END log_reg__C=1, log_reg__solver=saga, nmf__n_components=15;, score=(train=0.403, test=0.404) total time=  13.7s
[CV 3/5] END log_reg__C=10, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.397, test=0.402) total time=   7.1s
[CV 2/5] END log_reg__C=1, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.403, test=0.401) total time=  12.8s




[CV 4/5] END log_reg__C=1, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.401, test=0.414) total time=  14.0s
[CV 1/5] END log_reg__C=1, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.406, test=0.385) total time=  15.2s
[CV 3/5] END log_reg__C=1, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.403, test=0.401) total time=  15.2s




[CV 5/5] END log_reg__C=10, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.400, test=0.389) total time=   7.5s
[CV 1/5] END log_reg__C=10, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.400, test=0.393) total time=  10.4s
[CV 4/5] END log_reg__C=10, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.396, test=0.414) total time=   8.8s
[CV 2/5] END log_reg__C=10, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.400, test=0.395) total time=  10.1s
[CV 3/5] END log_reg__C=10, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.400, test=0.407) total time=  11.6s
[CV 1/5] END log_reg__C=10, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.399, test=0.386) total time=   6.8s
[CV 4/5] END log_reg__C=10, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.399, test=0.410) total time=   9.6s




[CV 2/5] END log_reg__C=10, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.397, test=0.401) total time=   6.8s
[CV 1/5] END log_reg__C=10, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.401, test=0.388) total time=  11.9s
[CV 3/5] END log_reg__C=10, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.398, test=0.402) total time=   6.5s
[CV 4/5] END log_reg__C=10, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.393, test=0.407) total time=   6.1s
[CV 5/5] END log_reg__C=10, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.400, test=0.386) total time=   6.2s
[CV 3/5] END log_reg__C=10, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.400, test=0.394) total time=  11.6s
[CV 5/5] END log_reg__C=10, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.405, test=0.403) total time=   9.5s
[CV 5/5] END log_reg__C=10, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.402, test=0.391) total time=



[CV 3/5] END log_reg__C=10, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.405, test=0.398) total time=  10.8s
[CV 4/5] END log_reg__C=10, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.404, test=0.413) total time=  11.9s




[CV 2/5] END log_reg__C=10, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.400, test=0.400) total time=  14.8s
[CV 1/5] END log_reg__C=10, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.406, test=0.389) total time=  13.2s




[CV 2/5] END log_reg__C=10, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.404, test=0.397) total time=  14.7s




[CV 1/5] END log_reg__C=10, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.401, test=0.393) total time=   9.5s
[CV 4/5] END log_reg__C=10, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.397, test=0.413) total time=   8.8s
[CV 2/5] END log_reg__C=10, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.403, test=0.399) total time=  10.4s




[CV 1/5] END log_reg__C=100, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.399, test=0.386) total time=   6.1s
[CV 3/5] END log_reg__C=10, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.400, test=0.408) total time=  12.0s
[CV 2/5] END log_reg__C=10, log_reg__solver=saga, nmf__n_components=15;, score=(train=0.400, test=0.396) total time=   9.2s
[CV 1/5] END log_reg__C=10, log_reg__solver=saga, nmf__n_components=15;, score=(train=0.401, test=0.388) total time=  10.2s
[CV 5/5] END log_reg__C=10, log_reg__solver=saga, nmf__n_components=15;, score=(train=0.401, test=0.396) total time=   9.2s
[CV 4/5] END log_reg__C=10, log_reg__solver=saga, nmf__n_components=15;, score=(train=0.399, test=0.410) total time=  10.0s
[CV 5/5] END log_reg__C=10, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.404, test=0.399) total time=  11.9s
[CV 2/5] END log_reg__C=100, log_reg__solver=newton-cg, nmf__n_components=5;, score=(train=0.397, test=0.401) total time=   7.1



[CV 3/5] END log_reg__C=10, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.404, test=0.398) total time=  12.3s
[CV 5/5] END log_reg__C=10, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.404, test=0.400) total time=  11.7s
[CV 1/5] END log_reg__C=10, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.403, test=0.387) total time=  13.5s




[CV 2/5] END log_reg__C=100, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.399, test=0.396) total time=   7.2s
[CV 1/5] END log_reg__C=100, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.400, test=0.391) total time=   9.3s
[CV 4/5] END log_reg__C=10, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.404, test=0.408) total time=  15.8s
[CV 3/5] END log_reg__C=100, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.400, test=0.406) total time=   9.8s
[CV 4/5] END log_reg__C=100, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.397, test=0.414) total time=   8.7s




[CV 3/5] END log_reg__C=100, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.403, test=0.400) total time=   8.4s
[CV 5/5] END log_reg__C=100, log_reg__solver=newton-cg, nmf__n_components=10;, score=(train=0.399, test=0.389) total time=  10.2s
[CV 4/5] END log_reg__C=100, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.402, test=0.412) total time=   8.9s
[CV 1/5] END log_reg__C=100, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.399, test=0.386) total time=   6.4s




[CV 4/5] END log_reg__C=100, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.393, test=0.406) total time=   6.0s
[CV 2/5] END log_reg__C=100, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.397, test=0.401) total time=   6.4s
[CV 5/5] END log_reg__C=100, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.405, test=0.403) total time=   9.1s
[CV 5/5] END log_reg__C=100, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.400, test=0.387) total time=   6.1s
[CV 1/5] END log_reg__C=100, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.403, test=0.387) total time=  10.4s
[CV 3/5] END log_reg__C=100, log_reg__solver=saga, nmf__n_components=5;, score=(train=0.398, test=0.402) total time=   7.6s
[CV 5/5] END log_reg__C=100, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.401, test=0.395) total time=  10.9s
[CV 2/5] END log_reg__C=100, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.399, test=0.394) tot



[CV 3/5] END log_reg__C=100, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.404, test=0.401) total time=  10.7s
[CV 4/5] END log_reg__C=100, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.403, test=0.416) total time=  10.8s
[CV 2/5] END log_reg__C=100, log_reg__solver=newton-cg, nmf__n_components=20;, score=(train=0.403, test=0.395) total time=  12.0s
[CV 1/5] END log_reg__C=100, log_reg__solver=newton-cg, nmf__n_components=15;, score=(train=0.402, test=0.388) total time=  14.2s
[CV 2/5] END log_reg__C=100, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.399, test=0.396) total time=   7.0s




[CV 1/5] END log_reg__C=100, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.400, test=0.391) total time=   9.1s
[CV 4/5] END log_reg__C=100, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.397, test=0.414) total time=   8.4s
[CV 5/5] END log_reg__C=100, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.399, test=0.389) total time=   7.3s
[CV 3/5] END log_reg__C=100, log_reg__solver=saga, nmf__n_components=10;, score=(train=0.400, test=0.406) total time=   9.4s
[CV 4/5] END log_reg__C=100, log_reg__solver=saga, nmf__n_components=15;, score=(train=0.402, test=0.412) total time=   6.8s
[CV 5/5] END log_reg__C=100, log_reg__solver=saga, nmf__n_components=15;, score=(train=0.400, test=0.394) total time=   6.7s




[CV 1/5] END log_reg__C=100, log_reg__solver=saga, nmf__n_components=15;, score=(train=0.401, test=0.391) total time=   8.3s
[CV 5/5] END log_reg__C=100, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.405, test=0.402) total time=   7.2s
[CV 2/5] END log_reg__C=100, log_reg__solver=saga, nmf__n_components=15;, score=(train=0.399, test=0.395) total time=  10.1s
[CV 3/5] END log_reg__C=100, log_reg__solver=saga, nmf__n_components=15;, score=(train=0.401, test=0.396) total time=  10.1s




[CV 1/5] END log_reg__C=100, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.407, test=0.388) total time=   9.3s
[CV 2/5] END log_reg__C=100, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.402, test=0.396) total time=   9.1s
[CV 4/5] END log_reg__C=100, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.401, test=0.409) total time=   9.5s
[CV 3/5] END log_reg__C=100, log_reg__solver=saga, nmf__n_components=20;, score=(train=0.405, test=0.406) total time=  10.5s


0.4012605042016807

In [14]:
with open(MODEL_PATH / 'log_reg_tfidf_nmf.pkl', 'wb') as file:
    joblib.dump(search, file)


## <a id='toc4_3_'></a>[Multinomial Naive Bayes Classifier and TF-IDF](#toc0_)

We will also try to fit a Multinomial Naive Bayes classifier, as it has been shown to work well with TF-IDF vectors and bag of word representations. 

We will try to tune the classifier by deciding if it will learn the prior probabilities of the classes or not. Additionally, we will tune the alpha, which prevents a probability becoming 0. 

In [15]:
from sklearn.naive_bayes import MultinomialNB

In [16]:
vectorizer = TfidfVectorizer(max_df=0.9, 
                             min_df=0.01
                             )

mnb = MultinomialNB()


pipe = Pipeline(steps=
                [
                    ('tfidf', vectorizer),
                    ('mnb', mnb)
                ]
            )

param_grid = {
    'mnb__alpha': np.arange(0.2, 10, 0.2),
    'mnb__fit_prior':[True, False]
}

search = GridSearchCV(pipe, param_grid, n_jobs=-1, cv=5, verbose=3, return_train_score=True)
search.fit(X_train_lyrics, y_train)
search.score(X_test_lyrics, y_test)

Fitting 5 folds for each of 98 candidates, totalling 490 fits
[CV 2/5] END mnb__alpha=0.2, mnb__fit_prior=True;, score=(train=0.475, test=0.409) total time=   3.0s
[CV 4/5] END mnb__alpha=0.2, mnb__fit_prior=True;, score=(train=0.476, test=0.415) total time=   3.1s
[CV 1/5] END mnb__alpha=0.2, mnb__fit_prior=False;, score=(train=0.472, test=0.408) total time=   3.3s
[CV 2/5] END mnb__alpha=0.2, mnb__fit_prior=False;, score=(train=0.471, test=0.412) total time=   3.3s
[CV 3/5] END mnb__alpha=0.2, mnb__fit_prior=False;, score=(train=0.475, test=0.408) total time=   3.4s
[CV 4/5] END mnb__alpha=0.2, mnb__fit_prior=False;, score=(train=0.473, test=0.415) total time=   3.4s
[CV 1/5] END mnb__alpha=0.4, mnb__fit_prior=True;, score=(train=0.475, test=0.408) total time=   3.5s
[CV 1/5] END mnb__alpha=0.2, mnb__fit_prior=True;, score=(train=0.475, test=0.408) total time=   3.4s
[CV 2/5] END mnb__alpha=0.4, mnb__fit_prior=False;, score=(train=0.471, test=0.412) total time=   3.5s
[CV 5/5] END mn

0.41981792717086835

In [17]:
search.best_estimator_

In [18]:
with open(MODEL_PATH / 'naive_bayes_tf_idf.pkl', 'wb') as file:
    joblib.dump(search, file)

## <a id='toc4_4_'></a>[Random Forest and TF-IDF](#toc0_)

We will first only run a parameter search for the `max_depth` for each estimator in the random forest first.

After we have found an ideal `max_depth` we can look at the other parameters.

In [19]:
from sklearn.ensemble import RandomForestClassifier

In [20]:
vectorizer = TfidfVectorizer(max_df=0.9, 
                             min_df=0.01
                             )

random_forest = RandomForestClassifier(n_estimators=500, max_depth=10, n_jobs=-1)

pipe = Pipeline(steps=
                [
                    ('tfidf', vectorizer),
                    ('random_forest', random_forest)
                ]
            )

param_grid = {
    'random_forest__max_depth':np.arange(10,26),
}


search = GridSearchCV(pipe, param_grid, n_jobs=-1, cv=5, verbose=3, return_train_score=True)
search.fit(X_train_lyrics, y_train)
search.score(X_test_lyrics, y_test)

Fitting 5 folds for each of 16 candidates, totalling 80 fits
[CV 5/5] END random_forest__max_depth=10;, score=(train=0.688, test=0.397) total time=  34.9s
[CV 3/5] END random_forest__max_depth=10;, score=(train=0.693, test=0.414) total time=  35.5s
[CV 1/5] END random_forest__max_depth=10;, score=(train=0.679, test=0.403) total time=  35.4s
[CV 1/5] END random_forest__max_depth=11;, score=(train=0.735, test=0.407) total time=  37.0s
[CV 2/5] END random_forest__max_depth=10;, score=(train=0.680, test=0.414) total time=  35.9s
[CV 4/5] END random_forest__max_depth=11;, score=(train=0.731, test=0.405) total time=  36.6s
[CV 4/5] END random_forest__max_depth=10;, score=(train=0.679, test=0.408) total time=  35.4s
[CV 2/5] END random_forest__max_depth=11;, score=(train=0.738, test=0.412) total time=  37.3s
[CV 3/5] END random_forest__max_depth=11;, score=(train=0.747, test=0.413) total time=  37.6s
[CV 5/5] END random_forest__max_depth=11;, score=(train=0.745, test=0.396) total time=  37.9s



[CV 2/5] END random_forest__max_depth=21;, score=(train=0.953, test=0.415) total time= 1.5min
[CV 3/5] END random_forest__max_depth=21;, score=(train=0.954, test=0.418) total time= 1.5min
[CV 4/5] END random_forest__max_depth=21;, score=(train=0.947, test=0.415) total time= 1.5min
[CV 5/5] END random_forest__max_depth=21;, score=(train=0.953, test=0.402) total time= 1.5min
[CV 1/5] END random_forest__max_depth=22;, score=(train=0.959, test=0.409) total time= 1.5min
[CV 4/5] END random_forest__max_depth=22;, score=(train=0.955, test=0.410) total time= 1.4min
[CV 2/5] END random_forest__max_depth=22;, score=(train=0.957, test=0.412) total time= 1.6min
[CV 5/5] END random_forest__max_depth=22;, score=(train=0.958, test=0.404) total time= 1.6min
[CV 3/5] END random_forest__max_depth=22;, score=(train=0.958, test=0.417) total time= 1.7min
[CV 2/5] END random_forest__max_depth=23;, score=(train=0.964, test=0.414) total time= 1.7min
[CV 5/5] END random_forest__max_depth=23;, score=(train=0.96

0.4210434173669468

In [21]:
search.best_estimator_

We can see that at the `max_depth` of 25 we are overfitting the dataset as the training score is much higher than the validation score. We can try increasing the `min_samples_split`, `min_samples_leaf` and `max_features` to prevent overfitting. 

We will try to use a max_depth of 25 and try to reduce the 

Specifically we will try the following ranges:
- `min_samples_split` in a range from 2 to 20
- `min_samples_leaf` in a range from 1 to 10

In [22]:
vectorizer = TfidfVectorizer(max_df=0.9, 
                             min_df=0.01
                             )

random_forest = RandomForestClassifier(n_estimators=500, max_depth=25, n_jobs=-1)

pipe = Pipeline(steps=
                [
                    ('tfidf', vectorizer),
                    ('random_forest', random_forest)
                ]
            )

param_grid = {
    'random_forest__min_samples_split':np.arange(2, 21, 1),
    'random_forest__min_samples_leaf':[1,2,5,10]
}


search = GridSearchCV(pipe, param_grid, n_jobs=-1, cv=5, verbose=3, return_train_score=True)
search.fit(X_train_lyrics, y_train)
search.score(X_test_lyrics, y_test)

Fitting 5 folds for each of 76 candidates, totalling 380 fits
[CV 2/5] END random_forest__min_samples_leaf=1, random_forest__min_samples_split=2;, score=(train=0.970, test=0.421) total time= 2.2min
[CV 1/5] END random_forest__min_samples_leaf=1, random_forest__min_samples_split=2;, score=(train=0.971, test=0.406) total time= 2.2min
[CV 1/5] END random_forest__min_samples_leaf=1, random_forest__min_samples_split=3;, score=(train=0.969, test=0.412) total time= 2.2min




[CV 4/5] END random_forest__min_samples_leaf=1, random_forest__min_samples_split=3;, score=(train=0.965, test=0.419) total time= 2.3min
[CV 4/5] END random_forest__min_samples_leaf=1, random_forest__min_samples_split=2;, score=(train=0.967, test=0.415) total time= 2.3min
[CV 2/5] END random_forest__min_samples_leaf=1, random_forest__min_samples_split=3;, score=(train=0.968, test=0.412) total time= 2.3min
[CV 3/5] END random_forest__min_samples_leaf=1, random_forest__min_samples_split=2;, score=(train=0.971, test=0.424) total time= 2.3min
[CV 3/5] END random_forest__min_samples_leaf=1, random_forest__min_samples_split=3;, score=(train=0.968, test=0.422) total time= 2.3min
[CV 3/5] END random_forest__min_samples_leaf=1, random_forest__min_samples_split=5;, score=(train=0.966, test=0.423) total time= 2.3min
[CV 2/5] END random_forest__min_samples_leaf=1, random_forest__min_samples_split=4;, score=(train=0.967, test=0.418) total time= 2.3min
[CV 1/5] END random_forest__min_samples_leaf=1, 

0.422093837535014

In [23]:
search.best_estimator_

We may have reduced the overfitting to a smaller degree, but the accuracy of the model has remained the same. 

In [24]:
with open(MODEL_PATH / 'random_forest_tf_idf_v1.pkl', 'wb') as file:
    joblib.dump(search, file)

---

# <a id='toc5_'></a>[Modeling With OpenAI Embeddings](#toc0_)

## <a id='toc5_1_'></a>[Setting up the Embeddings](#toc0_)

To first work with the Ada embeddings we need to first convert the embeddings from a string into an array. This is carried out using the custom `utils.get_ada_embeddings()`. See the `utilities` module for more information on this process.

In [5]:
# Get the array of the ada embeddings. 
X_train_embeddings = utils.get_ada_embeddings(X_train_embeddings)
X_test_embeddings = utils.get_ada_embeddings(X_test_embeddings)

In [6]:
X_train_embeddings.shape, X_test_embeddings.shape

((22848, 1536), (5712, 1536))

## <a id='toc5_2_'></a>[Logistic Regression and Ada Embeddings](#toc0_)

In [27]:
vectorizer = TfidfVectorizer(max_df=0.9, 
                             min_df=0.01
                             )

log_reg = LogisticRegression(penalty='l2', max_iter=500)


pipe = Pipeline(steps=
                [
                    ('log_reg', log_reg)
                ]
            )

param_grid = {
    'log_reg__C':[0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
    'log_reg__solver':['newton-cg', 'saga']
}

search = GridSearchCV(pipe, param_grid, n_jobs=-1, cv=5, verbose=3, return_train_score=True)
search.fit(X_train_embeddings, y_train)
search.score(X_test_embeddings, y_test)

Fitting 5 folds for each of 14 candidates, totalling 70 fits
[CV 3/5] END log_reg__C=0.0001, log_reg__solver=newton-cg;, score=(train=0.352, test=0.347) total time=   9.2s
[CV 2/5] END log_reg__C=0.0001, log_reg__solver=newton-cg;, score=(train=0.349, test=0.346) total time=   9.4s
[CV 5/5] END log_reg__C=0.0001, log_reg__solver=newton-cg;, score=(train=0.352, test=0.350) total time=   9.4s
[CV 4/5] END log_reg__C=0.0001, log_reg__solver=newton-cg;, score=(train=0.343, test=0.345) total time=   9.6s
[CV 1/5] END log_reg__C=0.0001, log_reg__solver=newton-cg;, score=(train=0.353, test=0.349) total time=   9.8s
[CV 1/5] END log_reg__C=0.001, log_reg__solver=newton-cg;, score=(train=0.380, test=0.375) total time=  10.1s
[CV 4/5] END log_reg__C=0.001, log_reg__solver=newton-cg;, score=(train=0.378, test=0.387) total time=  10.3s
[CV 2/5] END log_reg__C=0.001, log_reg__solver=newton-cg;, score=(train=0.378, test=0.383) total time=  10.6s
[CV 5/5] END log_reg__C=0.001, log_reg__solver=newton-

0.42016806722689076

In [28]:
search.best_estimator_

In [29]:
with open(MODEL_PATH / 'log_reg_ada_pca.pkl', 'wb') as file:
    joblib.dump(search, file)

## <a id='toc5_3_'></a>[Logsitic Regression with PCA Components of the Ada Embeddings](#toc0_)

We will now try to dimensionality reduce the Ada embeddings with PCA and see if this improves our logistic regression model. We will also try to vary the number of components to see if this has any affect on model performance. From the above parameter search, we can see that a `C` of 0.0001 seems to impact the model negatively, so we will remove it from the parameter search. 

In [7]:
from sklearn.decomposition import PCA

In [8]:

log_reg = LogisticRegression(penalty='l2', max_iter=500)
pca = PCA(n_components=10)

pipe = Pipeline(steps=
                [
                    ('pca', pca),
                    ('log_reg', log_reg)
                ]
            )
param_grid = {
    'pca__n_components':[10, 15, 20],
    'log_reg__C':[0.001, 0.01, 0.1, 1, 10, 100],
    'log_reg__solver':['saga', 'newton-cg']
}
search = GridSearchCV(pipe, param_grid, n_jobs=-1, cv=5, verbose=3, return_train_score=True)
search.fit(X_train_embeddings, y_train)
search.score(X_test_embeddings, y_test)

Fitting 5 folds for each of 36 candidates, totalling 180 fits
[CV 1/5] END log_reg__C=0.001, log_reg__solver=newton-cg, pca__n_components=10;, score=(train=0.378, test=0.372) total time=   4.0s
[CV 1/5] END log_reg__C=0.001, log_reg__solver=saga, pca__n_components=10;, score=(train=0.378, test=0.372) total time=   4.4s
[CV 3/5] END log_reg__C=0.001, log_reg__solver=saga, pca__n_components=10;, score=(train=0.376, test=0.382) total time=   4.4s
[CV 4/5] END log_reg__C=0.001, log_reg__solver=saga, pca__n_components=10;, score=(train=0.376, test=0.387) total time=   4.6s
[CV 2/5] END log_reg__C=0.001, log_reg__solver=newton-cg, pca__n_components=10;, score=(train=0.377, test=0.384) total time=   4.5s
[CV 2/5] END log_reg__C=0.001, log_reg__solver=saga, pca__n_components=10;, score=(train=0.377, test=0.384) total time=   4.7s
[CV 3/5] END log_reg__C=0.001, log_reg__solver=saga, pca__n_components=15;, score=(train=0.376, test=0.383) total time=   4.9s
[CV 1/5] END log_reg__C=0.001, log_reg_



[CV 4/5] END log_reg__C=0.001, log_reg__solver=newton-cg, pca__n_components=20;, score=(train=0.376, test=0.387) total time=   4.3s
[CV 3/5] END log_reg__C=0.001, log_reg__solver=newton-cg, pca__n_components=20;, score=(train=0.376, test=0.383) total time=   4.5s
[CV 1/5] END log_reg__C=0.001, log_reg__solver=newton-cg, pca__n_components=20;, score=(train=0.378, test=0.373) total time=   4.6s
[CV 4/5] END log_reg__C=0.01, log_reg__solver=saga, pca__n_components=10;, score=(train=0.404, test=0.414) total time=   4.6s
[CV 5/5] END log_reg__C=0.001, log_reg__solver=newton-cg, pca__n_components=15;, score=(train=0.382, test=0.360) total time=   5.8s
[CV 2/5] END log_reg__C=0.01, log_reg__solver=saga, pca__n_components=10;, score=(train=0.405, test=0.404) total time=   5.4s
[CV 3/5] END log_reg__C=0.01, log_reg__solver=saga, pca__n_components=10;, score=(train=0.404, test=0.404) total time=   5.2s
[CV 4/5] END log_reg__C=0.001, log_reg__solver=newton-cg, pca__n_components=15;, score=(train=

0.42384453781512604

In [9]:
search.best_estimator_

In [10]:
with open(MODEL_PATH / 'log_reg_ada_pca.pkl', 'wb') as file:
    joblib.dump(search, file)

# <a id='toc6_'></a>[Predicting Popularity of Hip Hop Songs](#toc0_)

With such a large portion of our dataset belonging to hip hop, we wanted to see if by eliminating the other genres we could better predict hip hop popularity. This reduction in genres will hopefully reduce the noise in the dataset.  

## <a id='toc6_1_'></a>[Subsetting the Dataset for only Hip Hop](#toc0_)

In [3]:
hip_hop_df = df[
    df['genre'] == 'hip hop'
]

In [4]:
hip_hop_df.shape

(11785, 16)

In [5]:
hip_hop_df['spotify_popularity_three_class'].value_counts(normalize=True)

0    0.376750
1    0.329741
2    0.293509
Name: spotify_popularity_three_class, dtype: float64

We can see that we have reduced the dataset down to approximately 12 000 songs. 

In [11]:
X=hip_hop_df[['cleaned_lyrics_stem']]
y=hip_hop_df['spotify_popularity_three_class']

# Create train and test splits and set a random state for reproducibility.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=33)

# Dividing up the raws lyrics and the embeddings.
X_train_lyrics = X_train['cleaned_lyrics_stem']
X_test_lyrics = X_test['cleaned_lyrics_stem']


## <a id='toc6_2_'></a>[Modeling for Hip Hop Popularity](#toc0_)

### <a id='toc6_2_1_'></a>[Logistic Regression and TF-IDF](#toc0_)

In [12]:

vectorizer = TfidfVectorizer(max_df=0.9, 
                             min_df=0.01
                             )

log_reg = LogisticRegression(penalty='l2', max_iter=500)


pipe = Pipeline(steps=
                [
                    ('tfidf', vectorizer),
                    ('log_reg', log_reg)
                ]
            )

param_grid = {
    'log_reg__C':[0.0001, 0.001, 0.01, 0.1, 1, 10, 100],
    'log_reg__solver':['newton-cg', 'saga']
}

search = GridSearchCV(pipe, param_grid, n_jobs=-1, cv=5, verbose=3, return_train_score=True)
search.fit(X_train_lyrics, y_train)
search.score(X_test_lyrics, y_test)

Fitting 5 folds for each of 14 candidates, totalling 70 fits
[CV 2/5] END log_reg__C=0.0001, log_reg__solver=newton-cg;, score=(train=0.377, test=0.377) total time=   1.8s
[CV 4/5] END log_reg__C=0.0001, log_reg__solver=newton-cg;, score=(train=0.377, test=0.377) total time=   1.9s
[CV 1/5] END log_reg__C=0.0001, log_reg__solver=newton-cg;, score=(train=0.377, test=0.376) total time=   2.0s
[CV 3/5] END log_reg__C=0.0001, log_reg__solver=newton-cg;, score=(train=0.377, test=0.377) total time=   2.0s
[CV 1/5] END log_reg__C=0.001, log_reg__solver=newton-cg;, score=(train=0.377, test=0.376) total time=   2.0s
[CV 2/5] END log_reg__C=0.001, log_reg__solver=newton-cg;, score=(train=0.377, test=0.377) total time=   2.0s
[CV 4/5] END log_reg__C=0.001, log_reg__solver=newton-cg;, score=(train=0.377, test=0.377) total time=   2.0s
[CV 5/5] END log_reg__C=0.0001, log_reg__solver=newton-cg;, score=(train=0.377, test=0.377) total time=   2.2s
[CV 5/5] END log_reg__C=0.001, log_reg__solver=newton-

0.42002545608824776

In [13]:
with open(MODEL_PATH / 'log_reg_tfidf_hip_hop.pkl', 'wb') as file:
    joblib.dump(search, file)

### <a id='toc6_2_2_'></a>[Multinomial Naive Bayes and TF-IDF](#toc0_)

In [16]:
vectorizer = TfidfVectorizer(max_df=0.9, 
                             min_df=0.01
                             )

mnb = MultinomialNB()


pipe = Pipeline(steps=
                [
                    ('tfidf', vectorizer),
                    ('mnb', mnb)
                ]
            )

param_grid = {
    'mnb__alpha': np.arange(0.2, 10, 0.2),
    'mnb__fit_prior':[True, False]
}

search = GridSearchCV(pipe, param_grid, n_jobs=-1, cv=5, verbose=3, return_train_score=True)
search.fit(X_train_lyrics, y_train)
search.score(X_test_lyrics, y_test)

Fitting 5 folds for each of 98 candidates, totalling 490 fits
[CV 1/5] END mnb__alpha=0.2, mnb__fit_prior=True;, score=(train=0.566, test=0.409) total time=   1.6s
[CV 5/5] END mnb__alpha=0.2, mnb__fit_prior=True;, score=(train=0.569, test=0.416) total time=   1.6s
[CV 2/5] END mnb__alpha=0.2, mnb__fit_prior=True;, score=(train=0.560, test=0.433) total time=   1.7s
[CV 3/5] END mnb__alpha=0.2, mnb__fit_prior=True;, score=(train=0.554, test=0.433) total time=   1.7s
[CV 3/5] END mnb__alpha=0.2, mnb__fit_prior=False;, score=(train=0.575, test=0.426) total time=   1.7s
[CV 1/5] END mnb__alpha=0.4, mnb__fit_prior=True;, score=(train=0.564, test=0.408) total time=   1.7s
[CV 4/5] END mnb__alpha=0.2, mnb__fit_prior=True;, score=(train=0.569, test=0.425) total time=   1.8s
[CV 2/5] END mnb__alpha=0.4, mnb__fit_prior=True;, score=(train=0.558, test=0.436) total time=   1.8s
[CV 3/5] END mnb__alpha=0.4, mnb__fit_prior=True;, score=(train=0.554, test=0.433) total time=   1.8s
[CV 4/5] END mnb__a

0.42469240560033944

In [17]:
with open(MODEL_PATH / 'mnb_tfidf_hip_hop.pkl', 'wb') as file:
    joblib.dump(search, file)

# <a id='toc7_'></a>[Conclusion](#toc0_)

Here we have tuned a few models with the finalized dataset and will evaluate the models further in the next notebook `7_model_evaluation`.