### Modeling using OneVsRest
---
**Goal:** Fit multi-label classification model on the train set. Finally, score on test set.

use google colab to run

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
import os
os.chdir("/content/gdrive/MyDrive/Colab/github/IMDbXMTC")

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 500)
import matplotlib.pyplot as plt
%matplotlib inline
import joblib

Import train and test dataframes from previous step. They are large files.

In [2]:
%%time
train = pd.read_csv('../dataset/Task3preprocessed/netflix_train_dataframe.tsv', sep='\t', index_col=0)
test = pd.read_csv('../dataset/Task3preprocessed/netflix_test_dataframe.tsv', sep='\t', index_col=0)

CPU times: total: 656 ms
Wall time: 795 ms


Put the genre and features names (which aren't words) into lists for easy use.

In [3]:
cols = list(train.columns.values)

In [4]:
genre_cols = cols[-42:]
print(len(genre_cols))
print(genre_cols)

42
['g_Independent Movies', 'g_Faith & Spirituality', 'g_Documentaries', 'g_LGBTQ Movies', 'g_International TV Shows', 'g_TV Thrillers', 'g_TV Dramas', 'g_Stand-Up Comedy & Talk Shows', 'g_Thrillers', 'g_Anime Features', 'g_Science & Nature TV', 'g_TV Horror', 'g_Movies', 'g_Korean TV Shows', 'g_Teen TV Shows', 'g_Action & Adventure', 'g_Crime TV Shows', 'g_Anime Series', 'g_Cult Movies', 'g_Docuseries', 'g_Sci-Fi & Fantasy', 'g_TV Sci-Fi & Fantasy', 'g_Dramas', 'g_Sports Movies', 'g_TV Comedies', 'g_Horror Movies', 'g_Stand-Up Comedy', 'g_British TV Shows', 'g_Music & Musicals', 'g_TV Action & Adventure', 'g_Spanish-Language TV Shows', 'g_TV Mysteries', 'g_Reality TV', 'g_TV Shows', 'g_Comedies', 'g_Romantic TV Shows', 'g_Romantic Movies', "g_Kids' TV", 'g_Classic Movies', 'g_International Movies', 'g_Classic & Cult TV', 'g_Children & Family Movies']


In [5]:
f_names = cols[:2]

Separate out X and y out of our train and test .tsv files. We want JUST the genre columns for `y` and everything except the genre columns for `X`.

In [6]:
#X_train = train[train.columns[~train.columns.isin(genre_cols)]]
y_train = train[train.columns[ train.columns.isin(genre_cols)]]
X_train = train[train.columns[~train.columns.isin(genre_cols + f_names)]]

X_test = test[test.columns[~test.columns.isin(genre_cols + f_names)]]
y_test = test[test.columns[ test.columns.isin(genre_cols)]]
#X_test = test[test.columns[~test.columns.isin(genre_cols)]]

---

Before running a model, we need to scale our data. Both standard and min-max were tested, but standard scaler came out on top.

In [8]:
%%time
# Scale data (Standard Scaler)
from sklearn.preprocessing import StandardScaler
my_standard_scaler = StandardScaler().fit(X_train)
X_train_s = my_standard_scaler.transform(X_train)
X_test_s = my_standard_scaler.transform(X_test)

joblib.dump(my_standard_scaler, 'models/netflix_standard_scaler.pkl')

CPU times: total: 125 ms
Wall time: 163 ms


['models/my_standard_scaler.pkl']

In [9]:
# Scale data (MinMax Scaler)
from sklearn.preprocessing import MinMaxScaler
my_minmax_scaler = MinMaxScaler().fit(X_train)
X_train_mm = my_minmax_scaler.transform(X_train)
X_test_mm = my_minmax_scaler.transform(X_test)

joblib.dump(my_minmax_scaler, 'models/netflix_scaler.pkl')

['models/my_minmax_scaler.pkl']

---

### Please note
MANY models were tested and pkl'd. Below is the optimized model. After that, everything below it is testing of other models, scalers, score grading, and tuning hyperparameters. I normally would not include all of them, but they remain for completeness.

In the end, OneVsRest with Logistic Regression (C=0.01, solver='lbfgs') when scaled with a standard scaler was the best option.

In [10]:
import joblib
#my_model = joblib.load('models/netflix_1vr_linear_svc_default.pkl')

In [11]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression

In [27]:
from sklearn.model_selection import cross_val_score
my_log_model = OneVsRestClassifier(LogisticRegression(random_state=123, solver='lbfgs', max_iter=3000, C=0.01, n_jobs=-1), n_jobs=-1)

scores = cross_val_score(my_log_model, X_train_s, y_train, cv = 5)
print(scores)

for i in range(len(scores)) :
    print(f"Fold {i+1}: {scores[i]}")
print(f"Average Score:{np.mean(scores)}")

[0.08402725 0.10370931 0.08629826 0.08856927 0.07948524]
Fold 1: 0.08402725208175625
Fold 2: 0.10370931112793338
Fold 3: 0.08629825889477669
Fold 4: 0.08856926570779712
Fold 5: 0.07948523845571537
Average Score:0.08841786525359575


In [28]:
%%time
my_log_model = OneVsRestClassifier(LogisticRegression(random_state=123, solver='lbfgs', max_iter=3000, C=0.01, n_jobs=-1), n_jobs=-1).fit(X_train_s, y_train)

CPU times: user 214 ms, sys: 49.7 ms, total: 264 ms
Wall time: 15.8 s


In [29]:
y_train_pred = my_log_model.predict(X_train_s)
y_train_proba = my_log_model.predict_proba(X_train_s)
y_test_pred = my_log_model.predict(X_test_s)
y_test_proba = my_log_model.predict_proba(X_test_s)

In [30]:
from sklearn.metrics import accuracy_score
print(f'Training score: {accuracy_score(y_train, y_train_pred):0.5f}')
print(f'    Test score: {accuracy_score(y_test, y_test_pred):0.5f}')

Training score: 0.24557
    Test score: 0.08583


In [31]:
y_pred_df = pd.DataFrame(y_test_pred, columns=genre_cols)

# Test set predictions
for g in genre_cols:
    score = accuracy_score(y_test[g], y_pred_df[g])
    print(f'{score:0.4f}  {g}')

0.8987  g_Independent Movies
0.9927  g_Faith & Spirituality
0.9314  g_Documentaries
0.9914  g_LGBTQ Movies
0.8279  g_International TV Shows
0.9923  g_TV Thrillers
0.8946  g_TV Dramas
0.9923  g_Stand-Up Comedy & Talk Shows
0.9223  g_Thrillers
0.9932  g_Anime Features
0.9905  g_Science & Nature TV
0.9891  g_TV Horror
0.9936  g_Movies
0.9837  g_Korean TV Shows
0.9927  g_Teen TV Shows
0.9005  g_Action & Adventure
0.9387  g_Crime TV Shows
0.9805  g_Anime Series
0.9936  g_Cult Movies
0.9655  g_Docuseries
0.9696  g_Sci-Fi & Fantasy
0.9927  g_TV Sci-Fi & Fantasy
0.7439  g_Dramas
0.9791  g_Sports Movies
0.9214  g_TV Comedies
0.9623  g_Horror Movies
0.9777  g_Stand-Up Comedy
0.9687  g_British TV Shows
0.9605  g_Music & Musicals
0.9782  g_TV Action & Adventure
0.9791  g_Spanish-Language TV Shows
0.9886  g_TV Mysteries
0.9709  g_Reality TV
0.9995  g_TV Shows
0.8102  g_Comedies
0.9569  g_Romantic TV Shows
0.9187  g_Romantic Movies
0.9605  g_Kids' TV
0.9868  g_Classic Movies
0.7039  g_International 

In [32]:
joblib.dump(my_log_model, 'models/netflix_logistic_model.pkl')

['models/my_logistic_model.pkl']

In [22]:
test_acc_dict = {}
# Test set predictions
for g in genre_cols:
    score = accuracy_score(y_test[g], y_pred_df[g])
    test_acc_dict.update( {g[2:] : score} )
    print(f'{score:0.4f}  {g}')

0.8588  g_Independent Movies
0.9809  g_Faith & Spirituality
0.8638  g_Documentaries
0.9673  g_LGBTQ Movies
0.8038  g_International TV Shows
0.9850  g_TV Thrillers
0.8492  g_TV Dramas
0.9877  g_Stand-Up Comedy & Talk Shows
0.8542  g_Thrillers
0.9732  g_Anime Features
0.9764  g_Science & Nature TV
0.9768  g_TV Horror
0.9809  g_Movies
0.9410  g_Korean TV Shows
0.9832  g_Teen TV Shows
0.8669  g_Action & Adventure
0.8778  g_Crime TV Shows
0.9455  g_Anime Series
0.9723  g_Cult Movies
0.9005  g_Docuseries
0.9105  g_Sci-Fi & Fantasy
0.9759  g_TV Sci-Fi & Fantasy
0.7307  g_Dramas
0.9364  g_Sports Movies
0.8601  g_TV Comedies
0.9028  g_Horror Movies
0.9555  g_Stand-Up Comedy
0.9078  g_British TV Shows
0.9096  g_Music & Musicals
0.9346  g_TV Action & Adventure
0.9446  g_Spanish-Language TV Shows
0.9668  g_TV Mysteries
0.9364  g_Reality TV
0.9950  g_TV Shows
0.7934  g_Comedies
0.8892  g_Romantic TV Shows
0.8438  g_Romantic Movies
0.9042  g_Kids' TV
0.9523  g_Classic Movies
0.6862  g_International 