### Modeling using OneVsRest

The Multi-Label Text Classification (MLTC) task is a form of text classification in which each text instance can be assigned to multiple categories rather than just a single category.

**Goal:** Fit multi-label classification model on the train set. Finally, score on test set.

---

Use google colab to run. To run locally please comment the following two boxes

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
import os
os.chdir("/content/gdrive/MyDrive/Colab/github/XMTC")

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 500)
import matplotlib.pyplot as plt
%matplotlib inline
import joblib

Import train and test dataframes from previous step3.2.

In [2]:
%%time
train = pd.read_csv('../dataset/Task3preprocessed/netflix_train_dataframe.tsv', sep='\t', index_col=0)
test = pd.read_csv('../dataset/Task3preprocessed/netflix_test_dataframe.tsv', sep='\t', index_col=0)

CPU times: total: 688 ms
Wall time: 813 ms


Put the genre and other coloum names into lists for easy use later.

In [3]:
cols = list(train.columns.values)

In [4]:
# Filter out categorized target columns
genre_cols = cols[-42:]
print(len(genre_cols))
print(genre_cols)

42
['g_Independent Movies', 'g_Faith & Spirituality', 'g_Documentaries', 'g_LGBTQ Movies', 'g_International TV Shows', 'g_TV Thrillers', 'g_TV Dramas', 'g_Stand-Up Comedy & Talk Shows', 'g_Thrillers', 'g_Anime Features', 'g_Science & Nature TV', 'g_TV Horror', 'g_Movies', 'g_Korean TV Shows', 'g_Teen TV Shows', 'g_Action & Adventure', 'g_Crime TV Shows', 'g_Anime Series', 'g_Cult Movies', 'g_Docuseries', 'g_Sci-Fi & Fantasy', 'g_TV Sci-Fi & Fantasy', 'g_Dramas', 'g_Sports Movies', 'g_TV Comedies', 'g_Horror Movies', 'g_Stand-Up Comedy', 'g_British TV Shows', 'g_Music & Musicals', 'g_TV Action & Adventure', 'g_Spanish-Language TV Shows', 'g_TV Mysteries', 'g_Reality TV', 'g_TV Shows', 'g_Comedies', 'g_Romantic TV Shows', 'g_Romantic Movies', "g_Kids' TV", 'g_Classic Movies', 'g_International Movies', 'g_Classic & Cult TV', 'g_Children & Family Movies']


In [5]:
# Distinguish columns not used in this task
f_names = cols[:2]

Separate out X and y out of our train and test .tsv files. We want JUST the genre columns for `y` and everything except the genre columns for `X`.

In [6]:
y_train = train[train.columns[ train.columns.isin(genre_cols)]]
X_train = train[train.columns[~train.columns.isin(genre_cols + f_names)]]

X_test = test[test.columns[~test.columns.isin(genre_cols + f_names)]]
y_test = test[test.columns[ test.columns.isin(genre_cols)]]

---

Before running a model, we need to scale our data. Both standard and min-max were tested, but standard scaler came out on top.

In [7]:
%%time
# Scale data (Standard Scaler)
from sklearn.preprocessing import StandardScaler

# Creating a StandardScaler object and fitting it to the training data
my_standard_scaler = StandardScaler().fit(X_train)

# Transforming the data using the fitted scaler
X_train_s = my_standard_scaler.transform(X_train)
X_test_s = my_standard_scaler.transform(X_test)

# Saving the fitted scaler model to a file using joblib
joblib.dump(my_standard_scaler, 'models/netflix_standard_scaler.pkl')

CPU times: total: 250 ms
Wall time: 906 ms


['models/netflix_standard_scaler.pkl']

In [8]:
# Scale data (MinMax Scaler)
from sklearn.preprocessing import MinMaxScaler
my_minmax_scaler = MinMaxScaler().fit(X_train)
X_train_mm = my_minmax_scaler.transform(X_train)
X_test_mm = my_minmax_scaler.transform(X_test)

joblib.dump(my_minmax_scaler, 'models/netflix_scaler.pkl')

['models/netflix_scaler.pkl']

MANY models were tested and pkl'd.

In the end, we find OneVsRest with Logistic Regression (C=0.01, solver='lbfgs') when scaled with a standard scaler was the best option.

**Below is the final best model. **

In [9]:
import joblib

In [10]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression

---

# One-vs-Rest

在机器学习中，One-vs-Rest (OvR) 是一种多类别分类的策略。当面临一个多类别分类问题时，OvR 策略将其转化为多个二分类问题的组合。具体来说，对于每个类别，都训练一个二分类器，该分类器负责区分该类别和其他所有类别。在预测时，通过所有二分类器的输出，选择具有最高概率的类别作为最终的预测结果。

在我们的工作中，用scikit-learn库中的类OneVsRestClassifier实现了这个 OvR 策略。在此基础上，Logistic Regression 被用作基础模型，每个类别都有一个对应的 Logistic Regression 二分类器。这样的策略被证实对于处理多类别问题是直观且有效的，因为它将原始问题分解为一系列更简单的二分类问题。

---

# cross-validation

在机器学习中，使用交叉验证（cross-validation）是为了更准确地评估模型的性能，特别是在数据集较小的情况下。它有助于确保模型的性能评估不会受到数据集划分的随机性影响过大。

交叉验证的基本思想是将数据集划分为训练集和测试集，然后多次重复这个过程，每次使用不同的划分。最常见的一种形式是k折交叉验证（k-fold cross-validation），其中数据集被分成k个子集（或折叠），模型在其中的k次训练-测试过程中进行评估。这样，每个子集都有机会成为测试集，模型的性能评估就是这k次评估的平均值。

交叉验证的优势在于：

* 减少过拟合风险： 通过多次使用不同的训练集和测试集，模型更有可能捕捉到数据中的模式，而不仅仅是依赖于特定的训练集。

* 更可靠的性能评估： 通过多次评估的平均值，我们可以得到更可靠的性能指标，减少了因为特定划分而引起的评估结果的偶然性。

* 充分利用数据： 在训练和测试集的选择上更加充分，每个样本都有机会出现在训练集和测试集中，提高了数据的利用效率。

在我们的工作中，cross_val_score函数使用了5折交叉验证，即k=5。这意味着模型将在数据的五个不同子集上进行五次训练和测试，然后返回五个性能评估分数，最后通过打印这些分数和它们的平均值来评估模型的性能。

In [11]:

from sklearn.model_selection import cross_val_score

# Creating a One-vs-Rest classifier using Logistic Regression as the base model
my_log_model = OneVsRestClassifier(LogisticRegression(random_state=123, solver='lbfgs', max_iter=3000, C=0.01, n_jobs=-1), n_jobs=-1)

# Performing cross-validation and obtaining the scores
scores = cross_val_score(my_log_model, X_train_s, y_train, cv = 5)
print(scores)

# Printing individual fold scores
for i in range(len(scores)) :
    print(f"Fold {i+1}: {scores[i]}")
print(f"Average Score:{np.mean(scores)}")

[0.08402725 0.10370931 0.08629826 0.08856927 0.07948524]
Fold 1: 0.08402725208175625
Fold 2: 0.10370931112793338
Fold 3: 0.08629825889477669
Fold 4: 0.08856926570779712
Fold 5: 0.07948523845571537
Average Score:0.08841786525359575


In [12]:
%%time
# fit (data to a model)
my_log_model = OneVsRestClassifier(LogisticRegression(random_state=123, solver='lbfgs', max_iter=3000, C=0.01, n_jobs=-1), n_jobs=-1).fit(X_train_s, y_train)

CPU times: total: 203 ms
Wall time: 10.8 s


---
Predict

In [13]:
y_train_pred = my_log_model.predict(X_train_s)
y_train_proba = my_log_model.predict_proba(X_train_s)
y_test_pred = my_log_model.predict(X_test_s)
y_test_proba = my_log_model.predict_proba(X_test_s)

In [14]:
from sklearn.metrics import accuracy_score
print(f'Training score: {accuracy_score(y_train, y_train_pred):0.5f}')
print(f'    Test score: {accuracy_score(y_test, y_test_pred):0.5f}')

Training score: 0.24557
    Test score: 0.08583


---
Prediction results for the test set: 

the accuracy is calculated for each category and the results are printed out.

In [15]:
y_pred_df = pd.DataFrame(y_test_pred, columns=genre_cols)

# Test set predictions
for g in genre_cols:
    score = accuracy_score(y_test[g], y_pred_df[g])
    print(f'{score:0.4f}  {g}')

0.8987  g_Independent Movies
0.9927  g_Faith & Spirituality
0.9314  g_Documentaries
0.9914  g_LGBTQ Movies
0.8279  g_International TV Shows
0.9923  g_TV Thrillers
0.8946  g_TV Dramas
0.9923  g_Stand-Up Comedy & Talk Shows
0.9223  g_Thrillers
0.9932  g_Anime Features
0.9905  g_Science & Nature TV
0.9891  g_TV Horror
0.9936  g_Movies
0.9837  g_Korean TV Shows
0.9927  g_Teen TV Shows
0.9005  g_Action & Adventure
0.9387  g_Crime TV Shows
0.9805  g_Anime Series
0.9936  g_Cult Movies
0.9655  g_Docuseries
0.9696  g_Sci-Fi & Fantasy
0.9927  g_TV Sci-Fi & Fantasy
0.7439  g_Dramas
0.9791  g_Sports Movies
0.9214  g_TV Comedies
0.9623  g_Horror Movies
0.9777  g_Stand-Up Comedy
0.9687  g_British TV Shows
0.9605  g_Music & Musicals
0.9782  g_TV Action & Adventure
0.9791  g_Spanish-Language TV Shows
0.9886  g_TV Mysteries
0.9709  g_Reality TV
0.9995  g_TV Shows
0.8102  g_Comedies
0.9569  g_Romantic TV Shows
0.9187  g_Romantic Movies
0.9605  g_Kids' TV
0.9868  g_Classic Movies
0.7039  g_International 

In [16]:
joblib.dump(my_log_model, 'models/netflix_logistic_model.pkl') # save the model

['models/netflix_logistic_model.pkl']

In [17]:
test_acc_dict = {}
# Test set predictions
for g in genre_cols:
    score = accuracy_score(y_test[g], y_pred_df[g])
    test_acc_dict.update( {g[2:] : score} )
    print(f'{score:0.4f}  {g}')

0.8987  g_Independent Movies
0.9927  g_Faith & Spirituality
0.9314  g_Documentaries
0.9914  g_LGBTQ Movies
0.8279  g_International TV Shows
0.9923  g_TV Thrillers
0.8946  g_TV Dramas
0.9923  g_Stand-Up Comedy & Talk Shows
0.9223  g_Thrillers
0.9932  g_Anime Features
0.9905  g_Science & Nature TV
0.9891  g_TV Horror
0.9936  g_Movies
0.9837  g_Korean TV Shows
0.9927  g_Teen TV Shows
0.9005  g_Action & Adventure
0.9387  g_Crime TV Shows
0.9805  g_Anime Series
0.9936  g_Cult Movies
0.9655  g_Docuseries
0.9696  g_Sci-Fi & Fantasy
0.9927  g_TV Sci-Fi & Fantasy
0.7439  g_Dramas
0.9791  g_Sports Movies
0.9214  g_TV Comedies
0.9623  g_Horror Movies
0.9777  g_Stand-Up Comedy
0.9687  g_British TV Shows
0.9605  g_Music & Musicals
0.9782  g_TV Action & Adventure
0.9791  g_Spanish-Language TV Shows
0.9886  g_TV Mysteries
0.9709  g_Reality TV
0.9995  g_TV Shows
0.8102  g_Comedies
0.9569  g_Romantic TV Shows
0.9187  g_Romantic Movies
0.9605  g_Kids' TV
0.9868  g_Classic Movies
0.7039  g_International 