<h2>Machine Learning</h2>
<br>
<p>Model learning and predicting</p>

In [None]:
#pip install catboost

In [59]:
import numpy as np
from numpy import arange
from numpy import absolute
from numpy import mean
from numpy import std
import pandas as pd
#-------sklearn etc:
import sklearn
from sklearn import linear_model
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler,LabelEncoder ## might not scale, less effective in regression and tree based algorithms
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score, make_scorer
from sklearn.linear_model import Lasso
#-------metrics:
from sklearn import metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
#-------plots:
from matplotlib import pyplot as plt
#-------Extra models:
from catboost import CatBoostRegressor

<h4>Functions:</h4>

In [2]:
def split_train_test(df):
    y = df['Mean Score']
    X = df.drop('Mean Score', axis = 1)
    return train_test_split(X, y, test_size=0.33, random_state=42)

In [3]:
def r_adjust(y, y_pred, **kwargs):
    res = r2_score(y, y_pred)
    Adj_r2 = 1 - (1 - res * (len(y) - 1) / (len(y) - kwargs['X'].shape[1] - 1))
    return Adj_r2

In [65]:
def run_cat_boost_model(model_class, df, params):
    model = model_class(verbose=False, one_hot_max_size=255)
    X_train, X_test, y_train, y_test = split_train_test(df)
    grid_search = GridSearchCV(estimator=model, param_grid=params,
                               cv=5, n_jobs=3, verbose=1,
                               scoring=make_scorer(r_adjust, greater_is_better=True, X=X_train))
    model.fit(X_train, y_train)
    grid_search.fit(X_train, y_train)
    best_model = grid_search.best_estimator_
    y_predict = model.predict(X_test)
    y_predict_best = best_model.predict(X_test)
    try:
        print(f"Best hyperparams : {best_model.__dict__['_init_params']}")
    except:
        print(f"Best hyperparams: {best_model}")
    return r2_score(y_test, y_predict), r2_score(y_test, y_predict_best)

<h4>Main Body:</h4>

In [29]:
df = pd.read_csv('./clean_anime_dataframe')
df

Unnamed: 0,Format,Mean Score,Popularity,Favorites,Source,Action,Adventure,Comedy,Drama,Ecchi,...,Mecha,Music,Mystery,Psychological,Romance,Sci-Fi,Slice of Life,Sports,Supernatural,Thriller
0,0,66,6050,69,0,0,0,1,1,0,...,0,0,0,0,1,0,0,0,0,0
1,0,72,6307,98,0,1,1,0,1,0,...,0,0,0,0,0,1,0,0,0,0
2,1,60,6176,48,0,0,0,1,0,0,...,1,0,0,0,0,1,1,0,0,0
3,0,61,43462,656,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,1,0
4,2,76,58959,797,1,0,0,1,1,0,...,0,0,0,1,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9522,2,61,22,1,3,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
9523,5,62,73,2,3,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
9524,4,61,71,2,0,1,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
9525,5,66,159,2,3,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


<p>First ,lets split the data to training and testing segments based on the value we want to predict 'Mean Score'</p>

In [30]:
X_train, X_test, y_train, y_test = split_train_test(df)

<p>After splitting the data lets create a linear regression model and train it based on the entire dataframe without scaling or taking out unnecessary data</p>

In [31]:
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

LinearRegression()

In [32]:
predict_val = lr_model.predict(X_test)

In [61]:
score = r2_score(y_test, predict_val)
score

0.4770840773618533

<p>As you can see, the first model was unsuccessful with an accuracy of 24%, lets try and improve it</p>

In [34]:
df['Popularity'] = np.log(df['Popularity']) 
df['Favorites'] = np.log(df['Favorites'])

In [35]:
X_train, X_test, y_train, y_test = split_train_test(df)

In [36]:
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

LinearRegression()

In [37]:
predict_val = lr_model.predict(X_test)

In [38]:
score = r2_score(y_test, predict_val)
score

0.4770840773618533

<p>As you can see, using lan on the most influential columns regarding the target of the model (Mean score) helped raise the model's accuracy from 24% to 47%</p>

Although we've improved the model, I think this is as far as regular Linear regression can go in the case of this dataset.

Lets try a few more models.

First of, lets try a different variation to the Linear Regression, the Lasso Regression.

In [39]:
lass_m = Lasso(alpha = 1.0)

In [40]:
lass_m.fit(X_train, y_train)

Lasso()

In [41]:
predicted_vals = lass_m.predict(X_test)

In [42]:
score = r2_score(y_test, predicted_vals)
score

0.4324836331444315

On first attempt the Lasso regression yielded worse result than the Linear regression, but we can still tweek it a little. contrary to Linear regression, on the Lasso we can change a hyperparameter to yield better results.

In [43]:
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

In [44]:
grid = dict()
grid['alpha'] = arange(0, 1, 0.01)

In [45]:
search = GridSearchCV(lass_m, grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)

In [46]:
results = search.fit(X_train, y_train)

In [47]:
print('MAE: %.3f' % results.best_score_)
print('Config: %s' % results.best_params_)

MAE: -5.530
Config: {'alpha': 0.01}


The results from the GridSearchCV shows that the best hyperparameter is 0.01 and that the mean absolute error (MAE) is -5.53, which is the average offset from the predicted value, not too shabby.

In [48]:
lass_m = Lasso(alpha = 0.01)

In [49]:
lass_m.fit(X_train, y_train)

Lasso(alpha=0.01)

In [50]:
predicted_vals = lass_m.predict(X_test)

In [51]:
score = r2_score(y_test, predicted_vals)
score

0.4768784745287775

The result is more disappointing than expected, the Lasso with the most optimal hyperparameter gave out slightly worse results than the Linear regression with an accuracy of 47.6% compared to the 47.7% (neglible but disappointing still).

Although our previous models disappointed, I still want to try one last model that showed great potential when it comes to predicting continuous variables such as our Mean Score, the CatBoostRegressor.

CatBoostRegressor is an Ensemble model, created through the combinations of other models.
It works similar to a Regression/Decision tree, it has depth and number of iterations as part of its hyperparameters.

Lets first set parameters for the CatBoostRegressor to iterate over and see the best result through GridSearchCv

In [62]:
params = {
        'depth': [6, 8, 10],
        'learning_rate': [0.01, 0.05, 0.1],
        'iterations': [100]
    }

In [63]:
model = CatBoostRegressor

In [66]:
r2, r2_best = run_cat_boost_model(model, df, params)
print(f"model: {model}, r2: {r2}, best_r2: {r2_best}")

Fitting 5 folds for each of 9 candidates, totalling 45 fits
Best hyperparams : {'iterations': 100, 'learning_rate': 0.1, 'depth': 10, 'loss_function': 'RMSE', 'verbose': False, 'one_hot_max_size': 255}
model: <class 'catboost.core.CatBoostRegressor'>, r2: 0.5958694368878392, best_r2: 0.5851129750207251


As seen in the results, the CatBoost model yielded the best score yet at 59.5% accuracy, this is not an ideal number although much better than previous models.

This result is take from the default arguments of the model and not from the best results that the GridSearchCV gave, thus concluding that in this database and in these parameters the default arguments prevail in terms of score.

In conclusion, out of the 3 models we tried to train using the dataset, the first two (Linear Regression and Lasso Regression) yielded the same result of 47% accuracy.
The last model we used, CatBoostRegressor showed better results at around 59% accuracy.

All in all, it seems the data isn't very predictable when it comes to the Mean Score of an anime, I guess we'll have to settle with a model that yields a result of 59% accuracy.