# Model Selection of The Video Game Sales Predictor
In this notebook, we examine my regressor models that will allow us to predict sales.

<h3>Importing libraries</h3>

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

<h3>Loading the training & testing datasets</h3>

In [2]:
train_df = pd.read_csv("./data/train.csv")
X_train = train_df.iloc[:, train_df.columns != "Global_Sales"].values
y_train = train_df.iloc[:, train_df.columns == "Global_Sales"].values.reshape(-1)

In [3]:
X_train

array([[ 1.44447423, -0.31529833, -0.27863932, ...,  0.        ,
         0.        ,  0.        ],
       [ 1.20827211, -0.31529833, -0.27863932, ...,  0.        ,
         0.        ,  0.        ],
       [-1.33932158,  0.4693327 ,  0.32967975, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 0.23061708, -0.21872836, -0.20259943, ...,  0.        ,
         0.        ,  0.        ],
       [ 1.03420642, -0.26701334, -0.25962935, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.13124063, -0.31529833, -0.27863932, ...,  0.        ,
         0.        ,  0.        ]])

In [4]:
y_train

array([0.02, 0.04, 1.07, ..., 0.13, 0.05, 0.15])

In [5]:
test_df = pd.read_csv("./data/test.csv")
X_test = test_df.iloc[:, test_df.columns != "Global_Sales"].values
y_test = test_df.iloc[:, test_df.columns == "Global_Sales"].values.reshape(-1)

In [6]:
X_test

array([[-0.84892497,  0.14340904, -0.25962935, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.76850518, -0.24287085, -0.25962935, ...,  0.        ,
         0.        ,  0.        ],
       [-1.23952671,  0.55383142, -0.2216094 , ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 0.56159295, -0.29115584, -0.18358946, ...,  0.        ,
         0.        ,  0.        ],
       [-1.21755928,  0.23997901,  0.31066977, ...,  0.        ,
         0.        ,  0.        ],
       [-0.86440677,  0.14340904, -0.12655955, ...,  0.        ,
         0.        ,  0.        ]])

In [7]:
y_test

array([0.46, 0.07, 0.87, ..., 0.09, 0.83, 0.47])

<h3>Training models</h3>

In [8]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR

models = []
models.append(RandomForestRegressor())
models.append(DecisionTreeRegressor())
models.append(LinearRegression())
models.append(SVR())

for model in models:
    model.fit(X_train, y_train)
    print("ML Model:", type(model), "is done with training!")

ML Model: <class 'sklearn.ensemble._forest.RandomForestRegressor'> is trained!
ML Model: <class 'sklearn.tree._classes.DecisionTreeRegressor'> is trained!
ML Model: <class 'sklearn.linear_model._base.LinearRegression'> is trained!
ML Model: <class 'sklearn.svm._classes.SVR'> is trained!


<h3>Evaluating scores</h3>

In [10]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

for model in models:
    y_pred = model.predict(X_test)
    print("\nModel", type(model))
    print("MAE: ", mean_absolute_error(y_test, y_pred))
    print("MSE: ", mean_squared_error(y_test, y_pred))
    print("R2: ", r2_score(y_test, y_pred))




Model <class 'sklearn.ensemble._forest.RandomForestRegressor'>
MAE:  0.0026860386621666613
MSE:  0.002822729917152416
R2:  0.9986123558466602

Model <class 'sklearn.tree._classes.DecisionTreeRegressor'>
MAE:  0.003341515802393634
MSE:  0.0015926050935869923
R2:  0.9992170809069383

Model <class 'sklearn.linear_model._base.LinearRegression'>
MAE:  4145997.8341250797
MSE:  1.125820885955417e+16
R2:  -5534496094050255.0

Model <class 'sklearn.svm._classes.SVR'>
MAE:  0.10485481054290552
MSE:  0.5609652255210479
R2:  0.724231457394795


<h2>Conclusion</h2>
Random Forest & Decision Tree are the best algorithms for this problem. Their accuracy is very high, it might be a result of the overfitting. <br>

As we know decision trees are susceptible to overfitting. That's why we cannot forget about the SVR model, that has an average performance.