#### import internal libraries

In [2]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go

## Data gathering

In [3]:
PATH_FILES = 'predictions/'
FILE_NAMES = ['df_predictions_linearreg_hyper_tuning.csv', 'df_predictions_linearreg_standard.csv',
             'df_predictions_randomforest_hyper_tuning.csv', 'df_predictions_randomforest_standard.csv']

In [4]:
df_logreg_s, df_logreg_h, df_randforest_s, df_randforest_h= [
                         pd.read_csv(PATH_FILES + f) for f in FILE_NAMES]

In [5]:
df_logreg_s.head()

Unnamed: 0,Publisher,MAE,prediction1,prediction2,prediction3,prediction4,prediction5
0,Activision,5.181232,5.097123,32.08518,1.719034,87.448589,4.27466
1,Nintendo,26.468344,69.279623,68.958063,103.864933,72.466509,43.692457
2,Electronic Arts,10.990904,95.55929,70.660881,5.117121,20.001916,
3,Sony Computer Entertainment,16.526021,16.42889,28.407351,39.027552,49.244994,
4,Ubisoft,7.100618,0.0,20.287904,34.429624,0.0,


In [6]:
df_logreg_h.head()

Unnamed: 0,Publisher,MAE,prediction1,prediction2,prediction3,prediction4,prediction5
0,Activision,18.964156,-1.023599,32.571526,-1.710099,4.904506,5.453113
1,Nintendo,31.416551,74.307173,81.601085,87.109365,71.02628,35.447581
2,Electronic Arts,21.693864,108.528945,95.682671,14.199076,19.565238,
3,Sony Computer Entertainment,16.526021,16.42889,28.407351,39.027552,49.244994,
4,Ubisoft,14.074656,-2.535053,22.116794,7.17047,0.069163,


## Analysis and conclusions

#### Linear Regression

Linear Regression is th classical algorithm in Supervised Machine Learning, because of its simplicity & efficiency is recommended to use when you are building a regression model. 

* One the pros I really like it, it isn't a black box algorithm, you could meet the weights and try optimization strategies to improve it and you are aware about the change in their weights. 
* About cons, you must pay attention to correlated features, if the features are so correlated your model will be less sensibility and split the weight in the correlated features.

* Hyperparameters:
 * fit_intercept: In order to prove statistical significance hypothesis, intercept is one of the features you could delete if it isn't statistical significance, this hyperparameter helps to support that statistical approach.
 * positive: Force all coefficients to be positive.
 * normalize: Normalize the variables if necessary.

#### Random Forest

Random Forest is an optimal algorithm, could solver any type of Supervised problem, classification o regression.

* One of the benefits, its the power in the large datasets. Further, the RF outputs importance of each variable and you could choose what variables are statistically important and almost never has overfitting.
* Cons, it's a black box algorithm, you could access to the general tree structure but if you would like to improve it isn't possible in directly way.
* Hyperparameters:
    * max_features: The maximum number of features RF is allowed to try in indivual tree.
    * max_depth: The longest path between node & the leaf node
    * min_samples_split: it is useful for prevent the overfitting because weduce the number of splits that happens in the decision tree. 


#### Neural Network

Neural network are good algorithms to perfom on large datasets.

* The benefits are the high performance
* The cons, the computational cost (monetary), more time for training and predict.

### MAE comparison

In [18]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=df_logreg_s['Publisher'], y=df_logreg_s['MAE'],
                    mode='lines',name='MAE_logreg_s'))
fig.add_trace(go.Scatter(x=df_logreg_h['Publisher'], y=df_logreg_h['MAE'],
                    mode='lines',name='MAE_logreg_h'))
fig.add_trace(go.Scatter(x=df_randforest_s['Publisher'], y=df_randforest_s['MAE'],
                    mode='lines', name='MAE_randforest_s'))
fig.add_trace(go.Scatter(x=df_randforest_h['Publisher'], y=df_randforest_h['MAE'],
                    mode='lines', name='MAE_randforest_h'))

fig.show()

In [13]:
df_randforest_h['MAE'] - df_randforest_s['MAE']

0    0.990950
1    2.420546
2   -1.175325
3   -0.061669
4   -0.476261
Name: MAE, dtype: float64

In [14]:
df_randforest_s['MAE'] - df_randforest_h['MAE']

0   -0.990950
1   -2.420546
2    1.175325
3    0.061669
4    0.476261
Name: MAE, dtype: float64

**The bestperfomance is random forest hyperemeter tuning.**

*Si solo un modelo se pudiera implementar en produccion , ¿Cuál recomendarias? ¿Porqué?*

* Analizando diversos criterios, escogería random forest:
 > Uno de los factores principales es nuestra función de costo el Eror Absoluto Promedio (MAE), examinando la gráfica en la que se desplega los 5 modelos entrenados por cada algoritmo en el que se entrenaron con tuneo de hiperparametros y sin tuneo de hiperpárametros, se concluye que el mínimo error destaca en el Random Forest con tuneado de hiperapámetros. Además es un buen algoritmo para llevarlo a producción ya que cuenta con una baja complejidad computacional comporado con algoritmo de orden mayor como lo son la familia de algoritmos de computación exahustiva.
> Por otro lado los hiperpárametros finales del algoritmo (Njntendo) son: 
    - max_depth=10
    - max_features='log2'
    - min_samples_split=5
    - n_estimators=20

Para concluir esta sección, se trabajaron los métodos y estrategias antes mencionados, además de una gran carga heurística, empírica y matemática para ir experimentando y generando la mejor bondad de ajuste de tuneo de los hiperparámetros.

*¿Qué mejorarías en los modelos para que se ajusten mejor al data set utlizado? (por ejemplo para tomar en cuenta la correlación entre observaciones)*

Serían varios factores:

* El princial y definitivo, obtener/adquirir más datos. Esto pone en un desiquilibrio a la red neuronal y al ARIMA, ya que no cuentan con suficientes datos para adquirir un buen performance y ser estadísticamente comparable con la regresión lineal o random forest.
* Un híbrido entre conseguir otro tipo de variables que ayuden al proceso de ventas de videogjuegos como cantidad de horas que juegan, método de pago, edad, sexo, entre otros, esto ayudaría a generar un brazo ḿás fuerte en la creación de nuevas características, así como variables latentes.
* Dadas las condiciones anteriores, ahora si se podría generar un análisis más robusto e intercomparable, asi como el manejor de optimización de la función de costo, la selección del trade-off de overfitting & underfitting.