# Movie Wars
## ~ Episode VI – The Metrics return ~

First of all, we should set the notebook so that it outputs all results of each cell and not only the last one.

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

And import all the python libraries needed for this step.

In [27]:
import pandas as pd
import seaborn as sns
import pickle
from sklearn.metrics import mean_squared_error 
from sklearn.metrics import mean_absolute_error 
from sklearn.ensemble import RandomForestRegressor
from sklearn import neighbors
from sklearn.neural_network import MLPRegressor
from scipy.stats import gaussian_kde
import numpy as np
import math
import matplotlib.pyplot as plt

Next, we state where our data sources are.

In [None]:
data_folder_path = 'data\\'

predictions_file_path = data_folder_path + 'predictions.csv'

And load the data.

In [None]:
predictions = pd.read_csv(predictions_file_path, sep = ';', index_col = False)

Now, we are ready to start with the performance analysis process.

## Metrics

This are some of the most common metrics which fit the recomendation problem when it is treated as a regression on a rating.

### Mean Absolute Error (MAE)

Simply averages the absolute error of the difference between the actual and predicted ratings. For a test dataset of *n* samples is defined as:

\\[MAE =  (\frac{1}{n})\sum_{i=1}^{n}\left | rating_{i} - predicted_{i} \right | \\]

In [None]:
def MAE(pred, actual):
    error_function = lambda x, y : abs(x - y)
    
    errors = map(error_function, preds, actuals)
    
    return sum(errors)/len(preds)

The mean absolute error can not only be used to compare different models. Even when we have a single model it has meaning in itself, it tell us information about the **expected error**.

### Mean Squared Error (MSE)

Is the average of the squared differences between the actual and predicted ratings. For a test dataset of *n* samples is defined as:

\\[MSE =  (\frac{1}{n})\sum_{i=1}^{n}\left ( rating_{i} - predicted_{i} \right )^{2} \\]

In [None]:
def MSE(pred, actual):
    error_function = lambda x, y : math.sqrt(x, y)
    
    errors = map(error_function, preds, actuals)
    
    return sum(errors)/len(preds)

If you want to penalize the higher differences between the real ratings and the predicted ones, the the mean squared error is a  very common solution.

### Mean Frequency Penalized Squared Error (MFPSE)

Heavely penalizes prediction errors between frequent rates but errors between rare rates are keep the same. For a test dataset of *n* samples is defined as:

\\[MFPSE = (\frac{1}{n})\sum_{i=1}^{n}\left | rating_{i} - predicted_{i} \right | ( 1 + \lambda f_{i}) \\]

where:

\\[ f_i = \frac{\text{Number of ratings with value equal to } rating_{i}}{\text{Total number of ratings}} \\]

We have to adjust the  $\lambda$ parameter to control the effect of the frequency coefficient, $\lambda = \frac{1}{2}$ is a good start point.

In [1]:
def MFPSE(preds, actuals, penalty = 1):
    ratings = (1,2,3,4,5)
    normal_distances = map(lambda x : Counter(actuals[x])/len(actuals), ratings)
    penalites = map(lambda x: penalty * normal_distances[x], actuals)
    
    error_function = lambda x, y, p : math.sqrt(x - y) * (1 + p)
    errors = map(error_function, preds, actuals, penalties)
    
    return sum(errors)/len(preds)

The mean frequency penalized squared error is useful when the data points are very concentrated around a given value, **3.6** in our case. 

### Mean Asymmetry Penalized Squared Error (MAPSE)

Takes in consideration the non-symmetric character of errors. For a testset of *n* samples is defined as:

\\[MAPSE = (\frac{1}{n})\sum_{i=1}^{n}\left | rating_{i} - predicted_{i} \right | (1 + p(rating_i, predicted_i) \\]

where:

\\[ p(x, y) = \lambda \frac{(x-y)|x-y|}{16} \lambda \in [0,1] \\]


We have to adjust the  $\lambda$ parameter to control the effect of the assymetry coefficient, $\lambda = \frac{1}{5}$ is a good start point.

In [None]:
def MAPSE(preds, actuals, penalty = 0.75):
    penalites = map(lambda x, y : penalty * 1/16 * (x - y) * abs(x - y), preds, actuals)
    
    error_function = lambda x, y, p : math.sqrt(x - y) * (1 + p)
    errors = map(error_function, preds, actuals, penalties)
    
    return sum(errors)/len(preds)

Until now if the recommender predicts a rating for a concrete movie and user (let's say 5) and the user's real rating is 1, the error is the same if the recommender predicts a rating of 1 and the user's real rating is 5.

These symmetries are not realistic because in fact if you give a try to a movie which seems awful and finally you can enjoy it the "damage" is lower than if the system recommend you a film and after watch it you are dissapointed.

#### Comparison

To sum up we resume the principal features of the metrics in the following table.

| Metric | Meaningful | Penalizes higher errors | Distribution-aware | Letdown-aware |
|--------|------------|-------------------------|--------------------|---------------|
| MAE    | ✓          | ✗                      | ✗                  | ✗             |
| MSE    | ✗          | ✓                      | ✗                  | ✗             |
| MFPSE  | ✗          | ✓                      | ✓                  | ✗             |
| MAPSE  | ✗          | ✓                      | ✗                  | ✓             |

## Model performance

Lets calculate the MAE, MSE, MFPSE and MAPSE of our models.

In [None]:
metrics = pd.DataFrame({
    'Model': [
        'Naïve', 
        'K-Nearest Neighbors', 
        'Random Forest', 
        'Artificial Neural Networks', 
        'Matrix Factorization'
    ],
    
     'MAE': [
         MAE(predictions['naive_pred'], predictions['actual']),
         MAE(predictions['knn_pred'], predictions['actual']),
         MAE(predictions['rf_pred'], predictions['actual']),
         MAE(predictions['nn_pred'], predictions['actual']),
         MAE(predictions['mf_pred'], predictions['actual'])
     ],
     
     'MSE': [
         MSE(predictions['naive_pred'], predictions['actual']),
         MSE(predictions['knn_pred'], predictions['actual']),
         MSE(predictions['rf_pred'], predictions['actual']),
         MSE(predictions['nn_pred'], predictions['actual']),
         MSE(predictions['mf_pred'], predictions['actual'])
     ], 
     
     'MFPSE': [
         MFPSE(predictions['naive_pred'], predictions['actual']),
         MFPSE(predictions['knn_pred'], predictions['actual']),
         MFPSE(predictions['rf_pred'], predictions['actual']),
         MFPSE(predictions['nn_pred'], predictions['actual']),
         MFPSE(predictions['mf_pred'], predictions['actual'])
     ], 
     
     'MAPSE':[
         MAPSE(predictions['naive_pred'], predictions['actual']),
         MAPSE(predictions['knn_pred'], predictions['actual']),
         MAPSE(predictions['rf_pred'], predictions['actual']),
         MAPSE(predictions['nn_pred'], predictions['actual']),
         MAPSE(predictions['mf_pred'], predictions['actual'])
     ] 
    })

metrics

To understand how the models behave, we use the density of the **absolute error values** to **estimate their distribution**.

In [None]:
absolute_error_density = pd.DataFrame({
        'Error': np.linspace(0,4,1000),
        'Naïve': gaussian_kde(ratings_test_data['Naive_abs_error']).evaluate(ind),
        'K-Nearest Neighbors': gaussian_kde(ratings_test_data['Naive_abs_error']).evaluate(ind),
        'Random Forest': gaussian_kde(ratings_test_data['Naive_abs_error']).evaluate(ind),
        'Artificial Neural Network': gaussian_kde(ratings_test_data['Naive_abs_error']).evaluate(ind),
        'Matrix Factorization': gaussian_kde(ratings_test_data['Naive_abs_error']).evaluate(ind)
    })

plt.title('Density of absolute error')
plt.xlabel('Absolute error')
plt.ylabel('Density')
plt.plot(absolute_error_density['Error'], absolute_error_density['Naïve'], label = "Naïve")
plt.plot(absolute_error_density['Error'], absolute_error_density['K-Nearest Neighbors'], label = "K-Nearest Neighbors")
plt.plot(absolute_error_density['Error'], absolute_error_density['Random Forest'], label = "Random Forest")
plt.plot(absolute_error_density['Error'], absolute_error_density['Artificial Neural Network'], label = "Artificial Neural Network")
plt.plot(absolute_error_density['Error'], absolute_error_density['Matrix Factorization'], label = "Matrix Factorization")
plt.legend()

It is interesting to observe that due to the heavely centered distribution of the ratings the naïve approximation is not bad at all with a **MAE** less than 1 and the ratings being integers, implies that **we don't have much room for improvement**. 

The profile-based models(**KNN, Random Forest and Artificial Neural Networks**) show a better behavior than the naive approach, but the improvement **is not very significant**, which is to be expected since the movie likes and dislikes is a complex and personal thing.

The only model that stands out is the one based on **Matrix Factorization** as we already expected, since as an user-user collaborative filtering method it was specially created to handle this kind of scenarios. 

In the density plots of the absolute error we see a similar situation, but we distinguish between models with good generalization such as Matrix Factorization and KNN, and those that doesn't such as Naïve, Random Forest and Artificial Neural Networks.

### Conclusions

In order to stablish a starting model, we have tried prototypes from the models shown previously (and additional ones) for this problem and tabulate the results.

| Metrics | Naïve (mean)| Neural Netwoks | KNN | Decision Tree | Random Forest | Gradient-boost |  ADA-boost | Matrix factorization |
|---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| Accuracy training (MAE)  | 0.93 |0.91  | 0.8  |  0.78 | 0.78  | 0.79  | 0.9  | 0.49 |
| Accuracy testing (MAE) | 0.93 |  0.91 |  0.82 | 0.81  | 0.8  | 0.79  | 0.9  | 0.68 |
| Training time (500k)| 0'' | 4'' | 7'  | 1' | 12'  | 1'' |  1'' | 1''  |
| Training time (26M) | 0'' |  10'' |  1h | 10'  | 2h  | 12h  | 15h | 1' ~ 5' |
| Support new users/movies  | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ |
| Support additional information| ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ |
| Programming language | Any language | Python | Python | Python | Python | Python | Python | C# |
| Machine Learning library |-| sklearn | sklearn | sklearn | sklearn | sklearn | sklearn | ML.NET |
| Model size (500k) | 0kB | 200kB  |  32Mb | 345kB  |  7Mb | 390kB | 10Mb  | 3Mb |
| Model size (26M) | 0kB | 200kB  |  16Gb |  1Mb |  10Mb |  10Mb| 10Mb  | 120Mb |


Factors to consider from the **data scientist** point of view:

- The distribution of our target feature is very centered so dull models are capable of give us an acceptable performance but it's hard to improve it significantly
- What level of accuracy is good enough?
- We don't have the additional information available for all the users of the company, will the model still work without it.
- Is the model explainable?

**machine learning engineer**?

**In conclusion, based in this results, we choose to further develop our movie recommender system based on the Matrix Factorization model and implemented in ML.Net**.