# Movie Wars
## ~ Episode IV – The training strikes back ~

First of all, we should set the notebook so that it outputs all results of each cell and not only the last one.

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

And import all the python libraries needed for this step.

In [27]:
import pandas as pd
import seaborn as sns
import pickle
from sklearn.metrics import mean_squared_error 
from sklearn.metrics import mean_absolute_error 
from sklearn.ensemble import RandomForestRegressor
from sklearn import neighbors
from  sklearn.neural_network import MLPRegressor
from scipy.stats import gaussian_kde
import numpy as np
import math
import matplotlib.pyplot as plt

Next we will define a naïve model to act as baseline, it will return the average of the targets for regardless of the input.

In [None]:
def class naive():
    def fit(self, train, test):
        self.__mean = test.mean()
    
    def predict(self, examples):
        return [self._mean] * len(examples)

Finally, we state where our data sources are.

In [None]:
data_folder_path = 'data\\'

ratings_training_file_path = data_folder_path + 'ratings_training_data_basic_split_clean.csv'
ratings_test_file_path = data_folder_path + 'ratings_test_data_basic_split.csv'

And load the data.

In [None]:
ratings_training_data = pd.read_csv(ratings_training_file_path, sep = ';', index_col = False)
ratings_test_data = pd.read_csv(ratings_test_file_path, sep = ';', index_col = False)

Now, we are ready to start with the prototyping process.

## Selecting the target feature

We especify that our goal is to predict the **rating** of a movie for an user.

In [None]:
y_train = ratings_training_data['rating']
y_test = ratings_test_data['rating']

## Training the models

We will build prototypes with various approaches to solve the recommendation problem:

- Naïve (mean rating)
- K-Nearest Neighbors
- Random Forest
- Artificial Neural Network
- Matrix Factorization

### Naïve

This model will always return the mean of all ratings.

In [11]:
naive_x_train = [None] * len(ratings_training_data)

NAIVE = naive()

NAIVE.fit(naive_x_train, y_train)

### K-Nearest Neighbors

This model looks for the **K ratings most similar to a given one** and uses them to estimate the value of the unknown rating.

In [12]:
features_used_for_KNN = ['user_age', 'user_gender', 'movie_year_mod', 'genre_affinity', 'user_movies_epoch']

knn_x_train = ratings_training_data[features_used_for_KNN]

KNN = neighbors.KNeighborsRegressor(n_neighbors = 40, metric = 'manhattan')

KNN.fit(knn_x_train, y_train)

### Random Forest

This model based on trees allows the use of information from the **categorical features** in a cleaner and versatile way.

In [13]:
features_used_for_RF = [ 'user_age', 'user_gender', 'user_occupation_categorie', 'user_movies_epoch', 'movie_year'
                        ,'genre_affinity', 'Action', 'Adventure', 'Animation', "Children's", 'Comedy', 'Crime'
                        ,'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical','Mystery', 'Romance'
                        , 'Sci-Fi', 'Thriller', 'War', 'Western','movie_year_mod']

rf_x_train = ratings_training_data[features_used_for_RF]

RF = RandomForestRegressor(max_depth=12, random_state=1, n_estimators = 20, criterion = 'mse')

RF.fit(rf_x_train, y_train)

### Artificial Neural Network

This model uses **perceptrons** to estimate the movie ratings based on numerical input features.

In [16]:
features_used_for_NN = ['user_age', 'user_gender', 'user_occupation_categorie','user_movies_epoch'
                        ,'movie_year', 'genre_affinity', 'Action','Adventure', 'Animation', "Children's", 'Comedy'
                        , 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical', 'Mystery'
                        , 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western', 'movie_year_mod']

nn_x_train = ratings_training_data[features_used_for_NN]

NN = MLPRegressor(hidden_layer_sizes = (20, 15, 10, 5, 3), activation = 'relu', max_iter = 1000)

NN.fit(nn_x_train, y_train)

### Matrix Factorization

This model is specially designed to for recommendation systems and uses only the ratings. 

It has been implemented using **ML.NET (C#)**, in this notebook we will only need its predictions, so we will cover the model itself later on this workshop.

## Making predictions

We select the required features for each model on the test dataset.

In [None]:
naive_x_test = [None] * len(ratings_test_data)
knn_x_test = ratings_test_data[features_used_for_KNN]
rf_x_test = ratings_test_data[features_used_for_RF]
nn_x_test = ratings_test_data[features_used_for_NN]

Generate predictions for it.

In [None]:
naive_predictions = NAIVE.predict(ratings_test_data)
knn_predictions = KNN.predict(knn_x_test)
rf_predictions = RF.predict(rf_x_test)
nn_predictions = NN.predict(nn_x_test)

And add the predictions of the matrix factorization model.

In [None]:
mldotnet_preds_test = pd.read_csv(data_folder_path + 'ML.NET_approach_preds_test.csv', header = 10, sep = ';')
mldotnet_preds_test = mldotnet_preds_test.dropna()

mf_predictions = [math.ceil(x) for x in mldotnet_preds_train.Label]

## Results

Finally, we save the results of the predictions with the different models.

In [None]:
# Create predictions dataframe with [actual, naive_pred, knn_pred, rf_pred, nn_pred, mf_pred]
# predictions = 

predictions.to_csv(data_folder_path + 'predictions.csv', sep = ';')