# Movie Wars
## ~ Episode IV – The training strikes back ~

First of all, we should set the notebook so that it outputs all results of each cell and not only the last one.

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

And import all the python libraries needed for this step.

In [2]:
import math
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor

Next we will define a naïve model to act as baseline, it will return the average of the targets for regardless of the input.

In [3]:
class naive():
    def fit(self, train, test):
        self.__mean = test.mean()
    
    def predict(self, examples):
        return [self.__mean] * len(examples)

Finally, we state where our data sources are.

In [4]:
data_folder_path = 'data\\'

ratings_training_file_path = data_folder_path + 'ratings_training_data_basic_split_reduced.csv'
ratings_test_file_path = data_folder_path + 'ratings_test_data_basic_split.csv'

And load the data.

In [5]:
ratings_training_data = pd.read_csv(ratings_training_file_path, sep = ';', index_col = False)
ratings_test_data = pd.read_csv(ratings_test_file_path, sep = ';', index_col = False)

Now, we are ready to start with the prototyping process.

## Selecting the target feature

We especify that our goal is to predict the **rating** of a movie for an user.

In [6]:
y_train = ratings_training_data['rating']
y_test = ratings_test_data['rating']

## Training the models

We will build prototypes with various approaches to solve the recommendation problem:

- Naïve (mean rating)
- K-Nearest Neighbors
- Random Forest
- Artificial Neural Network
- Matrix Factorization

### Naïve

This model will always return the mean of all ratings.

In [7]:
naive_x_train = [None] * len(ratings_training_data)

NAIVE = naive()

NAIVE.fit(naive_x_train, y_train)

### K-Nearest Neighbors

This model looks for the **K ratings most similar to a given one** and uses them to estimate the value of the unknown rating.

In [8]:
features_used_for_KNN = ['user_age', 'user_gender', 'movie_year', 'genre_affinity', 'user_movies_epoch']

knn_x_train = ratings_training_data[features_used_for_KNN]

KNN = KNeighborsRegressor(n_neighbors = 40, metric = 'manhattan')

KNN.fit(knn_x_train, y_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='manhattan',
                    metric_params=None, n_jobs=None, n_neighbors=40, p=2,
                    weights='uniform')

### Random Forest

This model based on trees allows the use of information from the **categorical features** in a cleaner and versatile way.

In [9]:
Movie_genres = ['Action','Adventure','Animation',"Children's",'Comedy','Crime','Documentary','Drama','Fantasy','Film-Noir','Horror',
 'Musical','Mystery','Romance','Sci-Fi','Thriller','War','Western']

features_used_for_RF = [ 'user_age', 'user_gender', 'user_occupation_category', 'user_movies_epoch', 'movie_year'
                        ,'genre_affinity'] + Movie_genres

rf_x_train = ratings_training_data[features_used_for_RF]

RF = RandomForestRegressor(max_depth=12, random_state=1, n_estimators = 20, criterion = 'mse')

RF.fit(rf_x_train, y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=12,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=20,
                      n_jobs=None, oob_score=False, random_state=1, verbose=0,
                      warm_start=False)

### Artificial Neural Network

This model uses **perceptrons** to estimate the movie ratings based on numerical input features.

In [10]:
features_used_for_NN = ['user_age', 'user_gender', 'user_occupation_category','user_movies_epoch'
                        ,'movie_year', 'genre_affinity'] + Movie_genres

nn_x_train = ratings_training_data[features_used_for_NN]

NN = MLPRegressor(hidden_layer_sizes = (20, 15, 10, 5, 3), activation = 'relu', max_iter = 1000)

NN.fit(nn_x_train, y_train)

MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
             beta_2=0.999, early_stopping=False, epsilon=1e-08,
             hidden_layer_sizes=(20, 15, 10, 5, 3), learning_rate='constant',
             learning_rate_init=0.001, max_iter=1000, momentum=0.9,
             n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
             random_state=None, shuffle=True, solver='adam', tol=0.0001,
             validation_fraction=0.1, verbose=False, warm_start=False)

### Matrix Factorization

This model is specially designed to for recommendation systems and uses only the ratings. 

It has been implemented using **ML.NET (C#)**, in this notebook we will only need its predictions, so we will cover the model itself later on this workshop.

## Making predictions

We select the required features for each model on the test dataset.

In [11]:
naive_x_test = [None] * len(ratings_test_data)
knn_x_test = ratings_test_data[features_used_for_KNN]
rf_x_test = ratings_test_data[features_used_for_RF]
nn_x_test = ratings_test_data[features_used_for_NN]

Generate predictions for it.

In [17]:
python_predictions = pd.DataFrame()

python_predictions['actual'] = y_test
python_predictions['naive_pred'] = NAIVE.predict(ratings_test_data)
python_predictions['knn_pred'] = KNN.predict(knn_x_test)
python_predictions['rf_pred'] = RF.predict(rf_x_test)
python_predictions['nn_pred'] = NN.predict(nn_x_test)

And add the predictions of the matrix factorization model.

In [24]:
mf_predictions = pd.read_csv(data_folder_path + 'matrix_factorization_predictions.csv', header = 10, sep = ';')
mf_predictions = mf_predictions.dropna()

mldotnet_predictions = pd.DataFrame({
    'actual': mf_predictions.Label,
    'mf_pred': [float(x) for x in mf_predictions.Score]
})

## Results

Finally, we save the results of the predictions with the different models.

In [25]:
python_predictions.to_csv(data_folder_path + 'python_predictions.csv', sep = ';')
mldotnet_predictions.to_csv(data_folder_path + 'mldotnet_predictions.csv', sep = ';')