# Task 

# Data

In [1]:
import numpy as np
import csv
from sklearn.preprocessing import MultiLabelBinarizer
from scipy.sparse import hstack
import pickle as pkl
from utils.tokenizer import tokenize_corpus

In [None]:
def getNames(data):
    names = []
    if not data:
        return names
    parsedData = eval(data)
    if not parsedData:
        return names
    for pieceOfInfo in parsedData:
        name = pieceOfInfo['name']
        names.append(name)
    return np.array(names)

In [None]:
with open('./data/links.csv', 'r', encoding='utf-8', newline='') as f:
    reader = csv.reader(f)
    next(reader, None)
    id_to_movieId = dict()
    for line in reader:
        try:
            id_to_movieId[int(line[2])] = int(line[0])
        except:
            pass

In [None]:
with open('./data/movies_metadata.csv', encoding= 'utf-8') as csvFile:
    reader = csv.DictReader(csvFile)
    i = 0
    for row in reader:
        dataEmbeded[i, 0] = row['overview']
        try:
            dataEmbeded[i, 1] = id_to_movieId[int(row['id'])]
        except:
            pass
        dataEmbeded[i, 2] = row['adult'] == 1
        dataEmbeded[i, 3] = row['budget']
        dataEmbeded[i, 4] = getNames(row['genres'])
        dataEmbeded[i, 5] = row['popularity']
        dataEmbeded[i, 6] = getNames(row['production_companies'])
        dataEmbeded[i, 7] = row['production_countries'] == "[{'iso_3166_1': 'US', 'name': 'United States of America'}]"
        dataEmbeded[i, 8] = row['revenue']
        dataEmbeded[i, 9] = getNames(row['spoken_languages'])
        i += 1

In [None]:
one_hot = MultiLabelBinarizer(sparse_output=True)
genres = one_hot.fit_transform(dataEmbeded[:,4])
production_companies = one_hot.fit_transform(dataEmbeded[:,6])
spoken_languages = one_hot.fit_transform(dataEmbeded[:,9])
BoW = tokenize_corpus(dataEmbeded[:,0], stop_words = False, BoW = True)

In [None]:
data =  hstack([BoW, genres, spoken_languages])
with open('./data/data.npy', 'wb') as pikeler:
    data = {'ids':dataEmbeded[:, 1], 'data':data}
    pkl.dump(data, pikeler)

# Model

## Explication of base models

### Colaborative Deep Learning

The first model on which we based ourselves is Hao Wang's model based on a Stacked Denoising Auto Encoder (SDAE), in charge of the item-based part. The principle this network is as follows:
* We have a MLP neural network that is given a vector input and has to reproduce it as output.
* A noise is applied to the input to make the network more robust
* This network applies transformations to this vector until having a vector of small size compared to the input.
* Then on a second part of the network, it reapplies transformations to this vector of small size until finding a vector of the same size as the entry. The loss is given by the difference between the input vector and the output vector in order to push the network to apply a reversible transformation within it.
* In this way our network can be cut in half. A part that is an encoder that, given a large vector, encode a smaller, denser vector supposed to represent it. And a second part, able to decode this vector to find the original vector.

This type of network is particularly interesting with bag of words approach because it gives at first a vector often very sparse with the size of the vocabulary, unusable without size reduction.

<img src="./images/SDAE.png" width=300px>

On the other hand, for the collaborative part, embeddings are created for the users and items. Embeddings are widely used in other filed of domain (notably NLP), but are particularly adapted for this application. Indeed, embeddings are dense vectors representing an entity, the closer entities are, the closer their embeddings will be.

After that, the item embedding and the dense vector created by the SDAE are concatenated making the full item embedding. 
Once this is done, the user and full item embedding are multiplied to form the ratings predictions.

<img src="./images/MF.png" width=600px>

The full architected is as follow:

<img src="./images/CDL.png" width=400px>

### Neural Collaborative Filter

The second model is based on the first one, however Xiangnan He et al. that the matrix multiplication is suboptimal and doesn't have enough capacity to represent the non-linear relations between users, items and ratings. It is therefore proposed to replace the multiplication by a neural network.

<img src="./images/NCF_1.png" width=400px>

The intuition behind this is that matrix multiplication is a special case of the MLP. Indeed, with the right weights (identity), a network can simply give the result of a matrix multiplication. Like so:

<img src="./images/NCF_3.png" width=200px>

<img src="./images/NCF_2.png" width=400px>

However, empirical results showed that keeping the matrix multiplication still yield better results. The model they propose is then the following:
<img src="./images/NCF_4.png" width=400px>

### Our model: Neural Hybrid Recommender

We kept the main ideas proposed earlier but added a couple of improvements:
* Addition of regularization layers (Batch-norm and Dropout)
* Concatation of the SDAE to the Neural Collaborative Filter
* Use of Adam optimizer

The batch-norm improves the Convergence speed and Dropout prevents over-fitting. Adam optimizer adds Momentum en Nesterov Momentum and has proven to fasten the optimization.

The model is then:

<img src="./images/NHR.png" width=400px>

# Results