# MovieLens 10M Collaborative Genre Tagging
**To do**:
  - try out [this implementation](https://www.onceupondata.com/2019/02/10/nn-collaborative-filtering/) of baseline features. 
  - create object classes for models
  - implement TF 2.0 data classes
  - [paperswithcode link](https://paperswithcode.com/sota/collaborative-filtering-on-movielens-100k)
  - [ML 100k state of the art paper](https://arxiv.org/pdf/1706.02263v2.pdf) (RMSE=0.905): details their evaluation method
  
[![Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/James-Leslie/deep-collaborative-filtering/blob/master/tf-movielens10m.ipynb)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
import glob
import os

%matplotlib inline

## Load rating data

In [2]:
path = 'data/ml-10M100K/'  # ML-10M files

all_files = glob.glob(os.path.join(path, "ratings*.csv"))
df = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)

In [3]:
df.head()

Unnamed: 0,userId,movieId,rating
0,69587,1005,2.0
1,47904,193,4.0
2,26906,3097,3.5
3,31241,559,1.0
4,69402,2541,4.0


In [4]:
df.shape

(10000054, 3)

In [5]:
print('Number of users:', df.userId.nunique())
print('Number of items:', df.movieId.nunique())
print("Min item rating:", df.rating.min())
print("Max item rating:", df.rating.max())
print("Mean item rating:", df.rating.mean())

Number of users: 69878
Number of items: 10677
Min item rating: 0.5
Max item rating: 5.0
Mean item rating: 3.512421932921562


## Load movie metadata
  - remove 10% as holdout test set

In [7]:
movies = pd.read_csv(path+'movies.tsv', sep='\t')
movies.head()

Unnamed: 0,movieId,title,Action,Adventure,Animation,Children,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,Boomerang (1992),0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0
1,1,"Net, The (1995)",1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0
2,2,Dumb & Dumber (1994),0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
3,3,Outbreak (1995),1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0
4,4,Stargate (1994),1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0


In [8]:
from sklearn.model_selection import train_test_split

In [9]:
movies, movies_holdout = train_test_split(movies, test_size=.1, random_state=42)

---
# Create baseline features
For each user, calculate average user bias - the average difference between the user's rating and the movie's average rating:

$$b_{u} = \dfrac{\sum_{j=1}^{n_u} (r_{uj} - \mu_i)}{n_u}$$

For each item, calculate the difference between its average rating and the average rating of all movies:

$$b_{i} = \dfrac{\sum_{k=1}^{n_i} (r_{ki})}{n_i} - \mu$$

Then, for each interaction, calculate the combined bias:

$$b_{ui} = \dfrac{b_u + b_i}{2}$$

In [10]:
from CGT import get_baseline
?get_baseline

[1;31mSignature:[0m [0mget_baseline[0m[1;33m([0m[0mdf[0m[1;33m,[0m [0mtrain_index[0m[1;33m,[0m [0mtest_index[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Calculate baseline features from an explicit ratings dataset. Receives a dataframe
and returns train and test splits with added bias column and mean rating value.
User and item biases are calculated as average difference from global mean rating.
Baseline factors are only calculated from training observations, with users or
items that do not appear in train receiving the global average as default.

Args:
    df          : explicit ratings dataframe with columns userId, movieId and rating
    train_index : train index splits taken from KFold.splits()
    test_index  : test index splits taken from KFold.splits()
    
Returns:
    train, test : train/test splits of df, with added bias column
    global_mean : average rating of all training observations
[1;31mFile:[0m      c:\users\jleslie\documents\deep

---
# CGT model
**To do**:
  - Can we avoid re-training rating model on CV fold?
  - Create a grid search function / class

In [11]:
from CGT import compile_genre_model
?compile_genre_model

[1;31mSignature:[0m
[0mcompile_genre_model[0m[1;33m([0m[1;33m
[0m    [0mn_items[0m[1;33m,[0m[1;33m
[0m    [0mn_users[0m[1;33m,[0m[1;33m
[0m    [0mmin_rating[0m[1;33m,[0m[1;33m
[0m    [0mmax_rating[0m[1;33m,[0m[1;33m
[0m    [0mmean_rating[0m[1;33m,[0m[1;33m
[0m    [0mn_latent[0m[1;33m,[0m[1;33m
[0m    [0mn_hidden_1[0m[1;33m,[0m[1;33m
[0m    [0mn_hidden_2[0m[1;33m,[0m[1;33m
[0m    [0mactivation[0m[1;33m=[0m[1;34m'relu'[0m[1;33m,[0m[1;33m
[0m    [0mdropout_1[0m[1;33m=[0m[1;36m0.2[0m[1;33m,[0m[1;33m
[0m    [0mdropout_2[0m[1;33m=[0m[1;36m0.2[0m[1;33m,[0m[1;33m
[0m    [0mrandom_seed[0m[1;33m=[0m[1;36m42[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m <no docstring>
[1;31mFile:[0m      c:\users\jleslie\documents\deep-collaborative-filtering\cgt.py
[1;31mType:[0m      function


# Classification report

In [12]:
from sklearn.metrics import classification_report, confusion_matrix, precision_score, recall_score, f1_score

### Train model on full dataset, with best hparams

In [13]:
# get baseline predictors for full dataset
train, _, _ = get_baseline(df, df.index, df.index)

# compile both models
model1, model2 = compile_genre_model(
    n_items=df.movieId.nunique(),
    n_users=df.userId.nunique(),
    min_rating=df.rating.min(),
    max_rating=df.rating.max(),
    mean_rating=df.rating.mean(),
    n_latent=200, 
    n_hidden_1=100,
    n_hidden_2=100,
    dropout_1=.15,
    dropout_2=.15
)

# train rating model
ratings = model1.fit(
    x=[train.userId.values, train.movieId.values, train.bias.values],
    y=train.rating.values, 
    batch_size=2048,
    epochs=6,
    verbose=1,
    validation_split=.2
)

# train genre model
genres = model2.fit(
    movies.movieId.values, movies.Drama.values,
    batch_size=128, 
    epochs=5,
    validation_split=.2)

Train on 8000043 samples, validate on 2000011 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8
Train on 7687 samples, validate on 1922 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### Evaluate on test set

In [14]:
X_test = movies_holdout.movieId.values
y_test = movies_holdout.Drama.values
y_score = pd.DataFrame(model2.predict(X_test))
y_pred = y_score.round().astype('int')

In [15]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.70      0.67      0.68       551
           1       0.66      0.70      0.68       517

    accuracy                           0.68      1068
   macro avg       0.68      0.68      0.68      1068
weighted avg       0.68      0.68      0.68      1068



In [16]:
pd.DataFrame(confusion_matrix(y_test, y_pred))

Unnamed: 0,0,1
0,367,184
1,156,361


In [26]:
movies_holdout['prediction'] = y_pred.values

In [33]:
movies_holdout[['movieId', 'title', 'Drama', 'prediction']].to_csv(path+'holdout_predictions.csv', index=False)

In [32]:
X_train = movies.movieId.values
y_train = movies.Drama.values
train_score = pd.DataFrame(model2.predict(X_train))
train_pred = train_score.round().astype('int')

In [34]:
print(classification_report(y_train, train_pred))

              precision    recall  f1-score   support

           0       0.75      0.70      0.72      4790
           1       0.72      0.76      0.74      4819

    accuracy                           0.73      9609
   macro avg       0.73      0.73      0.73      9609
weighted avg       0.73      0.73      0.73      9609



In [35]:
pd.DataFrame(confusion_matrix(y_train, train_pred))

Unnamed: 0,0,1
0,3356,1434
1,1134,3685


In [37]:
movies['prediction'] = train_pred.values

In [38]:
movies[['movieId', 'title', 'Drama', 'prediction']].to_csv(path+'train_predictions.csv', index=False)