# MINI-PROJECT RECSYS

Today we will build a recommender system (RS) using [*ml-100k* dataset](https://grouplens.org/datasets/movielens/100k/). This data consist on 100,000 ratings from 1000 users on 1700 movies and it is tipically used for either comparing results in SOTA papers or for building toy RS.

1. You first need to create a folder called *data* and then put inside the downloaded dataset. Thus, your data needs to be stored in:
> *./data/ml-100k/all_files_here*.

2. Then, in the next cell you can observe how all the model and pipeline configuration is set up. You can modify any parameter if you want to (like the seed, lr, batch_size or embedding dimension...).

3. Check also that, in order to allow reproducing the results you achieve, the seed you choose is forwaded to all tensorflow, numpy and os libraries.

In [None]:
import os
import csv
import pandas as pd
import random
import numpy as np
import tensorflow as tf
import warnings
warnings.simplefilter('ignore')

DATA_DIR = 'data/ml-100k'
OUTPUT_DIR = './'

class Config:
    category_col = ['user_id','movie_id','Action','Adventure','Animation',"Childrens",'Comedy','Crime',
          'Documentary','Drama','Fantasy','Film-Noir','Horror','Musical','Mystery',
          'Romance','Sci-Fi','Thriller','War','Western', 'gender','occupation','year']
    num_col = ['age']
    target_col = ['label']
    
    epochs=5
    batch_size=128
    seed=17
    embedding_dim=8
    lr=1e-4
    

def seed_everything(seed=1234):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    os.environ['TF_DETERMINISTIC_OPS'] = '1'
    os.environ['TF_CUDNN_DETERMINISTIC'] = '1'

    tf.config.threading.set_inter_op_parallelism_threads(1)
    tf.config.threading.set_intra_op_parallelism_threads(1)

In [None]:
config=Config()
seed_everything(config.seed)

## Load data

In this section we will load the three different files available on ml-100k dataset: 

- the data interactions
- the user data
- the item's data

In [None]:
def load_data_df():
    df = pd.read_csv(os.path.join(DATA_DIR, 'u.data'), sep='\t', header=None)
    df.columns = ['user_id', 'movie_id', 'rating', 'timestamp']
    return df

def load_item_df():
    m_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url',
          'unknown','Action','Adventure','Animation',"Children's",'Comedy','Crime',
          'Documentary','Drama','Fantasy','Film-Noir','Horror','Musical','Mystery',
          'Romance','Sci-Fi','Thriller','War','Western',]
    item_df = pd.read_csv(os.path.join(DATA_DIR, 'u.item'), sep='|', encoding="iso-8859-1", names=m_cols)
    item_df = item_df.rename(columns={"Children's":'Childrens'})
    return item_df

def load_user_df():
    u_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
    user_df = pd.read_csv(os.path.join(DATA_DIR, 'u.user'), sep='|', encoding="iso-8859-1", names=u_cols)
    return user_df

In [None]:
data_df = load_data_df()
item_df = load_item_df()
user_df = load_user_df()

FileNotFoundError: ignored

In [None]:
################################
# Excercice 1: 
#
# You can now visualize each of them and get familiar with the data by using dataframe.head() function.
#
###############################

data_df.head()

> #### Please, check the [pandas.merge documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) for more clarifying information about the next exercice.


In [None]:
################################
# Excercice 2: 
#
# Build the following function in order to merge the data into one single dataframe, which needs to be merged by 
# 'inner' mode. 
#
###############################

def merge_df(data_df, item_df, user_df):
    # ...
    return tmp

In [None]:
df = merge_df(data_df, item_df, user_df)
df.head()

In [None]:
################################
# Excercice 3: 
#
# Check with one command line that our merged dataframe contains 100,000 rows and 31 features per each row. 
#
###############################

# ...

## Preprocess

In this section we will pre-process the data in order to apply some common filters (such as a a film having a mínimum of 10 views) and also some data transformations to allow the model process numerical and categorical features.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split


def build_preprocessor(config): 
    category_col = config.category_col
    num_col = config.num_col
    
    num_transformer = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="mean")),
        ('std', (StandardScaler())),])

    categorical_transformer = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="constant", fill_value='NAN')),
        ('oe', (OrdinalEncoder())),
        ])
    
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', num_transformer, num_col),
            ('cat', categorical_transformer, category_col),
        ],
        remainder="drop")
    return preprocessor

In [None]:
################################
# Excercice 4: 
#
# Explain, by understanding the code of the previous cell and by researching in the documentation what is the
# function 'build_preprocessor' doing to our data. Answer in the next cell.
#
###############################

> **Answer here:**

In [None]:
################################
# Excercice 5: 
#
# Now, by following the next steps we propose, please try to come up with the code that corresponds to the  
# filters that need to be done. We have splitted the steps into a), b), c) and d)
#
###############################

In [None]:
category_col = config.category_col
num_col = config.num_col
target_col = config.target_col[0]

In [None]:
#######
# a) Apply the filter in order to just use movies with more than 10 views
#######

print(df.shape)

df = # ...

print(df.shape)

In [None]:
#######
# b) Extract the year from the 'release_date' column.
#######

df["year"] = # ...

In [None]:
#######
# c) Create a label column for binary classification from the 'rating' column. If rating >=4, we want the target_col
#    to be 1, and if not 0. 
#######

df['label'] = # ...
df[target_col]

In [None]:
# Build pipeline
pp = build_preprocessor(config)
pp.fit(df)

In [None]:
# Check that it transforms the data (do not do it yet)
pp.transform(df).shape

In [None]:
#######
# d) Split the data into 80% (train) and 20% (test) with the imported function 'train_test_split'. Look at the 
#    documentation and use df['movie_id'] as 'stratify' parameter and config.seed as 'random_state'.
#######
tra_df, val_df = # ...
print(tra_df.shape)
print(val_df.shape)

In [None]:
assert tra_df.movie_id.nunique() == val_df.movie_id.nunique()
assert len(set(val_df.user_id) - set(tra_df.user_id)) == 0

## Training

In this section we will build the model and train it with the data we have been preparing.

In [None]:
from tensorflow.keras.layers import Input, Embedding, Dense, Flatten, add, Activation, dot
from tensorflow.keras.models import Model
from tensorflow.keras.regularizers import l2 as l2_reg
from tensorflow.python.keras.utils.vis_utils import plot_model
from tensorflow.keras.callbacks import EarlyStopping

import itertools


def build_model(category_num, category_cols, num_cols, K=8, l2=0.0, l2_fm=0.0):

    # Numerical features
    num_inputs = [Input(shape=(1,), name=col,) for col in num_cols]
    # Categorical features
    cat_inputs = [Input(shape=(1,), name=col,) for col in category_cols]

    inputs = num_inputs + cat_inputs

    flatten_layers=[]
    # Numerical featrue embedding
    for enc_inp, col in zip(num_inputs, num_cols):
        # num feature dense
        x = Dense(K, name = f'embed_{col}',kernel_regularizer=l2_reg(l2_fm))(enc_inp)
        flatten_layers.append(x)

    # Category feature embedding
    for enc_inp, col in zip(cat_inputs, category_cols):
        num_c = category_num[col]
        embed_c = Embedding(input_dim=num_c,
                            output_dim=K,
                            input_length=1,
                            name=f'embed_{col}',
                            embeddings_regularizer=l2_reg(l2_fm))(enc_inp)
        flatten_c = Flatten()(embed_c)
        flatten_layers.append(flatten_c)
                
    # Feature interaction term
    fm_layers = []
    for emb1,emb2 in itertools.combinations(flatten_layers, 2):
        dot_layer = dot([emb1,emb2], axes=1)
        fm_layers.append(dot_layer)        

    # Linear term
    for enc_inp,col in zip(cat_inputs, category_cols):
        # embedding
        num_c = category_num[col]
        embed_c = Embedding(input_dim=num_c,
                            output_dim=1,
                            input_length=1,
                            name=f'linear_{col}',
                            embeddings_regularizer=l2_reg(l2_fm))(enc_inp)
        flatten_c = Flatten()(embed_c)
        fm_layers.append(flatten_c)
                
    for enc_inp, col in zip(num_inputs, num_cols):
        x = Dense(1, name = f'linear_{col}',kernel_regularizer=l2_reg(l2_fm))(enc_inp)
        fm_layers.append(x)

    # Add all terms
    flatten = add(fm_layers)
    outputs = Activation('sigmoid',name='outputs')(flatten)
    
    model = Model(inputs=inputs, outputs=outputs)
    return model    

In [None]:
################################
# Excercice 6: 
#
# Make sure you understand the model from above and then call an instance of it and compile it with 
# optimizer 'adam', loss 'binary_crossentropy' and metrics 'accuracy'.
#
###############################

category_num = {col: df[col].nunique() for col in config.category_col}

model = # instance the model here
# complile the model here

Now we will build an 'early stopping' callback and also transform the training and validation dataframes with the **pp.transform** function we tested at the beggining of the notebook.

In [None]:
cb = [EarlyStopping(monitor='val_loss', min_delta=1e-4, patience=2, verbose=0,)]

feature_num = len(config.category_col + config.num_col)
tra_inputs = [pp.transform(tra_df)[:, i] for i in range(feature_num)]
val_inputs = [pp.transform(val_df)[:, i] for i in range(feature_num)]

In [None]:
################################
# Excercice 7: 
#
# Complete the fit function with the necessary arguments.
#
###############################

history = model.fit(
          x= #... ,
          y= #... ,
          epochs= #... ,
          batch_size= #... ,
          validation_data= #(... , ... ),
          callbacks= # ...
         )

On the next cell we give you a function in order to plot the training and validation curves. Plot them and comment the results:

> **Answer here**:

In [None]:
import matplotlib.pyplot as plt
def plot_history(history):
    # Plot training & validation accuracy values
    plt.plot(history.history['accuracy'])
    plt.plot(history.history['val_accuracy'])
    plt.title('Model accuracy')
    plt.ylabel('Accuracy')
    plt.xlabel('Epoch')
    plt.legend(['Train', 'Valid'], loc='upper left')
    plt.show()

    # Plot training & validation loss values
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title('Model loss')
    plt.ylabel('Loss')
    plt.xlabel('Epoch')
    plt.legend(['Train', 'Valid'], loc='upper left')
    plt.show()    

In [None]:
plot_history(history)

## Check output

On this last section, what we propose is to select a user and qualitatively check the prediction results to see what our system would recommend for a given user. You can also try with many different users if you want.

What we will do is to select the validation results for a given users (those which the model has not seen) and then compute the predictions for each film on this user to compare whether they make sense with the original ratings the user have given (ground-truth).

In [None]:
######################
# a) Select user_id and select all his/her validation results
#####################
user = # ...

In [None]:
print(user_df.index)
print(feature_num) # feature_num = len(config.category_col + config.num_col)

In [None]:
user_inputs = [pp.transform(val_df)[user_df.index, i] for i in range(feature_num)]

In [None]:
######################
# b) Predict the results for 'user_inputs' variable and sort their predictions in descending order.
#####################

user_df['pred'] = # ... compute predictions
user_df = # ... sort in descending order

In [None]:
# Finally, here we can observe the rating 
user_df[['title','rating','pred']].head(50)