## **Advances in Data Mining**

Stephan van der Putten | (s1528459) | stvdputtenjur@gmail.com  
Theo Baart | s2370328 | s2370328@student.leidenuniv.nl

### **Assignment 1**
This assignment is concered with implementing formulas and models capable of predicting movie ratings for a set of users. Additionally, the accuracy of the various models are checked. 

Note all implementations are based on the assignment guidelines and helper files given as well as the documentation of the used functions.

#### **Model Approach**
This specific notebook handles the implementation of a Matrix factorization approach to the prediction problem.

The following snippet handles all imports.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os.path
import sklearn
from sklearn.model_selection import KFold

### **Data Extraction and Preparation**

The `convert_data` function is used to extract the data from the raw data file and store it in a format that is more convenient for us. 

In order to do this the function uses the following parameters:
  * `path` - the (relative) location of the raw dataset (with filetype `.dat`)
  * `cols` - which columns to load from the raw dataset
  * `delim` - the delimitor used in the raw dataset
  * `dt` - the datatype used for the data in the raw dataset
    
Additionally it returns the following value:
  * `path` - the location at which the converted dataset is stored (with filetype `.npy`)

In [3]:
def convert_data(path="datasets/ratings",cols=(0,1,2),delim="::",dt="int"):
    raw = np.genfromtxt(path+'.dat', usecols=cols, delimiter=delim, dtype=dt)
    np.save(path+".npy",raw)
    # check to see if file works
    assert np.load(path+'.npy').all() == raw.all()
    return path

The `prep_data` function is used to load the stored data and transform it into a usable and well defined dataframe. 

In order to do this the function uses the following parameters:
  * `path` - the (relative) location of the converted dataset: if no file exists a new one is created
    
Additionally it returns the following value:
  * `df_ratings` - a dataframe containing the dataset

In [4]:
def prep_data(path='datasets/ratings'):
    filepath = path+'.npy'
    if not os.path.isfile(filepath):
        filepath = convert_data()
    ratings = np.load(filepath)
    df_ratings = pd.DataFrame(ratings)
    colnames = ['UserId', 'MovieId', 'Rating']
    df_ratings.columns = colnames
    return df_ratings

The following snippet is responsible for running the extraction and preparation of the raw data. The data is stored in `df_ratings`.

In [5]:
df_ratings = prep_data()

### **Rating Model**

The `rating_model` function predicts the user's ratings for a certain movie by implenting Matrix factorization with Gradient Descent.

In order to do this the function uses the following parameters:
  * `df` - the dataframe containing the dataset
  * `user` - the user for which a rating is requested
  * `item` - the movie for which a rating is requested 
    
Additionally it returns the following value:
  * `mean` - the predicted rating for the requested movie by the requested user

In [None]:
def rating_model(df,user,item):
    mean = 0.0 # TODO
    return mean

The following snippet executes a test run of the rating function.

In [None]:
example = rating_model(df_ratings,1,1)
print(example)

### **Cross-validation**

The accuracy of the model is computed through 5-fold cross-validation.

The `run_validation` function executes `n`-fold validation for a given function and initial data set. The initial data is split into `n` test and training folds for which the error is computed. The average error gives an indication of the accuracy of the rating function. 

In order to do this the function uses the following parameters:
  * `df` - the dataframe containing the original dataset
  * `function` a string representing name of the function to be tested
  * `n` - the number of folds to be generated
  * `seed` - the random seed to be used
    
Additionally it returns the following value:
  * `train_error` - the average error for this function on the training set
  * `test_error` - the average error for this function on the test set

In [6]:
def run_validation(df,function,n=5,seed=17092019):
    err_train = np.zeros(n)
    err_test = np.zeros(n)
    
    kf = KFold(n_splits=n, shuffle=True,random_state=seed)
#     print(kf)

    i = 0
    for train_index, test_index in kf.split(df):
        df_train, df_test = df.iloc[train_index].copy(), df.iloc[test_index].copy()
        
        # run function on training set
        df_train.loc[:,'RatingTrained'] = [function(df=df_train,user=u,item=i) for u, i in zip(df_train['UserId'],df_train['MovieId'])]
#         print(i,'trained')
        # compute error on train set
        df_train.loc[:,'DiffSquared'] = [(t - r)**2 for t, r in zip(df_train['RatingTrained'],df_train['Rating'])]
        err_train[i] = np.sqrt(np.mean(df_train['DiffSquared']))
#         print(i,'train error')
        # compute error on test set
        df_test.loc[:,'DiffSquared'] = [(function(df=df_test,user=u,item=i) - r)**2 for u, i, r in zip(df_test['UserId'],df_test['MovieId'],df_test['Rating'])]
        err_test[i] = np.sqrt(np.mean(df_test['DiffSquared']))  
#         print(i,'test trained and error')
        i = i + 1
    # compute total error
    train_error = np.mean(err_train)
    test_error = np.mean(err_test)
    return train_error, test_error

The following snippet computes the mean train and test errors for the `rating_model` function for various initial parameters

In [None]:
TODO ADD LOOP
train, test = run_validation(df_ratings,rating_model)
print("Mean training error: " + str(train))
print("Mean test error: " + str(test))

From the results above it is evident which configuration provides the best results for the model. TODO ADD EXPLANATION?

Below is an ordered list ranking the different configurations from most accurate (lowest mean test error) to least accurate:

### **Cross-validation Results**

As can be seen by running the `run_validation` function for the various rating function the performance for each function vastly differs. This is obvious as each function takes a more detailed look (i.e. considers more factors) into what could influence the rating. TODO TODO TODO TODO

Below is an ordered list ranking the different functions from most accurate (lowest mean test error) to least accurate:
  1. `rating_model` - test error of XX and training error of YY
  2. `rating_user_item` - test error of XX and training error of YY
  3. `rating_TODO` - test error of XX and training error of YY 
  4. `rating_TODO` - test error of XX and training error of YY
  5. `rating_global` - test error of XX and training error of YY