## **Advances in Data Mining**

Stephan van der Putten | (s1528459) | stvdputtenjur@gmail.com  
Theo Baart | s2370328 | s2370328@student.leidenuniv.nl

### **Assignment 1**
This assignment is concered with implementing formulas and models capable of predicting movie ratings for a set of users. Additionally, the accuracy of the various models are checked. 

#### **Naive Approaches**
This specific notebook handles the implementation of various naive approaches/formulas to the prediction problem.

The following snippet handles all imports.

In [106]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os.path
import sklearn
from sklearn.model_selection import KFold

### **Data Extraction and Preparation**

The `convert_data` function is used to extract the data from the raw data file and store it in a format that is more convenient for us. 

In order to do this the function uses the following parameters:
  * `path` - the (relative) location of the raw dataset (with filetype `.dat`)
  * `cols` - which columns to load from the raw dataset
  * `delim` - the delimitor used in the raw dataset
  * `dt` - the datatype used for the data in the raw dataset
    
Additionally it returns the following value:
  * `path` - the location at which the converted dataset is stored (with filetype `.npy`)

In [104]:
def convert_data(path="datasets/ratings",cols=(0,1,2),delim="::",dt="int"):
    raw = np.genfromtxt(path+'.dat', usecols=cols, delimiter=delim, dtype=dt)
    np.save(path+".npy",raw)
    # check to see if file works
    assert np.load(path+'.npy').all() == raw.all()
    return path

The `prep_data` function is used to load the stored data and transform it into a usable and well defined dataframe. 

In order to do this the function uses the following parameters:
  * `path` - the (relative) location of the converted dataset
    
Additionally it returns the following value:
  * `df_ratings` - a dataframe containing the dataset

In [8]:
def prep_data(path='datasets/ratings'):
    filepath = path+'.npy'
     if not os.path.isfile(filepath):
        path = conver_data()
    if not datapath:
        datapath = convert_data()
    ratings = np.load(datapath)
    df_ratings = pd.DataFrame(ratings)
    colnames = ['UserId', 'MovieId', 'Rating']
    df_ratings.columns = colnames
    return df_ratings

The following snippet is responsible for running the extraction and preparation of the raw data. The data is stored in `df_ratings`.

datapath = 'datasets/ratings.npy'
df_ratings = prep_data(datapath)

### **Rating Global**

The `rating_global` function predicts the user's ratings for a certain movie by taking the mean of all the ratings in the dataset. 

In order to do this the function uses the following parameters:
  * `df` - the dataframe containing the dataset
    
Additionally it returns the following value:
  * `mean` - the predicted rating for the requested movie by the requested user

In [11]:
def rating_global(df):
    mean = df['Rating'].mean()
    return mean

The following snippet executes a test run of the rating function.

In [12]:
example = rating_global(df_ratings)
print(example)

3.581564453029317


### **Rating Item**

The `rating_item` function predicts the user's ratings for a certain movie by taking the mean of all the ratings in the dataset for that specific movie.

In order to do this the function uses the following parameters:
  * `item` - the item (movie) for which we want the rating
  * `df` - the dataframe containing the dataset
    
Additionally it returns the following value:
  * `mean` - the predicted rating for the requested movie by the requested user

In [13]:
def rating_item(item,df):
    mean = df[df['MovieId']== item].groupby('MovieId')['Rating'].mean()
    return mean[item]

The following snippet executes a test run of the rating function.

In [14]:
example = rating_item(5,df_ratings)
print(example)

3.0067567567567566


### **Rating User**

The `rating_user` function predicts the user's ratings for a certain movie by taking the mean of all the ratings in the dataset by the specific user. 

In order to do this the function uses the following parameters:
  * `user` - the user for which we want the rating
  * `df` - the dataframe containing the dataset
    
Additionally it returns the following value:
  * `mean` - the predicted rating for the requested movie by the requested user

In [15]:
def rating_user(user,df):
    mean = df[df['UserId']== user].groupby('UserId')['Rating'].mean()
    return mean[user]

The following snippet executes a test run of the rating function.

In [16]:
example = rating_user(20,df_ratings)
print(example)

4.083333333333333


### **Rating User-Item**

The `rating_user_item` function predicts the user's ratings for a certain movie by applying a linear regression to the outputs of the `rating_user` and `rating_item` functions. 

In order to do this the function uses the following parameters:
  * `user` - the user for which we want the rating
  * `item` - the item (movie) for which we want the rating
  * `df` - the dataframe containing the dataset
  * `alpha` - the weight for the `rating_user` function
  * `beta` - the weight for the `rating_item` function
  * `gamma` - the offset/modifier for the linear regression
    
Additionally it returns the following value:
  * `mean` - the predicted rating for the requested movie by the requested user    
  
Note: `alpha`, `beta` and `gamma` are estimated by the `run_linear_regression` function. 

In [88]:
def rating_user_item(user,item,df,alpha,beta,gamma):
    mean_user = rating_user(user,df)
    mean_item = rating_item(item,df)
    mean = alpha * mean_user + beta * mean_item + gamma

    return mean

The `generate_store_matrix` function is responsible for generating the input matrix for the `run_linear_regression` function. This matrix consists of the user rating, movie rating and a constant term. For easier (re)use the matrix stored into a file.

In order to do this the function uses the following parameters:
  * `df` - the dataframe containing the dataset
  * `path` - the path where the matrix should be stored

In [29]:
def generate_store_matrix(df,path='datasets/inputLM'):  
    r_user = df.groupby('UserId')['Rating'].mean()
    r_item = df.groupby('MovieId')['Rating'].mean()
    
    matrix = []
    for index, row in df.iterrows():
        matrix.append([index, r_user[row['UserId']], r_item[row['MovieId']], 1])
    np.save(path+'.npy', matrix)  

The `run_linear_regression` function estimates the values needed for `alpha`, `beta` and `gamma`.

In order to do this the function uses the following parameters:
  * `df` - the dataframe containing the dataset
  * `path` - the location of the input matrix: if no file exists at this location one will be generated

Additionally it returns the following values:
  * `alpha` - the estimated weight for the `rating_user` function
  * `beta` - the estimated weight for the `rating_item` function
  * `gamma` - the estimated offset/modifier for the linear regression
  
Note it is assumed that the indexes in `df` and the matrix stored at `path` correspond, e.g. the data in row 0 of the matrix is computed based on the values of the data in row 0 of `df`.

In [101]:
def run_linear_regression(df, path='datasets/inputLM'):
    filepath = path+'.npy'
    if not os.path.isfile(filepath):
        generate_store_matrix(df,path)
    matrix = np.load(filepath) 
    y = df['Rating']
    S = np.linalg.lstsq(matrix,y,rcond=None)
    alpha =S[0][0]
    beta = S[0][1]
    gamma = S[0][2]
    print('Sum Squared Error: '+str(S[1]))
    return alpha, beta, gamma

The following snippet executes a test run of the rating function.

In [110]:
alpha,beta,gamma = run_linear_regression(df_ratings)
print(alpha,beta,gamma)
example = rating_user_item(1,1,df_ratings,alpha,beta,gamma)
print(example)

Sum Squared Error: [838481.41893203]
0.7821285278628085 0.8757397042780259 -2.356197475012722
4.551446108280905


In [23]:
X = df_ratings[:100]

The `split_dataset` function is used to split the dataset into training and test subsets.

In order to do this the function uses the following parameters:
  * `df` - the dataframe contianing the dataset to be split
  * `no` - the number of training/test sets to be generated
  * `seed` - the random seed to be used
    
Additionally it returns the following value:
  * `sets` - a nested dictionary containing the training/test sets

In [9]:
def split_dataset(df,no,seed=17092019):
    sets = {}
    np.random.seed(seed)
    sequences = [x%no for x in range(len(df))]
    np.random.shuffle(sequences)
    for n in range(no):
        subset = {}
        train_subset=np.array([x!=n for x in sequences])
        test_subset=np.array([x==n for x in sequences])
        subset['Train'] = df[train_subset]
        subset['Test'] = df[test_subset]
        sets[n] = subset
    return sets

In [27]:

randomState = 21092019
kf = KFold(n_splits=5, shuffle=True,random_state=randomState)
# kf.get_n_splits(X)

print(kf)  

for train_index, test_index in kf.split(X):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X.iloc[train_index], X.iloc[test_index]

KFold(n_splits=5, random_state=21092019, shuffle=True)
TRAIN: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 18 21 22 23 24 25 26
 27 29 30 32 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 54 56
 58 59 61 63 66 67 68 69 70 71 72 73 75 77 78 80 82 83 85 86 87 88 89 90
 91 92 93 95 96 97 98 99] TEST: [17 19 20 28 31 33 52 53 55 57 60 62 64 65 74 76 79 81 84 94]
TRAIN: [ 0  1  2  3  5  7  8  9 10 11 12 13 14 16 17 18 19 20 21 22 23 24 27 28
 29 30 31 33 34 35 36 37 38 39 41 43 46 47 48 50 51 52 53 54 55 57 58 60
 61 62 63 64 65 66 67 68 69 71 72 73 74 75 76 77 78 79 81 82 84 86 87 88
 89 91 92 93 94 96 98 99] TEST: [ 4  6 15 25 26 32 40 42 44 45 49 56 59 70 80 83 85 90 95 97]
TRAIN: [ 0  1  3  4  5  6  7  8 10 11 13 14 15 17 19 20 21 22 23 24 25 26 28 29
 31 32 33 34 35 36 37 38 39 40 42 43 44 45 46 47 48 49 50 51 52 53 54 55
 56 57 58 59 60 62 63 64 65 66 70 72 74 75 76 78 79 80 81 82 83 84 85 87
 90 91 92 93 94 95 97 99] TEST: [ 2  9 12 16 18 27 30 41 61 67 68 69 71 73 77 8

array([2.        , 4.18867925, 4.15408805, 1.        ])