## **Advances in Data Mining**

Stephan van der Putten | (s1528459) | stvdputtenjur@gmail.com  
Theo Baart | s2370328 | s2370328@student.leidenuniv.nl

### **Assignment 1**
This assignment is concered with implementing formulas and models capable of predicting movie ratings for a set of users. Additionally, the accuracy of the various models are checked. 

#### **Naive Approaches**
This specific notebook handles the implementation of various naive approaches/formulas to the prediction problem.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### **Data Extraction**

The `convert_data` function is used to extract the data from the raw data file and store it in a format that is more convenient for us. 

In order to do this the function uses the following parameters:
    * `path` - the (relative) location of the raw dataset
    * `cols` - which columns to load from the raw dataset
    * `delim` - the delimitor used in the raw dataset
    * `dt` - the datatype used for the data in the raw dataset
    
Additionally it returns the following value:
    * `path` - the location at which the converted dataset is stored

In [57]:
def convert_data(path="datasets/ratings.dat",cols=(0,1,2),delim="::",dt="int"):
    raw = np.genfromtxt(path, usecols=cols, delimiter=delim, dtype=dt)
    path, ext = path.split('.')
    path = path + ".npy"
    np.save(path,raw)
    # check to see if file works
    assert np.load(path).all() == raw.all()
    return path

The `prep_data` function is used to load the stored data and transform it into a usable and well defined dataframe. 

In order to do this the function uses the following parameters:
    * `datapth` - the (relative) location of the converted dataset
    
Additionally it returns the following value:
    * `df_ratings` - a dataframe containing the dataset

In [62]:
def prep_data(datapath=''):
    if not datapath:
        datapath = convert_data()
    ratings = np.load(datapath)
    df_ratings = pd.DataFrame(ratings)
    colnames = ['UserId', 'MovieId', 'Rating']
    df_ratings.columns = colnames
    return df_ratings

The `split_dataset` function is used to split the dataset into training and test subsets.

In order to do this the function uses the following parameters:
    * `df` - the dataframe contianing the dataset to be split
    * `no` - the number of training/test sets to be generated
    * `seed` - the random seed to be used
    
Additionally it returns the following value:
    * `sets` - an array containing the training/test sets

In [170]:
def split_dataset(df,no,seed=17092019):
#     sets = pd.DataFrame(columns=['Train','Test'])
    sets = {}
    np.random.seed(seed)
    sequences = [x%no for x in range(len(df))]
    np.random.shuffle(sequences)
    print(sequences[:10])
    for n in range(no):
        subset = {}
        train_subset=np.array([x!=n for x in sequences])
        test_subset=np.array([x==n for x in sequences])
        subset['Train'] = df[train_subset]
        subset['Test'] = df[test_subset]
        sets[n] = subset
#         print(df[train_subset].shape,df[test_subset].shape)
#         sets = sets.append({'Train':df[train_subset], 'Test':df[test_subset]},ignore_index=True)
#         print(sets)
    return sets

In [171]:
datapath = 'datasets/ratings.npy'
df_ratings = prep_data(datapath)
split_dataset(df_ratings,5)

[1, 0, 2, 2, 0, 1, 2, 0, 4, 2]


{0: {'Train':          UserId  MovieId  Rating
  0             1     1193       5
  2             1      914       3
  3             1     3408       4
  5             1     1197       3
  6             1     1287       5
  8             1      594       4
  9             1      919       4
  11            1      938       4
  12            1     2398       4
  13            1     2918       4
  14            1     1035       5
  16            1     2687       3
  17            1     2018       4
  18            1     3105       5
  19            1     2797       4
  20            1     2321       3
  21            1      720       3
  22            1     1270       5
  23            1      527       5
  24            1     2340       3
  25            1       48       5
  26            1     1097       4
  27            1     1721       4
  28            1     1545       4
  30            1     2294       4
  31            1     3186       4
  32            1     1566       4
  33    

### **Rating Global**

The `rating_global` function predicts the user's ratings for a certain movie by taking the mean of all the ratings in the dataset. 

In order to do this the function uses the following parameters:
    * `df` - the dataframe containing the dataset
    
Additionally it returns the following value:
    * `mean` - the predicted rating for the requested movie by the requested user

In [58]:
def rating_global(df):
    mean = df['Rating'].mean()
    return mean

In [107]:
example = rating_global(df_ratings)
print(example)

3.581564453029317


### **Rating Item**

The `rating_item` function predicts the user's ratings for a certain movie by taking the mean of all the ratings in the dataset for that specific movie.

In order to do this the function uses the following parameters:
    * `item` - the item (movie) for which we want the rating
    * `df` - the dataframe containing the dataset
    
Additionally it returns the following value:
    * `mean` - the predicted rating for the requested movie by the requested user

In [98]:
def rating_item(item,df):
    mean = df[df['MovieId']== item].groupby('MovieId')['Rating'].mean()
    return mean[item]

In [106]:
example = rating_item(5,df_ratings)
print(example)

3.0067567567567566


### **Rating User**

The `rating_user` function predicts the user's ratings for a certain movie by taking the mean of all the ratings in the dataset by the specific user. 

In order to do this the function uses the following parameters:
    * `user` - the user for which we want the rating
    * `df` - the dataframe containing the dataset
    
Additionally it returns the following value:
    * `mean` - the predicted rating for the requested movie by the requested user

In [104]:
def rating_user(user,df):
    mean = df[df['UserId']== user].groupby('UserId')['Rating'].mean()
    return mean[user]

In [105]:
example = rating_user(20,df_ratings)
print(example)

4.083333333333333


### **Rating User-Item**

The `rating_user_item` function predicts the user's ratings for a certain movie by applying a linear regression to the outputs of the `rating_user` and `rating_item` functions. 

In order to do this the function uses the following parameters:
    * `user` - the user for which we want the rating
    * `item` - the item (movie) for which we want the rating
    * `df` - the dataframe containing the dataset
    * `alpha` - the weight for the `rating_user` function
    * `beta` - the weight for the `rating_item` function
    * `gamma` - the offset/modifier for the linear regression
    
Additionally it returns the following value:
    * `mean` - the predicted rating for the requested movie by the requested user    
  
Note: `alpha`, `beta` and `gamma` are estimated by the `run_linear_regression` function. 

In [108]:
def rating_user_item(user,item,df,alpha,beta,gamma):
    mean_user = rating_user(user,df)
    mean_item = rating_item(item,df)
    print(mean_user,mean_item)
    mean = alpha * mean_user + beta * mean_item + gamma
    return mean

The `run_linear_regression` function estimates the values needed for `alpha`, `beta` and `gamma`.

In order to do this the function uses the following parameters:

    * `df` - the dataframe containing the dataset

Additionally it returns the following values:
    * `alpha` - the estimated weight for the `rating_user` function
    * `beta` - the estimated weight for the `rating_item` function
    * `gamma` - the estimated offset/modifier for the linear regression

In [109]:
def run_linear_regression(df):
    alpha = 0.5 #replace me 
    beta = 0.5 #replace me
    gamma = 0.5 #replace me
    return alpha, beta, gamma

In [111]:
alpha,beta,gamma = run_linear_regression(df_ratings)
example = rating_user_item(1,1,df_ratings,alpha,beta,gamma)
print(example)

4.188679245283019 4.146846413095811
4.6677628291894155
