## **Advances in Data Mining**

Stephan van der Putten | (s1528459) | stvdputtenjur@gmail.com  
Theo Baart | s2370328 | s2370328@student.leidenuniv.nl

### **Assignment 1**
This assignment is concered with implementing formulas and models capable of predicting movie ratings for a set of users. Additionally, the accuracy of the various models are checked. 

#### **Naive Approaches**
This specific notebook handles the implementation of various naive approaches/formulas to the prediction problem.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### **Data Extraction**

The `convert_data` function is used to extract the data from the raw data file and store it in a format that is more convenient for us. 

In order to do this the function uses the following parameters:
    * `path` - the (relative) location of the raw dataset
    * `cols` - which columns to load from the raw dataset
    * `delim` - the delimitor used in the raw dataset
    * `dt` - the datatype used for the data in the raw dataset
    
Additionally it returns the following value:
    * `path` - the location at which the converted dataset is stored

In [2]:
def convert_data(path="datasets/ratings.dat",cols=(0,1,2),delim="::",dt="int"):
    raw = np.genfromtxt(path, usecols=cols, delimiter=delim, dtype=dt)
    path, ext = path.split('.')
    path = path + ".npy"
    np.save(path,raw)
    # check to see if file works
    assert np.load(path).all() == raw.all()
    return path

The `prep_data` function is used to load the stored data and transform it into a usable and well defined dataframe. 

In order to do this the function uses the following parameters:
    * `datapth` - the (relative) location of the converted dataset
    
Additionally it returns the following value:
    * `df_ratings` - a dataframe containing the dataset

In [3]:
def prep_data(datapath=''):
    if not datapath:
        datapath = convert_data()
    ratings = np.load(datapath)
    df_ratings = pd.DataFrame(ratings)
    colnames = ['UserId', 'MovieId', 'Rating']
    df_ratings.columns = colnames
    return df_ratings

The `split_dataset` function is used to split the dataset into training and test subsets.

In order to do this the function uses the following parameters:
    * `df` - the dataframe contianing the dataset to be split
    * `no` - the number of training/test sets to be generated
    * `seed` - the random seed to be used
    
Additionally it returns the following value:
    * `sets` - a nested dictionary containing the training/test sets

In [4]:
def split_dataset(df,no,seed=17092019):
    sets = {}
    np.random.seed(seed)
    sequences = [x%no for x in range(len(df))]
    np.random.shuffle(sequences)
    for n in range(no):
        subset = {}
        train_subset=np.array([x!=n for x in sequences])
        test_subset=np.array([x==n for x in sequences])
        subset['Train'] = df[train_subset]
        subset['Test'] = df[test_subset]
        sets[n] = subset
    return sets

In [5]:
datapath = 'datasets/ratings.npy'
df_ratings = prep_data(datapath)
split_sets = split_dataset(df_ratings,5)

### **Rating Global**

The `rating_global` function predicts the user's ratings for a certain movie by taking the mean of all the ratings in the dataset. 

In order to do this the function uses the following parameters:
    * `df` - the dataframe containing the dataset
    
Additionally it returns the following value:
    * `mean` - the predicted rating for the requested movie by the requested user

In [6]:
def rating_global(df):
    mean = df['Rating'].mean()
    return mean

In [7]:
example = rating_global(df_ratings)
print(example)

3.581564453029317


### **Rating Item**

The `rating_item` function predicts the user's ratings for a certain movie by taking the mean of all the ratings in the dataset for that specific movie.

In order to do this the function uses the following parameters:
    * `item` - the item (movie) for which we want the rating
    * `df` - the dataframe containing the dataset
    
Additionally it returns the following value:
    * `mean` - the predicted rating for the requested movie by the requested user

In [8]:
def rating_item(item,df):
    mean = df[df['MovieId']== item].groupby('MovieId')['Rating'].mean()
    return mean[item]

In [9]:
example = rating_item(5,df_ratings)
print(example)

3.0067567567567566


### **Rating User**

The `rating_user` function predicts the user's ratings for a certain movie by taking the mean of all the ratings in the dataset by the specific user. 

In order to do this the function uses the following parameters:
    * `user` - the user for which we want the rating
    * `df` - the dataframe containing the dataset
    
Additionally it returns the following value:
    * `mean` - the predicted rating for the requested movie by the requested user

In [10]:
def rating_user(user,df):
    mean = df[df['UserId']== user].groupby('UserId')['Rating'].mean()
    return mean[user]

In [11]:
example = rating_user(20,df_ratings)
print(example)

4.083333333333333


### **Rating User-Item**

The `rating_user_item` function predicts the user's ratings for a certain movie by applying a linear regression to the outputs of the `rating_user` and `rating_item` functions. 

In order to do this the function uses the following parameters:
    * `user` - the user for which we want the rating
    * `item` - the item (movie) for which we want the rating
    * `df` - the dataframe containing the dataset
    * `alpha` - the weight for the `rating_user` function
    * `beta` - the weight for the `rating_item` function
    * `gamma` - the offset/modifier for the linear regression
    
Additionally it returns the following value:
    * `mean` - the predicted rating for the requested movie by the requested user    
  
Note: `alpha`, `beta` and `gamma` are estimated by the `run_linear_regression` function. 

In [12]:
def rating_user_item(user,item,df,alpha,beta,gamma):
    mean_user = rating_user(user,df)
    mean_item = rating_item(item,df)
    print(mean_user,mean_item)
    mean = alpha * mean_user + beta * mean_item + gamma
    return mean

The `run_linear_regression` function estimates the values needed for `alpha`, `beta` and `gamma`.

In order to do this the function uses the following parameters:

    * `df` - the dataframe containing the dataset

Additionally it returns the following values:
    * `alpha` - the estimated weight for the `rating_user` function
    * `beta` - the estimated weight for the `rating_item` function
    * `gamma` - the estimated offset/modifier for the linear regression

In [13]:
r_user = df_ratings.groupby('UserId')['Rating'].mean()
r_item = df_ratings.groupby('MovieId')['Rating'].mean().tolist()
# np.vstack((r_user,r_item))
print(r_user)

UserId
1       4.188679
2       3.713178
3       3.901961
4       4.190476
5       3.146465
6       3.901408
7       4.322581
8       3.884892
9       3.735849
10      4.114713
11      3.277372
12      3.826087
13      3.388889
14      3.320000
15      3.323383
16      3.028571
17      4.075829
18      3.649180
19      3.572549
20      4.083333
21      2.909091
22      3.067340
23      3.315789
24      3.948529
25      3.741176
26      2.960000
27      4.171429
28      3.757009
29      3.583333
30      3.488372
          ...   
6011    3.969543
6012    3.000000
6013    4.080645
6014    3.886792
6015    3.754386
6016    3.189219
6017    3.515152
6018    3.591195
6019    3.460674
6020    4.395349
6021    3.500000
6022    3.854167
6023    3.687075
6024    4.126316
6025    3.302583
6026    3.617284
6027    4.250000
6028    3.446809
6029    3.903226
6030    3.939130
6031    3.666667
6032    4.134615
6033    3.850000
6034    4.095238
6035    2.610714
6036    3.302928
6037    3.717822
6038   

In [23]:
def run_linear_regression(df):
    r_user = df.groupby('UserId')['Rating'].mean()
    r_item = df.groupby('MovieId')['Rating'].mean()
    
    matrix = []
    for index, row in df.iterrows():
        #print(row['UserId'],row['MovieId'],r_user[row['UserId']], r_item[row['MovieId']])
        matrix.append([r_user[row['UserId']], r_item[row['MovieId']], 1])
    #print(matrix)
    np.save('inputLM.npy', matrix)
        
    alpha = 0.5 #replace me 
    beta = 0.5 #replace me
    gamma = 0.5 #replace me
    return alpha, beta, gamma

In [24]:
alpha,beta,gamma = run_linear_regression(df_ratings)
example = rating_user_item(1,1,df_ratings,alpha,beta,gamma)
#print(example)

4.188679245283019 4.146846413095811


In [25]:
matrix = np.load('inputLM.npy')    

In [26]:
matrix.shape

(1000209, 3)

In [29]:
matrix[0]

array([4.18867925, 4.39072464, 1.        ])