# Assignment 1 Recommender Systems Naive Approaches

Group 27

Our task was to create recommender systems using the 1M dataset from Movielens. We implemented four naive approaches and matrix facotrization.

We implemented, in Python, several recommendation algorithms and estimate their accuracy with the Root Mean Squared Error (RMSE), and the Mean  Absolute  Error (MAE).  In addition, to  make  sure  that  your  results  are  reliable  use  5-fold  cross-validation.  The average error of these five models (measured on the 5 test sets) 
is a reliable estimate of the accuracy of the (hypothetical) final model that is trained on the whole data set. 5-fold cross-validation is described in  more detail later.

In [2]:
import numpy as np
from sklearn import linear_model
import time

## The Data

The data contains 1,000,209 anonymous ratings of approximately 3,900 movies 
made by 6,040 MovieLens users who joined MovieLens in 2000.
format is UserID::MovieID::Rating::Timestamp


In [3]:
#ratings = pd.read_csv('ratings.dat')

ratings = np.loadtxt('ratings.dat', delimiter="::")

#format is UserID::MovieID::Rating::Timestamp
users = np.unique(ratings[:,0])
items = np.unique(ratings[:,1])	

## Naive Approaches

We first use four naive approaches to predict the rating a particular user will give a movie. Using the four equations from the slides from class

Global Average: Is very simplistic approach, it takes the average over all (user, item) pairs.
\begin{equation*}
R_{global}(User, Item) = mean(all\;ratings)
\end{equation*}
Average over an Item: Takes the average over all (user, item) pairs but only for one item.
\begin{equation*}
R_{item}(User, Item) = mean(all\;ratings\;for\;Item)
\end{equation*}
Average over an User: Takes the average over all (user, item) pairs but only for one user.
\begin{equation*}
R_{user}(User, Item) = mean(all\;ratings\;for\;User)
\end{equation*}
Linear Regression: This approaches combines the user and item means to create a better prediction. It assigns different weights for both the user and the item mean to account for uneven significance between them. We will call these weight $\alpha$ and $\beta$. Then accounts for some overarching bias that we will detect using the global average and account for it in $\gamma$. Estimate the parameters $\alpha$, $\beta$, $\gamma$ with linear regression. We are trying to approximate the rating column as a linear combination of the other two columns, users and item and a constant $\gamma$

\begin{equation*}
R_{user-item}(User, Item) = \alpha * R_{user}(User, Item) + \beta * R_{item} + \gamma
\end{equation*}

We create seperate functions for each of the first three approaches and incoporated the linear regression later in the code for ease.


In [4]:
def item_avg(ratings,item):
    """The mean of all ratings of an item. This could work 
    well in the sense that if an users rate particular 
    item generally the same
    Input: The array of data with the ratings"""
    
    i_ratings = ratings[ratings[:,1] == item]
    R_item = np.mean(i_ratings[:,2])
    
    return R_item

In [5]:
def global_avg(ratings):
    
    """Mean over all ratings from all users. Not expecting 
    a accurate prediction but is the simpliest approach.
    Input: The array of data with the ratings"""
    
    R_global = np.mean(ratings[:,2])
    
    return R_global

In [6]:
def user_avg(ratings,user):
    
    """Find mean of all ratings of a user. This method works better
    for users who rate many items rather then a few.
    Input: ratings array and user number"""
    
    u_ratings = ratings[ratings[:,0] == user]
    R_user = np.mean(u_ratings[:,2])
    
    return R_user

### Five Fold Cross-Validation

First we defined memory to store the results.
We defined functions to calculate the errors on for our different approaches (mean_err()) and (global_err()).
Then we use the modulo operator to split the data into five equal folds and np.random to shuffle each sequence of data.
Four folds will make up the training data and one fold will be the test set.
The training set we use to train the algorithm and to predict new ratings for users
The test set is what we apply our trained algorithm to and find our accuracy.
We iterate through the procedure five times so that every set has been the test set one time.
The we treat the test set as new ratings provided by the users.
Then we calculate the accuracy by taking the mean over the five iterations.
We calculate the accuracy using the RMSE and the MAE.
We account for gaps in the data with a conditional where we check for a rating and if there is no rating use the global mean. 

In [12]:
def five_fold_CV(ratings):
    
    """
    Test the models with 5 fold cross validation
    Input: ratings 
    """

    np.random.seed(17) # For reproducibility

    # split data into 5 train and test folds
    folds=5

    # allocate memory for results:
    error_train_R_user_RSME = np.zeros(folds)
    error_train_R_user_MAE = np.zeros(folds)
    error_train_R_item_RSME = np.zeros(folds)
    error_train_R_item_MAE = np.zeros(folds)
    error_train_R_global_RSME = np.zeros(folds)
    error_train_R_global_MAE = np.zeros(folds)
    error_train_LR_RSME = np.zeros(folds)
    error_train_LR_MAE = np.zeros(folds)
    error_test_R_user_RSME = np.zeros(folds)
    error_test_R_user_MAE = np.zeros(folds)
    error_test_R_item_RSME = np.zeros(folds)
    error_test_R_item_MAE = np.zeros(folds)
    error_test_R_global_RSME = np.zeros(folds)
    error_test_R_global_MAE = np.zeros(folds)
    error_test_LR_RSME = np.zeros(folds) 
    error_test_LR_MAE = np.zeros(folds) 
  
    
    alpha = np.zeros(folds)
    beta = np.zeros(folds)
    gamma = np.zeros(folds)

    seqs = [x%folds for x in range(len(ratings))]
    np.random.shuffle(seqs) 
    
    
    #Define the functions to calculate the errors later
    def mean_err(test, test_predict_R, train, train_predict_R):
        
        """
        Error for the mean rating of a user and a item
        Mean Absolute Error mean(|value - expected value|)
        Inputs: Test, test predictions for item or user, Train, train predictions for user or item
        RMSE Error on the  sqrt(1/n(sum(predicted - true)^2)))
        """
        error_test_R_MAE = np.mean(np.abs(test[:,2] - test_predict_R[:,1]))
        error_train_R_MAE = np.mean(np.abs(train[:,2] - train_predict_R[:,1]))
        
        #Use np.power in order to brodacasting two arrays 
        error_test_RSME = np.sqrt(np.mean(np.power(test[:,2] - test_predict_R[:,1],2)))
        error_train_RSME = np.sqrt(np.mean(np.power(train[:,2] - train_predict_R[:,1],2)))
        
        return error_test_RSME, error_train_RSME, error_test_R_MAE, error_train_R_MAE
        
    def global_mean_err():
        
        """
        Error for the global mean rating
        Error on the  sqrt(mean(value -  expected value)^2)
        Mean Absolute Error mean(|value - expected value|)
        """
        
        error_test_R_global_RSME[fold] = np.sqrt(np.mean((test[:,2] - global_mean_train)**2))
        error_test_R_global_MAE[fold] = np.mean(np.abs(test[:,2] - global_mean_train))
        error_train_R_global_RSME[fold] = np.sqrt(np.mean((train[:,2] - global_mean_train)**2))
        error_train_R_global_MAE[fold] = np.mean(np.abs(train[:,2] - global_mean_train))
        

    def linear_regression(train):
        """
        This function performs the linear regression
        Inputs: The training data
        """
        y = train[:,2]
        
        # make the x_1,x_2 matrices
        x_1 = train_predict_R_user[:,1]
        x_2 = train_predict_R_item[:,1]
        
        #Section taken from sklearn.linear_model.LinearRegression
        #http://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares
        X = np.asarray([x_1,x_2]).T
        reg = linear_model.LinearRegression()
        reg.fit(X,y)
        alpha, beta = reg.coef_
        gamma = reg.intercept_

        return alpha,beta,gamma

    start_time = time.time()
    
    #for each fold:
    for fold in range(folds):

        train_chunk = np.array([x!=fold for x in seqs])
        test_chunk = np.array([x==fold for x in seqs])
        train = ratings[train_chunk]
        test = ratings[test_chunk]

        # A 200042x2 matrix that holds all test/train users/items and their prediction ratings
        test_predict_R_user = np.array([test[:,0],np.empty(len(test[:,0]))]).T
        test_predict_R_item = np.array([test[:,1],np.empty(len(test[:,1]))]).T
        train_predict_R_user = np.array([train[:,0],np.empty(len(train[:,0]))]).T
        train_predict_R_item = np.array([train[:,1],np.empty(len(train[:,1]))]).T

        global_mean_train = global_avg(train)
        
        training_itms = np.unique(train[:,1]) 
        training_usrs = np.unique(train[:,0]) 
        
        for user in users:
            
            if user in training_usrs:
                
                # user and item mean on the train set
                single_user_mean = user_avg(train,user) 
                
                # make a matrix with test predictions for calculation of error
                test_predict_R_user[test_predict_R_user[:,0] == user, 1] = single_user_mean
                train_predict_R_user[train_predict_R_user[:,0] == user, 1] = single_user_mean
                
            else: 
                
                # use the global mean is there is no user rating
                test_predict_R_user[test_predict_R_user[:,0] == user, 1] = global_mean_train
                
        for item in items:
            
            if item in training_itms: 
                
                single_item_mean = item_avg(train,item)
                
                # make a matrix with test predictions for the error calculation 
                test_predict_R_item[test_predict_R_item[:,0] == item, 1] = single_item_mean
                train_predict_R_item[train_predict_R_item[:,0] == item, 1] = single_item_mean
                
            else: 
                # use the global mean if there is no item rating
                test_predict_R_item[test_predict_R_item[:,0] == item, 1] = global_mean_train

        # Calculate the test and train RSME and MAE for item and user means by using mean_err() 
        error_test_R_user_RSME[fold], error_test_R_user_MAE[fold], error_train_R_user_RSME[fold], error_train_R_user_MAE[fold] = \
        mean_err(test, test_predict_R_user, train, train_predict_R_user)
        
        error_test_R_item_RSME[fold], error_test_R_item_MAE[fold], error_train_R_item_RSME[fold], error_train_R_item_MAE[fold] = \
        mean_err(test, test_predict_R_item, train, train_predict_R_item)
        
        # Calculate the errors RSME and MAE using global_mean_err()
        global_mean_err()

        #The Linear Regression
        alpha[fold], beta[fold], gamma[fold] = linear_regression(train)
        LR_predict_train = alpha[fold] * train_predict_R_user[:,1] + beta[fold] * train_predict_R_item[:,1] + gamma[fold]
        LR_predict_test = alpha[fold] * test_predict_R_user[:,1] + beta[fold] * test_predict_R_item[:,1] + gamma[fold]
        
        #We are trying to minimize the RMSE
        error_train_LR_RSME[fold] = np.sqrt(np.mean((train[:,2] - LR_predict_train)**2))
        error_test_LR_RSME[fold] = np.sqrt(np.mean((test[:,2] - LR_predict_test )**2))

        error_train_LR_MAE[fold] = np.mean(np.abs((train[:,2] - LR_predict_train)**2))
        error_test_LR_MAE[fold] = np.mean(np.abs((test[:,2] - LR_predict_test )**2))
    
    #Results
    print("Total runtime:  %s seconds ---" % (time.time() - start_time))
    
    print('Train results:')
    print('User mean RMSE: %s'%np.mean(error_train_R_user_RSME))
    print('User mean MAE: %s'%np.mean(error_train_R_user_MAE))
    print('Item mean RMSE: %s'%np.mean(error_train_R_item_RSME))
    print('Item mean MAE: %s'%np.mean(error_train_R_item_MAE))
    print('Global mean RMSE: %s'%np.mean(error_train_R_global_RSME))
    print('Global mean MAE: %s'%np.mean(error_train_R_global_MAE))
    print('\n')


    print('Test results:')
    print('User mean RMSE: %s'%np.mean(error_test_R_user_RSME))
    print('User mean MAE: %s'%np.mean(error_test_R_user_MAE))
    print('Item mean RMSE: %s'%np.mean(error_test_R_item_RSME))
    print('Item mean MAE: %s'%np.mean(error_test_R_item_MAE))
    print('Global mean RMSE: %s'%np.mean(error_test_R_global_RSME))
    print('Global mean MAE: %s'%np.mean(error_test_R_global_MAE))
    print('\n')

    print('Linear Regression results:')
    print('alpha: {}'.format(np.mean(alpha)))
    print('beta: {}'.format(np.mean(beta)))
    print('gamma: {}'.format(np.mean(gamma)))
    print('Train set:')
    print('RMSE: %s'%np.mean(error_train_LR_RSME))
    print('MAE: %s'%np.mean(error_train_LR_MAE))
    print('Test set:')
    print('RMSE: %s'%np.mean(error_test_LR_RSME))
    print('MAE: %s'%np.mean(error_test_LR_MAE))
    

five_fold_CV(ratings)

Total runtime:  271.8847801685333 seconds ---
Train results:
User mean RMSE: 0.82896757447
User mean MAE: 0.822737078586
Item mean RMSE: 0.78240409533
Item mean MAE: 0.778338232748
Global mean RMSE: 1.11710112744
Global mean MAE: 0.933860752823


Test results:
User mean RMSE: 1.03549268243
User mean MAE: 1.02767187417
Item mean RMSE: 0.979458646244
Item mean MAE: 0.974217433147
Global mean RMSE: 1.11710198981
Global mean MAE: 0.93386130485


Linear Regression results:
alpha: 0.7818509582241939
beta: 0.8748595533670936
gamma: -2.3520510519663214
Train set:
RMSE: 0.91462036644
MAE: 0.836530516689
Test set:
RMSE: 0.924407363516
MAE: 0.854531144446


# Results

The Results were as follows
Total runtime:  313.4345362186432 seconds or about 5 minutes
Train results:
User mean RMSE:...... 0.82896757447
User mean MAE: ...... 0.822737078586
Item mean RMSE:...... 0.78240409533
Item mean MAE: ...... 0.778338232748
Global mean RMSE:.... 1.11710112744
Global mean MAE:..... 0.933860752823


Test results:
User mean RMSE:...... 1.03549268243
User mean MAE:....... 1.02767187417
Item mean RMSE:...... 0.979458646244
Item mean MAE:....... 0.974217433147
Global mean RMSE:.... 1.11710198981
Global mean MAE:..... 0.93386130485

Linear Regression results:
alpha:.............. 0.7818509582241939
beta:............... 0.8748595533670936
gamma:.............. -2.3520510519663214

Train set:
RMSE:............... 0.91462036644
MAE:................ 0.836530516689
Test set:
RMSE:............... 0.924407363516
MAE:................ 0.854531144446

The RMSE is highest for the simplest approach, using the Global Mean and, following a decreasing trend, is lowest for the linear regression approach. This makes sense becuase the linear regression model makes an effort to minimize when fitting the data so that results in a lower RSME and slightly higer MAE becuase the model was not designed to minimize the overall amount of error which is what the MAE measures. It is also pretty clear that the User mean is a much worse way model then the item mean so the user bias is less significant. Since the MAE is lower then the RSME in every case there are probably abnormally large errors for a few data pairs that get minimized when finding the average error. 

We expected the test error to be higher then the training error due to the fact that the test set has data that the algorithm has not encountered yet. So this is not unusual

Our results match those of http://mymedialite.net/examples/datasets.html 

# Computational Costs
In terms of memory and time required for this approach it can be broken down into the four approaches

User Mean: The time scales with the number of ratings O(N) with N being the number of ratings. The memory required is O(1) for a single calculation for one user so the total memory O(U) with U being the number of users.

Item Mean: It is similar to the user mean in that for the total time it is O(N) and for the memory required it is O(1) for each calculation but total is O(I) where I is the number of items or in our case movies.

Global Mean: The memory required is O(1) for the single calculatoin. The time is again O(N).

Linear Regression: 
This requires O(R) memory and O(R) time
