# AiDM assignment 1
## Recommender system --- 4 naive approaches


### Read data
We load the ratings data into a matrix which has 4 columns. The first column gives the user id. The second column gives the movie id. The third column gives the rating which can only be a integer from 1 to 5. The fourth column gives the timestamp in a unit of second.

In [64]:
import numpy as np
data_origin = np.genfromtxt('ml-1m/ratings.dat', delimiter= '::')

## First approach
We define a function to calculate the mean of all ratings as a global rating.

In [65]:
def mean_global(data):
    R_global = np.mean(data[:,2])
    return R_global

## Second approach
We define a funtion to calculate the mean rating of every user.

In [66]:
def mean_user(data):
    user_list = np.unique(data[:,0]) # the user IDs will be automatically sorted with np.unique
    R_user = np.zeros((len(user_list),2))
    for i in range(len(user_list)):
        # The rating data will be selected in order to calculte the average based on True/ Flase markers.
        marker = data[:,0] == user_list[i]
        R_user[i,0] = user_list[i]
        R_user[i,1] = np.mean(data[marker][:,2])
    return R_user # This array contains 2 columns which are user ID and mean ratings, repectively.

## Third approach
We define a function to calculate the mean rating of every movie.

In [67]:
def mean_movie(data):
    movie_list = np.unique(data[:,1]) # the movie IDs will be automatically sorted with np.unique
    R_movie = np.zeros((len(movie_list),2))
    for i in range(len(movie_list)):
        # The rating data will be selected in order to calculte the average based on True/ Flase markers.
        marker = data[:,1] == movie_list[i]  
        R_movie[i,0] = movie_list[i]
        R_movie[i,1] = np.mean(data[marker][:,2])
    return R_movie # This array contains 2 columns which are movie ID and mean ratings, respectively.

## Fourth approach
We define a function to calculate the estimated rating as a combination of mean of users and mean of movies

In [68]:
def ls(data):
    INPUT=np.ones((len(data),3))
    R_user = mean_user(data) # call the function mean_user
    R_movie = mean_movie(data) # call the function mean_movie
    for i in range(len(data)):
        INPUT[i,0] = R_user[np.where(R_user[:,0]==data[i,0])[0][0],1]
        INPUT[i,1] = R_movie[np.where(R_movie[:,0]==data[i,1])[0][0],1]
    ls_para = np.linalg.lstsq(INPUT, data[:,2],rcond=None)
    # This tuple mainly will provide us 2 informations, one is the coeffients and the other is the sum squared error.
    return R_user, R_movie, ls_para

## Apply four approaches to the entire data
We apply the approaches to the entire data and have the first look of the averages and the results of least square method.

In [69]:
avg_global_origin = mean_global(data_origin)
ls_result_origin = ls(data_origin)
print('The global average of ratings is ', avg_global_origin)
print('\n The average rating for each user with the user ID in the first column is \n', ls_result_origin[0])
print('\n The average rating for each movie with the movie ID in the first column is \n',ls_result_origin[1])
print('\n The alpha, beta, gamma values given by least square algorithm are \n', ls_result_origin[2][1])
print('Besides, the sum square error of least square method is ', ls_result_origin[2][1][0])

The global average of ratings is  3.581564453029317

 The average rating for each user with the user ID in the first column is 
 [[1.00000000e+00 4.18867925e+00]
 [2.00000000e+00 3.71317829e+00]
 [3.00000000e+00 3.90196078e+00]
 ...
 [6.03800000e+03 3.80000000e+00]
 [6.03900000e+03 3.87804878e+00]
 [6.04000000e+03 3.57771261e+00]]

 The average rating for each movie with the movie ID in the first column is 
 [[1.00000000e+00 4.14684641e+00]
 [2.00000000e+00 3.20114123e+00]
 [3.00000000e+00 3.01673640e+00]
 ...
 [3.95000000e+03 3.66666667e+00]
 [3.95100000e+03 3.90000000e+00]
 [3.95200000e+03 3.78092784e+00]]

 The alpha, beta, gamma values given by least square algorithm are 
 [838481.41893202]
Besides, the sum square error of least square method is  838481.4189320239


## N-fold Cross Validation
We split your data into N parts and then develop N models on all combinations of (N-1) parts. At the end, we test our N models on test set and take the average of error over the number of test tests.

In [70]:
def fold_4naive(data_all, N):
    np.random.shuffle(data_all) # Note this only shuffles row order, individual user_id/movie_id/rate entries are the same
    errors = np.zeros(4) # prepare a numpy array to return 4 errors made by 4 naive approaches 
    # The order will be: 'global', 'user', 'movie', 'least square method'
    for i in range(N):
        # drag out 1/N part of the entire data as a test set
        data_test = data_all[int(i/N*len(data_all)):int(((i+1)/N)*len(data_all))]
        # delete the test set from entire data and it gives us a training set as remaining.
        data_folder = np.delete(data_all,np.s_[int(i/N*len(data_all)):int(((i+1)/N)*len(data_all))],0)
        R_global = mean_global(data_folder)
        # The final error provided would be sum(real rating-estimated rating)**2/ (number of ratings)
        # and then take the average over the number of test sets
        errors[0] += sum((data_test[:,2]-R_global)**2)/len(data_test)/N
        '''
        Then we do the other 3 naive approaches
        '''
        # first of all, we input the training folders and return the models of 3 approaches.
        ls_result = ls(data_folder)
        R_user = ls_result[0]
        R_movie = ls_result[1]
        ls_para = ls_result[2]
        # prepare a numpy zeros array which contains the estimated ratings based on users and movies
        # in the first and second column, respectively
        rating_est = np.zeros((len(data_test),3))
        for j in range(len(data_test)):
            # In case any user in test set has not showed in the training set, 
            # we would like to use global average as an estimated rating.
            try:
                rating_est[j,0] = R_user[int(np.where(R_user[:,0]== data_test[j,0])[0][0]),1]
            except:
                rating_est[j,0] = R_global
            # In case any movie in test set has not showed in the training set, 
            # we would like to use global average as an estimated rating.
            try:
                rating_est[j,1] = R_movie[int(np.where(R_movie[:,0]== data_test[j,1])[0][0]),1]
            except:
                rating_est[j,1] = R_global
        rating_est[:,2] = ls_para[0][0]*rating_est[:,0] + ls_para[0][1]*rating_est[:,1] + ls_para[0][2]
        errors[1] += sum((data_test[:,2]-rating_est[:,0])**2)/len(data_test)/N
        errors[2] += sum((data_test[:,2]-rating_est[:,1])**2)/len(data_test)/N
        errors[3] += sum((data_test[:,2]-rating_est[:,2])**2)/len(data_test)/N
    return errors

### Result
In this part, we calculate the squared error per rating, which is to divide the sum squared error by the length of the test set and then take the average over the number of test sets.

In [71]:
result = fold_4naive(data_origin,5)
print('error for global average method is ', result[0])
print('error for user average method is ', result[1])
print('error for movie average method is ', result[2])
print('error for least square method is', result[3])

error for global average method is  1.2479161841688666
error for user average method is  1.0723646943791647
error for movie average method is  0.9589793230954573
error for least square method is 0.8543680526172086


## Conclusion
In this report, first of all, we managed the entire data set to show the global average rating and average rating of users and so on.
Secondly, we applied N fold cross validation to each naive approach on the data set with N = 5. 
As we can see from the results, the global method's preformence is the worst and the combined linear regression method provides the best estimation among 4 naive approaches.