# Naive approaches
 *Authors : Pantelis Matsakidis & Asteris Barakopoulos*
 
 *Group 21*

## 1. Data and general goal
The goal of this assignment is to implement and compare the performance of different recommender system methods. More specifically, the interest lies in estimating the rating that a user would give to a specific movie. The data consists of 1,000,209 anonymus ratings of users that joined MovieLens at the year 2000 and three columns that represent the user id, the movie id and the rating which is on a 5-star scale (whole ratings). There is a total of 6,040 users and 3,952 movies with each user having at least 20 ratings. On this notebook, four approaches are implemented on this data, these are :

- $R_{global}(User,Item)=mean(\text{all ratings})$
- $R_{item}(User,Item)=mean(\text{all ratings for item})$
- $R_{user}(User,Item)=mean(\text{all ratings for user})$
- $R_{user-item}(User,Item)= \alpha*R_{user}(User,Item) + \beta*R_{item}(User,Item) + \gamma$

Where $R(User,Item)$ stands for the estimate of the rating that an active user would give to a certain item. To assess and compare the performance of these approaches, 5-fold cross-validation is being used in order to get a more honest estimate of their training and test errors. For every approach, the Root mean squared error is being used for assessment and its given by the formula : $$RMSE=\sqrt{\frac{1}{n}\displaystyle\sum_{i=1}^{n} (predicted - true)^2}$$ These are the naive approaches which are characterized that way because of their simplicity. Especially the first three approaches use information from one source only, either the users or the movies (items). So, even though they work up to a certain extend they have some important limitations which are going to be discussed further on the next sections.

## 2. Implementation
### 2.1 $R_{global}(User,Item)$

First we start with the most trivial approach out of the four. This approach estimates the rating that a user would give to a movie according to the mean of every rating that is available in the data. So, no matter for which user or movie is the estimation made for, the result will always be the same mean number of ratings. This is of course very limited in many ways as no differentiation between users or movies takes place. Basically, the recommendations that a user gets according to this approach are as good as random since the estimated rating is the same for any movie. The results obtained from the 5-fold cross validation are presented on the following table :

| Training error | Test error   |
|------|------|
|   1,117  | 1,117|

In [None]:

###Modules
import numpy as np 
import matplotlib.pyplot as plt

###Read data
ratings = np.genfromtxt("ratings.dat",names = None,dtype = None,delimiter = "::")



###-----------------------------------Rglobal - mean of all ratings---------------------------------------------------###
nfolds = 5
train_error = np.zeros(nfolds)
test_error = np.zeros(nfolds)
np.random.seed(17)
seqs=[x%nfolds for x in range(len(ratings))]
np.random.shuffle(seqs)

for fold in range(nfolds):
    train_sel=np.array([x!=fold for x in seqs])
    test_sel=np.array([x==fold for x in seqs])
    train=ratings[train_sel]
    test=ratings[test_sel]
    #calculate model parameters: mean rating over the training set:
    Rglobal=np.mean(train[:,2])
    #apply the model to the train set:
    train_error[fold]=np.sqrt(np.mean((train[:,2]-Rglobal)**2))
    #apply the model to the test set:
    test_error[fold]=np.sqrt(np.mean((test[:,2]-Rglobal)**2)) 
    #print errors:
    print("Fold " + str(fold) + ": RMSE_train=" + str(train_error[fold]) + "; RMSE_test=" + str(test_error[fold]))

#print the final conclusion:
print("\n")
print("Mean error on TRAIN: " + str(np.mean(train_error)))
print("Mean error on  TEST: " + str(np.mean(test_error)))

### 2.2 $R_{item}(User,Item)$


This approach is certainly much better than the previous one, but still with some serious limitations. Here, we focus on the movie ratings. The estimated rating that a user would give to a movie is equal to the mean of all ratings for this specific movie. Even though it perfoms better than the global mean rating, it eventually recommends movies that are highly rated by most people. This may seem reasonable at first, but the problem is that it doen't take into account the user's personal taste but only what most of the people find good. Its performance also depends on the data. For example, if a movie has been only highly/poorly rated by a small number of users, then the recommendations that other users will get for this movie will not be very honest. Its recommendations can only be trusted for movies that have been rated many times and by a variety of users. During cross-validation, there were some cases where a movie id was present in the test set but not in the training set. For those movies, the estimated rating was done according to the previous approach using the $R_{global}(User,Item)$. Finally, the error on the training and test set for this approach can be seen from the table. 

| Training error | Test error   |
|------|------|
|   0.974  | 0.979|

In [None]:
###-----------------------Ritem - mean of all ratings for a specific item----------------------------------------------------------------###
train_error2 = np.zeros(nfolds)
test_error2 = np.zeros(nfolds)
np.random.seed(17)
seqs=[x%nfolds for x in range(len(ratings))]
np.random.shuffle(seqs)

for fold in range(nfolds):
    train_sel=np.array([x!=fold for x in seqs])
    test_sel=np.array([x==fold for x in seqs])
    train=ratings[train_sel]
    test=ratings[test_sel]
    Rglobal=np.mean(train[:,2])
    #Make a list which has the mean rating for every movie in the training set
    Ritem = list()
    for i in range(1,3953):
        if i in train[:,1]:
            my_table = train[train[:,1] == i]
            Ritem.append(my_table[:,2].mean())
        else:
            #If a movie id is not in the training set we replace its estimate with Rglobal
            Ritem.append(Rglobal)

    Ritem_train = list()
    for i in range(train.shape[0]):
        Ritem_train.append(Ritem[int(train[i,1])-1])

    Ritem_test = list()
    for i in range(test.shape[0]):
        Ritem_test.append(Ritem[int(test[i,1])-1])

    Ritem_train = np.array(Ritem_train)
    Ritem_test = np.array(Ritem_test)
    #apply the model to the train set:
    train_error2[fold]=np.sqrt(np.mean((train[:,2]-Ritem_train)**2))
    #apply the model to the test set:
    test_error2[fold]=np.sqrt(np.mean((test[:,2]-Ritem_test)**2)) 
    #print errors:
    print("Fold " + str(fold) + ": RMSE_train=" + str(train_error2[fold]) + "; RMSE_test=" + str(test_error2[fold]))
    # Saving the Ritem vectors for the training and test sets at each fold,
    # to use them later at the linear regression approach.
    if fold == 0:
        Ritrain0 = Ritem_train
        Ritest0 = Ritem_test
    elif fold == 1:
        Ritrain1 = Ritem_train
        Ritest1 = Ritem_test
    elif fold == 2:
        Ritrain2 = Ritem_train
        Ritest2 = Ritem_test
    elif fold == 3:
        Ritrain3 = Ritem_train
        Ritest3 = Ritem_test
    elif fold == 4:
        Ritrain4 = Ritem_train
        Ritest4 = Ritem_test

#print the final conclusion:
print("\n")
print("Mean error on TRAIN: " + str(np.mean(train_error2)))
print("Mean error on  TEST: " + str(np.mean(test_error2)))



### 2.3 $R_{user}(User,Item)$

In this approach, the estimated rating that a user would give to a movie is equal to the mean of all ratings of the specific user. Initially, one may think that this approach would perform better than the previous one but its actually slightly worse. This is mainly because the general qualiy of a movie, or its general rating is not taken into account at all. Therefore, if for example a user has a high mean rating, then any movie recommended to this user is expected to have a high rating regardless of its genre or its general rating. Like the previous approach, for the user ids that were present in the test set but not in the training set, their estimated mean ratings were found using $R_{global}(User,Item)$. The results can be seen below. 

| Training error | Test error   |
|------|------|
|   1.027  | 1.035|

In [None]:
###-----------------------Ruser - mean of all ratings for specific user-----------------------------###
train_error3 = np.zeros(nfolds)
test_error3 = np.zeros(nfolds)
np.random.seed(17)
seqs=[x%nfolds for x in range(len(ratings))]
np.random.shuffle(seqs)

for fold in range(nfolds):
    train_sel=np.array([x!=fold for x in seqs])
    test_sel=np.array([x==fold for x in seqs])
    train=ratings[train_sel]
    test=ratings[test_sel]
    Rglobal=np.mean(train[:,2])
    #Make a list which has the mean rating for every user in the training set
    Ruser = list()
    for i in range(1,6041):
        if i in train[:,0]:
            my_table = train[train[:,0] == i]
            Ruser.append(my_table[:,2].mean()) 
        else:
            #If a user id is not in the training set we replace its estimate with Rglobal
            Ruser.append(Rglobal)

    Ruser_train = list()
    for i in range(train.shape[0]):
        Ruser_train.append(Ruser[int(train[i,0])-1])

    

    Ruser_test = list()
    for i in range(test.shape[0]):
        Ruser_test.append(Ruser[int(test[i,0])-1])

    

    Ruser_train = np.array(Ruser_train)
    Ruser_test = np.array(Ruser_test)
    
    
    #apply the model to the train set:
    train_error3[fold]=np.sqrt(np.mean((train[:,2]-Ruser_train)**2))
    #apply the model to the test set:
    test_error3[fold]=np.sqrt(np.mean((test[:,2]-Ruser_test)**2)) 
    #print errors:
    print("Fold " + str(fold) + ": RMSE_train=" + str(train_error3[fold]) + "; RMSE_test=" + str(test_error3[fold]))
    if fold == 0:
        Rutrain0 = Ruser_train
        Rutest0 = Ruser_test
    elif fold == 1:
        Rutrain1 = Ruser_train
        Rutest1 = Ruser_test
    elif fold == 2:
        Rutrain2 = Ruser_train
        Rutest2 = Ruser_test
    elif fold == 3:
        Rutrain3 = Ruser_train
        Rutest3 = Ruser_test
    elif fold == 4:
        Rutrain4 = Ruser_train
        Rutest4 = Ruser_test


#print the final conclusion:
print("\n")
print("Mean error on TRAIN: " + str(np.mean(train_error3)))
print("Mean error on  TEST: " + str(np.mean(test_error3)))

### 2.4 $R_{user-item}(User,Item)$

For the final approach in this part, a linear model is being fitted using two variables : $R_{user}(User,Item)$ and $R_{item}(User,Item)$ that were found in the above sections. It is expected that this model will perform much better than each of the previous methods individually. That is because the response variable $R_{user-item}(User,Item)$, takes into account information for both users and movies. The parameters were estimated using ordinary least squares estimation, giving the following linear equation : $R_{user-item}(User,Item)= 0.781*R_{user}(User,Item) + 0.874*R_{item}(User,Item) - 2.348$. From the table below, one can see the performance of all the four naive approaches that have been implemented so far.

|                   | $R_{global}$| $R_{user}$| $R_{item}$| $R_{user-item}$|
|-------------------|-------------|-----------|-----------|----------------|
|   Training error  | 1.117       | 1.027     | 0.974     | 0.914          |
|                   |             |           |           |                |
|   Test error      | 1.117       | 1.035     | 0.979     | 0.924          |

In [None]:
train_error4 = np.zeros(nfolds)
test_error4 = np.zeros(nfolds)
np.random.seed(17)
seqs=[x%nfolds for x in range(len(ratings))]
np.random.shuffle(seqs)

for fold in range(nfolds):
    train_sel=np.array([x!=fold for x in seqs])
    test_sel=np.array([x==fold for x in seqs])
    train=ratings[train_sel]
    test=ratings[test_sel]
    #Calculate Ritem for the training set.
    #mean rating for each movie on the training set
    Rglobal=np.mean(train[:,2])
    if fold == 0 :
        X1 = np.vstack([Rutrain0,Ritrain0])
        X1 = X1.T
        Y1 = train[:,2]
        Y1 = Y1.T
        A1 = np.hstack([X1,np.ones((X1.shape[0],1))])
        model_parameters, residuals, rank, singular_values = np.linalg.lstsq(A1, Y1,rcond=None)
        model_parameters = np.array(model_parameters)
        model_parameters.shape = (1,3)
        X2 = np.vstack([Rutest0,Ritest0])
        X2 = X2.T
        Y2 = test[:,2]
        Y2 = Y2.T
        A2 = np.hstack([X2,np.ones((X2.shape[0],1))])
    elif fold == 1 :
        X1 = np.vstack([Rutrain1,Ritrain1])
        X1 = X1.T
        Y1 = train[:,2]
        Y1 = Y1.T
        A1 = np.hstack([X1,np.ones((X1.shape[0],1))])
        model_parameters, residuals, rank, singular_values = np.linalg.lstsq(A1, Y1,rcond=None)
        model_parameters = np.array(model_parameters)
        model_parameters.shape = (1,3)
        X2 = np.vstack([Rutest1,Ritest1])
        X2 = X2.T
        Y2 = test[:,2]
        Y2 = Y2.T
        A2 = np.hstack([X2,np.ones((X2.shape[0],1))])
    elif fold == 2 :
        X1 = np.vstack([Rutrain2,Ritrain2])
        X1 = X1.T
        Y1 = train[:,2]
        Y1 = Y1.T
        A1 = np.hstack([X1,np.ones((X1.shape[0],1))])
        model_parameters, residuals, rank, singular_values = np.linalg.lstsq(A1, Y1,rcond=None)
        model_parameters = np.array(model_parameters)
        model_parameters.shape = (1,3)
        X2 = np.vstack([Rutest2,Ritest2])
        X2 = X2.T
        Y2 = test[:,2]
        Y2 = Y2.T
        A2 = np.hstack([X2,np.ones((X2.shape[0],1))])
    elif fold == 3 :
        X1 = np.vstack([Rutrain3,Ritrain3])
        X1 = X1.T
        Y1 = train[:,2]
        Y1 = Y1.T
        A1 = np.hstack([X1,np.ones((X1.shape[0],1))])
        model_parameters, residuals, rank, singular_values = np.linalg.lstsq(A1, Y1,rcond=None)
        model_parameters = np.array(model_parameters)
        model_parameters.shape = (1,3)
        X2 = np.vstack([Rutest3,Ritest3])
        X2 = X2.T
        Y2 = test[:,2]
        Y2 = Y2.T
        A2 = np.hstack([X2,np.ones((X2.shape[0],1))])
    elif fold == 4 :
        X1 = np.vstack([Rutrain4,Ritrain4])
        X1 = X1.T
        Y1 = train[:,2]
        Y1 = Y1.T
        A1 = np.hstack([X1,np.ones((X1.shape[0],1))])
        model_parameters, residuals, rank, singular_values = np.linalg.lstsq(A1, Y1,rcond=None)
        model_parameters = np.array(model_parameters)
        model_parameters.shape = (1,3)
        X2 = np.vstack([Rutest4,Ritest4])
        X2 = X2.T
        Y2 = test[:,2]
        Y2 = Y2.T
        A2 = np.hstack([X2,np.ones((X2.shape[0],1))])
    #apply the model to the train set:
    train_error4[fold]=np.sqrt(((np.dot(model_parameters,A1.T) - Y1) ** 2).mean())
    #apply the model to the test set:
    test_error4[fold]=np.sqrt(((np.dot(model_parameters,A2.T) - Y2) ** 2).mean())
    #print errors:
    print("Fold " + str(fold) + ": RMSE_train=" + str(train_error4[fold]) + "; RMSE_test=" + str(test_error4[fold]))


print("Mean error on TRAIN: " + str(np.mean(train_error4)))
print("Mean error on  TEST: " + str(np.mean(test_error4)))
