### Caroline Liongosari 

# Netflix Recommender System

This project contains the code for the Netflix Recommender system using various similarity algorithms including cosine similarity, pearson correlation, pearson correlation with IUF and case amplification adjustments, and a custom-made algorithm/hack.

In [1]:
import numpy as np
import pandas as pd
#from scipy import linalg

## Step 1: Exploratory data analysis on tranining data 

In [2]:
df = pd.read_csv('train.txt', sep="\t", header=None)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,990,991,992,993,994,995,996,997,998,999
0,5,3,0,3,3,5,0,1,5,3,...,0,0,0,0,0,0,0,0,0,0
1,4,0,0,0,0,0,0,0,0,2,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,3,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,4,0,0,0,0,0,2,0,4,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,5,0,0,5,5,5,4,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,3,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,0,0,0,0,0,5,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,4,0,0,4,0,0,4,0,0,0,...,0,0,0,0,0,0,0,0,0,0


The given training dataset is a text file in 200 by 1000 format with rows being the users and columns being the movies. 

I first checked to see if there's any rows containing all zeros (ie: if there's any users that didn't rate any movies) or any columns that have all zeros (ie: if there's any movies that have no ratings).

`df_filtered_for_zeros` filtered out any rows or columns  with all zeros. 
Since the dimensions of the resulting dataframe is 200x994, we know that all users have rated at least one movie and that 6 movies have no ratings. (which is good to know when making failsafes for the algorithms)



In [3]:
df_filtered_for_zeros =df.loc[(df.sum(axis=1) != 0), (df.sum(axis=0) != 0)]
df_filtered_for_zeros.shape

(200, 994)

I also wanted to find the number of movies each user has rated (just to see the variation):

In [4]:
#find how many movies each user rated
df.astype(bool).sum(axis=1)

0      235
1       53
2       49
3       19
4      151
5      194
6      353
7       50
8       19
9      153
10     153
11      44
12     531
13      84
14      92
15     126
16      23
17     238
18      18
19      41
20     156
21     112
22     130
23      57
24      61
25      81
26      22
27      66
28      28
29      33
      ... 
170     24
171     23
172     35
173    134
174     32
175     52
176     96
177    218
178     32
179     53
180    239
181     26
182     42
183    203
184     38
185     71
186     46
187     98
188    138
189     47
190     26
191     22
192    103
193    252
194     73
195     33
196    105
197    148
198     34
199    186
Length: 200, dtype: int64

From this data, I wanted to see the most and least number of movies a user had made to see the range:

In [5]:
print("The user who rated the most number of movies rated this many movies: " + str(max(df.astype(bool).sum(axis=1))))
print("The user who rated the least number of movies rated this many movies: " + str(min(df.astype(bool).sum(axis=1))))

The user who rated the most number of movies rated this many movies: 531
The user who rated the least number of movies rated this many movies: 13


Same with seeing how many ratings each movie has. This is also helpful to know when checking if my IUF is implemented correctly since we calculate log(#number of users=200/ **#number of ratings for a particular movie**).

In [6]:
#find how many ratings each movie has 
df.astype(bool).sum(axis=0)

0      79
1      17
2      15
3      27
4      12
5       9
6      79
7      35
8      52
9      13
10     42
11     45
12     44
13     32
14     56
15      5
16     13
17      5
18     11
19     13
20     19
21     45
22     29
23     32
24     67
25     10
26      9
27     44
28     17
29      6
       ..
970     6
971     6
972     1
973     8
974     8
975     4
976    10
977     5
978     7
979     3
980     2
981     6
982     4
983    12
984     4
985     5
986     0
987    13
988     4
989     6
990     2
991     1
992    11
993     1
994     9
995     3
996     4
997     1
998     3
999     3
Length: 1000, dtype: int64

In [7]:
print ("Movie with the most number of ratings has this many ratings: " + str(max(df.astype(bool).sum(axis=0))))
print ("Movie with the least number of ratings has this many ratings: " + str(min(df.astype(bool).sum(axis=0))))

Movie with the most number of ratings has this many ratings: 102
Movie with the least number of ratings has this many ratings: 0


In [8]:
s=df.astype(bool).sum(axis=0)
s=s[s<=10]
print( "Number of movies with 10 or less ratings: " + str(len(s)))   

Number of movies with 10 or less ratings: 500


I also checked to see if any of the users only gave a particular value for all of their ratings (ex: if a user rated all movies they watched as 3's). I checked this by seeing if a row in the original training data only had 2 values (0 and the rating). 

In [9]:
dft = df.transpose()
dft

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,190,191,192,193,194,195,196,197,198,199
0,5,4,0,0,4,4,0,0,0,4,...,0,0,4,4,0,0,0,4,1,5
1,3,0,0,0,3,0,0,0,0,0,...,0,0,3,0,0,0,3,0,0,4
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,0,0,0,0,0,5,0,0,4,...,0,0,0,4,0,0,3,3,0,0
4,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,5,0,0,0,0,0,0,0,5,0,...,0,0,0,0,0,0,0,2,0,0
6,0,0,0,0,0,2,5,3,0,4,...,0,4,0,3,0,0,0,4,4,4
7,1,0,0,0,0,0,5,0,0,0,...,0,0,0,3,0,5,0,0,0,4
8,5,0,0,0,0,4,5,0,0,0,...,0,0,0,4,0,0,0,0,5,4
9,3,2,0,0,0,0,4,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
dft = df.transpose()
for num in range(200):
    if len(dft[num].value_counts()) <=2:
        print("found!")

Since "found" was never printed, we know that no user in the training data only gave a single rating. However, users in the test data might have. 

## Step 2: Implement the similarity algorithms

**Cosine similarity functions:**

In [11]:
#returns the cosine similarity between two arrays a1 and a2
def cosine_similarity(a1, a2):
    # first take out indices where there's a 0 in either arrays
    new_a1 = []
    new_a2 = []
    for index, e_a1 in enumerate(a1):
        e_a2 = a2[index]
        if e_a1 >0 and e_a2 > 0:
            new_a1.append(e_a1)
            new_a2.append(e_a2)

    numerator = np.dot(new_a1, new_a2)  # compute dot product
    
    #norm_a1 = linalg.norm(new_a1) these don't work for some reason...
    #norm_a2 = linalg.norm(new_a2)
    norm_a1 = np.sqrt(np.sum(np.square(new_a1)))  # norms
    norm_a2 =np.sqrt(np.sum(np.square(new_a2)))
    denominator = norm_a1*norm_a2

    if denominator == 0: #if denominator = 0, just return cosine similarity of 0
        return 0

    cos_sim = numerator/denominator  # calculate cosine similarity
    
    if cos_sim > 1:
        cos_sim = 1
    if cos_sim < 0:
        cos_sim = 0

    return cos_sim

In [12]:
#this implementation of cosine similarity doesn't filter out the indices that contain 0s 
#for implementation in pearson correlation
def simple_cos_similarity(a1,a2):
    numerator = np.dot(a1, a2)  # compute dot product
    norm_a1 = np.sqrt(np.sum(np.square(a1)))  # norms
    norm_a2 = np.sqrt(np.sum(np.square(a2)))
    denominator = norm_a1*norm_a2

    #if denominator equals zero, means that the current user from the test data rated all movies they watched the same
    #since we know from EDA that none of the users of the training data did so
    #just return correlation of 0.5 - right in the middle (not too similar or different)
    # but this is probably not the best way to handle this 
    if denominator == 0:
        return 0.5 

    cos_sim = numerator/denominator  # calculate cosine similarity
    
    if cos_sim > 1:
        cos_sim = 1
        
    #I realized later on this should have been 
    #if cos_sim<-1
        #cos_sim =-1
    #since this is inside pearson_correlation
    #but using 0 actually made results better (??)
    if cos_sim <0:
        cos_sim = 0

    return cos_sim

**Item-Based Collaborative: Adjusted Cosine Similarity**

<img src="adjusted-cosine-formula.png">

Understanding the formula: 

**Numerator:**
For each user, get the dot product of (the user's rating on movie i minus the user's overall average rating) and (the user's rating on movie j minus the user's overall average rating). Then add them all together to get the numerator. 

**Denominator:**
For each user, get the square of the user's rating on movie i minus the user's overall average rating. Add them all up and take the square root of it. For each user, get the square of the user's rating on movie j minus the user's overall average rating. Add them all up and take the square root of it. Take the above two results and multiply them together to get the denominator.

**Basically:**  take the cosine similarity between the arrays: (a row of ratings in the transposed array minus the list of average ratings) and (the current movie ratings minus the list of average movie ratings) 


In [13]:
#adjusted cosine similarity is to be used for item-based collaborative filtering
#a1 and a2 are movie arrays containing user ratings - to coincide with the formula:
#a1 = all ratings on movie i, a2 = all ratings on movie j
#avg_ratings is an array of length 200 contain the users' average ratings
#ex: avg_ratings[0] contain the userid1's average for all his/her ratings

def adjusted_cos_similarity_adj(a1, a2, avg_ratings):
    
    a1_min_avg_rating = np.subtract(a1, avg_ratings)
    a2_min_avg_rating = np.subtract(a2, avg_ratings) 
    
    return cosine_similarity(a1_min_avg_rating, a2_min_avg_rating)

**Pearson correlation:**

In [14]:
#returns the pearson correlation between two arrays a1 and a2
#note that pearson_correlation ([3,2,1],[1,2,3]) = -.99999 rather than -1, 
#may possibly lead to calculations being a little off? (doubt it)
def pearson_correlation(a1,a2):
    # first take out indices where there's a 0 in either arrays
    new_a1 = []
    new_a2 = []
    for index, e_a1 in enumerate(a1):
        e_a2 = a2[index]
        if e_a1 > 0 and e_a2 > 0:
            new_a1.append(e_a1)
            new_a2.append(e_a2)
    
    avg_rating_a1 = np.mean(new_a1)
    avg_rating_a2 = np.mean(new_a2)
    
    a1_sub_avg = new_a1 - avg_rating_a1
    a2_sub_avg = new_a2 - avg_rating_a2
    
    return simple_cos_similarity(a1_sub_avg, a2_sub_avg)

## Step 3: Implement the functions that use the similarity algorithms to make predictions

**Prediction using Cosine Similarity with  k nearest neighbors: **

From this, it seems that the predictions are better when k is larger. So, I decided to stick with using all the possible neighbors to create the cosine and other algorithms.

In [15]:
def cosine_scores_knn(training_data,  movieids_to_rate, current_user_ratings, current_user_id):
    
    # weights_list is a 1 by 200 array
    # weights_list[0] contains the cosine similarity between current user and user 1,
    # weights_list[1] contains the cosine similarity between current user and user 2 and so on
    weights_list = [cosine_similarity(current_user_ratings, other_user) for other_user in training_data]
    
    
    wlist_indices_sorted = np.argsort(weights_list)
    wlist_indices_sorted = list((reversed(wlist_indices_sorted)))
    #wlist_indices_sorted stores the indices of the users sorted in decreasing cosine similairty
        #which are the indices for the rows in the training dataset
    
    #print("wlist_indices_sorted: " + str(wlist_indices_sorted))
    ratings_list = []  # ratings_list contains the calculated ratings based on cosine similarity 

    # for each movie needed rating, predict the rating
    for m in movieids_to_rate:
        numerator = 0  # numerator is the sum of the weights x score
        denominator = 0  # denominator is the sum of the weights
        
        #filter wlist_indices_sorted to take out users who didn't see movie m
        wlist_indices_sorted_filtered = [userid for userid in wlist_indices_sorted if training_data[userid][m]>0]
        #print("length of wlist_indices_sorted_filtered: " + str(wlist_indices_sorted_filtered))
       
        k=50
        if  len(wlist_indices_sorted_filtered)<k:
            for index in range(len(wlist_indices_sorted_filtered)):
                current_user_rating = training_data[wlist_indices_sorted_filtered[index]][m]
                
                if current_user_rating !=0:
                    numerator += (weights_list[wlist_indices_sorted_filtered[index]]*current_user_rating)
                    denominator += weights_list[wlist_indices_sorted_filtered[index]]
            
            if denominator ==0:
                rating =3
                ratings_list.append(rating)
                continue
            
            rating = int(np.rint(numerator/denominator))
            if rating > 5:
                rating = 5
            elif rating < 1:
                rating = 1

            ratings_list.append(rating)
            continue
            
        for index in range(k):
            current_user_rating = training_data[wlist_indices_sorted_filtered[index]][m] 
            #print("current_user_rating: " + str(current_user_rating))
            if current_user_rating !=0: #current user rating
                
                #numerator += weight*rating
                numerator+= (weights_list[wlist_indices_sorted_filtered[index]]*current_user_rating)
                #denominator +=  weight
                denominator += weights_list[wlist_indices_sorted_filtered[index]]


            if denominator == 0:
                rating = 3
                #rating = int(np.rint(current_user_rating_avg)) <- turns out using a failsafe of 3 gives
                                                            #slightly more accurate results
                ratings_list.append(rating)
                continue

            rating = int(np.rint(numerator/denominator))
            if rating > 5:
                rating = 5
            if rating < 1:
                rating = 1

            ratings_list.append(rating)

    return ratings_list

**Prediction using cosine similarity for all possible users ** 

In [16]:
def cosine_scores(training_data, movieids_to_rate, current_user_ratings, current_user_id):
    
    # weights_list is a 1 by 200 array
    # weights_list[0] contains the cosine similarity between current user and user 1,
    # weights_list[1] contains the cosine similarity between current user and user 2 and so on
    weights_list = [cosine_similarity(current_user_ratings, other_user) for other_user in training_data]
    #print("weights_list: " + str(weights_list))
    #current_user_ratings_wo_zeros = [rating for rating in current_user_ratings if rating >!=]
    #current_user_ratings_wo_zeros = np.trim_zeros(current_user_ratings)
    #current_user_rating_avg = np.mean(current_user_ratings_wo_zeros)
    
    ratings_list = []  # ratings_list contains the calculated ratings based on cosine similarity

    # for each movie needed rating, predict the rating
    for m in movieids_to_rate:
        numerator = 0  # numerator is the sum of the weights x score
        denominator = 0  # denominator is the sum of the weights

        for w, row in zip(weights_list, training_data):
            current_row_user_rating = row[m]
            if current_row_user_rating != 0:
                denominator += w
                numerator += (w*current_row_user_rating)

        if denominator == 0:
            rating = 3
            #rating = int(np.rint(current_user_rating_avg)) <- turns out using a failsafe of 3 gives
                                                            #slightly more accurate results in this case
            ratings_list.append(rating)
            continue

        rating = int(np.rint(numerator/denominator))
        if rating > 5:
            rating = 5
        if rating < 1:
            rating = 1


        ratings_list.append(rating)

    return ratings_list


**Prediction using Pearson correlation (only) **

In [17]:
#basic algorithm for predicting using pearson correlation (no modifications)
def pearson_scores(training_data, movieids_to_rate, current_user_ratings, current_user_id):

    current_user_ratings_wo_zeros = [rating for rating in current_user_ratings if rating>0]
    #current_user_ratings_wo_zeros = np.trim_zeros(current_user_ratings) doesn't work properly 
    # it doesn't take out all zeros for some reason
    current_user_rating_avg = np.mean(current_user_ratings_wo_zeros)
    
    weights_list = [pearson_correlation(current_user_ratings, other_user) for other_user in training_data]
    
    ratings_list = []
    
    
    avg_rating_all_users = [np.mean([rating for rating in other_user if rating >0]) for other_user in training_data]
    
    for m in movieids_to_rate:
        numerator = 0
        denominator = 0
        
        for w, row, current_user_avg_r in zip(weights_list, training_data, avg_rating_all_users):
            
            current_row_user_rating = row[m]
            
            if current_row_user_rating !=0:
                denominator += np.absolute(w)
                numerator += (w*(current_row_user_rating - current_user_avg_r))
        
        if denominator == 0:
            rating = int(np.rint(current_user_rating_avg))
            #rating = 3 (tried using a default of 3 like cosine
                        #but using the avg rating gives slightly more accurate results)
            ratings_list.append(rating)
            continue
        
        rating = int(np.rint(current_user_rating_avg + (numerator/denominator)))
        if rating > 5:
            rating = 5
        if rating < 1:
            rating = 1
        
        ratings_list.append(rating)
    
    return ratings_list
        

**Adjustment to Prediction with Pearson Modification to accept IUF-adjusted training data and perform case amplification if `c_amp = True` **

In [18]:
#extension of basic Pearson Score prediction above but accepting an IUF-modified version of the training dataset
def pearson_scores_IUF(training_data, training_data_IUF, movieids_to_rate, current_user_ratings, current_user_id, c_amp):
    weights_list = [pearson_correlation(current_user_ratings, other_user) for other_user in training_data_IUF]
    
    #if c_amp is true, do case amplification modification by calculating w*|w|^(rho-1) where rho =2.5
    if c_amp==True:
        rho=2.5
        weights_list = [w * (np.absolute(w)**(rho-1)) for w in weights_list]
        
    ratings_list = []
    
    current_user_ratings_wo_zeros = [rating for rating in current_user_ratings if rating >0]
    current_user_rating_avg = np.mean(current_user_ratings_wo_zeros)
    
    avg_rating_all_users = [np.mean([rating for rating in other_user if rating >0]) for other_user in training_data]
    #avg_rating_all_users = [np.mean(np.trim_zeros(other_user)) for other_user in training_data] doesn't work
    
    for m in movieids_to_rate:
        numerator = 0
        denominator = 0
        
        for w, row, current_user_avg_r in zip(weights_list, training_data, avg_rating_all_users):
            
            current_row_user_rating = row[m]
            
            if current_row_user_rating !=0:
                denominator += np.absolute(w)
                numerator += (w*(current_row_user_rating - current_user_avg_r))
        
        if denominator == 0:
            rating = int(np.rint(current_user_rating_avg))
            #rating = 3 (tried using a default of 3 like cosine
                        #but using the avg rating gives slightly more accurate results)
            ratings_list.append(rating)
            continue
        
        rating = int(np.rint(current_user_rating_avg + (numerator/denominator)))
        if rating > 5:
            rating = 5
        if rating < 1:
            rating = 1
        
        ratings_list.append(rating)
    
    return ratings_list

** Pearson correlation prediction using IUF modification ** 

In [19]:
def pearson_iuf_scores(training_data, movieids_to_rate, current_user_ratings, current_user_id):

    #Step 1: multiply all the original ratings in the training data by the IUF and put them into 
    #the training_data_IUF 2d array

    training_data_IUF = np.array(training_data)
    
    m = 200 # the number of users
    for col in range(1000): #for each column (movie)
        
        mj = (training_data_IUF!=0).sum(0)[col] #mj = # of people who rated movie col
        
        #cross checking iuf is calculated correctly...
        #print ("np.log10(" + str(m) + "/" +str(mj) +")")
        
        #would have checked if m/mj = 0 but that's not possible since m is always 200
        if mj!=0:
            iuf = np.log10(m/mj)
        else: 
            iuf = 1
          
        for row in training_data_IUF: #go through each user and adjust their ratings 
            row[col] *= iuf
            #print("row[col]:" + str(row[col]))
            
    #Step 2: Calculate the ratings using pearson with the IUF training dataset 
    ratings = pearson_scores_IUF(training_data, training_data_IUF, movieids_to_rate, current_user_ratings, current_user_id,False)
    
    return ratings

**Prediction using Pearson Correlation using case amplifications on the weights ** 

In [20]:
#Case amplification adjusts the weights by amplfying high weights and lowering low weights
def pearson_caseamp_scores(training_data,  movieids_to_rate, current_user_ratings, current_user_id):
    
    #get the original list of weights
    weights_list = [pearson_correlation(current_user_ratings, other_user) for other_user in training_data]

    #case amplification: multiply each weight by |w|^(rho-1) where rho is 2.5
    rho=2.5
    weights_list = [w * (np.absolute(w)**(rho-1)) for w in weights_list]
    
    #make predictions using the new weights
    ratings_list =[]
    current_user_ratings_wo_zeros = [rating for rating in current_user_ratings if rating >0]
    current_user_rating_avg = np.mean(current_user_ratings_wo_zeros)
    
    avg_rating_all_users = [np.mean([rating for rating in other_user if rating >0]) for other_user in training_data]
    
    for m in movieids_to_rate:
        numerator = 0
        denominator = 0
        
        for w, row, current_user_avg_r in zip(weights_list, training_data, avg_rating_all_users):
            
            current_row_user_rating = row[m]
            
            if current_row_user_rating !=0:
                denominator += np.absolute(w)
                numerator += (w*(current_row_user_rating - current_user_avg_r))
        
        if denominator == 0:
            rating = int(np.rint(current_user_rating_avg))
            #rating = 3 (tried using a default of 3 like cosine
                        #but using the avg rating gives slightly more accurate results)
            ratings_list.append(rating)
            continue
        
        rating = int(np.rint(current_user_rating_avg + (numerator/denominator)))
        if rating > 5:
            rating = 5
        if rating < 1:
            rating = 1
        
        ratings_list.append(rating)
    
    return ratings_list
          

** Prediction using Pearson Correlation with both IUF and Case Amplification adjustments **

In [21]:
#Do both IUF and CaseAmp modifications for Pearson
def pearson_iuf_caseamp_scores(training_data, movieids_to_rate, current_user_ratings, current_user_id):
    
    #Step 1: multiply all the original ratings in the training data by the IUF
    training_data_IUF = np.array(training_data)
    
    m = 200 # the number of users
    for col in range(1000): #for each column (movie)
        
        mj = (training_data_IUF!=0).sum(0)[col] #mj = # of people who rated movie col
        
        #cross checking of iuf is calculated correctly...
        #print ("np.log10(" + str(m) + "/" +str(mj) +")")
        
        #would have checked if m/mj = 0 but that's not possible since m is always 200
        if mj!=0:
            iuf = np.log10(m/mj)
        else: 
            iuf = 1
          
        for row in training_data_IUF: #go through each user and adjust their ratings 
            row[col] *= iuf
            #print("row[col]:" + str(row[col]))
            
    #Step 2: Calculate the ratings using pearson with the adjusted training dataset 
    ratings = pearson_scores_IUF(training_data, training_data_IUF, movieids_to_rate, current_user_ratings, current_user_id,True)
    
    return ratings

**Item-based Collaborative Filtering with Adjusted Cosine Similarity ** 

In [22]:
#item based collaborative filtering algorithm with adjusted cosine similarity
def item_adj_cosine_sim_adj(training_data, movieids_to_rate, current_user_ratings, current_user_id):
    
    ratings_list = [] 
    
    #transpose_td is the transposed 2d array of training_data: rows are the movies, columns are the users 
    transpose_td = np.array(training_data).T.tolist()
    
    #movies_user_has_rated gives the list of movieids the current user had rated
    movies_user_has_rated = [index for index, element in enumerate(current_user_ratings) if element>0]
    
    #current_user_ratings_wo_zeros = [rating for rating in current_user_ratings if rating >0]
    #current_user_rating_avg = np.mean(current_user_ratings_wo_zeros)
    
    #td_wo_zeros is the original training data but with the zeros taken out 
    td_wo_zeros = [[rating for rating  in row if rating!=0] for row in training_data]
    
    #get the average rating for each user
    #none of them should be zero since each user made at least 13 ratings (from EDA)
    avg_ratings = [np.mean(row) for row in td_wo_zeros]
    
    #go through the movies that needs rating - m is the current movie
    for m in movieids_to_rate:
        current_movie_ratings = transpose_td[m] #get the current movie's list of ratings
        
        #calculate the weights between the current movie and all the others in the training data using adjusted cosine similarity
        #weights_list[0] contains the similarity between current movie and movieid1, etc.
        weights_list = [adjusted_cos_similarity_adj(current_movie_ratings, transpose_td[mo], avg_ratings) for mo in movies_user_has_rated]

        numerator =0
        denominator = 0

        for w, mov in zip(weights_list, movies_user_has_rated):
            current_user_rating = current_user_ratings[mov]
            denominator += w
            numerator += (w*current_user_rating)
            
        if denominator ==0:
            rating = 3 #using failsafe of 3 gives slightly more accurate results
            #rating = int(np.rint(current_user_rating_avg))
            
        else:
            rating = numerator/denominator

            
        rating = int(np.rint(rating))
        if rating >5:
            rating = 5
        if rating <1:
            rating = 1
                
        ratings_list.append(rating)
        
    
    return ratings_list

**Custom-made Algorithms ** 

Take the average predicted ratings of cosine, pearson, and item-based as the predicted ratings! `MAE: 0.758783450993269`



In [23]:
def custom_scores(training_data, movieids_to_rate, current_user_ratings, current_user_id):
    #use the average ratings of cosine similarity, pearson and item based
    ratings_cosine = cosine_scores(training_data, movieids_to_rate, current_user_ratings, current_user_id)
    ratings_pearson = pearson_scores(training_data, movieids_to_rate, current_user_ratings, current_user_id)
    ratings_items = item_adj_cosine_sim_adj(training_data, movieids_to_rate, current_user_ratings, current_user_id)
    
    ratings_list = (np.add(np.add(ratings_cosine, ratings_pearson), ratings_items)/3)
    ratings_list= [int(np.rint(x)) for x in ratings_list]
    
    return ratings_list
    

These other custom algorithms were other combinations for the various algorithms implemented. Though the one above did the best.

In [24]:
def custom_scores2(training_data, movieids_to_rate, current_user_ratings, current_user_id):
    #use the average ratings of cosine similarity and pearson 
    ratings_cosine = cosine_scores(training_data, movieids_to_rate, current_user_ratings, current_user_id)
    ratings_pearson = pearson_scores(training_data, movieids_to_rate, current_user_ratings, current_user_id)
    
    ratings_list = (np.add(ratings_cosine, ratings_pearson)/2)
    ratings_list= [int(np.rint(x)) for x in ratings_list]
    
    return ratings_list

In [25]:
def custom_scores4(training_data, movieids_to_rate, current_user_ratings, current_user_id):
    ratings_cosine = cosine_scores(training_data, movieids_to_rate, current_user_ratings, current_user_id)
    ratings_pearson = pearson_scores(training_data, movieids_to_rate, current_user_ratings, current_user_id)
    ratings_items = item_adj_cosine_sim_adj(training_data, movieids_to_rate, current_user_ratings, current_user_id)
    ratings_case_iuf= pearson_iuf_caseamp_scores(training_data, movieids_to_rate, current_user_ratings, current_user_id)
    
    ratings_list = (np.add(np.add(np.add(ratings_cosine, ratings_pearson), ratings_items), ratings_case_iuf)/4)
    ratings_list= [int(np.rint(x)) for x in ratings_list]
    return ratings_list

In [26]:
def custom_scores5(training_data, movieids_to_rate, current_user_ratings, current_user_id):
    ratings_cosine = cosine_scores(training_data, movieids_to_rate, current_user_ratings, current_user_id)
    ratings_pearson = pearson_scores(training_data, movieids_to_rate, current_user_ratings, current_user_id)
    ratings_items = item_adj_cosine_sim_adj(training_data, movieids_to_rate, current_user_ratings, current_user_id)
    ratings_avg = custom_scores(training_data, movieids_to_rate, current_user_ratings, current_user_id)
    
    ratings_list = (np.add(ratings_avg, np.add(ratings_items, np.add(ratings_cosine, ratings_pearson)))/4)
    ratings_list= [int(np.rint(x)) for x in ratings_list]
    return ratings_list

Extra: just use each user's average rating as all of their predictions - did surprisingly well with an MAE of 0.82

In [27]:
def avg_scores(training_data, movieids_to_rate, current_user_ratings, current_user_id):
    
    #print("Current user ratings: " + str(current_user_ratings))
    #current_user_ratings_wo_zeros = np.trim_zeros(current_user_ratings) <- doesn't take out all of the zeros (?)
    current_user_ratings_wo_zeros = [rating for rating in current_user_ratings if rating >0]
    #print("Current user ratings wo zeros: " + str(current_user_ratings_wo_zeros))
    current_user_rating_avg = int(np.rint(np.mean(current_user_ratings_wo_zeros)))
    
    
    #shouldn't be possible to go below 1 or above 5, but just in case:
    if current_user_rating_avg>5:
        current_user_rating_avg=5
    if current_user_rating_avg<1:
        current_user_rating_avg=1
    
    ratings_list = []
    
    for m in movieids_to_rate:
        ratings_list.append(current_user_rating_avg)
    
    return ratings_list

## Step 4: Get the predicted ratings and store them into text files 

In [28]:
def process_test_dataset(t_data, test_data):
    print("called process_test_dataset")
    
    # Phase 1: put the test data into 2d array format

    test_ds  = open(test_data, 'r')
    test_ds = [[int(x) for x in row.split()] for row in test_ds]
    

    # Phase 2: predict the ratings!
    
    # movie_ids_to_rate contain the movies that need to be rated for a particular user (got from test data)
    movie_ids_to_rate = []

    # current_user_ratings an array indexed by movieID-1 containing the ratings for a particular user
    # think of it as an extra row in the training dataset for the current user
    current_user_ratings = [0]*1000
    
    # subtract 1 from the current userid for indexing 
    current_userid = test_ds[0][0]-1

    # pred_ratings will contain the predicted ratings in form of [[u1,m1,r1], [u1, m2, r2]...] that will be 
    # printed in the final text file via main()
    pred_ratings = []

    for u, m, r in test_ds:
        # need to subtract 1 for indexing like for current_userid
        u -= 1
        m -= 1

        # when there's a new user, get predicted ratings for previous user and
        # empty out current_user_ratings and movie_ids_to_rate for next user
        if u != current_userid:
  
            #TO CHANGE ALGORITHM: need to change algorithm HERE 

            ratings = cosine_scores(t_data, movie_ids_to_rate, current_user_ratings, current_userid)
            #ratings = pearson_scores(t_data, movie_ids_to_rate, current_user_ratings, current_userid)
            #ratings = pearson_iuf_scores(t_data, movie_ids_to_rate, current_user_ratings, current_userid)
            #ratings = pearson_caseamp_scores(t_data,movie_ids_to_rate, current_user_ratings, current_userid)
            #ratings = pearson_iuf_caseamp_scores(t_data, movie_ids_to_rate, current_user_ratings, current_userid)
            #ratings = item_adj_cosine_sim(t_data, movie_ids_to_rate, current_user_ratings, current_userid)
            #ratings = custom_scores(t_data, movie_ids_to_rate, current_user_ratings, current_userid)
            #ratings = cosine_scores_knn(t_data, movie_ids_to_rate, current_user_ratings, current_userid)
            #ratings = item_adj_cosine_sim_adj(t_data, movie_ids_to_rate, current_user_ratings, current_userid)
            #ratings = custom_scores2(t_data, movie_ids_to_rate, current_user_ratings, current_userid)
            #ratings = custom_scores3(t_data, movie_ids_to_rate, current_user_ratings, current_userid)
            #ratings = custom_scores4(t_data, movie_ids_to_rate, current_user_ratings, current_userid)
            #ratings = custom_scores5(t_data, movie_ids_to_rate, current_user_ratings, current_userid)
            
            for m, r in zip(movie_ids_to_rate, ratings):
                
                #add back the proper userid and movieid numbers (no more indexing)
                current_userid_add_back = current_userid+1
                movie_add_back = m +1
                pred_ratings.append([current_userid_add_back, movie_add_back, r])
            
            current_userid = u  # reassign current userid
            current_user_ratings = [0]*1000 # empty the ratings and movieids to rate
            movie_ids_to_rate = []

        # if rating is not 0, insert movie into the current_user_ratings array in the proper index
        # like an extra row in the training data
        if r != 0:
            current_user_ratings[m] = r

        # if the rating is 0, found a movie that needs a predicted rating, add to list of movies to rate
        else:
            movie_ids_to_rate.append(m)

    # get ratings for the last user
    
    #TO CHANGE ALGORITHM: need to change algorithm HERE TOO!!
    
    ratings = cosine_scores(t_data, movie_ids_to_rate, current_user_ratings, current_userid)
    #ratings = pearson_scores(t_data, movie_ids_to_rate, current_user_ratings, current_userid)
    #ratings = pearson_iuf_scores(t_data, movie_ids_to_rate, current_user_ratings, current_userid)
    #ratings = pearson_caseamp_scores(t_data,movie_ids_to_rate, current_user_ratings, current_userid)
    #ratings = pearson_iuf_caseamp_scores(t_data, movie_ids_to_rate, current_user_ratings, current_userid)
    #ratings = item_adj_cosine_sim(t_data, movie_ids_to_rate, current_user_ratings, current_userid)
    #ratings = custom_scores(t_data, movie_ids_to_rate, current_user_ratings, current_userid)
    #ratings = cosine_scores_knn(t_data, movie_ids_to_rate, current_user_ratings, current_userid)
    #ratings = item_adj_cosine_sim_adj(t_data, movie_ids_to_rate, current_user_ratings, current_userid)
    #ratings = custom_scores2(t_data, movie_ids_to_rate, current_user_ratings, current_userid)
    #ratings = custom_scores3(t_data, movie_ids_to_rate, current_user_ratings, current_userid)
    #ratings = custom_scores4(t_data, movie_ids_to_rate, current_user_ratings, current_userid)
    #ratings = custom_scores5(t_data, movie_ids_to_rate, current_user_ratings, current_userid)
    
    #Phase 3: Format the predicted ratings into the format for the result files
    for m, r in zip(movie_ids_to_rate, ratings):
        
        #add back the proper userid and movieid numbers (no more indexing) and then add to final list of results
        current_userid_add_back = current_userid+1
        movie_add_back = m +1
        pred_ratings.append([current_userid_add_back, movie_add_back, r])

    return pred_ratings


## Step 5: Call the main function that does everything above
### Also process the initial training dataset and putting them into 2d array format

In [29]:
def main():
    #process training dataset 
    #t_data is the 2d array of the training dataset
    file = open('train.txt', 'r')
    t_data = [[int(x) for x in row.split()] for row in file]

    #get the 3 text files for the results
    
    ans5 = process_test_dataset(t_data, 'test5.txt')
    np.savetxt('cosineresult5.txt', ans5, fmt = '%d %d %d')
    #np.savetxt('cosineknnresult5.txt', ans5, fmt = '%d %d %d')
    #np.savetxt('pearsonresult5.txt', ans5, fmt = '%d %d %d')
    #np.savetxt('pearsoniufresult5.txt', ans5, fmt = '%d %d %d')
    #np.savetxt('pearsoncaseampresult5.txt', ans5, fmt = '%d %d %d')
    #np.savetxt('pearsoncaseampiufresult5.txt', ans5, fmt = '%d %d %d')
    #np.savetxt('itembasedresult5.txt', ans5, fmt = '%d %d %d')
    #np.savetxt('customresult5.txt', ans5, fmt = '%d %d %d')
    #np.savetxt('custom2result5.txt', ans5, fmt= '%d %d %d')
    #np.savetxt('custom3result5.txt', ans5, fmt= '%d %d %d')
    #np.savetxt('custom4result5.txt', ans5, fmt= '%d %d %d')
    #np.savetxt('custom5result5.txt', ans5, fmt= '%d %d %d')
    

    ans10 = process_test_dataset(t_data, 'test10.txt')
    np.savetxt('cosineresult10.txt', ans10, fmt = '%d %d %d')
    #np.savetxt('cosineknnresult10.txt', ans10, fmt = '%d %d %d')
    #np.savetxt('pearsonresult10.txt', ans10, fmt = '%d %d %d')
    #np.savetxt('pearsoniufresult10.txt', ans10, fmt = '%d %d %d')
    #np.savetxt('pearsoncaseampresult10.txt', ans10, fmt = '%d %d %d')
    #np.savetxt('pearsoncaseampiufresult10.txt', ans10, fmt = '%d %d %d')
    #np.savetxt('itembasedresult10.txt', ans10, fmt = '%d %d %d')
    #np.savetxt('customresult10.txt', ans10, fmt = '%d %d %d')
    #np.savetxt('custom2result10.txt', ans10, fmt= '%d %d %d')
    #np.savetxt('custom3result10.txt', ans10, fmt= '%d %d %d')
    #np.savetxt('custom4result10.txt', ans10, fmt= '%d %d %d')
    #np.savetxt('custom5result10.txt', ans10, fmt= '%d %d %d')

    ans20 = process_test_dataset(t_data, 'test20.txt')
    np.savetxt('cosineresult20.txt', ans20, fmt = '%d %d %d')
    #np.savetxt('cosineknnresult20.txt', ans20, fmt = '%d %d %d')
    #np.savetxt('pearsonresult20.txt', ans20, fmt = '%d %d %d')
    #np.savetxt('pearsoniufresult20.txt', ans20, fmt = '%d %d %d')
    #np.savetxt('pearsoncaseampresult20.txt', ans20, fmt = '%d %d %d')
    #np.savetxt('pearsoncaseampiufresult20.txt', ans20, fmt = '%d %d %d')
    #np.savetxt('itembasedresult20.txt', ans20, fmt = '%d %d %d')
    #np.savetxt('customresult20.txt', ans20, fmt = '%d %d %d')
    #np.savetxt('custom2result20.txt', ans20, fmt= '%d %d %d')
    #np.savetxt('custom3result20.txt', ans20, fmt= '%d %d %d')
    #np.savetxt('custom4result20.txt', ans20, fmt= '%d %d %d')
    #np.savetxt('custom5result20.txt', ans20, fmt= '%d %d %d')
        
main() 

called process_test_dataset
called process_test_dataset
called process_test_dataset
