# User user collabrative Filtering

In this project I have used the user user based collabrative filtering technique to recommend movie.

Steps:
    1. Import the data and do initial preprocessing
    2. From extracted data, create a dataframe for (user,movies)= data_rating. 
    3. Check the sparsity of the  matrix 
    4. train_test_split:  split our data into training and test sets by removing 10 ratings per user from the training 
        set and placing them in the test set.
    5. Create user-similarity or item-similarity matrix using centered cosine also called as Pearson similarity technique
       (Note: using 0 as entry for NAN in the data_rating matrix is not a good idea as the rating 0 means giving low rating
       to the movie and so we use centered cosine technique to normalize the ratings in the data_rating matrix)
    6. With our similarity matrix in hand, we can now predict the ratings that were not included with the data. 
    Using these predictions, we can then compare them with the test data to attempt to validate the quality of our
    recommender model.
    7. For user-based collaborative filtering, we predict that a user’s uu’s rating for item ii is given by the weighted sum 
    of all other users’ ratings for item ii where the weighting is the cosine similarity between the each user and the
    input user uu.
    8. use the scikit-learn’s mean squared error function as our validation metric. 
    
    
    Model Improvement:
        1. Top k collabrative filtering- to improve our prediction MSE by only considering the top kk users 
        
        2. Bias-subtracted Collaborative Filtering - certain users may tend to always give high or low ratings to all movies.             One could imagine that the relative difference in the ratings that these users give is more important than  the absolute rating values.
        
        Try subtracting each user’s average rating when summing over similar user’s ratings and
        then add that average back in at the end
        
        

In [1]:
import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split



In [2]:
data_Rating = pd.read_csv("Data/ratings.csv")

In [3]:
data_Rating.head(5)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [4]:
n_users = data_Rating.userId.unique().shape[0]
n_items = data_Rating.movieId.unique().shape[0]
print (str(n_users) + ' users')
print (str(n_items) + ' items')

671 users
9066 items


### We can easily map user/item ID’s to user/item indices 

In [5]:
ratings =  data_Rating.pivot(index='movieId',columns='userId',values='rating')    

In [6]:
ratings.isnull().values.any()

True

In [7]:
ratings.shape

(9066, 671)

In [8]:
ratings[ratings.isnull()] = 0

In [9]:
ratings.head(5)

userId,1,2,3,4,5,6,7,8,9,10,...,662,663,664,665,666,667,668,669,670,671
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,4.0,0.0,...,0.0,4.0,3.5,0.0,0.0,0.0,0.0,0.0,4.0,5.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
ratings = ratings.as_matrix()

  """Entry point for launching an IPython kernel.


In [11]:
ratings.shape

(9066, 671)

In [12]:
ratings =  ratings.T

In [13]:
ratings.shape

(671, 9066)

In [14]:
sparsity = float(len(ratings.nonzero()[0]))
sparsity /= (ratings.shape[0] * ratings.shape[1])
sparsity *= 100
print ('Sparsity: {:4.2f} %'.format(sparsity))

Sparsity: 1.64 %


In [15]:
def train_test_split(ratings):
    test = np.zeros(ratings.shape)
    train = ratings.copy()
    for user in range(ratings.shape[0]):
        test_ratings = np.random.choice(ratings[user, :].nonzero()[0], 
                                        size=10, 
                                        replace=False)
        train[user, test_ratings] = 0.
        test[user, test_ratings] = ratings[user, test_ratings]
        
    # Test and training are truly disjoint
    assert(np.all((train * test) == 0)) 
    return train, test

In [16]:
train, test = train_test_split(ratings)

In [17]:
train.shape

(671, 9066)

In [18]:
test.shape

(671, 9066)

In [19]:
def slow_similarity(ratings, kind='user'):
    if kind == 'user':
        axmax = 0
        axmin = 1
    elif kind == 'item':
        axmax = 1
        axmin = 0
    sim = np.zeros((ratings.shape[axmax], ratings.shape[axmax]))
    for u in range(ratings.shape[axmax]):
        for uprime in range(ratings.shape[axmax]):
            rui_sqrd = 0.
            ruprimei_sqrd = 0.
            for i in range(ratings.shape[axmin]):
                sim[u, uprime] = ratings[u, i] * ratings[uprime, i]
                rui_sqrd += ratings[u, i] ** 2
                ruprimei_sqrd += ratings[uprime, i] ** 2
            sim[u, uprime] /= rui_sqrd * ruprimei_sqrd
            
        if((u%10) == 0):
            print (u)
    return sim


In [20]:
user_Similarity_Matrix = slow_similarity(train,kind='user')

0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
260
270
280
290
300
310
320
330
340
350
360
370
380
390
400
410
420
430
440
450
460
470
480
490
500
510
520
530
540
550
560
570
580
590
600
610
620
630
640
650
660
670


In [22]:
np.savetxt("User_Similarity.csv", user_Similarity_Matrix, delimiter=",")

In [23]:
user_Similarity_Matrix.shape

(671, 671)

In [None]:
#item_Similarity_Matrix = slow_similarity(ratings,kind='item')

In [None]:
#item_Similarity_Matrix.shape

With our similarity matrix in hand, we can now predict the ratings that were not included with the data. 
Using these predictions, we can then compare them with the test data to attempt to validate the quality of 
our recommender model.

For user-based collaborative filtering, we predict that a user’s uu’s rating for item ii is given by the weighted sum of all
other users’ ratings for item ii where the weighting is the cosine similarity between the each user and the input user uu.

In [36]:
def predict_slow_simple(ratings, similarity, kind='user'):
    pred = np.zeros(ratings.shape)
    if kind == 'user':
        for i in range(ratings.shape[0]):
            for j in range(ratings.shape[1]):
                pred[i, j] = similarity[i, :].dot(ratings[:, j])/np.sum(np.abs(similarity[i, :]))
            if((i%50) == 0):
                print (i)            
        return pred
    elif kind == 'item':
        for i in range(ratings.shape[0]):
            for j in range(ratings.shape[1]):
                pred[i, j] = similarity[j, :].dot(ratings[i, :].T)\
                             /np.sum(np.abs(similarity[j, :]))

    return pred

In [25]:
train.shape

(671, 9066)

In [34]:
np.any(train[np.isnan(train)] == True)

False

In [35]:
np.any(user_Similarity_Matrix[np.isnan(user_Similarity_Matrix)] == True)

False

In [37]:
predict = predict_slow_simple(train, user_Similarity_Matrix, kind='user')

  


0
50
100
150
200
250
300
350
400
450
500
550
600
650


In [39]:
np.any(predict[np.isnan(predict) == False])

True

In [43]:
nanreplaced_Predict = predict

In [44]:
nanreplaced_Predict[np.isnan(nanreplaced_Predict) == True] = 0

In [45]:
np.any(nanreplaced_Predict[np.isnan(nanreplaced_Predict) == True])

False

In [40]:
from sklearn.metrics import mean_squared_error

def get_mse(pred, actual):
    # Ignore nonzero terms.
    pred = pred[actual.nonzero()].flatten()
    actual = actual[actual.nonzero()].flatten()
    return mean_squared_error(pred, actual)

In [46]:
print ('User-based CF MSE: ' + str(get_mse(predict, test)))

User-based CF MSE: 14.525745156482861


## Method to improve the existing model

Considering only top k similar users lets see the result

In [47]:
def predict_topk(ratings, similarity, kind='user', k=40):
    pred = np.zeros(ratings.shape)
    if kind == 'user':
        for i in range(ratings.shape[0]):
            top_k_users = [np.argsort(similarity[:,i])[:-k-1:-1]]
            for j in range(ratings.shape[1]):
                pred[i, j] = similarity[i, :][top_k_users].dot(ratings[:, j][top_k_users]) 
                pred[i, j] /= np.sum(np.abs(similarity[i, :][top_k_users]))
    if kind == 'item':
        for j in range(ratings.shape[1]):
            top_k_items = [np.argsort(similarity[:,j])[:-k-1:-1]]
            for i in range(ratings.shape[0]):
                pred[i, j] = similarity[j, :][top_k_items].dot(ratings[i, :][top_k_items].T) 
                pred[i, j] /= np.sum(np.abs(similarity[j, :][top_k_items]))        
    
    return pred

## K is a hyperparameter and one needs to find the best value of k which gives the least MSE

In [56]:
pred_TOP_k = predict_topk(train, user_Similarity_Matrix, kind='user', k=20)

  import sys
  
  


In [57]:
np.any(pred_TOP_k[np.isnan(pred_TOP_k) == True])

True

In [58]:
pred_TOP_k[np.isnan(pred_TOP_k) == True] = 0

In [59]:
print ('Top-k User-based CF MSE: ' + str(get_mse(pred_TOP_k, test)))

Top-k User-based CF MSE: 14.525745156482861


### Similar to the user user based collabrative filtering item-item based collabrative filtering can be also implemented to find the similar items and recommend the user

### To further improve the model: 
-Bias-subtracted Collaborative Filtering

we will try removing biases associated with either the user of the item. The idea here is that certain users may tend to always give high or low ratings to all movies. One could imagine that the relative difference in the ratings that these users give is more important than the absolute rating values.

Let us try subtracting each user’s average rating when summing over similar user’s ratings
and then add that average back in at the end

In [2]:
def predict_nobias(ratings, similarity, kind='user'):
    if kind == 'user':
        user_bias = ratings.mean(axis=1)
        ratings = (ratings - user_bias[:, np.newaxis]).copy()
        pred = similarity.dot(ratings) / np.array([np.abs(similarity).sum(axis=1)]).T
        pred += user_bias[:, np.newaxis]
    elif kind == 'item':
        item_bias = ratings.mean(axis=0)
        ratings = (ratings - item_bias[np.newaxis, :]).copy()
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])
        pred += item_bias[np.newaxis, :]
        
    return pred

In [None]:
user_pred = predict_nobias(train, user_similarity, kind='user')
print ('Bias-subtracted User-based CF MSE: ' + str(get_mse(user_pred, test)))

### Top k and bias subtracted algo can be combined to check  