# Recommendation system

In this notebook, I will build a recommender system

The recommendation system I will build will be user-user based collaborative filtering and item-item based collaborative filtering and later go onto try a model based collaborative filtering.

Importing all the libraries. 

In [212]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import pairwise_distances
from sklearn.metrics import mean_squared_error
import math

Reading both the datasets and setting the column names.

In [213]:
r_cols = ['userId', 'questionId', 'rating', 'department']
ratings = pd.read_excel('C:/Users/Daniel Eje/Downloads/book.xlsx', sep='\t', names=r_cols, encoding='latin-1')
#ratings_test = pd.read_csv('C:/Users/User/Downloads/ua.test', sep='\t', names=r_cols, encoding='latin-1')

In [249]:
ratings.head()

Unnamed: 0,userId,questionId,rating,department
0,1,1,5,
1,1,2,5,
2,1,3,0,
3,1,4,5,
4,1,5,0,


The column userId contains ids' of users starting from 1, the column questionId contains ids' of questions starting from 1 and the 'rating' column contains the corresponding ratings. Let us see how many unique users and how many unique questions are there.

In [250]:
n_users = ratings['userId'].unique().max()
n_questions = ratings['questionId'].unique().max()
n_users,n_questions

(100, 16)

There are 100 users and 16 questions in the training set.

In [251]:
#n_users_test = ratings_test['userId'].unique().max()
#n_items_test = ratings_test['questionId'].unique().max()
#n_users_test,n_items_test

### creating user-item matrix

Now let us go ahead and create our user-item matrices, test_matrix and train_matrix which contain number of rows equal to the number of unique users and number of columns equal to the number of unique questions. The cells of this matrix are filled with the corresponding rating a user has been given based on the response to a question. If a users response has not been rated the cell is filled with 0.

In [252]:
train_matrix = np.zeros((n_users, n_questions))
for line in ratings.itertuples():
    train_matrix[line[1]-1,line[2]-1] = line[3]   
#test_matrix = np.zeros((n_users_test, n_items_test))
#for line in ratings_test.itertuples():
#    test_matrix[line[1]-1,line[2]-1] = line[3]

### Trying user-user based collaborative filtering

The first approach we try is user-user based collaborative filtering. In this method, we first create a similarity matrix which specifies the similarity between two users based on the ratings they have been given to different questions. We use the cosine similarity metric which computers the dot product between the two vectors made up of the ratings of the movies they have rated.

In [253]:
user_similarity = pairwise_distances(train_matrix, metric='cosine')
print('shape: ',user_similarity.shape)
user_similarity

shape:  (100, 100)


array([[0.        , 0.11220648, 0.04396246, ..., 0.06179228, 0.14133588,
        0.15326384],
       [0.11220648, 0.        , 0.0844352 , ..., 0.06280791, 0.22507911,
        0.20006567],
       [0.04396246, 0.0844352 , 0.        , ..., 0.08003917, 0.13492629,
        0.1153486 ],
       ...,
       [0.06179228, 0.06280791, 0.08003917, ..., 0.        , 0.15924232,
        0.19011317],
       [0.14133588, 0.22507911, 0.13492629, ..., 0.15924232, 0.        ,
        0.07722164],
       [0.15326384, 0.20006567, 0.1153486 , ..., 0.19011317, 0.07722164,
        0.        ]])

The similarity matrix has the shape to 100 x 100 as expected with each cell corresponding to the similarity between two users. Now we will write a prediction function which will predict the values in the user-item(question matrix. We will only consider the top n users which are similar to a user to make predictions for that user. In the formula we normalise the ratings of users by subtracting the mean rating of a user from every rating given to the users questions.

\begin{equation*}
\hat{x}_{k,m} =\bar{x}_{k} + \frac{\sum\limits_{u_a} sim_u(u_k, u_a) (x_{a,m} - \bar{x}_{u_a})}{\sum\limits_{u_a}|sim_u(u_k, u_a)|}
\end{equation*}

In [254]:
def predict_user_user(train_matrix, user_similarity, n_similar=100):
    similar_n = user_similarity.argsort()[:,-n_similar:][:,::-1]
    pred = np.zeros((n_users,n_questions))
    for i,users in enumerate(similar_n):
        similar_users_indexes = users
        similarity_n = user_similarity[i,similar_users_indexes]
        matrix_n = train_matrix[similar_users_indexes,:]
        rated_items = similarity_n[:,np.newaxis].T.dot(matrix_n - matrix_n.mean(axis=1)[:,np.newaxis])/ similarity_n.sum()
        pred[i,:]  = rated_items
    return pred
def predict_users(user_similarity): 
    sim = []
    for i in range(0,100):
        top_10_users = []
        arr = user_similarity[i]
        sorted_arr = np.sort(arr)[::-1]
        for key in range(0,10):
            search_key = sorted_arr[key]
            result = np.where(arr == search_key)
            top_user = result[0][0] + 1
            top_10_users.append(top_user)
        sim.append((i+1,top_10_users))
    return sim

We will use one function to find the predicted ratings and add the average rating of every user to give back the final predicted ratings. Here, we are considering the top 100 users which are similar to our user and using their ratings to predict our user's ratings.
The other function is used to find the best match of similar users to a particular user.

In [255]:
predictions = predict_user_user(train_matrix,user_similarity, 100) + train_matrix.mean(axis=1)[:, np.newaxis]
print('predictions shape ',predictions.shape)
predictions

predictions shape  (100, 16)


array([[5.35754813, 4.97921168, 3.94432139, ..., 3.90865566, 3.85359681,
        3.74285658],
       [4.82382006, 4.437079  , 3.30464807, ..., 3.39134493, 3.39008652,
        3.24903886],
       [5.16217343, 4.74413551, 3.7426921 , ..., 3.8205868 , 3.71265995,
        3.58041481],
       ...,
       [4.90464328, 4.48148849, 3.4564079 , ..., 3.43203774, 3.45491168,
        3.33449098],
       [5.5297413 , 5.2031479 , 3.27524773, ..., 4.26968891, 3.8846415 ,
        4.02905668],
       [5.35028546, 5.02346966, 3.04844649, ..., 4.07565674, 3.82050319,
        3.83036157]])

In [256]:
#predicted_ratings = predictions[test_matrix.nonzero()]
#test_truth = test_matrix[test_matrix.nonzero()]

In [257]:
#math.sqrt(mean_squared_error(predicted_ratings,test_truth))

### Trying question-question based collaborative filtering

Now, I will go on and try item-item based collaborative filtering. This method finds the similarity between items instead of users, exactly like the previous method using 'cosine similarity'. Using the similarity between items and the users rating for similar items, we find the predicted ratings for un-rated items. Let us make the item similarity matrix.

In [258]:
item_similarity = pairwise_distances(train_matrix.T, metric = 'cosine')
item_similarity.shape

(16, 16)

In [259]:
item_similarity

array([[0.        , 0.02532057, 0.21259921, 0.09560514, 0.21442982,
        0.28912294, 0.03698586, 0.25193276, 0.06137651, 0.10265558,
        0.        , 0.16356582, 0.        , 0.04198024, 0.05565788,
        0.02620876],
       [0.02532057, 0.        , 0.25729353, 0.12365987, 0.22202895,
        0.28793857, 0.05828233, 0.2462537 , 0.08165383, 0.12835235,
        0.02532057, 0.19513871, 0.02532057, 0.07070561, 0.09237612,
        0.05156929],
       [0.21259921, 0.25729353, 0.        , 0.30816368, 0.39523168,
        0.46087838, 0.26108654, 0.4245239 , 0.32277567, 0.31535623,
        0.21259921, 0.31711056, 0.21259921, 0.23522575, 0.24525823,
        0.25797033],
       [0.09560514, 0.12365987, 0.30816368, 0.        , 0.34951384,
        0.43604073, 0.12857747, 0.39022886, 0.18845351, 0.2240498 ,
        0.09560514, 0.20454349, 0.09560514, 0.15245731, 0.12452867,
        0.12808538],
       [0.21442982, 0.22202895, 0.39523168, 0.34951384, 0.        ,
        0.61884106, 0.29487453, 

The similarity matrix has a shape of 16 x 16 as expected with each cell corresponding to the similarity between two users. Now we will write a prediction function which will predict the values in the user-question matrix. We will only consider the top n items which are similar to a item to make predictions.. In this formula we don't need normalise the ratings of users questions as we are using questions to make predictions instead of users.

\begin{equation*}
\hat{x}_{k,m} = \frac{\sum\limits_{i_b} sim_i(i_m, i_b) (x_{k,b}) }{\sum\limits_{i_b}|sim_i(i_m, i_b)|}
\end{equation*}

In [266]:
def predict_item_item(train_matrix, item_similarity, n_similar=100):
    similar_n = item_similarity.argsort()[:,-n_similar:][:,::-1]
    print('similar_n shape: ', similar_n.shape)
    pred = np.zeros((n_users,n_questions))
    
    for i,items in enumerate(similar_n):
        similar_items_indexes = items
        similarity_n = item_similarity[i,similar_items_indexes]
        matrix_n = train_matrix[:,similar_items_indexes]
        rated_items = matrix_n.dot(similarity_n)/similarity_n.sum()
        pred[:,i]  = rated_items
    return pred

We will use this function to find the predicted ratings. Here, we are considering the top 100 users which are similar to our user and using their ratings to predict our user's ratings.

In [267]:
predictions = predict_item_item(train_matrix,item_similarity,100)
print('predictions shape ',predictions.shape)
predictions

similar_n shape:  (16, 16)
predictions shape  (100, 16)


array([[3.14546657, 3.20310937, 4.01392796, ..., 3.04814972, 3.44896125,
        3.06865067],
       [2.4857618 , 2.52293562, 3.46902525, ..., 2.33114597, 2.9635371 ,
        2.44038797],
       [3.38059177, 3.31889216, 3.94080769, ..., 3.30634623, 3.59761161,
        3.27619337],
       ...,
       [2.82965893, 2.80832809, 3.59465896, ..., 2.64067519, 3.15232378,
        2.78518891],
       [3.93818538, 3.93939673, 3.93904175, ..., 3.97776336, 3.84993713,
        3.92331891],
       [3.67029504, 3.6775919 , 3.6853785 , ..., 3.69383585, 3.7384012 ,
        3.63263344]])

Let us consider only those ratings which are not zero in the test matrix and use them to find the error in our model

In [227]:
#predicted_ratings = predictions[test_matrix.nonzero()]
#test_truth = test_matrix[test_matrix.nonzero()]
#math.sqrt(mean_squared_error(predicted_ratings,test_truth))

### Getting similar users recommendations for a user

In the part we get recommendations for a user based on the highest similarity with other users . Let us get predictions for the user with user id 40. I am using the predictions from the user_user collaborative filtering model for this.

In [228]:
user_id = 40
users_prediction = predict_users(user_similarity)


We store an array of top 10 similar users for each user

In [260]:
for i in users_prediction:
    if i[0] == 40:
        for users in i[1]:
            print ('Recommended For user '+str(user_id)+' is user  '+str(users))

Recommended For user 33 is user  61
Recommended For user 33 is user  97
Recommended For user 33 is user  34
Recommended For user 33 is user  71
Recommended For user 33 is user  88
Recommended For user 33 is user  53
Recommended For user 33 is user  96
Recommended For user 33 is user  78
Recommended For user 33 is user  80
Recommended For user 33 is user  35


We go on and print the top 10 user recommendation.

### Getting question recommendations for a user

In the next part we get recommendations for a user based on the highest predicted ratings for a particular user. Let us get predctions for the user with user id 29. I am using the predictions from the item-item collaborative filtering model for this.

In [264]:
user_id = 29
user_ratings = predictions[user_id-1,:]

We extract the indices of the questions in the matrix which ratings have not been assigned i.e. value is 0 and get their predticted ratings. 

In [244]:
train_unkown_indices = np.where(train_matrix[user_id-1,:] == 0)[0]
train_unkown_indices

array([ 2,  3, 11], dtype=int32)

In [261]:
user_recommendations.shape

(3,)

We go on and print the top 3 recommendations.

In [265]:
print('\nRecommendations for user {} are  : \n'.format(user_id))
for question_id in user_recommendations.argsort()[-5:][: : -1]:
    print(question_id +1)


Recommendations for user 29 are  : 

3
1
2
