# Exercice - Collaborative filtering

Implement user-based collaborative filtering to find recommendations for new user.

In [1]:
import scipy.stats
import numpy

The code below reads file "movies.csv" with ratings in the following form (userId, itemId, rating) and processes it. As a result in the variable ratings there is a matrix with users as rows and items as colunms. In cells there are ratings.

In [2]:
def readRatings(path):
    file=open(path, "r")
    lines = file.read().split("\n")
    return([[int(x) for x in line.split(",")] for line in lines if line != ""])

def processRatings(path):
    ratings = readRatings(path)
    maxUser = max([item[0] for item in ratings])
    maxItem = max([item[1] for item in ratings])
    ratMatrix = numpy.zeros((maxUser, maxItem))
    for rat in ratings:
        ratMatrix[rat[0]-1, rat[1]-1] = rat[2]
    return(ratMatrix)
ratings = processRatings("movies.csv")

**TODO** implement similarity for pair of vectors (user ratings). Use Pearson correlation (scipy.stats.pearsonr).

In [5]:
def similarity(item1, item2):
    return scipy.stats.pearsonr(item1, item2)[0]

In [6]:
similarity(ratings[0],ratings[1])

0.10632192973557714

**TODO** implement weighted average. RatingsCol parameter contains a column from ratings matrix (ratings of all users for one movie). Weights parameter is the array of similarities of users to current user (non-zero for k nearest neighbors, zeros for others).

In [21]:
def weightedMean(ratingsCol, weights):
    my_sum = 0
    for r, w in zip(ratingsCol, weights):
        my_sum += r*w
    return my_sum/sum(weights)

weightedMean([1,2,3], [5,2,4])

1.9090909090909092

**TODO** implement user-based collaborative filtering. Use the following steps:
    1. find similarities for all users fo given user (parameter userId). Remember not to take into consideration this user himself. The easiest form is to set value -1 for similarity of user to himself. 
    2. sort similarities descending
    3. find weights vector - similarity for k nearest users, 0 for others
    4. find predicted ratings for all items, which werent already rated by this user
        4.1. call weightedMean method for all columns with zeros for given user and computed in step 2 weights vector)
        4.2. sort predicted values descending
    5. return results in the form of sorted descending list of tuples (itemId, predicted rating)

In [66]:
k=10 #number of closest users used for recommendation

def findRecommendationsUserBased(userId, ratingsMatrix):
    
    # array of similarity [1]
    
    sim_array = [None] * len(ratingsMatrix)
    
    for other in range(len(ratingsMatrix)):
        if userId == other:
            sim_array[other] = -1
        else:
            sim_array[other] = similarity(ratingsMatrix[userId], ratingsMatrix[other])
    
    # sort sim_array, receive idexes [2]
    
    sort_index = numpy.argsort(sim_array) # this array will be in revered order
    
    # find weights vector (numbers only for k-nearest) [3]
    
    vect_array = [0] * len(ratingsMatrix)
    
    num = sort_index[-k:]
    
    for i in range(len(num)):
        main_index = num[i]
        vect_array[main_index] = sim_array[main_index]
    
    
    return sim_array

findRecommendationsUserBased(1, ratings)

767
0.42616639399820166
833
0.43167424174210867
14
0.4320194584754063
412
0.43946907159911736
734
0.4398446497517806
103
0.44341587921139586
130
0.46477073542892466
459
0.48591290907267914
930
0.4951656011926353
700
0.570307281229846
[767 833  14 412 734 103 130 459 930 700]


[0.10632192973557714,
 -1,
 0.08268016854282502,
 0.16032261105387996,
 0.020217807303362843,
 0.197844256436931,
 0.022886096225234645,
 0.07277206553778975,
 0.14371621924825856,
 0.10686125399104798,
 0.07576560040090953,
 0.09246463071510258,
 0.1474739142841934,
 0.19199290692165605,
 0.4320194584754063,
 0.07781811891083473,
 0.20631285114918707,
 0.1173093569021231,
 0.08360012520997935,
 0.03210429372861286,
 0.1599733613634574,
 0.018687906613846526,
 0.09872168340005359,
 0.14714026107345157,
 0.11376339854849801,
 0.4203187476344612,
 0.16730563135396045,
 0.06619094981467333,
 0.15362284227774925,
 0.20748423698399432,
 0.0571220979687225,
 0.1936531556558395,
 0.1489230151056949,
 0.1978144729094955,
 0.055391376779900164,
 0.08986545821963995,
 0.0485156400418891,
 0.013602981918010678,
 0.2538336519409621,
 0.23239964012206452,
 0.09718034985484886,
 0.061978202918571285,
 0.2625853537920454,
 0.06992472841720597,
 0.3032396181272214,
 0.21467643934749298,
 0.31333088147

In [58]:
arr = [0, 1, 2, 3, 4, 5, 6, 7,8]

print(arr[-1:])

[8]


The following code fragment prints 10 recommended movies for 10 first users. Notice that the user and movie IDs corespond the ones from input file, not the matrix indices. The matrix row/column index = user/movie ID - 1

In [None]:
usersCount = ratings.shape[0]
for user in range(5):
    recommendations = findRecommendationsUserBased(user, ratings)
    for i in range(10):
        print("User: " + str(user + 1) + ", Item: " + str(recommendations[i][0] + 1) + ", predicted rating: " + str(round(recommendations[i][1], 2)))
    print("")