<a href="https://colab.research.google.com/github/Foutse/Recommendation_systems/blob/master/Recommendation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### ↨ Collaborative Filtering using N Nearest Neighbors

This is a commonly used method of which the two most commonly used methods are **memory-based** and **model-based**. Here we shall tackle the *memory-based* method(**User-Based Collaborative Filtering**:People with similar characteristics share similar taste), inspired from this work: https://medium.com/sfu-cspmp/recommendation-systems-user-based-collaborative-filtering-using-n-nearest-neighbors-bf7361dc24e0 

#### User-User Collaborative Filtering


The method identifies users that are similar to the queried user and estimate the desired rating to be the weighted average of the ratings of these similar users.
![Texte alternatif…](https://miro.medium.com/max/2728/1*x8gTiprhLs7zflmEn1UjAQ.png)

In [None]:
import pandas as pd
movies = pd.read_csv("/movies.csv",encoding="Latin1")
Ratings = pd.read_csv("/ratings.csv")
Tags = pd.read_csv("/tags.csv",encoding="Latin1")

We normalize the ratings for a user, to eventually use the data to calculate the final score for the user later.

In [None]:
Mean = Ratings.groupby(by="userId",as_index=False)['rating'].mean()
Rating_avg = pd.merge(Ratings,Mean,on='userId')
Rating_avg['adg_rating']=Rating_avg['rating_x']-Rating_avg['rating_y']
Rating_avg.head()

Unnamed: 0,userId,movieId,rating_x,timestamp,rating_y,adg_rating
0,12882,1,4.0,1147195252,4.061321,-0.061321
1,12882,32,3.5,1147195307,4.061321,-0.561321
2,12882,47,5.0,1147195343,4.061321,0.938679
3,12882,50,5.0,1147185499,4.061321,0.938679
4,12882,110,4.5,1147195239,4.061321,0.438679


### ♦ Cosine Similarity

We need to find the users who have similar thoughts. That is, users who has similar liking’s and disliking. This will be done using cosine similarity. It is usually calculated over the ratings that both the users have rated in the past. *sklearn* has a *cosine_similarity fucntion*.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
final=pd.pivot_table(Rating_avg,values='adg_rating',index='userId',columns='movieId')

In [None]:

check = pd.pivot_table(Rating_avg,values='rating_x',index='userId',columns='movieId')
check.head()

movieId,1,10,100,1003,1004,1005,1006,1007,1009,101,...,97913,97921,97938,986,98809,991,99114,994,996,999
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
316,2.5,2.5,,,,,,,,,...,,,,,,,,,,
320,,,,,,,,,,,...,,,,,,,,,,
359,5.0,4.0,,3.0,,,,,,5.0,...,,,,,,,,4.0,,
370,4.5,,,,,,,,,,...,3.5,4.0,,,4.0,,4.5,,4.5,
910,5.0,,3.5,,,,,,,,...,,4.0,,4.0,3.0,,,,3.5,


In [None]:
final.head()

movieId,1,2,3,4,5,6,7,9,10,11,...,106487,106489,106782,106920,109374,109487,111362,111759,112556,112852
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
316,-0.829457,,,,,,-1.329457,,-0.829457,,...,,,,,,,,,,
320,,,,,,,,,,,...,,,,,,,,,,
359,1.314526,,,,,1.314526,,,0.314526,0.314526,...,,,,,,,,,,
370,0.705596,0.205596,,,,1.205596,,,,,...,-1.294404,-0.794404,0.705596,0.205596,,,-0.794404,0.705596,-0.294404,-0.794404
910,1.10192,0.10192,-0.39808,,-0.39808,-0.39808,,,,0.10192,...,,,-0.39808,,,,,0.60192,,


We can observe that we have hear a sparse matrix (so many NaN since every user has not seen all the movies). Methods like **matrix factorization** are used to deal with this sparsity. Let's replace the NaN values. This can be handled in various ways depending on the gold we want to acchieve. For our case, we could apply any of the following:
- Use the user average over the row.
- Use the movie average over the column.

#### Replacing NaN by Movie Average

In [None]:
final_movie = final.fillna(final.mean(axis=0))

In [None]:
final_movie.head()

movieId,1,2,3,4,5,6,7,9,10,11,...,106487,106489,106782,106920,109374,109487,111362,111759,112556,112852
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
316,-0.829457,-0.436518,-0.468109,-0.770223,-0.615331,0.320415,-1.329457,-0.690175,-0.829457,-0.094277,...,0.105075,0.006629,0.262314,0.23735,0.429868,0.306567,0.22511,0.234458,0.362468,0.349157
320,0.20022,-0.436518,-0.468109,-0.770223,-0.615331,0.320415,-0.203889,-0.690175,-0.150642,-0.094277,...,0.105075,0.006629,0.262314,0.23735,0.429868,0.306567,0.22511,0.234458,0.362468,0.349157
359,1.314526,-0.436518,-0.468109,-0.770223,-0.615331,1.314526,-0.203889,-0.690175,0.314526,0.314526,...,0.105075,0.006629,0.262314,0.23735,0.429868,0.306567,0.22511,0.234458,0.362468,0.349157
370,0.705596,0.205596,-0.468109,-0.770223,-0.615331,1.205596,-0.203889,-0.690175,-0.150642,-0.094277,...,-1.294404,-0.794404,0.705596,0.205596,0.429868,0.306567,-0.794404,0.705596,-0.294404,-0.794404
910,1.10192,0.10192,-0.39808,-0.770223,-0.39808,-0.39808,-0.203889,-0.690175,-0.150642,0.10192,...,0.105075,0.006629,-0.39808,0.23735,0.429868,0.306567,0.22511,0.60192,0.362468,0.349157


**We calculate the similarity between the users**

In [None]:
# Replacing NaN by user Average
final_user = final.apply(lambda row: row.fillna(row.mean()), axis=1)

In [None]:
final_user.head()

movieId,1,2,3,4,5,6,7,9,10,11,...,106487,106489,106782,106920,109374,109487,111362,111759,112556,112852
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
316,-0.8294574,1.893404e-16,1.893404e-16,1.893404e-16,1.893404e-16,1.893404e-16,-1.329457,1.893404e-16,-0.8294574,1.893404e-16,...,1.893404e-16,1.893404e-16,1.893404e-16,1.893404e-16,1.893404e-16,1.893404e-16,1.893404e-16,1.893404e-16,1.893404e-16,1.893404e-16
320,4.297638e-17,4.297638e-17,4.297638e-17,4.297638e-17,4.297638e-17,4.297638e-17,4.297638e-17,4.297638e-17,4.297638e-17,4.297638e-17,...,4.297638e-17,4.297638e-17,4.297638e-17,4.297638e-17,4.297638e-17,4.297638e-17,4.297638e-17,4.297638e-17,4.297638e-17,4.297638e-17
359,1.314526,-1.135546e-16,-1.135546e-16,-1.135546e-16,-1.135546e-16,1.314526,-1.135546e-16,-1.135546e-16,0.3145258,0.3145258,...,-1.135546e-16,-1.135546e-16,-1.135546e-16,-1.135546e-16,-1.135546e-16,-1.135546e-16,-1.135546e-16,-1.135546e-16,-1.135546e-16,-1.135546e-16
370,0.7055961,0.2055961,1.958963e-15,1.958963e-15,1.958963e-15,1.205596,1.958963e-15,1.958963e-15,1.958963e-15,1.958963e-15,...,-1.294404,-0.7944039,0.7055961,0.2055961,1.958963e-15,1.958963e-15,-0.7944039,0.7055961,-0.2944039,-0.7944039
910,1.10192,0.1019202,-0.3980798,6.795811e-16,-0.3980798,-0.3980798,6.795811e-16,6.795811e-16,6.795811e-16,0.1019202,...,6.795811e-16,6.795811e-16,-0.3980798,6.795811e-16,6.795811e-16,6.795811e-16,6.795811e-16,0.6019202,6.795811e-16,6.795811e-16


In [None]:
# user similarity on replacing NAN by user avg
b = cosine_similarity(final_user)
np.fill_diagonal(b, 0 )
similarity_with_user = pd.DataFrame(b,index=final_user.index)
similarity_with_user.columns=final_user.index
similarity_with_user.head()

userId,316,320,359,370,910,975,1015,1387,1447,1588,...,137118,137209,137227,137446,137559,137609,137805,138072,138176,138200
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
316,0.0,0.060063,0.072075,0.043266,0.039305,0.045616,0.035341,0.038068,-0.01248514,0.050183,...,0.052632,0.104864,0.011358,0.029674,0.092552,0.017876,0.051371,0.077377,0.026924,-0.022727
320,0.060063,0.0,0.063054,0.027315,0.006811,0.07562,0.01191,0.042509,1.208166e-31,0.067389,...,0.115325,0.06513,0.071996,0.097554,0.064769,-0.006251,0.077256,0.098845,0.038752,0.056639
359,0.072075,0.063054,0.0,0.135836,0.076131,0.036757,0.046418,0.066544,0.04287659,0.109726,...,0.120191,0.020672,0.032166,0.039599,0.108502,0.026371,0.075492,0.102698,0.099307,0.003147
370,0.043266,0.027315,0.135836,0.0,0.108404,0.071655,0.070893,-0.003139,0.05223516,0.090241,...,0.091218,0.049594,0.004344,0.040692,0.110434,0.019767,-0.001364,0.052187,0.050997,0.00995
910,0.039305,0.006811,0.076131,0.108404,0.0,0.021814,0.027339,-0.032211,-0.006301121,-0.007491,...,0.039464,-0.01762,0.020058,-0.004581,0.040866,-0.001438,-0.026082,0.073272,-0.012058,0.00761


In [None]:
# user similarity on replacing NAN by item(movie) avg
import numpy as np
cosine = cosine_similarity(final_movie)
np.fill_diagonal(cosine, 0 )
similarity_with_movie =pd.DataFrame(cosine,index=final_movie.index)
similarity_with_movie.columns=final_user.index
similarity_with_movie.head()

userId,316,320,359,370,910,975,1015,1387,1447,1588,...,137118,137209,137227,137446,137559,137609,137805,138072,138176,138200
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
316,0.0,0.921169,0.665659,0.673486,0.694247,0.894969,0.80578,0.851492,0.945224,0.705491,...,0.827564,0.895641,0.87929,0.916856,0.912146,0.922262,0.587738,0.671783,0.949138,0.74022
320,0.921169,0.0,0.687225,0.691158,0.699527,0.91602,0.816931,0.874283,0.970234,0.724147,...,0.861798,0.909376,0.907009,0.938964,0.929049,0.943265,0.612746,0.695382,0.973853,0.768459
359,0.665659,0.687225,0.0,0.534369,0.523475,0.655225,0.602806,0.629143,0.705042,0.542504,...,0.62182,0.65432,0.655839,0.679696,0.6839,0.686193,0.418283,0.489595,0.70737,0.534065
370,0.673486,0.691158,0.534369,0.0,0.54756,0.67181,0.618456,0.628825,0.712683,0.548592,...,0.636688,0.673489,0.651209,0.688647,0.689265,0.692595,0.405881,0.497332,0.714011,0.546637
910,0.694247,0.699527,0.523475,0.54756,0.0,0.680701,0.621463,0.634921,0.723574,0.528281,...,0.638257,0.668887,0.677377,0.701964,0.701245,0.705041,0.408456,0.509008,0.725896,0.554105


We check whether what we calculated really makes sense !!

In [None]:
def get_user_similar_movies( user1, user2 ):
    common_movies = Rating_avg[Rating_avg.userId == user1].merge(
    Rating_avg[Rating_avg.userId == user2],
    on = "movieId",
    how = "inner" )
    return common_movies.merge( movies, on = 'movieId' )

Let us try with the users 370 and 86309 and check if they are similar

In [None]:
a = get_user_similar_movies(370,86309)
a = a.loc[ : , ['rating_x_x','rating_x_y','title']]
a.head()

Unnamed: 0,rating_x_x,rating_x_y,title
0,5.0,5.0,"Matrix, The (1999)"
1,5.0,4.5,"Lord of the Rings: The Fellowship of the Ring,..."
2,5.0,4.0,"Lord of the Rings: The Two Towers, The (2002)"
3,4.5,4.0,"Lord of the Rings: The Return of the King, The..."
4,1.5,1.0,Serenity (2005)


We can observe that the similarity we generated is true since both users **370** & **86309** have almost same ratings and liking’s.

#### ♣ Neighborhood for User (K)

It would not be appropriate if we have to calculate the similarity of a given user to all the users, hence it is important that we restrict our self to some users. We introduce the notion of trying to find a number of neighbors to calculate the similarity only to those ones. We shall take k as 20. So we would have 20 nearest neighbor for all the users.  We thus build a function **find_n_neighbours** which takes the similarity matrix and the value of n as input and returns the nearest n neighbors for all the users.

In [None]:
def find_n_neighbours(df,n):
    order = np.argsort(df.values, axis=1)[:, :n]
    df = df.apply(lambda x: pd.Series(x.sort_values(ascending=False)
           .iloc[:n].index, 
          index=['top{}'.format(i) for i in range(1, n+1)]), axis=1)
    return df

In [None]:
# top 20 neighbours for each user
sim_user_20_u = find_n_neighbours(similarity_with_user,20)
sim_user_20_u.head()

Unnamed: 0_level_0,top1,top2,top3,top4,top5,top6,top7,top8,top9,top10,top11,top12,top13,top14,top15,top16,top17,top18,top19,top20
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
316,113673,117918,9050,12882,38187,102668,98880,43829,13215,78501,6988,5611,131835,86783,98781,94883,61305,59269,117861,128236
320,12288,113673,28159,79846,134627,112948,120729,97163,2945,4931,44400,61305,82880,21860,100540,12569,88608,124849,69256,59269
359,102118,96482,102532,50898,2702,60016,23428,120782,57937,42096,38159,32780,65670,124078,11343,46645,79531,35246,134181,128224
370,46645,42245,40768,23428,123707,60016,45120,113645,97195,102118,58265,113540,102532,120782,17039,117007,101137,57937,27365,41244
910,87042,131620,67352,40768,31321,48821,26222,63295,5611,370,79531,84752,10164,17022,60016,133811,12271,88394,105455,35522


In [None]:
# top 30 neighbours for each user
sim_user_30_m = find_n_neighbours(similarity_with_movie,30)
sim_user_30_m.head()

Unnamed: 0_level_0,top1,top2,top3,top4,top5,top6,top7,top8,top9,top10,...,top21,top22,top23,top24,top25,top26,top27,top28,top29,top30
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
316,138176,100240,96936,51460,88932,1447,104732,125012,5268,121403,...,121987,72633,21401,114335,22338,118304,124981,93203,81435,94333
320,138176,96936,121403,1447,51460,125012,88932,42944,5268,104529,...,121987,102549,118304,86309,94333,124981,93203,80585,136037,22338
359,138176,1447,5268,96936,100240,21401,88932,13927,104732,72633,...,12930,121987,114335,125012,51460,118304,57474,27142,80585,22338
370,86309,44194,138176,24802,129869,96936,1447,104529,94333,88932,...,124981,27142,102549,120308,54643,42944,80585,13927,21401,136037
910,96936,107991,138176,27142,51460,125012,88932,100240,72633,129869,...,36624,51255,94333,42944,121403,80585,61755,124981,88455,78908


We have seriously reduced the number of unnecessary computations. We are ready to calculate the score for an item now.

### Generating the final score S(u,i)

In [None]:
def User_item_score(user,item):
    a = sim_user_30_m[sim_user_30_m.index==user].values
    b = a.squeeze().tolist()
    c = final_movie.loc[:,item]
    d = c[c.index.isin(b)]
    f = d[d.notnull()]
    avg_user = Mean.loc[Mean['userId'] == user,'rating'].values[0]
    index = f.index.values.squeeze().tolist()
    corr = similarity_with_movie.loc[user,index]
    fin = pd.concat([f, corr], axis=1)
    fin.columns = ['adg_score','correlation']
    fin['score']=fin.apply(lambda x:x['adg_score'] * x['correlation'],axis=1)
    nume = fin['score'].sum()
    deno = fin['correlation'].sum()
    final_score = avg_user + (nume/deno)
    return final_score

Let us predict the score for a movie, given user has not seen.

In [None]:
score = User_item_score(320,7371)
print(score)

4.255766437391595


The system predicted the score of 4.25 which is really good. This suggests that user (320) could like the movie with id (7371).

- We shall now try to predict the top 5 movies that a given user may like.
- Here we are just interested in calculating the scores for the items that their neighbor users have seen.
- We reduced the computation from 2500 to N ( where N is the set of movies my neighborhood liked ) and which is very less than 2500.

We then build a new function **User_item_score1** 

In [None]:
Rating_avg = Rating_avg.astype({"movieId": str})
Movie_user = Rating_avg.groupby(by = 'userId')['movieId'].apply(lambda x:','.join(x))

In [None]:
def User_item_score1(user):
    Movie_seen_by_user = check.columns[check[check.index==user].notna().any()].tolist()
    a = sim_user_30_m[sim_user_30_m.index==user].values
    b = a.squeeze().tolist()
    d = Movie_user[Movie_user.index.isin(b)]
    l = ','.join(d.values)
    Movie_seen_by_similar_users = l.split(',')
    Movies_under_consideration = list(set(Movie_seen_by_similar_users)-set(list(map(str, Movie_seen_by_user))))
    Movies_under_consideration = list(map(int, Movies_under_consideration))
    score = []
    for item in Movies_under_consideration:
        c = final_movie.loc[:,item]
        d = c[c.index.isin(b)]
        f = d[d.notnull()]
        avg_user = Mean.loc[Mean['userId'] == user,'rating'].values[0]
        index = f.index.values.squeeze().tolist()
        corr = similarity_with_movie.loc[user,index]
        fin = pd.concat([f, corr], axis=1)
        fin.columns = ['adg_score','correlation']
        fin['score']=fin.apply(lambda x:x['adg_score'] * x['correlation'],axis=1)
        nume = fin['score'].sum()
        deno = fin['correlation'].sum()
        final_score = avg_user + (nume/deno)
        score.append(final_score)
    data = pd.DataFrame({'movieId':Movies_under_consideration,'score':score})
    top_5_recommendation = data.sort_values(by='score',ascending=False).head(5)
    Movie_Name = top_5_recommendation.merge(movies, how='inner', on='movieId')
    Movie_Names = Movie_Name.title.values.tolist()
    return Movie_Names

In [None]:
user = int(input("Enter the user id to whom you want to recommend : "))
predicted_movies = User_item_score1(user)
print(" ")
print("The Recommendations for User Id :" + str(user))
print("   ")
for i in predicted_movies:
    print(i)

Enter the user id to whom you want to recommend : 320
 
The Recommendations for User Id :320
   
Godfather, The (1972)
Shawshank Redemption, The (1994)
Godfather: Part II, The (1974)
Band of Brothers (2001)
Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)


We did it!!!!! ;) Thank you **Ashay Pathak**

![Texte alternatif…](https://media1.giphy.com/media/4QFzDdeLmo19MTm9AF/source.gif)