This mini project demonstrates Collaborative filtering using the Movielens dataset to recommend movies to users. The MovieLens 
ratings dataset lists the ratings given by a set of users to a set of movies. Our goal is to be able to predict ratings for 
movies a user has not yet watched. The movies with the highest predicted ratings can then be recommended to the user.

 The ﬁle ratings.csv in the dataset contains ratings given by users. Each line in this ﬁle represents a rating given by a user to a movie. The ratings are on the scale of 1 to 5.
The dataset has the following features:
    1. userId 
    2. movieId 
    3. rating 
    4. timestamp




In [18]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
from sklearn.model_selection import train_test_split

In [31]:
ratingsdf=pd.read_csv("C:/Users/dataset/ratings.csv")

In [32]:
ratingsdf.head(10)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
5,1,70,3.0,964982400
6,1,101,5.0,964980868
7,1,110,4.0,964982176
8,1,151,5.0,964984041
9,1,157,5.0,964984100


In [35]:
ratingsdf.drop( 'timestamp', axis = 1, inplace = True )

The number of unique users in the dataset

In [37]:
len( ratingsdf.userId.unique() )

610

total number of movies rated

In [39]:
len( ratingsdf.movieId.unique() ) 

9724

In [42]:
user_movies_df =  ratingsdf.pivot(  index='userId', columns='movieId', values = "rating").reset_index(drop=True) 
user_movies_df.index = ratingsdf.userId.unique() 

In [43]:
user_movies_df.iloc[0:5, 0:15]
 

movieId,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
1,4.0,,4.0,,,4.0,,,,,,,,,
2,,,,,,,,,,,,,,,
3,,,,,,,,,,,,,,,
4,,,,,,,,,,,,,,,
5,4.0,,,,,,,,,,,,,,


In [45]:
user_movies_df.fillna( 0, inplace = True  ) 
user_movies_df.iloc[0:5, 0:10]
 

movieId,1,2,3,4,5,6,7,8,9,10
1,4.0,0.0,4.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Calculating Cosine Similarity between Users Each row in user_movies_df represents a user. If we compute the similarity between
rows, it will represent the similarity between those users. sklearn.metrics.pairwise_distances can be used to compute distance 
between all pairs of users. pairwise_distances() takes a metric parameter for what distance measure to use. We will be using
cosine similarity for ﬁnding similarity. Cosine similarity closer to 1 means users are very similar and closer to 0 means users
are very dissimilar. The following code can be used for calculating the similarity

In [46]:
from sklearn.metrics import pairwise_distances 
from scipy.spatial.distance import cosine, correlation 

In [47]:
user_sim = 1 - pairwise_distances( user_movies_df.values,metric="cosine" )  
#Store the results in a dataframe 
user_sim_df = pd.DataFrame( user_sim ) 
#Set the index and column names to user ids (0 to 671)  
user_sim_df.index = ratingsdf.userId.unique()  
user_sim_df.columns = ratingsdf.userId.unique() 

In [48]:
user_sim_df.iloc[0:5, 0:5]
 

Unnamed: 0,1,2,3,4,5
1,1.0,0.027283,0.05972,0.194395,0.12908
2,0.027283,1.0,0.0,0.003726,0.016614
3,0.05972,0.0,1.0,0.002251,0.00502
4,0.194395,0.003726,0.002251,1.0,0.128659
5,0.12908,0.016614,0.00502,0.128659,1.0


In [49]:
user_sim_df.shape 

(610, 610)

to avoid showing similarity of user with itself we need to fill diagonal with zero.The diagonal of the matrix shows the similarity of an user with itself. This is true as each user is most similar to
himself or herself. But we need the algorithm to find other users who are similar to a specific user. So, we will set the 
diagonal values as 0.0 

In [50]:
np.fill_diagonal( user_sim, 0 )  
user_sim_df.iloc[0:5, 0:5]
 

Unnamed: 0,1,2,3,4,5
1,0.0,0.027283,0.05972,0.194395,0.12908
2,0.027283,0.0,0.0,0.003726,0.016614
3,0.05972,0.0,0.0,0.002251,0.00502
4,0.194395,0.003726,0.002251,0.0,0.128659
5,0.12908,0.016614,0.00502,0.128659,0.0


filtering similar user

In [53]:
user_sim_df.idxmax(axis=1)[0:5] 

1    266
2    366
3    313
4    391
5    470
dtype: int64

In [69]:
user_sim_df.iloc[1:5, 365:392]

Unnamed: 0,366,367,368,369,370,371,372,373,374,375,...,383,384,385,386,387,388,389,390,391,392
2,0.300074,0.031699,0.008637,0.016431,0.034816,0.0,0.012827,0.019976,0.025745,0.069168,...,0.0,0.0,0.013884,0.040741,0.035229,0.0,0.0,0.01452,0.019561,0.0
3,0.004494,0.008275,0.057148,0.015957,0.0,0.0,0.015347,0.003621,0.0,0.0,...,0.0,0.0,0.013088,0.0,0.02508,0.0,0.0,0.02562,0.010147,0.0
4,0.007239,0.14317,0.148244,0.148745,0.154995,0.031229,0.175466,0.078816,0.044272,0.082331,...,0.036184,0.060981,0.198938,0.088377,0.217468,0.021851,0.030786,0.063068,0.317541,0.10262
5,0.050216,0.066044,0.116216,0.00917,0.144276,0.016383,0.273186,0.415042,0.331518,0.087494,...,0.021046,0.07666,0.27481,0.341885,0.078773,0.0,0.032686,0.062746,0.147868,0.021012


The output shows that the cosine similarity between userid 2 and userid 338 is 0.581528 and highest. But why is user 338 most 
similar to user 2? This can be explained intuitively if we can verify that the two users have watched several movies in common 
and rated very similarly. For this, we need to read movies dataset, which contains the movie id along with the movie name.


Movie titles are entered manually or imported from https://www.themoviedb.org/ and include the year of release in parentheses. 
    Errors and inconsistencies may exist in these titles. The movie can be loaded using the following codes: 

In [59]:
movies_df=pd.read_csv(r"C:\Users\dataset\movies.csv") 

In [60]:
movies_df[0:5]
 

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [62]:
movies_df.drop( 'genres', axis = 1, inplace = True )

In [67]:
def  get_user_similar_movies( user1, user2 ):
    # Inner join between movies watched between two users will give the common movies watched.
    common_movies = ratingsdf[ratingsdf.userId == user1].merge( ratingsdf[ratingsdf.userId == user2], on = "movieId", how = "inner" )
    return common_movies.merge( movies_df, on = 'movieId' ) 


In [70]:
common_movies = get_user_similar_movies( 1, 266 )
common_movies[ (common_movies.rating_x >= 4.0) & ((common_movies.rating_y >= 4.0))]
 

Unnamed: 0,userId_x,movieId,rating_x,userId_y,rating_y,title
1,1,6,4.0,266,4.0,Heat (1995)
2,1,50,5.0,266,4.0,"Usual Suspects, The (1995)"
3,1,110,4.0,266,5.0,Braveheart (1995)
5,1,235,4.0,266,4.0,Ed Wood (1994)
6,1,260,5.0,266,4.0,Star Wars: Episode IV - A New Hope (1977)
9,1,356,4.0,266,4.0,Forrest Gump (1994)
12,1,457,5.0,266,4.0,"Fugitive, The (1993)"
13,1,480,4.0,266,4.0,Jurassic Park (1993)
14,1,592,4.0,266,4.0,Batman (1989)
15,1,608,5.0,266,5.0,Fargo (1996)


In [71]:
common_movies = get_user_similar_movies( 2, 332 ) 
common_movies


Unnamed: 0,userId_x,movieId,rating_x,userId_y,rating_y,title
0,2,318,3.0,332,4.5,"Shawshank Redemption, The (1994)"
1,2,3578,4.0,332,3.5,Gladiator (2000)
2,2,6874,4.0,332,4.0,Kill Bill: Vol. 1 (2003)
3,2,58559,4.5,332,4.0,"Dark Knight, The (2008)"
4,2,68157,4.5,332,3.5,Inglourious Basterds (2009)
5,2,79132,4.0,332,4.0,Inception (2010)
6,2,91658,2.5,332,1.0,"Girl with the Dragon Tattoo, The (2011)"


Finding user similarity does not work for new users. We need to wait until the new user buys a few items and rates them. 
Only then users with similar preferences can be found and recommendations can be made based on that. This is called cold start 
problem in recommender systems. This can be overcome by using item-based similarity. Item-based similarity is based on the 
notion that if two items have been bought by many users and rated similarly, then there must be some inherent relationship between these two items. In other terms, in future, if a user buys one of those two items, he or she will most likely buy the other one.


 Item-Based Similarity 
 
If two movies, movie A and movie B, have been watched by several users and rated very similarly, then movie A and movie B can 
be similar in taste. In other words, if a user watches movie A, then he or she is very likely to watch B and vice versa

In [94]:
rating_mat = ratingsdf.pivot( index='movieId', columns='userId', values = "rating").reset_index(drop = True) 
rating_mat.fillna(0, inplace = True)
movie_sim = 1 - pairwise_distances( rating_mat.values, metric="correlation")
np.fill_diagonal(movie_sim , 0 )  
movie_sim_df = pd.DataFrame( movie_sim ) 

In [93]:
movie_sim_df.iloc[0:5, 0:5]
 

Unnamed: 0,0,1,2,3,4
0,0.0,0.231327,0.173213,-0.028917,0.192474
1,0.231327,0.0,0.191945,0.071269,0.200526
2,0.173213,0.191945,0.0,0.067143,0.370171
3,-0.028917,0.071269,0.067143,0.0,0.16791
4,0.192474,0.200526,0.370171,0.16791,0.0


In [74]:
movie_sim_df.shape 

(9724, 9724)

In [95]:
def  get_similar_movies( movieid, topN = 5 ): 
    movieidx = movies_df[movies_df.movieId == movieid].index[0] 
    movies_df['similarity'] = movie_sim_df.iloc[movieidx] 
    top_n = movies_df.sort_values( ["similarity"], ascending =   False )[0:topN] 
    return top_n

In [96]:
movies_df[movies_df.movieId == 615]


Unnamed: 0,movieId,title,similarity
526,615,Bread and Chocolate (Pane e cioccolata) (1973),-0.041059


In [102]:
get_similar_movies(1089) 

Unnamed: 0,movieId,title,similarity
908,1207,To Kill a Mockingbird (1962),0.491685
922,1221,"Godfather: Part II, The (1974)",0.485977
964,1265,Groundhog Day (1993),0.456386
827,1088,Dirty Dancing (1987),0.447339
1644,2193,Willow (1988),0.430249
