# Collaborative Filtering Movie recomendation system

An item based recommendation system works by comparing the similarity of different items. It then decides what to recommend based on how the user has rated some of the items. If the user favorably rates an item, the system aims to find the neighboring items and these will be recommended.

The similarity of items can be determined based on Euclidean distance, correlation, cosine similarity (angular distance) etc.

In this project, the similarity will be determined based on Pearson correlation.

In [2]:
import pandas as pd
from scipy import sparse
from sklearn.metrics.pairwise import cosine_similarity

In [5]:
movies_df = pd.read_csv('D:/DATA science/Collaborative Filtering Dataset/dataset/movies.csv')
ratings_df = pd.read_csv('D:/DATA science/Collaborative Filtering Dataset/dataset/ratings.csv')

In [9]:
movie_ratings_df = pd.merge(movies_df,ratings_df)
movie_ratings_df.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,1106635946
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,1510577970
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,1305696483


In [11]:
#For collaborative filtering, we will only need the users and their ratings for the movies. Drop the other columns.
movie_ratings_df.drop(['genres','timestamp'],axis=1,inplace=True)
movie_ratings_df.head()

Unnamed: 0,movieId,title,userId,rating
0,1,Toy Story (1995),1,4.0
1,1,Toy Story (1995),5,4.0
2,1,Toy Story (1995),7,4.5
3,1,Toy Story (1995),15,2.5
4,1,Toy Story (1995),17,4.5


In [21]:
#Since we are building an item based recommender, let us restructure our dataframe such that the rows are user ids and 
#columns are movies and the values are ratings

user_ratings = movie_ratings_df.pivot_table(index='userId',columns='title',values='rating')
user_ratings.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,4.0,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


In [22]:
print("Number of Users : {}, Number of Movies : {}".format(user_ratings.shape[0],user_ratings.shape[1]))

Number of Users : 610, Number of Movies : 9719


Let's reduce the number of movies for the sake of this project and consider only the movies that have more than 20 ratings.

In [24]:
user_ratings.dropna(thresh=15,axis=1,inplace=True)
print("Number of Users : {}, Number of Movies : {}".format(user_ratings.shape[0],user_ratings.shape[1]))

Number of Users : 610, Number of Movies : 1650


In [35]:
#Let's replace the Nans with 0
user_ratings = user_ratings.fillna(0)
user_ratings.head()

title,"'burbs, The (1989)",(500) Days of Summer (2009),10 Things I Hate About You (1999),"10,000 BC (2008)",101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),12 Angry Men (1957),12 Years a Slave (2013),127 Hours (2010),13 Going on 30 (2004),...,Young Guns (1988),Zack and Miri Make a Porno (2008),Zero Effect (1998),Zodiac (2007),Zombieland (2009),Zoolander (2001),Zootopia (2016),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Similarity of Items (Movies)

In [36]:
similarItems_df = user_ratings.corr(method = 'pearson')

In [37]:
similarItems_df.head()

title,"'burbs, The (1989)",(500) Days of Summer (2009),10 Things I Hate About You (1999),"10,000 BC (2008)",101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),12 Angry Men (1957),12 Years a Slave (2013),127 Hours (2010),13 Going on 30 (2004),...,Young Guns (1988),Zack and Miri Make a Porno (2008),Zero Effect (1998),Zodiac (2007),Zombieland (2009),Zoolander (2001),Zootopia (2016),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"'burbs, The (1989)",1.0,0.063117,0.143482,0.011998,0.087931,0.224052,0.034223,0.009277,0.008331,0.0497,...,0.248535,0.017477,0.134701,0.153158,0.101301,0.049897,0.003233,0.187953,0.062174,0.353194
(500) Days of Summer (2009),0.063117,1.0,0.273989,0.19396,0.148903,0.142141,0.159756,0.135486,0.200135,0.297152,...,0.073476,0.374515,0.068407,0.414585,0.355723,0.252226,0.216007,0.053614,0.241092,0.125905
10 Things I Hate About You (1999),0.143482,0.273989,1.0,0.24467,0.223481,0.211473,0.011784,0.091964,0.043383,0.321071,...,0.152333,0.243118,0.13246,0.091853,0.158637,0.281934,0.050031,0.121029,0.130813,0.110612
"10,000 BC (2008)",0.011998,0.19396,0.24467,1.0,0.234459,0.119132,0.059187,-0.025882,0.089328,0.167098,...,0.065201,0.260261,0.094913,0.184521,0.242299,0.240231,0.094773,0.088045,0.203002,0.083518
101 Dalmatians (1996),0.087931,0.148903,0.223481,0.234459,1.0,0.285112,0.119843,0.072399,0.029967,0.188467,...,0.033582,0.114968,0.096294,0.067134,0.113224,0.184324,0.054024,0.047804,0.156932,0.078734


In [55]:
#Function to get similar movies
def get_similar(movie,rating):
    sim_score = similarItems_df[movie] * (rating - 2.5) # Subtract the rating by the mean so that we center it around 0.So that the reutrned movies differ with the rating given.
    sim_score = sim_score.sort_values(ascending=False)
    return sim_score

### Creating a User to get recomendations based on their Rating

In [70]:
Chinmay = [("Young Guns (1988)",4),("12 Angry Men (1957)",1),("'burbs, The (1989)",1),("Zootopia (2016)",5)]

similar_movies = pd.DataFrame()

for movie,rating in Chinmay:
    similar_movies = similar_movies.append(get_similar(movie,rating),ignore_index=True)
    
similar_movies.head()

Unnamed: 0,"'burbs, The (1989)",(500) Days of Summer (2009),10 Things I Hate About You (1999),"10,000 BC (2008)",101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),12 Angry Men (1957),12 Years a Slave (2013),127 Hours (2010),13 Going on 30 (2004),...,Young Guns (1988),Zack and Miri Make a Porno (2008),Zero Effect (1998),Zodiac (2007),Zombieland (2009),Zoolander (2001),Zootopia (2016),eXistenZ (1999),xXx (2002),¡Three Amigos! (1986)
0,0.372803,0.110215,0.228499,0.097802,0.050373,0.214509,0.209483,-0.013622,0.052568,0.108408,...,1.5,0.22084,0.376937,0.303925,0.178494,0.129839,0.032806,0.133529,0.143013,0.56314
1,-0.051335,-0.239633,-0.017676,-0.08878,-0.179765,-0.201055,-1.5,-0.199469,-0.088294,0.041508,...,-0.209483,-0.156777,-0.119857,-0.362153,-0.216978,-0.183161,-0.085113,0.002562,-0.111459,-0.154116
2,-1.5,-0.094675,-0.215222,-0.017996,-0.131896,-0.336079,-0.051335,-0.013915,-0.012496,-0.07455,...,-0.372803,-0.026215,-0.202052,-0.229737,-0.151951,-0.074846,-0.004849,-0.28193,-0.093261,-0.529791
3,0.008082,0.540017,0.125079,0.236933,0.135061,0.193986,0.141854,0.158314,0.564368,0.174005,...,0.054677,0.715534,0.028545,0.535963,0.746261,0.270368,2.5,0.117213,0.501904,0.051488


### Final Recommendation List

In [71]:
#To get a final recommendation, summing rowwise gives the total similarity score for each movie based on the users rating.
similar_movies.sum().sort_values(ascending=False)

Zootopia (2016)                                         2.442845
Deadpool (2016)                                         1.373434
Star Wars: Episode VII - The Force Awakens (2015)       1.312780
Big Hero 6 (2014)                                       1.267366
Inside Out (2015)                                       1.249747
Doctor Strange (2016)                                   1.225116
Mad Max: Fury Road (2015)                               1.224776
Kingsman: The Secret Service (2015)                     1.222286
Guardians of the Galaxy (2014)                          1.189335
The Lego Movie (2014)                                   1.092618
How to Train Your Dragon (2010)                         1.033517
Brave (2012)                                            1.013548
Logan (2017)                                            0.982437
Other Guys, The (2010)                                  0.981998
Ant-Man (2015)                                          0.975462
Young Guns (1988)        