Movie Recommendation System

Collaborative filtering recommendation systems with MovieLens 20m Datasets
This dataset was constructed to support participants in the Netflix Prize

Recommandation system passed to users suggestions on what they might like according to their preferences
is a machine learning application that offers.

In [1]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

movie = pd.read_csv('../input/movielense20m/movie.csv')
rating = pd.read_csv('../input/movielense20m/rating.csv')
df = movie.merge(rating, how="left", on="movieId")

In [2]:
# Data Preparation
# we specify the parantheses so we don't conflict with movies that have years in their titles
df['year_movie'] = df.title.str.extract('(\(\d\d\d\d\))', expand=False)
#Removing the parentheses
df['year_movie'] = df.year_movie.str.extract('(\d\d\d\d)', expand=False)

#Removing the years from the 'title' column
df['title'] = df.title.str.replace('(\(\d\d\d\d\))', '',regex=True)
# To get rid of all the ending white space characters that might have appeared 
df['title'] = df['title'].apply(lambda x: x.strip())

a = pd.DataFrame(df["title"].value_counts())
rare_movies = a[a["title"] <= 1000].index
common_movies = df[~df["title"].isin(rare_movies)]
user_movie_df = common_movies.pivot_table(index=["userId"], columns=["title"], values="rating")

In [3]:
# genres
# Every genre is separated by a | so we simply have to call the split function on |
df["genre"] = df["genres"].apply(lambda x: x.split("|")[0])
df.drop("genres", inplace=True, axis=1)
df.head()

Unnamed: 0,movieId,title,userId,rating,timestamp,year_movie,genre
0,1,Toy Story,3.0,4.0,1999-12-11 13:36:47,1995,Adventure
1,1,Toy Story,6.0,5.0,1997-03-13 17:50:52,1995,Adventure
2,1,Toy Story,8.0,4.0,1996-06-05 13:37:51,1995,Adventure
3,1,Toy Story,10.0,4.0,1999-11-25 02:44:47,1995,Adventure
4,1,Toy Story,11.0,4.5,2009-01-02 01:13:41,1995,Adventure


In [4]:
# editting the date format 
df["timestamp"] = pd.to_datetime(df["timestamp"], format='%Y-%m-%d')
df["year"] = df["timestamp"].dt.year
df["month"] = df["timestamp"].dt.month
df["day"] = df["timestamp"].dt.day
df.head()


Unnamed: 0,movieId,title,userId,rating,timestamp,year_movie,genre,year,month,day
0,1,Toy Story,3.0,4.0,1999-12-11 13:36:47,1995,Adventure,1999.0,12.0,11.0
1,1,Toy Story,6.0,5.0,1997-03-13 17:50:52,1995,Adventure,1997.0,3.0,13.0
2,1,Toy Story,8.0,4.0,1996-06-05 13:37:51,1995,Adventure,1996.0,6.0,5.0
3,1,Toy Story,10.0,4.0,1999-11-25 02:44:47,1995,Adventure,1999.0,11.0,25.0
4,1,Toy Story,11.0,4.5,2009-01-02 01:13:41,1995,Adventure,2009.0,1.0,2.0


Collaborative Filtering

* This filtering method is usually based on collecting and analyzing information on user’s behaviors, 
their activities or preferences and predicting what they will like based on the similarity with other
users.

User-User Collaborative Filtering: Here, we try to search for lookalike customers and offer products 
based on what his/her lookalike has chosen. This algorithm is very effective but takes 
a lot of time and resources.


In [5]:
# choosing random a user
user_id = int(pd.Series(user_movie_df.index).sample(1).values)
user_id

116684

In [6]:
# Determining the movies watched by the recommended user
user_df = user_movie_df[user_movie_df.index == user_id]

movies_watched = user_df.columns[user_df.notna().any()].tolist()
len(movies_watched)
# the number of movies the user has watched 

128

In [7]:
# Accessing data and Ids of other users watching the same movies 
movies_watched_df = user_movie_df[movies_watched]
movies_watched_df.head()
movies_watched_df.shape
# There are 138493 users who watched at least one of the movies the user watched.

(138493, 128)

In [8]:
# each user watched how many of these 455 movies
user_movie_count = movies_watched_df.T.notnull().sum()
user_movie_count.head()

userId
1.0    33
2.0    12
3.0    35
4.0     9
5.0    18
dtype: int64

In [9]:
user_movie_count = user_movie_count.reset_index()
user_movie_count.columns = ["userId", "movie_count"]
m_count = movies_watched_df.shape[1]   # 455

# choosen a ratio of 0.60. User ids who watched at least percent of 60 these movies 
users_same_movies=user_movie_count[user_movie_count["movie_count"]/m_count > 0.6].sort_values("movie_count", ascending=False)
users_same_movies.nunique()

userId         2257
movie_count      49
dtype: int64

In [10]:
users_same_movies.head(10)

Unnamed: 0,userId,movie_count
116683,116684.0,128
83089,83090.0,128
118204,118205.0,127
8404,8405.0,126
92010,92011.0,123
118753,118754.0,122
80885,80886.0,122
57734,57735.0,122
76629,76630.0,121
107325,107326.0,121


Determining the most similar users to the user to be recommended 

In [11]:
# We bring together users who watch the same movies with the user. (not overlook the above ratio)
final_df = pd.concat([movies_watched_df[movies_watched_df.index.isin(users_same_movies.index)],
                      user_df[movies_watched]])

final_df.head()

title,12 Angry Men,Air Force One,Alien,Aliens,Assassins,Back to the Future,Bambi,Basic Instinct,Ben-Hur,Blade Runner,Bloodsport,"Blues Brothers, The",Braveheart,Casablanca,Casino,Clear and Present Danger,Con Air,Crocodile Dundee,Die Hard,Die Hard 2,Die Hard: With a Vengeance,Dumbo,Dune,E.T. the Extra-Terrestrial,Elizabeth,Enemy of the State,Entrapment,Escape to Witch Mountain,Face/Off,Fantasia,Farewell My Concubine (Ba wang bie ji),Fargo,"Femme Nikita, La (Nikita)",Ferris Bueller's Day Off,"Firm, The",First Blood (Rambo: First Blood),For Your Eyes Only,Forrest Gump,Four Weddings and a Funeral,French Kiss,"Fugitive, The",Glory,"Godfather, The","Godfather: Part II, The","Godfather: Part III, The","Good, the Bad and the Ugly, The (Buono, il brutto, il cattivo, Il)",Goodfellas,Grease,"Hand That Rocks the Cradle, The",Heartbreak Ridge,High Noon,Highlander,Holiday Inn,"Hunt for Red October, The",I.Q.,Indiana Jones and the Last Crusade,Indiana Jones and the Temple of Doom,It's a Wonderful Life,"Joy Luck Club, The",Jumanji,Kazaam,"Killer, The (Die xue shuang xiong)","Killing Fields, The","King and I, The",Legend,Legends of the Fall,Lethal Weapon,Lethal Weapon 2,"Little Mermaid, The",Léon: The Professional (a.k.a. The Professional) (Léon),"Mask, The","Matrix, The",Meet Me in St. Louis,Moonraker,My Life as a Dog (Mitt liv som hund),"Negotiator, The",Out of Sight,Patriot Games,"Patriot, The","Peacemaker, The","Postman, The (Postino, Il)",Predator,Pulp Fiction,Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark),Ransom,Reservoir Dogs,"Road Warrior, The (Mad Max 2)",Rob Roy,RoboCop,"Rock, The",Roman Holiday,Romeo Must Die,"Running Man, The",Rush Hour,Seven (a.k.a. Se7en),"Siege, The","Silence of the Lambs, The",Singin' in the Rain,"Sixth Sense, The",Sleepless in Seattle,Sleepy Hollow,Species,Speed,Star Wars: Episode I - The Phantom Menace,Star Wars: Episode IV - A New Hope,Star Wars: Episode V - The Empire Strikes Back,Star Wars: Episode VI - Return of the Jedi,Strictly Ballroom,Superman II,Terminator 2: Judgment Day,"Terminator, The","Thomas Crown Affair, The",Three Colors: Blue (Trois couleurs: Bleu),Three Colors: Red (Trois couleurs: Rouge),Titanic,To Kill a Mockingbird,Total Recall,Toy Story,Toys,Tron,True Lies,Twelve Monkeys (a.k.a. 12 Monkeys),"Usual Suspects, The",When Harry Met Sally...,While You Were Sleeping,White Christmas,Willow,Witness
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1,Unnamed: 126_level_1,Unnamed: 127_level_1,Unnamed: 128_level_1
53.0,5.0,,,,,,,,2.0,,,,,3.0,,,,,,,,,,5.0,,,,,,,5.0,5.0,,,,,,,,,,,4.0,4.0,,,,,,,,,,,,,,,,,,,5.0,,,,,,,,,,,,5.0,,,,,,5.0,,3.0,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,5.0,,4.0,,,,4.0,5.0,,,,,
90.0,,,3.5,,,,,,,,,,3.0,,,,3.5,,,,,,,,,,,,,,,3.5,,3.0,,,,5.0,,,,,,,,,,,,,,,,,,4.0,,,,,,,,,,,,,3.0,,3.5,2.0,,,,,,,4.5,,,,3.5,,,,,,,3.5,,,,,4.5,,3.5,,4.0,,,,2.5,3.0,3.5,3.5,4.0,,,3.0,,,,,2.5,4.0,,3.5,,,4.5,,,,,,,
115.0,,,,,,3.0,,,,,,,,,,,,,,,2.5,,,,,,,,,,,,,,,,,,,,,,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,4.0,,,,,,,,,,,5.0,4.0,,4.5,,,,,,,,,4.5,,,,3.5,,,,1.5,,,,,,,,,,,,,,,1.5,,,,,,,,,,
155.0,,,4.0,4.0,,4.0,,,,,,,,2.5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,,,,,,,,,,,,3.5,,,,,,,,,,,,,,,,5.0,,,,,,,,,,,5.0,3.5,,,,,,,,,,,,,,,3.5,,,,,,4.5,4.5,4.0,,,,,,,,,,,2.5,,,,4.5,4.0,,,,,
207.0,,2.0,4.0,3.0,,,,,,,,4.0,,,,,2.0,,3.0,,,,,,,,,,2.0,,,,,,,,,,,,,,5.0,5.0,,,,,,,,,,3.0,,,,,,,,,,,,,3.0,,,,,,,,,,4.0,,,,,,,4.0,,,,,,3.0,,,,,,,,,,,,,3.0,,4.0,3.0,3.0,,,3.0,4.0,,,,3.0,,,,,,1.0,,,,,,,


In [12]:
corr_df = final_df.T.corr().unstack().sort_values().drop_duplicates()
corr_df = pd.DataFrame(corr_df, columns=["corr"])
corr_df.index.names = ['user_id_1', 'user_id_2']
corr_df = corr_df.reset_index()
corr_df.head()

Unnamed: 0,user_id_1,user_id_2,corr
0,50745.0,47985.0,-1.0
1,37316.0,13848.0,-1.0
2,55236.0,4149.0,-1.0
3,66532.0,88819.0,-1.0
4,48643.0,125584.0,-1.0


We are interested in users who have a high correlation with the user_id.
Could analysis it by entering a value

In [13]:
top_users = corr_df[(corr_df["user_id_1"] == user_id) & (corr_df["corr"] >= 0.60)][
    ["user_id_2", "corr"]].reset_index(drop=True)

top_users = top_users.sort_values(by='corr', ascending=False)

# Users who have more than 0.60 corr with the user
top_users.rename(columns={"user_id_2": "userId"}, inplace=True) 
top_users.T

Unnamed: 0,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0
userId,135424.0,126394.0,15836.0,94627.0,125895.0,137163.0,104062.0,37293.0,100253.0,26066.0,25959.0,6920.0,44295.0,85653.0,80390.0,113072.0
corr,0.908108,0.783092,0.768854,0.766419,0.678287,0.677631,0.676209,0.663947,0.641026,0.629059,0.608175,0.60746,0.605206,0.604254,0.603055,0.602512


In [14]:
# Calculation weighted rating
rating = pd.read_csv('../input/movielense20m/rating.csv')
top_users_ratings = top_users.merge(rating[["userId", "movieId", "rating"]], how='inner')
top_users_ratings['weighted_rating'] = top_users_ratings['corr'] * top_users_ratings['rating']
top_users_ratings.head()

Unnamed: 0,userId,corr,movieId,rating,weighted_rating
0,135424.0,0.908108,1,4.0,3.632433
1,135424.0,0.908108,3,3.0,2.724325
2,135424.0,0.908108,5,3.0,2.724325
3,135424.0,0.908108,14,2.0,1.816217
4,135424.0,0.908108,25,2.0,1.816217


In [15]:
# Calculating the weighted average recommendation score and keeping the top ten films
# Unique rating for movieId 
temp = top_users_ratings.groupby('movieId').sum()[['corr', 'weighted_rating']]
temp.columns = ['sum_corr', 'sum_weighted_rating']
temp.head()

Unnamed: 0_level_0,sum_corr,sum_weighted_rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,2.18945,7.476458
2,1.232115,4.325403
3,0.908108,2.724325
5,1.537168,5.869621
10,1.885596,5.656789


In [16]:
# Calculating weighted average recommendation score and keeping top five films based on scores
recommendation_df = pd.DataFrame()
recommendation_df['weighted_average_recommendation_score'] = temp['sum_weighted_rating'] / temp['sum_corr']
recommendation_df['movieId'] = temp.index
recommendation_df = recommendation_df.sort_values(by='weighted_average_recommendation_score', ascending=False)
recommendation_df.head(7)

Unnamed: 0_level_0,weighted_average_recommendation_score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
109374,5.0,109374
2948,5.0,2948
2692,5.0,2692
1117,5.0,1117
425,5.0,425
1148,5.0,1148
2745,5.0,2745


In [17]:
# movies that the user may like
movie = pd.read_csv('../input/movielense20m/movie.csv')
movie.loc[movie['movieId'].isin(recommendation_df.head(10)['movieId'])]

Unnamed: 0,movieId,title,genres
421,425,Blue Sky (1994),Drama|Romance
1094,1117,"Eighth Day, The (Huitième jour, Le) (1996)",Drama
1125,1148,Wallace & Gromit: The Wrong Trousers (1993),Animation|Children|Comedy|Crime
1149,1172,Cinema Paradiso (Nuovo cinema Paradiso) (1989),Drama
2606,2692,Run Lola Run (Lola rennt) (1998),Action|Crime
2659,2745,"Mission, The (1986)",Drama
2705,2791,Airplane! (1980),Comedy
2711,2797,Big (1988),Comedy|Drama|Fantasy|Romance
2862,2948,From Russia with Love (1963),Action|Adventure|Thriller
22880,109374,"Grand Budapest Hotel, The (2014)",Comedy|Drama


In [18]:
user_df = df.loc[df["userId"]==user_id]
user_df.head()

Unnamed: 0,movieId,title,userId,rating,timestamp,year_movie,genre,year,month,day
41940,1,Toy Story,116684.0,3.0,2000-06-20 00:57:44,1995,Adventure,2000.0,6.0,20.0
68447,2,Jumanji,116684.0,3.0,2000-06-21 00:02:30,1995,Adventure,2000.0,6.0,21.0
217895,16,Casino,116684.0,3.0,2000-06-21 00:17:31,1995,Crime,2000.0,6.0,21.0
309110,23,Assassins,116684.0,3.0,2000-06-21 00:19:03,1995,Action,2000.0,6.0,21.0
404010,32,Twelve Monkeys (a.k.a. 12 Monkeys),116684.0,3.0,2000-06-20 00:45:26,1995,Mystery,2000.0,6.0,20.0


In [19]:
user_df.groupby("title").agg({"rating": "max", "timestamp": "max"}).sort_values("timestamp", ascending=False)

Unnamed: 0_level_0,rating,timestamp
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Species,3.0,2000-06-21 00:40:02
Braveheart,5.0,2000-06-21 00:38:17
Blade Runner,5.0,2000-06-21 00:38:00
Superman II,4.0,2000-06-21 00:32:25
First Blood (Rambo: First Blood),4.0,2000-06-21 00:32:08
"Running Man, The",4.0,2000-06-21 00:32:08
Heartbreak Ridge,3.0,2000-06-21 00:32:08
Indiana Jones and the Temple of Doom,4.0,2000-06-21 00:31:36
Bloodsport,4.0,2000-06-21 00:31:36
For Your Eyes Only,4.0,2000-06-21 00:31:13
