> Business Case:

This project intends to recommend 10 movies to a user whos ID is provided, using collaborative filtering methods.
- Item-based and User-based Recommemder Methods.
<hr><br>

> Data Description:

The dataset was provided by [MovieLens](https://grouplens.org/datasets/movielens/), a movie recommendation service. It contains the rating scores for these movies along with the movies.

It contains 2,000,0263 ratings across 27,278 movies. This data was provided by 138,493 users from January 09, 1995 to March 31, 2015.

Users are randomly selected. It is known that all selected users voted for at least 20 movies.

> #### Performing Data Preparation

In [15]:
import pandas as pd
pd.set_option("display.max_columns", 20)
pd.pandas.set_option('display.width', 300)

In [16]:
movie = pd.read_csv("movie.csv")
movie.shape

(27278, 3)

In [17]:
movie.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [18]:
rating = pd.read_csv("rating.csv")
rating.shape

(508476, 4)

In [19]:
rating.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,2005-04-02 23:53:47
1,1,29,3.5,2005-04-02 23:31:16
2,1,32,3.5,2005-04-02 23:33:39
3,1,47,3.5,2005-04-02 23:32:07
4,1,50,3.5,2005-04-02 23:29:40


Concatenating the 2 datasets

In [25]:
df = movie.merge(rating, how="left", on="movieId")
df.shape

(523440, 6)

In [26]:
df.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.0,4.0,1999-12-11 13:36:47
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,6.0,5.0,1997-03-13 17:50:52
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,8.0,4.0,1996-06-05 13:37:51
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,10.0,4.0,1999-11-25 02:44:47
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,11.0,4.5,2009-01-02 01:13:41


In [27]:
df["title"].nunique()

27262

In [28]:
df["title"].value_counts().head()

Pulp Fiction (1994)                 1716
Forrest Gump (1994)                 1696
Silence of the Lambs, The (1991)    1578
Shawshank Redemption, The (1994)    1568
Jurassic Park (1993)                1503
Name: title, dtype: int64

In [29]:
comment_counts = pd.DataFrame(df["title"].value_counts())
rare_movies = comment_counts[comment_counts["title"] <= 1000].index
common_movies = df[~df["title"].isin(rare_movies)] 
common_movies.shape

(37289, 6)

In [30]:
common_movies.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.0,4.0,1999-12-11 13:36:47
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,6.0,5.0,1997-03-13 17:50:52
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,8.0,4.0,1996-06-05 13:37:51
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,10.0,4.0,1999-11-25 02:44:47
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,11.0,4.5,2009-01-02 01:13:41


In [35]:
common_movies["title"].nunique()

30

In [36]:
user_movie_df = common_movies.pivot_table(index=["userId"], columns=["title"], values="rating")
user_movie_df.shape

(3232, 30)

In [37]:
user_movie_df.head()

title,Aladdin (1992),American Beauty (1999),Apollo 13 (1995),Back to the Future (1985),Batman (1989),Braveheart (1995),Dances with Wolves (1990),Fargo (1996),Fight Club (1999),Forrest Gump (1994),...,"Silence of the Lambs, The (1991)",Speed (1994),Star Wars: Episode IV - A New Hope (1977),Star Wars: Episode V - The Empire Strikes Back (1980),Star Wars: Episode VI - Return of the Jedi (1983),Terminator 2: Judgment Day (1991),Toy Story (1995),True Lies (1994),Twelve Monkeys (a.k.a. 12 Monkeys) (1995),"Usual Suspects, The (1995)"
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,,,,,,,,,4.0,,...,3.5,,4.0,4.5,,3.5,,,3.5,3.5
2.0,,3.0,,5.0,,4.0,,,,,...,,,5.0,5.0,5.0,5.0,,,,
3.0,,,,5.0,,,,,,,...,5.0,,5.0,5.0,5.0,4.0,4.0,,4.0,5.0
4.0,,,,,,,,,,4.0,...,,4.0,,,,4.0,,3.0,1.0,
5.0,5.0,,5.0,,,4.0,5.0,3.0,,,...,3.0,5.0,5.0,5.0,5.0,5.0,,5.0,,


In [38]:
user_movie_df.columns
len(user_movie_df.columns)

30

In [39]:
common_movies["title"].nunique()

30

> #### Determining the movies watched by the user to be recommended.

Selecting a random user

In [40]:
random_user = int(pd.Series(user_movie_df.index).sample(1, random_state=45).values)
random_user_df = user_movie_df[user_movie_df.index == random_user]
random_user_df.shape

(1, 30)

In [41]:
#Let's look at the movies watched by the user we have chosen:

movies_watched = random_user_df.columns[random_user_df.notna().any()].to_list()

In [43]:
len(movies_watched)

27

In [44]:
user_movie_df.loc[user_movie_df.index == random_user, user_movie_df.columns =="Jurassic Park (1993)"]

title,Jurassic Park (1993)
userId,Unnamed: 1_level_1
945.0,


> #### Access data and IDs of other users watching the same movies.

In [45]:
pd.set_option("display.max_columns", 5)
movies_watched_df = user_movie_df[movies_watched]
movies_watched_df.shape

(3232, 27)

In [46]:
movies_watched_df.head()

title,Aladdin (1992),American Beauty (1999),...,Twelve Monkeys (a.k.a. 12 Monkeys) (1995),"Usual Suspects, The (1995)"
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1.0,,,...,3.5,3.5
2.0,,3.0,...,,
3.0,,,...,4.0,5.0
4.0,,,...,1.0,
5.0,5.0,,...,,


In [47]:
user_movie_count = movies_watched_df.T.notnull().sum()

In [48]:
user_movie_count = user_movie_count.reset_index()
user_movie_count.columns = ["userId", "movie_count"]
user_movie_count.shape

(3232, 2)

In [49]:
user_movie_count.head()

Unnamed: 0,userId,movie_count
0,1.0,11
1,2.0,7
2,3.0,13
3,4.0,5
4,5.0,16


In [50]:
user_movie_count[user_movie_count["movie_count"] > 20].sort_values("movie_count", ascending=False)
user_movie_count[user_movie_count["movie_count"] == len(movies_watched)].count()

userId         37
movie_count    37
dtype: int64

> #### Identify the users who are most similar to the user to be suggested.

In [51]:
perc = len(movies_watched) * 60 / 100
perc

16.2

In [52]:
#person ids who watched 60% of the same movie as the user
users_same_movies = user_movie_count[user_movie_count["movie_count"] > perc]["userId"]
users_same_movies.count()

654

In [55]:
final_df = pd.concat([movies_watched_df[movies_watched_df.index.isin(users_same_movies.index)],random_user_df[movies_watched]])
final_df.shape

(617, 27)

In [56]:
final_df.T.corr()


userId,10.0,23.0,...,3231.0,945.0
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
10.0,1.000000,1.000000,...,,0.211897
23.0,1.000000,1.000000,...,,0.533002
24.0,,-0.199205,...,-0.745356,0.065348
28.0,,0.327327,...,,0.659321
31.0,-1.000000,-1.000000,...,,-0.545325
...,...,...,...,...,...
3218.0,0.301511,-0.096523,...,-0.645497,0.035939
3226.0,,-0.389167,...,1.000000,0.017188
3228.0,-0.333333,-0.349215,...,-0.577350,0.128921
3231.0,,,...,1.000000,0.963343


In [57]:
corr_df = final_df.T.corr().unstack().sort_values().drop_duplicates()
corr_df = pd.DataFrame(corr_df, columns=["corr"])
corr_df.index.names = ["user_id_1", "user_id_2"]
corr_df = corr_df.reset_index()

In [59]:
# Let's look at those that have a 65% correlation with the user:
top_users = corr_df[(corr_df["user_id_1"] == random_user) & (corr_df["corr"] >= 0.65)][["user_id_2", "corr"]].reset_index(drop=True)

# Let's take a look at the ones with the least correlation with #user:
top_users = top_users.sort_values(by="corr", ascending=False)

top_users.rename(columns={"user_id_2":"userId"}, inplace=True)
top_users.shape

(30, 2)

In [60]:
top_users.head()

Unnamed: 0,userId,corr
29,2101.0,0.958423
28,1551.0,0.941833
27,2876.0,0.869048
26,2913.0,0.857953
25,431.0,0.853913


In [62]:
# rating = pd.read_csv("rating.csv")
top_users_ratings = top_users.merge(rating[["userId", "movieId", "rating"]], how="inner")
top_users_ratings = top_users_ratings[top_users_ratings["userId"] != random_user]
top_users_ratings["userId"].unique()

array([2101., 1551., 2876., 2913.,  431., 1258., 2052., 2897.,   66.,
       3158.,  675., 1204., 2030.,  432., 2778., 1396.,  427., 2322.,
       2784., 2390., 2985., 2524., 1626.,  853., 2114., 2891.,  869.,
       3136.,   28., 3128.])

> #### Calculating Weighted Average Recommendation Score and keeping the first 5 movies.

In [63]:
top_users_ratings["weighted_rating"] = top_users_ratings["corr"] * top_users_ratings["rating"]
top_users_ratings.head()

Unnamed: 0,userId,corr,movieId,rating,weighted_rating
0,2101.0,0.958423,1,3.0,2.87527
1,2101.0,0.958423,6,4.0,3.833693
2,2101.0,0.958423,25,2.0,1.916846
3,2101.0,0.958423,32,5.0,4.792116
4,2101.0,0.958423,65,3.0,2.87527


In [64]:
top_users_ratings.groupby('movieId').agg({"weighted_rating": "mean"}) #singularization by movies

Unnamed: 0_level_0,weighted_rating
movieId,Unnamed: 1_level_1
1,3.031767
2,2.384976
3,2.232081
5,2.046955
6,3.291383
...,...
103984,3.296414
106920,3.767330
109374,3.767330
111759,2.825498


In [65]:
recommendation_df = top_users_ratings.groupby('movieId').agg({"weighted_rating": "mean"})
recommendation_df = recommendation_df.reset_index()
recommendation_df.head()

Unnamed: 0,movieId,weighted_rating
0,1,3.031767
1,2,2.384976
2,3,2.232081
3,5,2.046955
4,6,3.291383


In [66]:
movies_to_be_recommend = recommendation_df[recommendation_df["weighted_rating"] > 3.5].sort_values("weighted_rating", ascending=False)

> #### Making an Item-based suggestion based on the name of the movie that the user has watched with the highest score.

> 5 recommendations user-based. <br>
5 suggestions item-based. <br>
Make 10 suggestions.

In [67]:
movies_to_be_recommend.merge(movie[["movieId", "title"]])["title"].head()

0                Kingpin (1996)
1       Escape from L.A. (1996)
2    Singin' in the Rain (1952)
3         Producers, The (1968)
4            Toy Story 3 (2010)
Name: title, dtype: object

In [None]:
'''
user = 28941
movie_id = rating[(rating["userId"] == user) & (rating["rating"] == 5.0)].sort_values(by="timestamp", ascending = False)["movieId"][0:1].values[0]
'''

In [None]:
'''
movie_name = movie[movie["movieId"] == movie_id]["title"].values[0]
movie_name = user_movie_df[movie_name]
movies_from_item_based = user_movie_df.corrwith(movie_name).sort_values(ascending=False)
movies_from_item_based[1:6].index
'''