# Content Based Recommendation

### Recommendations are developed based on the similarities in product content. For instance, if a user reads a book of a certain category, a similar book is recommended based on its content. By representing texts mathematically, we can capture some key words in the product descriptions using text vectors through the Count Vector and TF-IDF methods, and identify other product descriptions that are similar.

In [23]:
#importing libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# TF-IDF Matrix building 

In [24]:
movies = pd.read_csv(r'C:\Users\DELL\Downloads\The Movies Dataset\movies_metadata.csv',
                    usecols=["id","overview","title","vote_average","vote_count","release_date"],low_memory=False)

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


In [25]:
movies = movies.reset_index(drop=True)
movies = movies.dropna()
movies = movies.drop_duplicates()
movies = movies.rename(columns={"id":"movieId"})
movies["movieId"] = movies["movieId"].astype("int64")

In [26]:
#We will focus only on the overview from the dataset:
movies["overview"].head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

In [27]:
#We are using the TF-IDF method and setting up the model:
tfidf = TfidfVectorizer(stop_words="english", min_df = 4)
#We removed commonly used words such as 'and', 'the', 'on', 'in', as they do not carry significant values.

In [28]:
#We replaced NaN values with blanks as NaNs can cause issues in calculations:
movies['overview'] = movies['overview'].fillna('')

In [29]:
#After fitting, we transform the data:
tfidf_matrix = tfidf.fit_transform(movies['overview'])

In [30]:
#There are 45,466 movie reviews and 75,827 words:
tfidf_matrix.shape

(44407, 23834)

# Cosine Similarity Matrix 

In [31]:
#This is the part where we find which movies are similar to each other, mathematically speaking, using text vectors.
cosine_sim = cosine_similarity(tfidf_matrix,
                               tfidf_matrix)
#The cosine_sim.shape gives us the similarities between documents.
cosine_sim.shape

(44407, 44407)

# Making Recommendations Based on Similarities 

In [33]:
#To evaluate the calculated scores, we retrieve the names.
indices = pd.Series(movies.index, index=movies['title'])

In [34]:
#There are multiple instances of movies.
indices.index.value_counts()

title
Cinderella              11
Hamlet                   9
Beauty and the Beast     8
Alice in Wonderland      8
Les Misérables           8
                        ..
No Greater Love          1
A Woman in Berlin        1
Talhotblond              1
Tortilla Flat            1
Queerama                 1
Name: count, Length: 41303, dtype: int64

In [35]:
#We keep one of the duplicate movies and delete the others. We take the last one for freshness.
indices = indices[~indices.index.duplicated(keep='last')]

In [36]:
#We note the index of the movie "Sherlock Holmes".
movie_index = indices["Sherlock Holmes"]

In [37]:
#Accessing cosine_sim with the index of Sherlock Holmes.
cosine_sim[movie_index]

array([0.00630183, 0.00923754, 0.        , ..., 0.        , 0.01089884,
       0.        ])

In [38]:
#We create a dataframe called similarity_scores and retrieve the similar ones, evaluating them as scores.
similarity_scores = pd.DataFrame(cosine_sim[movie_index],columns=["score"])

In [40]:
#Fetching the top 10 movies with the highest scores. The first observation includes the movie itself, so we use 1 to 11.
movie_indices = similarity_scores.sort_values("score", ascending=False)[1:11].index

In [41]:
#Retrieving the titles of the movies with index information.
movies['title'].iloc[movie_indices]

35745    The Dog of Flanders
16735    The Heart Elsewhere
31594         We Can Do That
30608      Drama of Jealousy
25451             Marvellous
44609             The Mitten
21348      Darling Companion
12104        The Dog Problem
33454       The Empty Canvas
42413         Death by Death
Name: title, dtype: object

In [42]:
def content_based_recommender(title, cosine_sim, dataframe):
    # making index
    indices = pd.Series(dataframe.index, index=dataframe['title'])
    indices = indices[~indices.index.duplicated(keep='last')]
    # catch title's index
    movie_index = indices[title]
    # calculating similarty score to target
    similarity_scores = pd.DataFrame(cosine_sim[movie_index], columns=["score"])
    # bring 10 movie
    movie_indices = similarity_scores.sort_values("score", ascending=False)[1:11].index
    return dataframe['title'].iloc[movie_indices]

In [43]:
content_based_recommender("The Matrix", cosine_sim, movies)

27610                So Sweet, So Dead
3534                             Lured
21                             Copycat
2069                            Frenzy
20626       The Wandering Soul Murders
7583             The Stendhal Syndrome
28141               Mark Strikes Again
23816    Tables Turned on the Gardener
26944            Whistling in Brooklyn
28203                Kommissarie Späck
Name: title, dtype: object

# Item-Based Collaborative Filtering

### Recommendations are made based on item similarity.
Example: Lets think viewer likes a particular movie. Based on the similarity in the structure of liked and disliked movies, another movie is recommended.

In [45]:
rating = pd.read_csv(r"C:\Users\DELL\Downloads\The Movies Dataset\ratings.csv")

df = pd.merge(movies,rating, how="inner", on="movieId")

In [46]:
df.head()

Unnamed: 0,movieId,overview,release_date,title,vote_average,vote_count,userId,rating,timestamp
0,949,"Obsessive master thief, Neil McCauley leads a ...",1995-12-15,Heat,7.7,1886.0,23,3.5,1148721092
1,949,"Obsessive master thief, Neil McCauley leads a ...",1995-12-15,Heat,7.7,1886.0,102,4.0,956598942
2,949,"Obsessive master thief, Neil McCauley leads a ...",1995-12-15,Heat,7.7,1886.0,232,2.0,955092697
3,949,"Obsessive master thief, Neil McCauley leads a ...",1995-12-15,Heat,7.7,1886.0,242,5.0,956688825
4,949,"Obsessive master thief, Neil McCauley leads a ...",1995-12-15,Heat,7.7,1886.0,263,3.0,1117846575


In [48]:
#For instance, a user has rated one movie, but has not rated numerous others, leading to a large number of cells representing the missing data, 
#resulting in performance issues. Various reductions must be made, such as excluding movies with fewer than 1000 ratings.

#The resulting 'user_movie_df' dataframe will have users as the index, movie titles as the columns, and the corresponding ratings as the values
user_movie_df = df.groupby(["userId","movieId"])["rating"].mean().unstack().notnull()

In [49]:
user_movie_df.head()

movieId,2,3,5,6,11,12,13,14,15,16,...,132961,133365,134158,134569,134881,140174,142507,148652,158238,160718
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
5,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [50]:
#taking sample randomly
sample_movie = user_movie_df.sample(1,random_state=45).index[0]

sample_movie

196

In [51]:
#filtering our sample's movies on user_movie_df
filtered = user_movie_df[sample_movie]

In [52]:
#dropping the movies that our sample watched
user_movie_df_wo = user_movie_df.drop(sample_movie,axis=1)

In [53]:
#checking correlation
movies_similarity = user_movie_df_wo.corrwith(filtered)

In [54]:
movies_similarity.sort_values(ascending=False).head(20)

movieId
160    0.388101
172    0.359755
435    0.346429
173    0.342352
316    0.331515
253    0.331465
22     0.326370
317    0.324579
165    0.324017
592    0.321798
198    0.321120
145    0.321120
587    0.315399
329    0.313288
292    0.307931
204    0.296570
153    0.295394
344    0.292703
379    0.289515
426    0.287628
dtype: float64

In [55]:
#similar movies
movies_similarity = movies_similarity.sort_values(ascending=False).reset_index()
movies_similarity.columns = ["movieId","movies_similarity"]
movies_similarity.head()

Unnamed: 0,movieId,movies_similarity
0,160,0.388101
1,172,0.359755
2,435,0.346429
3,173,0.342352
4,316,0.331515


In [58]:
#filtering the movies
filtered_movies = df[df['movieId'].isin([160, 172, 435, 173, 316])]

In [59]:
filtered_movies['title'].value_counts()

title
Grill Point                            145
20,000 Leagues Under the Sea            70
The Arrival of a Train at La Ciotat     63
The Day After Tomorrow                  55
Star Trek V: The Final Frontier         48
Name: count, dtype: int64

# User-Based Collaborative Filtering

### Recommendations are made based on the similarities between users.

Example: Access is provided to the movies watched by a user, and then to the movies watched by other users who have watched the same movies. By examining the correlation of the movies watched by other users but not by our initial user, the highest correlated movie is recommended.

In [60]:
df.head()

Unnamed: 0,movieId,overview,release_date,title,vote_average,vote_count,userId,rating,timestamp
0,949,"Obsessive master thief, Neil McCauley leads a ...",1995-12-15,Heat,7.7,1886.0,23,3.5,1148721092
1,949,"Obsessive master thief, Neil McCauley leads a ...",1995-12-15,Heat,7.7,1886.0,102,4.0,956598942
2,949,"Obsessive master thief, Neil McCauley leads a ...",1995-12-15,Heat,7.7,1886.0,232,2.0,955092697
3,949,"Obsessive master thief, Neil McCauley leads a ...",1995-12-15,Heat,7.7,1886.0,242,5.0,956688825
4,949,"Obsessive master thief, Neil McCauley leads a ...",1995-12-15,Heat,7.7,1886.0,263,3.0,1117846575


In [62]:
#we have 44,823 rating
df.shape

(44823, 9)

In [63]:
#we have 2772 movie
df["title"].nunique()

2772

In [65]:
# number of the comments for each movie
comments = df["title"].value_counts()
comments

title
Terminator 3: Rise of the Machines    324
The Million Dollar Hotel              311
Solaris                               305
The 39 Steps                          291
Monsoon Wedding                       274
                                     ... 
Things to Come                          1
Portrait in Black                       1
Les Visiteurs du Soir                   1
The Warped Ones                         1
The One-Man Band                        1
Name: count, Length: 2772, dtype: int64

In [66]:
# taking movies which has lower than 5 
rare_movies = comments[comments < 5].index

rare_movies

Index(['Hotel Rwanda', 'Double Trouble', 'The African Queen', 'Che: Part Two',
       'Ronja Robbersdaughter', 'The Story of a Cheat', 'Jinxed!',
       'Cruel Intentions 3', 'Enigma', 'Jekyll and Hyde ... Together Again',
       ...
       'The Model Couple', 'Arthur and the Revenge of Maltazard',
       'Blue Like Jazz', 'Kismet', 'Erotic Nights of the Living Dead',
       'Things to Come', 'Portrait in Black', 'Les Visiteurs du Soir',
       'The Warped Ones', 'The One-Man Band'],
      dtype='object', name='title', length=1429)

In [68]:
#removing movies which is lower than 5
clean_df = df[~df["title"].isin(rare_movies)]

In [70]:
#building our user_tittle_df
user_title_df = clean_df.groupby(["userId","title"])["rating"].mean().unstack().notnull()

In [71]:
user_title_df.head()

title,10 Items or Less,10 Things I Hate About You,15 Minutes,1984,2 Days in Paris,"20,000 Leagues Under the Sea",2001: A Space Odyssey,24 Hour Party People,25th Hour,28 Days Later,...,Young Adam,Young Frankenstein,Young and Innocent,Z,Zatoichi,Zazie dans le métro,Zodiac,eXistenZ,xXx,À nos amours
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
4,False,False,False,False,False,True,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False


In [72]:
#taking sample randomly
lucky_guy = user_title_df.sample(1,random_state=45).index[0]

In [73]:
#our lucky_guy's movies that he watch
random_user_df = user_title_df[user_title_df.index == lucky_guy]

In [74]:
#the ones which our guy voted
movies_watched = random_user_df.dropna(axis=1).columns.tolist()

In [75]:
movies_watched_df = user_title_df[movies_watched]

In [76]:
#number of the movies that other users watched same with our luckyguy
user_movie_count = movies_watched_df.notnull().sum(axis=1)

In [77]:
# users who watched same movies %60 with our guy 
users_same_movies = user_movie_count[user_movie_count > (movies_watched_df.shape[1] * 60 ) / 100].index

users_same_movies

Index([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,
       ...
       662, 663, 664, 665, 666, 667, 668, 669, 670, 671],
      dtype='int64', name='userId', length=671)

In [78]:
#filtering our df
filted_df = movies_watched_df[movies_watched_df.index.isin(users_same_movies)]

filted_df

title,10 Items or Less,10 Things I Hate About You,15 Minutes,1984,2 Days in Paris,"20,000 Leagues Under the Sea",2001: A Space Odyssey,24 Hour Party People,25th Hour,28 Days Later,...,Young Adam,Young Frankenstein,Young and Innocent,Z,Zatoichi,Zazie dans le métro,Zodiac,eXistenZ,xXx,À nos amours
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
4,False,False,False,False,False,True,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
667,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
668,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
669,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
670,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [79]:
#correlation beetwen users
corr_df = filted_df.T.corr().unstack().drop_duplicates() 

In [80]:
# users who is correlated with our lucky guy 
corr_df[lucky_guy].sort_values(ascending=False)

userId
541    0.246396
476    0.146818
525    0.132043
592    0.110805
463    0.104639
         ...   
391   -0.037857
559   -0.040912
287   -0.041207
396   -0.047080
434   -0.049696
Length: 359, dtype: float64

In [83]:
# correlation higher than 0.10
top_users = pd.DataFrame(corr_df[lucky_guy][corr_df[lucky_guy] > 0.10], columns=["corr"])

top_users

Unnamed: 0_level_0,corr
userId,Unnamed: 1_level_1
294,0.102905
463,0.104639
476,0.146818
525,0.132043
541,0.246396
575,0.101738
592,0.110805


In [84]:
top_users_ratings = pd.merge(top_users, rating[["userId", "movieId", "rating"]], how='inner', on="userId")

top_users_ratings

Unnamed: 0,userId,corr,movieId,rating
0,294,0.102905,1,4.0
1,294,0.102905,5,3.5
2,294,0.102905,7,5.0
3,294,0.102905,10,3.5
4,294,0.102905,11,3.0
...,...,...,...,...
2318,592,0.110805,4299,4.0
2319,592,0.110805,4340,3.0
2320,592,0.110805,4344,5.0
2321,592,0.110805,4369,5.0


In [85]:
top_users_ratings['weighted_rating'] = top_users_ratings['corr'] * top_users_ratings['rating']

In [86]:
recommendation_df = top_users_ratings.pivot_table(values="weighted_rating", index="movieId", aggfunc="mean")

recommendation_df

Unnamed: 0_level_0,weighted_rating
movieId,Unnamed: 1_level_1
1,0.323708
2,0.335518
5,0.360166
6,0.528171
7,0.446676
...,...
42738,0.411618
44613,0.360166
45028,0.411618
45499,0.411618


In [87]:
movies_to_be_recommend = recommendation_df[recommendation_df["weighted_rating"] > 0.7].sort_values(by="weighted_rating", ascending=False).head(10)

In [88]:
movies["title"][movies["movieId"].isin(movies_to_be_recommend.index)]

3786                  Dancer in the Dark
7234                    Dawn of the Dead
9399              A Very Long Engagement
9805             Elevator to the Gallows
11922                     License to Wed
21863    Frankenstein Conquers the World
Name: title, dtype: object