# Part 2: Content-based Filtering Recommender System

## Section A: Introduction

▪ In this practical session, we learn how to build a basic model of content-based recommender systems using the Movies Data set that is publicly available on Kaggle. 

▪ To achieve this, we will compute pairwise cosine similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score threshold.

\>>> **(Full dataset can be downloaded here)** https://www.kaggle.com/rounakbanik/the-movies-dataset?select=movies_metadata.csv

\>>> **(The reference of this practical)** https://www.datacamp.com/community/tutorials/recommender-systems-python

### Content-based Filtering Recommender Systems

▪ Content-based recommendations systems are the systems that look for similarity before recommending something. 

<img src="content.png" width="350">

## Section B: Data Exploration

### Loading Dataset into Dataframe

In [1]:
import pandas as pd

movies_data = pd.read_csv('movies_metadata.csv', low_memory=False)

In [2]:
movies_data.shape

(45466, 24)

### Retrieving All Columns' Names

In [3]:
movies_data.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

### Identifying the Best Indicator of Similarity (Part 1) 

▪ Assumption: **If two movies fall under the same category, then they might be similar to certain extent**. 

In [4]:
movies_data.genres

0        [{'id': 16, 'name': 'Animation'}, {'id': 35, '...
1        [{'id': 12, 'name': 'Adventure'}, {'id': 14, '...
2        [{'id': 10749, 'name': 'Romance'}, {'id': 35, ...
3        [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...
4                           [{'id': 35, 'name': 'Comedy'}]
                               ...                        
45461    [{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...
45462                        [{'id': 18, 'name': 'Drama'}]
45463    [{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...
45464                                                   []
45465                                                   []
Name: genres, Length: 45466, dtype: object

### Understanding the Content of Genres

In [5]:
# Each movie can be categorized under more than one genre
movies_data.genres[0]

"[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]"

In [6]:
# genres is stored as string
print(type(movies_data.genres[0]))

<class 'str'>


In [7]:
size = len(movies_data.genres)
print(size)

45466


<span style = "color:red">
    
**Exercise \#1: Explain what the following code does.** 

</span>

In [8]:
import re

df_1 = [movies_data.genres[index] for index in range(size) if re.search('Science Fiction', movies_data.genres[index])]
df_1

["[{'id': 18, 'name': 'Drama'}, {'id': 14, 'name': 'Fantasy'}, {'id': 878, 'name': 'Science Fiction'}, {'id': 53, 'name': 'Thriller'}]",
 "[{'id': 14, 'name': 'Fantasy'}, {'id': 878, 'name': 'Science Fiction'}, {'id': 12, 'name': 'Adventure'}]",
 "[{'id': 878, 'name': 'Science Fiction'}, {'id': 53, 'name': 'Thriller'}, {'id': 9648, 'name': 'Mystery'}]",
 "[{'id': 28, 'name': 'Action'}, {'id': 878, 'name': 'Science Fiction'}]",
 "[{'id': 27, 'name': 'Horror'}, {'id': 878, 'name': 'Science Fiction'}]",
 "[{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'Adventure'}, {'id': 18, 'name': 'Drama'}, {'id': 878, 'name': 'Science Fiction'}, {'id': 53, 'name': 'Thriller'}]",
 "[{'id': 18, 'name': 'Drama'}, {'id': 9648, 'name': 'Mystery'}, {'id': 878, 'name': 'Science Fiction'}, {'id': 53, 'name': 'Thriller'}]",
 "[{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'Adventure'}, {'id': 18, 'name': 'Drama'}, {'id': 9648, 'name': 'Mystery'}, {'id': 878, 'name': 'Science Fiction'}, {'id': 53, 'name':

In [9]:
print(len(df_1))

3049


### DataFrame Slicing using str.contains()

▪ The loc property is used to access a group of rows and columns by label(s).

In [10]:
df_2 = movies_data.loc[movies_data['genres'].str.contains('Science Fiction')]
print(len(df_2))

3049


### Recommending Movies Based on Genres

In [11]:
df_3 = df_2[['original_title', 'release_date', 'genres']]
df_3.head()

Unnamed: 0,original_title,release_date,genres
23,Powder,1995-10-27,"[{'id': 18, 'name': 'Drama'}, {'id': 14, 'name..."
28,La Cité des Enfants Perdus,1995-05-16,"[{'id': 14, 'name': 'Fantasy'}, {'id': 878, 'n..."
31,Twelve Monkeys,1995-12-29,"[{'id': 878, 'name': 'Science Fiction'}, {'id'..."
65,Lawnmower Man 2: Beyond Cyberspace,1996-01-12,"[{'id': 28, 'name': 'Action'}, {'id': 878, 'na..."
75,Screamers,1995-09-08,"[{'id': 27, 'name': 'Horror'}, {'id': 878, 'na..."


<span style = "color:red">
    
**Exercise \#2: Is _genre_ a good indicator of similarity? Are _Powder_ and _Screamers_ similar movies?** 

</span>

In [12]:
movies_data.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

### Identifying the Best Indicator of Similarity (Part 2)

▪ Assumption: **If two movies have similar plots, then they might be similar to certain extent**.

In [13]:
movies_data['overview'].head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

In [15]:
movies_data['overview'][0]

"Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences."

In [14]:
movies_data['overview'][1]

"When siblings Judy and Peter discover an enchanted board game that opens the door to a magical world, they unwittingly invite Alan -- an adult who's been trapped inside the game for 26 years -- into their living room. Alan's only hope for freedom is to finish the game, which proves risky as all three find themselves running from giant rhinoceroses, evil monkeys and other terrifying creatures."

## Section C: Feature Extraction

### TF-IDF Vectorizer

▪ Scikit-learn's built-in TfIdfVectorizer class is used to produce the TF-IDF matrix:

\>>> Import the Tfidf module using scikit-learn.

\>>> Replace not-a-number values with a blank string.

\>>> Remove stop words like 'the', 'an', etc. since they do not give any useful information about the topic.

\>>> Finally, construct the TF-IDF matrix on the data.

In [16]:
import pandas as pd

for i in range(len(movies_data['overview'])):
    if pd.isna(movies_data['overview'][i]):
        print(movies_data['overview'][i])

nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan


In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english')

# Replace NaN with an empty string
movies_data['overview'] = movies_data['overview'].fillna('')

# Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(movies_data['overview'])

<span style = "color:red">
    
**Exercise \#3: Explain what are the 2 numbers printed when the shape property of _tfidf_matrix_ is accessed?** 

</span>

In [18]:
# Check the shape of tfidf_matrix
tfidf_matrix.shape

(45466, 75827)

### Useless Features vs. Useful Features? Data Preprocessing?

In [19]:
tfidf.get_feature_names_out()

array(['00', '000', '000km', ..., '첫사랑', 'ﬁrst', 'ﬁve'], dtype=object)

In [20]:
tfidf.get_feature_names_out()[0:500] #number removal and punctuation removal

array(['00', '000', '000km', '000th', '001', '006', '007', '008', '009',
       '0093', '01', '0123', '02', '03', '04', '042', '05', '05pm', '06',
       '07', '077', '07am', '08', '088', '09', '10', '100', '1000',
       '10000', '1000s', '1000th', '1001', '100th', '101', '101st', '103',
       '103rd', '104', '105', '1066', '108', '1080s', '108th', '109',
       '10b', '10crores', '10mn', '10th', '10x', '11', '110', '1100',
       '111', '112', '1138', '114', '115', '117', '117a', '118', '1183',
       '119', '11s', '11th', '12', '120', '1200', '1200s', '1206', '1215',
       '1218', '1227', '125', '1250', '125th', '1263', '129', '12th',
       '13', '130', '1300', '1300s', '1302', '1303', '133', '134', '1344',
       '1348', '1349', '138', '13anos', '13b', '13s', '13th', '14', '140',
       '1400', '1408', '1413', '142', '1429', '143', '144', '145', '1458',
       '146', '1463', '1466', '1472', '1475', '148', '1482', '1483',
       '1492', '14pm', '14th', '15', '150', '1500', '15000

In [21]:
tfidf.get_feature_names_out()[900:1000]

array(['97', '976', '977', '98', '99', '999', '9mm', '9pm', '9th', '9to5',
       '_stakeout_', '_wizard', 'a2', 'a300', 'aa', 'aaa', 'aaaron',
       'aab', 'aachan', 'aachi', 'aackerlund', 'aadhavan', 'aadhi',
       'aadland', 'aaicha', 'aakash', 'aake', 'aalavandhan', 'aalst',
       'aaltonen', 'aames', 'aamir', 'aan', 'aanandam', 'aang',
       'aangehouden', 'aanmodderfakker', 'aapeli', 'aarakshan', 'aarchi',
       'aarne', 'aarno', 'aaron', 'aarons', 'aarp', 'aart', 'aarti',
       'aarya', 'aarón', 'aasen', 'aashirvad', 'aashish', 'aasia', 'aati',
       'aatish', 'ab', 'aba', 'ababa', 'abacco', 'aback', 'abacus',
       'abaddon', 'abagnale', 'abahachi', 'abalaba', 'aballay', 'abalone',
       'abandon', 'abandond', 'abandoned', 'abandoning', 'abandonment',
       'abandons', 'abar', 'abard', 'abargil', 'abas', 'abatcha',
       'abates', 'abati', 'abba', 'abbaji', 'abbas', 'abbaseya', 'abbasi',
       'abbass', 'abbe', 'abberline', 'abbes', 'abbess', 'abbey', 'abbia',
     

## Section D: Similarity Computation

▪ With the matrix, **cosine similarity** can be used to calculate a numeric quantity that denotes the similarity between two movies.

▪ The syntax is **cosine_similarity(X, Y=None, dense_output=True)**

\>>> X (either an ndarray or a sparse matrix) is the input data.

\>>> Y (either an ndarray or a sparse matrix) is the input data. If None, the output will be the pairwise similarities between all samples in X.

In [22]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute the Cosine Similarity in terms of pairwise similarities
cosine_sim_1 = cosine_similarity(tfidf_matrix, tfidf_matrix)

<span style = "color:red">
    
**Exercise \#4: Explain what are the 2 numbers printed when the shape property of _cosine_sim_1 is accessed?** 

</span>

In [23]:
cosine_sim_1.shape

(45466, 45466)

### Visualizing cosine_sim_1

In [24]:
# Print the first 6 rows and 6 columns
for i in range(6):
    print(cosine_sim_1[i][:6])

[1.         0.01504121 0.         0.         0.         0.        ]
[0.01504121 1.         0.04681953 0.         0.         0.05018805]
[0.         0.04681953 1.         0.         0.02509444 0.        ]
[0.         0.         0.         1.         0.         0.00720276]
[0.         0.         0.02509444 0.         1.         0.        ]
[0.         0.05018805 0.         0.00720276 0.         1.        ]


In [25]:
print(type(cosine_sim_1))

<class 'numpy.ndarray'>


In [26]:
df = pd.DataFrame(cosine_sim_1[0:6, 0:6])
df.columns = ['Movie_1', 'Movie_2', 'Movie_3', 'Movie_4', 'Movie_5', 'Movie_6']
df.index = ['Movie_1', 'Movie_2', 'Movie_3', 'Movie_4', 'Movie_5', 'Movie_6']
df.round(4)

Unnamed: 0,Movie_1,Movie_2,Movie_3,Movie_4,Movie_5,Movie_6
Movie_1,1.0,0.015,0.0,0.0,0.0,0.0
Movie_2,0.015,1.0,0.0468,0.0,0.0,0.0502
Movie_3,0.0,0.0468,1.0,0.0,0.0251,0.0
Movie_4,0.0,0.0,0.0,1.0,0.0,0.0072
Movie_5,0.0,0.0,0.0251,0.0,1.0,0.0
Movie_6,0.0,0.0502,0.0,0.0072,0.0,1.0


### linear_kernel()

▪ Since TF-IDF vectorizer is used, calculating the dot product between each vector will directly give you the cosine similarity score. 

▪ Therefore, we can use sklearn's **linear_kernel()** instead of **cosine_similarities()** since it is faster.

In [27]:
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim_2 = linear_kernel(tfidf_matrix, tfidf_matrix)

In [28]:
cosine_sim_2.shape

(45466, 45466)

In [29]:
print(type(cosine_sim_2))

<class 'numpy.ndarray'>


In [30]:
# Print the first 6 rows and 6 columns
for i in range(6):
    print(cosine_sim_2[i][:6])

[1.         0.01504121 0.         0.         0.         0.        ]
[0.01504121 1.         0.04681953 0.         0.         0.05018805]
[0.         0.04681953 1.         0.         0.02509444 0.        ]
[0.         0.         0.         1.         0.         0.00720276]
[0.         0.         0.02509444 0.         1.         0.        ]
[0.         0.05018805 0.         0.00720276 0.         1.        ]


### cosine_similarity() vs. linear_kernel()

https://campus.datacamp.com/courses/feature-engineering-for-nlp-in-python/tf-idf-and-similarity-scores?ex=9

In [None]:
import time
from sklearn.metrics.pairwise import linear_kernel

# Record start time
start = time.time()

# Compute cosine similarity matrix
cosine_sim_lk = linear_kernel(tfidf_matrix, tfidf_matrix)

# Print time taken
print("Time taken: %s seconds" %(time.time() - start))

In [None]:
import time
from sklearn.metrics.pairwise import cosine_similarity

# Record start time
start = time.time()

# Compute cosine similarity matrix
cosine_sim_cs = cosine_similarity(tfidf_matrix, tfidf_matrix)

# Print time taken
print("Time taken: %s seconds" %(time.time() - start))

## Section E: Recommending Movies

▪ Create a reverse mapping of movie titles and DataFrame indices. 

In [31]:
tempo = movies_data[['title']]
tempo

Unnamed: 0,title
0,Toy Story
1,Jumanji
2,Grumpier Old Men
3,Waiting to Exhale
4,Father of the Bride Part II
...,...
45461,Subdue
45462,Century of Birthing
45463,Betrayal
45464,Satan Triumphant


In [32]:
# Create a pandas series where indexes are values and titles are indexes
indices = pd.Series(movies_data.index, index = movies_data['title']).drop_duplicates()

# Check the first 10 indices
indices[:10]

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
Heat                           5
Sabrina                        6
Tom and Huck                   7
Sudden Death                   8
GoldenEye                      9
dtype: int64

In [33]:
indices.shape

(45466,)

### enumerate()

▪ **enumerate()** method adds a counter to an iterable and returns it in a form of enumerating object. 

▪ This enumerated object can then be used directly for loops or converted into a list of tuples using the list() method.

https://www.geeksforgeeks.org/enumerate-in-python/

In [34]:
# Python program to illustrate enumerate function
list_1 = ["eat", "sleep", "repeat"]
  
print(list(enumerate(list_1)))

[(0, 'eat'), (1, 'sleep'), (2, 'repeat')]


### get_recommendations()

▪ To build a content filtering recommender, we need to define a function that takes in a movie title as an input and outputs a list of the 10 most similar movies.

▪ These are the following steps to follow:

\>>> Get the index of the movie given its title.

\>>> Get the list of cosine similarity scores for that particular movie with all movies. Convert it into a list of tuples where the first element is its position, and the second is the similarity score.

\>>> Sort the aforementioned list of tuples based on the similarity scores; that is, the second element.

\>>> Get the top 10 elements of this list. Ignore the first element as it refers to self (the movie most similar to a particular movie is the movie itself).

\>>> Return the titles corresponding to the indices of the top elements.

In [35]:
def get_recommendations(title, cosine_sim=cosine_sim_1):

    # Get the index of the movie that matches the title
    # Title: The Dark Knight Rises, index: 18252
    index = indices[title]
    # print(index) 
    
    # Get the pairwsie similarity scores of all 45466 movies with the selected movie: 'The Dark Knight Rises'
    sim_scores = list(enumerate(cosine_sim[index]))
    # print(sim_scores)
    
    # Sort the movies based on the similarity scores in descending order
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    # print(sim_scores)
    
    # Get the scores of the top 10 most similar movies 
    sim_scores = sim_scores[1:11]
    # print(sim_scores)

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    # print(movie_indices)
    
    # Return the top 10 most similar movies
    return movies_data['title'].iloc[movie_indices]

In [36]:
get_recommendations('The Dark Knight Rises')

12481                                      The Dark Knight
150                                         Batman Forever
1328                                        Batman Returns
15511                           Batman: Under the Red Hood
585                                                 Batman
21194    Batman Unmasked: The Psychology of the Dark Kn...
9230                    Batman Beyond: Return of the Joker
18035                                     Batman: Year One
19792              Batman: The Dark Knight Returns, Part 1
3095                          Batman: Mask of the Phantasm
Name: title, dtype: object

<span style = "color:red">
    
**Exercise \#5: Build a recommender system based on 4 types of metadata: 3 top actors, the director, related genres, and the movie plot keywords. [Reference: https://www.datacamp.com/community/tutorials/recommender-systems-python]** 

</span>

<span style = "color:red">
    
**Exercise \#6: Build a recommender that would take the 30 most similar movies, calculate the weighted ratings (using the IMDB formula from above), sort movies based on this rating, and return the top 10 movies.** 

</span>