# **Content Based Recommendation System (Movies)**

## Objectives

*   Create a content-based recommendation system.

Recommendation systems are a collection of algorithms used to recommend items to users based on information taken from the user. These systems have become ubiquitous, and can be commonly seen in online stores, movies databases and job finders. In this notebook, we will explore Content-based recommendation systems and implement a simple version of one using Python and the Pandas library.

# Load dependencies

In [1]:
import pandas as pd
import numpy as np

# Preprocessing

In [20]:
# read read the movies data
movie_df = pd.read_csv('assets/data/movies.csv', usecols=['movieId', 'title', 'genres'])
display(movie_df.shape)
movie_df.head()

(34208, 3)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [21]:
# read the movies rating data
rating_df = pd.read_csv('assets/data/ratings.csv', usecols=['userId', 'movieId', 'rating', 'timestamp'])
display(rating_df.shape)
rating_df.head()

(22884377, 4)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,169,2.5,1204927694
1,1,2471,3.0,1204927438
2,1,48516,5.0,1204927435
3,2,2571,3.5,1436165433
4,2,109487,4.0,1436165496


In [4]:
# extracting year from movies title
movie_df['year'] = movie_df['title'].str.extract('(\(\d\d\d\d\))', expand=False).apply(lambda x: str(x)[-5:-1])

# cleaning movie title
movie_df['title'] = movie_df['title'].str.replace('(\(\d\d\d\d\))', '')
movie_df['title'] = movie_df['title'].apply(lambda x: x.strip())
movie_df.head()

  movie_df['title'] = movie_df['title'].str.replace('(\(\d\d\d\d\))', '')


Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


Let's split the genres into list of genres

In [5]:
movie_df['genres'] = movie_df['genres'].str.split('|')
movie_df.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men,"[Comedy, Romance]",1995
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II,[Comedy],1995


#### Checking if there is any missing value present in the dataset

In [6]:
movie_df.isna().sum()

movieId    0
title      0
genres     0
year       0
dtype: int64

Dataset looks clean to move further

In [7]:
movies_ = movie_df.copy()

# creating separate genre column for each movie i.e creating dummies
for index, row in movie_df.iterrows():
    for genre in row['genres']:
        movies_.at[index, genre] = 1
movies_.head()

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,,...,,,,,,,,,,
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,,1.0,,1.0,,...,,,,,,,,,,
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,,,,1.0,,1.0,...,,,,,,,,,,
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,,,,1.0,,1.0,...,,,,,,,,,,
4,5,Father of the Bride Part II,[Comedy],1995,,,,1.0,,,...,,,,,,,,,,


In [8]:
# filling 'Nan' with 0
movies_.fillna(0, inplace=True)
movies_.head()

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II,[Comedy],1995,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now, we have required things ready in our movie, dataframe.

Lets now have a look over rating dataframe.

In [9]:
display(rating_df.shape)
rating_df.head()

(22884377, 4)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,169,2.5,1204927694
1,1,2471,3.0,1204927438
2,1,48516,5.0,1204927435
3,2,2571,3.5,1436165433
4,2,109487,4.0,1436165496


Wow! there's too many rows iin comparison to movie data

There's nothing to be surpries, this is the rating dataset.

The rating dataset is storing reviews of multiple users for the same movies, Lets see it!

In [10]:
# Counting number of similar entries
rating_df['movieId'].value_counts()

356       81296
296       79091
318       77887
593       76271
480       69545
          ...  
142478        1
142476        1
134501        1
150892        1
117989        1
Name: movieId, Length: 33670, dtype: int64

In this recommendation system we don't require timestamp, so lets eliminate `timestamp` feature.

In [11]:
rating_df = rating_df.drop('timestamp', 1)
rating_df.head()

Unnamed: 0,userId,movieId,rating
0,1,169,2.5
1,1,2471,3.0
2,1,48516,5.0
3,2,2571,3.5
4,2,109487,4.0


## Let's start with building our Content-Based recommendation system

#### How will machine will able to know about user interest on start ?

So, at first machine will require some movie taste that user loves.
And then machine will help user in finding more similar movies.

So lets initialize user input for testing recommendation system.

In [12]:
user_input = [
    {'title':'Conjuring, The', 'rating':5},
    {'title':'Avengers, The', 'rating':4.1},
    {'title':'Avengers: Age of Ultron', 'rating':4.2},
    {'title':"Inception", 'rating':4.8},
    {'title':'Harry Potter and the Order of the Phoenix', 'rating':4.5}
]
                
input_movies = pd.DataFrame(user_input)
input_movies

Unnamed: 0,title,rating
0,"Conjuring, The",5.0
1,"Avengers, The",4.1
2,Avengers: Age of Ultron,4.2
3,Inception,4.8
4,Harry Potter and the Order of the Phoenix,4.5


We got the name of movie good, but we don't know much about movie also we don't have the id, Also we dont know if the movie exists in our dataset. Lets check in movies list.

In [13]:
# searching movie in movies list
get_id = movies_[movies_['title'].isin(input_movies['title'].tolist())]

# merging the movie data into user input data
input_movies = pd.merge(input_movies, get_id)

input_movies

Unnamed: 0,title,rating,movieId,genres,year,Adventure,Animation,Children,Comedy,Fantasy,...,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,"Conjuring, The",5.0,103688,"[Horror, Thriller]",2013,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,"Avengers, The",4.1,2153,"[Action, Adventure]",1998,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Avengers, The",4.1,89745,"[Action, Adventure, Sci-Fi, IMAX]",2012,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Avengers: Age of Ultron,4.2,122892,"[Action, Adventure, Sci-Fi]",2015,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Inception,4.8,79132,"[Action, Crime, Drama, Mystery, Sci-Fi, Thrill...",2010,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
5,Harry Potter and the Order of the Phoenix,4.5,54001,"[Adventure, Drama, Fantasy, IMAX]",2007,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
user_movie = input_movies.drop(['title', 'rating' ,'movieId', 'genres', 'year'], axis=1)
user_movie

Unnamed: 0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
5,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
# caclculating weight matrix
interest_list = user_movie.transpose().dot(input_movies['rating'])

interest_list

Adventure             16.9
Animation              0.0
Children               0.0
Comedy                 0.0
Fantasy                4.5
Romance                0.0
Drama                  9.3
Action                17.2
Crime                  4.8
Thriller               9.8
Horror                 5.0
Mystery                4.8
Sci-Fi                13.1
IMAX                  13.4
Documentary            0.0
War                    0.0
Musical                0.0
Western                0.0
Film-Noir              0.0
(no genres listed)     0.0
dtype: float64

##### Getting the genres from movie dataframe

In [16]:
# creating genre table
# setting movie_id as index
genre_table = movies_.set_index(movies_['movieId'])

# deleting column excluding genre
genre_table = genre_table.drop(['title', 'movieId', 'genres', 'year'], axis=1)
display(genre_table.shape)
genre_table.head()

(34208, 20)

Unnamed: 0_level_0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
# lets get movie for recommendation
recommendation_df = ((genre_table * interest_list).sum(axis=1)) / (interest_list.sum())

# sorting our recommendations in descending order
recommendation_df = recommendation_df.sort_values(ascending=False)

recommendation_df.head()

movieId
1    0.216599
2    0.216599
3    0.000000
4    0.094130
5    0.000000
dtype: float64

## Predicting recommendation

In [19]:
# retrieving top 20 recommendation
recommended = movie_df[movie_df['movieId'].isin(recommendation_df.head(20).index)]
recommended

Unnamed: 0,movieId,title,genres,year
6261,6365,"Matrix Reloaded, The","[Action, Adventure, Sci-Fi, Thriller, IMAX]",2003
6823,6934,"Matrix Revolutions, The","[Action, Adventure, Sci-Fi, Thriller, IMAX]",2003
7763,8361,"Day After Tomorrow, The","[Action, Adventure, Drama, Sci-Fi, Thriller]",2004
9403,27618,"Sound of Thunder, A","[Action, Adventure, Drama, Sci-Fi, Thriller]",2005
10382,36509,"Cave, The","[Action, Adventure, Horror, Mystery, Sci-Fi, T...",2005
11410,48774,Children of Men,"[Action, Adventure, Drama, Sci-Fi, Thriller]",2006
11838,52722,Spider-Man 3,"[Action, Adventure, Sci-Fi, Thriller, IMAX]",2007
12464,58025,Jumper,"[Action, Adventure, Drama, Sci-Fi, Thriller]",2008
12873,60684,Watchmen,"[Action, Drama, Mystery, Sci-Fi, Thriller, IMAX]",2009
14397,71999,Aelita: The Queen of Mars (Aelita),"[Action, Adventure, Drama, Fantasy, Romance, S...",1924


# Conclusion

So we conclude that this notebook is ready to provide best recommendation on user preferences.

<h3>Author</h3>
<h4>Akash Sharma</h4>
<div style="float:left">
  <a href="https://www.linkedin.com/in/akash-sharma-01775b14a">
    <img src="https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white" alt="LinkedIn">
  </a>
  <a href="https://discord.com/users/366283102462541865">
    <img src="https://img.shields.io/badge/Discord-7289DA?style=for-the-badge&logo=discord&logoColor=white" alt="Discord">
  </a>
  <a href="https://github.com/CosmiX-6">
    <img src="https://img.shields.io/badge/GitHub-100000?style=for-the-badge&logo=github&logoColor=white" alt="GitHub">
  </a>
</div>