# Content-Based Recommended System

In [30]:
# Let's import the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
%matplotlib inline
sns.set()

### Data Collection


Let's read the imdb movies raw dataset from csv file into pandas dataframe

In [31]:
# Read the csv file and store into pandas dataframe
df_imdb_raw=pd.read_csv('imdb_movies.csv')

# let's see the first 5 rows of dataframe
df_imdb_raw.head()

Unnamed: 0,Imdb_Id,Year,Movie_Id,Title,Genres,Rating,T_Id
0,218817,2001,4052,Antitrust,"Crime, Drama, Thriller",3.1,9989.0
1,238948,2001,4053,Double Take,"Action, Comedy",2.7,18828.0
2,206275,2001,4054,Save the Last Dance,"Drama, Romance",3.2,9816.0
3,237572,2001,4056,"Pledge, The","Crime, Drama, Mystery, Thriller",3.3,5955.0
4,186589,2001,4068,Sugar & Spice,Comedy,2.8,16723.0


### Data Cleansing

In [32]:
# describe the dataframe that contains IMDB movies raw data 
df_imdb_raw.describe().round(2)

Unnamed: 0,Imdb_Id,Year,Movie_Id,Rating,T_Id
count,16398.0,16398.0,16398.0,16398.0,16255.0
mean,1430633.21,2042.0,94229.36,3.09,101997.84
std,1046857.11,513.5,40839.59,0.74,100698.61
min,35423.0,2001.0,4052.0,0.5,12.0
25%,448131.75,2006.0,68934.25,2.8,21066.0
50%,1251546.5,2009.0,101380.0,3.2,59013.0
75%,2031562.25,2012.0,129246.75,3.5,161915.5
max,5290524.0,9999.0,151709.0,5.0,378477.0



We can see that T_Id attribute count doesn't match with rest of the features that indicates it contains some null values. I think we need not to bother much as this feature is irrelevant for our model and it will be dropped soon. Any way we will just check the null count

In [33]:
df_imdb_raw.isna().sum().to_frame().T

Unnamed: 0,Imdb_Id,Year,Movie_Id,Title,Genres,Rating,T_Id
0,0,0,0,0,0,0,143


In [34]:
# Let's create a function to correct the title in some rows. ex: from "Pledge, The" to "The Pledge" 
def title_correction(movie_name):
    results=re.findall(', The$', movie_name)
    
# Swap the values if title contains the string else no change
    if len(results) == 1:
        t_lst=movie_name.split(',')
        t_name=t_lst[1].strip()+' '+t_lst[0].strip()
    else:
        t_name=movie_name

# return the movie title   
    return (t_name)


Let's apply the above user defined function to correct the titles where it is applicable. 

In [35]:
# copy the orignal dataframe
df_imdb_adj=df_imdb_raw.copy()

# apply the function on title column and display the first five rows
df_imdb_adj['Movie Title']=df_imdb_raw['Title'].apply(title_correction)
df_imdb_adj.head()

Unnamed: 0,Imdb_Id,Year,Movie_Id,Title,Genres,Rating,T_Id,Movie Title
0,218817,2001,4052,Antitrust,"Crime, Drama, Thriller",3.1,9989.0,Antitrust
1,238948,2001,4053,Double Take,"Action, Comedy",2.7,18828.0,Double Take
2,206275,2001,4054,Save the Last Dance,"Drama, Romance",3.2,9816.0,Save the Last Dance
3,237572,2001,4056,"Pledge, The","Crime, Drama, Mystery, Thriller",3.3,5955.0,The Pledge
4,186589,2001,4068,Sugar & Spice,Comedy,2.8,16723.0,Sugar & Spice



Now we need to drop the columns that are not required for model i.e. some system generated numbers like Imdb_id, T_id etc..

In [36]:
df_imdb_transform=df_imdb_adj.drop(['Imdb_Id','T_Id','Title'], axis=1)
df_imdb_transform.head()

Unnamed: 0,Year,Movie_Id,Genres,Rating,Movie Title
0,2001,4052,"Crime, Drama, Thriller",3.1,Antitrust
1,2001,4053,"Action, Comedy",2.7,Double Take
2,2001,4054,"Drama, Romance",3.2,Save the Last Dance
3,2001,4056,"Crime, Drama, Mystery, Thriller",3.3,The Pledge
4,2001,4068,Comedy,2.8,Sugar & Spice



As the multiple genres are concatenated with comma delimeter in the dataframe, we need to see the distinct genres in whole dataset

In [37]:
genres_arr=df_imdb_transform['Genres'].unique()

uniq_genres=set()
for mv_gen in genres_arr:
    for val in mv_gen.split(', '):
        uniq_genres.add(val)
    
uniq_genres

{'(no genres listed)',
 'Action',
 'Adventure',
 'Animation',
 'Children',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Fantasy',
 'Film-Noir',
 'Horror',
 'IMAX',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Thriller',
 'War',
 'Western'}


We can see that there are some movies without genres defined. So, we will drop those movies from dataset

In [38]:
row_ids=df_imdb_transform[df_imdb_transform['Genres']=='(no genres listed)'].index

df_imdb_transform.drop(row_ids, inplace=True)
df_imdb_transform.head()

Unnamed: 0,Year,Movie_Id,Genres,Rating,Movie Title
0,2001,4052,"Crime, Drama, Thriller",3.1,Antitrust
1,2001,4053,"Action, Comedy",2.7,Double Take
2,2001,4054,"Drama, Romance",3.2,Save the Last Dance
3,2001,4056,"Crime, Drama, Mystery, Thriller",3.3,The Pledge
4,2001,4068,Comedy,2.8,Sugar & Spice



To calculate the weighted genre matrix in next section i.e as part of our model development, we cannot work on the concatenated genres column. So, we need to split the each genre into individual columns

In [39]:
# split the genres based on comma delimeter 
for idx, row in df_imdb_transform.iterrows():
    for genre in row['Genres'].split(', '):
        df_imdb_transform.at[idx, genre]=1

In [40]:
df_imdb_transform.head()

Unnamed: 0,Year,Movie_Id,Genres,Rating,Movie Title,Crime,Drama,Thriller,Action,Comedy,...,Animation,Children,War,Adventure,Sci-Fi,Documentary,Musical,IMAX,Western,Film-Noir
0,2001,4052,"Crime, Drama, Thriller",3.1,Antitrust,1.0,1.0,1.0,,,...,,,,,,,,,,
1,2001,4053,"Action, Comedy",2.7,Double Take,,,,1.0,1.0,...,,,,,,,,,,
2,2001,4054,"Drama, Romance",3.2,Save the Last Dance,,1.0,,,,...,,,,,,,,,,
3,2001,4056,"Crime, Drama, Mystery, Thriller",3.3,The Pledge,1.0,1.0,1.0,,,...,,,,,,,,,,
4,2001,4068,Comedy,2.8,Sugar & Spice,,,,,1.0,...,,,,,,,,,,



We can clearly see that there are NaN values for genres that are anyway not applicable at once for any movie. So, let's replace these values with 0.

In [41]:
df_imdb_transform.fillna(0, inplace=True)
df_imdb_transform.head()

Unnamed: 0,Year,Movie_Id,Genres,Rating,Movie Title,Crime,Drama,Thriller,Action,Comedy,...,Animation,Children,War,Adventure,Sci-Fi,Documentary,Musical,IMAX,Western,Film-Noir
0,2001,4052,"Crime, Drama, Thriller",3.1,Antitrust,1.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2001,4053,"Action, Comedy",2.7,Double Take,0.0,0.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2001,4054,"Drama, Romance",3.2,Save the Last Dance,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2001,4056,"Crime, Drama, Mystery, Thriller",3.3,The Pledge,1.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2001,4068,Comedy,2.8,Sugar & Spice,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0



Great! Finally we have cleansed and transformed the raw data into required format. Let's move on and create some sample user input dataset

In [151]:
my_movies=df_imdb_raw[(df_imdb_raw['Title'].str.contains('Matrix')) | (df_imdb_raw['Title'].str.contains('Spider-Man'))]
my_movies.sort_values(by='Year')

Unnamed: 0,Imdb_Id,Year,Movie_Id,Title,Genres,Rating,T_Id
300,145487,2002,5349,Spider-Man,"Action, Adventure, Sci-Fi, Thriller",3.5,557.0
4831,499665,2002,76709,Spider-Man: The Ultimate Villain Showdown,Animation,3.4,49131.0
670,234215,2003,6365,"Matrix Reloaded, The","Action, Adventure, Sci-Fi, Thriller, IMAX",3.3,604.0
877,242653,2003,6934,"Matrix Revolutions, The","Action, Adventure, Sci-Fi, Thriller, IMAX",3.2,605.0
1138,316654,2004,8636,Spider-Man 2,"Action, Adventure, Sci-Fi, IMAX",3.5,558.0
12820,439783,2004,132490,Return to Source: The Philosophy of The Matrix,Documentary,4.1,174615.0
2779,413300,2007,52722,Spider-Man 3,"Action, Adventure, Sci-Fi, Thriller, IMAX",3.0,559.0
7213,948470,2012,95510,"Amazing Spider-Man, The","Action, Adventure, Sci-Fi, IMAX",3.5,1930.0
9878,1872181,2014,110553,The Amazing Spider-Man 2,"Action, Sci-Fi, IMAX",3.2,102382.0


### User Profile

In [152]:
# user data with movies names, genres and user ratings for those movies
user_data={
    'Movie Name':['The Matrix Reloaded', 'Iron Man', 'Spider-Man'],
    'Genres':['Action, Adventure, Sci-Fi, Thriller, IMAX', 'Action, Adventure, Sci-Fi', 'Action, Adventure, Sci-Fi, Thriller'],
    'Ratings':['3.5', '4.5', '4.8']
}


Let's convert the user data dictionary into pandas dataframe

In [153]:
df_user_movies=pd.DataFrame(user_data)
df_user_movies.head()

Unnamed: 0,Movie Name,Genres,Ratings
0,The Matrix Reloaded,"Action, Adventure, Sci-Fi, Thriller, IMAX",3.5
1,Iron Man,"Action, Adventure, Sci-Fi",4.5
2,Spider-Man,"Action, Adventure, Sci-Fi, Thriller",4.8



We need a movie ID key column to identify the movies from the original dataset. So, we will use pandas merge to achieve it.

In [154]:
df_imdb_movieids=df_imdb_transform[['Movie_Id','Movie Title']]

# Join the dataframes
df_user_movies=pd.merge(df_user_movies, df_imdb_movieids, how='left', left_on='Movie Name', right_on='Movie Title')

# Drop the duplicate columns i.e movie title
df_user_movies.drop('Movie Title', axis=1, inplace=True)
df_user_movies[['Movie_Id', 'Movie Name', 'Genres', 'Ratings']]

Unnamed: 0,Movie_Id,Movie Name,Genres,Ratings
0,6365,The Matrix Reloaded,"Action, Adventure, Sci-Fi, Thriller, IMAX",3.5
1,59315,Iron Man,"Action, Adventure, Sci-Fi",4.5
2,5349,Spider-Man,"Action, Adventure, Sci-Fi, Thriller",4.8



Now for the user movies, we need to extract the genres from the dataframe that we have created in the data cleansing section.

In [155]:
df_user_movies_imdb=df_imdb_transform[df_imdb_transform['Movie Title'].isin(df_user_movies['Movie Name'].tolist())]
df_user_movies_imdb

Unnamed: 0,Year,Movie_Id,Genres,Rating,Movie Title,Crime,Drama,Thriller,Action,Comedy,...,Animation,Children,War,Adventure,Sci-Fi,Documentary,Musical,IMAX,Western,Film-Noir
300,2002,5349,"Action, Adventure, Sci-Fi, Thriller",3.5,Spider-Man,0.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
670,2003,6365,"Action, Adventure, Sci-Fi, Thriller, IMAX",3.3,The Matrix Reloaded,0.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
3373,2008,59315,"Action, Adventure, Sci-Fi",3.9,Iron Man,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0


In [156]:
# drop the columns that are not required for us
df_user_genres=df_user_movies_imdb.drop(['Year', 'Movie_Id', 'Genres', 'Rating', 'Movie Title'], axis=1)
df_user_genres

Unnamed: 0,Crime,Drama,Thriller,Action,Comedy,Romance,Mystery,Horror,Fantasy,Animation,Children,War,Adventure,Sci-Fi,Documentary,Musical,IMAX,Western,Film-Noir
300,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
670,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
3373,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0



Let's create an array for user movies ratings

In [157]:
usr_ratings=df_user_movies[['Ratings']].astype('float').values.T
usr_ratings

array([[3.5, 4.5, 4.8]])


We need to create an array for user movies genres

In [158]:
usr_genres=df_user_genres.values
usr_genres

array([[0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0.,
        0., 0., 0.],
       [0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0.,
        1., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0.,
        0., 0., 0.]])


Ok. Now we need to create the user weighted genre matrix by performing the dot product of user ratings matrix with user genres matrix

In [159]:
print ("Shape of input matrices:{}, {}\n".format(usr_ratings.shape, usr_genres.shape))
usr_weighted_genres=np.dot(usr_ratings, usr_genres)
usr_weighted_genres

Shape of input matrices:(1, 3), (3, 19)



array([[ 0. ,  0. ,  8. , 12.8,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,
         0. , 12.8, 12.8,  0. ,  0. ,  4.5,  0. ,  0. ]])


Above values represent the user weighted genres matrix. Now we need to normalize the values

In [160]:
usr_weighted_genres_norm=usr_weighted_genres/np.sum(usr_weighted_genres, axis=1)
usr_weighted_genres_norm=np.round_(usr_weighted_genres_norm, 2)
usr_weighted_genres_norm

array([[0.  , 0.  , 0.16, 0.25, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  ,
        0.  , 0.25, 0.25, 0.  , 0.  , 0.09, 0.  , 0.  ]])


Excellent! we have got the required user profile data to proceed further. Let's have a look at the values by converting into dataframe

In [161]:
df_user_profile=pd.DataFrame(usr_weighted_genres_norm.T, index=df_user_genres.columns, columns=['Weighted_ratings'])
df_user_profile.sort_values(by='Weighted_ratings', ascending=False)

Unnamed: 0,Weighted_ratings
Action,0.25
Sci-Fi,0.25
Adventure,0.25
Thriller,0.16
IMAX,0.09
Crime,0.0
War,0.0
Western,0.0
Musical,0.0
Documentary,0.0


#### Recommendation Table

In [162]:
# we need only transformed genre columns and everything else can be dropped
df_imdb_genres=df_imdb_transform.drop(['Year', 'Genres', 'Rating', 'Movie Title'], axis=1)
df_imdb_genres.set_index('Movie_Id', inplace=True)
df_imdb_genres.head()

Unnamed: 0_level_0,Crime,Drama,Thriller,Action,Comedy,Romance,Mystery,Horror,Fantasy,Animation,Children,War,Adventure,Sci-Fi,Documentary,Musical,IMAX,Western,Film-Noir
Movie_Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
4052,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4053,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4054,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4056,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4068,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [163]:
# cleansed imdb dataframe dimensions
df_imdb_genres.shape

(15862, 19)


In order to achieve the movies recommendation based on user profile, we must calculate the movies genre weighted matrix so that we can recommend the suitable movies to the user. So, let's multiply the user profile with imdb genres

In [164]:
imdb_genres_agg= (df_user_profile.values.T * df_imdb_genres).sum(axis=1)

# consider the top 20 rows
df_recomm_table=imdb_genres_agg.sort_values(ascending=False).head(20)

In [165]:
# display the recommendation dataframe that shows movie ID's
df_recomm_table.round(1)

Movie_Id
52722     1.0
101076    1.0
77561     1.0
6365      1.0
6934      1.0
5264      0.9
70336     0.9
66639     0.9
8361      0.9
96200     0.9
61350     0.9
27618     0.9
91500     0.9
133759    0.9
68791     0.9
4638      0.9
122882    0.9
116758    0.9
141385    0.9
4781      0.9
dtype: float64


Now we need to extract those above top movies based on key i.e. movie ID and at the same time we must ignore the movies that user had already watched.

In [166]:
imdb_filter=df_imdb_raw['Movie_Id'].isin(df_recomm_table.index) 
exc_usr_movies=~df_imdb_raw['Movie_Id'].isin(df_user_movies['Movie_Id'])
df_recom_movies=df_imdb_raw[imdb_filter & exc_usr_movies].sort_values(by=['Rating'], ascending=False)

### Final result: Recommended Movies

In [167]:
df_recom_movies[['Title', 'Genres', 'Rating']].head(10)

Unnamed: 0,Title,Genres,Rating
11596,Mad Max: Fury Road,"Action, Adventure, Sci-Fi, Thriller",3.9
4900,Iron Man 2,"Action, Adventure, Sci-Fi, Thriller, IMAX",3.6
6650,"Hunger Games, The","Action, Adventure, Drama, Sci-Fi, Thriller",3.6
4086,Terminator Salvation,"Action, Adventure, Sci-Fi, Thriller",3.2
14542,Humanity's End,"Action, Adventure, Drama, Sci-Fi, Thriller",3.2
877,"Matrix Revolutions, The","Action, Adventure, Sci-Fi, Thriller, IMAX",3.2
2779,Spider-Man 3,"Action, Adventure, Sci-Fi, Thriller, IMAX",3.0
1088,"Day After Tomorrow, The","Action, Adventure, Drama, Sci-Fi, Thriller",3.0
3605,Babylon A.D.,"Action, Adventure, Sci-Fi, Thriller",2.9
13072,Eyeborgs,"Action, Adventure, Sci-Fi, Thriller",2.9


### Advantages and Disadvantages of Content-Based Filtering

##### Advantages
* Learns user's preferences
* Highly personalized for the user

##### Disadvantages
* Doesn't take into account what others think of the item, so low quality item recommendations might happen
* Extracting data is not always intuitive
* Determining what characteristics of the item the user dislikes or likes is not always obvious