#  Content Based Recommender System

![](https://images.unsplash.com/photo-1598899134739-24c46f58b8c0?ixid=MXwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHw%3D&ixlib=rb-1.2.1&auto=format&fit=crop&w=1476&q=80)

Recommendations are playing a major role in social media, advertisement and many other applications. We will start the series with a simple content-based algorithm. You can use it to create similar movies; which are used for 'more like this' or, 'because you watched' and maybe 'recommended for you'.

### What is Content Based Recomendation ?
The content-based Recommender extract movie metadata such as genres, cast, overview, etc. then convert them into features then find similar movies based on these features. 
The content-Based recommender could generate bad results if you didn't utilize metadata wisely, once I got Ice Age as a recommendation because I watched Titanic!. First, you should determine specific criteria of how the recommender system will recommend. Our criteria will be to recommend movies with same features as directors, cast, genres, keywords, release date, etc. The director, genres and cast. Will have a higher weight than the other features.

We will start with importing, cleaning and preparing the data.


In [1]:
from pandas.core.computation.expressions import evaluate
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity


In [2]:

credits=pd.read_csv('../input/the-movies-dataset/credits.csv')
movies=pd.read_csv('../input/the-movies-dataset/movies_metadata.csv')
keywords=pd.read_csv('../input/the-movies-dataset/keywords.csv')



  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
credits['id']=credits['id'].astype('str')
keywords['id']=keywords['id'].astype('str')

Generate one DataFrame from the three dataframes 

In [4]:
training_df=(credits.merge(movies,on='id')).merge(keywords,on='id')

We have a large dataset, so it will take only the first 10000 movies for testing, using the whole dataset will cause slowness in the algorithm. You can change it as you like.  Using a small dataset will make it difficult for the algorithm to work well. So make sure to slice enough data.

In [5]:
training_df=training_df[:10000]

We will not use all the features(columns) from the dataset in the algorithm, for example, revenue, video homepage and many other features are irrelevant data.  Sometimes including this kind of data will decrease the algorithm accuracy. other data as cast, crew, genres need some cleaning to extract names only from them.

In [6]:
training_df.head(5)

Unnamed: 0,cast,crew,id,adult,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,keywords
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,tt0114709,en,...,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,tt0113497,en,...,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,tt0113228,en,...,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,tt0114885,en,...,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,tt0113041,en,...,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


 We will only work with released movies. It does not make a scene to recommend rumoured or post-production movies.

In [7]:
training_df['status'].value_counts()

Released           9953
Rumored              38
Post Production       1
Name: status, dtype: int64

In [8]:
training_df=training_df[training_df['status']=='Released']
training_df.drop('status', axis=1, inplace=True)

I will use only 'title', 'cast', 'crew', 'genres', 'keywords', 'original_language', 'popularity' and 'release_date'. We will not rely on overview or tagline, These feature could confuse the algorithm because it could be written allegory that the computer could not understand.  

In [9]:
training_df=training_df[['title','cast','crew','genres','keywords','original_language',
                         'popularity','release_date']]


In [10]:
training_df.head(3)

Unnamed: 0,title,cast,crew,genres,keywords,original_language,popularity,release_date
0,Toy Story,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...",en,21.946943,1995-10-30
1,Jumanji,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10090, 'name': 'board game'}, {'id': 1...",en,17.015539,1995-12-15
2,Grumpier Old Men,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392...",en,11.7129,1995-12-22


We will use clean_data method to remove any special characters in the words. Then encode them to ASCII after and use underscore between words so the algorithm will count it as one word.

In [11]:
import unicodedata
import re
def clean_data(value):
    value =  unicodedata.normalize('NFD', value).encode('ascii', 'ignore').decode('ascii')
    x= str.lower(re.sub(r"[^a-zA-Z0-9]+", '_', value.strip().replace(" ","_")))
    if(x!='_'):
        return x 
    else:
        return " " 
    

In [12]:
 training_df.isnull().sum()

title                0
cast                 0
crew                 0
genres               0
keywords             0
original_language    0
popularity           0
release_date         4
dtype: int64

In [13]:
training_df['original_language']=training_df['original_language'].fillna('').astype('str')


Use the day and month of the release date for the movie does not make sense. We will convert the release date to release decade, so the algorithm be able to recommend movies with a similar decade.

In [14]:

training_df['release_date'] = pd.to_datetime(training_df['release_date'], errors='coerce')


In [15]:
training_df['release_date'].head(3)

0   1995-10-30
1   1995-12-15
2   1995-12-22
Name: release_date, dtype: datetime64[ns]

In [16]:
import math
training_df['release_date']=training_df['release_date'].dt.year.fillna(0).astype('int')
training_df['release_date']=training_df['release_date'].apply(lambda x: str(math.floor((int(x))/ 10) * 10))


In [17]:

training_df['release_date'].head(3)

0    1990
1    1990
2    1990
Name: release_date, dtype: object

In [18]:
import ast
training_df['cast']=training_df['cast'].apply(lambda s: list(ast.literal_eval(s)))

We will only use the first three leading actors of the movie. Some movies have only one leading actor, Others have many. So we will stick with three. We will extract only the names of the cast as we mention earlier.

In [19]:
training_df['cast'] = training_df['cast'].map(lambda x: x[:3] if len(x) >=4 else x)
training_df['cast']=training_df['cast'].apply((lambda cast : [ clean_data(actor['name'])  for actor in cast  if actor != '' and actor != ' ' ]))
training_df.head(3)

Unnamed: 0,title,cast,crew,genres,keywords,original_language,popularity,release_date
0,Toy Story,"[tom_hanks, tim_allen, don_rickles]","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...",en,21.946943,1990
1,Jumanji,"[robin_williams, jonathan_hyde, kirsten_dunst]","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10090, 'name': 'board game'}, {'id': 1...",en,17.015539,1990
2,Grumpier Old Men,"[walter_matthau, jack_lemmon, ann_margret]","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392...",en,11.7129,1990


In [20]:

training_df['clean_title']=training_df['title'].apply((lambda title: clean_data(title)))


We will use the director name only from the crew. 

In [21]:
def get_director(crew):
    for man in crew :
        if(man['job']=="Director"):
            return   clean_data(man['name'])
    return np.nan 
    
training_df['crew']=training_df['crew'].apply(lambda s: list(literal_eval(s)))
training_df['crew']=training_df['crew'].apply(get_director)
training_df['crew']=training_df['crew'].fillna('')

In [22]:
training_df['genres']=training_df['genres'].apply(lambda s: list(ast.literal_eval(s)))
training_df['genres']=training_df['genres'].apply((lambda genres : [ clean_data(genre['name'])  for genre in genres]))


We will only select keywords that appear more than two times and not more than 30 times, other than that it will not be a helpful keyword.

In [23]:
training_df['keywords']=training_df['keywords'].apply(lambda s: list(ast.literal_eval(s)))
training_df['keywords']=training_df['keywords'].apply((lambda keywords : [ keyword['name']  for keyword in keywords]))

keywords = training_df.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
keywords_count = keywords.value_counts()




  after removing the cwd from sys.path.


In [24]:
valid_keywords=keywords_count[(keywords_count > 2) & (keywords_count<30) ]
training_df['keywords'] = training_df['keywords'].apply(
    lambda row: [clean_data(val)  for val in row if val in valid_keywords]
)

In [25]:
training_df.head(5)

Unnamed: 0,title,cast,crew,genres,keywords,original_language,popularity,release_date,clean_title
0,Toy Story,"[tom_hanks, tim_allen, don_rickles]",john_lasseter,"[animation, comedy, family]","[toy, boy, rivalry, boy_next_door, toy_comes_t...",en,21.946943,1990,toy_story
1,Jumanji,"[robin_williams, jonathan_hyde, kirsten_dunst]",joe_johnston,"[adventure, fantasy, family]","[board_game, disappearance, based_on_children_...",en,17.015539,1990,jumanji
2,Grumpier Old Men,"[walter_matthau, jack_lemmon, ann_margret]",howard_deutch,"[romance, comedy]",[fishing],en,11.7129,1990,grumpier_old_men
3,Waiting to Exhale,"[whitney_houston, angela_bassett, loretta_devine]",forest_whitaker,"[comedy, drama, romance]","[interracial_relationship, single_mother]",en,3.859495,1990,waiting_to_exhale
4,Father of the Bride Part II,"[steve_martin, diane_keaton, martin_short]",charles_shyer,[comedy],"[midlife_crisis, confidence, aging, gynecologist]",en,8.387519,1990,father_of_the_bride_part_ii


Now we will contact the features into one column (bow) stands for a bag of words. So we can feed it to the algorithm.

In [26]:
training_df['bow']=training_df['cast']+training_df['keywords'] + training_df['genres']
training_df['bow']=training_df['bow'].apply(lambda x: ' '.join(x))+" "+training_df['crew']+" "+training_df['clean_title']




In [27]:
training_df["bow"].head(3)

0    tom_hanks tim_allen don_rickles toy boy rivalr...
1    robin_williams jonathan_hyde kirsten_dunst boa...
2    walter_matthau jack_lemmon ann_margret fishing...
Name: bow, dtype: object

We will use CountVectorizer to count the number of words in each movie which will be represented by a row and the words of all movies will be the columns, then we will use the cosine similarity function to find similar movies. but before that, we will modify the count of the (genres, cast and directors) to have a higher weight than other words in the similarity function.

Some columns have been dropped from the cleaning. So we need to reset the index of the rows.

In [28]:
training_df=training_df.reset_index(drop=True)


In [29]:
directors = training_df.apply(lambda x: pd.Series(x['crew']),axis=1).stack().reset_index(level=1, drop=True)
cast = training_df.apply(lambda x: pd.Series(x['cast']),axis=1).stack().reset_index(level=1, drop=True)
genres = training_df.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)


  
  This is separate from the ipykernel package so we can avoid doing imports until


In [30]:

directors=list(filter(None, directors))
cast=list(filter(None, cast))
genres=list(filter(None, genres))



The cast will have (5 * original count). Mostly the actor name will appear one in each bow for a movie so the count will be 1, the reason why the actors will have the most weight is a business decision, you could say the genres should have the highest weight or maybe the director. Try to change these weights and check how the recommendations will change accordingly.

In [31]:
from scipy.spatial.distance import cosine
import numpy as np

vectorizer = CountVectorizer(analyzer='word',min_df=0,strip_accents='ascii')
train_array = vectorizer.fit_transform(training_df['bow'])
words_weights =  dict.fromkeys(cast, 5)
words_weights.update(dict.fromkeys(directors,3))
words_weights.update(dict.fromkeys(genres,4))

feature_names = vectorizer.get_feature_names()
weights = np.ones(len(feature_names))




In [32]:
for key, value in words_weights.items():
        x=feature_names.index(str(key))
        weights[x] = value

In [33]:
train_array=train_array.toarray()
train_array=train_array*weights



###Cosine Similarity  
we will use cosine similarity to find the similarity between movies based on the words counts in each movie. you could use other algorithms or create you on. 

![](https://wikimedia.org/api/rest_v1/media/math/render/svg/1d94e5903f7936d3c131e040ef2c51b473dd071d)
Example: Let two movies M1 and M2, and cosine_similarity(M1, M2) the similarity between these two movies according to their similarity in the (bow) column:
cosine_similarity(M1, M2)=1 if the two movies have the same words counts or if M1=M2.
cosine_similarity(M1, M2)=0 if we do not find any common words between them.
you can read more about cosine similarity [here](https://en.wikipedia.org/wiki/Cosine_similarity) and [here](https://scikit-learn.org/stable/modules/metrics.html#cosine-similaritycosine_similarity). 

In [34]:
cosine_sim = cosine_similarity(train_array, train_array)

In [35]:
indices = pd.Series(training_df.index, index=training_df['title'])    

The main idea of this recommendations algorithm to find the most similar movies based on the features we determine earlier (cast, director, genres...etc). Popularity narrows the list to the most popular items and reduces the risk for showing items that are for particular (and maybe unpopular) tastes but remember that if a user likes unusual items more than popular items, the unusual ones would still bubble up in the list.  Our algorithm will not rerank the movies and depends on the similarity score to determine the rank.

In [36]:
def get_recommendations(title):
    idx = indices[title].iloc[0]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return training_df.iloc[movie_indices]['title']

In [37]:
get_recommendations('Titanic').head(10)

3180                      The Beach
331     What's Eating Gilbert Grape
5208                         Enigma
1043                           Jude
1046                 Romeo + Juliet
198                   Total Eclipse
6092         The Life of David Gale
5600              Absence of Malice
4866                           Iris
2484                  Hideous Kinky
Name: title, dtype: object

This kind of recommender system called white box recommender because you have the answer to why you getting these recommendations. for example in the Titanic results we got most movies with the same leading actors as Titanic. Recommender system called black box recommender system like for example collaborative filtering if it's hard to explain why we are getting these result. The more your system needs to explain, the simpler the algorithm. The better the quality of the recommendation, the more complex and harder to show explanations. This problem is known as[ model accuracy-model interpretation trade-off](https://machinelearningmastery.com/model-prediction-versus-interpretation-in-machine-learning/). 

There is always room for improvement. The algorithm we used has many Disadvantages, for example when we use to increase the size of the data a little bit then the size of movies/words matrix train_array will significantly increase. In the next notebook, we will try to use other recommednation algorithims stay tuned.

References:
1. https://www.kaggle.com/rounakbanik/movie-recommender-systems
2. https://www.goodreads.com/book/show/28510003-practical-recommender-systems