 [datacamp_link](https://www.datacamp.com/community/tutorials/recommender-systems-python)

# Content based Recommendation 

## Idea 1: Overview/Description

#### Limitations:

* if user watches a movie, he might be interested in watching another movie with the same main actor or director for example, rather than another film with a similar description. This is not handled in this method.

## Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
data_path = '/Users/jeremy/data/movie_datasets/'

In [3]:
metadata = pd.read_csv(data_path + 'movies_metadata.csv', low_memory=False)
metadata.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0


## Data Preprocessing / Feature engineering

In [4]:
metadata['overview'] = metadata['overview'].fillna('')
metadata = metadata[metadata.adult.isin(['True','False'])]
metadata['id'] = metadata['id'].astype('int')

In [5]:
metadata['overview'].head(2)

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
Name: overview, dtype: object

### Text data (NLP)

This is an NLP problem. We need to extract features from the `overview` feature which can be used to compute cosine similarity. This needs to be in a numeric format not raw strings.

To do this, you need to compute the word vectors of each overview or document.

`Term Frequency-Inverse Document Frequency` (TF-IDF) vectors for each document needs to be computed. This will create a matrix where each column represents a word in the overview vocabulary (all the words that appear in at least one document), and each column represents a movie, as before.

In its essence, the TF-IDF score is the frequency of a word occurring in a document, down-weighted by the number of documents in which it occurs. This is done to reduce the importance of words that frequently occur in plot overviews and, therefore, their significance in computing the final similarity score.


The method that we will follows is the following:

* import tfidf from sklearn
* remove stop words 'the', 'an', etc (these do not provide any information
* replace nan with ''
* compute TF-IDF matrix

In [6]:
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(metadata['overview'])
tfidf_matrix.shape

(45463, 75827)

# Cosine Similarities

In [7]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_sim.shape

(45463, 45463)

## TODO:

* write a function that takes a movie title and returns a list of top n most similar movies


1. get the index of the specified movie given its title
2. get the list of cosine similarity scores for that movie
3. convert it into a list of tuples, where elmt 0 is its position and elmt 1 is the score
4. get top n elements of this list (ignoring the first element which will be itself)
5. return titles of these top n movies

In [8]:
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()
indices[:2]

title
Toy Story    0
Jumanji      1
dtype: int64

In [9]:
indices['Toy Story']

0

In [10]:
def get_cb_recommendations(title: str, metadata: pd.DataFrame = metadata, top_n: int = 10, cosine_sim = cosine_sim, indices_=indices):
    idx = indices_[title]
    
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    sim_scores = sim_scores[1:top_n + 1]
    movie_indices = [i[0] for i in sim_scores]
    
    return metadata['title'].iloc[movie_indices]

In [11]:
get_cb_recommendations('The Dark Knight Rises')

12481                                      The Dark Knight
150                                         Batman Forever
1328                                        Batman Returns
15511                           Batman: Under the Red Hood
585                                                 Batman
21194    Batman Unmasked: The Psychology of the Dark Kn...
9230                    Batman Beyond: Return of the Joker
18035                                     Batman: Year One
19792              Batman: The Dark Knight Returns, Part 1
3095                          Batman: Mask of the Phantasm
Name: title, dtype: object

In [12]:
get_cb_recommendations('The Godfather')

1178               The Godfather: Part II
44030    The Godfather Trilogy: 1972-1990
1914              The Godfather: Part III
23126                          Blood Ties
11297                    Household Saints
34717                   Start Liquidation
10821                            Election
38030            A Mother Should Be Loved
17729                   Short Sharp Shock
26293                  Beck 28 - Familjen
Name: title, dtype: object

## Import Libraries

In [13]:
from ast import literal_eval
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## Idea 2: Cast, Crew, Genres, Keywords based recommender

In [14]:
credits = pd.read_csv(data_path + 'credits.csv')
keywords = pd.read_csv(data_path + 'keywords.csv')

In [15]:
keywords.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46419 entries, 0 to 46418
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        46419 non-null  int64 
 1   keywords  46419 non-null  object
dtypes: int64(1), object(1)
memory usage: 725.4+ KB


In [16]:
metadata = metadata.merge(credits, on='id')
metadata = metadata.merge(keywords, on='id')

## Data Preprocessing / Feature engineering

* Parse the stringified features into their corresponding python objects
* define function which extracts director from crew feature
* define function which creates a clean list of objects from list of dictionary features
* convert new clean string features to lower case and remove whitespace (Removing the spaces between words is an important preprocessing step. It is done so that your vectorizer doesn't count the Johnny of "Johnny Depp" and "Johnny Galecki" as the same)
* the new clean features will the be concatenated into a single string feature "metadata soup"


In [17]:
features = ['cast', 'crew', 'keywords', 'genres']

for feature in features:
    metadata[feature] = metadata[feature].apply(literal_eval)

In [18]:
def get_director(x: list) -> str:
    for i in x:
        if i.get('job') == 'Director':
            return i.get('name')
    return np.nan


def get_clean_list(x: list, top_n: int = 3) -> list:
    
    if isinstance(x, list):
        names = [i['name'] for i in x]
        
        if (top_n) & (len(names) > top_n):
            names = names[:top_n]
        
        return names
    return []

In [19]:
# Extract director feature and process cast, genres and keywords features

metadata['director'] = metadata['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']

for feature in features:
    metadata[feature] = metadata[feature].apply(get_clean_list)

In [20]:
metadata[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,title,cast,director,keywords,genres
0,Toy Story,"[Tom Hanks, Tim Allen, Don Rickles]",John Lasseter,"[jealousy, toy, boy]","[Animation, Comedy, Family]"
1,Jumanji,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]",Joe Johnston,"[board game, disappearance, based on children'...","[Adventure, Fantasy, Family]"
2,Grumpier Old Men,"[Walter Matthau, Jack Lemmon, Ann-Margret]",Howard Deutch,"[fishing, best friend, duringcreditsstinger]","[Romance, Comedy]"


In [21]:
def clean_string_features(x: list):
    
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        
        else:
            return ''

In [22]:
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    metadata[feature] = metadata[feature].apply(clean_string_features)

In [23]:
metadata[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,title,cast,director,keywords,genres
0,Toy Story,"[tomhanks, timallen, donrickles]",johnlasseter,"[jealousy, toy, boy]","[animation, comedy, family]"
1,Jumanji,"[robinwilliams, jonathanhyde, kirstendunst]",joejohnston,"[boardgame, disappearance, basedonchildren'sbook]","[adventure, fantasy, family]"
2,Grumpier Old Men,"[waltermatthau, jacklemmon, ann-margret]",howarddeutch,"[fishing, bestfriend, duringcreditsstinger]","[romance, comedy]"


In [24]:
def create_metadata_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + ' '.join(x['director']) + ' ' + ' '.join(x['genres']) + ' '


In [25]:
metadata['soup'] = metadata.apply(create_metadata_soup, axis=1)

In [26]:
metadata[['soup','title', 'cast', 'director', 'keywords', 'genres']].head(3)

Unnamed: 0,soup,title,cast,director,keywords,genres
0,jealousy toy boy tomhanks timallen donrickles ...,Toy Story,"[tomhanks, timallen, donrickles]",johnlasseter,"[jealousy, toy, boy]","[animation, comedy, family]"
1,boardgame disappearance basedonchildren'sbook ...,Jumanji,"[robinwilliams, jonathanhyde, kirstendunst]",joejohnston,"[boardgame, disappearance, basedonchildren'sbook]","[adventure, fantasy, family]"
2,fishing bestfriend duringcreditsstinger walter...,Grumpier Old Men,"[waltermatthau, jacklemmon, ann-margret]",howarddeutch,"[fishing, bestfriend, duringcreditsstinger]","[romance, comedy]"


### Note

Next step is very close to what we did before to solve the NLP problem. The main difference is that here we will be using `CountVectorizer` instead of `TF-IDF`. The reason is that you do not want to down-weight the actor/director's presence if he or she has acted or directed in relatively more movies. It doesn't make much intuitive sense to down-weight them in this context.

The main difference between `CountVectorizer` and `TF-IDF` is the inverse document frequency (IDF) component which is only present in `TF-IDF`.

In [27]:
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(metadata['soup'])
count_matrix.shape

(46628, 58204)

In [28]:
cosine_count_sim = cosine_similarity(count_matrix, count_matrix)

In [32]:
# Reset index of your main DataFrame and construct reverse mapping as before
metadata = metadata.reset_index()
indices = pd.Series(metadata.index, index=metadata['title'])

In [33]:
def get_cb_recommendations1(title: str, metadata: pd.DataFrame = metadata, top_n: int = 10, cosine_sim = cosine_sim, indices_=indices):
    idx = indices_[title]
    
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    sim_scores = sim_scores[1:top_n + 1]
    movie_indices = [i[0] for i in sim_scores]
    
    return metadata['title'].iloc[movie_indices]

In [37]:
bat = get_cb_recommendations1('The Dark Knight Rises', cosine_sim=cosine_count_sim, indices_=indices)
bat

12589           The Dark Knight
10210             Batman Begins
9311                     Shiner
9874            Amongst Friends
7772                   Mitchell
35802    Manuscripts Don't Burn
35803    Manuscripts Don't Burn
41063                      Sara
516           Romeo Is Bleeding
24090                 Quicksand
Name: title, dtype: object

In [36]:
get_cb_recommendations1('The Godfather', cosine_sim=cosine_count_sim, indices_=indices)

35802            Manuscripts Don't Burn
35803            Manuscripts Don't Burn
1934            The Godfather: Part III
8001     The Night of the Following Day
18261                 The Son of No One
28683            In the Name of the Law
39193                 The Good Neighbor
7772                           Mitchell
18940                         Last Exit
34488                              Rege
Name: title, dtype: object

In [41]:
metadata[metadata.title.isin(bat.values.tolist())]

Unnamed: 0,level_0,index,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,...,tagline,title,video,vote_average,vote_count,cast,crew,keywords,director,soup
516,516,516,False,,11500000,"[action, crime, drama]",,2088,tt0107983,en,...,The story of a cop who wanted it bad and got i...,Romeo Is Bleeding,False,5.7,36.0,"[garyoldman, lenaolin, annabellasciorra]","[{'credit_id': '52fe4334c3a36847f80422d1', 'de...","[policeoperation, sexaddiction, police]",petermedak,policeoperation sexaddiction police garyoldman...
7772,7772,7772,False,,0,"[crime, drama, action]",,32303,tt0073396,en,...,Brute Force With a Badge,Mitchell,False,4.4,15.0,"[joedonbaker, martinbalsam, johnsaxon]","[{'credit_id': '52fe44cb9251416c9101d24b', 'de...","[drama, crime]",andrewv.mclaglen,drama crime joedonbaker martinbalsam johnsaxon...
9311,9311,9311,False,,0,"[drama, action, crime]",,28943,tt0232632,en,...,,Shiner,False,5.1,10.0,"[michaelcaine, martinlandau, andyserkis]","[{'credit_id': '52fe45bdc3a368484e06c373', 'de...",[],johnirvin,michaelcaine martinlandau andyserkis j o h n ...
9874,9874,9874,False,,0,"[crime, drama, action]",,77041,tt0106264,en,...,"When Crime is a Way of Life, You Never Know if...",Amongst Friends,False,4.3,3.0,[mirasorvino],"[{'credit_id': '52fe4958c3a368484e126cff', 'de...",[],robweiss,mirasorvino r o b w e i s s crime drama action
10210,10210,10210,False,"{'id': 263, 'name': 'The Dark Knight Collectio...",150000000,"[action, crime, drama]",http://www2.warnerbros.com/batmanbegins/index....,272,tt0372784,en,...,Evil fears the knight.,Batman Begins,False,7.5,7511.0,"[christianbale, michaelcaine, liamneeson]","[{'credit_id': '52fe4230c3a36847f800ac6d', 'de...","[himalaya, martialarts, dccomics]",christophernolan,himalaya martialarts dccomics christianbale mi...
12589,12589,12589,False,"{'id': 263, 'name': 'The Dark Knight Collectio...",185000000,"[drama, action, crime]",http://thedarkknight.warnerbros.com/dvdsite/,155,tt0468569,en,...,Why So Serious?,The Dark Knight,False,8.3,12269.0,"[christianbale, michaelcaine, heathledger]","[{'credit_id': '55a0eb4a925141296b0010f8', 'de...","[dccomics, crimefighter, secretidentity]",christophernolan,dccomics crimefighter secretidentity christian...
24090,24090,24090,False,,0,"[action, crime, drama]",,47517,tt0271136,en,...,He's running for his life and running out of t...,Quicksand,False,5.9,10.0,"[michaelkeaton, michaelcaine, judithgodrèche]","[{'credit_id': '52fe4737c3a36847f81297df', 'de...","[massage, limousine, executioner]",johnmackenzie,massage limousine executioner michaelkeaton mi...
28914,28914,28914,False,,0,"[action, crime, drama]",,72003,tt2258647,en,...,,The Dark Knight,False,6.3,2.0,"[kylewalsh, aaronfarb, debralopez]","[{'credit_id': '52fe4852c3a368484e0f2eed', 'de...",[],drewmaxwell,kylewalsh aaronfarb debralopez d r e w m a x ...
35802,35802,35802,False,,0,"[crime, drama]",,191731,tt2912144,fa,...,,Manuscripts Don't Burn,False,6.1,8.0,[],"[{'credit_id': '52fe4c909251416c910f8ea5', 'de...",[],mohammadrasoulof,m o h a m m a d r a s o u l o f crime drama
35803,35803,35803,False,,0,"[crime, drama]",,191731,tt2912144,fa,...,,Manuscripts Don't Burn,False,6.1,8.0,[],"[{'credit_id': '52fe4c909251416c910f8ea5', 'de...",[],mohammadrasoulof,m o h a m m a d r a s o u l o f crime drama
