# Content based Recommendation 

## Idea 1: Overview/Description

#### Limitations:

* if user watches a movie, he might be interested in watching another movie with the same main actor or director for example, rather than another film with a similar description. This is not handled in this method.

## Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
data_path = '/Users/jeremy/data/movie_datasets/'

In [3]:
metadata = pd.read_csv(data_path + 'movies_metadata.csv', low_memory=False)
metadata.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0


## Data Preprocessing / Feature engineering

In [4]:
metadata['overview'] = metadata['overview'].fillna('')
metadata = metadata[metadata.adult.isin(['True','False'])]
metadata['id'] = metadata['id'].astype('int')

In [5]:
metadata['overview'].head(2)

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
Name: overview, dtype: object

### Text data (NLP)

This is an NLP problem. We need to extract features from the `overview` feature which can be used to compute cosine similarity. This needs to be in a numeric format not raw strings.

To do this, you need to compute the word vectors of each overview or document.

`Term Frequency-Inverse Document Frequency` (TF-IDF) vectors for each document needs to be computed. This will create a matrix where each column represents a word in the overview vocabulary (all the words that appear in at least one document), and each column represents a movie, as before.

In its essence, the TF-IDF score is the frequency of a word occurring in a document, down-weighted by the number of documents in which it occurs. This is done to reduce the importance of words that frequently occur in plot overviews and, therefore, their significance in computing the final similarity score.


The method that we will follows is the following:

* import tfidf from sklearn
* remove stop words 'the', 'an', etc (these do not provide any information
* replace nan with ''
* compute TF-IDF matrix

In [6]:
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(metadata['overview'])
tfidf_matrix.shape

(45463, 75827)

# Cosine Similarities

In [None]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_sim.shape

## TODO:

* write a function that takes a movie title and returns a list of top n most similar movies


1. get the index of the specified movie given its title
2. get the list of cosine similarity scores for that movie
3. convert it into a list of tuples, where elmt 0 is its position and elmt 1 is the score
4. get top n elements of this list (ignoring the first element which will be itself)
5. return titles of these top n movies

In [None]:
indices_ = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()
indices_[:2]

In [None]:
indices_['Toy Story']

In [None]:
def get_cb_recommendations(title: str, metadata: pd.DataFrame = metadata, top_n: int = 10, cosine_sim = cosine_sim):
    idx = indices[title]
    
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    sim_scores = sim_scores[1:top_n + 1]
    movie_indices = [i[0] for i in sim_scores]
    
    return metadata['title'].iloc[movie_indices]

In [None]:
get_cb_recommendations('The Dark Knight Rises')

In [None]:
get_cb_recommendations('The Godfather')

## Import Libraries

In [None]:
from ast import literal_eval
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## Idea 2: Cast, Crew, Genres, Keywords based recommender

In [None]:
credits = pd.read_csv(data_path + 'credits.csv')
keywords = pd.read_csv(data_path + 'keywords.csv')

In [None]:
keywords.info()

In [None]:
metadata = metadata.merge(credits, on='id')
metadata = metadata.merge(keywords, on='id')

## Data Preprocessing / Feature engineering

* Parse the stringified features into their corresponding python objects
* define function which extracts director from crew feature
* define function which creates a clean list of objects from list of dictionary features
* convert new clean string features to lower case and remove whitespace (Removing the spaces between words is an important preprocessing step. It is done so that your vectorizer doesn't count the Johnny of "Johnny Depp" and "Johnny Galecki" as the same)
* the new clean features will the be concatenated into a single string feature "metadata soup"


In [None]:
features = ['cast', 'crew', 'keywords', 'genres']

for feature in features:
    metadata[feature] = metadata[feature].apply(literal_eval)

In [None]:
def get_director(x: list) -> str:
    for i in x:
        if i.get('job') == 'Director':
            return i.get('name')
    return np.nan


def get_clean_list(x: list, top_n: int = 3) -> list:
    
    if isinstance(x, list):
        names = [i['name'] for i in x]
        
        if (top_n) & (len(names) > top_n):
            names = names[:top_n]
        
        return names
    return []

In [None]:
# Extract director feature and process cast, genres and keywords features

metadata['director'] = metadata['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']

for feature in features:
    metadata[feature] = metadata[feature].apply(get_clean_list)

In [None]:
metadata[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

In [None]:
def clean_string_features(x: list):
    
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        
        else:
            return ''

In [None]:
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    metadata[feature] = metadata[feature].apply(clean_string_features)

In [None]:
metadata[['title', 'cast', 'director', 'keywords', 'genres']].head(3)

In [None]:
def create_metadata_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + ' '.join(x['director']) + ' ' + ' '.join(x['genres']) + ' '


In [None]:
metadata['soup'] = metadata.apply(create_metadata_soup, axis=1)

In [None]:
metadata[['soup','title', 'cast', 'director', 'keywords', 'genres']].head(3)

### Note

Next step is very close to what we did before to solve the NLP problem. The main difference is that here we will be using `CountVectorizer` instead of `TF-IDF`. The reason is that you do not want to down-weight the actor/director's presence if he or she has acted or directed in relatively more movies. It doesn't make much intuitive sense to down-weight them in this context.

The main difference between `CountVectorizer` and `TF-IDF` is the inverse document frequency (IDF) component which is only present in `TF-IDF`.

In [None]:
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(metadata['soup'])
count_matrix.shape

In [None]:
cosine_count_sim = cosine_similarity(count_matrix, count_matrix)

In [None]:
# Reset index of your main DataFrame and construct reverse mapping as before
metadata = metadata.reset_index()
indices = pd.Series(metadata.index, index=metadata['title'])

In [None]:
get_cb_recommendations('The Dark Knight Rises', cosine_sim=cosine_count_sim)

In [None]:
get_cb_recommendations('The Godfather', cosine_sim=cosine_count_sim)