# Content based Recommendation 

## Idea 1: Overview/Description

#### Limitations:

* if user watches a movie, he might be interested in watching another movie with the same main actor or director for example, rather than another film with a similar description. This is not handled in this method.

## Import Libraries

In [10]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics.pairwise import linear_kernel
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
data_path = '/Users/jeremy/data/movie_datasets/'

In [3]:
metadata = pd.read_csv(data_path + 'movies_metadata.csv', low_memory=False)
metadata.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0


# Data Preprocessing

In [36]:
metadata['overview'] = metadata['overview'].fillna('')
metadata = metadata[metadata.adult.isin(['True','False'])]
metadata['id'] = metadata['id'].astype('int')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_item(key, value)


In [5]:
metadata['overview'].head(2)

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
Name: overview, dtype: object

# Feature Engineering - text data (NLP)

This is an NLP problem. We need to extract features from the `overview` feature which can be used to compute cosine similarity. This needs to be in a numeric format not raw strings.

To do this, you need to compute the word vectors of each overview or document.

`Term Frequency-Inverse Document Frequency` (TF-IDF) vectors for each document needs to be computed. This will create a matrix where each column represents a word in the overview vocabulary (all the words that appear in at least one document), and each column represents a movie, as before.

In its essence, the TF-IDF score is the frequency of a word occurring in a document, down-weighted by the number of documents in which it occurs. This is done to reduce the importance of words that frequently occur in plot overviews and, therefore, their significance in computing the final similarity score.


The method that we will follows is the following:

* import tfidf from sklearn
* remove stop words 'the', 'an', etc (these do not provide any information
* replace nan with ''
* compute TF-IDF matrix

In [8]:
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(metadata['overview'])
tfidf_matrix.shape

(45466, 75827)

# Cosine Similarities

In [11]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_sim.shape

## TODO:

* write a function that takes a movie title and returns a list of top n most similar movies


1. get the index of the specified movie given its title
2. get the list of cosine similarity scores for that movie
3. convert it into a list of tuples, where elmt 0 is its position and elmt 1 is the score
4. get top n elements of this list (ignoring the first element which will be itself)
5. return titles of these top n movies

In [17]:
indices_ = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()
indices_[:2]

title
Toy Story    0
Jumanji      1
dtype: int64

In [18]:
indices_['Toy Story']

0

In [20]:
def get_cb_recommendations(title: str, metadata: pd.DataFrame = metadata, top_n: int = 10, cosine_sim = cosine_sim):
    indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()
    idx = indices[title]
    
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    sim_scores = sim_scores[1:top_n + 1]
    movie_indices = [i[0] for i in sim_scores]
    
    return metadata['title'].iloc[movie_indices]

In [21]:
get_cb_recommendations('The Dark Knight Rises')

12481                                      The Dark Knight
150                                         Batman Forever
1328                                        Batman Returns
15511                           Batman: Under the Red Hood
585                                                 Batman
21194    Batman Unmasked: The Psychology of the Dark Kn...
9230                    Batman Beyond: Return of the Joker
18035                                     Batman: Year One
19792              Batman: The Dark Knight Returns, Part 1
3095                          Batman: Mask of the Phantasm
Name: title, dtype: object

In [22]:
get_cb_recommendations('The Godfather')

1178               The Godfather: Part II
44030    The Godfather Trilogy: 1972-1990
1914              The Godfather: Part III
23126                          Blood Ties
11297                    Household Saints
34717                   Start Liquidation
10821                            Election
38030            A Mother Should Be Loved
17729                   Short Sharp Shock
26293                  Beck 28 - Familjen
Name: title, dtype: object

## Import Libraries

In [37]:
from ast import literal_eval

## Idea 2: Cast, Crew, Genres, Keywords based recommender

In [23]:
credits = pd.read_csv(data_path + 'credits.csv')
keywords = pd.read_csv(data_path + 'keywords.csv')

In [35]:
keywords.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46419 entries, 0 to 46418
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        46419 non-null  int64 
 1   keywords  46419 non-null  object
dtypes: int64(1), object(1)
memory usage: 725.4+ KB
