# Recommendation Engine

I am building simple recommendation engine, using Kaggle's movie [dataset](https://www.kaggle.com/rounakbanik/the-movies-dataset), this recommender system will only recommend based on similarity between **cast**, **crew**, **keywords** and **genres**.

## Objectives
1. Loading and cleaning the dataset
2. Extracting english language movies only
3. Creating necessary functions to extract similar characterstics
4. Using scikit learn **Count Vectorizer** to get tokens of similar words
5. Using scikit learn **Cosine Similarity** function to get cosine similarities between keywords
6. Getting Recommendations

In [1]:
import matplotlib as mpl
mpl.use('TkAgg')
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
np.random.seed(42)

#### Loading and cleaning dataset

In [2]:
metadata = pd.read_csv('movies_metadata.csv', low_memory=False)
metadata.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0


In [3]:
metadata.shape

(45466, 24)

In [4]:
metadata['original_language'].value_counts().head()

en    32269
fr     2438
it     1529
ja     1350
de     1080
Name: original_language, dtype: int64

In [5]:
metadata['release_date'] = pd.to_datetime(metadata['release_date'], errors='coerce')
import calendar
metadata['year'] = metadata['release_date'].dt.year
metadata.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count,year
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,1995.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,1995.0


In [6]:
metadata['year'].isnull().sum()

90

In [7]:
metadata['year'].fillna(method='ffill', inplace=True)
metadata['year'] = metadata['year'].astype(int)

In [8]:
metadata['year'].isnull().sum()

0

#### Extracting movies of english language

In [9]:
metadata = metadata.query('original_language == "en"')
metadata = metadata.query('year > 2004')

In [10]:
metadata.shape

(13605, 25)

In [11]:
# Load keywords and credits
credits = pd.read_csv('credits.csv')
keywords = pd.read_csv('keywords.csv')

# Remove rows with bad IDs.


# Convert IDs to int. Required for merging
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
metadata['id'] = metadata['id'].astype('int')

# Merge keywords and credits into your main metadata dataframe
metadata = metadata.merge(credits, on='id')
metadata = metadata.merge(keywords, on='id')

In [12]:
metadata.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,status,tagline,title,video,vote_average,vote_count,year,cast,crew,keywords
0,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,79782,tt1684935,en,Wenecja,An atmospheric coming-of-age story featuring a...,...,Released,,Venice,False,7.5,4.0,2010,"[{'cast_id': 1005, 'character': 'Marek', 'cred...","[{'credit_id': '52fe49e5c3a368484e145fb7', 'de...",[]
1,False,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 27, 'nam...",,141210,tt2250194,en,The Sleepover,"The town of Derry has a secret, but no one tol...",...,Released,,The Sleepover,False,8.0,1.0,2013,"[{'cast_id': 2, 'character': 'Rachel', 'credit...","[{'credit_id': '52fe4aaf9251416c750ea6f1', 'de...",[]


In [13]:
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(literal_eval)
    
metadata.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,status,tagline,title,video,vote_average,vote_count,year,cast,crew,keywords
0,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10749, 'n...",,79782,tt1684935,en,Wenecja,An atmospheric coming-of-age story featuring a...,...,Released,,Venice,False,7.5,4.0,2010,"[{'cast_id': 1005, 'character': 'Marek', 'cred...","[{'credit_id': '52fe49e5c3a368484e145fb7', 'de...",[]
1,False,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 27, 'nam...",,141210,tt2250194,en,The Sleepover,"The town of Derry has a secret, but no one tol...",...,Released,,The Sleepover,False,8.0,1.0,2013,"[{'cast_id': 2, 'character': 'Rachel', 'credit...","[{'credit_id': '52fe4aaf9251416c750ea6f1', 'de...",[]


#### Creating neccesary functions to extract similar keywords
We are extracting cast, crew, keywords and genres.

In [14]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [15]:
def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than 3 elements exist. If yes, return only first three. If no, return entire list.
        if len(names) > 3:
            names = names[:3]
        return names

    #Return empty list in case of missing/malformed data
    return []

In [16]:
# Define new director, cast, genres and keywords features that are in a suitable form.
metadata['director'] = metadata['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(get_list)

In [17]:
metadata[['title', 'cast', 'director', 'keywords', 'genres']].head(5)

Unnamed: 0,title,cast,director,keywords,genres
0,Venice,"[Marcin Walewski, Magdalena Cielecka, Mariusz ...",Jan Jakub Kolski,[],"[Drama, Romance]"
1,The Sleepover,"[Josh Feldman, Gus Kamp, Carolyn Jania]",Chris Cullari,[],"[Comedy, Horror]"
2,The Farmer's Wife,"[James Cartwright, Geraldine James, Alex Kelly]",Francis Lee,[short],[Drama]
3,A Place at the Table,"[Jeff Bridges, Tom Colicchio, Mariana Chilton]",Kristi Jacobson,[woman director],[Documentary]
4,A Place at the Table,"[Jeff Bridges, Tom Colicchio, Mariana Chilton]",Kristi Jacobson,[woman director],[Documentary]


In [18]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        #Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [19]:
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    metadata[feature] = metadata[feature].apply(clean_data)

Creating a soup of all the keywords in one column

In [20]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

# Create a new soup feature
metadata['soup'] = metadata.apply(create_soup, axis=1)

#### Scikit Learn
Using count vectorizer to give count text tokens. To learn more about count vectorizer, read [docs](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In [21]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(metadata['soup'])

Using scikit learn cosine similarity function to match similarity between keywords. Read [docs](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html)

In [22]:
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

In [23]:
# Reset index of your main DataFrame and construct reverse mapping as before
metadata = metadata.reset_index()
indices = pd.Series(metadata.index, index=metadata['title'])

In [55]:
# Function that takes in movie title as input and outputs most similar movies
def recommend_me(title, cosine_sim2=cosine_sim2):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim2[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    
    # Return the top 10 most similar movies
    return metadata['title'].iloc[movie_indices]

#### Getting Recommendations

In [56]:
recommend_me('John Wick', cosine_sim2)

8107             Evil Behind You
5633                        Rage
12757       John Wick: Chapter 2
2266              Angel of Death
8843              Kung Fu Killer
4653     Interview with a Hitman
6657                     Taken 3
13413            Lady Bloodfight
749                 The Contract
3384                     Bunraku
Name: title, dtype: object

In [57]:
recommend_me('The Dark Knight Rises', cosine_sim2)

1146             The Dark Knight
96                 Batman Begins
562                 The Prestige
12995      Lure: Teen Fight Club
9557                        Rege
13544       Payback: Straight Up
1802            Streets of Blood
1847                    The Line
3741     House of the Rising Sun
4504                      Loaded
Name: title, dtype: object

In [63]:
recommend_me('Watchmen', cosine_sim2)

10863                                       The Flying Man
12677    Batman Beyond Darwyn Cooke's Batman 75th Anniv...
1146                                       The Dark Knight
1418                                             Eagle Eye
1901                           Green Lantern: First Flight
2716                                          TRON: Legacy
2855                                      I Am Number Four
3736                                    Superman: Doomsday
4669                                          Man of Steel
6918                                        Justice League
Name: title, dtype: object

In [32]:
recommend_me('The Lego Movie', cosine_sim2)

311                                Curious George
2960                            Idiots and Angels
9580                     Jinxy Jenkins, Lucky Lou
13005                                   Happy End
5647     Alpha and Omega 2: A Howl-iday Adventure
5988                     Care Bears To the Rescue
9255                                   Sly Cooper
10101                Megamind: The Button Of Doom
10102                Megamind: The Button Of Doom
13401        Scrat's Continental Crack-Up: Part 2
Name: title, dtype: object

In [48]:
recommend_me('Thor', cosine_sim2)

15                   A Sound of Thunder
20                 The Girl from Monday
22                       In Enemy Hands
23    Charlie and the Chocolate Factory
25                        Underclassman
27                              Elektra
31                    Are We There Yet?
32                    Alone in the Dark
34                   Aliens of the Deep
41                          Constantine
Name: title, dtype: object

In [34]:
recommend_me('Interstellar', cosine_sim2)

9171                       Infinite
7540               Midnight Special
7956                    The Martian
3036                    Interkosmos
10301                The Perfect 46
10302                The Perfect 46
9673     Doctor Who: Last Christmas
456                End of the Spear
1685                           Moon
2984               The Tree of Life
Name: title, dtype: object

In [35]:
recommend_me('The Prestige', cosine_sim2)

6433     The Ghost and the Whale
10747       404: Error Not Found
3082                 Dreamkiller
5273                       Apnea
11240               Dustbin Baby
96                 Batman Begins
380                 Irresistible
1146             The Dark Knight
3407       The Dark Knight Rises
8393           A Sister's Secret
Name: title, dtype: object

In [65]:
recommend_me('The Martian', cosine_sim2)

9171                       Infinite
3577                    John Carter
5474                   Interstellar
3036                    Interkosmos
10301                The Perfect 46
10302                The Perfect 46
9673     Doctor Who: Last Christmas
12627               A Space Program
456                End of the Spear
3773                    On the Road
Name: title, dtype: object