<a href="http://colab.research.google.com/github/dipanjanS/nlp_workshop_odsc19/blob/master/Module05%20-%20NLP%20Applications/Project01%20-%20Text%20Content%20Recommenders.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Recommendations with Document Similarity

Recommender systems are one of the popular and most adopted applications of machine learning. They are typically used to recommend entities to users and these entites can be anything like products, movies, services and so on. 

Popular examples of recommendations include,
- Amazon suggesting products on its website
- Amazon Prime, Netflix, Hotstar recommending movies\shows
- YouTube recommending videos to watch

Typically recommender systems can be implemented in three ways:

- Simple Rule-based Recommenders: Typically based on specific global metrics and thresholds like movie popularity, global ratings etc.
- Content-based Recommenders: This is based on providing similar entities based on a specific entity of interest. Content metadata can be used here like movie descriptions, genre, cast, director and so on
- Collaborative filtering Recommenders: Here we don't need metadata but we try to predict recommendations and ratings based on past ratings of different users and specific items.

We will be building a movie recommendation system here where based on data\metadata pertaining to different movies, we try and recommend similar movies of interest!

![](https://i.imgur.com/c7Go7d3.png)

Since our focus in not really recommendation engines but NLP, we will be leveraging the text-based metadata for each movie to try and recommend similar movies based on specific movies of interest. This falls under content-based recommenders. 

# Install Dependencies

In [29]:
!pip install textsearch
!pip install contractions
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Load and View Data

In [43]:
import pandas as pd

df = pd.read_csv('https://github.com/dipanjanS/nlp_workshop_dhs18/raw/master/Unit%2010%20-%20Project%208%20-%20Movie%20Recommendations%20with%20Document%20Similarity/tmdb_5000_movies.csv.gz', compression='gzip')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
budget                  4803 non-null int64
genres                  4803 non-null object
homepage                1712 non-null object
id                      4803 non-null int64
keywords                4803 non-null object
original_language       4803 non-null object
original_title          4803 non-null object
overview                4800 non-null object
popularity              4803 non-null float64
production_companies    4803 non-null object
production_countries    4803 non-null object
release_date            4802 non-null object
revenue                 4803 non-null int64
runtime                 4801 non-null float64
spoken_languages        4803 non-null object
status                  4803 non-null object
tagline                 3959 non-null object
title                   4803 non-null object
vote_average            4803 non-null float64
vote_count              4803 non-null 

In [44]:
df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


In [45]:
df = df[['title', 'tagline', 'overview', 'popularity']]
df.tagline.fillna('', inplace=True)
df['description'] = df['tagline'].map(str) + ' ' + df['overview']
df.dropna(inplace=True)
df = df.sort_values(by=['popularity'], ascending=False)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4800 entries, 546 to 4553
Data columns (total 5 columns):
title          4800 non-null object
tagline        4800 non-null object
overview       4800 non-null object
popularity     4800 non-null float64
description    4800 non-null object
dtypes: float64(1), object(4)
memory usage: 225.0+ KB


In [46]:
df.head()

Unnamed: 0,title,tagline,overview,popularity,description
546,Minions,"Before Gru, they had a history of bad bosses","Minions Stuart, Kevin and Bob are recruited by...",875.581305,"Before Gru, they had a history of bad bosses M..."
95,Interstellar,Mankind was born on Earth. It was never meant ...,Interstellar chronicles the adventures of a gr...,724.247784,Mankind was born on Earth. It was never meant ...
788,Deadpool,Witness the beginning of a happy ending,Deadpool tells the origin story of former Spec...,514.569956,Witness the beginning of a happy ending Deadpo...
94,Guardians of the Galaxy,All heroes start somewhere.,"Light years from Earth, 26 years after being a...",481.098624,All heroes start somewhere. Light years from E...
127,Mad Max: Fury Road,What a Lovely Day.,An apocalyptic story set in the furthest reach...,434.278564,What a Lovely Day. An apocalyptic story set in...


# Build a Movie Recommender System

Here you will build your own movie recommender system. We will use the following pipeline:
- Text pre-processing
- Feature Engineering
- Document Similarity Computation
- Find top similar movies
- Build a movie recommendation function


## Document Similarity

Recommendations are about understanding the underlying features which make us favour one choice over the other. Similarity between items(in this case movies) is one way to understanding why we choose one movie over another. There are different ways to calculate similarity between two items. One of the most widely used measures is __cosine similarity__ which we have already used in the previous unit.

### Cosine Similarity

Cosine Similarity is used to calculate a numeric score to denote the similarity between two text documents. Mathematically, it is defined as follows:

$$ cosine(x,y) = \frac{x. y^\intercal}{||x||.||y||} $$

In [55]:
import nltk
import re
import numpy as np
import contractions

stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    doc = contractions.fix(doc)
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    #filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

normalize_corpus = np.vectorize(normalize_document)

norm_corpus = normalize_corpus(list(df['description']))
len(norm_corpus)

4800

## Extract TF-IDF Features

In [56]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf = TfidfVectorizer(ngram_range=(1, 2), min_df=2)
tfidf_matrix = tf.fit_transform(norm_corpus)
tfidf_matrix.shape

(4800, 20468)

## Compute Pairwise Document Similarity

In [57]:
from sklearn.metrics.pairwise import cosine_similarity

doc_sim = cosine_similarity(tfidf_matrix)
doc_sim_df = pd.DataFrame(doc_sim)
doc_sim_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,4760,4761,4762,4763,4764,4765,4766,4767,4768,4769,4770,4771,4772,4773,4774,4775,4776,4777,4778,4779,4780,4781,4782,4783,4784,4785,4786,4787,4788,4789,4790,4791,4792,4793,4794,4795,4796,4797,4798,4799
0,1.0,0.0,0.0,0.0,0.00607,0.008067,0.0,0.0,0.0,0.0,0.025531,0.008554,0.018111,0.0,0.0,0.0,0.0,0.007439,0.010454,0.0,0.0,0.00819,0.008365,0.010035,0.0,0.0,0.050976,0.006502,0.0,0.010728,0.0,0.006908,0.0,0.167573,0.0,0.0,0.0,0.0,0.009191,0.053475,...,0.0,0.009711,0.006508,0.0,0.0,0.0,0.0,0.028409,0.0,0.0,0.0,0.00887,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033246,0.0,0.0,0.0,0.0,0.0,0.034092,0.018754,0.0,0.037924,0.0,0.0,0.0,0.0,0.0,0.0,0.009646
1,0.0,1.0,0.0,0.017839,0.007967,0.0,0.0,0.012501,0.0,0.01484,0.0,0.0,0.0,0.0,0.012814,0.0,0.0,0.024144,0.0,0.0,0.0,0.0,0.0,0.0,0.008101,0.0,0.0,0.016898,0.0,0.017789,0.0,0.008885,0.009432,0.0,0.0,0.014947,0.0,0.0,0.0,0.022738,...,0.0,0.0,0.019783,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.011407,0.011409,0.0,0.011632,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.015329,0.0,0.008367,0.0,0.0,0.0,0.021596,0.0,0.0,0.0,0.0,0.017561,0.0,0.019152,0.0,0.0,0.0,0.0,0.007963
2,0.0,0.0,1.0,0.0,0.017176,0.0,0.0,0.0,0.0,0.024326,0.005471,0.018038,0.0,0.0,0.0,0.005099,0.004985,0.004705,0.0,0.004843,0.0,0.017026,0.0,0.003672,0.003984,0.030998,0.006959,0.0,0.0,0.006117,0.0,0.019548,0.020365,0.009213,0.028467,0.010515,0.004198,0.006311,0.015154,0.012698,...,0.020597,0.004723,0.0,0.016673,0.0,0.0,0.0,0.0,0.0,0.006625,0.0,0.0,0.006972,0.0,0.010574,0.0,0.008222,0.008604,0.012782,0.015353,0.006259,0.0,0.0,0.0,0.0,0.010555,0.0,0.0,0.0,0.0,0.0,0.006903,0.005023,0.0,0.012893,0.0,0.025975,0.0,0.027126,0.00934
3,0.0,0.017839,0.0,1.0,0.0,0.022414,0.0,0.0,0.0,0.037207,0.0,0.0,0.0,0.0,0.027958,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.012232,0.017893,0.043639,0.0,0.0,0.0,0.0,0.023127,0.0,0.0,0.0,0.0,0.0,0.009324,0.022995,0.0,0.0,0.044476,...,0.0,0.0,0.014029,0.030561,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.056761,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.018486,0.0,0.0,0.0,0.060846,0.025035,0.0,0.036237,0.030516,0.022605,0.0,0.0,0.0
4,0.00607,0.007967,0.017176,0.0,1.0,0.004672,0.0,0.064572,0.0,0.0,0.0,0.036611,0.023879,0.014421,0.012316,0.019834,0.00817,0.004308,0.025459,0.0,0.008907,0.023736,0.011341,0.003009,0.0,0.026928,0.0,0.010026,0.0,0.019557,0.0,0.028521,0.016018,0.011311,0.016912,0.022974,0.054142,0.013767,0.027397,0.005416,...,0.005088,0.015097,0.030759,0.013663,0.0,0.012624,0.0,0.0,0.0,0.0,0.027018,0.012025,0.005714,0.0,0.01723,0.0,0.022327,0.007051,0.038014,0.011783,0.018612,0.021468,0.0,0.003767,0.0,0.00865,0.009499,0.0,0.012749,0.0,0.022056,0.019659,0.03685,0.0,0.015824,0.0,0.076022,0.004515,0.043469,0.011464


## Get List of Movie Titles

In [58]:
movies_list = df['title'].values
movies_list, movies_list.shape

(array(['Minions', 'Interstellar', 'Deadpool', ..., 'Penitentiary',
        'Alien Zone', 'America Is Still the Place'], dtype=object), (4800,))

## Find Top Similar Movies for a Sample Movie

Let's take __Minions__ the most popular movie the the dataframe above and try and find the most similar movies which can be recommended

#### Find movie ID

In [59]:
movie_idx = np.where(movies_list == 'Minions')[0][0]
movie_idx

0

#### Get movie similarities

In [60]:
movie_similarities = doc_sim_df.iloc[movie_idx].values
movie_similarities

array([1.        , 0.        , 0.        , ..., 0.        , 0.        ,
       0.00964634])

#### Get top 5 similar movie IDs

In [61]:
similar_movie_idxs = np.argsort(-movie_similarities)[1:6]
similar_movie_idxs

array([ 33,  60, 737, 490, 298])

#### Get top 5 similar movies

In [62]:
similar_movies = movies_list[similar_movie_idxs]
similar_movies

array(['Despicable Me 2', 'Despicable Me',
       'Teenage Mutant Ninja Turtles: Out of the Shadows', 'Superman',
       'Rise of the Guardians'], dtype=object)

### Build a movie recommender function to recommend top 5 similar movies for any movie 

The movie title, movie title list and document similarity matrix dataframe will be given as inputs to the function

In [0]:
def movie_recommender(movie_title, movies=movies_list, doc_sims=doc_sim_df):
    # find movie id
    movie_idx = np.where(movies == movie_title)[0][0]
    # get movie similarities
    movie_similarities = doc_sims.iloc[movie_idx].values
    # get top 5 similar movie IDs
    similar_movie_idxs = np.argsort(-movie_similarities)[1:6]
    # get top 5 movies
    similar_movies = movies[similar_movie_idxs]
    # return the top 5 movies
    return similar_movies

# Get popular Movie Recommendations

In [0]:
popular_movies = ['Minions', 'Interstellar', 'Deadpool', 'Jurassic World', 'Pirates of the Caribbean: The Curse of the Black Pearl',
              'Dawn of the Planet of the Apes', 'The Hunger Games: Mockingjay - Part 1', 'Terminator Genisys', 
              'Captain America: Civil War', 'The Dark Knight', 'The Martian', 'Batman v Superman: Dawn of Justice', 
              'Pulp Fiction', 'The Godfather', 'The Shawshank Redemption', 'The Lord of the Rings: The Fellowship of the Ring',  
              'Harry Potter and the Chamber of Secrets', 'Star Wars', 'The Hobbit: The Battle of the Five Armies',
              'Iron Man']

In [80]:
for movie in popular_movies:
    print('Movie:', movie)
    print('Top 5 recommended Movies:', movie_recommender(movie_title=movie, movies=movies_list, doc_sims=doc_sim_df))
    print()

Movie: Minions
Top 5 recommended Movies: ['Despicable Me 2' 'Despicable Me'
 'Teenage Mutant Ninja Turtles: Out of the Shadows' 'Superman'
 'Rise of the Guardians']

Movie: Interstellar
Top 5 recommended Movies: ['Gattaca' 'Space Pirate Captain Harlock' 'Space Cowboys'
 'Starship Troopers' 'Final Destination 2']

Movie: Deadpool
Top 5 recommended Movies: ['Silent Trigger' 'Underworld: Evolution' 'Bronson' 'Shaft' 'Don Jon']

Movie: Jurassic World
Top 5 recommended Movies: ['Jurassic Park' 'The Lost World: Jurassic Park'
 "National Lampoon's Vacation" 'The Nut Job' 'Vacation']

Movie: Pirates of the Caribbean: The Curse of the Black Pearl
Top 5 recommended Movies: ["Pirates of the Caribbean: Dead Man's Chest"
 'Pirates of the Caribbean: On Stranger Tides' 'The Pirate'
 'The Pirates! In an Adventure with Scientists!' 'Joyful Noise']

Movie: Dawn of the Planet of the Apes
Top 5 recommended Movies: ['Battle for the Planet of the Apes' 'Groove' 'The Other End of the Line'
 'Chicago Overcoat

# Movie Recommendation with Embeddings

We used count based normalized features in the previous section. Can we use word embeddings and then compute movie similarity? We definitely can! Here we will use the FastText model and train it on our corpus.

The FastText model was first introduced by Facebook in 2016 as an extension and supposedly improvement of the vanilla Word2Vec model. Based on the original paper titled ‘Enriching Word Vectors with Subword Information’ by Mikolov et al. which is an excellent read to gain an in-depth understanding of how this model works. Overall, FastText is a framework for learning word representations and also performing robust, fast and accurate text classification. The framework is open-sourced by Facebook on GitHub and claims to have the following.
- Recent state-of-the-art English word vectors.
- Word vectors for 157 languages trained on Wikipedia and Crawl.
- Models for language identification and various supervised tasks.

Though I haven’t implemented this model from scratch, based on the research paper, following is what I learnt about how the model works. In general, predictive models like the Word2Vec model typically considers each word as a distinct entity (e.g. `where`) and generates a dense embedding for the word. However this poses to be a serious limitation with languages having massive vocabularies and many rare words which may not occur a lot in different corpora. The Word2Vec model typically ignores the morphological structure of each word and considers a word as a single entity. The FastText model considers each word as a Bag of Character n-grams. This is also called as a subword model in the paper.

We add special boundary symbols < and > at the beginning and end of words. This enables us to distinguish prefixes and suffixes from other character sequences. We also include the word w itself in the set of its n-grams, to learn a representation for each word (in addition to its character n-grams). Taking the word `where` and n=3 (tri-grams) as an example, it will be represented by the character n-grams: `<wh, whe, her, ere, re>` and the special sequence `<where>` representing the whole word. Note that the sequence , corresponding to the word `<her>` is different from the tri-gram `her` from the word `where`.

Here we leverage `gensim` to build our embeddings

In [0]:
from gensim.models import FastText

tokenized_docs = [doc.split() for doc in norm_corpus]
ft_model = FastText(tokenized_docs, size=300, window=30, min_count=2, workers=4, sg=1, iter=50)

# Generate document level embeddings

Word embedding models give us an embedding for each word, how can we use it for downstream ML\DL tasks? one way is to flatten it or use sequential models. A simpler approach is to average all word embeddings for words in a document and generate a fixed-length document level emebdding

In [0]:
def averaged_word2vec_vectorizer(corpus, model, num_features):
    vocabulary = set(model.wv.index2word)
    
    def average_word_vectors(words, model, vocabulary, num_features):
        feature_vector = np.zeros((num_features,), dtype="float64")
        nwords = 0.
        
        for word in words:
            if word in vocabulary: 
                nwords = nwords + 1.
                feature_vector = np.add(feature_vector, model.wv[word])
        if nwords:
            feature_vector = np.divide(feature_vector, nwords)

        return feature_vector

    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in corpus]
    return np.array(features)

In [85]:
doc_vecs_ft = averaged_word2vec_vectorizer(tokenized_docs, ft_model, 300)
doc_vecs_ft.shape

(4800, 300)

# Get Movie Recommendations

We will leverage cosine similarity again to generate recommendations

In [86]:
doc_sim = cosine_similarity(doc_vecs_ft)
doc_sim_df = pd.DataFrame(doc_sim)
doc_sim_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,4760,4761,4762,4763,4764,4765,4766,4767,4768,4769,4770,4771,4772,4773,4774,4775,4776,4777,4778,4779,4780,4781,4782,4783,4784,4785,4786,4787,4788,4789,4790,4791,4792,4793,4794,4795,4796,4797,4798,4799
0,1.0,0.492664,0.496308,0.510912,0.535769,0.50032,0.486901,0.4508,0.403828,0.481371,0.516654,0.556541,0.515398,0.511148,0.480202,0.494923,0.499225,0.556211,0.480949,0.479091,0.479232,0.553309,0.537978,0.536106,0.556041,0.501698,0.491527,0.524758,0.512683,0.552796,0.49575,0.532899,0.503712,0.668022,0.486572,0.539664,0.579454,0.452427,0.489803,0.569999,...,0.539699,0.580656,0.48633,0.503779,0.0,0.520268,0.431107,0.528245,0.401787,0.491489,0.490939,0.579464,0.454055,0.521494,0.4519,0.464359,0.480017,0.433644,0.467632,0.486388,0.538019,0.470559,0.343202,0.567069,0.505675,0.494155,0.518892,0.514991,0.496923,0.537625,0.55784,0.454287,0.579657,0.473208,0.513137,0.475539,0.532357,0.474379,0.518188,0.551425
1,0.492664,1.0,0.563403,0.545177,0.603489,0.517669,0.505453,0.584717,0.489141,0.527009,0.575785,0.579353,0.513444,0.493603,0.653348,0.457902,0.53379,0.570331,0.604212,0.515493,0.481409,0.581974,0.5561,0.514275,0.575213,0.494131,0.497962,0.573492,0.594617,0.615781,0.546663,0.569933,0.53318,0.466352,0.505694,0.589641,0.59883,0.463351,0.523948,0.60093,...,0.540858,0.582376,0.641514,0.541945,0.0,0.56051,0.490818,0.561715,0.455354,0.53518,0.646578,0.576697,0.54499,0.559209,0.511443,0.455557,0.514441,0.466614,0.456843,0.5234,0.555628,0.55638,0.285916,0.528488,0.503273,0.489853,0.549668,0.576395,0.451794,0.474446,0.5605,0.485316,0.552406,0.473985,0.572856,0.504204,0.491027,0.494966,0.523733,0.573146
2,0.496308,0.563403,1.0,0.578705,0.587129,0.470731,0.52742,0.515955,0.53037,0.581713,0.575859,0.554697,0.518732,0.511006,0.560594,0.490921,0.569978,0.550745,0.514001,0.514964,0.489199,0.542026,0.559439,0.499355,0.56213,0.569028,0.469616,0.526761,0.530578,0.566531,0.567329,0.557099,0.544709,0.50486,0.561595,0.566932,0.53337,0.496736,0.572349,0.576454,...,0.575831,0.535931,0.534528,0.501722,0.0,0.520267,0.49343,0.571536,0.456092,0.552054,0.523101,0.577292,0.53482,0.537752,0.508964,0.444129,0.496698,0.511398,0.472164,0.47246,0.527869,0.528036,0.30663,0.53577,0.463161,0.466277,0.526613,0.540793,0.543037,0.538885,0.532356,0.52662,0.558509,0.463098,0.559336,0.536197,0.541042,0.525498,0.542619,0.545562
3,0.510912,0.545177,0.578705,1.0,0.59006,0.511298,0.481921,0.511408,0.502355,0.53624,0.558474,0.557189,0.502109,0.495526,0.578293,0.544688,0.553659,0.573777,0.515629,0.485216,0.487369,0.557719,0.56487,0.551208,0.587898,0.529047,0.432864,0.518734,0.492371,0.604807,0.511024,0.519564,0.524674,0.488336,0.494692,0.59596,0.604437,0.476813,0.526082,0.57528,...,0.507219,0.501528,0.526166,0.586476,0.0,0.498878,0.406795,0.546657,0.413804,0.511191,0.510685,0.542662,0.45627,0.522375,0.541161,0.450507,0.524743,0.498713,0.474976,0.555518,0.577508,0.497237,0.347284,0.557203,0.526138,0.489007,0.557042,0.603214,0.483213,0.517395,0.532141,0.579059,0.592812,0.482258,0.547086,0.547431,0.567795,0.507392,0.502579,0.544799
4,0.535769,0.603489,0.587129,0.59006,1.0,0.56468,0.569394,0.668452,0.559399,0.551341,0.595495,0.624358,0.54082,0.578923,0.66377,0.56536,0.650237,0.627029,0.643336,0.576168,0.553061,0.602179,0.648406,0.613924,0.585841,0.527725,0.493498,0.618335,0.607178,0.673672,0.564068,0.625221,0.639371,0.555308,0.623144,0.60444,0.644364,0.536808,0.646316,0.608405,...,0.588222,0.570244,0.647099,0.573426,0.0,0.584451,0.534608,0.597488,0.437698,0.543725,0.628278,0.620133,0.58103,0.610563,0.598553,0.538588,0.581358,0.558789,0.581866,0.610934,0.593388,0.577401,0.352316,0.60434,0.580486,0.548577,0.627801,0.571301,0.592933,0.550137,0.632155,0.613195,0.642776,0.443999,0.616751,0.536856,0.642225,0.572681,0.649322,0.628217


In [87]:
for movie in popular_movies:
    print('Movie:', movie)
    print('Top 5 recommended Movies:', movie_recommender(movie_title=movie, movies=movies_list, doc_sims=doc_sim_df))
    print()

Movie: Minions
Top 5 recommended Movies: ['Despicable Me' 'Time Bandits'
 'Rise of the Entrepreneur: The Search for a Better Way'
 'Austin Powers: The Spy Who Shagged Me' 'Despicable Me 2']

Movie: Interstellar
Top 5 recommended Movies: ['Gattaca' 'Prometheus' 'The Cave'
 'Sea Rex 3D: Journey to a Prehistoric World' 'Space Cowboys']

Movie: Deadpool
Top 5 recommended Movies: ['Fantastic Four' 'Banshee Chapter' 'Spider-Man 3' 'Enough' 'Spider-Man 2']

Movie: Jurassic World
Top 5 recommended Movies: ['Jurassic Park' 'Jurassic Park III' 'The Lost World: Jurassic Park'
 "National Lampoon's Vacation" 'Walking With Dinosaurs']

Movie: Pirates of the Caribbean: The Curse of the Black Pearl
Top 5 recommended Movies: ['Pirates of the Caribbean: On Stranger Tides'
 'The Pirates! In an Adventure with Scientists!'
 "Pirates of the Caribbean: Dead Man's Chest"
 'American Ninja 2: The Confrontation' 'In the Name of the King III']

Movie: Dawn of the Planet of the Apes
Top 5 recommended Movies: ['Bat

# Transfer Learning with ELMO pre-trained embeddings

ELMo is a novel way to represent documents\words as embeddings with more contextual information. These embeddings are helpful in achieving state-of-the-art (SOTA) results in several NLP tasks. 

![](https://i.imgur.com/pqHFUeE.gif)

ELMo word vectors are computed on top of a two-layer bidirectional language model. The model has two layers stacked together (e.g sequential models like LSTMs). Each layer goes both forward and backward and concatenates the result states (or averages\sums) to capture contextual information better. Just like bi-directional LSTMs\GRUs we will be using later.

- The architecture above uses a character-level convolutional neural network (CNN) to represent words of a text string into raw word vectors
- These raw word vectors act as inputs to the first layer of the model
- The forward pass contains information about a certain word and the context (other words) before that word
- The backward pass contains information about the word and the context after it
- This pair of information, from the forward and backward pass, forms the intermediate word vectors
- These intermediate word vectors are fed into the next layer of biLM
- The final representation (ELMo) is the weighted sum of the raw word vectors and the 2 intermediate word vectors

Helps capture subword level information just like fasttext and also different context of same words based on their usage e.g: 'the _bank_ gives good interest' vs. 'i'm at the river _bank_'

# Using ELMo as a feature extractor

Here we will be leveraging the pre-trained ELMo model as a feature extractor and extract document level 1024-sized embeddings for each our our movies

In [0]:
import tensorflow_hub as hub
import tensorflow as tf

elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=False)

In [90]:
sample_desc = df.iloc[0]['description']
sample_desc

'Before Gru, they had a history of bad bosses Minions Stuart, Kevin and Bob are recruited by Scarlet Overkill, a super-villain who, alongside her inventor husband Herb, hatches a plot to take over the world.'

In [91]:
embeddings = elmo([sample_desc], signature="default", as_dict=True)["elmo"]
embeddings.shape

TensorShape([Dimension(1), Dimension(35), Dimension(1024)])

In [0]:
with tf.Session() as sess:
    embeddings = elmo([sample_desc, 'hello how are you'], signature="default", as_dict=True)["elmo"]
    sess.run(tf.global_variables_initializer())
    sess.run(tf.tables_initializer())
    a = (sess.run(tf.reduce_mean(embeddings,1)))

In [0]:
def get_elmo_embeddings(docs, batch_size=32):
  elmo_embeddings = []
  i = 0
  total_docs = len(docs)
  with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    sess.run(tf.tables_initializer())
    while i < total_docs:
      i_new = i + batch_size
      if i_new < len(docs):
        embeddings = elmo(docs[i:i_new], signature="default", as_dict=True)["elmo"]
      else:
        embeddings = elmo(docs[i:], signature="default", as_dict=True)["elmo"]
      i = i_new
      elmo_embeddings.append(sess.run(tf.reduce_mean(embeddings,1)))
    return np.concatenate(elmo_embeddings, axis=0)

In [138]:
elmo_embeddings = get_elmo_embeddings(norm_corpus, batch_size=128)
elmo_embeddings.shape

(4800, 1024)

# Compute Movie Similarity

In [139]:
doc_sim = cosine_similarity(elmo_embeddings)
doc_sim_df = pd.DataFrame(doc_sim)
doc_sim_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,4760,4761,4762,4763,4764,4765,4766,4767,4768,4769,4770,4771,4772,4773,4774,4775,4776,4777,4778,4779,4780,4781,4782,4783,4784,4785,4786,4787,4788,4789,4790,4791,4792,4793,4794,4795,4796,4797,4798,4799
0,1.0,0.673902,0.751573,0.863714,0.689305,0.743295,0.787667,0.832816,0.768811,0.798508,0.765984,0.738887,0.78563,0.764878,0.674737,0.744505,0.757085,0.733013,0.739036,0.763905,0.846082,0.812291,0.711172,0.811894,0.737671,0.813584,0.752075,0.699435,0.78052,0.818098,0.691326,0.687013,0.810482,0.841995,0.742114,0.786964,0.729926,0.873464,0.7903,0.76193,...,0.760047,0.749157,0.602688,0.75696,0.591256,0.58201,0.833129,0.782325,0.664777,0.775956,0.666539,0.773167,0.79513,0.66397,0.758077,0.756637,0.694812,0.728161,0.784195,0.800416,0.718451,0.816857,0.671299,0.725745,0.757803,0.79394,0.578746,0.622107,0.835142,0.737978,0.734108,0.867359,0.749411,0.796369,0.761055,0.787904,0.628688,0.72034,0.782287,0.690373
1,0.673902,1.0,0.635209,0.745993,0.662533,0.647107,0.63641,0.726282,0.644014,0.639146,0.697943,0.696398,0.6512,0.619105,0.79788,0.676857,0.636756,0.652537,0.740867,0.625048,0.651642,0.713823,0.672913,0.643272,0.661184,0.661125,0.60785,0.609703,0.699224,0.672637,0.707378,0.702402,0.690226,0.62707,0.564022,0.665528,0.700463,0.639506,0.742246,0.80413,...,0.652133,0.5482,0.747544,0.546157,0.407826,0.547783,0.650193,0.731211,0.509724,0.565217,0.712193,0.674869,0.622215,0.561811,0.636012,0.555241,0.598729,0.62715,0.593192,0.582941,0.538249,0.627147,0.470351,0.636928,0.591095,0.597366,0.497236,0.565199,0.58767,0.520422,0.631614,0.66001,0.591151,0.562878,0.644372,0.650457,0.471403,0.593951,0.613347,0.577721
2,0.751573,0.635209,1.0,0.776931,0.812324,0.626358,0.80568,0.752336,0.636632,0.727761,0.827235,0.755048,0.666357,0.866802,0.750277,0.800741,0.838277,0.796359,0.696132,0.819026,0.670245,0.802342,0.800858,0.815696,0.807615,0.710225,0.641008,0.797783,0.644627,0.74403,0.741489,0.702317,0.80646,0.644382,0.846387,0.839317,0.829216,0.741464,0.73014,0.747999,...,0.737073,0.752865,0.696997,0.668742,0.262846,0.623608,0.700227,0.739211,0.431054,0.679432,0.676116,0.718834,0.651115,0.72643,0.610499,0.698294,0.629757,0.679335,0.596131,0.690085,0.783374,0.670873,0.354831,0.772638,0.717234,0.61162,0.769312,0.628364,0.679596,0.753085,0.811146,0.73643,0.779702,0.617341,0.7677,0.69556,0.753344,0.783749,0.799898,0.729796
3,0.863714,0.745993,0.776931,1.0,0.722593,0.751869,0.794865,0.83933,0.803945,0.777118,0.78555,0.740758,0.77213,0.774399,0.749708,0.778072,0.771104,0.735786,0.809802,0.75136,0.818086,0.798171,0.748329,0.800451,0.72923,0.842323,0.75846,0.701381,0.805943,0.847231,0.729932,0.704609,0.822478,0.823391,0.731352,0.811611,0.748316,0.848465,0.81523,0.785894,...,0.782355,0.679349,0.657832,0.759113,0.594817,0.544871,0.803135,0.830061,0.676949,0.738936,0.687116,0.774896,0.793214,0.631826,0.773584,0.715254,0.726705,0.74634,0.778108,0.776918,0.668761,0.809238,0.668043,0.70998,0.756156,0.793967,0.607917,0.637557,0.791494,0.684709,0.726178,0.844632,0.719388,0.753712,0.787019,0.770069,0.640735,0.746616,0.788071,0.67362
4,0.689305,0.662533,0.812324,0.722593,1.0,0.600786,0.775307,0.735864,0.54723,0.601958,0.788993,0.710288,0.578968,0.800858,0.751683,0.713725,0.860757,0.78765,0.639214,0.731263,0.592789,0.756226,0.759656,0.759283,0.801298,0.635432,0.498798,0.802448,0.561195,0.678202,0.691015,0.735656,0.800781,0.526903,0.740167,0.817801,0.845239,0.659591,0.690958,0.764449,...,0.775696,0.758959,0.750301,0.667466,0.149284,0.629677,0.653324,0.699159,0.317762,0.656449,0.71639,0.690597,0.595044,0.760471,0.558444,0.730936,0.610142,0.603166,0.515101,0.639371,0.810822,0.632039,0.239777,0.806933,0.707522,0.536813,0.782715,0.708684,0.638223,0.751427,0.777682,0.748921,0.799325,0.54421,0.752086,0.62736,0.775034,0.757307,0.787797,0.740129


In [140]:
for movie in popular_movies:
    print('Movie:', movie)
    print('Top 5 recommended Movies:', movie_recommender(movie_title=movie, movies=movies_list, doc_sims=doc_sim_df))
    print()

Movie: Minions
Top 5 recommended Movies: ['Shooting Fish' 'Alice Through the Looking Glass'
 'My Super Ex-Girlfriend' 'Hocus Pocus' 'Ernest & Celestine']

Movie: Interstellar
Top 5 recommended Movies: ['The Ice Pirates' 'Transformers: Age of Extinction'
 '2001: A Space Odyssey' 'Prometheus' 'Sphere']

Movie: Deadpool
Top 5 recommended Movies: ['The Dark Knight Rises' 'Enough' 'Secret Window'
 'The Silence of the Lambs' 'Locker 13']

Movie: Jurassic World
Top 5 recommended Movies: ['Marmaduke' 'Super Mario Bros.' 'The Shining' 'How to Be Single' 'Cheri']

Movie: Pirates of the Caribbean: The Curse of the Black Pearl
Top 5 recommended Movies: ['Pirates of the Caribbean: On Stranger Tides' 'Moby Dick' 'Sharknado'
 'Dear John' 'Jaws']

Movie: Dawn of the Planet of the Apes
Top 5 recommended Movies: ['Resident Evil: Extinction' 'Hoot' 'The 13th Warrior' 'The Night Visitor'
 'Star Trek: Generations']

Movie: The Hunger Games: Mockingjay - Part 1
Top 5 recommended Movies: ['Invasion U.S.A.' '