# Movie Recommendation System using Deep Learning

Based on the structured data extracted [here](https://github.com/DOsinga/deep_learning_cookbook/blob/master/04.1%20Collect%20movie%20data%20from%20Wikipedia.ipynb), we'll train a network that learns to predict a movie based on the outgoing links on the corresponding Wikipedia page. This creates embeddings for the movies. This in turn lets us recommend movies based on other movies - similar movies are next to each other in the embedding space.

Our steps:
1. Load in data, understand dataset and clean
2. Prepare data for supervised machine learning task
3. Build the entity embedding neural network & Train Model
4. Extract embeddings and find most similar books and wikilinks

# 1. Load in data, understand dataset and clean

In [57]:
# Download and unzip our zipfile
from urllib.request import urlopen
from zipfile import ZipFile

zipurl = 'https://github.com/rajeevratan84/datascienceforbusiness/blob/master/wp_movies_10k.ndjson.zip?raw=true'
zipresp = urlopen(zipurl) # Create a new file on the hard drive
tempzip = open("/tmp/tempfile.zip", "wb") # Write the contents of the downloaded file into the new file
tempzip.write(zipresp.read()) # Close the newly-created file
tempzip.close() # Re-open the newly-created file with ZipFile()
zf = ZipFile("/tmp/tempfile.zip") # Extract its contents into <extraction_path>
zf.extractall(path = '') # note that extractall will automatically create the path, left blank so it's in working directory
# close the ZipFile instance
zf.close()

We'll be using Keras to  learn and store our embeddings

In [58]:
import json
from collections import Counter
from keras.models import Model
from keras.layers import Embedding, Input, Reshape
from keras.layers.merge import Dot
from sklearn.linear_model import LinearRegression
import numpy as np
import random
from sklearn import svm

#### Load our dataset

Note it's stored as a ndjson file, hence we need the json package to load it.

In [59]:
with open('wp_movies_10k.ndjson') as fin:
    movies = [json.loads(l) for l in fin]

## Examining our dataset

In [60]:
type(movies)

list

In [61]:
movies[0]

['Deadpool (film)',
 {'Software Used': 'Adobe Premier Pro',
  'alt': "Official poster shows the titular hero Deadpool standing in front of the viewers, with hugging his hands, and donning his traditional black and red suit and mask, and the film's name, credits and billing below him.",
  'budget': '$58 million',
  'caption': 'Theatrical release poster',
  'cinematography': 'Ken Seng',
  'country': 'United States',
  'director': 'Tim Miller',
  'distributor': '20th Century Fox',
  'editing': 'Julian Clarke',
  'gross': '$783.1 million',
  'image': 'Deadpool poster.jpg',
  'language': 'English',
  'music': 'Tom Holkenborg',
  'name': 'Deadpool',
  'runtime': '108 minutes'},
 ['Tim Miller (director)',
  'Simon Kinberg',
  'Ryan Reynolds',
  'Lauren Shuler Donner',
  'Rhett Reese',
  'Paul Wernick',
  'Deadpool',
  'Fabian Nicieza',
  'Rob Liefeld',
  'Morena Baccarin',
  'Ed Skrein',
  'T.J. Miller',
  'Gina Carano',
  'Leslie Uggams',
  'Brianna Hildebrand',
  'Stefan Kapičić',
  'Junkie

In [62]:
movies[0][0]

'Deadpool (film)'

In [63]:
movies[0][1]

{'Software Used': 'Adobe Premier Pro',
 'alt': "Official poster shows the titular hero Deadpool standing in front of the viewers, with hugging his hands, and donning his traditional black and red suit and mask, and the film's name, credits and billing below him.",
 'budget': '$58 million',
 'caption': 'Theatrical release poster',
 'cinematography': 'Ken Seng',
 'country': 'United States',
 'director': 'Tim Miller',
 'distributor': '20th Century Fox',
 'editing': 'Julian Clarke',
 'gross': '$783.1 million',
 'image': 'Deadpool poster.jpg',
 'language': 'English',
 'music': 'Tom Holkenborg',
 'name': 'Deadpool',
 'runtime': '108 minutes'}

In [64]:
movies[0][2]

['Tim Miller (director)',
 'Simon Kinberg',
 'Ryan Reynolds',
 'Lauren Shuler Donner',
 'Rhett Reese',
 'Paul Wernick',
 'Deadpool',
 'Fabian Nicieza',
 'Rob Liefeld',
 'Morena Baccarin',
 'Ed Skrein',
 'T.J. Miller',
 'Gina Carano',
 'Leslie Uggams',
 'Brianna Hildebrand',
 'Stefan Kapičić',
 'Junkie XL',
 'Julian Clarke',
 'Marvel Entertainment',
 'Kinberg Genre',
 'Lauren Shuler Donner',
 'TSG Entertainment',
 '20th Century Fox',
 'Le Grand Rex',
 'Variety (magazine)',
 'Box Office Mojo',
 'superhero film',
 'Tim Miller (director)',
 'Rhett Reese',
 'Paul Wernick',
 'Marvel Comics',
 'Deadpool',
 'X-Men (film series)',
 'Ryan Reynolds',
 'Morena Baccarin',
 'Ed Skrein',
 'T.J. Miller',
 'Gina Carano',
 'Leslie Uggams',
 'Brianna Hildebrand',
 'Stefan Kapičić',
 'antihero',
 'New Line Cinema',
 '20th Century Fox',
 'X-Men Origins: Wolverine',
 'principal photography',
 'Vancouver',
 'IMAX',
 'Digital Light Processing',
 'D-Box Technologies',
 'List of accolades received by Deadpool (

### Show the most common link counts

In [65]:
link_counts = Counter()

for movie in movies:
    link_counts.update(movie[2])

link_counts.most_common(10)

[('Rotten Tomatoes', 9393),
 ('Category:English-language films', 5882),
 ('Category:American films', 5867),
 ('Variety (magazine)', 5450),
 ('Metacritic', 5112),
 ('Box Office Mojo', 4186),
 ('The New York Times', 3818),
 ('The Hollywood Reporter', 3553),
 ('Roger Ebert', 2707),
 ('Los Angeles Times', 2454)]

# 2. Prepare data for supervised machine learning task
### Map Links to Integers

First we want to create a mapping of links to integers. When we feed links into the embedding neural network, we will have to represent them as numbers, and this mapping will let us keep track of the books. We'll also create the reverse mapping, from integers back to the link.

In [66]:
top_links = [link for link, c in link_counts.items() if c >= 3]
print(len(top_links))
top_links[:20]

66913


['Tim Miller (director)',
 'Simon Kinberg',
 'Ryan Reynolds',
 'Lauren Shuler Donner',
 'Rhett Reese',
 'Paul Wernick',
 'Deadpool',
 'Morena Baccarin',
 'Ed Skrein',
 'T.J. Miller',
 'Gina Carano',
 'Leslie Uggams',
 'Brianna Hildebrand',
 'Stefan Kapičić',
 'Junkie XL',
 'Julian Clarke',
 'Marvel Entertainment',
 'Kinberg Genre',
 'TSG Entertainment',
 '20th Century Fox']

In [67]:
# Map link to ID number
link_to_idx = {link: idx for idx, link in enumerate(top_links)}

# Map movie to ID Number
movie_to_idx = {movie[0]: idx for idx, movie in enumerate(movies)}
movie_to_idx

{'Deadpool (film)': 0,
 'The Revenant (2015 film)': 1,
 'Suicide Squad (film)': 2,
 'Spectre (2015 film)': 3,
 'Rebel Without a Cause': 4,
 'Warcraft (film)': 5,
 'The Martian (film)': 6,
 'List of Marvel Cinematic Universe films': 7,
 'X-Men (film series)': 8,
 'The Hateful Eight': 9,
 'The Jungle Book (2016 film)': 10,
 'The Big Short (film)': 11,
 '10 Cloverfield Lane': 12,
 'Spotlight (film)': 13,
 'Room (2015 film)': 14,
 'Creed (film)': 15,
 'DC Universe Animated Original Movies': 16,
 'Star Trek Beyond': 17,
 'Star Wars (film)': 18,
 'Interstellar (film)': 19,
 'Ant-Man (film)': 20,
 'Everest (2015 film)': 21,
 'Jurassic World': 22,
 'Joy (film)': 23,
 'Gods of Egypt (film)': 24,
 'Star Wars sequel trilogy': 25,
 'The Conjuring 2': 26,
 'The Danish Girl (film)': 27,
 'Sicario (2015 film)': 28,
 'Rogue One': 29,
 'Finding Dory': 30,
 'Black Mass (film)': 31,
 'Blade Runner': 32,
 'Harry Potter (film series)': 33,
 'Doctor Strange (film)': 34,
 'Titanic (1997 film)': 35,
 'Furious

### Build a Training Set

In order for any machine learning model to learn, it needs a training set. We are going to treat this as a supervised learning problem: given a pair (movie, link), we want the neural network to learn to predict whether this is a legitimate pair - present in the data - or not.

To create a training set, for each movie, we'll iterate through the wikilinks on the movie page and record the movie title and each link as a tuple. The final pairs list will consist of tuples of every (movie, link) pairing on all of Wikipedia.

In [68]:
# Create a blank array
pairs = []

for movie in movies:
    pairs.extend((link_to_idx[link], movie_to_idx[movie[0]]) for link in movie[2] if link in link_to_idx)

pairs_set = set(pairs)
len(pairs), len(top_links), len(movie_to_idx)

(949544, 66913, 10000)

In [69]:
pairs_set

{(11566, 6844),
 (2454, 7796),
 (8054, 5634),
 (64836, 8586),
 (203, 2780),
 (21670, 416),
 (2926, 6061),
 (4849, 255),
 (35937, 8392),
 (23746, 1366),
 (9214, 222),
 (18849, 2311),
 (17058, 6885),
 (12663, 845),
 (203, 889),
 (13688, 7765),
 (19206, 2373),
 (178, 2893),
 (11820, 8449),
 (19, 4787),
 (5480, 182),
 (42710, 5809),
 (57200, 7656),
 (965, 7725),
 (57651, 6318),
 (28505, 615),
 (6618, 543),
 (6691, 1557),
 (22, 1106),
 (1203, 8556),
 (2985, 3076),
 (20319, 1044),
 (33113, 1909),
 (43079, 8950),
 (321, 5887),
 (61712, 5361),
 (15937, 216),
 (1931, 369),
 (203, 5288),
 (42418, 1528),
 (43991, 3682),
 (33070, 8532),
 (2988, 836),
 (402, 211),
 (36537, 1057),
 (23548, 3035),
 (13348, 3766),
 (56631, 3692),
 (433, 5722),
 (22444, 5956),
 (3239, 1213),
 (7609, 5065),
 (700, 4275),
 (22, 4515),
 (27705, 3101),
 (14514, 8128),
 (2966, 2139),
 (15664, 578),
 (27551, 570),
 (943, 65),
 (5386, 3183),
 (19893, 2249),
 (22, 8142),
 (22873, 7291),
 (47405, 8829),
 (16570, 2666),
 (22698,

# 3. Build the Entity embedding Neural Network & Train Model

With our dataset and a supervised machine learning task, we're almost there. The next step is the most technically complicated but thankfully fairly simple with Keras. We are going to construct the neural network that learns the entity embeddings. 

The input to this network is the (movie, link) (either positive or negative) as integers, and the output will be a prediction of whether or not the link was present in the book article. However, we're not actually interested in the prediction except as the device used to train the network by comparison to the label. What we are after is at the heart of the network: **the embedding layers**, one for the movie and one for the link each of which maps the input entity to a 50 dimensional vector. The layers of our network are as follows:

1. Input: parallel inputs for the movie and link
2. Embedding: parallel embeddings for the movie and link
3. Dot: computes the dot product between the embeddings to merge them together
4. Reshape: utility layer needed to correct the shape of the dot product
5. [Optional] Dense: fully connected layer with sigmoid activation to generate output for classification

After converting the inputs to an embedding, we need a way to combine the embeddings into a single number. For this we can use the dot product which does element-wise multiplication of numbers in the vectors and then sums the result to a single number. This raw number (after reshaping) is then the ouput of the model for the case of regression. In regression, our labels are either -1 or 1, and so the model loss function will be mean squared error in order to minimize the distance between the prediction and the output. Using the dot product with normalization means that the `Dot` layer is finding the cosine similarity between the embedding for the movie and the link. Using this method for combining the embeddings means we are trying to make the network learn similar embeddings for movies that link to similar pages. 

### Classification vs Regression

For classification, we add an extra fully connected `Dense` layer with a `sigmoid` activation to squash the outputs between 0 and 1 because the labels are either 0 or 1. The loss function for classification is `binary_crossentropy` which measures the [error of the neural network predictions in a binary classification problem](https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html), and is a measure of the similarity between two distributions. We can train with either classification or regression, and in practice, I found that both approaches produced similar embeddings. I'm not sure about the technical merits of these methods, and I'd be interested to hear if one is better than the other. 

The optimizer - the algorithm used to update the parameters (also called weights) of the neural network after calculating the gradients through backpropagation - is Adam in both cases ([Adam is a modification to Stochastic Gradient Descent](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/)). We use the default parameters for this optimizer. The nice thing about modern neural network frameworks is we don't have to worry about backpropagation or updating the model parameters because that is done for us. It's nice to have an idea of what is occuring behind the scenes, but it's not entirely necessary to use a neural network effectively. 

In [70]:
def movie_embedding_model(embedding_size=50):
    link = Input(name='link', shape=(1,))
    movie = Input(name='movie', shape=(1,))
    link_embedding = Embedding(name='link_embedding', 
                               input_dim=len(top_links), 
                               output_dim=embedding_size)(link)
    movie_embedding = Embedding(name='movie_embedding', 
                                input_dim=len(movie_to_idx), 
                                output_dim=embedding_size)(movie)
    dot = Dot(name='dot_product', normalize=True, axes=2)([link_embedding, movie_embedding])
    merged = Reshape((1,))(dot)
    model = Model(inputs=[link, movie], outputs=[merged])
    model.compile(optimizer='nadam', loss='mse')
    return model

model = movie_embedding_model()
model.summary()

Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
link (InputLayer)               [(None, 1)]          0                                            
__________________________________________________________________________________________________
movie (InputLayer)              [(None, 1)]          0                                            
__________________________________________________________________________________________________
link_embedding (Embedding)      (None, 1, 50)        3345650     link[0][0]                       
__________________________________________________________________________________________________
movie_embedding (Embedding)     (None, 1, 50)        500000      movie[0][0]                      
_______________________________________________________________________________________

In [71]:
random.seed(5)

def batchifier(pairs, positive_samples=50, negative_ratio=10):
    batch_size = positive_samples * (1 + negative_ratio)
    batch = np.zeros((batch_size, 3))
    while True:
        for idx, (link_id, movie_id) in enumerate(random.sample(pairs, positive_samples)):
            batch[idx, :] = (link_id, movie_id, 1)
        idx = positive_samples
        while idx < batch_size:
            movie_id = random.randrange(len(movie_to_idx))
            link_id = random.randrange(len(top_links))
            if not (link_id, movie_id) in pairs_set:
                batch[idx, :] = (link_id, movie_id, -1)
                idx += 1
        np.random.shuffle(batch)
        yield {'link': batch[:, 0], 'movie': batch[:, 1]}, batch[:, 2]

next(batchifier(pairs, positive_samples=3, negative_ratio=2))

({'link': array([32643., 48731.,  3801.,  1313., 13365., 32318., 20558., 22418.,
         31254.]),
  'movie': array([7628., 1854., 5874., 7236., 6238., 7685.,  849., 1529., 5530.])},
 array([-1., -1., -1.,  1., -1., -1., -1.,  1.,  1.]))

## Train Model

We have the training data - in a generator - and a model. The next step is to train the model to learn the entity embeddings. During this process, the model will update the embeddings (change the model parameters) to accomplish the task of predicting whether a certain link is on a book page or not. The resulting embeddings can then be used as a representation of books and links. 

There are a few parameters to adjust for training. The batch size should generally be as large as possible given the memory constraints of your machine. The negative ratio can be adjusted based on results. The number of steps per epoch is chosen such that the model sees a number of examples equal to the number of pairs on each epoch. This is repeated for 5 epochs (you can do more but it has diminishing returns).

In [72]:
positive_samples_per_batch = 512

model.fit_generator(
    batchifier(pairs, positive_samples=positive_samples_per_batch, negative_ratio=10),
    epochs=5,
    steps_per_epoch=len(pairs) // positive_samples_per_batch,
    verbose=1
)

Instructions for updating:
Please use Model.fit, which supports generators.
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f517e147240>

### Save our model

In [77]:
model.save('movie_embeddings.h5')

# 4. Extract embeddings and Finding Similar Movies

We've trained the model and extracted the embeddings - great - but where is the movie recommendation system? Now that we have the embeddings, we can use them to recommend movies that our model has learned are most similar to a given movie.


### Function to Find Most Similar Entities

The function below takes in either a movie or a link, a set of embeddings, and returns the `n` most similar items to the query. It does this by computing the dot product between the query and embeddings. Because we normalized the embeddings, the dot product represents the [cosine similarity](http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/) between two vectors. This is a measure of similarity that does not depend on the magnitude of the vector in contrast to the Euclidean distance. (The Euclidean distance would be another valid metric of similary to use to compare the embeddings.)

Once we have the dot products, we can sort the results to find the closest entities in the embedding space. With cosine similarity, higher numbers indicate entities that are closer together, with -1 the furthest apart and +1 closest together.

In [79]:
# Get the Movie embeddings layers 
movie = model.get_layer('movie_embedding')

# Get the weights 
movie_weights = movie.get_weights()[0]

# Get the matrix normal of movie_weights 
movie_lengths = np.linalg.norm(movie_weights, axis=1)

# Get the normalized movie embeddigns 
normalized_movies = (movie_weights.T / movie_lengths).T

def similar_movies(movie, normalized_movies):
    
    # This represents the cosine similarity of the two vectors
    dists = np.dot(normalized_movies, normalized_movies[movie_to_idx[movie]])

    # Sort dot product results to get the most similar movies
    closest = np.argsort(dists)[-10:]
    for c in reversed(closest):
        print(c, movies[c][0], dists[c])

similar_movies("Interstellar (film)", normalized_movies)

19 Interstellar (film) 1.0
85 Inception 0.980147
101 Prometheus (2012 film) 0.97491866
181 Pacific Rim (film) 0.9684306
6 The Martian (film) 0.9656895
784 Spider-Man 2 0.9624138
182 The Amazing Spider-Man 2 0.9617946
29 Rogue One 0.9567835
37 Avatar (2009 film) 0.95600116
200 The Incredible Hulk (film) 0.9559895


In [81]:
similar_movies("Titanic (1997 film)", normalized_movies)

35 Titanic (1997 film) 0.99999994
84 Saving Private Ryan 0.9431554
303 Raiders of the Lost Ark 0.91884685
155 Gladiator (2000 film) 0.91162646
531 Indiana Jones and the Kingdom of the Crystal Skull 0.9116
85 Inception 0.9021463
260 Braveheart 0.90205705
245 Gravity (film) 0.8957691
692 Indiana Jones and the Last Crusade 0.89550674
449 The Curious Case of Benjamin Button (film) 0.8944834


### Wikilink Embeddings

We also have the embeddings of wikipedia links (which are themselves Wikipedia pages). We can take a similar approach to extract these and find the most similar to a query page. 

Let's write a quick function to extract weights from a model given the name of the layer.

In [76]:
link = model.get_layer('link_embedding')
link_weights = link.get_weights()[0]
link_lengths = np.linalg.norm(link_weights, axis=1)
normalized_links = (link_weights.T / link_lengths).T

def similar_links(link):
    dists = np.dot(normalized_links, normalized_links[link_to_idx[link]])
    closest = np.argsort(dists)[-10:]
    for c in reversed(closest):
        print(c, top_links[c], dists[c])

similar_links('George Lucas')

127 George Lucas 0.9999999
3176 Star Wars (film) 0.9565747
3696 Raiders of the Lost Ark 0.9506459
2778 Lucasfilm 0.9395998
8301 Academy Award for Visual Effects 0.9351768
976 Hugo Award for Best Dramatic Presentation 0.9221633
3082 Category:Films that won the Best Visual Effects Academy Award 0.91937685
2884 London Symphony Orchestra 0.916252
2919 Close Encounters of the Third Kind 0.91387147
2984 Saturn Award for Best Science Fiction Film 0.912349
