## Content Based Movie Recommendation System

This notebook includes a Movie Recommendation system based on the Movie Summary Corpus Dataset.

#### Imports 

In [1]:
import os

#### Load the dataset

In [2]:
movie_summary_path = './../../Datasets/Movie_summary'

In [3]:
!ls ./../../Datasets/Movie_summary

README.MD              movie.metadata.tsv     tvtropes.clusters.txt
README.txt             name.clusters.txt
character.metadata.tsv plot_summaries.txt


In [4]:
movies = sc.textFile(os.path.join(movie_summary_path, 'movie.metadata.tsv'))

In [5]:
print(f"There are {movies.count()} in the Movie Summary dataset")

There are 81741 in the Movie Summary dataset


In [6]:
movies.take(5)

['975900\t/m/03vyhn\tGhosts of Mars\t2001-08-24\t14010832\t98.0\t{"/m/02h40lc": "English Language"}\t{"/m/09c7w0": "United States of America"}\t{"/m/01jfsb": "Thriller", "/m/06n90": "Science Fiction", "/m/03npn": "Horror", "/m/03k9fj": "Adventure", "/m/0fdjb": "Supernatural", "/m/02kdv5l": "Action", "/m/09zvmj": "Space western"}',
 '3196793\t/m/08yl5d\tGetting Away with Murder: The JonBenét Ramsey Mystery\t2000-02-16\t\t95.0\t{"/m/02h40lc": "English Language"}\t{"/m/09c7w0": "United States of America"}\t{"/m/02n4kr": "Mystery", "/m/03bxz7": "Biographical film", "/m/07s9rl0": "Drama", "/m/0hj3n01": "Crime Drama"}',
 '28463795\t/m/0crgdbh\tBrun bitter\t1988\t\t83.0\t{"/m/05f_3": "Norwegian Language"}\t{"/m/05b4w": "Norway"}\t{"/m/0lsxr": "Crime Fiction", "/m/07s9rl0": "Drama"}',
 '9363483\t/m/0285_cd\tWhite Of The Eye\t1987\t\t110.0\t{"/m/02h40lc": "English Language"}\t{"/m/07ssc": "United Kingdom"}\t{"/m/01jfsb": "Thriller", "/m/0glj9q": "Erotic thriller", "/m/09blyk": "Psychological th

In [7]:
summaries = sc.textFile(os.path.join(movie_summary_path, 'plot_summaries.txt'))

In [8]:
print(f"There are {summaries.count()} summaries avaiable in the Movie Summary dataset")

There are 42306 summaries avaiable in the Movie Summary dataset


In [9]:
summaries.take(5)

["23890098\tShlykov, a hard-working taxi driver and Lyosha, a saxophonist, develop a bizarre love-hate relationship, and despite their prejudices, realize they aren't so different after all.",
 '31186339\tThe nation of Panem consists of a wealthy Capitol and twelve poorer districts. As punishment for a past rebellion, each district must provide a boy and girl  between the ages of 12 and 18 selected by lottery  for the annual Hunger Games. The tributes must fight to the death in an arena; the sole survivor is rewarded with fame and wealth. In her first Reaping, 12-year-old Primrose Everdeen is chosen from District 12. Her older sister Katniss volunteers to take her place. Peeta Mellark, a baker\'s son who once gave Katniss bread when she was starving, is the other District 12 tribute. Katniss and Peeta are taken to the Capitol, accompanied by their frequently drunk mentor, past victor Haymitch Abernathy. He warns them about the "Career" tributes who train intensively at special academie

#### Spark transformations

Extract movie ids and titles

In [10]:
movie_ids_and_titles = movies.map(lambda elem: elem.split("\t")).map(lambda movie: (movie[0], movie[2]))

In [11]:
movie_ids_and_titles.take(5)

[('975900', 'Ghosts of Mars'),
 ('3196793', 'Getting Away with Murder: The JonBenét Ramsey Mystery'),
 ('28463795', 'Brun bitter'),
 ('9363483', 'White Of The Eye'),
 ('261236', 'A Woman in Flames')]

Extract the movie ids and summaries

In [12]:
ids_and_summaries = summaries.map(lambda elem: elem.split('\t')).map(lambda summary: (summary[0], summary[1]))

In [13]:
ids_and_summaries.take(5)

[('23890098',
  "Shlykov, a hard-working taxi driver and Lyosha, a saxophonist, develop a bizarre love-hate relationship, and despite their prejudices, realize they aren't so different after all."),
 ('31186339',
  'The nation of Panem consists of a wealthy Capitol and twelve poorer districts. As punishment for a past rebellion, each district must provide a boy and girl  between the ages of 12 and 18 selected by lottery  for the annual Hunger Games. The tributes must fight to the death in an arena; the sole survivor is rewarded with fame and wealth. In her first Reaping, 12-year-old Primrose Everdeen is chosen from District 12. Her older sister Katniss volunteers to take her place. Peeta Mellark, a baker\'s son who once gave Katniss bread when she was starving, is the other District 12 tribute. Katniss and Peeta are taken to the Capitol, accompanied by their frequently drunk mentor, past victor Haymitch Abernathy. He warns them about the "Career" tributes who train intensively at speci

Get only the ids of summaries

In [14]:
ids_of_summaries = ids_and_summaries.map(lambda elem: elem[0]).collect()

In [15]:
ids_of_summaries[:5]

['23890098', '31186339', '20663735', '2231378', '595909']

Keep only the movies for which you have summaries

In [16]:
kept_movies = movie_ids_and_titles.filter(lambda elem: elem[0] in ids_of_summaries)

In [17]:
kept_movies.count()

42207

We found that some summaries where not available, but we can proceed with the 42207 movies.

In [18]:
kept_movies.take(5)

[('975900', 'Ghosts of Mars'),
 ('9363483', 'White Of The Eye'),
 ('261236', 'A Woman in Flames'),
 ('18998739', "The Sorcerer's Apprentice"),
 ('6631279', 'Little city')]

Now we need to join the movie titles with their summaries

In [19]:
joint_rdd = kept_movies.join(ids_and_summaries).cache()

In [20]:
joint_rdd.count()

42207

In [21]:
joint_rdd.take(5)

[('156558',
  ('Baby Boy',
   'A young 20-year-old named Jody  lives with his mother Juanita ,{{amg movie}} in South Central Los Angeles. He spends most of his time with his unemployed best friend P , and does not seem interested in becoming a responsible adult. However, he is forced to mature as a result of an ex-con named Melvin , who moves into their home. Another factor is his children - a son with his girlfriend Yvette  and a daughter with a girl named Peanut, who also lives with her mother. At the beginning of the movie Yvette has an abortion that Jody forced her to have. Yvette constantly asks Jody if he will ever come live with her and their son, but Jody avoids the subject and comes and goes as he pleases. Jody also continues seeing and having sex with other women, including Peanut. This becomes an issue between him and Yvette as well, especially since Yvette and Peanut do not get along. When she discovers his cheating they get in a heated argument which results to Jody slappi

## Using a doc2vec model on the available data

Based on the idea found on this [this link](http://sujitpal.blogspot.com/2016/04/predicting-movie-tags-from-plots-using.html) we wish to use a document_2_vector model so that we can vectorize the content of the summaries to vectors, and that we can use them in order to find movies similar to the interest of a user based on the cosine similarity of the extracted documents.

In [53]:
from gensim.models.doc2vec import TaggedDocument
from gensim.models import Doc2Vec
from random import shuffle
from sklearn.model_selection import train_test_split
import nltk
import numpy as np

In [48]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/alexandros.ferles/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [23]:
model = Doc2Vec(dm=0, vector_size=100, negative=5, hs=0, min_count=2)

In [45]:
only_summaries = joint_rdd.map(lambda elem: elem[1][1]).collect()

In [61]:
sentences = [TaggedDocument(doc, [i]) for i, doc in enumerate(only_summaries)]

In [74]:
model = Doc2Vec(documents, vector_size=100, negative=5, hs=0, min_count=2)

Train the doc2Vec model

In [82]:
from tqdm import tqdm

In [83]:
alpha = 0.025
min_alpha = 0.001
num_epochs = 20
alpha_delta = (alpha - min_alpha) / num_epochs

for epoch in tqdm(range(num_epochs)):
    shuffle(sentences)
    model.alpha = alpha
    model.min_alpha = alpha
    model.train(sentences, total_examples=model.corpus_count, epochs=1)
    alpha -= alpha_delta

100%|██████████| 20/20 [04:23<00:00, 12.98s/it]


In [84]:
We can see that the trained model

<gensim.models.doc2vec.Doc2Vec at 0x11ed9a780>

In [86]:
model.infer_vector(only_summaries[0])

array([ 3.35745700e-02,  6.84581045e-03, -8.33332837e-02,  5.95750324e-02,
        2.94893458e-02, -4.44047116e-02,  9.32905525e-02, -6.17083684e-02,
        9.08117965e-02, -2.30221674e-02, -4.95272763e-02,  2.69503519e-03,
        2.32944768e-02, -4.41004559e-02, -1.17280625e-01, -3.69116594e-03,
        6.42675674e-03, -2.95317061e-02,  6.27793670e-02, -9.50667355e-03,
        1.71459317e-01, -9.37651172e-02,  4.02137451e-02, -2.54107174e-03,
        1.86351165e-01, -8.39546844e-02,  1.42284231e-02, -2.95529608e-02,
       -4.00071740e-02,  5.61882891e-02, -5.88813685e-02,  3.18345875e-02,
       -4.92487708e-03,  9.68242437e-02, -6.64590523e-02, -1.77101403e-01,
       -3.58027667e-02,  6.84229238e-03, -3.82050276e-02, -1.06377356e-01,
        5.75787723e-02, -4.21793684e-02,  1.21596217e-01,  4.07193415e-02,
       -5.78942103e-03, -2.61177234e-02, -1.15547791e-01,  9.75271910e-02,
        2.47938503e-02, -4.07442264e-02, -3.79833542e-02, -2.93961391e-02,
        2.84741689e-02, -