### Read Data
Data can be acquired via link [Wikipedia Movie Plots](https://www.kaggle.com/jrobischon/wikipedia-movie-plots)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('weights/wiki_movie_plots_deduped.csv')
df['Title'] = df['Title'] + ' - ' + df['Release Year'].astype(str) + ' - ' + df['Director']
df = df[['Title', 'Plot']]
df.head()

Unnamed: 0,Title,Plot
0,Kansas Saloon Smashers - 1901 - Unknown,"A bartender is working at a saloon, serving dr..."
1,Love by the Light of the Moon - 1901 - Unknown,"The moon, painted with a smiling face hangs ov..."
2,The Martyred Presidents - 1901 - Unknown,"The film, just over a minute long, is composed..."
3,"Terrible Teddy, the Grizzly King - 1901 - Unknown",Lasting just 61 seconds and consisting of two ...
4,Jack and the Beanstalk - 1902 - George S. Flem...,The earliest known adaptation of the classic f...


### Tokenize data

In [3]:
import gensim
import nltk
from nltk.corpus import stopwords

In [4]:
def transform2tok(text):
    tokens = []
    for sent in nltk.sent_tokenize(text, language='english'):
        for word in nltk.word_tokenize(sent, language='english'):
            if len(word) < 2:
                continue
            if word in stopwords.words('english'):
                continue
            tokens.append(word)
    return ' '.join(tokens)

In [5]:
import multiprocessing as mp
from multiprocessing import Pool

In [6]:
def _df_tokenize(df):
    return df.apply(lambda x: transform2tok(x['Plot']), axis=1)


def df_apply_cpu(df, func, n_cpu):
    pool = Pool(n_cpu)
    
    dfs = [pool.apply_async(func, (d, )) for d in np.array_split(df, n_cpu)]
    dfs = [d.get() for d in dfs]
    
    pool.close()
    pool.join()
    
    data = pd.concat(dfs)
    
    return data

In [7]:
movies_data = df_apply_cpu(df, _df_tokenize, mp.cpu_count())

In [8]:
movies = pd.concat([df.Title, movies_data], axis=1)
movies.columns = [['Title', 'Plot']]
movies.head()

Unnamed: 0,Title,Plot
0,Kansas Saloon Smashers - 1901 - Unknown,bartender working saloon serving drinks custom...
1,Love by the Light of the Moon - 1901 - Unknown,The moon painted smiling face hangs park night...
2,The Martyred Presidents - 1901 - Unknown,The film minute long composed two shots In fir...
3,"Terrible Teddy, the Grizzly King - 1901 - Unknown",Lasting 61 seconds consisting two shots first ...
4,Jack and the Beanstalk - 1902 - George S. Flem...,The earliest known adaptation classic fairytal...


In [36]:
# movies.to_csv('movies_tokens.csv', header=True, index=False)
movies = pd.read_csv('movies_tokens.csv')

### Doc2vec

In [37]:
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

taged_movies = movies.apply(
    lambda r: TaggedDocument(words=r['Plot'].split(), tags=[r.Title]), axis=1)

In [38]:
%%time
trainsent = taged_movies.values

# simple gensim doc2vec api
doc2vec_model = Doc2Vec(trainsent, workers=4, vector_size=300, epochs=20, dm=1)

CPU times: user 13min 28s, sys: 10.5 s, total: 13min 39s
Wall time: 4min 31s


In [39]:
doc2vec_model.save('weights/doc2vec_movie_long')

### Example

In [1]:
text = ["A group of reporters are trying to decipher the last word ever spoken by Charles Foster Kane, the millionaire newspaper tycoon: \"Rosebud.\" The film begins with a news reel detailing Kane's life for the masses, and then from there, we are shown flashbacks from Kane's life. As the reporters investigate further, the viewers see a display of a fascinating man's rise to fame, and how he eventually fell off the top of the world.",
        "During World War II, a brave, patriotic American Soldier undergoes experiments to become a new supersoldier, \"Captain America.\" Racing to Germany to sabotage the rockets of Nazi baddie \"Red Skull\", Captain America winds up frozen until the 1990s. He reawakens to find that the Red Skull has changed identities and is now planning to kidnap the President of the United States.",
        "Tony Stark. Genius, billionaire, playboy, philanthropist. Son of legendary inventor and weapons contractor Howard Stark. When Tony Stark is assigned to give a weapons presentation to an Iraqi unit led by Lt. Col. James Rhodes, he's given a ride on enemy lines. That ride ends badly when Stark's Humvee that he's riding in is attacked by enemy combatants. He survives - barely - with a chest full of shrapnel and a car battery attached to his heart. In order to survive he comes up with a way to miniaturize the battery and figures out that the battery can power something else. Thus Iron Man is born. He uses the primitive device to escape from the cave in Iraq. Once back home, he then begins work on perfecting the Iron Man suit. But the man who was put in charge of Stark Industries has plans of his own to take over Tony's technology for other matters.",
        "The lead character, called 'The Bride,' was a member of the Deadly Viper Assassination Squad, led by her lover 'Bill.' Upon realizing she was pregnant with Bill's child, 'The Bride' decided to escape her life as a killer. She fled to Texas, met a young man, who, on the day of their wedding rehearsal was gunned down by an angry and jealous Bill (with the assistance of the Deadly Viper Assassination Squad). Four years later, 'The Bride' wakes from a coma, and discovers her baby is gone. She, then, decides to seek revenge upon the five people who destroyed her life and killed her baby. The saga of Kill Bill Volume I begins.",
        "Detroit - in the future - is crime-ridden and run by a massive company. The company has developed a huge crime-fighting robot, which unfortunately develops a rather dangerous glitch. The company sees a way to get back in favor with the public when policeman Alex Murphy is killed by a street gang. Murphy's body is reconstructed within a steel shell and called RoboCop. RoboCop is very successful against criminals and becomes a target of supervillian Boddicker.",
        ]

In [3]:
import plot_to_movie as ptm

sim_mov = ptm.plot2movie(text[3], n_matches=20, long=True)
sim_mov.head(100)

  if np.issubdtype(vec.dtype, np.int):


Unnamed: 0,Title,Year,Director,Plot
0,Kill Bill Volume 1,2003,Quentin Tarantino,"A woman in a wedding dress, the Bride, lies wo..."
1,Kill Bill Volume 2,2004,Quentin Tarantino,"A woman in a wedding dress, the Bride, lies wo..."
2,Yashoda Krishna,1976,Mahija Prakasa Rao,The film picturised the some events in the lif...
3,Yashoda Krishna,1975,C. S. Rao,The film picturised the some events in the lif...
4,Ninaivugal,1984,M. Vellaisamy,Story a women how facing problem in her life i...
5,Mandya to Mumbai,2016,Rajashekar Vardhik Joseph,The movie begins in Mandya where the Hero lead...
6,Death Valley,1946,Lew Landers,A dance hall girl is murdered and her killer f...
7,Born to Kill,1996,Jang Hyun-su,The life of a professional killer becomes comp...
8,Yashoda Krishna,1975,CSR Rao,The film picturised the some events in the lif...
9,Khauff,2000,Sanjay Gupta,Neha witnesses the Mafia slaying of a police o...


### Experimenting with data
Removing rare words from plot (e.g. names and other key words which can indicate movie's title)

In [18]:
from collections import Counter
dict_words = Counter()

In [19]:
for i, row in movies.iterrows():
    dict_words.update(row['Plot'].split())

In [20]:
tokens = {w for w, c in dict_words.items() if c < 3}

In [21]:
import pickle

with open('weights/dict_not_used_words.pkl', 'wb') as fopen:
    pickle.dump(tokens, fopen)

In [22]:
def tok_min(text):
    tokens = []
    for w in text.split():
        if w not in tokens:
            tokens.append(w)
    return tokens

In [23]:
def tokenizer(text):
    tokens = []
    for sent in nltk.sent_tokenize(text, language='english'):
        for word in nltk.word_tokenize(sent, language='english'):
            if len(word) <= 2:
                continue
            if word in stopwords.words('english') or word in tokens:
                continue
            tokens.append(word)
    return tokens

In [25]:
movies['Plot'].apply(lambda x: len(x)).mean()

1522.7868199277648

In [26]:
#decrease number of movies with low plotlines:
movies = movies[movies['Plot'].apply(lambda x: len(x)) < 500]

### Doc2Vec

In [27]:
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

plot2movie = movies.apply(
    lambda r: TaggedDocument(words=tok_min(r['Plot']), tags=[r.Title]), axis=1)

In [28]:
%%time
trainsent = plot2movie.values

# simple gensim doc2vec api
doc2vec_model = Doc2Vec(trainsent, workers=4, vector_size=300, epochs=20, dm=1)

CPU times: user 30.9 s, sys: 2.04 s, total: 32.9 s
Wall time: 16.1 s


In [33]:
doc2vec_model.save('weights/doc2vec_movie_short')

In [34]:
text = ["A group of reporters are trying to decipher the last word ever spoken by Charles Foster Kane, the millionaire newspaper tycoon: \"Rosebud.\" The film begins with a news reel detailing Kane's life for the masses, and then from there, we are shown flashbacks from Kane's life. As the reporters investigate further, the viewers see a display of a fascinating man's rise to fame, and how he eventually fell off the top of the world.",
        "During World War II, a brave, patriotic American Soldier undergoes experiments to become a new supersoldier, \"Captain America.\" Racing to Germany to sabotage the rockets of Nazi baddie \"Red Skull\", Captain America winds up frozen until the 1990s. He reawakens to find that the Red Skull has changed identities and is now planning to kidnap the President of the United States.",
        "Tony Stark. Genius, billionaire, playboy, philanthropist. Son of legendary inventor and weapons contractor Howard Stark. When Tony Stark is assigned to give a weapons presentation to an Iraqi unit led by Lt. Col. James Rhodes, he's given a ride on enemy lines. That ride ends badly when Stark's Humvee that he's riding in is attacked by enemy combatants. He survives - barely - with a chest full of shrapnel and a car battery attached to his heart. In order to survive he comes up with a way to miniaturize the battery and figures out that the battery can power something else. Thus Iron Man is born. He uses the primitive device to escape from the cave in Iraq. Once back home, he then begins work on perfecting the Iron Man suit. But the man who was put in charge of Stark Industries has plans of his own to take over Tony's technology for other matters.",
        "The lead character, called 'The Bride,' was a member of the Deadly Viper Assassination Squad, led by her lover 'Bill.' Upon realizing she was pregnant with Bill's child, 'The Bride' decided to escape her life as a killer. She fled to Texas, met a young man, who, on the day of their wedding rehearsal was gunned down by an angry and jealous Bill (with the assistance of the Deadly Viper Assassination Squad). Four years later, 'The Bride' wakes from a coma, and discovers her baby is gone. She, then, decides to seek revenge upon the five people who destroyed her life and killed her baby. The saga of Kill Bill Volume I begins.",
        "Detroit - in the future - is crime-ridden and run by a massive company. The company has developed a huge crime-fighting robot, which unfortunately develops a rather dangerous glitch. The company sees a way to get back in favor with the public when policeman Alex Murphy is killed by a street gang. Murphy's body is reconstructed within a steel shell and called RoboCop. RoboCop is very successful against criminals and becomes a target of supervillian Boddicker.",
        'An English woman asks for help from a visiting American detective to London to help her find out who has killed her brother.'
       ]

In [3]:
import plot_to_movie as ptm

sim_mov = ptm.plot2movie(text[3], n_matches=20, long=False)
sim_mov.head(100)

Unnamed: 0,Title,Year,Director,Plot
0,Walking the Edge,1983,Norbert Meisel,"In the opening scene, a criminal gang led by B..."
1,The Man in Black,1949,Francis Searle,After the death of her yogi father during a fr...
2,Carbine Williams,1952,Richard Thorpe,The film follows the life of David Marshall Wi...
3,Sweetwater,2013,Logan Miller,"In the late 1800s, a beautiful ex-prostitute (..."
4,Charlie Chan in Reno,1939,Norman Foster,Mary Whitman has arrived in Reno to obtain a d...
5,Christmas Holiday,1944,Robert Siodmak,"On Christmas Eve in New Orleans, U.S. Army off..."
6,Adil-E-Jahangir,1955,G. P. Sippy,Emperor Jehangir is known all over India for h...
7,Honours Easy,1935,Herbert Brenon,Unhinged art dealer William Barton (Ivan Samso...
8,An Unsuitable Job for a Woman,1982,Chris Petit,"After finding her former boss, a private detec..."
9,Thunder Island,1963,Jack Leewood,Contract killer Billy Poole is hired to assass...
