# Movie Summaries Comparison

**Team AKVC** : D.j. Johnson, Phineas Pham, Hannah Zhang, Yoorae Kim

## Introduction and Ethical concern



1. Our data for this project is based on the CMU Movie Summary Corpus. The original dataset was created alongside the 2013 "Learning Latent Personas of Film Characters." _Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics_ (Volume 1: Long Papers): 352–361, by David Bamman, Brendan O’Connor, and Noah A. Smith. That data includes 42,306 movie plot summaries and aligned metadata such as box office revenue, genre, release date, runtime, and language. Our dataset we will work with is a smaller sample of this dataset for us to work with. hWe want to be able to create a set of functions that can calculate the Word Mover's Distance between our two documents. We also want it to print a summary of all the moves that were made in order to derive the distance. We will use our functions to answer several distance comparison questions such as finding movies that have the most in comparison with one another. We want to be able to print data frames of specific films and their WMD scores, and have them ordered by their WMD scores in comparison to another specific movie.


2. The stakeholders include Dr. Lavin, the CMU Movie Summary Corpus, the authors of "Learning Latent Personas of Film Characters", and anyone involved in the creation of these movies or the audience. Our ethical concern for our project was that we did not want to create something that would stray people away from some of these movies. For example, if someone had seen Iron Man and did not like it, we do not want them to not go see the movies that we have as being similar to Iron Man. That would lead to less movie sales for certain producers and we do not want the credibility behind it. But, we hope our viewers are open minded and come into viewing our results with the mind set that no movie is exactly alike and does not include any information about the movies' quality.

## Data Exploration

Our repository includes a metadata file called "movieSummariesSampleMeta.csv" and "plotSummariesSample.txt", which has the structure "wiki_id\tsummary\n", meaning that each line of the file has a wiki_id that matches a wiki_id in "movieSummariesSampleMeta.csv", then a tab character, then a summary of a few hundred words, then a newline character.  We will dive into the summaries of the movies and look to perform specific comparisons between the summaries and a specific movie's summary. We will have to look mainly at the movie titles, the movie summaries, and the movie's release date. These categories will allow us to look at the word comparisons in the data more easily without having to take in all the variables. The first problem is that there is a lot of data in the dataset which makes it difficult to work efficiently. We initially struggled to speed up the process which led to over 30 minute runtimes. After getting past that difficulty, we struggled with checking our correctness due to our lack of seeing all of the movies. It was difficult to identify the correct movie index with the correct summaries. Also, rather than building a huge matrix with everything together, we decided we would be better off using several individual loops to compare the specific documents we picked out. 

In [1]:
#import libraries
import pandas as pd
from pyemd import emd
import numpy as np
from sympy import Q

In [2]:
# Load the movie post 1990
Moviedf = pd.read_csv("movieSummariesSample.csv",index_col=0)
Moviedf

Unnamed: 0,wiki_id,freebase_id,movie_name,release_date,box_office_revenue,runtime,has_summary,thriller,romance_film
0,975900,/m/03vyhn,Ghosts of Mars,2001-08-24,14010832.0,98.0,1,1,0
1,27556929,/m/04j0jtp,Deadly Voyage,1996-01-01,,90.0,1,1,0
2,3550323,/m/09kzfd,Things to Do in Denver When You're Dead,1995-10-01,529677.0,115.0,1,1,0
3,27463222,/m/0c037x9,Vanishing on 7th Street,2010-09-12,,92.0,1,1,0
4,1940449,/m/067p6m,RoboCop 3,1993-11-05,10600000.0,105.0,1,1,0
...,...,...,...,...,...,...,...,...,...
4684,2450370,/m/07dzb3,Must Love Dogs,2005-07-21,58405313.0,98.0,1,0,1
4685,35321421,/m/0c054l7,Magic Flute Diaries,2008-02-14,,104.0,1,0,1
4686,22427855,/m/05zkcsk,Adam,2009-01-20,2549605.0,99.0,1,0,1
4687,11823946,/m/02rtqvb,Twelfth Night: Or What You Will,1996-10-25,588621.0,134.0,1,0,1


In [3]:
# Read txt file into list. (post 1990)
file = open('plotSummariesSample.txt', 'r', encoding = 'utf-8')
txtlist = file.readlines()
txtlist = [line.rstrip() for line in txtlist]
len(txtlist)


4689

In [4]:
# Load dataframe before 1980
df1980 = pd.read_csv("movieSummariesSample1980s.csv")
len(df1980)
df1980 
# fix with index later

Unnamed: 0.1,Unnamed: 0,wiki_id,freebase_id,movie_name,release_date,box_office_revenue,runtime,has_summary,thriller,romance_film
0,0,9363483,/m/0285_cd,White Of The Eye,1987-01-01,,110.0,1,1,0
1,1,4951456,/m/0cws46,Kinjite: Forbidden Subjects,1989-01-01,3416846.0,97.0,1,1,0
2,2,2154704,/m/06qv1c,Choke Canyon,1986-05-02,,94.0,1,1,0
3,3,13919299,/m/03cn6hr,The Devil’s Gift,1984-01-01,,90.0,1,1,0
4,4,7235116,/m/0kv17w,Someone to Watch Over Me,1987-10-09,10278549.0,106.0,1,1,0
...,...,...,...,...,...,...,...,...,...,...
789,789,31153,/m/07nn0,The Princess Bride,1987-09-18,30857814.0,98.0,1,0,1
790,790,21859735,/m/05pc7v8,High Season,1988-03-25,,94.0,1,0,1
791,791,18118679,/m/04y663q,Torch Song Trilogy,1988-12-14,4865997.0,120.0,1,0,1
792,792,54540,/m/0f7hw,Coming to America,1988-06-29,288752301.0,117.0,1,0,1


In [5]:
# All summaries of 1980 movie
f = open("movieSummariesSample1980s.txt","r", encoding = 'utf-8')
txt1980 = f.readlines()
txt1980 = [line.rstrip() for line in txt1980]
len(txt1980)

794

In [6]:
%%time
# Download the pre-trained model, model with ID 5 had been picked. 
# from link: http://vectors.nlpl.eu/repository/
import gensim
# Load the folder and all the files. 
model_location = "5/model.txt"
word_vectors = gensim.models.KeyedVectors.load_word2vec_format(model_location, binary=False)

CPU times: user 31.9 s, sys: 426 ms, total: 32.3 s
Wall time: 32.8 s


In [7]:
# Load necessary packages
from nltk import word_tokenize
# Import and download stopwords from NLTK.
from nltk.corpus import stopwords
from nltk import download
download('stopwords')  # Download stopwords list.
download('punkt')
# store stopwords.
stop_words = stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/zhangziyue/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/zhangziyue/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Code & Results

### Objectives:

- A data frame of fifty 1980s films and WMD scores, ordered by their WMD score when compared to the movie _Iron Man_
- A data frame of fifty 1980s films and WMD scores, ordered by their WMD score when compared to the movie _Sleepless in Seattle_
- A data frame of fifty 1980s films and WMD scores, ordered by their WMD score when compared to the movie _Avatar_
- A data frame of fifty post-1990 films and WMD scores, ordered by their WMD score when compared to the movie _The Karate Kid_
- A data frame summarizing the Word Mover's flow pattern for _Iron Man_, _Sleepless in Seattle_, _Avatar_, or _The Karate Kid_ and that chosen movie's top matching film. 

In [8]:
words = list(word_vectors.index_to_key)
word_store = {i: True for i in words}

In [9]:
Moviedf.loc[Moviedf['movie_name'] == "Iron Man"]

Unnamed: 0,wiki_id,freebase_id,movie_name,release_date,box_office_revenue,runtime,has_summary,thriller,romance_film
1584,5676692,/m/0dzlbx,Iron Man,2008-04-14,585174222.0,126.0,1,1,0


In [10]:
Moviedf.loc[Moviedf['movie_name'] == "Sleepless in Seattle"]

Unnamed: 0,wiki_id,freebase_id,movie_name,release_date,box_office_revenue,runtime,has_summary,thriller,romance_film
2818,226198,/m/01gzln,Sleepless in Seattle,1993-06-25,227799884.0,105.0,1,0,1


In [11]:
Moviedf.loc[Moviedf['movie_name'] == "Avatar"]
# We want the one is 2009, index = 789

Unnamed: 0,wiki_id,freebase_id,movie_name,release_date,box_office_revenue,runtime,has_summary,thriller,romance_film
789,4273140,/m/0bth54,Avatar,2009-12-10,2782275000.0,178.0,1,1,0
1760,15945267,/m/03qhwlm,Avatar,2004-01-01,,90.0,1,1,0


In [12]:
df1980.loc[df1980["movie_name"] == "The Karate Kid"]

Unnamed: 0.1,Unnamed: 0,wiki_id,freebase_id,movie_name,release_date,box_office_revenue,runtime,has_summary,thriller,romance_film
704,704,91133,/m/0mmbb,The Karate Kid,1984-06-22,90815558.0,127.0,1,0,1


In [13]:
# The movie summary about The Karate Kid
TKKtext = txt1980[704]
# The movie summary about Iron Man
ironman = txtlist[1584]

In [14]:
def preprocess(doc):
    doc = doc.lower()  # Lower the text.
    doc = word_tokenize(doc)  # Split into words.
    doc = [w for w in doc if not w in stop_words]  # Remove stopwords.
    doc = [w for w in doc if w.isalpha()]  # Remove numbers and punctuation.
    return doc

In [15]:
#creat list for iron man words
ironmanlist = preprocess(ironman)

#filter words in Iron Man but not in word_vector model
in_modelIM = []
for i in ironmanlist:
    try: 
        w = word_store[i]
        in_modelIM.append(i)
    except:
        pass

#words in ironmanlist and after check with word_vector model
#print(len(ironmanlist), len(in_modelIM))    #440 316

In [16]:
#creat list for Sleepless in Seattle words
sleepseattle = txtlist[2818]
sleepseattlelist = preprocess(sleepseattle)
in_modelSS = []
for i in sleepseattlelist:
    try:
        w = word_store[i]
        in_modelSS.append(i)
    except:
        pass

#print(len(sleepseattlelist), len(in_modelSS))  #307 216

In [17]:
#creat list for Avatar words
avatar = txtlist[789]
avatarlist = preprocess(avatar)

in_modelA = []
for i in avatarlist:
    try:
        w = word_store[i]
        in_modelA.append(i)
    except:
        pass

#print(len(avatarlist), len(in_modelA)) #258 167

In [18]:
#matrix() will get a list of words, and output a matrix of euclidean distances between words
def matrix(wordlist):
    newmatrix = np.zeros([len(wordlist), len(wordlist)])
    for e, i in enumerate(wordlist):
        for f, j in enumerate(wordlist):
            euclid_distance = np.sqrt(np.sum((word_vectors[i] - word_vectors[j])**2))
            newmatrix[e, f] = euclid_distance
            newmatrix[f, e] = euclid_distance
    return newmatrix

In [19]:
# ALL summaries of top 50 movies in the 80s
top50_df1980 = df1980.sort_values('box_office_revenue', ascending = False).head(50)
top50_df1980.rename(columns = {'Unnamed: 0': 'index'}, inplace = True)

allsummary1980 = []
for index in top50_df1980["index"]:
    allsummary1980.append(txt1980[index])

assert len(allsummary1980) == 50


In [20]:
# ALL summaries of top 50 movies post 90s
top50_post1990 = Moviedf.sort_values('box_office_revenue', ascending = False).head(50)
top50_post1990.rename(columns = {'Unnamed: 0': 'index'}, inplace = True)

summary1990 = []
for index in top50_post1990.index:
    summary1990.append(txtlist[index])

assert len(summary1990) == 50


In [21]:
#create list of distances of a movie vs top 50 of 80s movies 
def movie_top50_matrices(movie_preprocessed_list):
    emdscore =[]
    for txt in allsummary1980:
        wholelist = []
        T50txttoken = preprocess(txt)
        wholelist.extend(T50txttoken)
        wholelist.extend(movie_preprocessed_list)
        
        #movie_preprocessed_list is all the words in order in unique words
        uniquewords = set(wholelist)
        in_model = []
        for uniqueword in uniquewords:
            try: 
                w = word_store[uniqueword]
                in_model.append(uniqueword)
            except:
                pass

        # T50txttoken contain repeated words, unique words are the rows and column length. 
        array1 = np.zeros(len(in_model))
        array2IM = np.zeros(len(in_model))

        weight_doc1 = 0
        weight_doc2 = 0
        for i in in_model:
            if i in T50txttoken:
                weight_doc1 += 1 
        for i in in_model:
            if i in movie_preprocessed_list:
                weight_doc2 += 1

        for e, word in enumerate(in_model):
            if word in T50txttoken:
                array1[e] = 1/weight_doc1
            else: 
                array1[e] = 0
            if word in movie_preprocessed_list:
                array2IM[e] = 1/weight_doc2
            else: 
                array2IM[e] = 0
            
        finmatrix = matrix(in_model)
        emdscore.append(emd(array1, array2IM, finmatrix))
    return emdscore

#### Code Construction Explanation

1.In order to find the most similar movie to a particular movie. We need to do a series of steps.

2.We want to get the partial dataframe for that 50 movies, which have been sorted by the box office revenue in descending order. 

3.Two dataframe had been built.

4.Then we use a for loop through the index of all the movie summary and get the list of the movie summary for this top 50, called summary. 

5.Identify the movie summary for Iron man and tokenize it into words, save a copy of all words called TokenIronman. 
Start to build our large for loop, we want to loop through a summary, which is 50 loops. 

1. We tokenize all selected summary, saved a copy called Tokensum

2. Extend Tokensum with the TokenIronman. 

3. Delete the replications in the extended list called uniquelist

4. Build two one-dimension numpy array with the length n = len(uniquelist)

 1. For the first array, we want to loop through all the words and find its frequency and divide by the length of the document to calculate the relative frequency in the looped summary.
 2. For the second array, we do the same thing but about Iron man or any other selected movie from another dataframe. 
5. Build the matrix to calculate the euclidean distance for uniquelist
 1. it should be a two dimensional np.array with the shape len(uniquelist) * len(uniquelist)
 2. Then we want to find the embedding array using our pre-trained model for each word pair. 
 3. Then use the euclidean formula to find the distance and build into the matrix. 
5. Use the emd function to find the emd distance, using two arrays mentioned above and the matrix mentioned above. And append the results to a emdlist.
6. Build the data frame using the emdlist and the movie name and sorted emd values by descending order. The top one movie is the most similar to the selected movie, in this case, Iron man. 


**1. _Iron Man_ vs fifty 1980s films**

In [22]:
%%time
#df of Iron Man to top 50 movies
emd_IM_50 = movie_top50_matrices(in_modelIM)

resultIM = pd.DataFrame()
resultIM["Movie Name"] = top50_df1980["movie_name"]
resultIM["Similarity_IronMan"] = emd_IM_50
resultIM.sort_values(by = "Similarity_IronMan", ascending = True)

CPU times: user 51.7 s, sys: 128 ms, total: 51.8 s
Wall time: 51.8 s


Unnamed: 0,Movie Name,Similarity_IronMan
71,Lethal Weapon,0.925994
592,"The Karate Kid, Part II",0.953002
301,Aliens,0.954154
145,Black Rain,0.95785
133,Rambo III,0.960384
503,The Jewel of the Nile,0.962419
349,A View to a Kill,0.963127
343,Lethal Weapon 2,0.963419
619,Look Who's Talking,0.965863
23,For Your Eyes Only,0.967499


**2. _Sleepless in Seattle_ vs fifty 1980s films**

In [23]:
%%time
# Similarity Comparison Sleepless in Settle vs top 50
emd_SS_50 = movie_top50_matrices(in_modelSS)

resultSS = pd.DataFrame()
resultSS["Movie Name"] = top50_df1980["movie_name"]
resultSS["Similarity_SleepSettle"] = emd_SS_50
resultSS.sort_values(by = "Similarity_SleepSettle", ascending = True, inplace=True)
resultSS
# When Harry Met Sally... is the most similar movie to Sleepless in Settle

CPU times: user 35.1 s, sys: 66.6 ms, total: 35.2 s
Wall time: 35.2 s


Unnamed: 0,Movie Name,Similarity_SleepSettle
612,When Harry Met Sally...,0.895924
619,Look Who's Talking,0.940013
668,Cocktail,0.946107
454,Jaws 3-D,0.947803
751,Terms of Endearment,0.953142
713,Big,0.955397
753,Moonstruck,0.956471
40,Sea of Love,0.957332
614,An American Tail,0.95834
664,Romancing the Stone,0.959905


**3. _Avatar_ vs fifty 1980s films**

In [24]:
%%time
#df of Avatar to top 50 movies
emd_A_50 = movie_top50_matrices(in_modelA)

resultA = pd.DataFrame()
resultA["Movie Name"] = top50_df1980["movie_name"]
resultA["Similarity_Avatar"] = emd_A_50
resultA.sort_values(by = "Similarity_Avatar", ascending = True)

# Aliens are the most similar one to Avatar

CPU times: user 35 s, sys: 118 ms, total: 35.1 s
Wall time: 35.1 s


Unnamed: 0,Movie Name,Similarity_Avatar
301,Aliens,1.007274
614,An American Tail,1.024489
521,The Little Mermaid,1.026337
352,Das Boot,1.026348
133,Rambo III,1.028703
454,Jaws 3-D,1.033724
229,The Abyss,1.036063
298,Cobra,1.04428
290,Star Trek II: The Wrath of Khan,1.045057
592,"The Karate Kid, Part II",1.050487


**4. _The Karate Kid_ vs fifty post-1990 films**

In [25]:
#create distance matrices of a movie vs top 50 of post 90s movies 

def movie1990_top50_matrices(movie_preprocessed_list):
    emdscore =[]
    for txt in summary1990:
        wholelist = []
        T50txttoken = preprocess(txt)
        wholelist.extend(T50txttoken)
        wholelist.extend(movie_preprocessed_list)
        
        #movie_preprocessed_list is all the words in order in unique words
        uniquewords = set(wholelist)
        in_model = []
        for uniqueword in uniquewords:
            try: 
                w = word_store[uniqueword]
                in_model.append(uniqueword)
            except:
                pass

        # T50txttoken contain repeated words, unique words are the rows and column length. 
        array1 = np.zeros(len(in_model))
        array2IM = np.zeros(len(in_model))

        weight_doc1 = 0
        weight_doc2 = 0
        for i in in_model:
            if i in T50txttoken:
                weight_doc1 += 1 
        for i in in_model:
            if i in movie_preprocessed_list:
                weight_doc2 += 1

        for e, word in enumerate(in_model):
            if word in T50txttoken:
                array1[e] = 1/weight_doc1
            else: 
                array1[e] = 0
            if word in movie_preprocessed_list:
                array2IM[e] = 1/weight_doc2
            else: 
                array2IM[e] = 0
        finmatrix = matrix(in_model)
        emdscore.append(emd(array1, array2IM, finmatrix))
    return emdscore

In [26]:
%%time
TKKemdscore = movie1990_top50_matrices(TKKtext)

CPU times: user 17.3 s, sys: 90 ms, total: 17.4 s
Wall time: 17.3 s


In [27]:
# Comparison the top 50 movie from 1980 to The Karate Kid
resultTKK = pd.DataFrame()
resultTKK["Movie Name"] = top50_post1990["movie_name"]
resultTKK["Similarity_TKK"] = TKKemdscore
resultTKK.sort_values(by = "Similarity_TKK", ascending = True)
# The most similar one is Troy

Unnamed: 0,Movie Name,Similarity_TKK
2897,Troy,0.842615
1030,The Sixth Sense,0.850523
2170,Fast Five,0.890743
4565,Tangled,0.894844
607,X-Men: The Last Stand,0.903053
2031,The Lost World: Jurassic Park,0.91159
3676,Aladdin,0.919221
1669,Transformers,0.919339
4321,Forrest Gump,0.919429
1597,Terminator 2: Judgment Day,0.922102


**5. _Word Mover's flow pattern_ for _Sleepless in Seattle_ and top 1 matching film**

In [28]:
from pyemd import emd_with_flow
# We compare the Sleepless in Seattle with the top 1 movie summary in the 80s

In [29]:
#Conduct the Word Mover's flow
wholelist = []
T50txttoken = preprocess(txt1980[612])  #When Harry Met Sally index 
wholelist.extend(T50txttoken)
wholelist.extend(in_modelSS)

#in_model_flow has all unique words in Sleepless in Seattle summary
uniquewords = set(wholelist)
in_model_flow = []
for uniqueword in uniquewords:
    try: 
        w = word_store[uniqueword]
        in_model_flow.append(uniqueword)
    except:
        pass

# create vectors for T50txttoken and in_modelSS
array1 = np.zeros(len(in_model_flow))
array2IM = np.zeros(len(in_model_flow))
for e, word in enumerate(in_model_flow):
    array1[e] = T50txttoken.count(word)/len(T50txttoken)
    array2IM[e] = in_modelSS.count(word)/len(in_modelSS)

#create matrix of distances between words in in_model_flow
matrixIM_flow = matrix(in_model_flow)

sim, flow = emd_with_flow(array1, array2IM, matrixIM_flow)

def explain_movers_flow(flow_matrix, vocab, doc1, doc2):
    moves = []
    for e in range(len(vocab)):
        src = vocab[e]
        for f in range(len(vocab)):
            target = in_model_flow[f]
            move = flow_matrix[e][f]
            if move > 0.0:
                src_total = doc1[e]
                moves.append([move, src, target, src_total])
    return moves

result = explain_movers_flow(flow, in_model_flow, array1, array2IM)

df_result = pd.DataFrame.from_records(result, columns=['amount_moved', 'moved_from', 'moved_to', 'total_to_move'])
df_result.sort_values(by=['moved_from', 'amount_moved'], ascending=[True, False], inplace=True)

In [30]:
df_result.head(20)

Unnamed: 0,amount_moved,moved_from,moved_to,total_to_move
60,0.003268,airport,airport,0.003268
207,0.003268,albright,hand,0.003268
230,0.004357,alone,let,0.006536
231,0.002179,alone,together,0.006536
73,0.002995,always,remember,0.003268
72,0.000273,always,let,0.003268
18,0.001362,angered,convince,0.003268
19,0.001362,angered,distraught,0.003268
20,0.000544,angered,stunned,0.003268
96,0.001906,apartment,skyscraper,0.003268


In [31]:
df_result.tail(20)

Unnamed: 0,amount_moved,moved_from,moved_to,total_to_move
225,0.003268,view,observation,0.003268
177,0.003268,walk,backpack,0.003268
171,0.007353,way,look,0.009804
172,0.00218,way,help,0.009804
173,0.000271,way,go,0.009804
31,0.005991,wedding,wife,0.006536
30,0.000274,wedding,engagement,0.006536
29,0.000271,wedding,valentine,0.006536
67,0.00463,without,without,0.006536
66,0.001906,without,allow,0.006536


## Interpretation


#### Analyze the results for _Iron Man_, _Sleepless in Seattle_, _Avatar_, and _The Karate Kid_ 

###### A data frame of fifty 1980s films and WMD scores, with movie Iron Man

Given the fact that the smaller score indicates that documents are similar to each other, Lethal Weapon (1987) is the most similar movie to Iron Man (2008) among top 50 movies in the 1980s with the similarity score of 0.925994. 

Movie Iron Man is about a guy who builds his own weapon/suit to defeat his enemies, saves people as a hero, and grows his technology industry. In Lethal Weapon, two detectives fight against the Shadow Company and try to get revenge. 

Our assumption is that the reason why the model picked Lethal Weapon as the most similar movie is that both of the movies contain the content of weapons, and talk about secret agents/companies. Both movies fall into the thriller category.

###### A data frame of fifty 1980s films and WMD scores, with Sleepless in Seattle

Given the fact that the smaller score indicates that documents are similar to each other, When Harry Met Sally… (1989) is the most similar movie to Sleepless in Seattle (1993) among top 50 movies in the post 1990s with the similarity score of 0.895924. 

Movie Sleepless in Seattle (1993) is about a man and woman that has no common relationship with each other, meets through several unexpected events and finally falls in love with each other. In When Harry Met Sally… (1989), a man and woman who stays friends for 12 years finally falls in love and gets married.

Our assumption is that the reason why these two movies turned out to be most similar to each other is because both movies are about man and woman, and the relationship between each other plays a big role in these movies. Also, the sub-story of both movies takes place in New York, so this is the part we think that it is common in between movies. Both movies fall into the romance category.

###### A data frame of fifty 1980s films and WMD scores, with movie Avatar

Given the fact that the smaller score indicates that documents are similar to each other, Aliens (1986)  is the most similar movie to Avatar (2009) among all movies in the 1980s with the similarity score of 1.139954. 

Looking at the summary of Avatar (2009), it has the storyline of a mining colony threatening the local tribe (who are not humans) using genetically engineered technology. The movie Aliens (1986) is about fighting and colonization of space creatures.

Our assumption on this is because both of these movies are about colonization and war between two groups of creatures, and also both include the non-human creatures in the movie plot in general. The common fact about these two movies is that both were directed by James Cameron. Both Movies fall into the thriller category.

###### A data frame of fifty post-1990 films and WMD scores, with The Karate Kid

Given the fact that the smaller score indicates that documents are similar to each other, Troy (2004) is the most similar movie to The Karate Kid (1984) among all movies in the post 1990 with the similarity score of 0.842615. In the data frame containing movie information, both Troy (2004) and The Karate Kid (1984) is marked as a romance movie. 

Looking at the Karate Kid summary, it is the movie about a Karate coach helping the kid to become stronger so that he won’t be bullied by the people around him. Troy is a historical movie about Greece and it contains the storyline of revenge and mentorship as well, which has the similar plot with Karate Kid. We are assuming that this could be the reason how the model decided to pick Troy as the most similar movie summary.

#### Analyze the results of the Word Mover's flow pattern

The Word Mover's flow shows how the words move from one document to words of another document. This gives us a choice to see if the decisions made by our model words are right or not. Looking at the above result, I noticed that figuring out the computer's reasoning for the comparisons was sometimes tricky. For example, assistant and editor do not have the same meaning, but because they are both job roles, they are close to each other in embedding space. Whether it was right away or after a few glances, we are able to figure out the correlations for most of them. But, we can see that there are some that may not be very accurate such as "awkward" and "annie". In the end, we are happy to be able to see the comparisons for most of them clearly. 

#### Kinds of results that Word Mover's Distance (WMD) produce

WMD calculates the distances that words have to move from one document to another document to get to the word with close meaning. Thus, it will give us information about the similarity between two documents, as the more similar the two documents are, the less moving words have to do to reach the similar words in other documents. There are three parameters that need to be put into the model: an array of words weighted for the first document and another array for the second document, and a matrix that contains the euclidean distances for each pair of words in the vocabulary lists. What makes WMD good is that it not only uses embeddings to calculate the similarity between words, it also put information on word travel distances.

#### Choice of word embeddings model

We choose the fifth model (ID = 5) form the website with following characteristics:
- It should use English
- Vector Size = 300
- Window Size = 5
- From English Wikipedia Dump of February 2017
- Corpus size 273992
- Type is Gensim Continuous Skipgram
- Lemmatization

Vector size is the array that we assigned to each word in the embedding process. 300 is a reasonable number. If we pick a number which is too large, we need a long training time. However, if the array is too short, the difference between words can not be captured accurately. 

Window size is the number of words near that word that we want to take into account. A too small window size may lead to only capture the word meaning, but not the meaning of the whole sentence. However, a too large window size will blur the meaning of the sentence as some sentence may be short. Thus, 5 is a reasonable number for window size. 

We want to use a word list that we are familar with and sounds normal：like from wikipedia. And the size of the corpus is expected to be reasonably large, but not too large that hard for us to analyze and process. 

We need to use gensim/global vectors, etc those kinds that are compatible with gensim and MWD models. 

Lemmatization is about the different format of the word, for example the diffferent tenses: drive and drove. So we choose Lemmatization equals to True is time that we regradless of tense and only keep the core meanings of the words. 

## Conclusion


1. We used the EMD model to compare the specific documents with loops by their words to first get our array. We would take one document -such as Iron Man- and loop through the other documents we wanted to compare it to. We then built a dataframe with the top 50 words closest to our summary, Iron Man. We noticed that the smaller the numbers were with the pairs, the closer they were in comparison.

2. We used the emd_with_flow to compare Sleepless In Seattle with the 1980s top 50 movies. We then calculated the distance using the word2word model and built a dataframe with the top 50 connected words.

3. Lastly we wanted to describe the advantages of using the WMD model. First, we could compare the documents' meanings as opposed to solely just the words. This was beneficial because we did not have to rely on whether the summaries had a lot of the same words. This also made our results more beneficial for our readers because we want them to see the movies in comparison based on the meaning behind the words and story, not just based on the same words used. The WMD metric leads to such low error rates that we wanted to be accurate in our analysis. A possible limitation would be that we could have struggled if we needed to use word ordering frequently. Instead of using a big matrix to maybe speed up the initial process, we think that using several for loops may lead to a better understanding behind why we get the results we do. 