<h1><center>Appendix - Text Mining - Replicating Amazon X-Ray using Machine Learning</center></h1>
<img src="harvardlogo.jpg", width=100, height=100>
<p><b><center>Harvard University</center></b></p>
<p><b><center>CSCI E-81 Machine Learning & Data Mining</center></b></p>
<p><b><center>Fall 2016</center></b></p>
<p><b><center>Team: Nirmal Labh, Anmol Joshi</center></b></p>
<p><b><center>Due Date: Monday, December 17th, 2016 at 11:59pm</center></b></p>
<p><b><center>Submission Date: Saturday, December 17th, 2016</center></b></p>

<img src="trivia.jpg">

**We experimented with some future work before submission. We used one of my favorite movies, Swingers, starring Jon Faverau and Vince Vaughn. **

**IMDb has very stricy policies on web scraping and any existing IMDb APIs do not have downloading trivia, references and goofs features. We found one, but currently that is not working for them, and will be updated on the next blockpoint. So for this I went to IMDb and copied the text in to a notepad file, we'll have a better version of figuring out how to extract that information in the future.**

In [1]:
%%capture
import pysrt
import pandas as pd
from bs4 import BeautifulSoup
import urllib
import gensim
import nltk.data
import re
from nltk.corpus import stopwords
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import matplotlib.font_manager
import matplotlib.pyplot as plt
from matplotlib.pyplot import cm 

**IMDb lists references and trivia, IMDb sometimes simply lists the movie name, this can refer to how a scene is directed, music of that scene. So for this project, we're simply seeing where a character talks about a certain movie, uses a quote, basically related purely to text for that movie.** 

**In the future, we would like to be able to have this model automated to extract information from different movies pages to accurately classify text**

**The issue with this project is that there is no training and testing set, that's why I used a movie I know well so I'm able to confirm all the references**

**We did not attempt trivia, given the time constraint, but how we imagine it would work is to parse through the movie script, so we can get scene descriptions which would help find what trivia section applies to what point of the movie**

**We download the subtitles and display it so it shows the start and end time of the subtitles on screen, with the subtitle**

In [2]:
subs = pysrt.open('Swingers (1996).DVDRip.DivX.English.srt', encoding='iso-8859-1')

n = len(subs)
subs_start_time = []
subs_end_time = []
subs_txt = []

for ii in range(0, n):
    subs_start_time.append(3600*subs[ii].start.hours + 60*subs[ii].start.minutes + subs[ii].start.seconds)
    subs_end_time.append(3600*subs[ii].end.hours + 60*subs[ii].end.minutes + subs[ii].end.seconds)
    L = subs[ii].text
    L = L.replace("\n", " ")
    subs_txt.append(L)    
    
d = {'start_time': subs_start_time, 'end_time': subs_end_time, 'subtitles': subs_txt}
df = pd.DataFrame(data=d)
df.head()

Unnamed: 0,end_time,start_time,subtitles
0,14,11,No way.
1,41,33,You're nobody 'til somebody loves you
2,46,41,You're nobody 'til somebody cares
3,54,48,You may be king You may possess the world
4,56,54,And its gold


There was no available public API to get trivia, references and goofs from IMDb. Given IMDb's web scraping policy, we avoided scraping directly from there. Instead we saved the trivia to a text file. We can look into scraping in the future. 

We now read the saved trivia file, that was mentioned earlier. 

In [3]:
f = open('swingers_trivia.txt')
trivia = []

In [4]:
for tr in f.readlines():
    trivia.append(tr)

We use word2vec for this problem statement. 

word2vec was an algorithm created at Google. It uses a shallow 2 layer neural network to assign vector representations of words by forming linguistic context of words. We'll see later in these sections how these word embeddings and context of words.

word2vec works well here converts text to a vector representation to the size of one's choosing and trained on text of our choosing. word2vec is the current standard for essentially a dimensionality reduction for text. The lower space groups similar words together, which improves modeling for both supervised and unsupervised learning methods.

We start by converting the trivia, subtitles to a bag of words models. This works by removing all the stop words, for example: the, a , an, by, etc. Although, for this we choose not to remove stop words, as certain subtitles can just be one word. 

These are mainly articles, prepositions, pronouns. After having removed the stopwords, we create a bag of words by creating a vocabulary of words and then create feature vectors of sentence.

In [5]:
def to_wordlist(text, remove_stopwords):
    
    text = re.sub("[^a-zA-Z]", " ", text)
    
    words = text.lower().split()
    
    if remove_stopwords:
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]
    
    return (words)

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

def to_sentences(text, remove_stopwords):
    
    raw_sentences = tokenizer.tokenize(text.strip())
    
    sentences = []
    
    for raw_sentence in raw_sentences:
        
        if len(raw_sentence) > 0:
            
            sentences.append(to_wordlist(raw_sentence, remove_stopwords))
    
    return sentences

For our analysis, we have to create averaged feature vectors of the words of each string. We do this by using two functions makeFeatureVec and getAvgFeatureVecs.
makeFeatureVec takes text and converts each word to vector representation using the word2vec model. Each word vector representation is added together and then averaged by diving by the number of words of the text. This way each essay results in a vector with 300 features that is an average of each of the words in the text.
getAvgFeatureVecs creates vectors for text for the entire dataset using makeFeatureVec function.

In [6]:
def makeFeatureVec(words, model, num_features):
    featureVec = np.zeros((num_features,),dtype="float32")
    nwords = 0.
    index2word_set = set(model.index2word)
    for word in words:
        if word in index2word_set: 
            nwords = nwords + 1.
            featureVec = np.add(featureVec,model[word])        
    featureVec = np.divide(featureVec,nwords)
    return featureVec

def getAvgFeatureVecs(text, model, num_features):
    counter = 0.
    essayFeatureVecs = np.zeros((len(text),num_features),dtype="float32")
    for tr in text:
        essayFeatureVecs[counter] = makeFeatureVec(tr, model, num_features)
        counter = counter + 1.
    return essayFeatureVecs

In [7]:
sentences_tr = []

print ("Parsing sentences from trivia set")
for text in trivia:    
    sentences_tr += to_sentences(text, remove_stopwords = False)

print ("Complete")

Parsing sentences from trivia set
Complete


We use Google's pre-trained model on Google News to vectorize our text. We found to be more useful and general that training the model on text for each movie, as that would require constant training. Here is a link to the model https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit

In [8]:
# Load Google's pre-trained Word2Vec model.
model = gensim.models.Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)  

I tried running this block of code below, but it would hang my computer. So I copied the image below from a project from my other class - CS109A Introduction to Data Science. Here is the project page - https://projectaes.github.io/

Below is an example of word embedding from Google's model. 
It is important to examine word embedding and see how words cluster together based on Google's word2vec model. To visualize this, we'll apply Prinicipal Component Analysis and reduce the word dimensionality to 2 componenets. We use the transformed word vectors and represent them on a 2D plot. We examine the top 25 most similar words to the word "time".

In [None]:
def viswordembedding(word_to_use, topn_):
    okay = model.most_similar(word_to_use, topn = topn_)
    words_to_show = [word_to_use]
    words_vec = np.zeros((topn_+1, 300))
    words_vec[0,:] = model[word_to_use]

    for ii in range(0, len(okay)):
        words_to_show.append(okay[ii][0])    
        words_vec[ii+1,:] = model[okay[ii][0]]

    pca = PCA(n_components=2)
    pca.fit(words_vec)
    X = pca.transform(words_vec)
    xs = X[:, 0]
    ys = X[:, 1]

    plt.figure(figsize=(8,6))
    plt.scatter(xs, ys, marker = 'o')
    plt.xlabel('PCA1')
    plt.ylabel('PCA2')
    plt.title('Word Embedding Nearest Neighbors for ' + word_to_use)
    for i, w in enumerate(words_to_show):
        plt.annotate(
            w,
            xy = (xs[i], ys[i]), xytext = (3, 3),
            textcoords = 'offset points', ha = 'left', va = 'top')
    plt.savefig('wordembedding', bbox_inches = "tight")

<img src="wordembedding.png">

Below we examine the 25 nearest neighbors to the word "time". What's most interesting is that the units of time cluster together in left of the plot i.e. hours, days, weeks, years. This is all decided by context of the words!
A few other clusters are activities such as exercising, enjoying, outdoors - in the lower right corner of the graph.
What's most interesting is the top most clusters. the words patience, patient cluster together as nearest neighbours for time!

Now we create feature vectors of all the subtitles and all of the trivia. We'll display it to show you how this text is being represented.

In [9]:
print ("Creating average feature vecs for Trivia")
clean_trivia = []
for text in trivia:
    clean_trivia.append(to_wordlist(text, remove_stopwords= False ))
triviaDataVecs = getAvgFeatureVecs(clean_trivia, model, num_features = 300)

Creating average feature vecs for Trivia




In [10]:
triviaDataVecs

array([[-0.00483032,  0.01160645, -0.00051758, ..., -0.06493653,
        -0.01777466,  0.04527344],
       [-0.00179499,  0.03970064,  0.03865051, ..., -0.05148177,
        -0.07397635,  0.03995029],
       [ 0.1282552 ,  0.02202352, -0.15690105, ..., -0.0061849 ,
        -0.08959961, -0.02742513],
       ..., 
       [ 0.07275391, -0.05419922, -0.375     , ..., -0.46679688,
        -0.00466919,  0.06640625],
       [-0.10681152, -0.14709473, -0.02877808, ...,  0.23095703,
         0.14794922,  0.11889648],
       [ 0.10205078, -0.32617188,  0.06103516, ...,  0.05615234,
        -0.02233887, -0.27148438]], dtype=float32)

In [11]:
sentences_tr = []

print ("Parsing sentences from subtitles set")
for text in df['subtitles']:    
    sentences_tr += to_sentences(text, remove_stopwords = False)

print ("Complete")

Parsing sentences from subtitles set
Complete


In [12]:
clean_subs = []
for text in df['subtitles']:
    clean_subs.append(to_wordlist(text, remove_stopwords= False ))
subsDataVecs = getAvgFeatureVecs(clean_subs, model, num_features = 300)



In [13]:
subsDataVecs

array([[ 0.11791992, -0.03979492,  0.02087402, ..., -0.03656006,
        -0.10400391, -0.08630371],
       [ 0.15712193, -0.02580915,  0.04296875, ..., -0.19461496,
        -0.09286063, -0.04375785],
       [ 0.10078939, -0.02762858,  0.03930664, ..., -0.18058269,
        -0.07623291, -0.06974157],
       ..., 
       [ 0.02572632,  0.00257874,  0.10153198, ..., -0.03164291,
        -0.01159668,  0.04003763],
       [-0.13081868,  0.07226562, -0.01961263, ...,  0.04561361,
        -0.07670084, -0.06654867],
       [-0.03466797,  0.10928345,  0.02838135, ...,  0.01190567,
         0.11981201, -0.07543945]], dtype=float32)

We got some true divide errors earlier, this would result is nan vector for the features of the reference and subtitles. So we fill these vectors with zeros to prevent disruptions in future analysis.

In [14]:
subsDataVecs = np.nan_to_num(subsDataVecs)
triviaDataVecs = np.nan_to_num(triviaDataVecs)

Next we use cosine similarity to find out subtitle applies to what reference. Cosine similarity calculates the dot normalized dot project between two vectors. So for each referece, we go through the entire subtitles set to find similarities. The one with the maximum should be the one the applicable reference.

In [31]:
def trivia_to_subtitles(set_no):
    
    trivia_ = triviaDataVecs[set_no]
    cosine_similarity = [0]*len(df['subtitles'])

    for ii in range(0, len(df['subtitles'])):
        if np.sum(subsDataVecs[ii]) == 0:
            cosine_similarity[ii] = 0
        else:
            cosine_similarity[ii] = np.dot(trivia_, subsDataVecs[ii])/(np.linalg.norm(subsDataVecs[ii])* np.linalg.norm(trivia_))
    
    predicted = np.array(np.argwhere(cosine_similarity == np.amax(cosine_similarity)))
    
    predicted = predicted.flatten().tolist()
    
    print (trivia[set_no])    
    pd.options.display.max_colwidth = 100
    print (str(df['subtitles'][predicted]))
    print ("Start time: ", df["start_time"][predicted])
    print ("End time: ", df['end_time'][predicted])
    print ('------------------------------------------------------------------------------------------------------')

In [32]:
for ii in range(len(trivia)):
    trivia_to_subtitles(ii)

The Wizard of Oz (1939) Lisa works as "a Dorothy" at the MGM Grand in Las Vegas and is therefore in a Judy Garland costume when she meets Mike and Trent.

330    - Oh. Uh, Lisa works at the MGM Grand. - I'm a Dorothy.
Name: subtitles, dtype: object
Start time:  330    1226
Name: start_time, dtype: int64
End time:  330    1229
Name: end_time, dtype: int64
------------------------------------------------------------------------------------------------------
You Bet Your Life (1950) (TV Series) Lorraine observes that Mike's business card has the duck logo from "You Bet Your Life".

1232    Yeah, it's the logo from You Bet Your Life.
Name: subtitles, dtype: object
Start time:  1232    5034
Name: start_time, dtype: int64
End time:  1232    5037
Name: end_time, dtype: int64
------------------------------------------------------------------------------------------------------
The Odd Couple (1968)

550    One, two, three...
Name: subtitles, dtype: object
Start time:  550    2048
Name: start_t



Tapeheads (1988)

Series([], Name: subtitles, dtype: object)
Start time:  Series([], Name: start_time, dtype: int64)
End time:  Series([], Name: end_time, dtype: int64)
------------------------------------------------------------------------------------------------------
Things Change (1988) When Trent refers to Mike as "The guy behind the guy, behind the guy", he's quoting the scene in Things Change where Jerry is making up a story explaining who Gino is.

151    This is the guy behind the guy behind the guy.
Name: subtitles, dtype: object
Start time:  151    618
Name: start_time, dtype: int64
End time:  151    622
Name: end_time, dtype: int64
------------------------------------------------------------------------------------------------------
Rain Man (1988)

876    Bad man.
Name: subtitles, dtype: object
Start time:  876    3256
Name: start_time, dtype: int64
End time:  876    3258
Name: end_time, dtype: int64
------------------------------------------------------------------------

From the above results, we see that there are references that don't match up because IMDb lists just the movie name, when it could be applied to a particular song playing in the background, or action, type of directing, editing that applies to that scene. 

But, for those references that have explainations or are mentioned as names of movies, are prediction actually does very well. 

For future work, we'd like to produce cleaner data, and experiment with tfid and count vectorizer to hopefully yeild better results.