# Word2vec transformation of our TMDB and IMDB movie plots

This notebook applies a word2vec transformation to our TMDB and IMDB movie plots. Further design considerations and analysis are below.

In [2]:
#import libraries
import pandas as pd
from ast import literal_eval
from gensim import models
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
import numpy as np

In [4]:
#read in dataset
movies = pd.read_csv('Data/movies.csv', encoding='utf8', 
                     converters={'tmdb_genres':literal_eval, 'imdb_genres':literal_eval})
movies.head()

Unnamed: 0,tmdb_id,imdb_id,tmdb_genres,imdb_genres,binary_tmdb,binary_imdb,tmdb_plot,imdb_plot,popularity,release_date,title,vote_average,vote_count,tmdb_clean_plot,imdb_clean_plot,tmdb_w2v_plot,imdb_w2v_plot,tmdb_bow_plot,imdb_bow_plot
0,278,tt0111161,"[18, 80]","[80, 18]","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...",Framed in the 1940s for the double murder of h...,Chronicles the experiences of a formerly succe...,28.527767,1994-09-23,The Shawshank Redemption,8.5,9773,"['framed', '1940s', 'double', 'murder', 'wife'...","['chronicles', 'experiences', 'formerly', 'suc...","[0.014165705069899559, 0.035729147493839264, 0...","[0.004663567990064621, 0.09018586575984955, -0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,238,tt0068646,"[18, 80]","[80, 18]","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...","Spanning the years 1945 to 1955, a chronicle o...",When the aging head of a famous crime family d...,36.965452,1972-03-14,The Godfather,8.5,7394,"['spanning', 'years', '1945', '1955', 'chronic...","['aging', 'head', 'famous', 'crime', 'family',...","[-0.016820836812257767, 0.05966977775096893, -...","[-0.013326308690011501, 0.08134819567203522, 0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,424,tt0108052,"[18, 36, 10752]","[18, 36]","[0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...",The true story of how businessman Oskar Schind...,Oskar Schindler is a vainglorious and greedy G...,19.945455,1993-11-29,Schindler's List,8.4,5518,"['true', 'story', 'businessman', 'oskar', 'sch...","['oskar', 'schindler', 'vainglorious', 'greedy...","[0.0758906751871109, 0.02254812978208065, 0.06...","[0.05338115245103836, 0.10281133651733398, 0.0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,240,tt0071562,"[18, 80]","[80, 18]","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...",In the continuing saga of the Corleone crime f...,The continuing saga of the Corleone crime fami...,30.191804,1974-12-20,The Godfather: Part II,8.4,4249,"['continuing', 'saga', 'corleone', 'crime', 'f...","['continuing', 'saga', 'corleone', 'crime', 'f...","[-0.05790800228714943, 0.07111673057079315, -0...","[-0.05151921883225441, 0.07896284759044647, -0...","[0.0, 0.0, 0.34648671489524274, 0.0, 0.0, 0.0,...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.324..."
4,452522,tt0278784,"[18, 9648]","[80, 18, 9648, 53]","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, ...",Standalone version of the series pilot with an...,"When beautiful, young Laura Palmer is found br...",5.969249,1989-12-31,Twin Peaks,8.4,123,"['standalone', 'version', 'series', 'pilot', '...","['beautiful', 'young', 'laura', 'palmer', 'fou...","[-0.05888228118419647, -0.05345569923520088, -...","[0.0026558770332485437, 0.10140948742628098, 0...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


---
#### Clean plots

In [5]:
#define tokenizer
tokenizer = RegexpTokenizer(r'\w+')
#set stop words list
english_stop = get_stop_words('en')
len(english_stop)

174

In [6]:
#function to clean plots
def clean_plot(plot):
    plot = plot.lower()
    plot = tokenizer.tokenize(plot)
    plot = [word for word in plot if word not in english_stop]
    return plot
    

In [7]:
#check first movie's clean plot
print(movies.tmdb_plot[0])
print(clean_plot(movies.tmdb_plot[0]))

Framed in the 1940s for the double murder of his wife and her lover, upstanding banker Andy Dufresne begins a new life at the Shawshank prison, where he puts his accounting skills to work for an amoral warden. During his long stretch in prison, Dufresne comes to be admired by the other inmates -- including an older prisoner named Red -- for his integrity and unquenchable sense of hope
['framed', '1940s', 'double', 'murder', 'wife', 'lover', 'upstanding', 'banker', 'andy', 'dufresne', 'begins', 'new', 'life', 'shawshank', 'prison', 'puts', 'accounting', 'skills', 'work', 'amoral', 'warden', 'long', 'stretch', 'prison', 'dufresne', 'comes', 'admired', 'inmates', 'including', 'older', 'prisoner', 'named', 'red', 'integrity', 'unquenchable', 'sense', 'hope']


In [8]:
#apply to movies df for both imdb and tmdb
movies['tmdb_clean_plot'] = movies['tmdb_plot'].apply(lambda x: clean_plot(x))
movies['imdb_clean_plot'] = movies['imdb_plot'].apply(lambda x: clean_plot(x))

In [9]:
#check some outputs
movies.tmdb_clean_plot[1:5]

1    [spanning, years, 1945, 1955, chronicle, ficti...
2    [true, story, businessman, oskar, schindler, s...
3    [continuing, saga, corleone, crime, family, yo...
4    [standalone, version, series, pilot, alternate...
Name: tmdb_clean_plot, dtype: object

---
#### Apply word2vec Transformation of Plots

For this transformation we will use the Google News pretrained word2vec model that was trained on a more than 3 billion word corpus. It contains 3 million words, each represented by a 300-dimension vector. We make a couple of design choices for how we transform our movie plots:

- If a word in the cleaned plots (lowered, punctuation and stop word dropped, tokenized) is in the word2vec model, add it to a running list for that plot.
- If a word is not in the word2vec model, skip it. We print some of these words below for example.
- After collecting all the word vectors for a particular plot, take the column-wise mean and store the plot as a 300-dimension average vector of its words.

The assumption we make by taking the mean of each plot is that the resulting 300-dimension vector will point in the direction of one or more genres. 

In [10]:
#Load the pretrained google news word2vec model
model = models.KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin', binary=True)

In [11]:
#collect words not in the Google News w2v model
not_w2v = []

#word2vec function
def apply_words2Vec(cleaned_plot):
    
    """
    apply_words2Vec()
    -applies the following transformations to the cleaned plot of a movie:
        1) removes words that are not in google's model
        2) creates a vector representation of each word
        3) converts the resulting nd_array into a 1d_array via np.mean()
    -also keeps track of all words not found in google's model
    
    -inputs: cleaned_plot (string)
    
    -outputs: vector representation of plot
    
    """
    vecs=[]
    for word in cleaned_plot:
        #add word vector to list if it is in the google model
        try:
            vecs.append(model.word_vec(word)) 
        except:
            #if the word is not in the w2v model, add it to
            #our list of skipped words
            not_w2v.append(word)
    

    
    vecs = np.mean(vecs, axis=0)
    return vecs

In [12]:
#apply transformation to both sets of plots and add columns to df
movies['tmdb_w2v_plot'] = movies['tmdb_clean_plot'].apply(lambda x: apply_words2Vec(x))
movies['imdb_w2v_plot'] = movies['imdb_clean_plot'].apply(lambda x: apply_words2Vec(x))

In [13]:
#check shapes of first movie vectors to confirm 300-dimensions
movies.loc[0,'tmdb_w2v_plot'].shape, movies.loc[0,'imdb_w2v_plot'].shape

((300,), (300,))

Each TMDB and IMDB plot has been transformed into a 300-dimension representation of that plot. We will use these features as predictors in our multi-lable classification modeling.

In [14]:
#print some of our skipped words
print(len(not_w2v))
np.random.seed(1212)
print(np.random.choice(not_w2v, 50, replace=False))

5416
['valerie' 'addie' 'arroway' '1988' 'waititi' 'rocco' 'matlock' '1920s'
 'needham' 'gertrude' 'lili' 'rapunzel' '500' 'josey' '1996' 'sicily'
 '26th' 'jessup' 'marston' 'napaloni' 'dealan' 'katarina' 'isabelle'
 'urskeks' 'pistone' 'napua' 'sheryl' 'saroo' 'trinh' 'millman'
 'burpelson' '1942' 'discreteness' 'verona' 'lucile' 'cecelia' 'lestat'
 'maisie' 'onllwyn' 'cranley' '1999' 'kyla' 'renton' '1964' '1984'
 'moonscar' 'kristoff' 'trask' 'kirkeby' 'malone']


There are 5416 words in our cleaned TMDB and IMDB plots that were skipped when applying the word2vec transformation. As seen above, the random sample of 50 of those words are mostly years, numbers, and proper nouns. This is not surprising, and we suspect will not have a large impact on the resulting word2vec representations of our movie plots.

In [15]:
#check that each plot has a corresponding word2vec representation
for plot in movies.tmdb_w2v_plot:
    if len(plot) != 300:
        print("AH! no word2vec representation")
print('All TMDB movies have a word2vec representation.')

for plot in movies.imdb_w2v_plot:
    if len(plot) != 300:
        print("AH! no word2vec representation")
print('All IMDB movies have a word2vec representation.')

All TMDB movies have a word2vec representation.
All IMDB movies have a word2vec representation.


#### Update and save DataFrame for modeling and analysis

In [16]:
#convert to lists for csv writing and reading ease
movies['tmdb_w2v_plot'] = movies['tmdb_w2v_plot'].apply(lambda x: x.tolist())
movies['imdb_w2v_plot'] = movies['imdb_w2v_plot'].apply(lambda x: x.tolist())

movies.to_csv('data/movies.csv', encoding="utf-8", index=False)

### We have now transformed and saved each plot as a 300-dimension word2vec representation.