# Word2vec transformation of our TMDB and IMDB movie plots

This notebook applies a word2vec transformation to our TMDB and IMDB movie plots. Further design considerations and analysis are below.

In [1]:
#import libraries
import pandas as pd
from ast import literal_eval
from gensim import models
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
import numpy as np

In [2]:
#read in dataset
movies = pd.read_csv('Data/movies.csv', encoding='utf8', 
                     converters={'tmdb_genres':literal_eval, 'imdb_genres':literal_eval})
movies.head()

Unnamed: 0,tmdb_id,imdb_id,tmdb_genres,imdb_genres,binary_tmdb,binary_imdb,tmdb_plot,imdb_plot,popularity,release_date,...,vote_average,vote_count,tmdb_clean_plot,imdb_clean_plot,tmdb_w2v_plot,imdb_w2v_plot,tmdb_bow_plot,imdb_bow_plot,combined_plots,combined_bow_plots
0,278,tt0111161,"[18, 80]","[80, 18]","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...",Framed in the 1940s for the double murder of h...,Chronicles the experiences of a formerly succe...,28.527767,1994-09-23,...,8.5,9773,"['framed', '1940s', 'double', 'murder', 'wife'...","['chronicles', 'experiences', 'formerly', 'suc...","[0.014165705069899559, 0.035729147493839264, 0...","[0.004663567990064621, 0.09018586575984955, -0...","(0, 700)\t0.1914190824267342\n (0, 1141)\t0...","(0, 398)\t0.22753905256972778\n (0, 759)\t0...",Framed in the 1940s for the double murder of h...,"(0, 1092)\t0.15089615016031976\n (0, 811)\t..."
1,238,tt0068646,"[18, 80]","[80, 18]","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...",Spanning the years 1945 to 1955 a chronicle o...,When the aging head of a famous crime family d...,36.965452,1972-03-14,...,8.5,7394,"['spanning', 'years', '1945', '1955', 'chronic...","['aging', 'head', 'famous', 'crime', 'family',...","[-0.016820836812257767, 0.05966977775096893, -...","[-0.013326308690011501, 0.08134819567203522, 0...","(0, 610)\t0.12626163649598618\n (0, 1165)\t...","(0, 515)\t0.17259715509464205\n (0, 938)\t0...",Spanning the years 1945 to 1955 a chronicle o...,"(0, 1773)\t0.10485484905546055\n (0, 287)\t..."
2,424,tt0108052,"[18, 36, 10752]","[18, 36]","[0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...",The true story of how businessman Oskar Schind...,Oskar Schindler is a vainglorious and greedy G...,19.945455,1993-11-29,...,8.4,5518,"['true', 'story', 'businessman', 'oskar', 'sch...","['oskar', 'schindler', 'vainglorious', 'greedy...","[0.0758906751871109, 0.02254812978208065, 0.06...","[0.05338115245103836, 0.10281133651733398, 0.0...","(0, 1079)\t0.2907037443234484\n (0, 990)\t0...","(0, 916)\t0.40896979889639457\n (0, 317)\t0...",The true story of how businessman Oskar Schind...,"(0, 2911)\t0.09695795170181548\n (0, 2774)\..."
3,240,tt0071562,"[18, 80]","[80, 18]","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...",In the continuing saga of the Corleone crime f...,The continuing saga of the Corleone crime fami...,30.191804,1974-12-20,...,8.4,4249,"['continuing', 'saga', 'corleone', 'crime', 'f...","['continuing', 'saga', 'corleone', 'crime', 'f...","[-0.05790800228714943, 0.07111673057079315, -0...","[-0.05151921883225441, 0.07896284759044647, -0...","(0, 720)\t0.19462813339213522\n (0, 243)\t0...","(0, 515)\t0.21968270215051702\n (0, 1494)\t...",In the continuing saga of the Corleone crime f...,"(0, 1821)\t0.12839540573874353\n (0, 649)\t..."
4,452522,tt0278784,"[18, 9648]","[80, 18, 9648, 53]","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, ...",Standalone version of the series pilot with an...,When beautiful young Laura Palmer is found br...,5.969249,1989-12-31,...,8.4,123,"['standalone', 'version', 'series', 'pilot', '...","['beautiful', 'young', 'laura', 'palmer', 'fou...","[-0.05888228118419647, -0.05345569923520088, -...","[0.0026558770332485437, 0.10140948742628098, 0...","(0, 1110)\t0.5243217302828199\n (0, 925)\t0...","(0, 875)\t0.15459130001922888\n (0, 2438)\t...",Standalone version of the series pilot with an...,"(0, 1088)\t0.12464518166470029\n (0, 2380)\..."


---
### Make new feature of combined TMDB and IMDB plots
We will make a new column with the combined plots of tmdb and imdb. Perhaps this will offer a richer representation of each plot and increase our precision and recall in the forthcoming modeling.

In [3]:
movies['combined_plots'] = movies['tmdb_plot'] + ' ' + movies['imdb_plot']

---
#### Clean plots

In [4]:
#define tokenizer
tokenizer = RegexpTokenizer(r'\w+')
#set stop words list
english_stop = get_stop_words('en')
len(english_stop)

174

In [5]:
#function to clean plots
def clean_plot(plot):
    plot = plot.lower()
    plot = tokenizer.tokenize(plot)
    plot = [word for word in plot if word not in english_stop]
    return plot
    

In [6]:
#check first movie's clean plot
print(movies.tmdb_plot[0])
print(clean_plot(movies.tmdb_plot[0]))

Framed in the 1940s for the double murder of his wife and her lover  upstanding banker Andy Dufresne begins a new life at the Shawshank prison  where he puts his accounting skills to work for an amoral warden  During his long stretch in prison  Dufresne comes to be admired by the other inmates    including an older prisoner named Red    for his integrity and unquenchable sense of hope
['framed', '1940s', 'double', 'murder', 'wife', 'lover', 'upstanding', 'banker', 'andy', 'dufresne', 'begins', 'new', 'life', 'shawshank', 'prison', 'puts', 'accounting', 'skills', 'work', 'amoral', 'warden', 'long', 'stretch', 'prison', 'dufresne', 'comes', 'admired', 'inmates', 'including', 'older', 'prisoner', 'named', 'red', 'integrity', 'unquenchable', 'sense', 'hope']


In [7]:
#apply to movies df for both imdb and tmdb
movies['tmdb_clean_plot'] = movies['tmdb_plot'].apply(lambda x: clean_plot(x))
movies['imdb_clean_plot'] = movies['imdb_plot'].apply(lambda x: clean_plot(x))
movies['combined_clean_plot'] = movies['combined_plots'].apply(lambda x: clean_plot(x))

In [8]:
#check some outputs
movies.tmdb_clean_plot[1:5]

1    [spanning, years, 1945, 1955, chronicle, ficti...
2    [true, story, businessman, oskar, schindler, s...
3    [continuing, saga, corleone, crime, family, yo...
4    [standalone, version, series, pilot, alternate...
Name: tmdb_clean_plot, dtype: object

---
#### Apply word2vec Transformation of Plots

For this transformation we will use the Google News pretrained word2vec model that was trained on a more than 3 billion word corpus. It contains 3 million words, each represented by a 300-dimension vector. We make a couple of design choices for how we transform our movie plots:

- If a word in the cleaned plots (lowered, punctuation and stop word dropped, tokenized) is in the word2vec model, add it to a running list for that plot.
- If a word is not in the word2vec model, skip it. We print some of these words below for example.
- We then take two different representations for each plot. One as a matrix where each row is a 300-dimension representation of a particular word in the plot, and a column-wise mean vector, where each plot is reduced to 1x300 dimension array.

The assumption we make by taking the mean of each plot is that the resulting 300-dimension vector will point in the direction of one or more genres, but we acknowledge that it will most likely reduce some of the semantic value of certain words.


In [9]:
#Load the pretrained google news word2vec model
model = models.KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin', binary=True)

In [10]:
#collect words not in the Google News w2v model
not_w2v = []

#word2vec function
def apply_words2Vec(cleaned_plot, mean=False):
    
    """
    apply_words2Vec()
    -applies the following transformations to the cleaned plot of a movie:
        1) removes words that are not in google's model
        2) creates a 300-dimension vector representation of each word
        3) outputes vector of vectors for plot
        If mean = True
        4) converts the resulting nd_array into a 1d_array via np.mean() and
           outputs single vector for each plot.
    -also keeps track of all words not found in google's model
    
    -inputs: cleaned_plot (string)
    
    -outputs: vector representation of plot
    
    """
    vecs=[]
    for word in cleaned_plot:
        #add word vector to list if it is in the google model
        try:
            vecs.append(model.word_vec(word)) 
        except:
            #if the word is not in the w2v model, add it to
            #our list of skipped words
            not_w2v.append(word)
    
    #take the column-wise mean of vecs to reduce nd_aray to 1d_array
    if mean == True:
        vecs = np.mean(vecs, axis=0)
        return vecs
    #return matrix of w2v arrays where each row is a word in the plot
    return np.stack(vecs)

In [11]:
#apply transformation to three sets of plots and add columns to df

#columns with mean w2v
movies['tmdb_w2v_plot_mean'] = movies['tmdb_clean_plot'].apply(lambda x: apply_words2Vec(x, mean=True))
movies['imdb_w2v_plot_mean'] = movies['imdb_clean_plot'].apply(lambda x: apply_words2Vec(x, mean=True))
movies['combined_w2v_plot_mean'] = movies['combined_clean_plot'].apply(lambda x: apply_words2Vec(x, mean=True))

#columns with w2v matrix for each plot
movies['tmdb_w2v_plot_matrix'] = movies['tmdb_clean_plot'].apply(lambda x: apply_words2Vec(x, mean=False))
movies['imdb_w2v_plot_matrix'] = movies['imdb_clean_plot'].apply(lambda x: apply_words2Vec(x, mean=False))
movies['combined_w2v_plot_matrix'] = movies['combined_clean_plot'].apply(lambda x: apply_words2Vec(x, mean=False))

In [12]:
#check shapes of first movie vectors to confirm nd_array and 300-dimensions
print('Mean vector representations:')
print(movies.loc[0,'tmdb_w2v_plot_mean'].shape, 
      movies.loc[0,'imdb_w2v_plot_mean'].shape, 
      movies.loc[0, 'combined_w2v_plot_mean'].shape)

print('Matrix representations:')
print(movies.loc[0,'tmdb_w2v_plot_matrix'].shape, 
      movies.loc[0,'imdb_w2v_plot_matrix'].shape,
      movies.loc[0,'combined_w2v_plot_matrix'].shape)

Mean vector representations:
(300,) (300,) (300,)
Matrix representations:
(33, 300) (38, 300) (71, 300)


Each TMDB and IMDB plot has been transformed into a 300-dimension representation of that plot, as well as an matrix where each row represents a word in the plot. We will use these features as predictors in our multi-lable classification modeling.

In [17]:
#print some of our skipped words
print(len(not_w2v))
np.random.seed(112)
print(np.random.choice(not_w2v, 50, replace=False))

21664
['dunkirk' 'cambodian' 'kowalski' 'beale' 'matlock' 'cinema_fan' 'ronan'
 'katniss' 'leia' 'halley' 'simon_hrdng' 'willem' 'mimmo' 'barnum'
 'bennet' 'elsa' 'trinh' 'tyrone' 'wigand' 'dufresne' '1879' '22' 'huggo'
 '000' 'claudio' 'spurlin' '15' 'jwelch5742' 'krueger' '500' 'bergman'
 'cellini' 'kyla' 'andrewhodkinson' '1976' 'yvette' 'hiddleston' 'salieri'
 'huggo' '1980s' 'turturro' 'cambodia' 'fleetwood' '1900' 'redondo'
 'peatty' 'film_fan' 'bonneville' 'gooper' '000']


There are 21664 words in our cleaned TMDB and IMDB plots that were skipped when applying the word2vec transformation (some are repeats due to combined plots). As seen above, the random sample of 50 of those words are mostly years, numbers, and proper nouns. This is not surprising, and we suspect will not have a large impact on the resulting word2vec representations of our movie plots.

In [18]:
#check that each plot has a corresponding word2vec representation
#if both TMDB and IMDB have mean representations, we can assume
#the combined and matrix representations are also filled
for plot in movies.tmdb_w2v_plot_mean:
    if len(plot) != 300:
        print("AH! no word2vec representation")
print('All TMDB movies have a word2vec representation.')

for plot in movies.imdb_w2v_plot_mean:
    if len(plot) != 300:
        print("AH! no word2vec representation")
print('All IMDB movies have a word2vec representation.')

All TMDB movies have a word2vec representation.
All IMDB movies have a word2vec representation.


**Save w2v data as a numpy arrays/matrices for further modeling and analysis**

In [20]:
#w2v mean vectors
np.save('data/tmdb_w2v_mean.npy', movies['tmdb_w2v_plot_mean'].as_matrix())
np.save('data/imdb_w2v_mean.npy', movies['imdb_w2v_plot_mean'].as_matrix())
np.save('data/combined_w2v_mean.npy', movies['combined_w2v_plot_mean'].as_matrix())

#w2v matrices
np.save('data/tmdb_w2v_matrix.npy', movies['tmdb_w2v_plot_matrix'].as_matrix())
np.save('data/imdb_w2v_matrix.npy', movies['imdb_w2v_plot_matrix'].as_matrix())
np.save('data/combined_w2v_matrix.npy', movies['combined_w2v_plot_matrix'].as_matrix())

#### Update and save DataFrame for future analysis

In [24]:
movies.to_csv('data/movies.csv', encoding="utf-8", index=False)

### We have now transformed and saved each plot as a 300-dimension vector and n_words x 300 dimension matrix word2vec representation.