# Bag-of-words transformations of our TMDB and IMDB movie plots

This notebook applies a bag-of-words with TFIDF transformation to our TMDB and IMDB movie plots. Further design considerations and analysis are below.

In [1]:
#import libraries
import pandas as pd
from ast import literal_eval
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np


In [2]:
#list of columns in our 'movies.csv' that need to be read in using literal_eval
#to ensure they are represented as vectors rather than strings
converterList = ['imdb_genres', 'tmdb_genres', 'binary_tmdb', 'binary_imdb',
               'imdb_w2v_plot', 'tmdb_w2v_plot']

converterDict = {column: literal_eval for column in converterList}

movies = pd.read_csv('data/movies.csv', encoding='utf-8',
                     converters=converterDict)

movies.head(5)

Unnamed: 0,tmdb_id,imdb_id,tmdb_genres,imdb_genres,binary_tmdb,binary_imdb,tmdb_plot,imdb_plot,popularity,release_date,...,vote_average,vote_count,tmdb_clean_plot,imdb_clean_plot,tmdb_w2v_plot,imdb_w2v_plot,tmdb_bow_plot,imdb_bow_plot,combined_plots,combined_bow_plots
0,278,tt0111161,"[18, 80]","[80, 18]","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...",Framed in the 1940s for the double murder of h...,Chronicles the experiences of a formerly succe...,28.527767,1994-09-23,...,8.5,9773,"['framed', '1940s', 'double', 'murder', 'wife'...","['chronicles', 'experiences', 'formerly', 'suc...","[0.014165705069899559, 0.035729147493839264, 0...","[0.004663567990064621, 0.09018586575984955, -0...","(0, 700)\t0.1914190824267342\n (0, 1141)\t0...","(0, 398)\t0.22753905256972778\n (0, 759)\t0...",Framed in the 1940s for the double murder of h...,"(0, 1092)\t0.15089615016031976\n (0, 811)\t..."
1,238,tt0068646,"[18, 80]","[80, 18]","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...",Spanning the years 1945 to 1955 a chronicle o...,When the aging head of a famous crime family d...,36.965452,1972-03-14,...,8.5,7394,"['spanning', 'years', '1945', '1955', 'chronic...","['aging', 'head', 'famous', 'crime', 'family',...","[-0.016820836812257767, 0.05966977775096893, -...","[-0.013326308690011501, 0.08134819567203522, 0...","(0, 610)\t0.12626163649598618\n (0, 1165)\t...","(0, 515)\t0.17259715509464205\n (0, 938)\t0...",Spanning the years 1945 to 1955 a chronicle o...,"(0, 1773)\t0.10485484905546055\n (0, 287)\t..."
2,424,tt0108052,"[18, 36, 10752]","[18, 36]","[0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...",The true story of how businessman Oskar Schind...,Oskar Schindler is a vainglorious and greedy G...,19.945455,1993-11-29,...,8.4,5518,"['true', 'story', 'businessman', 'oskar', 'sch...","['oskar', 'schindler', 'vainglorious', 'greedy...","[0.0758906751871109, 0.02254812978208065, 0.06...","[0.05338115245103836, 0.10281133651733398, 0.0...","(0, 1079)\t0.2907037443234484\n (0, 990)\t0...","(0, 916)\t0.40896979889639457\n (0, 317)\t0...",The true story of how businessman Oskar Schind...,"(0, 2911)\t0.09695795170181548\n (0, 2774)\..."
3,240,tt0071562,"[18, 80]","[80, 18]","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...",In the continuing saga of the Corleone crime f...,The continuing saga of the Corleone crime fami...,30.191804,1974-12-20,...,8.4,4249,"['continuing', 'saga', 'corleone', 'crime', 'f...","['continuing', 'saga', 'corleone', 'crime', 'f...","[-0.05790800228714943, 0.07111673057079315, -0...","[-0.05151921883225441, 0.07896284759044647, -0...","(0, 720)\t0.19462813339213522\n (0, 243)\t0...","(0, 515)\t0.21968270215051702\n (0, 1494)\t...",In the continuing saga of the Corleone crime f...,"(0, 1821)\t0.12839540573874353\n (0, 649)\t..."
4,452522,tt0278784,"[18, 9648]","[80, 18, 9648, 53]","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, ...",Standalone version of the series pilot with an...,When beautiful young Laura Palmer is found br...,5.969249,1989-12-31,...,8.4,123,"['standalone', 'version', 'series', 'pilot', '...","['beautiful', 'young', 'laura', 'palmer', 'fou...","[-0.05888228118419647, -0.05345569923520088, -...","[0.0026558770332485437, 0.10140948742628098, 0...","(0, 1110)\t0.5243217302828199\n (0, 925)\t0...","(0, 875)\t0.15459130001922888\n (0, 2438)\t...",Standalone version of the series pilot with an...,"(0, 1088)\t0.12464518166470029\n (0, 2380)\..."


---
### Make new feature of combined TMDB and IMDB plots
We will make a new column with the combined plots of tmdb and imdb. Perhaps this will offer a richer representation of each plot and increase our precision and recall in the forthcoming modeling.

In [3]:
movies['combined_plots'] = movies['tmdb_plot'] + ' ' + movies['imdb_plot']

---
### Apply Count Vectorize and TFIDF to Plots

Sklearn's TfidfVectorizer will apply both a count vectorizer and Term Frequency Inverse Document Frequency (TFIDF) transformation to our plots. The count vectorizer will convert our collection of plots into a matrix of word counts. We set max_df and min_df to ignore words specific to our corpus of movie plots that occur both a lot and few times, and we also ignore english stop words if the previous arguments missed some. The TFIDF transformation will essentially apply a rarity score to each word returning a normalized TFIDF matrix for each feature - TMDB, IMDB, and combined plots.

In [4]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=0.005, 
                          stop_words = get_stop_words('en'))

tmdb_bow=tfidf_vectorizer.fit_transform(movies['tmdb_plot'])
imdb_bow=tfidf_vectorizer.fit_transform(movies['imdb_plot'])
combined_bow = tfidf_vectorizer.fit_transform(movies['combined_plots'])

#check shape of resulting matrices
tmdb_bow.shape, imdb_bow.shape, combined_bow.shape

((1000, 1171), (1000, 2442), (1000, 3025))

As seen above, now each TMDB and IMDB movies are represented by an 1171 and 2439 length array, and each combined plot is a 3025 length array, respectively. Each number in the array is the associated TFIDF score for the count vectorized representation of each plot. Most of the array values will be zero since each movie has on average 50 and 100 words for TMDB and IMDB, respectively.

---

### Add Bag-of-Words Matrices to Movies DF and save

We cast the resulting arrays into lists that pandas can save in a nice fashion for when we import in the modeling and analysis notebook. We also save them as numpy arrays for easier importing.

In [5]:
#each plot is a row vector of the TFIDF sparse matrices
movies['tmdb_bow_plot'] = [plot for plot in tmdb_bow]
movies['imdb_bow_plot'] = [plot for plot in imdb_bow]
movies['combined_bow_plots'] = [plot for plot in combined_bow]

#resave movies csv
movies.to_csv('data/movies.csv', encoding="utf-8", index=False)

#save TFIDF matrices an numpy arrays
np.save('data/tmdb_bow.npy',tmdb_bow.toarray())
np.save('data/imdb_bow.npy',imdb_bow.toarray())
np.save('data/combined_bow.npy', combined_bow.toarray())

### We have now transformed and saved each plot as a bag-of-words TFIDF representation.