# Report Miscelaneous

The 'movies.csv' we will read in has already had the following actions taken:
- Genres for each film from both TMDB and IMDB have been binarized using sklearn's MultiLabelBinarizer.
- Both sets of plots have been cleaned, tokenized, and had stop words removed.
- Both sets of genres have had a 300-dimension word2vec transformation applied using the Google News word2vec model.
# ADD bag-of-words feature and explain above after writing that notebook

In [36]:
#import libraries
import pandas as pd
import requests
from ast import literal_eval
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline

In [57]:
#list of columns in our 'movies.csv' that need to be read in using literal_eval
#to ensure they are represented as vectors rather than strings
converterList = ['imdb_genres', 'tmdb_genres', 'binary_tmdb', 'binary_imdb',
               'imdb_w2v_plot', 'tmdb_w2v_plot']

converterDict = {column: literal_eval for column in converterList}

movies = pd.read_csv('data/movies.csv', encoding='utf-8',
                     converters=converterDict)

movies.head(21)

Unnamed: 0,tmdb_id,imdb_id,tmdb_genres,imdb_genres,binary_tmdb,binary_imdb,tmdb_plot,imdb_plot,popularity,release_date,...,imdb_bow_plot,combined_plots,combined_bow_plots,combined_clean_plot,tmdb_w2v_plot_mean,imdb_w2v_plot_mean,combined_w2v_plot_mean,tmdb_w2v_plot_matrix,imdb_w2v_plot_matrix,combined_w2v_plot_matrix
0,278,tt0111161,"[18, 80]","[80, 18]","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...",Framed in the 1940s for the double murder of h...,Chronicles the experiences of a formerly succe...,28.527767,1994-09-23,...,"(0, 398)\t0.22753905256972778\n (0, 759)\t0...",Framed in the 1940s for the double murder of h...,"(0, 1092)\t0.15089615016031976\n (0, 811)\t...","['framed', '1940s', 'double', 'murder', 'wife'...",[ 1.41657051e-02 3.57291475e-02 3.55668515e-...,[ 4.66356799e-03 9.01858658e-02 -1.24760680e-...,[ 0.00908005 0.064875 0.00985374 0.060550...,[[-0.08300781 0.25390625 0.07128906 ... -0.1...,[[ 0.0201416 0.11474609 -0.35742188 ... -0.0...,[[-0.08300781 0.25390625 0.07128906 ... -0.1...
1,238,tt0068646,"[18, 80]","[80, 18]","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...",Spanning the years 1945 to 1955 a chronicle o...,When the aging head of a famous crime family d...,36.965452,1972-03-14,...,"(0, 515)\t0.17259715509464205\n (0, 938)\t0...",Spanning the years 1945 to 1955 a chronicle o...,"(0, 1773)\t0.10485484905546055\n (0, 287)\t...","['spanning', 'years', '1945', '1955', 'chronic...",[-0.01682084 0.05966978 -0.00681898 0.042978...,[-0.01332631 0.0813482 0.03576481 0.067564...,[-1.48730669e-02 7.17528313e-02 1.69162434e-...,[[ 0.05175781 0.02502441 -0.12255859 ... 0.0...,[[-0.07470703 0.49804688 -0.07373047 ... 0.2...,[[ 0.05175781 0.02502441 -0.12255859 ... 0.0...
2,424,tt0108052,"[18, 36, 10752]","[18, 36]","[0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, ...",The true story of how businessman Oskar Schind...,Oskar Schindler is a vainglorious and greedy G...,19.945455,1993-11-29,...,"(0, 916)\t0.40896979889639457\n (0, 317)\t0...",The true story of how businessman Oskar Schind...,"(0, 2911)\t0.09695795170181548\n (0, 2774)\...","['true', 'story', 'businessman', 'oskar', 'sch...",[ 0.07589068 0.02254813 0.0643049 0.117789...,[ 0.05338115 0.10281134 0.01086032 0.044059...,[ 6.03841133e-02 7.78405592e-02 2.74875220e-...,[[ 1.27929688e-01 4.78515625e-02 1.06933594e...,[[ 0.06542969 0.06054688 0.00114441 ... -0.0...,[[ 0.12792969 0.04785156 0.10693359 ... 0.0...
3,240,tt0071562,"[18, 80]","[80, 18]","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...",In the continuing saga of the Corleone crime f...,The continuing saga of the Corleone crime fami...,30.191804,1974-12-20,...,"(0, 515)\t0.21968270215051702\n (0, 1494)\t...",In the continuing saga of the Corleone crime f...,"(0, 1821)\t0.12839540573874353\n (0, 649)\t...","['continuing', 'saga', 'corleone', 'crime', 'f...",[-5.79080023e-02 7.11167306e-02 -6.58677071e-...,[-5.15192188e-02 7.89628476e-02 -4.06892076e-...,[-5.43040745e-02 7.55427405e-02 -5.16644493e-...,[[-0.0324707 0.21679688 -0.1484375 ... 0.2...,[[-0.0324707 0.21679688 -0.1484375 ... 0.2...,[[-0.0324707 0.21679688 -0.1484375 ... 0.2...
4,452522,tt0278784,"[18, 9648]","[80, 18, 9648, 53]","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, ...",Standalone version of the series pilot with an...,When beautiful young Laura Palmer is found br...,5.969249,1989-12-31,...,"(0, 875)\t0.15459130001922888\n (0, 2438)\t...",Standalone version of the series pilot with an...,"(0, 1088)\t0.12464518166470029\n (0, 2380)\...","['standalone', 'version', 'series', 'pilot', '...",[-0.05888228 -0.0534557 -0.06566273 0.057356...,[ 2.65587703e-03 1.01409487e-01 1.30335495e-...,[-1.17466701e-02 6.51644468e-02 -1.52680418e-...,[[ 0.01477051 -0.33203125 -0.37109375 ... -0.0...,[[-0.01831055 0.05566406 -0.01153564 ... -0.3...,[[ 0.01477051 -0.33203125 -0.37109375 ... -0.0...
5,244786,tt2582802,[18],"[18, 10402]","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ...",Under the direction of a ruthless instructor ...,A young and talented drummer attending a prest...,29.936676,2014-10-10,...,"(0, 1538)\t0.09838409046604597\n (0, 2438)\...",Under the direction of a ruthless instructor ...,"(0, 287)\t0.10942027714973482\n (0, 1877)\t...","['direction', 'ruthless', 'instructor', 'talen...",[ 3.56038399e-02 6.33951798e-02 5.07577248e-...,[ 0.01593236 0.0298778 0.04798644 0.080450...,[ 0.0218338 0.03993301 0.04881782 0.065518...,[[ 0.0378418 -0.15820312 -0.01782227 ... 0.2...,[[ 0.09472656 0.328125 -0.04858398 ... -0.0...,[[ 0.0378418 -0.15820312 -0.01782227 ... 0.2...
6,333339,tt1677720,"[12, 878, 28]","[28, 12, 878]","[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...","[1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...",When the creator of a popular video game syste...,In the year 2045 the real world is a harsh pl...,81.290391,2018-03-28,...,"(0, 811)\t0.12468076692567583\n (0, 1873)\t...",When the creator of a popular video game syste...,"(0, 2998)\t0.06334582274132743\n (0, 1012)\...","['creator', 'popular', 'video', 'game', 'syste...",[ 1.00656681e-01 1.29921651e-02 -2.78098360e-...,[ 0.05639736 0.04864527 0.01532517 0.062068...,[ 6.24078810e-02 4.38034907e-02 9.46733076e-...,[[ 0.24414062 -0.33789062 0.0625 ... 0.0...,[[ 0.06176758 0.2578125 0.00367737 ... 0.1...,[[ 0.24414062 -0.33789062 0.0625 ... 0.0...
7,680,tt0110912,"[53, 80]","[80, 18]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...",A burger loving hit man his philosophical par...,Jules Winnfield Samuel L Jackson and Vincen...,39.232028,1994-09-10,...,"(0, 79)\t0.14586323660066358\n (0, 1928)\t0...",A burger loving hit man his philosophical par...,"(0, 2956)\t0.07481577202175209\n (0, 649)\t...","['burger', 'loving', 'hit', 'man', 'philosophi...",[ 1.35401972e-02 4.95876744e-02 -6.97609223e-...,[ 1.41215771e-02 3.48640308e-02 -1.52587891e-...,[ 1.39253614e-02 3.98332588e-02 -2.36454001e-...,[[-0.26757812 0.05224609 0.00915527 ... 0.0...,[[-0.01611328 -0.06298828 -0.01116943 ... -0.0...,[[-0.26757812 0.05224609 0.00915527 ... 0.0...
8,282848,tt2986512,"[18, 878]","[12, 18, 10751, 9648, 878]","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...","[1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, ...",Orbiting a quiet backwater planet the massed ...,Orbiting a quiet backwater planet the massed ...,5.976601,2013-12-25,...,"(0, 790)\t0.1193448000922622\n (0, 2191)\t0...",Orbiting a quiet backwater planet the massed ...,"(0, 985)\t0.11128264404294251\n (0, 2709)\t...","['orbiting', 'quiet', 'backwater', 'planet', '...",[ 2.22598799e-02 6.96842000e-02 5.35027012e-...,[ 2.67578121e-02 7.11251423e-02 4.78166863e-...,[ 0.02454144 0.07041511 0.05061849 0.060271...,[[-0.21484375 0.13671875 0.0168457 ... 0.1...,[[-0.21484375 0.13671875 0.0168457 ... 0.1...,[[-0.21484375 0.13671875 0.0168457 ... 0.1...
9,550,tt0137523,[18],[18],"[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",A ticking time bomb insomniac and a slippery s...,A nameless first person narrator Edward Norto...,42.100189,1999-10-15,...,"(0, 1274)\t0.07085365163231709\n (0, 2047)\...",A ticking time bomb insomniac and a slippery s...,"(0, 1821)\t0.06246452495107521\n (0, 1567)\...","['ticking', 'time', 'bomb', 'insomniac', 'slip...",[ 6.64859861e-02 5.38930595e-02 -1.41709847e-...,[ 0.03070623 0.02010276 -0.00186943 0.089182...,[ 4.21410017e-02 3.09017226e-02 -5.80085674e-...,[[-0.14355469 0.12109375 -0.06225586 ... -0.0...,[[ 0.26953125 0.17773438 0.13769531 ... 0.0...,[[-0.14355469 0.12109375 -0.06225586 ... -0.0...


In [39]:
key = open('key.txt','r').read()
payload = {}
url = "https://api.themoviedb.org/3/genre/movie/list?api_key={0}&language=en-US".format(key)
response = requests.request("GET", url, data=payload).json()

id_to_genre = dict(zip([i['id'] for i in response['genres']],
                     [i['name'] for i in response['genres']]))

genre_to_id = dict(zip([i['name'] for i in response['genres']],
                       [i['id'] for i in response['genres']]))

id_to_genre

{12: 'Adventure',
 14: 'Fantasy',
 16: 'Animation',
 18: 'Drama',
 27: 'Horror',
 28: 'Action',
 35: 'Comedy',
 36: 'History',
 37: 'Western',
 53: 'Thriller',
 80: 'Crime',
 99: 'Documentary',
 878: 'Science Fiction',
 9648: 'Mystery',
 10402: 'Music',
 10749: 'Romance',
 10751: 'Family',
 10752: 'War',
 10770: 'TV Movie'}

In [72]:
for i in [1, 2, 12, 15, 85]:
    print('Title: ', movies.title[i])
    print('String Genres: ', [id_to_genre[genre] for genre in movies.tmdb_genres[i]])
    print('TMDB Genres: ', movies.tmdb_genres[i])
    print('Binarized Genres: ', movies.binary_tmdb[i])
    print()

Title:  The Godfather
String Genres:  ['Drama', 'Crime']
TMDB Genres:  [18, 80]
Binarized Genres:  [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]

Title:  Schindler's List
String Genres:  ['Drama', 'History', 'War']
TMDB Genres:  [18, 36, 10752]
Binarized Genres:  [0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]

Title:  Psycho
String Genres:  ['Drama', 'Horror', 'Thriller']
TMDB Genres:  [18, 27, 53]
Binarized Genres:  [0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]

Title:  The Dark Knight
String Genres:  ['Drama', 'Action', 'Crime', 'Thriller']
TMDB Genres:  [18, 28, 80, 53]
Binarized Genres:  [0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0]

Title:  Sing Street
String Genres:  ['Comedy', 'Romance', 'Drama', 'Music']
TMDB Genres:  [35, 10749, 18, 10402]
Binarized Genres:  [0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0]



In [54]:
for i in range(200):
    print(i, movies.tmdb_genres[i])

0 [18, 80]
1 [18, 80]
2 [18, 36, 10752]
3 [18, 80]
4 [18, 9648]
5 [18]
6 [12, 878, 28]
7 [53, 80]
8 [18, 878]
9 [18]
10 [18, 80]
11 [14, 18, 80]
12 [18, 27, 53]
13 [18]
14 [35, 18, 10749]
15 [18, 28, 80, 53]
16 [18]
17 [10751, 16, 10749, 35]
18 [16, 18]
19 [18, 9648]
20 [18, 9648, 878, 53, 10770]
21 [18, 80]
22 [18]
23 [878, 12]
24 [10749, 18]
25 [12, 28, 878]
26 [12, 14, 28]
27 [18]
28 [35]
29 [18]
30 [18, 9648, 53]
31 [35, 18, 10749]
32 [18, 9648, 878, 53]
33 [10751, 16]
34 [18]
35 [80, 9648, 53]
36 [18, 35]
37 [18, 9648, 80]
38 [18, 10752]
39 [80, 18]
40 [18, 10751]
41 [28, 53, 878, 9648, 12]
42 [80, 18, 53]
43 [12, 14, 28]
44 [12, 18, 878]
45 [18, 80, 53]
46 [18, 10749]
47 [12, 14, 28]
48 [12, 28, 878]
49 [9648, 53]
50 [27, 53]
51 [18, 10752]
52 [18, 53]
53 [80, 53]
54 [12, 35, 878, 10751]
55 [28, 35, 9648]
56 [9648, 10749, 53]
57 [80, 18]
58 [18, 10752]
59 [18, 10751, 14]
60 [9648, 18]
61 [35, 18, 10749]
62 [18]
63 [27, 10402]
64 [10751, 16, 14]
65 [878, 18]
66 [10751, 16, 18]
67 

In [65]:
movies.tmdb_genres[2]

[18, 36, 10752]