# Loading and Cleaning our Movie Data
We'll download our list of movies, the movie's descriptions first from [TMDB](https://www.themoviedb.org/?language=en-US). Then do some cleaning of the data. Download [word2vec](https://en.wikipedia.org/wiki/Word2vec) and finally, do some feature engineering to tokenize the descriptions. 

We've split these into 6 seperate scripts and linked them all in one notebook for simplicity. 



In [3]:
import warnings
warnings.filterwarnings('ignore')
import boto3
import io
import pandas as pd
from src.utils.initialize import *
import pprint

### Downloading the list of Movies from S3

In [4]:
# we have the credentials to access this s3 bucket stored in our project environment variables (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY)
# note that in reality these would likely either be in our user environment variables or we would use AWS credential propagation to pull these in automatically from you SSO 
client = boto3.client('s3')

# download no_duplicate_movies.pkl from se-demo-bucket and write locally - this is our pull of data movies with some deduplication applied
file = client.download_file('se-demo-bucket', 'movie-demo/data/interim/no_duplicate_movies.pkl', 'data/S3/interim/no_duplicate_movies.pkl')

### Extract all the Movie descriptions

In [5]:
# load no_duplicate_movies
with open('data/S3/interim/no_duplicate_movies.pkl','rb') as f:
    no_duplicate_movies=pickle.load(f)

In [6]:
# creating a dataset with of movies with overviews
movies_with_overviews=[] # from poster data
for i in range(len(no_duplicate_movies)):
    movie=no_duplicate_movies[i]
    id=movie['id']
    overview=movie['overview']
    
    if len(overview)==0:
        continue
    else:
        movies_with_overviews.append(movie)

In [7]:
len(movies_with_overviews)

1689

In [8]:
with open('data/S3/interim/movies_with_overviews.pkl','wb') as f:
    pickle.dump(movies_with_overviews,f)

In [9]:
print('\tHere is the first entry in movies_with_overviews:')
pprint.pprint(movies_with_overviews[0], indent=4)

	Here is the first entry in movies_with_overviews:
{   'adult': False,
    'backdrop_path': '/6O0lsCK90jZCyPvyYSt8Szzlnd6.jpg',
    'genre_ids': [18, 36, 10752],
    'id': 324786,
    'original_language': 'en',
    'original_title': 'Hacksaw Ridge',
    'overview': 'WWII American Army Medic Desmond T. Doss, who served during '
                'the Battle of Okinawa, refuses to kill people and becomes the '
                'first Conscientious Objector in American history to receive '
                'the Congressional Medal of Honor.',
    'popularity': 89.492,
    'poster_path': '/jhWbYeUNOA5zAb6ufK6pXQFXqTX.jpg',
    'release_date': '2016-10-07',
    'title': 'Hacksaw Ridge',
    'video': False,
    'vote_average': 8.1,
    'vote_count': 9353}


### Cleaning and de-duping the data

In [10]:
run src/data/cleaning_data.py

Loaded the list of movies that have overviews from data/interim/movies_with_overviews.pkl.

Extracting the genres and movie ids in prep for binarizination...
Binarizing the list of genres to create the target variable Y.
Done! Y created. Shape of Y is 
(1689, 19)


Creating a mapping from the genre ids to the genre names...
Mapping from genre id to genre name is saved in the Genre_ID_to_name dictionary:
{   12: 'Adventure',
    14: 'Fantasy',
    16: 'Animation',
    18: 'Drama',
    27: 'Horror',
    28: 'Action',
    35: 'Comedy',
    36: 'History',
    37: 'Western',
    53: 'Thriller',
    80: 'Crime',
    99: 'Documentary',
    878: 'Science Fiction',
    9648: 'Mystery',
    10402: 'Music',
    10749: 'Romance',
    10751: 'Family',
    10752: 'War',
    10770: 'TV Movie'}


Saved the mapping from genre id to genre name as data/processed/Genredict.pkl.
Saved the target variable Y to data/processed/Y.pkl.

	Here are the first few lines of Y:
	[[0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 

### Featuring engineering

In [11]:
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
import random
import sys
import json

In [13]:
# ## Note: spark config already embedded in Domino Compute Environment, for seamless use...
 
# Create the Spark Context
sc=SparkContext.getOrCreate()

In [15]:
if len(sys.argv) == 2:
    #note: must be real number between 0 and 1
    NUM_SAMPLES = int(sys.argv[1])
else: 
    NUM_SAMPLES = 50000000
 
def inside(p):
 x, y = random.random(), random.random()
 return x*x + y*y < 1
count = sc.parallelize(range(0, NUM_SAMPLES)).filter(inside).count()
output = float(4.0 * count / float(NUM_SAMPLES))

In [16]:
run src/features/feature_eng.py

Loaded the target variable from to data/processed/Y.pkl.

Loaded the list of de-duped movies with overviews from data/interim/movies_with_overviews.pkl.
Loaded the mapping from genre id to genre name from data/processed/Genredict.pkl.
Removed punctuation from the overviews.
Vectorized the text of the overviews using the CountVectorizer from scikit-learn. This is basically the bag of words model.
	Shape of X with count vectorizer:
	(1689, 1232)
	Saved X to data/processed/X.pkl and the vectorizer as models/count_vectorizer.pkl.
	Here are the first row of X (remember that it is a sparse matrix):
	   (0, 1224)	1
  (0, 43)	2
  (0, 60)	1
  (0, 1192)	1
  (0, 313)	1
  (0, 1062)	3
  (0, 89)	1
  (0, 745)	2
  (0, 1087)	2
  (0, 585)	1
  (0, 788)	1
  (0, 47)	1
  (0, 96)	1
  (0, 405)	1
  (0, 539)	1
  (0, 507)	1
Vectorized the text of the overviews using the TfidfVectorizer from scikit-learn.
	Shape of X with TF-IDF vectorizer:
	(1689, 1232)
	Saved X_tfidf to data/processed/X_tfidf.pkl and the vector

### Downloading [Word2Vec](https://en.wikipedia.org/wiki/Word2vec)

In [17]:
!sh src/models/get_word2vec.sh

Downloading the SLIMMED word2vec model...
--2021-03-11 16:06:48--  https://github.com/eyaler/word2vec-slim/raw/master/GoogleNews-vectors-negative300-SLIM.bin.gz
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://media.githubusercontent.com/media/eyaler/word2vec-slim/master/GoogleNews-vectors-negative300-SLIM.bin.gz [following]
--2021-03-11 16:06:48--  https://media.githubusercontent.com/media/eyaler/word2vec-slim/master/GoogleNews-vectors-negative300-SLIM.bin.gz
Resolving media.githubusercontent.com (media.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to media.githubusercontent.com (media.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 276467217 (264M) [application/octet-stream]
Saving to: ‘GoogleNews-vectors-negative300-SLIM.bin.gz’


20

### Feature engineering using [Word2Vec](https://en.wikipedia.org/wiki/Word2vec)

In [18]:
run src/features/word2vec_features.py

Loaded the list of de-duped movies with overviews from data/interim/movies_with_overviews.pkl.
Loaded the GoogleNews Slimmed Word2Vec model.
Tokenized all overviews.
Removed stopwords.
Calculated the mean word2vec vector for each overview.
Created a multi-label binarizer for genres.
Transformed the target variable for each movie using the multi-label binarizer to an array or arrays.
	For a movie with genre ids [36, 53, 10752], we create Y for the movie as [0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0].
Saved the mean word2vec vector for each overview (X) and the binarized target (Y) as textual_features=(X,Y) into data/processed/textual_features.pkl.
Saved the multi-label binarizer so we can do the inverse transform later as models/mlb.pkl.
