# Loading and Cleaning our Movie Data
We'll download our list of movies, the movie's descriptions first from [TMDB](https://www.themoviedb.org/?language=en-US). Then do some cleaning of the data. Download [word2vec](https://en.wikipedia.org/wiki/Word2vec) and finally, do some feature engineering to tokenize the descriptions. 

We've split these into 6 seperate scripts and linked them all in one notebook for simplicity.



### Downloading the list of Movies

In [9]:
run src/data/movie_list.py  

Pulling movie list of popular movies, Please wait...
	While you wait, here are some sampling of the movies that are being pulled...
		Ashfall
		Legionnaire's Trail
		Naruto Shippuden the Movie
		Terminator: Dark Fate
		The Courier
		Baba Yaga: Terror of the Dark Forest
		Goblin Slayer: Goblin's Crown
		Over the Moon
		Spider-Man: Homecoming
	10/51 done
	******* Waiting a few seconds to stay within rate limits of TMDB... *******)
		Antebellum
		Shrek
		Danger Close
		Vivarium
		Logan
		The Right One
		Attack on Titan
		The Driver
		Teenage Mutant Ninja Turtles
		Kingsman: The Secret Service
	20/51 done
	******* Waiting a few seconds to stay within rate limits of TMDB... *******)
		La sabiduría
		Better Days
		King Kong
		Batman v Superman: Dawn of Justice
		Incredibles 2
		The Equalizer 2
		Overcomer
		The Half of It
		Big Time Adolescence
		Golden Job
	30/51 done
	******* Waiting a few seconds to stay within rate limits of TMDB... *******)
		The Amazing Spider-Man 2
		Fifty Shades of B

### Downloading the Movie's descriptions

In [10]:
run src/data/overviews.py

Loading the list of de-duped movies from data/interim/no_duplicate_movies.pkl...
Loaded the list of de-duped movies from data/interim/no_duplicate_movies.pkl.

Creating a dataset where each movie must have an associated overview...
Done! Created a dataset where each movie must have an associated overview.

Saving the list of movies that have overviews (movies_with_overviews) as data/interim/movies_with_overviews.pkl....
	Here are the first entry in movies_with_overviews:
{   'adult': False,
    'backdrop_path': '/6O0lsCK90jZCyPvyYSt8Szzlnd6.jpg',
    'genre_ids': [18, 36, 10752],
    'id': 324786,
    'original_language': 'en',
    'original_title': 'Hacksaw Ridge',
    'overview': 'WWII American Army Medic Desmond T. Doss, who served during '
                'the Battle of Okinawa, refuses to kill people and becomes the '
                'first Conscientious Objector in American history to receive '
                'the Congressional Medal of Honor.',
    'popularity': 89.492,
    'po

### Cleaning and de-duping the data

In [11]:
run src/data/cleaning_data.py

Loaded the list of movies that have overviews from data/interim/movies_with_overviews.pkl.

Extracting the genres and movie ids in prep for binarizination...
Binarizing the list of genres to create the target variable Y.
Done! Y created. Shape of Y is 
(1690, 19)


Creating a mapping from the genre ids to the genre names...
Mapping from genre id to genre name is saved in the Genre_ID_to_name dictionary:
{   12: 'Adventure',
    14: 'Fantasy',
    16: 'Animation',
    18: 'Drama',
    27: 'Horror',
    28: 'Action',
    35: 'Comedy',
    36: 'History',
    37: 'Western',
    53: 'Thriller',
    80: 'Crime',
    99: 'Documentary',
    878: 'Science Fiction',
    9648: 'Mystery',
    10402: 'Music',
    10749: 'Romance',
    10751: 'Family',
    10752: 'War',
    10770: 'TV Movie'}


Saved the mapping from genre id to genre name as data/processed/Genredict.pkl.
Saved the target variable Y to data/processed/Y.pkl.

	Here are the first few lines of Y:
	[[0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 

### Featuring engineering

In [12]:
run src/features/feature_eng.py

Loaded the target variable from to data/processed/Y.pkl.

Loaded the list of de-duped movies with overviews from data/interim/movies_with_overviews.pkl.
Loaded the mapping from genre id to genre name from data/processed/Genredict.pkl.
Removed punctuation from the overviews.
Vectorized the text of the overviews using the CountVectorizer from scikit-learn. This is basically the bag of words model.
	Shape of X with count vectorizer:
	(1690, 1213)
	Saved X to data/processed/X.pkl and the vectorizer as models/count_vectorizer.pkl.
	Here are the first row of X (remember that it is a sparse matrix):
	   (0, 1204)	1
  (0, 41)	2
  (0, 58)	1
  (0, 1175)	1
  (0, 299)	1
  (0, 1044)	3
  (0, 84)	1
  (0, 733)	2
  (0, 1071)	2
  (0, 572)	1
  (0, 774)	1
  (0, 45)	1
  (0, 91)	1
  (0, 393)	1
  (0, 528)	1
  (0, 497)	1
Vectorized the text of the overviews using the TfidfVectorizer from scikit-learn.
	Shape of X with TF-IDF vectorizer:
	(1690, 1213)
	Saved X_tfidf to data/processed/X_tfidf.pkl and the vector

### Downloading [Word2Vec](https://en.wikipedia.org/wiki/Word2vec)

In [13]:
!sh src/models/get_word2vec.sh

Downloading the SLIMMED word2vec model...
--2021-03-11 20:08:43--  https://github.com/eyaler/word2vec-slim/raw/master/GoogleNews-vectors-negative300-SLIM.bin.gz
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://media.githubusercontent.com/media/eyaler/word2vec-slim/master/GoogleNews-vectors-negative300-SLIM.bin.gz [following]
--2021-03-11 20:08:43--  https://media.githubusercontent.com/media/eyaler/word2vec-slim/master/GoogleNews-vectors-negative300-SLIM.bin.gz
Resolving media.githubusercontent.com (media.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to media.githubusercontent.com (media.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 276467217 (264M) [application/octet-stream]
Saving to: ‘GoogleNews-vectors-negative300-SLIM.bin.gz’


20

### Feature engineering using [Word2Vec](https://en.wikipedia.org/wiki/Word2vec)

In [14]:
run src/features/word2vec_features.py

Loaded the list of de-duped movies with overviews from data/interim/movies_with_overviews.pkl.
Loaded the GoogleNews Slimmed Word2Vec model.
Tokenized all overviews.
Removed stopwords.
Calculated the mean word2vec vector for each overview.
Created a multi-label binarizer for genres.
Transformed the target variable for each movie using the multi-label binarizer to an array or arrays.
	For a movie with genre ids [36, 53, 10752], we create Y for the movie as [0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0].
Saved the mean word2vec vector for each overview (X) and the binarized target (Y) as textual_features=(X,Y) into data/processed/textual_features.pkl.
Saved the multi-label binarizer so we can do the inverse transform later as models/mlb.pkl.
