# Loading and Cleaning our Movie Data
We'll download our list of movies, the movie's descriptions first from [TMDB](https://www.themoviedb.org/?language=en-US). Then do some cleaning of the data. Download [word2vec](https://en.wikipedia.org/wiki/Word2vec) and finally, do some feature engineering to tokenize the descriptions. 

We've split these into 6 seperate scripts and linked them all in one notebook for simplicity. 



### Downloading the list of Movies

In [None]:
run src/data/movie_list.py  

Pulling movie list of popular movies, Please wait...
	While you wait, here are some sampling of the movies that are being pulled...
		Ashfall
		Legionnaire's Trail
		Naruto Shippuden the Movie
		Terminator: Dark Fate
		The Courier
		Baba Yaga: Terror of the Dark Forest
		Goblin Slayer: Goblin's Crown
		Over the Moon
		Spider-Man: Homecoming
	10/51 done
	******* Waiting a few seconds to stay within rate limits of TMDB... *******)
		Antebellum
		Shrek
		Danger Close
		Vivarium
		Logan
		The Right One
		Attack on Titan
		The Driver
		Teenage Mutant Ninja Turtles
		Kingsman: The Secret Service
	20/51 done
	******* Waiting a few seconds to stay within rate limits of TMDB... *******)
		La sabiduría
		Better Days
		King Kong
		Batman v Superman: Dawn of Justice
		Incredibles 2
		The Equalizer 2
		Overcomer
		The Half of It
		Big Time Adolescence
		Golden Job
	30/51 done
	******* Waiting a few seconds to stay within rate limits of TMDB... *******)
		The Amazing Spider-Man 2
		Fifty Shades of B

### Downloading the Movie's descriptions

In [None]:
run src/data/overviews.py

### Cleaning and de-duping the data

In [None]:
run src/data/cleaning_data.py

### Featuring engineering

In [None]:
run src/features/feature_eng.py

### Downloading [Word2Vec](https://en.wikipedia.org/wiki/Word2vec)

In [7]:
!sh src/models/get_word2vec.sh

Downloading the SLIMMED word2vec model...
--2021-03-11 18:26:31--  https://github.com/eyaler/word2vec-slim/raw/master/GoogleNews-vectors-negative300-SLIM.bin.gz
Resolving github.com (github.com)... 192.30.255.112
Connecting to github.com (github.com)|192.30.255.112|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://media.githubusercontent.com/media/eyaler/word2vec-slim/master/GoogleNews-vectors-negative300-SLIM.bin.gz [following]
--2021-03-11 18:26:31--  https://media.githubusercontent.com/media/eyaler/word2vec-slim/master/GoogleNews-vectors-negative300-SLIM.bin.gz
Resolving media.githubusercontent.com (media.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to media.githubusercontent.com (media.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 276467217 (264M) [application/octet-stream]
Saving to: ‘GoogleNews-vectors-negative300-SLIM.bin.gz’


20

### Feature engineering using [Word2Vec](https://en.wikipedia.org/wiki/Word2vec)

In [8]:
run src/features/word2vec_features.py

Loaded the list of de-duped movies with overviews from data/interim/movies_with_overviews.pkl.
Loaded the GoogleNews Slimmed Word2Vec model.
Tokenized all overviews.
Removed stopwords.
Calculated the mean word2vec vector for each overview.
Created a multi-label binarizer for genres.
Transformed the target variable for each movie using the multi-label binarizer to an array or arrays.
	For a movie with genre ids [36, 53, 10752], we create Y for the movie as [0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0].
Saved the mean word2vec vector for each overview (X) and the binarized target (Y) as textual_features=(X,Y) into data/processed/textual_features.pkl.
Saved the multi-label binarizer so we can do the inverse transform later as models/mlb.pkl.
