CS109B - Final project: Predicting movie genres
Spring 2017, Harvard
- Angela Ambroz
- Keun-Hwi Lee
- Johanna Ramos
- Pranav Sidhwani
Guide to our project
The main update is in how we pre-processed and cleaned the data. We faced some limitations, due to our AWS instance's memory constraints (not storage), and setbacks in the multi-label vs multi-class models.
Fetch-and-sample-data.ipynb- This notebook scraped the data from TMDB. We used a EC2 instance to scrape the data over several days.
flatten-data.ipynb- This notebook was used to flatten data. Features such as cast and crew were lists; we flattened each item in the list into a (one-hot encoded) feature on the data set.
create-test-train-data.ipynb- This notebook created the test data set, as well as, downloaded/pre-processed the images for the training and test set.
aa_ML5_naive_bayes.ipynb- Used in Milestone 3, updated to use the latest train and test data.
jr_SVM_mutilabel_v0.ipynb- Used in Milestone 3, updated to use the latest train and test data.
milestone-5-cnn-and-pretrain-cnn-models.ipynb- Used in Milestone 4, some tweaks were added. Updated to use the latest train and test data.
Note: We did not use the Tensorboard in the final write-up, but the code can be found here. Also the
data directory is git-ignored. The notebooks reference this directory for data, but the contents were not logged into source control.
Pre-processed data for CNN
JSON format of the data, cleaned
Whole data set, used to create both test and train
Original, uncleaned train set
This project template was based on the cookiecutter data science project template. #cookiecutterdatascience