Kyle Honegger, Mark Hill, Joseph Reilly
CS109b final project, Spring 2017
This notebook contains our preliminary trials of working with the movie databases, as well as some EDA where we investigate the frequency of the most common movie pairs, text data from reviews and similar movies, and average profit by genre.
Milestone2.ipynb contains the code used to download all of the data we used for training our models as well as formatting it into a csv file. csv_processing.ipynb contains code used to clean and preprocess the data, address any noticed idosyncracies, create the dummy variables for the categorical features, collapse the genres, and split the features from the genres.
This notebook contains the traditional machine learning methods we tested on this data. We split the data into train and test sets, one set drops missing values, one imputes missing values, and the others add subsequently more features to the data so we can see how performance changes as more and more features are added. We test a Naive Bayes classifier, Logistic regression, and a random forest classifier on the original data and data reduced via PCA to see if reducing the dimensions improves or worsens performance.
This notebook contains a variation on the VGG16, a pre-trained covnet included in Keras. Weights were pre-trained on ImageNet, the input was specified to the same 300x185x3 size we used in our original model, and the same binary cross-entropy loss function is used to train and evaluate performance. Fully connected layers are added at the end to specify our 7 desired output labels.
This notebook is an extension of Milestones 3 & 4, adding topic modeling variables, improving the data processing pipeline, and extending the deep learning model.
Specifically, these issues were addressed:
-
The test set must be set aside to start with, so subsequently generated training sets don't dip into it
-
Implement training by "maxi-batches":
- set aside test
- load one big train batch (~5k)
- do one epoch of training
-
Manual image pre-processing - use scipy to load images in with uniform shape and formatting (and downsample, if desired)
-
Use Precision and Recall custom fcns for metrics
-
Save Keras model and associated metadata automatically
-
Log results for TensorBoard viz
-
Functionalize calls for model building and fitting, so we can sweep configs
Available at https://youtu.be/T4w-bONg7JE