In [None]:
!pip install kaggle

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import json

# Career Exploration Final Project: TMDB Box Office Prediction

### Table Of Contents

* [1. Exploratory Data Analysis](#eda)
* [2. Feature Engineering and Data Cleaning](#feature-engineering)
* [3. Modeling](#modeling)
    * [3.1 Validation and Evaluation](#validation)
    * [3.2 Linear Regression](#linear-regression)
    * [3.3 Regularized Regression](#reg)
    * [3.4 Random Forest](#random-forest)
    * [3.5 Neural Network](#nn)
    * [3.6 XGBoost](#xgb)


### Hosted by and maintained by the [Students Association of Applied Statistics (SAAS)](https://saas.berkeley.edu).  Authored by [Ajay Raj](mailto:araj@berkeley.edu).

For your final project in Career Exploration, you will be participating in a **Kaggle competition**, a data science and machine learning competition where you use *real* data and develop models to solve *real* problems.

## Description

The problem: given data about a movie (runtime, budget, cast, crew), predict the **overall worldwide box office revenue** it will make.

You'll be competing in [this Kaggle competition](https://www.kaggle.com/c/tmdb-box-office-prediction). Note that this competition has already completed, so you won't be competing against other Kagglers, but you'll be competing against your fellow CXers on a private leaderboard. For information on where the training data came from and how you're predictions are evaluated (turned into a score), check out the Kaggle competition link.

**Note:** There is not much guidance provided in this project (on purpose). You'll be doing a lot of going through [previous lectures](https://github.com/SUSA-org/Spring-2019-Career-Exploration/blob/master/CX-Final-Project/CX-Final-Project-Starter.ipynb) to try to adapt the code provided there to this dataset, and reading documentation that's been linked in most of the problems. We are pushing you, fledgling data scientists, out of the nest and letting you spread your wings and fly.

## Setup

1. Create a Kaggle account at kaggle.com
2. Go to the [Kaggle competition page](https://www.kaggle.com/c/tmdb-box-office-prediction) and click "Late Submission", and register for the competition/
3. Go to the 'Account' tab of your user profile (https://www.kaggle.com/YOUR-USERNAME/account) and select 'Create API Token'
4. Download the `kaggle.json` file, which contains a dictionary with your Kaggle credentials
5. Put them in the `KAGGLE_USER_DATA` variable

In [None]:
TEAM_NAME = # replace this with your team name

In [None]:
KAGGLE_USER_DATA = # looks like this {"username":"ajaynraj","key":"<REDACTED>"}

## Data Loading

In [None]:
train = pd.read_csv('data/train.csv')

In [None]:
test = pd.read_csv('data/test.csv')

In [None]:
X_train, y_train = train.drop('revenue', axis=1), train['revenue']
X_test = test

When we do EDA and feature engineering on a dataset, we often examine the training points and the test points together, so when you do complex feature engineering and data cleaning, you don't need to do twice or worry about your transformations not applying to test set.

In [None]:
df = pd.concat((X_train, X_test), axis=0)

<span id="eda"></span>

## 1. Exploratory Data Analysis

Provide two plots that demonstrate interesting aspects of the dataset, and especially certain features' influence on the target variable, revenue.

Since you won't be "submitting" this notebook anywhere, this part of the project is technically optional, but it is a **crucial** part of the data science process, so we *highly* recommend you do this, because it will inform how you complete the next parts of the project.

In [None]:
# space for sick scatter plots and vivacious violin plots

## 2. Feature Engineering and Data Cleaning

Transform your data into a cleaned DataFrame with the features you believe will be the most helpful towards creating a model for the revenue from the film.

In order to use the models below, you will need to make every feature **numerical**, not categorical, so you need to make sure that your output DataFrame only has numbers in it (and no NaNs!).

Some of the columns have data that is a little funky, so here's the libraries I imported and a few functions that I used. Feel free to use them or not!

In [None]:
from sklearn.decomposition import PCA
from collections import defaultdict
from sklearn.preprocessing import StandardScaler

def empty_listify(lst):
    return [] if pd.isnull(lst) else eval(lst)

def pcaify(one_hot, column_prefix, num_pca_columns):
    pca = PCA(n_components=num_pca_columns)    
    features = pca.fit_transform(one_hot)
    
    return pd.DataFrame(data = features, columns = ['{0}_{1}'.format(column_prefix, i) for i in range(features.shape[1])])

In [None]:
def feature_engineering(df):
    # change this with your own feature engineering!
    df = df.loc[:, ["budget", "popularity", "runtime"]]
    df = df.fillna(0)
    return df 

In [None]:
X = feature_engineering(df)

In [None]:
# Splitting up our cleaned df back into training and test
X_train = X[:train.shape[0]]
y_train = y_train
X_test = X[train.shape[0]:]

<span id="modeling"/>

## 3. Modeling

For each of the models we try, make sure you also run the [Prediction](#prediction) cells at the bottom, so you can submit your predictions to the competition! This is how we'll be making sure you're keeping up with the project.

<span id="validation"/>

### 3.1 Validation and Evaluation

Our Kaggle competition (read more [here](https://www.kaggle.com/c/tmdb-box-office-prediction/overview/evaluation) uses Root-Mean-Square-Log-Error (RMSLE). In mathematical notation, it is:

$$\text{RMSLE}(\hat{y}, y) = \sqrt{\frac{1}{n} \sum_{i = 1}^n \log(y_i - \hat{y}_i)}$$

#### Evaluation

Complete the function below.

In [None]:
from sklearn.metrics import mean_squared_log_error

def evaluate(y_pred, y_true):
    """Returns the RMSLE(y_pred, y_true)"""
    return np.sqrt(mean_squared_log_error(y_true, y_pred))

In [None]:
# Tests the previous function

# If this fails, it will throw an error
assert np.allclose(evaluate(np.array([1, 2, 3, 4]), np.array([5, 6, 7, 8])), 0.8292781201720374)

#### Validation

Use the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function to split up your training data into a training set and a validation set. The size of the validation set should be 20% of the full training data.

In [None]:
from sklearn.model_selection import train_test_split

train_X, valid_X, train_y, valid_y = train_test_split(X_train, y_train)

<span id="linear-regression"/>

### 3.2 Linear Regression

Fit a linear regression model to your data and report your RMLSE.

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
# instantiating linear regression object (model)
lm = LinearRegression()

# fitting model on training sets
lm.fit(train_X, train_y)

# using model to predict on validation set
y_valid_pred = lm.predict(valid_X)

# IMPORTANT: This model is a "dumb" model that predicts negative values for some movie revenues
# However, because we are using RMLSE we cannot have negative predictions
# Ideally you create a better model that doesn't predict negative revenues
y_valid_pred[y_valid_pred < 0] = 0

# evaluating prediction on validation set
evaluate(y_valid_pred, valid_y)

<span id="reg" />

### 3.3 Regularized Regression

Fit a [LASSO regression model](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html) to your data with $\lambda = 1$

In [None]:
from sklearn.linear_model import Lasso

In [None]:
# YOUR CODE HERE

#### 3.3.1 Hyperparameter Tuning

Perform [3-fold cross-validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) on the parameter $\lambda$, which is called **alpha** when you pass it into Lasso. Find the best parameter of $\lambda \in \{0.001, 0.005, 0.01, 0.05, 0.1\}$ and report the **RMSLE** on the validation set if you use this parameter.

In [None]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=5)

alphas = [1e-3, 5e-3, 1e-2, 5e-2, 0.1]

cv_scores = np.zeros(len(alphas))

for alphai, alpha in enumerate(alphas):
    print('Training alpha =', alpha, end='\t')
    scores = np.zeros(5)
    for i, (train_index, test_index) in enumerate(kf.split(X_train)):
        # YOUR CODE HERE
    cv_scores[alphai] = scores.mean()
    print('RMSLE = ', cv_scores[alphai])

In [None]:
best_alpha = alphas[np.argmax(cv_scores)]
best_alpha

In [None]:
model = Lasso(alpha=best_alpha)
model.fit(train_X, np.log(train_y))
training_accuracy = # YOUR CODE HERE
validation_accuracy = # YOUR CODE HERE

print('Training accuracy', training_accuracy)
print('Validation accuracy', validation_accuracy)

<span id="random-forest"/>

### 3.4 Random Forest

Fit a random forest model to your data and report your RMSLE.

**NOTE:** If you're finding that your model is performing worse than your linear regression, make sure you tune the parameters to the RandomForestRegressor!

Try to understand what the parameters mean by looking at the Decision Trees lecture.

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
# YOUR CODE HERE

<span id="nn" />

### 3.5 Neural Network

Train a neural network on the data. Report your RMSLE.

**NOTE**: You will probably run into issues running this on DataHub! I would recommend downloading Anaconda and running the notebook locally. Ask us on Slack if you need help on this!

In [None]:
# YOUR CODE HERE

<span id="xgb" />

### 3.6 XGBoost (Stretch)

Now that we've tried many different types of classifiers, it's time to bring out the big guns.

Below are hyperparameters for an XGBoost model: tinker around with these to achieve the best validation score (below). Learn about what some of the hyperparameters mean [here](https://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.train).

**NOTE**: You will probably run into issues to run this on DataHub! I would recommend downloading Anaconda and running the notebook locally. Ask us on Slack if you need help on this!

In [None]:
from xgboost import train

In [None]:
params = {
    'eta': # YOUR CODE HERE
    'max_depth': # YOUR CODE HERE
    'subsample': # YOUR CODE HERE
    'colsample_bytree': # YOUR CODE HERE
    'silent': # YOUR CODE HERE
}

In [None]:
from xgb import run_xgb
xgb_preds = run_xgb(...) # change this

## Prediction

In [None]:
PATH_TO_SUBMISSION = 'submission.csv'

In [None]:
# You might have to change this to be the predictions from your model on the test set
preds = lm.predict(X_test)

In [None]:
out = pd.DataFrame(data={'id': test['id'], 'revenue': preds}).set_index('id')

In [None]:
assert out.shape[0] == test.shape[0]

In [None]:
out.to_csv(PATH_TO_SUBMISSION)

## Submission

In [None]:
from submit import submit_to_leaderboard

In [None]:
success = submit_to_leaderboard(
    KAGGLE_USER_DATA, 
    TEAM_NAME, 
    path_to_submission=PATH_TO_SUBMISSION, 
    submit_to_kaggle=True
)