# SUSA CX Kaggle Capstone Project
## Part 4: Deep Learning in Keras and Submitting to Kaggle

### Table Of Contents
* [Introduction](#section1)
* [Deep Learning](#section2)
* [Final Kaggle Evaluation](#section3)   
* [Conclusion](#conclusion)
* [Additional Reading](#reading)


### Hosted by and maintained by the [Statistics Undergraduate Students Association (SUSA)](https://susa.berkeley.edu). Originally authored by [Patrick Chao](mailto:prc@berkeley.edu) & [Arun Ramamurthy](mailto:contact@arun.run).

<a id='section1'></a>
# SUSA CX Kaggle Capstone Project

Woohoo! You've made it to the end of the CX Kaggle Capstone Project! Congratulations on all of your hard work so far. We hope you've enjoyed this opportunity to learn new modeling techniques, some underlying mathematics, and even make new friends within CX. At this point, we've covered the entirety of the Data Science Workflow, linear regression, feature engineering, PCA, shrinkage, hyperparameter tuning, decision trees and even ensemble models. This week, we're going to finish off this whirlwind tour with a revisit to our old friend, Deep Learning. While the MNIST digit dataset was really interesting to look at as a cool toy example of the powers of DL, this time you're going to apply neural networks to your housing dataset for some hands-on practice using Keras. 

> ### CX Kaggle Competition & Final Kaggle Evaluation
After you get some practice with deep learning, this week we will be asking you and your team to select and finalize your best model, giving you the codespace to write up your finalized model and evaluate it by officially submitting your results to Kaggle. The winners of this friendly collab-etition will be honored at the SUSA Banquet next Friday, including prizes for the winning team! We also want to encourage and facilitate discussion between teams on why different models performed differently, and give you a chance to chat with other teams about their own experiences with the CX Kaggle Capstone. 

## Logistics

Most of the logistics are the same as last week, but we are repeating them here for your convenience. Please let us know if you or your teammates are feeling nervous about the pace of this project - remember that we are not grading you on your project, and we really try to make the notebooks relatively easy and fast to code through. If for any reason you are feeling overwhelmed or frustrated, please DM us or talk to us in person. We want all of you to have a productive, healthy, and fun time learning data science! If you have any suggestions or recommendations on how to improve, please do not hesitate to reach out!


### Mandatory Office Hours

Because this is such a large project, you and your team will surely have to work on it outside of meetings. In order to get you guys to seek help from this project, we are making it **mandatory** for you and your group to attend **two (2)** SUSA Office Hours over the next 4 weeks. This will allow questions to be answered outside of the regular meetings and will help promote collaboration with more experienced SUSA members.

The schedule of SUSA office hours are below:
https://susa.berkeley.edu/calendar#officehours-table

We understand that most of you will end up going to Arun or Patrick's office hours, but we highly encourage you to go to other people's office hours as well. There are many qualified SUSA mentors who can help and this could be an opportunity for you to meet them.

To begin we will import all the necessary libraries.

In [77]:
# Import statements
from sklearn import tree # There are lots of other models from this module you can try!
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge, Lasso, LinearRegression, Ridge
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.externals.six import StringIO  
from IPython.display import Image  

import tensorflow as tf
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Conv2D, MaxPool2D, Flatten
from keras.layers.advanced_activations import LeakyReLU, PReLU
from keras.optimizers import SGD,Adam
from keras.layers.normalization import BatchNormalization
from keras.layers import Activation
from keras import backend as K


sqrt=np.sqrt

In [5]:
def get_features(data, col_list, y_name):
    """
    Function to return a numpy matrix of pandas dataframe features, given k column names and a single y column
    Outputs X, a n X k dimensional numpy matrix, and Y, an n X 1 dimensional numpy matrix.
    This is not a smart function - although it does drop rows with NA values. It might break. 
    
    data(DataFrame): e.g. train, clean
    col_list(list): list of columns to extract data from
    y_name(string): name of the column you to treat as the y column
    
    Ideally returns one np.array of shape (len(data), len(col_list)), and one of shape (len(data), len(col_list))
    """
    
    # keep track of numpy values
    feature_matrix = data[col_list + [y_name]].dropna().values
    np.random.shuffle(feature_matrix)
    return feature_matrix[:, :-1], feature_matrix[:, -1]

def get_loss(model, X,Y_true):
    """Returns square root of L2 loss (RMSE) from a model, X value input, and true y values
    
    model(Model object): model we use to predict values
    X: numpy matrix of x values
    Y_true: numpy matrix of true y values
    """
    Y_hat = model.predict(X)
    return get_RMSE(Y_hat,Y_true)

def get_RMSE(Y_hat,Y_true):
    """Returns square root of L2 loss (RMSE) between Y_hat and true values
    
    Y_true: numpy matrix of predicted y values
    Y_true: numpy matrix of true y values
    """
    return np.sqrt(np.mean((Y_true-Y_hat)**2))

def get_train_and_val(X,Y):
    """Given the X and Y data, return the training and validation based on the split variable
    
    X: numpy matrix of x values
    Y: numpy matrix of y values
    split: value between 0 and 1 for the training split
    """
    
    Y = Y.reshape(Y.shape[0],)

    train_index,_ = get_train_val_indices(X,Y)

    y_train = Y.reshape(Y.shape[0],)
    y_train = Y[:train_index]
    x_train = X[:train_index,:]

    x_val = X[train_index:,:]
    y_val = Y[train_index:]
    return (x_train,y_train),(x_val,y_val)

def get_train_val_indices(X,Y=None,split=0.7):
    train_index = (int)(X.shape[0]*split)
    test_index =X.shape[0]-1
    return train_index,test_index

def select_columns_except(dframe, non_examples):
    """Returns all comlumns in dframe except those in non_examples."""
    all_cols = dframe.select_dtypes(include=[np.number]).columns.tolist()
    cond = lambda x: sum([x == col for col in non_examples]) >= 1
    return [x for x in all_cols if not cond(x)]

# Data Loading

First, we need to load and clean the data. These datasets have been provided for you.

In [112]:
train = pd.read_csv('DATA/house-prices/train_cleaned.csv')
test = pd.read_csv('DATA/house-prices/test_cleaned.csv')
train = train.drop('Unnamed: 0',axis=1)
test = test.drop('Unnamed: 0',axis=1)
train.head()

Unnamed: 0.1,Unnamed: 0,1stFlrSF,2ndFlrSF,BedroomAbvGr,BsmtCond,BsmtExposure,BsmtFinSF1,BsmtFinSF2,BsmtFinType1,BsmtFinType2,...,ExteriorWd Shng,ExteriorImStucc,ExteriorWdShing,ExteriorMetalSd,ExteriorCmentBd,ExteriorCemntBd,ExteriorPlywood,ExteriorBrkFace,ExteriorWd Sdng,ExteriorBrk Cmn
0,0,856,854,3,3,1,706.0,0.0,6,1,...,0,0,0,0,0,0,0,0,0,0
1,1,1262,0,3,3,4,978.0,0.0,5,1,...,0,0,0,1,0,0,0,0,0,0
2,2,920,866,3,3,2,486.0,0.0,6,1,...,0,0,0,0,0,0,0,0,0,0
3,3,961,756,3,4,1,216.0,0.0,5,1,...,1,0,0,0,0,0,0,0,1,0
4,4,1145,1053,4,3,3,655.0,0.0,6,1,...,0,0,0,0,0,0,0,0,0,0


In [140]:
feature_cols = select_columns_except(train, ['Id','SalePrice'])

X, Y = get_features(train, feature_cols, 'SalePrice')
(x_train,y_train),(x_val,y_val) = get_train_and_val(X,Y)

x_test = test.loc[:, test.columns != 'Id'].values
test_ids = test['Id'].values

In [141]:
randomForest = RandomForestRegressor(max_depth=14,min_samples_leaf=1,min_samples_split=2,
                                     n_estimators=40,random_state=0,bootstrap=True)
randomForest = randomForest.fit(x_train, y_train)
loss = get_loss(randomForest, x_val,y_val)
print("Root Mean Squared Error loss of our model: {:.2f}".format(loss))


Root Mean Squared Error loss of our model: 33132.18


In [142]:
def model_prediction(model,x_test=x_test):
    prediction = model.predict(x_test)
    return prediction.reshape(prediction.shape[0],)

In [152]:
model = Sequential()
model.add(Dense(256, input_shape=(x_train.shape[1],)))
model.add(BatchNormalization())
model.add(Activation('linear'))
model.add(LeakyReLU(alpha=0.1))
model.add(Dropout(0.2))
model.add(Dense(256))
model.add(BatchNormalization())
model.add(Activation('linear'))
model.add(LeakyReLU(alpha=0.1))
model.add(Dropout(0.5))
model.add(Dense(256))
model.add(BatchNormalization())
model.add(Activation('linear'))
model.add(LeakyReLU(alpha=0.1))
model.add(Dropout(0.5))
# model.add(Dense(1024))
# model.add(BatchNormalization())
# model.add(Activation('relu'))
# model.add(Dropout(0.4))
# model.add(Dense(512))
# model.add(BatchNormalization())
# model.add(Activation('relu'))
# model.add(Dropout(0.4))
# model.add(Dense(256))
#model.add(Activation('leakyrelu'))
model.add(Dropout(0.5))
model.add(Dense(64))
model.add(Activation('linear'))
model.add(LeakyReLU(alpha=0.1))
model.add(Dropout(0.5))
model.add(Dense(32))
model.add(Activation('linear'))
model.add(LeakyReLU(alpha=0.1))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('relu'))

In [154]:
def root_mean_squared_error(y_true, y_pred):
        return K.sqrt(K.mean(K.square(y_pred - y_true), axis=-1)) 

model.compile(optimizer=Adam(), loss = root_mean_squared_error, 
              metrics =[root_mean_squared_error])
batch_size = 50
epochs = 198
learning_rate = 0.0003

In [155]:
history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(x_val, y_val))
score = model.evaluate(x_val, y_val, verbose=0)
print('Test loss:', score[0])

Train on 1021 samples, validate on 439 samples
Epoch 1/198
Epoch 2/198
Epoch 3/198
Epoch 4/198
Epoch 5/198
Epoch 6/198
Epoch 7/198
Epoch 8/198
Epoch 9/198
Epoch 10/198
Epoch 11/198
Epoch 12/198
Epoch 13/198
Epoch 14/198
Epoch 15/198
Epoch 16/198
Epoch 17/198
Epoch 18/198
Epoch 19/198
Epoch 20/198
Epoch 21/198
Epoch 22/198
Epoch 23/198
Epoch 24/198
Epoch 25/198
Epoch 26/198
Epoch 27/198
Epoch 28/198
Epoch 29/198
Epoch 30/198
Epoch 31/198
Epoch 32/198
Epoch 33/198
Epoch 34/198
Epoch 35/198
Epoch 36/198
Epoch 37/198
Epoch 38/198
Epoch 39/198
Epoch 40/198
Epoch 41/198
Epoch 42/198


Epoch 43/198
Epoch 44/198
Epoch 45/198
Epoch 46/198
Epoch 47/198
Epoch 48/198
Epoch 49/198
Epoch 50/198
Epoch 51/198
Epoch 52/198
Epoch 53/198
Epoch 54/198
Epoch 55/198
Epoch 56/198
Epoch 57/198
Epoch 58/198
Epoch 59/198
Epoch 60/198
Epoch 61/198
Epoch 62/198
Epoch 63/198
Epoch 64/198
Epoch 65/198
Epoch 66/198
Epoch 67/198
Epoch 68/198
Epoch 69/198
Epoch 70/198
Epoch 71/198
Epoch 72/198
Epoch 73/198
Epoch 74/198
Epoch 75/198
Epoch 76/198
Epoch 77/198
Epoch 78/198
Epoch 79/198
Epoch 80/198
Epoch 81/198
Epoch 82/198
Epoch 83/198
Epoch 84/198


Epoch 85/198
Epoch 86/198
Epoch 87/198
Epoch 88/198
Epoch 89/198
Epoch 90/198
Epoch 91/198
Epoch 92/198
Epoch 93/198
Epoch 94/198
Epoch 95/198
Epoch 96/198
Epoch 97/198
Epoch 98/198
Epoch 99/198
Epoch 100/198
Epoch 101/198
Epoch 102/198
Epoch 103/198
Epoch 104/198
Epoch 105/198
Epoch 106/198
Epoch 107/198
Epoch 108/198
Epoch 109/198
Epoch 110/198
Epoch 111/198
Epoch 112/198
Epoch 113/198
Epoch 114/198
Epoch 115/198
Epoch 116/198
Epoch 117/198
Epoch 118/198
Epoch 119/198
Epoch 120/198
Epoch 121/198
Epoch 122/198
Epoch 123/198
Epoch 124/198
Epoch 125/198
Epoch 126/198


Epoch 127/198
Epoch 128/198
Epoch 129/198
Epoch 130/198
Epoch 131/198
Epoch 132/198
Epoch 133/198
Epoch 134/198
Epoch 135/198
Epoch 136/198
Epoch 137/198
Epoch 138/198
Epoch 139/198
Epoch 140/198
Epoch 141/198
Epoch 142/198
Epoch 143/198
Epoch 144/198
Epoch 145/198
Epoch 146/198
Epoch 147/198
Epoch 148/198
Epoch 149/198
Epoch 150/198
Epoch 151/198
Epoch 152/198
Epoch 153/198
Epoch 154/198
Epoch 155/198
Epoch 156/198
Epoch 157/198
Epoch 158/198
Epoch 159/198
Epoch 160/198
Epoch 161/198
Epoch 162/198
Epoch 163/198
Epoch 164/198
Epoch 165/198
Epoch 166/198
Epoch 167/198
Epoch 168/198


Epoch 169/198
Epoch 170/198
Epoch 171/198
Epoch 172/198
Epoch 173/198
Epoch 174/198
Epoch 175/198
Epoch 176/198
Epoch 177/198
Epoch 178/198
Epoch 179/198
Epoch 180/198
Epoch 181/198
Epoch 182/198
Epoch 183/198
Epoch 184/198
Epoch 185/198
Epoch 186/198
Epoch 187/198
Epoch 188/198
Epoch 189/198
Epoch 190/198
Epoch 191/198
Epoch 192/198
Epoch 193/198
Epoch 194/198
Epoch 195/198
Epoch 196/198
Epoch 197/198
Epoch 198/198
Test loss: 22857.9350307873


In [156]:
model_prediction(model)

array([122516.77, 168665.52, 202756.78, ..., 223741.45, 127502.63,
       203054.25], dtype=float32)

In [157]:
my_submission = pd.DataFrame({'Id': test_ids, 'SalePrice': model_prediction(model)})
# you could use any filename. We choose submission here
my_submission.to_csv('submission2.csv', index=False)

<a id='section2'></a>
# Deep Learning

From Kaggle3:  
>We may imagine hyperparameters as a bunch of individual knobs we may turn. Consider that we are visiting our friend and staying at her place. However, you did not realize that she is actually an alien and her house is filled with very strange objects. When you head to bed, you attempt to use her shower, but see that her shower is has a dozen of knobs that control the temperature of the water coming out! We only have a single output to work off of, but many different knobs or *parameters* to adjust. If the water is too hot, we can turn random knobs until it becomes cold, and learn a bit about our environment. We may determine that some knobs are more or less sensitive, just like hyperparameters. Each knob in the shower is equivalent to a hyperparameter we can tune in a model.

In [None]:
model = Sequential()
model.add(Dense(30, activation='relu', input_shape=(784,)))
model.add(Dense(num_classes, activation='softmax'))

<a id='section3'></a>
# Final Kaggle Evaluation

Congrats on finishing the last of the models we planned on teaching you about during the CX Kaggle Capstone! 

You have now covered five distinct models and a several related techniques to add to your data science bag-of-tricks: 
- Linear Models
    - Multivariate Linear Regression
    - Polynomial Regression
    - Shrinkage / Biased Regression / Regularization (i.e. Ridge, LASSO)
- Decision Trees
    - Random Forests
- Deep Learning
    - Sequential Neural Networks
- Auxiliary Techniques 
    - The Data Science Workflow
    - Data Cleaning
    - Interpreting EDA Graphs
    - Feature Engineering
    - Principal Component Analysis
    - Hyperparameter Tuning (i.e. grid search)
    - Ensemble Learning (i.e. bagging, boosting) 
    
Wow, that's a lot! We are really proud of you all for exploring these techniques, which constitute some of Berkeley's toughest machine learning and statistics classes. As always, if you want to learn more about any of these topics, or are hungry to learn about even more techniques, feel free to reach out to any one of the SUSA Mentors.

With the help of the above listing and your own team's preferences, choose a model and a couple of techniques to implement for your final model. We will provide you with a preamble and some space to construct and train your model, as well as a helper function to turn your output into an official Kaggle submission file.  

In [6]:
################
### PREAMBLE ###
################
train = None
test = None 

####################
### MODEL DESIGN ###
####################
model = None 
# ^^ REPLACE THIS LINE ^^

################
### TRAINING ###
################

###############
### TESTING ###
###############
test_predictions = None
# ^^ REPLACE THIS LINE ^^

##################
### SUBMISSION ###
##################
def generate_kaggle_submission(predictions, test = test):
    '''
    This function accepts your 1459-dimensional vector of predicted SalesPrices for the test dataset, 
    and writes a CSV named kaggle_submission.csv containing your vector in a form suitable for 
    submission onto the Kaggle leaderboard.
    '''
    print("hello")
generate_kaggle_submission(test_predictions)

hello


As you might have noticed in the code block above, we had to write a simple CSV file containing row IDs and predicted values for the 1459 houses in the test dataset. This submission file is your ticket to 

Take a look at your `kaggle_submission.csv` file. When you and your team are ready, follow these instructions to upload your predictions to Kaggle and receive an official Kaggle score:

> TODO

# Conclusion

This brings us to an end to the CX Kaggle Capstone Project, as well as the Spring 2018 semester of SUSA Career Exploration. Congratulations on graduating from the SUSA Career Exploration committee! It's been a wonderful experience teaching you all, and we hope you got as much out of CX as we did this semester. This semester brought several new pilot programs to CX, such as the crash courses, workbooks, a revamped curriculum, and the CX Kaggle Capstone Project. You all have been great sources of feedback, and we want to make next semester's CX curriculum even better for the new generation of CX! 

We're going to ask you for feedback one last time, to give us insight into how we can improve the CX Kaggle Capstone experience for future CX members. Please fill out [this feedback form] and let us know how we could have done better. Thank you again for a wonderful semester, and we will see you again in the Fall!



As always, please email [Arun Ramamurthy](mailto:contact@arun.run), [Patrick Chao](mailto:prc@berkeley.edu), or [Noah Gundotra](mailto:noah.gundotra@berkeley.edu) with any questions or concerns whatsoever. Have a great summer, and we hope to see you as a returning member in the Fall! Go SUSA!!!

With geom_love,
Lucas, Arun, Patrick, Noah, and the rest of the SUSA Board

<a id='reading'></a>
# Additional Reading

TODO