# Week 6 Homework

This week's homework will help you build up your deep learning skills and apply them to the healthcare space. Specifically, we'll predicting whether a breast mass is benign or malignant. I got this data [here](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data), in case you're interested in exploring it on your own! 

## Setting up the Environment

Do a `git pull` from the `unit3` GitHub repository, and install dependencies in requirements.txt [(instructions here, just in case)](http://web.stanford.edu/class/cs21si/setup.html). We have some new dependencies this time, so don't skip this step!

Run any code below by highlighting it and hitting `Shift + Enter`. Import the libraries below.

In [None]:
from __future__ import print_function
import keras
from keras.models import Sequential, Model
from keras.layers import Input, Dense, Dropout, Activation, BatchNormalization
from keras import regularizers
from keras import backend as K
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
import random
from sklearn.preprocessing import StandardScaler

# fix random seed for reproducibility
np.random.seed(1337)

## Read in Data

The data above was computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

In [None]:
def get_dataset(path):
    dataset = pd.read_csv(path)
    np.random.seed(42)
    dataset = dataset.reindex(np.random.permutation(dataset.index))
    return dataset

dataset = get_dataset('resources/cancer-dataset.csv')
dataset.head()

## Separate Data and Labels

Now, we'll use functions to get the data and the labels (i.e. the $X$ and the $y$). We want to make sure these return NumPy arrays so they can be passed into Keras models. Benign samples will be assigned the label 0, and malignant samples will be assigned the label 1.

In [None]:
def get_data(dataset):
    data = np.array(dataset.as_matrix()[:, 2:-1], dtype=np.float64) # remove first, second, and last column
    return data

def get_labels(dataset):
    diagnoses = dataset['diagnosis'].map({'M':1, 'B':0})
    return np.array(diagnoses.as_matrix(), dtype=np.uint8)

data, labels = get_data(dataset), get_labels(dataset)

print("Number of patient samples: ", data.shape[0])
print("Number of features per patient: ", data.shape[1])

## Partition into Train, Validation, and Test Sets

In [None]:
def split_data(data, labels, split):
    train_ratio, val_ratio, test_ratio = split
    num_examples = labels.shape[0]
    train_bound, val_bound = int(train_ratio*num_examples), int(train_ratio*num_examples) + int(val_ratio*num_examples)
    
    train = {'data': data[:train_bound], 'labels': labels[:train_bound]}
    val = {'data': data[train_bound:val_bound], 'labels': labels[train_bound:val_bound]}
    test = {'data': data[val_bound:], 'labels': labels[val_bound:]}
    
    return train, val, test
    
train, val, test = split_data(data, labels, (.7, .2, .1))

## Feature Normalization

We want to scale our data so that each feature has mean 0 and variance 1. This is useful because it improves the stability of training our neural network. This makes it possible to train using more sophisticated networks and get better results. This is related to batch normalization–during batch normalization we are performing a similar operation, just on the inputs to a layer rather than the inputs into a model.

In [None]:
scaler = StandardScaler()
scaler.fit(train['data'])
train['data'] = scaler.transform(train['data'])
val['data'] = scaler.transform(val['data'])
test['data'] = scaler.transform(test['data'])

## Jupyter Exercise 1: Create your Own Classifier

**Your task:** And the training wheels are coming off! Here, we want a 4-layer fully-connected neural network that can be used for binary classification. When you are coming up with layer sizes, the intuition that each layer has about half the number of units as the previous one can be helpful. Good luck!

In [None]:
batch_size = 256
epochs = 25
num_classes = 2
num_features = data.shape[1]

def nn_classifier(learning_rate=0.005):
    model = Sequential()
    
    # YOUR CODE HERE:

    # END CODE
    
    # compile model
    sgd = keras.optimizers.SGD(lr = learning_rate)
    model.compile(loss='sparse_categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
    return model

model = nn_classifier()
model.summary()

This is a summary of the structure of our model. Now let's train and evaluate our model! We want to not only train the model with many epochs, but also print the validation set accuracy at the end.

In [None]:
def eval(model, train, val, num_epochs):
    # fit the model
    model.fit(train['data'], train['labels'], 
              epochs = num_epochs, 
              batch_size = 16, 
              verbose = 2,
              shuffle = False)
    
    # evaluate the model
    scores = model.evaluate(val['data'], val['labels'], batch_size=16, verbose=0)
    
    return scores

loss, accuracy = eval(model, train, val, epochs)
print("Validation Set Accuracy: ", accuracy)

## Jupyter Exercise 2: Hyperparameter Tuning

**Your task**: Tuning the hyperparameters and developing intuition for how they affect the final performance is a large part of using neural networks, so we want you to get a lot of practice. Below, you should experiment with different values of the various hyperparameters, including learning rate, regularization strength, and dropout strength. Your goal in this exercise is to get as good of a result on the breast cancer dataset as you can, with a fully-connected deep neural network. Feel free to change the model you initialized above in Exercise 1 as well. There is no starter code here, so feel free to perform a hyperparameter sweep as you wish. You will find the code for *tune_hyperparams* from the Week 5 homework useful as a start, but note that the *nn_classifier* as it is right now only takes in an optional learning rate and not the dropout and regularization hyperparameters from before, so you will have to manually add most other hyperparameters if you want to play with these. Also note that the *eval* function needs be passed *model*, *train*, *val*, and *epochs*. **Aim for at least 97% accuracy on the validation set with your best model.**

In [None]:
def tune_hyperparams():
    best_model = (None, None, None)
    ### YOUR CODE HERE

                
    ### END CODE
            
    return best_model
        
best_model = tune_hyperparams()
print("\n\nBest Model Performance on Validation: ", best_model[1])
print("Hyperparameters of Best Model: ", best_model[2])

## Testing Model on Unseen Data

Try playing around with hyperparameters like the learning rate, size of the hidden layers, number of epochs, etc. until you get a model that you are satisfied with! Use validation accuracy to compare performance across different model configurations. Once you're done configuring, try testing on a completely unseen dataset to get a good idea of how your model will perform for unseen data:

In [None]:
test_score = best_model[0].evaluate(test['data'], test['labels'], batch_size=128)
print("Test accuracy: %.2f" % (test_score[1]))