In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from scipy.stats import norm

### Part 1: Introduction to the deep learning library Keras

#### Goals:
- A) Introduction of the Keras library on a high-level for supervised learning tasks.
- B) On a toy example (iris dataset), predict the type of flower of the iris dataset (classification) as well as a regression task with a neural network.

A) Importation of the Keras library
- Keras is a high-level deep learning library that is considered to be user-friendly, modular and extensible. 
- Keras can be used to compile different types of deep neural networks which you will see throughout this week: multilayer perceptron (today), convolutional layer (friday), recurrent layer (tomorrow), etc...
- The implementation of a neural network with Keras requires a small number of lines of code. The training part (estimation of parameters) is also optimized such that it can be done automatically for you.

In [0]:
import keras
from keras.models import Sequential, load_model      
from keras.layers import Dense       # 'Dense' layer is a fully-connected layer in Keras, i.e. the layer used in an MLP 
from keras.utils import np_utils     # For the one-hot encoding, see below. 
from keras import optimizers

#### B.1) First supervised learning task
- We start with a toy example: the Iris dataset.

Let's first download the iris dataset
- The iris dataset is really simple. It consist of 150 flowers with 4 features: petal length, petal width, sepal length, sepal width and three species of flowers (setosa, virginica and versicolor). 
- We will try to predict the type of flower based on the 4 features.

In [0]:
from sklearn.datasets import load_iris
iris_data = load_iris() 
features = iris_data['data']
targets = iris_data['target']

In [0]:
print("examples of features:")
print (features[0:10,:])
print("species")
print(targets)

For a multi-classification task, it's best practice to encode the targets as a 'one-hot' encoding. 
- A one-hot incoding is simply a vector of zeros everywhere except for a '1' at the position of the class. 
- Ex: for the iris dataset, we have three classes. A class '0' will be encoded as :[1., 0, 0], a class '1': [0,1,0] and class '2': [0,0,1]. 

One-hot encoding is done in the following box

In [0]:
dummy_y = np_utils.to_categorical(targets)
print(dummy_y)

Let's split our dataset into a Train|Test set of 70%/30%. 
- This is done with the function 'train_test_split'.

In [0]:
x_train, x_test, y_train, y_test = train_test_split(features, dummy_y, test_size=0.3, random_state=0)

Let's compile our first neural network model to predict the type of flower.
- Single hidden layer of 5 neurons each with activation function 'tanh'

In [0]:
model = Sequential()

# Input layer: 5 hidden neurons and 'tanh' activation function
model.add(Dense(units=5, activation='tanh', input_dim=x_train.shape[-1]))

# Output layer: need to use softmax as we have a classification problem
model.add(Dense(3, activation='softmax'))

# Compile the model by specifying the loss function (cross entropy), the optimizer (Adam) and optional metric to output ('accuracy')
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=50, batch_size=10)

Let's evaluate the performance on the Train|Test set from the resulting neural network
- In the context of multi-classification, the function 'predict' of Keras takes as input a set of features, and outputs the probability of being in each class.
- i.e. model.predict(x_train) outputs a discrete probability distribution over the 3 types of flowers for each example in our training set. 
- The predicted flower will be the highest probability class (the mode).

In [0]:
y_pred_train = model.predict(x_train)

Let's analyze the predicted discrete probability distribution for the first 5 examples in the train set

In [0]:
print("Probability distribution for the first 5 examples of the Train set:")
print(y_pred_train[0:5])

and the predicted class for the same five examples vs the true label (true type of flower). 

In [0]:
print("Predicted flower for the first 5 examples (goes from 0 to 2):")
print(np.argmax(y_pred_train[0:5],axis=1))
print("True type of flowers for the first 5 examples")
print(np.argmax(y_train[0:5],axis=1))

Let's compute the accuracy on the train and test set.

In [0]:
print("Train set accuracy:")
print(np.sum(np.argmax(y_pred_train, axis=1)==np.argmax(y_train, axis=1))/x_train.shape[0])
print("Number of errors on the Train set out of %d examples:" %(x_train.shape[0]))
print(x_train.shape[0] - np.sum(np.argmax(y_pred_train, axis=1)==np.argmax(y_train, axis=1)))

print("Test set accuracy:")
y_pred_test = model.predict(x_test)
print(np.sum(np.argmax(y_pred_test, axis=1)==np.argmax(y_test, axis=1))/x_test.shape[0])
print("Number of errors on the Test set out of %d examples:" %(x_test.shape[0]))
print(x_test.shape[0] - np.sum(np.argmax(y_pred_test, axis=1)==np.argmax(y_test, axis=1)))

#### B.2) Predict the sepal width based on the other features

Still with the 'iris' dataset, let's predict the value of the 'sepal width' (the 4th feature) based on the other three other. 
- This is a regression task, as the 4th feature is a real value.

In [0]:
features_without_sepal_width = features[:,:-1]
sepal_width = features[:,-1]

Again, let's split our dataset into a Train|Test set of 70%. 

In [0]:
x_train, x_test, y_train, y_test = train_test_split(features_without_sepal_width, sepal_width, test_size=0.3, random_state=0)

Let's compile our neural network. Notice two important differences:
- 1) Output layer: must be of size 1 and the activation is no longer 'softmax', it's a 'linear' activation function.
- 2) Loss function: mean-square error (MSE) instead of cross-entropy. MSE is the default loss function to use in most regression tasks.

In [0]:
model = Sequential()

# Input layer: 5 hidden neurons and 'tanh' activation function
model.add(Dense(units=5, activation='tanh', input_dim=x_train.shape[-1]))

# Output layer: default activation function is 'linear'
model.add(Dense(1, activation='linear'))

# Compile the model by specifying the loss function (MSE), the optimizer (Adam).
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(x_train,y_train,epochs=50,batch_size=10)

Let's evaluate the performance on the Train|Test set from the resulting neural network.
- In the context of a regression task, the function 'predict' of Keras takes as input a set of features, and outputs a scalar that represents the prediction of the target (sepal width)

In [0]:
y_pred_train = model.predict(x_train)
print("Predicted sepal width for the first 5 examples")
print(y_pred_train[0:5])
print("True sepal width for the first 5 examples")
print(y_train[0:5])

Let's evaluate the total performance on the Train and Test set of our model.

In [0]:
print("Mean-square error obtained on the Train set:")
print(np.average((y_pred_train[:,0]-y_train)**2))

print("Mean-square error obtained on the Test set:")
y_pred_test = model.predict(x_test)
print(np.average((y_pred_test[:,0]-y_test)**2))

### Part 2) Option pricing with neural networks
In this second part, we will try to price options with neural networks under the Black-Scholes model. Many papers can be found doing a similar task, see for example https://srdas.github.io/Papers/BlackScholesNN.pdf 

### Sections:
- 2.1) Simulation of the dataset
- 2.2) Normalization of the features
- 2.3) Optimization

### 2.1) Simulation of the dataset

In [0]:
# Function to compute the true price of a European call option under the Black-Scholes model
# - S    : price of the underlying
# - K    : strike price
# - T    : time-to-maturity
# - q    : continuous dividend yield
# - sigma: volatility parameter
# - r    : continuous risk-free rate
def BlackScholes_price(S, K, T, q, sigma, r):
    d1 = (np.log(S / K) + (r - q + 0.5 * sigma ** 2) * T) / (sigma * np.sqrt(T))
    d2 = d1 - sigma * np.sqrt(T)
    return S * np.exp(-q * T) * norm.cdf(d1) - K * np.exp(-r * T) * norm.cdf(d2)

In [0]:
# Function to generate the dataset of pairs (X,Y) where X = [S,K,T,r,sigma,q] and Y = call price.
# - n: number of datapoints
def generate_options(n):
        data = {}
        data['S'] = np.random.uniform(1, 3000, n)               # Uniformly sampled from [1,3000]
        data['K'] = data['S'] * np.random.uniform(0.7, 1.3, n)  # Strike price should be around the stock price
        data['T'] = np.random.uniform(0, 2, n)                  # Number of years
        data['r'] = np.random.uniform(0, 0.2, n)                # Continuous risk-free rate
        data['sigma'] = np.random.uniform(0, 1, n)              # volatility 
        data['q'] = np.random.uniform(0, 0.2, n)                # continuous dividend yield
        data['P'] = BlackScholes_price(**data)                  # Option price

        return pd.DataFrame.from_dict(data)

In this part, we simulate a dataset of pairs (X,Y) where X = [S,K,T,q,r,sigma] are the features needed under the BSM to compute the price (Y) of a call option.

In [0]:
# Generate a dataset of call options from the BSM
Datasettraining = generate_options(int(1 * 1E5))                        # Generate 100K call options from BSM
df = Datasettraining.copy(deep=True)                                    # Copy of the training set
df.head(10)

In this next box, the simulated dataset is splitted into a train|valid|test set of proportion 0.6|0.2|0.2
- The train set will be used to fit the parameters of the neural network;
- The valid set will be used for hyperparameters tuning;
- The test set will be used to evaluate the out-of-sample performance.

In [0]:
df = df.sample(frac=1).reset_index(drop=True)  # Shuffle the dataset

# 1) Split into Train|Valid|Test set
# Proportions: 0.6|0.2|0.2
n = len(df)
prop_train = 0.8*0.75
prop_valid = 0.8*0.25
prop_test = 0.2
n_train = (int)(prop_train*n)
n_valid = int(prop_valid*n)
n_test = int(prop_test*n)

# Split
train = df[0:n_train]
valid = df[n_train:n_train+n_valid]
test = df[n_train+n_valid:]

### 2.2) Normalization of the features
- The features ['S', 'K', 'r', 'T', 'sigma' 'q'] are each continuous random variables (i.e. they are either positive real values or real values).
- One approach to normalize the features is with the Z-normalization, i.e. standardized each feature by substracting its mean and dividing by its standard deviation (std). 
- Note: it's good practice to estimate the mean and std of each feature with the training set: 
    - i.e. we normalize each feature of the train|valid|test sets with the mean and std computed stricly on the train set.
    - This should help improve the generalization error (i.e. difference in performance between train and test sets).

In [0]:
# Normalize features
# 1) Features to normalize with Z-normalization
X_train = train[['S', 'K', 'r', 'T', 'sigma', 'q']].values
X_valid = valid[['S', 'K', 'r', 'T', 'sigma', 'q']].values
X_test = test[['S', 'K', 'r', 'T', 'sigma', 'q']].values

# 2) Z-normalization step
# 2.1) Compute the mean/std on the TRAIN set only
train_mean = np.mean(X_train, axis=0)
train_std  = np.std(X_train, axis=0)

# 2.2) Z-normalize the train|valid|test with the mean/std of the train set
xtrain = (X_train - train_mean)/train_std
xvalid = (X_valid - train_mean)/train_std
xtest  = (X_test - train_mean)/train_std

# 2.3) Store the targets (i.e. option prices)
ytrain = train['P'].values                                   
yvalid = valid['P'].values 
ytest  = test['P'].values 

### 2.3) Optimization (without hyperparameter tuning)

In this first part, we train a neural network to price call options without hyperparameters tuning.
- Total number of epoch      : 30
- Batch size                 : 64
- Number of hidden layers    : 2
- Number of neurons per layer: 32
- Activation function        : Relu
- Loss                       : Mean-squared error

In [0]:
nb_epoch = 30     # total number of epochs
batch_size = 64   # mini-batch size
nbs_neurons = 32  # number of neurons per hidden layer

# Compile the model  - 2 hidden layers of Relu
model = Sequential()
model.add(Dense(units = nbs_neurons, activation = 'relu', input_dim = xtrain.shape[1]))
model.add(Dense(units = nbs_neurons, activation = 'relu'))

# Output layer 
model.add(Dense(units=1, activation = 'relu'))
model.compile(loss='mse', optimizer='adam')

# Print the structure of the neural network
model.summary()

# Train the neural network
model.fit(xtrain, ytrain, epochs=nb_epoch, batch_size=batch_size, verbose = 1)

It's always good practice to save the resulting trained neural network. Once saved, we can always reload the trained neural network for future use without having to retrain it. 

In [0]:
# Save the resulting trained neural network
model.save("model_no_tuning.h5")

# Example of how to load the trained model
model_no_tuning = load_model("model_no_tuning.h5")
model_no_tuning.summary()  

Now that the model is trained, we can print out the resulting performance in terms of the MSE on the train and test sets.

In [0]:
# 1) Compute the predicted call price on the train and test set
y_pred_train = model_no_tuning.predict(xtrain).flatten()  # train set predictions
y_pred_test  = model_no_tuning.predict(xtest).flatten()   # test set predictions

# 2) Some examples of the predictions on the test set
print("First 5 predictions:")
print(y_pred_test[0:5])
print("First 5 real values:")
print(ytest[0:5])

# 3) Compute the MSE on the train and test sets
MSE_train = np.mean(np.square(y_pred_train - ytrain))
print("With no hyperparameters tuning, the MSE on the train set is: %.5f" %(MSE_train))

MSE_test = np.mean(np.square(y_pred_test - ytest))
print("With no hyperparameters tuning, the MSE is: %.5f" %(MSE_test))

Since the MSE on the train and test set are relatively close, there's no indication that the neural network overfitted the training set. This is not suprising since we didn't train much (i.e. only 30 epochs).  

For the current problem of option pricing, it's also interesting to do a scatter plot of the predicted values vs the real values. The scatter plot is a visual way to quickly assess the prediction performance of the neural network. The best case scenario would be for the points to all fall on the diagonal line.

In [0]:
# 1) Scatter plot of the train set
plt.scatter(ytrain, y_pred_train, s=1.5)
plt.title('No hyperparameters tuning - train set')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

# 2) Scatter plot of the test set
plt.scatter(ytest, y_pred_test, s=1.5)
plt.title('No hyperparameters tuning - test set')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

As we can see, we are making a lot of errors, especially for lower call prices. 
---
One factor that could explain this result is that the mean-square error penalizes larger errors, which initially (i.e. at the beginning of the training) will happen most often for high values of call options. Thus, in order to minimize the MSE, the neural network has to focus more on the pricing of large values of call options. 





### Part 3) Optimization (with hyperparameter tuning)

### A) Grid search

As the first example of hyperparameters tuning, we will use a well-known method called 'grid search'. 
- This method consist in testing each combination of hyperparameter on a grid of possible values for each hyperparameter. In this  case, we will test the following grid:
    - {2,3} layers
    - {100,120} neurons/layer
    - {128,256} batch_size
- So, we test a total of 8 combinations of hyperparameters. 

The resulting trained model will be the one that minimizes the error on the validation set out of all of the combinations tested. The latter model will be saved as 'best_model_grid_search.h5'. 

In [0]:
nb_epoch = 20   # total number of epochs

# Grid search
nbs_layers_range  = np.array([2,3])       # 2 or 3 hidden layers
nbs_neurons_range = np.array([100,120])   # for each hidden layer, either {100,120,140} hidden neurons
batch_size_range  = np.array([128,256])   # batch size: either {128,256}

# During the optimization, we keep track of the best validation set error obtained
valid_loss_best  = 999999   

In [0]:
# Loop over the different combination on the grid
for i in range(len(nbs_layers_range)):
    nbs_layer = nbs_layers_range[i]
    for j in range(len(nbs_neurons_range)):
        nbs_neurons = nbs_neurons_range[j]
        for k in range(len(batch_size_range)):
            batch_size = batch_size_range[k]
            
            print("Current set of HP: %d hidden layers, %d number of neurons, %d batch size" %(nbs_layer, nbs_neurons, batch_size))
            
            # Compile the model:
            model = Sequential()
            model.add(Dense(units = nbs_neurons, activation = 'relu', input_dim = xtrain.shape[1]))
            model.add(Dense(units = nbs_neurons, activation = 'relu'))
            
            # Check if we have to add a third hidden layer on top of the first two.
            if(nbs_layer == 3):
                model.add(Dense(units = nbs_neurons, activation = 'relu'))
            
            # Output layer
            model.add(Dense(units=1, activation = 'relu'))
            model.compile(loss='mse', optimizer='adam')
            
            for h in range(nb_epoch):
                model.fit(xtrain, ytrain, epochs=1, batch_size=batch_size, verbose = 1)
            
                # Evaluate the validation loss of the trained model
                val_loss = model.evaluate(xvalid, yvalid)
               
                # If it's the best model so far:
                if (val_loss < valid_loss_best):
                    valid_loss_best = val_loss
                    model.save("best_model_grid_search.h5")

# load and print out the best model found with grid search
model_best_grid_search = load_model("best_model_grid_search.h5")
model_best_grid_search.summary()  

# Evaluate performance on train set
y_pred_train = model_best_grid_search.predict(xtrain).flatten()
MSE_train = np.mean(np.square(y_pred_train - ytrain))
print("With grid search, the MSE on the train set is: %.5f" %(MSE_train))

# Plot predicted prices vs real prices
plt.scatter(ytrain, y_pred_train, s=1.5)
plt.title('Grid search - Train set')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

# Evaluate performance on test set
y_pred_test = model_best_grid_search.predict(xtest).flatten()
MSE_test = np.mean(np.square(y_pred_test - ytest))
print("With grid search, the MSE on the test set is: %.5f" %(MSE_test))

# Plot predicted prices vs real prices
plt.scatter(ytest, y_pred_test, s=1.5)
plt.title('Grid search - Test set')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

### B) Random search algorithm for model tuning
- Define a space of possible HPs from which we want to find an optimal set.
- Random search: 
    - For a fix number of trials, sample randomly a set of HPs from the defined space;
    - Choose the set of HPs which minimizes the MSE on the valid set.

In [0]:
batch_size = 128   # Fix the batch size in this case

# Search space - define the boundaries for each HP
nbs_layers_range = np.array([2,4])             # number of layers within {2,3,4} 
nbs_neurons_range = np.array([100,400])        # number of neurons within {100,101,102,....,400}
lr_range = np.array([0.0001,0.01])             # learning rate within [0.0001, 0.01]

# Statistics to keep track of during the optimization
valid_loss_best = 999999

# Total number of iterations to be done in the random search
nbs_iteration = 5

Random search implementation with 5 iterations:

In [0]:
for i in range(nbs_iteration):
    
    # 1) At the beginning of each iteration, randomly sample a set of HP's
    nbs_layers = np.random.randint(low=nbs_layers_range[0], high=nbs_layers_range[1]+1)
    nbs_neurons = np.random.randint(low=nbs_neurons_range[0], high=nbs_neurons_range[1]+1)
    learning_rate = np.random.uniform(low=lr_range[0], high=lr_range[1])
    print("Iteration %d --- Nbs layers: %d, Nbs neurons: %d, learning rate: %.4f" % (i+1, nbs_layers, nbs_neurons, learning_rate))
    
    # 2) Compile the model
    model = Sequential()
    model.add(Dense(units = nbs_neurons, activation = 'relu', input_dim = xtrain.shape[1]))
    model.add(Dense(units = nbs_neurons, activation = 'relu'))
            
    # Check if we have to add a third hidden layer on top of the first two.
    if(nbs_layers == 3):
        model.add(Dense(units = nbs_neurons, activation = 'relu'))
    
    # Check if we have to add a third and forth hidden layers on top of the first two.
    if(nbs_layers ==4):
        model.add(Dense(units = nbs_neurons, activation = 'relu'))
        model.add(Dense(units = nbs_neurons, activation = 'relu'))
            
    # Output layer
    model.add(Dense(units=1, activation = 'relu'))
    
    sgd = optimizers.Adam(lr=learning_rate)   # watch out - we also sample the learning rate!
    model.compile(loss='mse', optimizer=sgd)
    
    for h in range(nb_epoch):
        model.fit(xtrain, ytrain, epochs=1, batch_size=batch_size, verbose = 1)
            
        # Evaluate the validation loss of the trained model
        val_loss = model.evaluate(xvalid, yvalid)
               
        # If it's the best model so far:
        if (val_loss < valid_loss_best):
            valid_loss_best = val_loss
            model.save("best_model_random_search.h5")
    
# load and print out the best model found with grid search
model_best_random_search = load_model("best_model_random_search.h5")
model_best_random_search.summary()  

# Evaluate performance on train set
y_pred_train = model_best_random_search.predict(xtrain).flatten()
MSE_train = np.mean(np.square(y_pred_train - ytrain))
print("With random search, the MSE on the train set is: %.5f" %(MSE_train))

# Plot predicted prices vs real prices
plt.scatter(ytrain, y_pred_train, s=1.5)
plt.title('Random search - Train set')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

# Evaluate performance on test set
y_pred_test = model_best_random_search.predict(xtest).flatten()
MSE_test = np.mean(np.square(y_pred_test - ytest))
print("With random search, the MSE is: %.5f" %(MSE_test))

# Plot predicted prices vs real prices
plt.scatter(ytest, y_pred_test, s=1.5)
plt.title('Random search')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

Let's summarize the results of the three approaches (no HP tuning, grid search and random search)

In [0]:
# 1) No HP tuning
y_pred_no_tune = model_no_tuning.predict(xtest).flatten()
MSE_test = np.mean(np.square(y_pred_no_tune - ytest))
print("With no tuning, the MSE on the test set is: %.5f" %(MSE_test))

# 2) Grid search
y_pred_grid = model_best_grid_search.predict(xtest).flatten()
MSE_test = np.mean(np.square(y_pred_grid - ytest))
print("With grid search, the MSE on the test set is: %.5f" %(MSE_test))

# 3) Random search 
y_pred_random = model_best_random_search.predict(xtest).flatten()
MSE_test = np.mean(np.square(y_pred_random - ytest))
print("With random search, the MSE on the test set is: %.5f" %(MSE_test))

# 4) Plot scatter of each
plt.scatter(ytest, y_pred_no_tune, s=1.5)
plt.title('No hyperparameters tuning')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

plt.scatter(ytest, y_pred_no_tune, s=1.5)
plt.title('Grid search')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

plt.scatter(ytest, y_pred_random, s=1.5)
plt.title('Random search')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

Based on our results, we can make the following qualitative observations:
-	1) Hyperparameter tuning can improve significantly the performance of our model. Indeed, both grid search and random search improved upon the MSE obtained with no hyperparameter tuning.
-	2) Prior to the optimization, it's not clear which of grid search of random search will provide the best solution.
-	3) The key takeaway: over the long-run, for the same computational budget, random search tends to provide a better solution than grid search, i.e. there's a higher probability that random search will find a better set of hyperparameters than grid search for a fixed computational budget. 


### 4) Other popular approaches and 'homework'

In this last part, we will implement two popular regularization methods to potentially improve our results: dropout and batchnormalization. In the next few boxes, we give simple examples of how to use dropout and batchnormalization with Keras.

Afterwards, given the time left, you will implement yourself an optimization approach of your choice (grid search, random search etc.) with the use of dropout and/or batchnormalization to see if you can improve upon the results obtained so far.

### A) Dropout
- Dropout is typically added after the activation function (i.e. after the hidden layer).
- The input of the dropout function is the fraction of units to drop (value between (0,1)).
    - What fraction of the units should we drop? Can be considered as an additional hyperparameter.

In [0]:
from keras.layers import Dropout  # Import the Dropout layer from Keras

nb_epoch = 30      # total number of epochs
batch_size = 128   # mini-batch size
nbs_neurons = 150  # number of neurons per layer

# Compile the model  - 2 hidden layers of Relu
model_dropout_ex = Sequential()

# First hidden layer
model_dropout_ex.add(Dense(units = nbs_neurons, activation = 'relu', input_dim = xtrain.shape[1]))
model_dropout_ex.add(Dropout(0.5))   # we drop 50% of the units

# Second hidden layer
model_dropout_ex.add(Dense(units = nbs_neurons, activation = 'relu'))
model_dropout_ex.add(Dropout(0.5))   # we drop 50% of the units

# Output layer 
model_dropout_ex.add(Dense(units=1, activation = 'relu'))
model_dropout_ex.compile(loss='mse', optimizer='adam')

# Print the structure of the neural network
model_dropout_ex.summary()

# Train the model
model_dropout_ex.fit(xtrain, ytrain, epochs=nb_epoch, batch_size=batch_size, verbose = 1)

What do you observe? Did dropout help? 

### B) Batchnormalization

In [0]:
from keras.layers import BatchNormalization

model_BN_ex = Sequential()

# First hidden layer
model_BN_ex.add(Dense(units = nbs_neurons, activation = 'relu', input_dim = xtrain.shape[1]))
model_BN_ex.add(BatchNormalization())  # Batchnormalization is applied on the first hidden layer

# Second hidden layer
model_BN_ex.add(Dense(units = nbs_neurons, activation = 'relu'))
model_BN_ex.add(BatchNormalization())  # Batchnormalization is applied on the second hidden layer

# Output layer 
model_BN_ex.add(Dense(units=1, activation = 'relu'))
model_BN_ex.compile(loss='mse', optimizer='adam')

# Print the structure of the neural network
model_BN_ex.summary()

# Train the model
model_BN_ex.fit(xtrain, ytrain, epochs=nb_epoch, batch_size=batch_size, verbose = 1)

Again, what do you observe?

It seems that neither batchnormalization nor dropout is useful in this case. You will see examples where dropout and batchnormalization helps training your model.

### C) Your implementation

Here, try anything you've learned so far to try and beat the best results you've obtained so far! Some suggestions:
- 1) Train for more epochs (this will surely help you get better results);
- 2) Can dropout and/or batchnorm help when used with random search/grid search?

In [0]:
# Write your implementation here