# Problem 1: Other activation functions (20%)
### The leaky Relu is defined as $max(0.1x, x)$. 
 - What is its derivative? (Please express in "easy" format")
 - Is it suitable for back propagation?
 
### $tanh$ is defined as  $\frac{e^{x} - e^{-x}}{e^{x} + e^{-x}}$. 
 - What is its derivative? (Please express in "easy" format")
 - Is it suitable for back propagation?
 - How is it different from the sigmoid activation

Relu' = +1   for x>0
        -0.1 for x<0
        
Leaky ReLU are one attempt to fix the “dying ReLU” problem. When x<0, a leaky ReLU will instead have a small negative slope. Some people have reported success with this form of activation function, but the results are not always consistent. 
(Source: “CS231n Convolutional Neural Networks for Visual Recognition”)


tanh' = 1- tanh^2

It makes the training less difficult as it less prone to saturation in the hidden layers of the network. The derivative is helpful, since for training purposes, the value of the derivative is just its output. On average, it is more likely to create output values that are close to 0, which is beneficial when forward propagating to subsequent layers. 

It is centered around zero, which means it has larger range compared to sigmoid function and this leads to larger derivatives. Having large derivatives leads to greater updates to weights and finally faster convergence to the minimum value of the cost function.

# Problem 2: Linear regression in Keras (40%)

#### We'd like to use keras to perform linear regression and compare it to another tool (scikit-learn)
#### We'll compare OLS, ridge ($L2$ regularization) and LASSO ($L1$ regularization) using both keras and scikit-learn


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%pylab inline

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

# many of these imports to be removed
from keras.models import Model, Input
from keras.layers import Dense, Softmax, Dropout
from keras.regularizers import l1_l2
from keras.optimizers import RMSprop
import keras.backend as K

Populating the interactive namespace from numpy and matplotlib


  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
# Generate some data
np.random.seed(1024)
num_observations = 1024
coefs = np.array([-1.2, 5, 0, .22, 2, 0, 4])  # notice, there are zeros!
noise_amplitude = .05

num_variables = coefs.shape[0]

x = np.random.rand(num_observations, num_variables)
y = np.dot(x, coefs) + noise_amplitude * np.random.rand(num_observations)

cutoff = int(.8 * num_observations)
x_train, x_test = x[:cutoff], x[cutoff:]
y_train, y_test = y[:cutoff], y[cutoff:]

In [3]:
x_train.shape, y_train.shape

((819, 7), (819,))

In [4]:
reg = LinearRegression().fit(x_train, y_train)
lin_reg_predictions = reg.predict(x_test)
mean_squared_error(y_test, lin_reg_predictions)

0.00020867822075987705

In [5]:
lin_reg_coefs = reg.coef_
pd.Series(lin_reg_coefs, name='fit coefficients').to_frame().join(pd.Series(coefs, name='real coefficients')) 

Unnamed: 0,fit coefficients,real coefficients
0,-1.200971,-1.2
1,4.999581,5.0
2,-0.00182,0.0
3,0.217426,0.22
4,1.999645,2.0
5,-0.000385,0.0
6,4.000916,4.0


## Keras

In [6]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot

import tensorflow
import keras
from keras import backend as K
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import activations

def plot_model_in_notebook(model):
    return SVG(model_to_dot(model, show_shapes=True).create(prog='dot', format='svg'))


In [7]:
x_train_reshape = x_train.reshape(x_train.shape[0], 7, 1)
x_test_reshape = x_test.reshape(x_test.shape[0], 7, 1)

In [8]:
# add model definition here
# don't forget to compile your model

input_shape = (7,1)
num_classes = 1

model = Sequential()
model.add(Flatten(input_shape=input_shape, name='flatten'))
model.add(Dense(num_classes, activation='linear', name='dense_linear'))

model.compile(loss='mean_squared_error',
              optimizer='sgd',
)
model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 7)                 0         
_________________________________________________________________
dense_linear (Dense)         (None, 1)                 8         
Total params: 8
Trainable params: 8
Non-trainable params: 0
_________________________________________________________________


In [9]:
batch_size = 512
epochs = 5000

history = model.fit(x_train_reshape, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=0,
                    validation_data=(x_test_reshape, y_test))

In [10]:
keras_lin_reg = model.evaluate(x_test_reshape, y_test, verbose=0)
keras_lin_reg

0.0002113248881026421

In [30]:
keras_lin_coefs = [ model.get_weights()[0][i][0] for i in range(7) ]
pd.Series(keras_lin_coefs, name='keras coefficients').to_frame().join(pd.Series(coefs, name='real coefficients'))

Unnamed: 0,keras coefficients,real coefficients
0,-1.202311,-1.2
1,4.998437,5.0
2,-0.003313,0.0
3,0.216067,0.22
4,1.998432,2.0
5,-0.001766,0.0
6,3.999674,4.0


### How many parameters does the model have? 
### Explicitly show the calculation, explain it, and verify that it agrees with `model.count_params()`

#### 7 (features) + 1 (constant) = 8 

In [11]:
model.count_params()

8

In [12]:
# find the coefficients
keras_ols_coefs = [ model.get_weights()[0][i][0] for i in range(7) ]

pd.Series(keras_ols_coefs, name='keras ols coefficients').to_frame().join(pd.Series(coefs, name='real coefficients'))

Unnamed: 0,keras ols coefficients,real coefficients
0,-1.202311,-1.2
1,4.998437,5.0
2,-0.003313,0.0
3,0.216067,0.22
4,1.998432,2.0
5,-0.001766,0.0
6,3.999674,4.0


## Now we will add some regularization

In [13]:
from keras.regularizers import l1_l2
regularizer = l1_l2(l1=0, l2=.1)

keras_ridge_model = Sequential()
keras_ridge_model.add(Flatten(input_shape=input_shape, name='flatten'))
keras_ridge_model.add(Dense(num_classes, activation='linear', name='dense_linear_ridge', kernel_regularizer=regularizer))

keras_ridge_model.compile(loss='mean_squared_error',
              optimizer='sgd',
)
keras_ridge_model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 7)                 0         
_________________________________________________________________
dense_linear_ridge (Dense)   (None, 1)                 8         
Total params: 8
Trainable params: 8
Non-trainable params: 0
_________________________________________________________________


In [14]:
batch_size = 512
epochs = 5000

history = keras_ridge_model.fit(x_train_reshape, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=0,
                    validation_data=(x_test_reshape, y_test))

In [40]:
keras_ridge_mse = keras_ridge_model.evaluate(x_test_reshape, y_test, verbose=0)
keras_ridge_mse

2.079639318512707

In [15]:
keras_ridge_coefs = [ keras_ridge_model.get_weights()[0][i][0] for i in range(7) ]
pd.Series(keras_ridge_coefs, name='keras ridge coefficients').to_frame().join(pd.Series(coefs, name='real coefficients'))

Unnamed: 0,keras ridge coefficients,real coefficients
0,-0.576151,-1.2
1,2.212345,5.0
2,0.011865,0.0
3,0.155825,0.22
4,0.922188,2.0
5,0.058095,0.0
6,1.862813,4.0


In [17]:
# ridge regression in sklaern
from sklearn.linear_model import Ridge

clf = Ridge(alpha=.1)
clf.fit(x_train, y_train) 
sklearn_ridge_coefs = clf.coef_
pd.Series(sklearn_ridge_coefs, name='ridge coefficients').to_frame().join(pd.Series(coefs, name='real coefficients'))

Unnamed: 0,ridge coefficients,real coefficients
0,-1.199363,-1.2
1,4.9919,5.0
2,-0.001726,0.0
3,0.217458,0.22
4,1.996803,2.0
5,-3.8e-05,0.0
6,3.995302,4.0


In [19]:
# compare coefficients from various methods
pd.concat([
    pd.Series(sklearn_ridge_coefs, name='ridge coefs'),
    pd.Series(keras_ridge_coefs, name='keras L2 coefs'),
    pd.Series(coefs, name='real coefs')
], axis=1)

Unnamed: 0,ridge coefs,keras L2 coefs,real coefs
0,-1.199363,-0.576151,-1.2
1,4.9919,2.212345,5.0
2,-0.001726,0.011865,0.0
3,0.217458,0.155825,0.22
4,1.996803,0.922188,2.0
5,-3.8e-05,0.058095,0.0
6,3.995302,1.862813,4.0


## In fact, given the zero coefficients, LASSO might have been a better model. 
## LASSO uses $L_{1}$ regularization which will make sparse coefficients (some are zero).

In [20]:
from sklearn import linear_model
from sklearn.linear_model import Lasso
# Add code here

clf_Lasso = linear_model.Lasso(alpha=0.1)
clf_Lasso.fit(x_train, y_train)

sklearn_lasso_coefs = clf_Lasso.coef_
pd.Series(sklearn_lasso_coefs, name='lasso coefficients').to_frame().join(pd.Series(coefs, name='real coefficients'))

Unnamed: 0,lasso coefficients,real coefficients
0,-0.066804,-1.2
1,3.718214,5.0
2,0.0,0.0
3,0.0,0.22
4,0.778287,2.0
5,0.0,0.0
6,2.932104,4.0


In [33]:
from keras.regularizers import l1_l2
regularizer_lasso = l1_l2(l1=0.1, l2=0)

keras_lasso_model = Sequential()
keras_lasso_model.add(Flatten(input_shape=input_shape, name='flatten'))
keras_lasso_model.add(Dense(num_classes, activation='linear', name='dense_linear_lasso', kernel_regularizer=regularizer_lasso))

keras_lasso_model.compile(loss='mean_squared_error',
              optimizer='sgd',
)
keras_lasso_model.summary()


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 7)                 0         
_________________________________________________________________
dense_linear_lasso (Dense)   (None, 1)                 8         
Total params: 8
Trainable params: 8
Non-trainable params: 0
_________________________________________________________________


In [34]:
batch_size = 512
epochs = 5000

history = keras_lasso_model.fit(x_train_reshape, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=0,
                    validation_data=(x_test_reshape, y_test))

In [39]:
keras_lasso_mse = keras_lasso_model.evaluate(x_test_reshape, y_test, verbose=0)
keras_lasso_mse

1.1041422087971757

In [35]:
keras_lasso_coefs = [ keras_lasso_model.get_weights()[0][i][0] for i in range(7) ]

In [36]:
# compare all the coefficients
pd.concat([
    pd.Series(sklearn_ridge_coefs, name='ridge coefs'),
    pd.Series(keras_ridge_coefs, name='keras L2 coefs'),
    pd.Series(sklearn_lasso_coefs, name='lasso coefs'),
    pd.Series(keras_lasso_coefs, name='keras L1 coefs'),
    pd.Series(lin_reg_coefs, name='ols coefs'),
    pd.Series(keras_lin_coefs, name='keras coefs'),
    pd.Series(coefs, name='real coefs'),
], axis=1)

Unnamed: 0,ridge coefs,keras L2 coefs,lasso coefs,keras L1 coefs,ols coefs,keras coefs,real coefs
0,-1.199363,-0.576151,-0.066804,-0.635886,-1.200971,-1.202311,-1.2
1,4.9919,2.212345,3.718214,4.35903,4.999581,4.998437,5.0
2,-0.001726,0.011865,0.0,0.00061,-0.00182,-0.003313,0.0
3,0.217458,0.155825,0.0,0.000135,0.217426,0.216067,0.22
4,1.996803,0.922188,0.778287,1.393212,1.999645,1.998432,2.0
5,-3.8e-05,0.058095,0.0,-4.9e-05,-0.000385,-0.001766,0.0
6,3.995302,1.862813,2.932104,3.470389,4.000916,3.999674,4.0


To find the optimal regularization parameter, we can compute the error with keras and scikit learn for different valus in [0,1] x [0,1] and take the ones ($\lambda_1$, $\lambda_2$) that minimize the total error

# Problem 3: Keras for harder mnist problems (40%)
#### The deep net during lecture has a hard time distiguishing between 9 and 4.
#### We will build an algorithm to do this binary classification task 

In [1]:
# safe to restart here

In [1]:
import numpy as np
import pandas as pd
%pylab inline

# many of these to be removed
from keras.datasets import mnist
from keras.models import Model, Input
from keras.layers import Dense, Softmax, Dropout
from keras.regularizers import l1_l2
from keras.optimizers import RMSprop
import keras.backend as K

Populating the interactive namespace from numpy and matplotlib


Using TensorFlow backend.


In [42]:
from keras.utils import to_categorical

def preprocess_training_data(data):
    data = data.reshape(data.shape[0], data.shape[1] * data.shape[2])
    data = data.astype('float32') / 255
    return data

def preprocess_targets(target, num_classes):
    return to_categorical(target, num_classes)


def subset_to_9_and_4(x, y):  # this is a new function
    mask = (y == 9) | (y == 4)
    new_x = x[mask]
    new_y = (y[mask] == 4).astype('int64')
    return new_x, new_y

(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = preprocess_training_data(x_train)
x_test = preprocess_training_data(x_test)

num_classes = np.unique(y_train).shape[0]

y_train_ohe = preprocess_targets(y_train, num_classes)
y_test_ohe = preprocess_targets(y_test, num_classes)

train_frac = 0.8
cutoff = int(x_train.shape[0] * train_frac)
x_train, x_val = x_train[:cutoff], x_train[cutoff:]
y_train, y_val = y_train[:cutoff], y_train[cutoff:]
y_train_ohe, y_val_ohe = y_train_ohe[:cutoff], y_train_ohe[cutoff:]

# y_train_ohe, y_val_ohe = subset_to_9_and_4(y_train_ohe, y_val_ohe)
x_train, y_train = subset_to_9_and_4(x_train, y_train)
x_val, y_val = subset_to_9_and_4(x_val, y_val)
x_test, y_test = subset_to_9_and_4(x_test, y_test)

print(x_train.shape)

IndexError: boolean index did not match indexed array along dimension 0; dimension is 48000 but corresponding boolean dimension is 12000

In [70]:
from keras.utils import to_categorical

def preprocess_training_data(data):
    data = data.reshape(data.shape[0], data.shape[1] * data.shape[2])
    data = data.astype('float32') / 255
    return data

def preprocess_targets(target, num_classes):
    return to_categorical(target, num_classes)


def subset_to_9_and_4(x, y):  # this is a new function
    mask = (y == 9) | (y == 4)
    new_x = x[mask]
    new_y = (y[mask] == 4).astype('int64')
    return new_x, new_y

(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = preprocess_training_data(x_train)
x_test = preprocess_training_data(x_test)

num_classes = np.unique(y_train).shape[0]

y_train_ohe = preprocess_targets(y_train, num_classes)
y_test_ohe = preprocess_targets(y_test, num_classes)

train_frac = 0.8
cutoff = int(x_train.shape[0] * train_frac)
x_train, x_val = x_train[:cutoff], x_train[cutoff:]
y_train, y_val = y_train[:cutoff], y_train[cutoff:]
y_train_ohe, y_val_ohe = y_train_ohe[:cutoff], y_train_ohe[cutoff:]


y_train_ohe= y_train_ohe[(y_train_ohe[:,4] ==1) | (y_train_ohe[:,9]==1)] # We added this line just to keep 4s and 9s labels.
y_val_ohe= y_val_ohe[(y_val_ohe[:,4] ==1) | (y_val_ohe[:,9]==1)]
x_train, y_train = subset_to_9_and_4(x_train, y_train)
x_val, y_val = subset_to_9_and_4(x_val, y_val)
x_test, y_test = subset_to_9_and_4(x_test, y_test)

print(y_train.shape)

(9457,)


In [63]:
# first try logistic regression

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

lr =LogisticRegression()
lr.fit(x_train, y_train)

sklearn_lr_predictions = lr.predict(x_test)
accuracy_score(y_test, sklearn_lr_predictions)

0.9728779507785033

In [4]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot

def plot_model_in_notebook(model):
    return SVG(model_to_dot(model, show_shapes=True).create(prog='dot', format='svg'))


In [85]:
K.clear_session()
num_hidden_units = 256
# num_classes =1
# define model
digit_input = Input(shape=(x_train.shape[1],), name='digit_input')
x = Dense(num_hidden_units, activation='relu', name='dense_0')(digit_input)
x = Dropout(.2, name='dropout_0')(x)
output = Dense(num_classes, activation='softmax')(x)
model = Model(digit_input, output)
model.compile(optimizer=RMSprop(lr=2e-3, decay=1e-5), loss='binary_crossentropy', metrics=['accuracy'])
#NB: you probably want BINARY cross entropy i.e. 'binary_crossentropy' for the loss function

In [47]:
plot_model_in_notebook(model)

OSError: `pydot` failed to call GraphViz.Please install GraphViz (https://www.graphviz.org/) and ensure that its executables are in the $PATH.

In [5]:
# how many params does the model have? 
model.count_params()

201217

In [87]:
# Add code here
# model.fit(...
model.fit(x_train, y_train_ohe, batch_size=128, validation_data=(x_val, y_val_ohe), epochs=64, shuffle=True, verbose=0)
keras_predictions = np.argmax(model.predict(x_test), axis=1)
keras_predictions[keras_predictions[:,]==4]=1
keras_predictions[keras_predictions[:,]==9]=0

In [90]:
from sklearn.metrics import f1_score, accuracy_score
accuracy_score(y_test, keras_predictions)

0.9924660974384731

In [None]:
# DONE! Congrats!