# Workshop: Deep Learning 3

Outline
1. Regularization
2. Gradient Descent
3. Hand-Written Digits with Convolutional Neural Networks 
4. Advanced Image Classification with Convolutional Neural Networks 


Source: Deep Learning With Python, Part 1 - Chapter 4

## 1. Regularization

To prevent a model from learning misleading or irrelevant patterns found in the
training data, the best solution is to get more training data. However, this is in many times out of our control.

Another approach is called - by now you should know that - regularization. 

### 1.1. Reducing the network’s size

The simplest way to prevent overfitting is to reduce the size of the model: the number
of learnable parameters in the model (which is determined by the number of layers
and the number of units per layer).

Or put it this way: A network with more parameters can better memorize stuff...

In [1]:
# Unfortunately, there is no closed form solution which gives us the best network size...
# So, we need to try out different models (or use grid search)

In [30]:
# Original  Model 
from keras import models
from keras import layers
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

In [31]:
# Simpler Model 
from keras import models
from keras import layers
model = models.Sequential()
model.add(layers.Dense(8, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(8, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

In [32]:
# Bigger Model 
model = models.Sequential()
model.add(layers.Dense(512, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

In [5]:
#### You need to load data, compile the network and then train it (with validation/hold out set)
#### Then you plot the validation loss for all these combinations

<img src="res/img1.png"></img>

<img src="res/img2.png"></img>

In [6]:
# This shows us that the bigger model starts to overfit immediately..

Instead of manually searching for the best model architecture (i.e., hyperparameters) you can use a method called grid-search. However, we will not cover this in this lecture - but you can find a tutorial here:

https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/


Basically, the author conceleates keras with scikit's grid search module. 

### 1.2. Adding weight regularization

1. L1 regularization
2. L2 regularization

#### 1.2.1 Adding L2 Regularization to the model

In [7]:
from keras import regularizers
model = models.Sequential()

# kernel_regularizer = regularizers.l2(0.001), add those weights to the loss with an alpha of 0.001
# you could use also: regularizers.l1(0.001) for L1 regularization
# Documentation: https://keras.io/api/layers/regularizers/
model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),activation='relu', input_shape=(10000,)))

model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001), activation='relu'))

model.add(layers.Dense(1, activation='sigmoid'))

<img src="res/img3.png"></img>

### 1.2.3 Adding Dropout 

Idea: Randomly drop out a number of (activation) nodes during training. 
    
**Assume**: [0.2, 0.5, 1.3, 0.8, 1.1] is the output of a layer (after activation function).

Dropout sets randomly some of these weights to 0. For example: [0, 0.5, 1.3, 0, 1.1]. 

The *dropout rate* is the fraction of features that are zeroed out (usually between 0.2 and 0.5)

In [8]:
# Example Code 
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))

# Pass dropout rate!!!
model.add(layers.Dropout(0.5))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(1, activation='sigmoid'))

# Compile..
# Fit..
# Evaluate...
# Doc: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout

<img src="res/img4.png"></img>

### To recap, these are the most common ways to prevent overfitting in neural networks:
1. Get more training data.
2. Reduce the capacity of the network.
3. Add weight regularization.
4. Add dropout.
5. Data Augmentation (for image classification tasks)

## 2 Gradient Descent 

In [10]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets 
from sklearn.metrics import mean_squared_error

housing_data = datasets.fetch_california_housing()

features = pd.DataFrame(housing_data.data, columns=housing_data.feature_names)
target = pd.DataFrame(housing_data.target, columns=['Target'])

df = features.join(target)

X = df.MedInc
Y = df.Target

In [11]:
# y = mx+b
# MSE 1/N * sum( (y_i - (m * x_i +b))^2) <= Loss Function

def gradient_descent(X, y, lr=0.05, iterations=10):
    
    '''
    Gradient Descent for a single feature
    '''
    
    m, b = 0.2, 0.2 # initial random parameters
    log, mse = [], [] # lists to store learning process
    N = len(X) # number of samples
    
    # MSE = 1/N SUM (y_i - (m*x_i +b))^2 
    # MSE' w.r.t. m => 1/N * SUM(-2*x_i*(m*x_i+b))
    # MSE' w.r.t. b => 1/N * SUM(-2*(m*x_i+b))

    for _ in range(iterations):
                
        f = y - (m*X + b)
    
        # Updating m and b 
        m -= lr * (-2 * X.dot(f).sum() / N) 
        b -= lr * (-2 * f.sum() / N)
        
        log.append((m, b))
        mse.append(mean_squared_error(y, (m*X + b)))        
    
    return m, b, log, mse


In [12]:
m, b, log, mse = gradient_descent(X, Y, lr=0.01, iterations=1000)

In [13]:
(m, b)

(0.41893244701097204, 0.44612945637258383)

In [14]:
# Analytical Solution (compaed to )
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(features["MedInc"].to_numpy().reshape(-1, 1), Y)

In [15]:
(reg.coef_, reg.intercept_)

(array([0.41793849]), 0.4508557670326776)

##### Stochastic Gradient Descent

In [16]:
def stochastic_gradient_descent(X, y, lr=0.05, iterations=10, batch_size=10):
        
    '''
    Stochastic Gradient Descent for a single feature
    '''
    
    m, b = 0.5, 0.5 # initial parameters
    log, mse = [], [] # lists to store learning process
    
    for _ in range(iterations):
        
        indexes = np.random.randint(0, len(X), batch_size) # random sample "batch_size" elements from training set
        
        Xs = np.take(X, indexes)
        ys = np.take(y, indexes)
        N = len(Xs)
        
        f = ys - (m*Xs + b)
        
        # Updating parameters m and b
        m -= lr * (-2 * Xs.dot(f).sum() / N)
        b -= lr * (-2 * f.sum() / N)
        
        log.append((m, b))
        mse.append(mean_squared_error(y, m*X+b))        
    
    return m, b, log, mse

In [17]:
m, b, log, mse = stochastic_gradient_descent(X, Y, lr=0.01, iterations=100, batch_size = 100)

In [18]:
(m,b)

(0.41078958000903365, 0.4757463276058315)

## 2. Using CNNs to Classify Hand-written Digits on MNIST Dataset

<img src="res/img5.png"></img>

In [19]:
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Conv2D, MaxPool2D
from keras.utils import np_utils

In [20]:
# Load Data
(X_train, y_train), (X_test, y_test) = mnist.load_data()


In [21]:
# Shape of data
print("X_train shape", X_train.shape)
print("y_train shape", y_train.shape)
print("X_test shape", X_test.shape)
print("y_test shape", y_test.shape)

X_train shape (60000, 28, 28)
y_train shape (60000,)
X_test shape (10000, 28, 28)
y_test shape (10000,)


In [22]:
# Flattening the images from the 28x28 pixels to 1D 784 pixels
X_train = X_train.reshape(60000, 784)
X_test = X_test.reshape(10000, 784)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

In [23]:
# normalizing the data to help with the training
X_train /= 255
X_test /= 255

In [24]:
# To Categorical (One-Hot Encoding)
n_classes = 10
print("Shape before one-hot encoding: ", y_train.shape)
Y_train = np_utils.to_categorical(y_train, n_classes)
Y_test = np_utils.to_categorical(y_test, n_classes)
print("Shape after one-hot encoding: ", Y_train.shape)

Shape before one-hot encoding:  (60000,)
Shape after one-hot encoding:  (60000, 10)


In [25]:
# Let's build again a very boring neural network
model = Sequential()
# hidden layer
model.add(Dense(100, input_shape=(784,), activation='relu'))
# output layer
model.add(Dense(10, activation='softmax'))

In [26]:
# looking at the model summary
model.summary()
# Compile
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')
# Traing (####-> Caution, this is dedicated for validation data - I was just lazy...)
model.fit(X_train, Y_train, batch_size=128, epochs=10, validation_data=(X_test, Y_test))

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_15 (Dense)            (None, 100)               78500     
                                                                 
 dense_16 (Dense)            (None, 10)                1010      
                                                                 
Total params: 79,510
Trainable params: 79,510
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7ff55f759af0>

In [27]:
# new imports needed
from keras.layers import  Conv2D, MaxPool2D, Flatten

# And now with a convolutional neural network
# Doc: https://keras.io/api/layers/convolution_layers/

# Load again data
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# DONT Vectorize - keep grid structure
X_train = X_train.reshape(X_train.shape[0], 28, 28, 1)
X_test = X_test.reshape(X_test.shape[0], 28, 28, 1)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

# normalize
X_train /= 255
X_test /= 255

# Sequential Model
model = Sequential()
# Convolutional layer

# 2D convolutional data 
# filters: number of kernels
# kernel size: (3, 3) pixel filter
# stride: (move one to the right, one to the bottom when you reach the end of the row)
# padding: "valid" => no padding => feature map is reduced
model.add(Conv2D(filters=25, kernel_size=(3,3), strides=(1,1), padding='valid', activation='relu', input_shape=(28,28,1)))


model.add(MaxPool2D(pool_size=(1,1)))
# flatten output such that the "densly" connected network can be attached
model.add(Flatten())

# hidden layer
model.add(Dense(100, activation='relu'))

# output layer
model.add(Dense(10, activation='softmax'))

# compiling the sequential model
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')

# training the model for 10 epochs
model.fit(X_train, Y_train, batch_size=128, epochs=10, validation_data=(X_test, Y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7ff5885b5d60>

In [None]:
# More on Classification with CNNs

## 3. Advanced Image Classification with Deep Convolutional Neural Networks

<img src="res/img6.png">

In [28]:
# Imports
from keras.datasets import cifar10
from keras.models import Sequential
from keras.layers import Dense, Dropout, Conv2D, MaxPool2D, Flatten
from keras.utils import np_utils

# Load Data
(X_train, y_train), (X_test, y_test) = cifar10.load_data()

# # Keep Grid Structure with 32x32 pixels (times 3; due to color channels)
X_train = X_train.reshape(X_train.shape[0], 32, 32, 3)
X_test = X_test.reshape(X_test.shape[0], 32, 32, 3)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

# Normalize
X_train /= 255
X_test /= 255

# One-Hot Encoding
n_classes = 10
print("Shape before one-hot encoding: ", y_train.shape)
Y_train = np_utils.to_categorical(y_train, n_classes)
Y_test = np_utils.to_categorical(y_test, n_classes)
print("Shape after one-hot encoding: ", Y_train.shape)

Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
Shape before one-hot encoding:  (50000, 1)
Shape after one-hot encoding:  (50000, 10)


In [29]:

# Create Model Object
model = Sequential()

# Add Conv. Layer
model.add(Conv2D(50, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu', input_shape=(32, 32, 3)))

## What happens here?

# Stack 2. Conv. Layer
model.add(Conv2D(75, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu'))

model.add(MaxPool2D(pool_size=(2,2)))

model.add(Dropout(0.25))

# Stack 3. Conv. Layer
model.add(Conv2D(125, kernel_size=(3,3), strides=(1,1), padding='same', activation='relu'))
model.add(MaxPool2D(pool_size=(2,2)))
model.add(Dropout(0.25))

# Flatten Output of Conv. Part such that we can add a densly connected network
model.add(Flatten()) 

# Add Hidden Layer and Dropout Reg.
model.add(Dense(500, activation='relu'))
model.add(Dropout(0.4))
model.add(Dense(250, activation='relu'))
model.add(Dropout(0.3))

# Output Layer
model.add(Dense(10, activation='softmax'))

# Compile
model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer='adam')

# Train
model.fit(X_train, Y_train, batch_size=128, epochs=2, validation_data=(X_test, Y_test))

Epoch 1/2

KeyboardInterrupt: 