# Introduction to Convolutional Neural Networks

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context("talk")
plt.rcParams["figure.figsize"] = [9.708,6]
import warnings
warnings.filterwarnings('ignore')
#this is our new one
import tensorflow as tf
tf.random.set_seed(0)
np.random.seed(0)
# !pip install tensorflow
from tensorflow.keras.datasets import fashion_mnist

### Improving Computer Vision Accuracy using Convolutions

In the previous lessons you saw how to do fashion recognition using a Deep Neural Network (DNN) containing three layers -- the input layer (in the shape of the data), the output layer (in the shape of the desired output) and a hidden layer. You experimented with the impact of different sized of hidden layer, number of training epochs etc on the final accuracy.

For convenience, here's the entire code again. Run it and take a note of the test accuracy that is printed out at the end. 

Load the data

In [None]:
(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
num_classes = 10
input_shape = (X_train.shape[1],X_train.shape[2])
#normalize the data between 0-1
X_train = X_train.astype('float32') / 255
X_test  = X_test.astype( 'float32') / 255
#Reshape To Match The Keras's Expectations
X_train = X_train.reshape(X_train.shape[0], 1, input_shape[0], input_shape[1])
X_test  = X_test.reshape( X_test.shape[0],  1, input_shape[0], input_shape[1])
#one hot encoding
Y_train = tf.keras.utils.to_categorical(y_train, num_classes)
Y_test  = tf.keras.utils.to_categorical(y_test,  num_classes)
#==============
print(X_train.shape[0], 'train samples')
print(X_test.shape[0],  'test samples')
print(X_train.shape)


### Reduce size of data
* Note we are doing this because we are interested in how architecture affects the accuracy. 
* To speed up training and test for your hw, we are decreasing the size of the training data by factor of 2.  
* You'll certainly get better results using the whole dataset, but we dont want to spend all our time **training** he **best** model, we'd rather learn how to use CNNs.

In [None]:
mm= 4
X_train=X_train[::mm]
Y_train=Y_train[::mm]

## NN Model architecture

We will start with the following model, but will explore how some of the choices here affect our outcome.  

Layers for our Network.

* **Input layer** - size 784 
    * flatten the input image (28x28).
* **1 Hidden layers** - with size 100
    * Dense (fully connected) network from input layer to these 128 neuron hidden layer.
* **Dropout** - 0.2
    * randomly sets 20% input units to 0 with a frequency of rate at each step during training time, which helps prevent overfitting. 
* **Output layer** - size 10
    * Dense layer (fully connected back to the 128 neuron hidden layer). The 10 is the number of classes.  Given an input image, our network should **light** up the corresponding neuron of our target.
* **Softmax activation** - convert our output into a probability for each class.


In [None]:
epochs     = 10
batch_size = 200

In [None]:
tf.random.set_seed(0)                             # set our initial seed
modelA = tf.keras.models.Sequential([             # model type
  tf.keras.layers.Flatten(input_shape=X_train[1].shape),  # input layer
  tf.keras.layers.Dense(500, activation='relu'),   # hidden layer
  tf.keras.layers.Dropout(0.5),                    # Dropout helps reduce overfitting 
  tf.keras.layers.Dense(10),                      # output to each class, could just stop here
  tf.keras.layers.Softmax()                       # convert to probability
])
sgd = tf.keras.optimizers.SGD(learning_rate=0.1, momentum=0.9, nesterov=True, name='SGD')
modelA.compile(optimizer=sgd,
              loss='categorical_crossentropy',    #need to define our loss function
              metrics=['accuracy'])
tstart   = tf.timestamp()
historyA = modelA.fit(X_train, Y_train, verbose=1,
                    epochs=epochs,
                    batch_size=batch_size,
                    validation_split = 0.2) 
total_time = tf.timestamp() - tstart
print("total time %3.3f seconds"%total_time)

In [None]:
#we will use this a lot, so lets make a function
def printAccuracy(history,results_test):
    print("train loss %.5f \t train acc: %.5f"%(history.history['loss'][-1],history.history['accuracy'][-1]))
    print("valid loss %.5f \t valid acc: %.5f"%(history.history['val_loss'][-1],history.history['val_accuracy'][-1]))
    print("test loss  %.5f \t test acc:  %.5f"%(results_test[0],results_test[1]))
#we will do this a lot, so lets make a function for this
def plot_result(history,results_test):
    # Get training and validation histories
    training_acc = history.history['accuracy']
    val_acc      = history.history['val_accuracy']
    # Create count of the number of epochs
    epoch_count = range(1, len(training_acc) + 1)
    # Visualize loss history
    plt.plot(epoch_count, training_acc, 'b-o',label='Training')
    plt.plot(epoch_count, val_acc, 'r--',label='Validation')
    plt.plot(epoch_count, results_test[1]*np.ones(len(epoch_count)),'k--',label='Test')
    plt.legend()
    plt.title("Training and validation accuracy")
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
 

## Model accuracy and loss

In [None]:
#===    
results_test = modelA.evaluate(X_test, Y_test, batch_size=128,verbose=0)    
printAccuracy(historyA,results_test)

## Plot training and test accuracy per epoch

In [None]:
plot_result(historyA,results_test)   
plt.title("MNIST 1-hidden layer, in %3.2f s"%(total_time)) #overwrite the title
plt.show()

In [None]:
modelA.summary()

Your accuracy is probably about 87% on training and 86% on test...not bad... 
The best we did last class was about **87%**.

But its overtraining a bit and not generalizing as well to the test data.

But how do you make that even better? 

One way is to use something called Convolutions. We're not going to details on Convolutions here, but instead refer to the lecture notes.  
* the idea of CNN is that they narrow down the content of the image to focus on specific, distinct, details.

If you've ever done image processing using a filter (like this: https://en.wikipedia.org/wiki/Kernel_(image_processing)) then convolutions will look very familiar.

In short, you take an array (usually 3x3 or 5x5) and pass it over the image. By changing the underlying pixels based on the formula within that matrix, you can do things like edge detection. So, for example, if you look at the above link, you'll see a 3x3 that is defined for edge detection where the middle cell is 8, and all of its neighbors are -1. In this case, for each pixel, you would multiply its value by 8, then subtract the value of each neighbor. Do this for every pixel, and you'll end up with a new image that has the edges enhanced.

This is perfect for computer vision, because often it's features that can get highlighted like this that distinguish one item for another, and the amount of information needed is then much less...because you'll just train on the highlighted features.

That's the concept of Convolutional Neural Networks. Add some layers to do convolution before you have the dense layers, and then the information going to the dense layers is more focussed, and possibly more accurate.

## CNN: building our first model

Next is to define your model. Now instead of the input layer at the top, you're going to add a Convolution. The parameters are:

1. The number of convolutions you want to generate. Purely arbitrary, but good to start with something in the order of 32
2. The size of the Convolution, in this case a 3x3 grid
3. The activation function to use -- in this case we'll use relu, which you might recall is the equivalent of returning x when x>0, else returning 0
4. In the first layer, the shape of the input data.


### Layers for our Network for CNN model

* **Input layer** - size (28x28).
* **1 Hidden convolutional layers** - each with size 32 with size (3x3)
* **Max Pooling** 
* **1 Hidden Dense (fully connected) layer** - with size 50
* **Dropout** - 0.2
* **Output layer** - with **Softmax activation** - convert our output into a probability for each class.


You'll follow the Convolution with a MaxPooling layer which is then designed to compress the image, while maintaining the content of the features that were highlighted by the convlution. By specifying (2,2) for the MaxPooling, the effect is to quarter the size of the image. Without going into too much detail here, the idea is that it creates a 2x2 array of pixels, and picks the biggest one, thus turning 4 pixels into 1. It repeats this across the image, and in so doing halves the number of horizontal, and halves the number of vertical pixels, effectively reducing the image by 25%.

You can call model.summary() to see the size and shape of the network, and you'll notice that after every MaxPooling layer, the image size is reduced in this way. 


```
model = tf.keras.models.Sequential([
  tf.keras.layers.Conv2D(32, (3,3), activation='relu', input_shape=(28, 28, 1)),
  tf.keras.layers.MaxPooling2D(2, 2),
```

Now flatten the output. After this you'll just have the same DNN structure as the non convolutional version

```
  tf.keras.layers.Flatten(),
```



The same 128 dense layers, and 10 output layers as in the pre-convolution example:



```
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(10, activation='softmax')
])
```



Now lets compile the model, call the fit method to do the training, and evaluate the loss and accuracy from the test set.



Now, TensorFlow prefers to have the channels (red,green,blue) last in the ordering of the image. This can be a source of endless headaches trying to get everything the right dimensions in your network.  

Your input data should have the shape
* **(samples,img_cols,img_rows,img_channels)**

However, we have grayscale images, so our **img_channels=1**

In [None]:
(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
num_classes = 10
input_shape = (X_train.shape[1],X_train.shape[2])
#normalize the data between 0-1
X_train = X_train.astype('float32') / 255
X_test  = X_test.astype( 'float32') / 255
#Reshape To Match The tf.keras's Expectations for CNNs
X_train = X_train.reshape(X_train.shape[0], input_shape[0], input_shape[1],1)
X_test  = X_test.reshape( X_test.shape[0],  input_shape[0], input_shape[1],1)
#one hot encoding
Y_train = tf.keras.utils.to_categorical(y_train, num_classes)
Y_test  = tf.keras.utils.to_categorical(y_test,  num_classes)
#==============
print(X_train.shape[0], 'train samples')
print(X_test.shape[0],  'test samples')
print(X_train.shape)


### Reduce size of data
* Note we are doing this because we are interested in how architecture affects the accuracy. 
* To speed up training and test for your hw, we are decreasing the size of the training data by factor of 4.  
* You'll certainly get better results using the whole dataset, but we dont want to spend all our time **training** he **best** model, we'd rather learn how to use CNNs.

In [None]:
mm= 4
X_train=X_train[::mm]
Y_train=Y_train[::mm]
print(X_train.shape)

## Build your first Convolutional Neural Network

In [None]:
epochs    = 10
batch_size= 200

In [None]:
tf.random.set_seed(0)                             # set our initial seed
#===
modelB = tf.keras.models.Sequential([             # model type
    tf.keras.layers.Conv2D(32, (3, 3), activation='relu',  
                           padding='valid', input_shape=(28,28,1)),#, data_format='channels_first'),
    tf.keras.layers.MaxPooling2D(pool_size=(2, 2), strides=2, padding='valid'),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dropout(0.5),                    # Dropout helps reduce overfitting 
    tf.keras.layers.Dense(200, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])
#===
sgd = tf.keras.optimizers.SGD(learning_rate=0.1, momentum=0.9, nesterov=True, name='SGD')
modelB.compile(optimizer=sgd,
              loss='categorical_crossentropy',    #need to define our loss function
              metrics=['accuracy'])
#===
tstart   = tf.timestamp()
historyB = modelB.fit(X_train, Y_train, verbose=1,
                    epochs=epochs,
                    batch_size=batch_size,
                    validation_split = 0.2) 
total_time = tf.timestamp() - tstart
print("total time %3.3f seconds"%total_time)

## Model accuracy and loss

In [None]:
#===    
results_test = modelB.evaluate(X_test, Y_test, batch_size=128,verbose=0)    
printAccuracy(historyB,results_test)

For the NN network, we had the best test accuracy of about 87.7% for the 2-layer network.  

Here, we are quite a bit better at **88.8%**, and training on only **1/4**th the data!

## Plot training and test accuracy per epoch

In [None]:
plot_result(historyB,results_test)   
plt.title("Fashion MNIST 1-hidden layer CNN, in %3.2f s"%(total_time))
plt.show()

Training and validation data could be lower here, because the values were recorded with **Dropout** on, however when the network is run on the **Test** data, **Dropout** is off.

We are also overtraining a bit, and could probably turn up the dropout or decrease the number of neurons in the fully connected layer.

In [None]:
modelB.summary()

Predict classes on the test set.

In [None]:
### y_hat = modelB.predict_classes(X_test) ### this is deprecated
y_hat = np.argmax(modelB.predict(X_test), axis=-1) #working version
pd.crosstab(y_hat, y_test)

## 2 convolutional layers

Putting two layers of convolution immediately after one another tends to produce very predictive models. Here, we also follow the convolution layers by a dense hidden layer. Note that training this model takes **significantly** longer than the dense models to run. As such, I ran only the first 1000 samples. Using all of them should yield a classification rate near 99.5% on the entire test set.

In [None]:
epochs      = 10
batch_size  = 200

In [None]:
tf.random.set_seed(0)                             # set our initial seed
#=======
modelC = tf.keras.models.Sequential([             # model type
    tf.keras.layers.Conv2D(32, (3, 3), activation='relu',  
                           padding='valid', input_shape=(28,28,1)),#, data_format='channels_first'),
    tf.keras.layers.MaxPooling2D(pool_size=(2, 2), strides=2, padding='valid'),
    tf.keras.layers.Conv2D(32, (3, 3), activation='relu',  
                           padding='valid'),
    tf.keras.layers.MaxPooling2D(pool_size=(2, 2), strides=2, padding='valid'),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dropout(0.5),                    # Dropout helps reduce overfitting 
    tf.keras.layers.Dense(200, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])
#=======
sgd = tf.keras.optimizers.SGD(learning_rate=0.1, momentum=0.9, nesterov=True, name='SGD')
modelC.compile(optimizer=sgd,
              loss='categorical_crossentropy',    #need to define our loss function
              metrics=['accuracy'])
#=======
tstart   = tf.timestamp()
historyC = modelC.fit(X_train, Y_train, verbose=1,
                      epochs=epochs,
                      batch_size=batch_size,
                      validation_split = 0.2) 
                      
total_time = tf.timestamp() - tstart
print("total time %3.3f seconds"%total_time)

## Model accuracy and loss

In [None]:
#===    
results_test = modelC.evaluate(X_test, Y_test, batch_size=128,verbose=0)    
printAccuracy(historyC,results_test)

## Plot training and test accuracy per epoch

In [None]:
plot_result(historyC,results_test)   
plt.title("Fashion MNIST 2-hidden layer CNN, in %3.2f s"%(total_time))
plt.show()

In [None]:
modelC.summary()

# Visualizing the Convolutions and Pooling

This code will show us the convolutions graphically. The print (test_labels[;100]) shows us the first 100 labels in the test set, and you can see that the ones at index 0, index 23 and index 28 are all the same value (9). They're all shoes. Let's take a look at the result of running the convolution on each, and you'll begin to see common features between them emerge. Now, when the DNN is training on that data, it's working with a lot less, and it's perhaps finding a commonality between shoes based on this convolution/pooling combination.

In [None]:
W1 = modelB.layers[0].get_weights()[0]
nW = W1.shape[3]
plt.figure(figsize=(10, 10), frameon=False)
for i in range(nW):
    plt.subplot(4, 8, i + 1)
    im = W1[:,:,:,i].reshape((3,3))
    plt.axis("off")
    plt.imshow(im, cmap='gray',interpolation='nearest')
plt.show()