# Artificial Intelligence
## School of Mechanical Engineering-Tehran University

## 2022




# Dear students

## G day

## This is your work for next week. This notebook provides all the necessary explanations and codes for the Convolutional Neural Network in TensorFlow using the modified version of LeNet-5 Architecture and AlexNet Architecture. The MNIST dataset is used for both CNN models. I kindly ask you to study this notebook carefully before the tutorial session. This is a practical course, and I want you to build and deploy machine learning/deep learning models during the semester. I suggest utilizing the Google Colab and running the models with GPU. This course will be valuable for you. I am here to assist you. You will make it.

## If further elucidation is warranted, please drop me a line.
## Cheers,
## Affiliated Research Professor Mohammad Khoshnevisan




# Convolutional Neural Network in TensorFlow using the modified version of LeNet-5 with MNIST dataset

In [1]:
from IPython.display import HTML
from base64 import b64encode

In [2]:
html1 = '<img src="SE6.jpg" width="1000" height="1000" align="center"/>'
HTML(html1)

# Textbook: Deep Learning with TensorFlow, Keras, and PyTorch, Dr Jon Krohn-February 2020

# The convolution operation-Animation
## Source: https://learning.oreilly.com/videos/deep-learning-with/9780136617617/9780136617617-DLTK_01_04_01/¶

In [3]:
from IPython.display import HTML
from base64 import b64encode

def play(filename):
    html = ''
    video = open(filename ,'rb').read()
    src = 'data:video/mp4;base64,' + b64encode(video).decode()
    html += '<video width=500 controls autoplay loop><source src="%s" type="video/mp4"></video>' % src 
    return HTML(html)

play('Convolution Operation.mp4')

# Preliminary

## The gradient descent algorithm will act erratically, as it jumps right over the parameters associated with minimal cost. If we have a very large quantity of training data, ordinary gradient descent would not work at all, because it wouldn't be possible to fit all the data into the memory into the RAM of our machine. Memory isn't the only potential snag; compute power could cause us headaches to. A relatively large data set might squeeze into the memory of our machine but if we try to train a neural network containing millions of parameters with all those data, then ordinary plain old vanilla gradient descent would be highly inefficient because of the computational complexity of the associated high-volume high dimensional calculations. The solution to these memory and compute limitations is the stochastic variant of gradient descent. With this variation we split our training data, into mini batches, small subsets of our full training data set, to render gradient descent both manageable and productive.

In [4]:
html1 = '<img src="Z2.jpg" width="1000" height="1000" align="center"/>'
HTML(html1)

In [5]:
html1 = '<img src="Z1.jpg" width="1000" height="1000" align="center"/>'
HTML(html1)

In [6]:
html1 = '<img src="Z3.jpg" width="1000" height="1000" align="center"/>'
HTML(html1)

# Dropout
## Deep learning practitioners tend to use a specific regularization technique and this technique is called dropout and it was developed by Geoff Hinton and his colleagues at the University of Toronto. And was made famous by its incorporation in their benchmark smashing AlexNet architecture. Hinton and his coworkers intuitive yet powerful concept for preventing overfitting is captured in the above. In a nutshell, dropout simply pretends they're randomly selected proportion of the neurons in each layer don't exist during each round of training. To illustrate this, three rounds of training are shown in this figure. For each round, we remove a specified proportion of neurons from each layer by random selection. For the first hidden layer of the network, we've configured it to drop out 33% or one third of the neurons. For the second hidden layer, we've configured it to drop out 50% of the neurons. Let's cover the three rounds of training here, one by one. So here in the leftmost panel, the second neuron of the first hidden layer and the first neuron of the second hidden layer are randomly dropped out. In the middle panel, it's the first neuron of the first hidden layer and the second one of the second layer that are selected for dropout. There's no memory of which neurons have been dropped out on previous training rounds. Dropout is an effective regularization technique. Because it prevents any single neuron from becoming excessively influential in the network. 

In [7]:
html1 = '<img src="Z4.jpg" width="1000" height="1000" align="center"/>'
HTML(html1)

# The convolution operation

In [8]:
html1 = '<img src="Z5.jpg" width="1000" height="1000" align="center"/>'
HTML(html1)

# LeNet-5 Architecture
## Source : https://learning.oreilly.com/library/view/hands-on-java-deep/9781789613964/651a3da9-2ad7-48f8-b062-eaed8ae0fd4d.xhtml

In [9]:
html1 = '<img src="LeNet 5.png" width="1000" height="1000" align="center"/>'
HTML(html1)

# 1. Background information
## In this notebook, we will use the modified version of the machine vision architecture called LeNet-5. We will use TensorFlow to construct an MNIST classifying network inspired by this landmark architecture. However, we will afford Yoon Kim and his colleagues 1998 model with some modern twists. Because computation is much cheaper today than it was in 1998, we will opt to use more kernels in the two convolutional layers of the architecture, so there is a two convolutional layers in his architecture. More specifically, we will include 32 filters in the first layer and 64 filters in the second convolutional layer. Whereas in the original LeNet-5, they only had six and 16. Also, thanks to cheap computing, we will be subsampling activations only once with a max-pooling layer, whereas LeNet-5 did it twice. So remember that this max-pooling layer can be used to reduce computational complexity. We do not need to do it quite as much as they did. On top of that, we will leverage innovations like ReLU activations and dropout, which had not yet been invented at the time of LeNet-5. 

# 2. Using a two-dimensional convolutional filter

## We are going to use a few new layers. Specifically, we will use the convolutional operation and its Conv2D because it will be a 2D filter that goes over the image. Here where we have a two-dimensional image, we will have a convolutional filter that starts. We have a convolutional filter with a three-by-three-pixel filter size. We are going to use a stride of one. In this example, we are not using padding. Indeed, we are using a two-dimensional convolutional filter in this notebook. In addition, there are three-dimensional convolutional filters. They are not actually for three-color channels. So here we have black and white, so we only have one color channel. So you might think to use a three-dimensional convolutional filter for color channels (red, green, and blue). You still use a two-dimensional filter for red, green, and blue because you still have this two-dimensional shape that convolves just over these two dimensions. Even though it does it over multiple image slices over three colors slices. So, the three-dimensional convolutional filter moves and convolves in three directions. Thus, it is useful for three-dimensional medical imaging scans or processing video data.

# 3. Using the conv2D filter from Keras. 
## We are also going to use the max-pooling layer. We are going to use two-dimensional max-pooling for the same reasons. So the two-dimensional max-pooling will move over, slide over the image, and reduce the dimensionality of the image. We are also going to do something called a flattened layer. This flattened layer allows us to take two or three or higher dimensional activation maps and flatten them down to a one-dimensional array. We need to do that to pass them into dense layers. So remember, we start with convolutional layers. Then we flip to dense layers for the final few layers of the architecture. When doing that, because these convolutional layers, the max-pooling layers, have many dimensional outputs. In order to pass those outputs into a dense layer, we first need to flatten them down to a single one-dimensional input. Dense layers can only handle one-dimensional inputs. 

# 4. Load dependencies

## Note : Before we flatten our images, we make our input images from 28 by 28-pixel images  to 784. So we will update this here from 784 to 28 by 28. We will also specify the fourth dimension. One indicates that it is a black and white monochromatic image. If you had full-color images with the red, green, and blue layers, you would switch this to three. So let us do that for both the training data and the validation data. Run that, we are still going to divide by 255 to scale our pixel values down to the range of zero to one, and we are going to preprocess the label data in the same way as before to the one hot categorical format.

In [1]:
# importing dependencies
import tensorflow
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.layers import Flatten, Conv2D, MaxPooling2D 

# 5. Load data

In [3]:
(X_train, y_train), (X_valid, y_valid) = mnist.load_data()  # loading the mnist dataset

y_valid

array([7, 2, 1, ..., 4, 5, 6], dtype=uint8)

# 6. Preprocess data

In [12]:
 # reshaping the dataset to a vector 
X_train = X_train.reshape(60000, 28, 28, 1).astype('float32')
X_valid = X_valid.reshape(10000, 28, 28, 1).astype('float32')

In [13]:
# normalizing the data
X_train /= 255
X_valid /= 255

In [14]:
# number of classes and intializing the classes 
n_classes = 10
y_train = to_categorical(y_train, n_classes)
y_valid = to_categorical(y_valid, n_classes)

# 7. Design our convolutional neural network architecture

## We will now use a convolutional layer as our first hidden layer. We will use the ReLU activation function here. We need to specify our input shape here. Our input shape is 28 by 28 by one. For the second hidden layer,  we will have a convolutional layer again, so the first convolutional layer will be able to identify 32 unique patterns of simple straight line orientations. In the second convolutional layer, we will do 64 nonlinear recombinations of those 32 input features. So we will be able to handle curves and corners and modestly complex spatial representations. We will use a kernel size of three by three. This is the size of our filter( three by three kernel or three by three filter). Kernel and filter are just synonyms. So we will have this three by three filter that covers the image. We will use that three by three filter in both the first and the second hidden layer. We now will specify the pool size ( two by two), and when you specify a max-pooling layer with a filter size of two by two, it will by default have a stride of two by two as well. In order to help our model generalize to the validation data beyond the training data, we are going to apply dropout to the second hidden layer. Typically, the dropouts are NOT necessary for the first layer because these are simple low-level features. Finally, after applying that dropout, we will use that flatten layer, which will turn the many-dimensional output from our second convolutional layer, into a one dimensional array that we can pass into a dense layer. We use 128 neurons in the third layer; 256 might perform better, maybe 64 performs the same. You can try that out and see if it makes a difference. You can certainly play around with this particular hyperparameter. Here, we are using the ReLU activations. We will apply more dropouts than we did in the second hidden layer. This is a reasonably common thing to apply more dropouts. Indeed, deeper layers could be memorizing complex features from the training data that are not relevant to data it has not seen before. Finally, for our output layer, we have 10 softmax outputs. 

# Question : I would like you to study the dropout and understand why in most cases, it is essential to use dropout in our CNN model.


In [15]:
model = Sequential() # initializing the model

model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(28, 28, 1))) # adding a relu layer 

model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))                          # adding a relu layer 
model.add(MaxPooling2D(pool_size=(2, 2)))                                             # adding a maxpool layer
model.add(Dropout(0.25))                                                              # using dropout 
model.add(Flatten())                                                                  # flattening 

model.add(Dense(128, activation='relu'))                                              # another relu layer 
model.add(Dropout(0.5))                                                               # a dropout 

model.add(Dense(n_classes, activation='softmax'))                                     # final softmax layer

In [16]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 26, 26, 32)        320       
                                                                 
 conv2d_1 (Conv2D)           (None, 24, 24, 64)        18496     
                                                                 
 max_pooling2d (MaxPooling2D  (None, 12, 12, 64)       0         
 )                                                               
                                                                 
 dropout (Dropout)           (None, 12, 12, 64)        0         
                                                                 
 flatten (Flatten)           (None, 9216)              0         
                                                                 
 dense (Dense)               (None, 128)               1179776   
                                                        

# 8. Model summary
## Let's have a quick look at the model summary here. A couple of things that we should mention about these Conv2D layers: Remember that we configured the kernel size, but we did not go into specifics on the stride length or on the padding. The default for a Conv2D layer in the Keras TensorFlow API is to use a stride of one. And the default is also to use padding of valid, which means not to use padding at all. Let us have a quick look at the model summary. So cumulatively we've gone from a few tens of thousands of parameters in our network, to 1.2 million parameters, by adding in these convolutional layers.Interestingly, the vast majority of these 98.3 percent of them are associated with the dense hidden layer. So the convolutional layers, although they have a relatively intense computationally mathematical operation, they do not actually have all that many parameters. So they are efficient in terms of weights.

# 9. Model compile 
## To compile the model, we call the compile method. We will use cross-entropy loss. We can use Adam, Nadam, SGD, etc., as our optimizer. However, in the literature, Adam has outperformed most of the other optimizers. Note that if you want easy access to a GPU for doing this type of work, you can employ Google Colab. These GPUs are particularly efficient when we are using convolutional layers. 

In [17]:
model.compile(loss='categorical_crossentropy', optimizer='nadam', metrics=['accuracy'])

# 10. We Train our model
## Note 1 : You can change the number of epochs
## Note 2: I suggest utilizing the Google Colab and running the models with GPU

In [18]:
model.fit(X_train, y_train, batch_size=128, epochs=10, verbose=1, validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1c490927940>

# AlexNet Architecture
## Source  https://neurohive.io/en/popular-networks/alexnet-imagenet-classification-with-deep-convolutional-neural-networks/

In [19]:
html1 = '<img src="AlexNet-1.png" width="1000" height="1000" align="center"/>'
HTML(html1)

# AlexNet in TensorFlow

# 1. Load dependencies

In [20]:
import tensorflow
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D
from tensorflow.keras.layers import BatchNormalization

# 2. Load data

In [21]:
(X_train, y_train), (X_valid, y_valid) = mnist.load_data()

# 3. Preprocess data

In [22]:
X_train = X_train.reshape(60000, 28, 28, 1).astype('float32')
X_valid = X_valid.reshape(10000, 28, 28, 1).astype('float32')

In [23]:
X_train /= 255
X_valid /= 255

In [24]:
n_classes = 10
y_train = to_categorical(y_train, n_classes)
y_valid = to_categorical(y_valid, n_classes)

# 4. Design our convolutional neural network architecture

## Question: Why should we use batch normalization? You should study it. 

In [25]:
model = Sequential()

 
model.add(Conv2D(96, kernel_size=(11, 11), strides=(1, 1), activation='relu', input_shape=(28, 28, 1)))
model.add(MaxPooling2D(pool_size=(3, 3), strides=(1, 1)))
model.add(BatchNormalization())



model.add(Conv2D(256, kernel_size=(5, 5), activation='relu'))
model.add(MaxPooling2D(pool_size=(3, 3), strides=(1, 1)))
model.add(BatchNormalization())

model.add(Conv2D(256, kernel_size=(3, 3), activation='relu'))
model.add(Conv2D(384, kernel_size=(3, 3), activation='relu'))
model.add(Conv2D(384, kernel_size=(3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(3, 3), strides=(1, 1)))
model.add(BatchNormalization())



model.add(Flatten())
model.add(Dense(4096, activation='tanh'))
model.add(Dropout(0.5))
model.add(Dense(4096, activation='tanh'))
model.add(Dropout(0.5))


model.add(Dense(10, activation='softmax'))

In [26]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_2 (Conv2D)           (None, 18, 18, 96)        11712     
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 16, 16, 96)       0         
 2D)                                                             
                                                                 
 batch_normalization (BatchN  (None, 16, 16, 96)       384       
 ormalization)                                                   
                                                                 
 conv2d_3 (Conv2D)           (None, 12, 12, 256)       614656    
                                                                 
 max_pooling2d_2 (MaxPooling  (None, 10, 10, 256)      0         
 2D)                                                             
                                                      

# 5. Model summary

# 6. Model compile

In [27]:
model.compile(loss='categorical_crossentropy', optimizer='nadam', metrics=['accuracy'])

# 7. We Train our model
## Note 1 : You can change the number of epochs
## Note 2: I suggest utilizing the Google Colab and running the models with GPU

In [None]:
 model.fit(X_train, y_train, batch_size=128, epochs=10, verbose=1, validation_data=(X_valid, y_valid))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10