# CNN Architecture

## TOPIC: Understanding Pooling and Padding in CNN

### 1. Describe the purpose and benifits of pooling in CNN

Purpose: 
 - The primary purpose of pooling is to help manage the computational complexity of CNNs, increase translation invariance, and improve the network's ability to learn hierarchical features.

Benifits:
 - Efficient dimensionality reduction.
 - Translation invariance for enhanced robustness.
 - Feature generalization for better generalization to new data.
 - Extraction of dominant features for improved noise resistance.

***

### 2. Explain the difference between "Min pooling" and "Max pooling".


Max Pooling:
 - In max pooling, a local region (usually a small grid) of the input feature map is divided into cells. The operation selects the maximum value within each cell and uses that value as the representative feature for that region. The key characteristic of max pooling is that it emphasizes the most dominant or significant feature present in that local region. This helps the network capture the most important information and enhance its sensitivity to distinctive features. Max pooling is particularly effective at retaining high-contrast features and edges.

Min Pooling:
 - In min pooling, the operation is similar to max pooling, but it selects the minimum value within each local region's cell as the representative feature. Min pooling is less commonly used than max pooling and might not be as effective in retaining important features, especially in cases where the maximum values are more indicative of feature presence. Min pooling tends to emphasize low-intensity features, which might not be as useful for tasks like object recognition or edge detection.

***

### 3. Discuss the concept of padding in CNN.


Padding is a technique used in Convolutional Neural Networks (CNNs) to control the spatial dimensions of feature maps as they are processed through convolutional layers. It involves adding extra pixels or values around the edges of an input image or feature map before applying convolutional operations. The main purpose of padding is to preserve spatial information, mitigate the reduction in feature map dimensions, and ensure that the output dimensions match the desired dimensions.

***

### 4. Compare and contrast zero-padding and valid-padding in terms of their effects on the  output feature map size.



Zero-padding:
 - Effect on Output Size: Zero-padding increases the size of the output feature map compared to the input size.
 - Purpose: The primary purpose of zero-padding is to maintain the spatial dimensions of the input and output feature maps. It helps to preserve information at the edges of the input and prevents the reduction of feature map dimensions caused by convolutions.
 - Use Cases: Zero-padding is often used when you want to maintain spatial information, ensure that the output dimensions match the input dimensions, and when you need to process features near the edges of the input.

Valid-padding:
 - Effect on Output Size: Valid-padding reduces the size of the output feature map compared to the input size.
 - Purpose: The purpose of valid-padding is to perform dimensionality reduction by allowing the convolutional operation to be applied only within the central regions of the input. It helps to extract key features while reducing computational complexity and output dimensions.
 - Use Cases: Valid-padding is useful when you want to extract features from an input while reducing its spatial dimensions, which is often desirable in tasks like feature extraction and dimensionality reduction.

Comparison:

- Output Size: Zero-padding increases the output size, while valid-padding reduces it.
- Spatial Information: Zero-padding preserves spatial information at the edges, while valid-padding can result in the loss of information at the edges.
- Feature Extraction: Zero-padding is suitable for maintaining edge features and preserving spatial relationships between pixels. Valid-padding is effective for reducing dimensions and focusing on central features.
- Receptive Field: Zero-padding ensures that more context is considered by each neuron during convolution due to the larger input size. Valid-padding results in a smaller receptive field for each neuron.
- Computation: Zero-padding leads to increased computational complexity compared to valid-padding due to the larger input size.
- Applications: Zero-padding is often used for tasks like object recognition, image segmentation, and preserving spatial details. Valid-padding is used for tasks like downsampling, dimensionality reduction, and feature extraction.

***

***


## TOPIC: Exploring LeNet


### 1. Provide a brief overview of LetNet-5 Architecture.

-  LeNet-5 was designed primarily for handwritten digit recognition, particularly for recognizing characters in postal addresses.

- The LeNet-5 architecture played a crucial role in demonstrating the effectiveness of CNNs for image recognition tasks. It showcased the benefits of convolutional layers for feature extraction and the use of pooling layers for downsampling and dimensionality reduction. 

- Although LeNet-5 was initially developed for handwritten digit recognition, its principles and structure have influenced the design of subsequent CNN architectures used in a wide range of computer vision applications, such as object recognition, image classification, and more.


***


### 2. Describe the key concept of LetNet-5 and their respective purpose.

1. Convolutional Layers:
 - Purpose: Convolutional layers are responsible for detecting local features in the input image. They use small filters (also known as kernels) to slide over the input and perform convolution operations, capturing patterns like edges, corners, and textures.
 - Impact: These layers enable the network to learn hierarchical features by detecting basic edges and textures in the earlier layers and more complex combinations of features in deeper layers.

2. Pooling (Subsampling) Layers:
 - Purpose: Pooling layers help downsample the feature maps produced by the convolutional layers. This reduces the spatial dimensions while retaining the most salient information. In LeNet-5, max pooling is used, where the maximum value within a small region is selected as the representative value for that region.
 - Impact: Pooling helps create translation invariance and reduces the sensitivity of the network to small changes in the input's position. It also reduces the computational load and the number of parameters in subsequent layers.

3. Activation Functions:
 - Purpose: Activation functions introduce non-linearity to the network, enabling it to learn complex mappings from inputs to outputs. In LeNet-5, the sigmoid function is used as the activation function.
 - Impact: Activation functions allow the network to model more complex relationships between features and contribute to the network's ability to capture nonlinear patterns in data.

4. Fully Connected Layers:
 - Purpose: Fully connected layers receive the features extracted by the convolutional and pooling layers and combine them to make final predictions. These layers are similar to the dense layers in traditional neural networks.
 - Impact: Fully connected layers enable the network to learn class-specific features by considering global context from the features extracted in earlier layers.

5. Softmax Activation Function (Output Layer):
 - Purpose: The softmax activation function is applied in the output layer to convert the network's raw scores (logits) into class probabilities. It produces a normalized probability distribution over the possible classes.
 - Impact: The softmax function provides a way to interpret the network's output probabilistically, aiding in making class predictions.

6. Gradient-Based Optimization:
 - Purpose: LeNet-5 introduced the concept of using gradient-based optimization algorithms (backpropagation) to train the network's weights. This enables the network to iteratively adjust its parameters to minimize the prediction error.
 - Impact: Gradient-based optimization is a fundamental concept in training deep neural networks. It allows the network to learn meaningful features and make accurate predictions.

***


### 3. Discuss the advantages and limitations of LetNet-5 in the context of image classification tasks.

Advantages:
- Hierarchical Feature Learning: LeNet-5 demonstrated the power of learning hierarchical features through its convolutional layers. The architecture's ability to detect basic features in earlier layers and more complex patterns in deeper layers is crucial for image classification tasks.
- Translation Invariance: The use of max pooling in LeNet-5 introduces translation invariance, making the network more robust to slight variations in object positions within the input images. This is beneficial for recognizing objects regardless of their positions.
- Reduced Parameter Count: LeNet-5's architecture, particularly the pooling layers, leads to a reduction in the number of parameters compared to fully connected architectures. This helps prevent overfitting, especially when working with limited training data.
- Localized Feature Detection: Convolutional layers in LeNet-5 are designed to detect localized features in the input image. This makes the network capable of identifying specific patterns in different areas of the image, which is crucial for accurate image classification.
- Efficient Learning: The small filter sizes used in the convolutional layers of LeNet-5 allow the network to learn local features efficiently, reducing the need for large receptive fields and heavy computation.


Disadvantages:
- Limited Depth: LeNet-5 has a relatively shallow architecture compared to modern CNNs. Deeper architectures have shown the potential to learn more complex and abstract features, leading to better performance in challenging image classification tasks.
- Small Input Size: LeNet-5 was designed to handle 32x32 grayscale images. While it was suitable for its time, modern datasets and applications often require processing larger and more high-resolution images.
- Lack of Advanced Activation Functions: LeNet-5 uses the sigmoid activation function, which suffers from the vanishing gradient problem. More advanced activation functions like ReLU (Rectified Linear Unit) have been found to improve training stability and convergence.
- No Batch Normalization or Regularization: LeNet-5 does not incorporate modern regularization techniques like dropout or batch normalization, which are effective for preventing overfitting and improving generalization. 
- Domain-Specific Design: LeNet-5 was primarily designed for handwritten digit recognition and may not perform optimally on more complex and diverse image classification tasks found in modern datasets like ImageNet.
- Inadequate Capacity for Complex Tasks: While effective for digit recognition, LeNet-5's architecture may lack the capacity to handle the intricacies and variations present in more challenging image classification tasks.

***


### 4. Implement LetNet-5 using a deep learning framework of your choice (eg., Tensoreflow, PyTorch) and train on public available dataset (e.g, MNIST). Evaluate its' performance and provide insights.


In [22]:
import tensorflow as tf
from tensorflow import keras
from keras.layers import Conv2D, MaxPooling2D,AveragePooling2D
from keras.layers import Dense, Flatten
from keras.models import Sequential


In [23]:
# Load the MNIST dataset
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Normalize pixel values between 0 and 1
x_train = x_train / 255.0
y_train = y_train / 255.0


In [24]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(60000, 28, 28)
(10000, 28, 28)
(60000,)
(10000,)


In [29]:
## Re-shape x_train and x_test
x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)

# Convert the target labels to one-hot encoded vectors
y_train = keras.utils.to_categorical(y_train, num_classes=10)
y_test = keras.utils.to_categorical(y_test, num_classes=10)

print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(60000, 28, 28, 1)
(10000, 28, 28, 1)
(60000, 10)
(10000, 10)


In [30]:
# Building the Model Architecture

model = Sequential()

model.add(Conv2D(6, kernel_size = (5,5), padding = 'valid', activation='relu', input_shape = (28, 28, 1)))
model.add(AveragePooling2D(pool_size= (2,2), strides = 2, padding = 'valid'))

model.add(Conv2D(16, kernel_size = (5,5), padding = 'valid', activation='relu'))
model.add(AveragePooling2D(pool_size= (2,2), strides = 2, padding = 'valid'))

model.add(Flatten())

model.add(Dense(120, activation='relu'))
model.add(Dense(84, activation='relu'))
model.add(Dense(10, activation='softmax'))

model.summary()


Model: "sequential_7"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_10 (Conv2D)          (None, 24, 24, 6)         156       
                                                                 
 average_pooling2d_4 (Averag  (None, 12, 12, 6)        0         
 ePooling2D)                                                     
                                                                 
 conv2d_11 (Conv2D)          (None, 8, 8, 16)          2416      
                                                                 
 average_pooling2d_5 (Averag  (None, 4, 4, 16)         0         
 ePooling2D)                                                     
                                                                 
 flatten_2 (Flatten)         (None, 256)               0         
                                                                 
 dense_6 (Dense)             (None, 120)              

In [31]:
# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(x_train, y_train, batch_size=128, epochs=10, validation_split=0.1)

# Evaluate the model
loss, accuracy = model.evaluate(x_test, y_test)
print(f'Test loss: {loss:.4f}, Test accuracy: {accuracy:.4f}')

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test loss: 8703.9707, Test accuracy: 0.0980


***

***


## TOPIC: Analyzing AlexNet


### 1. Present an overview of AlexNet Architecture.

AlexNet has 8 layers, with 5 convolutional layers and 3 fully connected layers. The convolutional layers use rectified linear units (ReLU) as their activation function, which helped to improve the training speed and performance of the network. The fully connected layers use a softmax activation function to output a probability distribution.
<br>

Here is a more detailed overview of the AlexNet architecture with ILSVRC DataSet:
- Input layer: The input layer is a 227x227x3 image.
- Convolutional layer 1: This layer has 96 filters of size 11x11 with stride 4. The output of this layer is a 55x55x96 feature map.
- Max pooling layer 1: This layer uses a 2x2 max pooling window with stride 2. The output of this layer is a 27x27x96 feature map.
- Convolutional layer 2: This layer has 256 filters of size 5x5 with stride 1. The output of this layer is a 27x27x256 feature map.
- Max pooling layer 2: This layer uses a 2x2 max pooling window with stride 2. The output of this layer is a 13x13x256 feature map.
- Convolutional layer 3: This layer has 384 filters of size 3x3 with stride 1. The output of this layer is a 13x13x384 feature map.
- Convolutional layer 4: This layer has 384 filters of size 3x3 with stride 1. The output of this layer is a 13x13x384 feature map.
- Convolutional layer 5: This layer has 256 filters of size 3x3 with stride 1. The output of this layer is a 13x13x256 feature map.
- Fully connected layer 1: This layer has 4096 neurons.
- Fully connected layer 2: This layer has 4096 neurons.
- Output layer: This layer has 1000 neurons, one for each class in the ILSVRC dataset.

***


### 2. Explain the architectural inovations and introduced in AlexNet that contributed to it's breakthrough performance.

- The use of rectified linear units (ReLU): ReLU is a non-linear activation function that helps to prevent the vanishing gradient problem, which can occur in deep neural networks. ReLU is much faster to compute than sigmoid or tanh activation functions, which also contributed to AlexNet's faster training speed.
- The use of overlapping max pooling: Overlapping max pooling is a technique that helps to reduce the number of parameters in a CNN without sacrificing accuracy. AlexNet used overlapping max pooling windows with a stride of 2, which allowed it to learn more features from the input image while still keeping the number of parameters manageable.
- The use of dropout: Dropout is a regularization technique that helps to prevent overfitting. AlexNet used dropout with a probability of 0.5, which meant that half of the neurons in each layer were randomly dropped out during training. This helped to prevent the network from learning too much from the training data and overfitting to it.
- The use of data augmentation: Data augmentation is a technique that artificially increases the size of the training dataset by creating new training examples from the existing ones. AlexNet used data augmentation by randomly flipping, rotating, and translating the images in the training dataset. This helped to improve the generalization performance of the network.

***

 
### 3. Discuss the role of convolutional layers, pooling layers and fully connected layers in AlexNet.

- Convolutional layers are used to extract features from the input image. They do this by applying a filter to the image, which slides across the image and produces a new feature map. The filter is a small matrix of weights, and it is used to detect specific features in the image. For example, a filter might be used to detect edges, corners, or textures.
- Pooling layers are used to reduce the size of the feature maps produced by the convolutional layers. This helps to reduce the number of parameters in the network, which makes it faster to train and easier to prevent overfitting. Pooling layers typically use a max pooling operation, which takes the maximum value from each region of the feature map.
- Fully connected layers are used to classify the input image. They do this by taking the output of the pooling layers and connecting it to a large number of neurons. Each neuron in the fully connected layer represents a different class, and the output of the neuron is the probability that the input image belongs to that class.

The convolutional layers, pooling layers, and fully connected layers work together to extract features from the input image and classify it into one of 1000 classes. The convolutional layers extract low-level features, such as edges and corners. The pooling layers reduce the size of the feature maps produced by the convolutional layers, while preserving the most important features. The fully connected layers then classify the input image by taking the output of the pooling layers and connecting it to a large number of neurons, each of which represents a different class.

***


### 4. Implement an AlexNet using DL framework of your choice and evaluate it's performance on a dataset of your choice.

In [64]:
import tensorflow as tf

def alexnet(input_shape=(32, 32, 3)):
    model1 = tf.keras.models.Sequential([
      tf.keras.layers.Conv2D(64, (3, 3), strides=(1, 1), padding='same', activation='relu', input_shape=input_shape),
      tf.keras.layers.MaxPooling2D((2, 2)),
      
      tf.keras.layers.Conv2D(192, (3, 3), padding='same', activation='relu'),
      tf.keras.layers.MaxPooling2D((2, 2)),
      
      tf.keras.layers.Conv2D(384, (3, 3), padding='same', activation='relu'),
      tf.keras.layers.Conv2D(256, (3, 3), padding='same', activation='relu'),
      tf.keras.layers.Conv2D(256, (3, 3), padding='same', activation='relu'),
      tf.keras.layers.MaxPooling2D((2, 2)),
      
      tf.keras.layers.Flatten(),
      tf.keras.layers.Dense(4096, activation='relu'),
      tf.keras.layers.Dropout(0.5),
      tf.keras.layers.Dense(4096, activation='relu'),
      tf.keras.layers.Dropout(0.5),
      tf.keras.layers.Dense(10, activation='softmax')
    ])

    return model1


In [69]:
model1 = alexnet()
model1.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model1.summary()

Model: "sequential_22"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d_76 (Conv2D)          (None, 32, 32, 64)        1792      
                                                                 
 max_pooling2d_42 (MaxPoolin  (None, 16, 16, 64)       0         
 g2D)                                                            
                                                                 
 conv2d_77 (Conv2D)          (None, 16, 16, 192)       110784    
                                                                 
 max_pooling2d_43 (MaxPoolin  (None, 8, 8, 192)        0         
 g2D)                                                            
                                                                 
 conv2d_78 (Conv2D)          (None, 8, 8, 384)         663936    
                                                                 
 conv2d_79 (Conv2D)          (None, 8, 8, 256)       

In [70]:
# Load the CIFAR-10 dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

# Convert the labels to one-hot encoded vectors
y_train = tf.keras.utils.to_categorical(y_train)
y_test = tf.keras.utils.to_categorical(y_test)

In [71]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(50000, 32, 32, 3)
(10000, 32, 32, 3)
(50000, 10)
(10000, 10)


In [75]:
# Train the model
model1.fit(x_train, y_train, epochs=3, batch_size=128, validation_split=0.1)

# Evaluate the model on the test set
model1.evaluate(x_test, y_test)

Train on 45000 samples, validate on 5000 samples
Epoch 1/3

  updates = self.state_updates


Epoch 2/3
Epoch 3/3


[1.0896483503341674, 0.613]