In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Assignment 9

#### 1. What are the advantages of a CNN for image classification over a completely linked DNN?

Convolutional Neural Networks (CNNs) have several advantages over completely linked Deep Neural Networks (DNNs) for image classification tasks:

1. **Spatial Hierarchical Structure:** CNNs are designed to exploit the spatial structure of images. They preserve the spatial relationships between pixels by using convolutional layers, which scan the input with small receptive fields and learn local patterns. This spatial hierarchical structure enables CNNs to capture spatial dependencies and effectively model the visual patterns present in images.

2. **Translation Invariance:** CNNs are invariant to translation, meaning they can recognize patterns regardless of their location in the image. This is achieved through the use of shared weights in convolutional layers, which enable the detection of the same pattern across different spatial locations. This translation invariance property is crucial for image classification tasks where the position of objects or features may vary.

3. **Parameter Sharing:** CNNs leverage parameter sharing to reduce the number of trainable parameters. Instead of having separate weights for each pixel or neuron, CNNs use shared weights within each convolutional layer. This parameter sharing scheme significantly reduces the model's complexity, making it more efficient and scalable, especially for large-scale image datasets.

4. **Local Receptive Fields:** CNNs use small receptive fields to capture local patterns in images. These local receptive fields enable CNNs to learn low-level features such as edges, corners, and textures, which are building blocks for higher-level representations. By progressively combining these local features, CNNs can learn more complex and abstract features, leading to better discrimination and understanding of the image content.

5. **Pooling Layers:** CNNs incorporate pooling layers to downsample the spatial dimensions of the feature maps. Pooling operations, such as max pooling or average pooling, help to reduce the spatial resolution while retaining the most salient features. Pooling enhances the network's robustness to small spatial variations, reduces computational requirements, and introduces a degree of translation invariance.

6. **Less Sensitivity to Input Variations:** CNNs are less sensitive to variations in the input images, such as changes in position, scale, or orientation. Due to the hierarchical nature of CNNs, higher-level features capture more abstract information, making them more invariant to specific variations. This property enables CNNs to handle images with different transformations or distortions more effectively.

#### 2. Consider a CNN with three convolutional layers, each of which has three kernels, a stride of two, and SAME padding. The bottom layer generates 100 function maps, the middle layer 200, and the top layer 400. RGB images with a size of 200 x 300 pixels are used as input. How many criteria does the CNN have in total? How much RAM would this network need when making a single instance prediction if we're using 32-bit floats? What if you were to practice on a batch of 50 images?

To calculate the number of parameters in the given CNN architecture, we need to consider the parameters in the convolutional layers and the fully connected layers (if any).

1. **Convolutional Layers:**
   - First Convolutional Layer: 
     - Number of kernels: 3
     - Size of each kernel: 3 x 3
     - Number of input channels: 3 (RGB images)
     - Number of output feature maps: 100
     - Total parameters in the first layer: 3 * 3 * 3 * 100 = 2700

   - Second Convolutional Layer: 
     - Number of kernels: 3
     - Size of each kernel: 3 x 3
     - Number of input channels: 100
     - Number of output feature maps: 200
     - Total parameters in the second layer: 3 * 3 * 100 * 200 = 540,000

   - Third Convolutional Layer: 
     - Number of kernels: 3
     - Size of each kernel: 3 x 3
     - Number of input channels: 200
     - Number of output feature maps: 400
     - Total parameters in the third layer: 3 * 3 * 200 * 400 = 2,880,000

2. **Fully Connected Layers (if any):**
   - The given CNN architecture does not mention any fully connected layers. If there are fully connected layers, the number of parameters would depend on the size of each layer.

Therefore, the total number of parameters in the CNN is the sum of the parameters in each layer:
2700 + 540,000 + 2,880,000 = 3,423,700 parameters.

To calculate the RAM required for a single instance prediction and for a batch of 50 images, we need to consider the following:
- Input size: RGB images with a size of 200 x 300 pixels.
- Data type: 32-bit floats.

1. **Single Instance Prediction:**
   - Input size: 200 x 300 x 3 (RGB channels)
   - RAM required for input: 200 * 300 * 3 * 4 bytes (32-bit float) = 720,000 bytes or 0.72 MB
   - Additional RAM required for the model: Total number of parameters * 4 bytes (32-bit float)
     = 3,423,700 * 4 bytes = 13,694,800 bytes or 13.69 MB
   - Total RAM required: Input RAM + Model RAM = 0.72 MB + 13.69 MB = 14.41 MB

2. **Batch of 50 Images:**
   - Input size for each image: 200 x 300 x 3 (RGB channels)
   - RAM required for input for a single image: 0.72 MB
   - Total RAM required for the batch: Input RAM for a single image * Batch size = 0.72 MB * 50 = 36 MB
   - Additional RAM required for the model: 13.69 MB
   - Total RAM required: Input RAM for the batch + Model RAM = 36 MB + 13.69 MB = 49.69 MB

Therefore, for a single instance prediction, the network would need approximately 14.41 MB of RAM, and for a batch of 50 images, it would require approximately 49.69 MB of RAM.

#### 3. What are five things you might do to fix the problem if your GPU runs out of memory while training a CNN?

If your GPU runs out of memory while training a CNN, here are five things you can do to address the issue:

1. **Reduce Batch Size**: Decrease the batch size used during training. Batch size determines the number of samples processed in each iteration, and reducing it can help reduce the memory usage. However, a smaller batch size may also affect the convergence and overall performance of the model.

2. **Resize or Crop Images**: If the input images are large, resizing them to a smaller resolution or cropping them can help reduce the memory requirements. This can be done while preprocessing the data before feeding it into the network.

3. **Use Mixed Precision Training**: Utilize mixed precision training techniques, such as using half-precision (float16) instead of single-precision (float32) for storing activations and gradients. This can reduce the memory footprint of the model without significant loss in training quality.

4. **Reduce Model Complexity**: Simplify the architecture of the CNN by reducing the number of layers, decreasing the number of filters, or employing other techniques to reduce the model's complexity. This can help reduce the memory usage and make the model more manageable on the available GPU memory.

5. **Utilize Model Parallelism**: If your GPU has multiple GPUs or if you have access to distributed training frameworks, you can distribute the model across multiple GPUs. This approach, known as model parallelism, allows you to train larger models by splitting the model across multiple devices and distributing the computations.

#### 4. Why would you use a max pooling layer instead with a convolutional layer of the same stride?

1. **Dimensionality reduction**: Max pooling reduces the spatial dimensions (width and height) of the input feature maps while retaining the most prominent features. It achieves this by selecting the maximum value within each pooling region. This helps in reducing the computational complexity and the number of parameters in the network.

2. **Translation invariance**: Max pooling provides a form of translation invariance by selecting the maximum activation within each pooling region. This means that even if the position of a feature slightly shifts within the pooling region, the maximum value will still be selected, preserving the essential information about the presence of that feature. This property helps the model to be more robust to small spatial variations and increases its ability to generalize.

3. **Feature selection**: Max pooling acts as a form of feature selection by emphasizing the most important features while discarding less significant ones. By selecting the maximum activation within each pooling region, max pooling focuses on capturing the most salient features in the input feature maps.

4. **Reduction of spatial overfitting**: Max pooling can help reduce the risk of overfitting by discarding unnecessary spatial details. It reduces the spatial resolution of the feature maps, making them less prone to overfitting on specific spatial patterns in the training data.

5. **Computationally efficient**: Max pooling is computationally efficient compared to convolutional layers with the same stride. It requires fewer computations since it only selects the maximum value within each pooling region without performing any weight multiplication or convolution operation.

#### 5. When would a local response normalization layer be useful?

1. **Normalization of Local Neighborhood**: LRN normalizes the response of neurons within a local neighborhood. It helps to enhance the contrast between responses by normalizing the activations relative to their neighboring activations. This can be beneficial when you want to emphasize local contrast and highlight salient features in an image.

2. **Lateral Inhibition**: LRN incorporates a form of lateral inhibition, where it suppresses the responses of neurons that are relatively weak compared to their neighbors. This inhibition mechanism can help create sparse representations and enhance the selectivity of neurons, promoting stronger responses for more pronounced features.

3. **Normalization Across Channels**: LRN operates across channels, considering activations from different feature maps. By normalizing across channels, LRN encourages competition between different feature channels and reduces the dominance of highly activated channels, promoting diversity in the learned features.

4. **Robustness to Brightness Variations**: LRN can provide some degree of robustness to brightness variations in the input data. By normalizing responses within a local neighborhood, LRN can help mitigate the impact of overall intensity changes in the image, making the network more invariant to such variations.

#### 6. In comparison to LeNet-5, what are the main innovations in AlexNet? What about GoogLeNet and ResNet's core innovations?

The main innovations in AlexNet compared to LeNet-5 are as follows:

1. **Deeper architecture**: AlexNet introduced a much deeper architecture compared to LeNet-5, consisting of eight layers, including five convolutional layers and three fully connected layers. This deeper architecture allowed for more complex feature extraction and representation learning.

2. **ReLU activation**: AlexNet replaced the sigmoid activation function used in LeNet-5 with the Rectified Linear Unit (ReLU) activation function. ReLU helps alleviate the vanishing gradient problem and accelerates the convergence of the network during training.

3. **Data augmentation**: AlexNet employed data augmentation techniques such as image translations, horizontal reflections, and random cropping during training. This technique helped to increase the size of the training set and improve the network's ability to generalize.

4. **Dropout regularization**: AlexNet introduced dropout regularization, a technique where randomly selected neurons are ignored during training. Dropout helps prevent overfitting by reducing co-adaptation among neurons and promoting better generalization.

The core innovations of GoogLeNet are as follows:

1. **Inception module**: GoogLeNet introduced the Inception module, which utilizes parallel convolutional filters of different sizes (1x1, 3x3, 5x5) to capture features at multiple scales. The outputs of these parallel filters are then concatenated to form the module's output. This architecture allows for efficient and effective feature extraction at various levels of abstraction.

2. **Network depth**: GoogLeNet demonstrated the effectiveness of significantly increasing the network's depth without a proportional increase in computational complexity by using the Inception module. It showed that network depth plays a crucial role in improving performance.

The core innovation of ResNet is:

1. **Residual connections**: ResNet introduced residual connections, also known as skip connections, that bypass one or more layers in the network. This architecture enables the network to learn residual mappings, making it easier to optimize and train very deep networks. Residual connections help address the vanishing gradient problem and enable the network to learn more meaningful and complex representations.

#### 7. On MNIST, build your own CNN and strive to achieve the best possible accuracy.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.transforms as transforms
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader

# Define the CNN architecture
class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3, stride=1, padding=1)
        self.relu1 = nn.ReLU()
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
        self.relu2 = nn.ReLU()
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.fc1 = nn.Linear(64 * 7 * 7, 128)
        self.relu3 = nn.ReLU()
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = self.relu1(x)
        x = self.pool(x)
        x = self.conv2(x)
        x = self.relu2(x)
        x = self.pool(x)
        x = x.view(-1, 64 * 7 * 7)
        x = self.fc1(x)
        x = self.relu3(x)
        x = self.fc2(x)
        return x

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load MNIST dataset
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))
])
train_dataset = MNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = MNIST(root='./data', train=False, download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

# Create the CNN model
model = CNN().to(device)

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
epochs = 10
for epoch in range(epochs):
    running_loss = 0.0
    for images, labels in train_loader:
        images = images.to(device)
        labels = labels.to(device)

        optimizer.zero_grad()

        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    print(f"Epoch {epoch+1}/{epochs} - Loss: {running_loss/len(train_loader)}")

# Evaluation
model.eval()
correct = 0
total = 0

with torch.no_grad():
    for images, labels in test_loader:
        images = images.to(device)
        labels = labels.to(device)

        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)

        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy = correct / total
print(f"Test Accuracy: {accuracy*100}%")


Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 9912422/9912422 [00:00<00:00, 148328408.44it/s]

Extracting ./data/MNIST/raw/train-images-idx3-ubyte.gz to ./data/MNIST/raw






Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ./data/MNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 28881/28881 [00:00<00:00, 110525268.09it/s]


Extracting ./data/MNIST/raw/train-labels-idx1-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 1648877/1648877 [00:00<00:00, 39915569.95it/s]

Extracting ./data/MNIST/raw/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw






Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 4542/4542 [00:00<00:00, 19184822.53it/s]


Extracting ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw

Epoch 1/10 - Loss: 0.15250781043095074
Epoch 2/10 - Loss: 0.044014206524121205
Epoch 3/10 - Loss: 0.030127926049340736
Epoch 4/10 - Loss: 0.024183074685952266
Epoch 5/10 - Loss: 0.01692957489038949
Epoch 6/10 - Loss: 0.014904003498827721
Epoch 7/10 - Loss: 0.01009060187567396
Epoch 8/10 - Loss: 0.008688662722807784
Epoch 9/10 - Loss: 0.008405397762664564
Epoch 10/10 - Loss: 0.00833569952428729
Test Accuracy: 98.98%


#### 8. Using Inception v3 to classify broad images. a.
Images of different animals can be downloaded. Load them in Python using the matplotlib.image.mpimg.imread() or scipy.misc.imread() functions, for example. Resize and/or crop them to 299 x 299 pixels, and make sure they only have three channels (RGB) and no transparency. The photos used to train the Inception model were preprocessed to have values ranging from -1.0 to 1.0, so make sure yours do as well.


In [4]:
import matplotlib.image as mpimg
import numpy as np
import tensorflow as tf
from tensorflow.keras.applications.inception_v3 import preprocess_input

# Define the image paths
image_paths = ['/content/drive/MyDrive/iNeuron Assignments/animals/c1.jpg', '/content/drive/MyDrive/iNeuron Assignments/animals/d2.jpg']

# Create an empty list to store the preprocessed images
preprocessed_images = []

# Load and preprocess each image
for image_path in image_paths:
    # Load the image using matplotlib
    img = mpimg.imread(image_path)
    
    # Resize the image to 299x299 pixels
    img = tf.image.resize(img, (299, 299))
    
    # Convert the image to a numpy array
    img = np.array(img)
    
    # Preprocess the image to have values ranging from -1.0 to 1.0
    img = preprocess_input(img)
    
    # Append the preprocessed image to the list
    preprocessed_images.append(img)

preprocessed_images = np.array(preprocessed_images)

model = tf.keras.applications.InceptionV3(weights='imagenet')

# Make predictions on the preprocessed images
predictions = model.predict(preprocessed_images)

# Get the top predicted classes and their corresponding probabilities
top_predictions = tf.keras.applications.imagenet_utils.decode_predictions(predictions, top=3)

# Print the top predictions for each image
for i, image_path in enumerate(image_paths):
    print("Image:", image_path)
    for pred in top_predictions[i]:
        print("Class:", pred[1])
        print("Probability:", pred[2])



Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/inception_v3/inception_v3_weights_tf_dim_ordering_tf_kernels.h5
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/imagenet_class_index.json
Image: /content/drive/MyDrive/iNeuron Assignments/animals/c1.jpg
Class: tiger_cat
Probability: 0.45521465
Class: tabby
Probability: 0.39312863
Class: Egyptian_cat
Probability: 0.0993125
Image: /content/drive/MyDrive/iNeuron Assignments/animals/d2.jpg
Class: pug
Probability: 0.8494927
Class: Brabancon_griffon
Probability: 0.011860026
Class: bull_mastiff
Probability: 0.0037340107


#### 9. Large-scale image recognition using transfer learning.
a. Make a training set of at least 100 images for each class. You might, for example, identify your own photos based on their position (beach, mountain, area, etc.) or use an existing dataset, such as the flowers dataset or MIT's places dataset (requires registration, and it is huge).

b. Create a preprocessing phase that resizes and crops the image to 299 x 299 pixels while also adding some randomness for data augmentation.

c. Using the previously trained Inception v3 model, freeze all layers up to the bottleneck layer (the last layer before output layer) and replace output layer with  appropriate number of outputs for your new classification task (e.g., the flowers dataset has five mutually exclusive classes so the output layer must have five neurons and use softmax activation function).

d. Separate the data into two sets: a training and a test set. The training set is used to train the model, and the test set is used to evaluate it.


To perform large-scale image recognition using transfer learning, you can follow these steps:

a. Building the Training Set:
   - Collect or obtain a training dataset with at least 100 images for each class you want to classify.
   - You can capture your own photos or use existing datasets like the flowers dataset or MIT's places dataset.

b. Preprocessing and Data Augmentation:
   - Preprocess the images by resizing them to a consistent size, such as 299 x 299 pixels.
   - Apply data augmentation techniques to add randomness and increase the diversity of the training data.
   - Data augmentation techniques can include random cropping, flipping, rotation, zooming, and color jittering.

c. Modifying the Inception v3 Model:
   - Load the pretrained Inception v3 model.
   - Freeze all the layers up to the bottleneck layer (the last layer before the output layer).
   - Replace the output layer with an appropriate number of neurons for your classification task.
   - Adjust the output layer to have the desired number of outputs, and use the softmax activation function for multi-class classification.

d. Splitting the Data:
   - Split the dataset into two sets: a training set and a test set.
   - The training set is used to train the model, while the test set is used to evaluate its performance.
   - Ensure that the data is split randomly and that each class is represented in both the training and test sets.