# Using Convolutional Neural Networks

In this last chapter, we learn how to make neural networks work well in practice, using concepts like regularization, batch-normalization and transfer learning.

# (1) The sequential module

## AlexNet - declearing the modules

```
class AlexNet(nn.Module):
    def __init__(self, num_classes=1000):
        super(AlexNet, self).__init__()
        self.conv1 = nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2)
        self.relu = nn.Relu(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2)
        self.conv2 = nn.Conv2d(64, 192, kernel_size=5, padding=2)
        self.conv3 = nn.Conv2d(192, 384, kernel_size=3, padding=1)
        self.conv4 = nn.Conv2d(384, 256, kernel_size=3, padding=1)
        self.conv5 = nn.Conv2d(256, 256, kernel_size=3, padding=1)
        self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
        self.fc1 = nn.Linear(256 * 6 * 6, 4096)
        self.fc2 = nn.Linear(4096, 4096)
        self.fc3 = nn.Linear(4096, num_classes)
```

## AlexNet - forward() methods

```
def forward(self, x):
    x = self.relu(self.conv1(x))
    x = self.maxpool(x)
    x = self.relu(self.conv2(x))
    x = self.maxpool(x)
    x = self.relu(self.conv3(x))
    x = self.relu(self.conv4(x))
    x = self.relu(self.conv5(x))
    x = self.maxpool(x)
    x = self.avgpool(x)
    x = x.view(x.size(0), 256 * 6 * 6)
    x = self.relu(self.fc1(x))
    x = self.relu(self.fc2(x))
    return self.fc3(x)
```

## The sequential modulde - declaring the modules

```
class AlexNet(nn.Module):
    def __init__(self, num_classes=1000):
        super(AlexNet, self).__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(64, 192, kernel_size=5, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2),
            nn.Conv2d(192, 384, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(384, 256, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(256, 256, kernel_size=3, padding=1),
            nn.MaxPool2d(kernel_size=3, stride=2), )
        self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
        self.classifier = nn.Sequential(
            nn.Dropout(),
            nn.Linear(256 * 6 * 6, 4096),
            nn.ReLU(inplace=True),
            nn.Dropout(),
            nn.Linear(4096, 4096),
            nn.ReLU(inplace=True),
            nn.Linear(4096, num_classes), )
```

## The sequential module - forward() method

```
def forward(self, x):
    x = self.features(x)
    x = self.avgpool(x)
    x = x.view(x.size(0), 256 * 6 * 6)
    x = self.classifier(x)
    return x
```

# Exercise I: Sequential module - init method

Having learned about the sequential module, now is the time to see how you can convert a neural network that doesn't use sequential modules to one that uses them. We are giving the code to build the network in the usual way, and you are going to write the code for the same network using sequential modules.

```
class Net(nn.Module):
    def __init__(self, num_classes):
        super(Net, self).__init__()

        self.conv1 = nn.Conv2d(in_channels=1, out_channels=5, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(in_channels=5, out_channels=10, kernel_size=3, padding=1)
        self.conv3 = nn.Conv2d(in_channels=10, out_channels=20, kernel_size=3, padding=1)
        self.conv4 = nn.Conv2d(in_channels=20, out_channels=40, kernel_size=3, padding=1)

        self.relu = nn.ReLU()

        self.pool = nn.MaxPool2d(2, 2)

        self.fc1 = nn.Linear(7 * 7 * 40, 1024)
        self.fc2 = nn.Linear(1024, 2048)
        self.fc3 = nn.Linear(2048, 10) 
```

We want the pooling layer to be used after the second and fourth convolutional layers, while the relu nonlinearity needs to be used after each layer except the last (fully-connected) layer. For the number of filters (kernels), stride, passing, number of channels and number of units, use the same numbers as above.

### Instructions

- Declare all the layers needed for feature extraction in the `self.features`.
- Declare the three linear layers in `self.classifier`.

In [None]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        
        # Declare all the layers for feature extraction
        self.features = nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=5, kernel_size=3, padding=1), 
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=5, out_channels=10, kernel_size=3, padding=1), 
            nn.MaxPool2d(2, 2), nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=10, out_channels=20, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=20, out_channels=40, kernel_size=3, padding=1),
            nn.MaxPool2d(2, 2), nn.ReLU(inplace=True))
        
        # Declare all the layers for classification
        self.classifier = nn.Sequential(nn.Linear(7 * 7 * 40, 1024),                  nn.ReLU(inplace=True),
            nn.Linear(1024, 2048), 
            nn.ReLU(inplace=True),
            nn.Linear(2048, 10))

# Exercise II: Sequential module - forward() method

Now, that you have defined all the modules that the network needs, it is time to apply them in the `forward()` method. For context, we are giving the code for the `forward()` method, if the net was written in the usual way.

```
class Net(nn.Module):
    def __init__(self, num_classes):
        super(Net, self).__init__()

        self.conv1 = nn.Conv2d(in_channels=1, out_channels=5, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(in_channels=5, out_channels=10, kernel_size=3, padding=1)
        self.conv3 = nn.Conv2d(in_channels=10, out_channels=20, kernel_size=3, padding=1)
        self.conv4 = nn.Conv2d(in_channels=20, out_channels=40, kernel_size=3, padding=1)

        self.relu = nn.ReLU()

        self.pool = nn.MaxPool2d(2, 2)

        self.fc1 = nn.Linear(7 * 7 * 40, 1024)
        self.fc2 = nn.Linear(1024, 2048)
        self.fc3 = nn.Linear(2048, 10) 

    def forward():
        x = self.relu(self.conv1(x))
        x = self.relu(self.pool(self.conv2(x)))
        x = self.relu(self.conv3(x))
        x = self.relu(self.pool(self.conv4(x)))
        x = x.view(-1, 7 * 7 * 40)
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        x = self.fc3(x)
        return x
```

Note: for evaluation purposes, the entire code of the class needs to be in the script. We are using the `__init__` method as you have coded it on the previous exercise, while you are going to code the `forward()` method here.

### Instructions

- Extract the features from the images.
- Squeeze the three spatial dimensions of the feature maps into one using the `view()` method.
- Classify images based on the extracted features.


In [None]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        
        # Declare all the layers for feature extraction
        self.features = nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=5, kernel_size=3, padding=1), 
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=5, out_channels=10, kernel_size=3, padding=1), 
            nn.MaxPool2d(2, 2), nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=10, out_channels=20, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(in_channels=20, out_channels=40, kernel_size=3, padding=1),
            nn.MaxPool2d(2, 2), nn.ReLU(inplace=True))
        
        # Declare all the layers for classification
        self.classifier = nn.Sequential(
            nn.Linear(7 * 7 * 40, 1024), 
            nn.ReLU(inplace=True),
            nn.Linear(1024, 2048), 
            nn.ReLU(inplace=True),
            nn.Linear(2048, 10))
        
    def forward(self, x):
      
        # Apply the feature extractor in the input
        x = self.ReLU(self.features(x))
        
        # Squeeze the three spatial dimensions in one
        x = x.view(-1, 7 * 7 * 40)
        
        # Classify the images
        x = self.classifier(x)
        return x

# (2) The problem of overfitting

## Overfitting

<img src="image/Screenshot 2021-01-28 020321.png">

## Detecting overfitting

<img src="image/Screenshot 2021-01-28 020426.png">

<img src="image/Screenshot 2021-01-28 020451.png">

## Overfitting in the testing set

- Training set
- Validation set
- Testing set

## Validation set

- Training set: train the model
- Validation set: select the model
- Testing set: test the model

## Using validation sets in PyTorch

```
indices = np.arange(50000)
np.random.shuffle(indices)

train_loader = torch.utils.data.DataLoader(
    datasets.CIFAR10(root='./data', train=True, download=True,
        transform=transform.Compose([transform.ToTensor(),
        transform.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))])),
        batch_size=1, shuffle=False, sampler=torch.Utils.data.SubsetRandomSampler(indices[:45000]))

test_loader = torch.utils.data.DataLoader(
    datasets.CIFAR10(root='./data', train=True, download=True,
        transform=transform.Compose([transform.ToTensor(),
        transform.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))])),
        batch_size=1, shuffle=False, sampler=torch.Utils.data.SubsetRandomSampler(indices[45000:50000]))  
```

# Exercise III: Validation set

You saw the need for validation set in the previous video. Problem is that the datasets typically are not separated into training, validation and testing. It is your job as a data scientist to split the dataset into training, testing and validation. The easiest (and most used) way of doing so is to do a random splitting of the dataset. In PyTorch, that can be done using `SubsetRandomSampler` object. You are going to split the training part of `MNIST` dataset into training and validation. After randomly shuffling the dataset, use the first `55000` points for training, and the remaining `5000` points for validation.

### Instructions

- Use `numpy.arange()` to create an array containing numbers [0, 59999] and then randomly shuffle the array.
- In the `train_loader` using `SubsetRandomSampler()` use the first `55k` points for training.
- In the `val_loader` use the remaining `5k` points for validation.


In [None]:
# Shuffle the indices
indices = np.arange(60000)
np.random.shuffle(indices)

# Build the train loader
train_loader = torch.utils.data.DataLoader(
    datasets.MNIST('mnist', download=True, train=True,
            transform=transforms.Compose([transforms.ToTensor(),         
            transforms.Normalize((0.1307,), (0.3081,))])),
    batch_size=64, shuffle=False,       
    sampler=torch.utils.data.SubsetRandomSampler(indices[:55000]))

# Build the validation loader
val_loader = torch.utils.data.DataLoader(
    datasets.MNIST('mnist', download=True, train=True,
            transform=transforms.Compose([transforms.ToTensor(), 
            transforms.Normalize((0.1307,), (0.3081,))])),
    batch_size=64, shuffle=False,   
    sampler=torch.utils.data.SubsetRandomSampler(indices[55000:60000]))

# Exercise IV: Detecting overfitting

Overfitting is arguably the biggest problem in machine learning and data science, and being able to detect it will make you a much better data scientist. While reaching a high (or even perfect) accuracy on training sets is quite easy when you use neural networks, reaching a high accuracy on validation and testing sets is a very different thing.

Let's see if you can now detect overfitting. Amongst the accuracy scores below, which network presents the biggest overfitting problem. ?

### Possible Answers

- The accuracy in the training set is 90%, the accuracy in the validation set is 88%.
- The accuracy in the training set is 90%, the accuracy in the testing set is 70%.
- The accuracy in the training set is 90%, the accuracy in the validation set is 70%. (T)
- The accuracy in the validation set is 85%, the accuracy in the testing set is 82%.

# (3) Regularization techniques

## L2-regularization

$$C = -\frac{1}{n}\sum_{xj} [y_j lna_j ^L + (1-a_j ^L)] + \frac{\lambda}{2n} \sum_{w} {w^2}$$

```
optimizer = optim.Adam(net.parameters(), lr=3e-4, weight_decay=0.0001)
```

## Dropout

<img src="image/Screenshot 2021-01-28 031145.png">

## Dropout in AlexNet - PyTorch code

```
self.classifier = nn.Sequential(
    nn.Dropout(p=0.5),
    nn.Linear(256 * 6 * 6, 4096)
    nn.ReLU(inplace=True),
    nn.Dropout(p=0.5),
    nn.Linear(4096, 4096),
    nn.ReLU(inplace=True),
    nn.Linear(4096, num_classes,
)
```

## Batch - normalization

<img src="image/Screenshot 2021-01-28 031513.png">

## Early - stopping

<img src="image/Screenshot 2021-01-28 031601.png">

## Hyperparameters

Question: How to choose all these hyperparameters (l2 regularization, dropout parameter, optimizers (Adam vs gradient descent etc), batch-norm momentum and epsilonm number of epochs for early stopping etc)?

Answer: Train many networks with different hyperparameters (typically use random values for them), and test them in the validation set. Then use the best performing net in the validation set to know the expected accuracy of the network in new data

## Eval() mode

```
# Sets the net in trian mode
model.train()

# Sets the net in evaluation mode
model.eval()
```

# Exercise V: L2-regularization

You are going to implement each of the regularization techniques explained in the previous video. Doing so, you will also remember important concepts studied throughout the course. You will start with l2-regularization, the most important regularization technique in machine learning. As you saw in the video, l2-regularization simply penalizes large weights, and thus enforces the network to use only small weights.

### Instructions

- Instantiate an object called `model` from class `Net()`, which is available in your workspace (consider it as a blackbox).
- Instantiate the cross-entropy loss.
- Instantiate `Adam` optimizer with `learning_rate` equals to `3e-4`, and `l2` regularization parameter equals to `0.001`.


In [None]:
# Instantiate the network
model = Net()

# Instantiate the cross-entropy loss
criterion = nn.CrossEntropyLoss()

# Instantiate the Adam optimizer
optimizer = optim.Adam(model.parameters(), lr=3e-4, weight_decay=0.001)

# Exercise VI: Dropout

You saw that dropout is an effective technique to avoid overfitting. Typically, dropout is applied in fully-connected neural networks, or in the fully-connected layers of a convolutional neural network. You are now going to implement dropout and use it on a small fully-connected neural network.

For the first hidden layer use `200` units, for the second hidden layer use `500` units, and for the output layer use `10` units (one for each class). For the activation function, use ReLU. Use `.Dropout()` with strength `0.5`, between the first and second hidden layer. Use the sequential module, with the order being: `fully-connected`, `activation`, `dropout`, `fully-connected`, `activation`, `fully-connected`.

### Instructions

- Implement the `__init__` method, based on the description of the network in the context.

In [None]:
class Net(nn.Module):
    def __init__(self):
        
        # Define all the parameters of the net
        self.classifier = nn.Sequential(
            nn.Linear(28*28, 200),
            nn.ReLU(inplace=True),
            nn.Dropout(p=0.5),
            nn.Linear(200, 500),
            nn.ReLU(inplace=True),
            nn.Linear(500, 10))
        
    def forward(self, x):
    
    	# Do the forward pass
        return self.classifier(x)

# Exercise VII: Batch-normalization

Dropout is used to regularize fully-connected layers. Batch-normalization is used to make the training of convolutional neural networks more efficient, while at the same time having regularization effects. You are going to implement the `__init__` method of a small convolutional neural network, with batch-normalization. The feature extraction part of the CNN will contain the following modules (in order): `convolution`, `max-pool`, `activation`, `batch-norm`, `convolution`, `max-pool`, `relu`, `batch-norm`.

The first convolutional layer will contain 10 output channels, while the second will contain 20 output channels. As always, we are going to use MNIST dataset, with images having shape (28, 28) in grayscale format (1 channel). In all cases, the size of the `filter` should be 3, the `stride` should be 1 and the `padding` should be `1`.

### Instructions

- Implement the feature extraction part of the network, using the description in the context.
- Implement the fully-connected (classifier) part of the network.


In [None]:
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        
        # Implement the sequential module for feature extraction
        self.features = nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=10, kernel_size=3, stride=1, padding=1),
            nn.MaxPool2d(2, 2), nn.ReLU(inplace=True), nn.BatchNorm2d(10),
            nn.Conv2d(in_channels=10, out_channels=20, kernel_size=3, stride=1, padding=1),
            nn.MaxPool2d(2, 2), nn.ReLU(inplace=True), nn.BatchNorm2d(20))
        
        # Implement the fully connected layer for classification
        self.fc = nn.Linear(in_features=7*7*20, out_features=10)