<a href="https://colab.research.google.com/github/jeffheaton/app_deep_learning/blob/main/t81_558_class_05_2_cnn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-558: Applications of Deep Neural Networks
**Module 6: Convolutional Neural Networks (CNN) for Computer Vision**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 5 Material

- Part 5.1: Image Processing in Python [[Video]](https://www.youtube.com/watch?v=Sob7VDb4xh8&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_05_1_python_images.ipynb)
- **Part 5.2: Using Convolutional Neural Networks** [[Video]](https://www.youtube.com/watch?v=jL0_lOpEwSk&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_05_2_cnn.ipynb)
- Part 5.3: Using Pretrained Neural Networks[[Video]](https://www.youtube.com/watch?v=W2T-dfiHYSo&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_05_3_vision_transfer.ipynb)
- Part 5.4: Looking at Generators and Image Augmentation [[Video]](https://www.youtube.com/watch?v=20JoEmQb810&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_05_4_generators.ipynb)
- Part 5.5: Recognizing Multiple Images with YOLOv5 [[Video]](https://www.youtube.com/watch?v=7Uu1n9Tp0Mk&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](t81_558_class_05_5_yolo.ipynb)

# Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.

In [1]:
try:
    import google.colab
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

# Nicely formatted time string
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return f"{h}:{m:>02}:{s:>05.2f}"

# Early stopping (see module 3.4)
import copy

class EarlyStopping:
    def __init__(self, patience=5, min_delta=0, restore_best_weights=True):
        self.patience = patience
        self.min_delta = min_delta
        self.restore_best_weights = restore_best_weights
        self.best_model = None
        self.best_loss = None
        self.counter = 0
        self.status = ""

    def __call__(self, model, val_loss):
        if self.best_loss is None:
            self.best_loss = val_loss
            self.best_model = copy.deepcopy(model.state_dict())
        elif self.best_loss - val_loss >= self.min_delta:
            self.best_model = copy.deepcopy(model.state_dict())
            self.best_loss = val_loss
            self.counter = 0
            self.status = f"Improvement found, counter reset to {self.counter}"
        else:
            self.counter += 1
            self.status = f"No improvement in the last {self.counter} epochs"
            if self.counter >= self.patience:
                self.status = f"Early stopping triggered after {self.counter} epochs."
                if self.restore_best_weights:
                    model.load_state_dict(self.best_model)
                return True
        return False


# Make use of a GPU or MPS (Apple) if one is available.  (see module 3.2)
import torch

device = (
    "mps"
    if getattr(torch, "has_mps", False)
    else "cuda"
    if torch.cuda.is_available()
    else "cpu"
)
print(f"Using device: {device}")

Note: not using Google CoLab
Using device: mps


# Part 5.2: Keras Neural Networks for Digits and Fashion MNIST

This module will focus on computer vision. There are some important differences and similarities with previous neural networks.

* We will usually use classification, though regression is still an option.
* The input to the neural network is now 3D (height, width, color)
* Data are not transformed; no z-scores or dummy variables.
* Processing time is much longer.
* We now have different layer times: dense layers (just like before), convolution layers, and max-pooling layers.
* Data will no longer arrive as CSV files. TensorFlow provides some utilities for going directly from the image to the input for a neural network.

## Convolutional Neural Networks (CNNs)

The convolutional neural network (CNN) is a neural network technology that has profoundly impacted the area of computer vision (CV). Fukushima  (1980) [[Cite:fukushima1980neocognitron]](https://www.rctn.org/bruno/public/papers/Fukushima1980.pdf) introduced the original concept of a convolutional neural network, and   LeCun, Bottou, Bengio & Haffner (1998) [[Cite:lecun1995convolutional]](http://yann.lecun.com/exdb/publis/pdf/lecun-bengio-95a.pdf) greatly improved this work. From this research, Yan LeCun introduced the famous LeNet-5 neural network architecture. This chapter follows the LeNet-5 style of convolutional neural network.  
Although computer vision primarily uses CNNs, this technology has some applications outside of the field. You need to realize that if you want to utilize CNNs on non-visual data, you must find a way to encode your data to mimic the properties of visual data.  

The order of the input array elements is crucial to the training. In contrast, most neural networks that are not CNNs treat their input data as a long vector of values, and the order in which you arrange the incoming features in this vector is irrelevant. You cannot change the order for these types of neural networks after you have trained the network. 

The CNN network arranges the inputs into a grid. This arrangement worked well with images because the pixels in closer proximity to each other are important to each other. The order of pixels in an image is significant. The human body is a relevant example of this type of order. For the design of the face, we are accustomed to eyes being near to each other. 

This advance in CNNs is due to years of research on biological eyes. In other words, CNNs utilize overlapping fields of input to simulate features of biological eyes. Until this breakthrough, AI had been unable to reproduce the capabilities of biological vision.
Scale, rotation, and noise have presented challenges for AI computer vision research. You can observe the complexity of biological eyes in the example that follows. A friend raises a sheet of paper with a large number written on it. As your friend moves nearer to you, the number is still identifiable. In the same way, you can still identify the number when your friend rotates the paper. Lastly, your friend creates noise by drawing lines on the page, but you can still identify the number. As you can see, these examples demonstrate the high function of the biological eye and allow you to understand better the research breakthrough of CNNs. That is, this neural network can process scale, rotation, and noise in the field of computer vision. You can see this network structure in Figure 6.LENET.

**Figure 5.LENET: A LeNET-5 Network (LeCun, 1998)**
![A LeNET-5 Network](https://raw.githubusercontent.com/jeffheaton/t81_558_deep_learning/master/images/class_8_lenet5.png "A LeNET-5 Network")

So far, we have only seen one layer type (dense layers). By the end of this book we will have seen:

* **Dense Layers** - Fully connected layers.  
* **Convolution Layers** - Used to scan across images. 
* **Max Pooling Layers** - Used to downsample images. 
* **Dropout Layers** - Used to add regularization. 
* **LSTM and Transformer Layers** - Used for time series data.

## Convolution Layers

The first layer that we will examine is the convolutional layer. We will begin by looking at the hyper-parameters that you must specify for a convolutional layer in most neural network frameworks that support the CNN:

* Number of filters
* Filter Size
* Stride
* Padding
* Activation Function/Non-Linearity

The primary purpose of a convolutional layer is to detect features such as edges, lines, blobs of color, and other visual elements. The filters can detect these features. The more filters we give to a convolutional layer, the more features it can see.

A filter is a square-shaped object that scans over the image. A grid can represent the individual pixels of a grid. You can think of the convolutional layer as a smaller grid that sweeps left to right over each image row. There is also a hyperparameter that specifies both the width and height of the square-shaped filter. The following figure shows this configuration in which you see the six convolutional filters sweeping over the image grid:

A convolutional layer has weights between it and the previous layer or image grid. Each pixel on each convolutional layer is a weight. Therefore, the number of weights between a convolutional layer and its predecessor layer or image field is the following:

```
[FilterSize] * [FilterSize] * [# of Filters]
```

For example, if the filter size were 5 (5x5) for 10 filters, there would be 250 weights.

You need to understand how the convolutional filters sweep across the previous layer's output or image grid. Figure 5.CNN illustrates the sweep:

**Figure 5.CNN: Convolutional Neural Network**
![Convolutional Neural Network](https://raw.githubusercontent.com/jeffheaton/t81_558_deep_learning/master/images/class_8_cnn_grid.png "Convolutional Neural Network")

The above figure shows a convolutional filter with 4 and a padding size of 1. The padding size is responsible for the border of zeros in the area that the filter sweeps. Even though the image is 8x7, the extra padding provides a virtual image size of 9x8 for the filter to sweep across. The stride specifies the number of positions the convolutional filters will stop. The convolutional filters move to the right, advancing by the number of cells specified in the stride. Once you reach the far right, the convolutional filter moves back to the far left; then, it moves down by the stride amount and
continues to the right again.

Some constraints exist concerning the size of the stride. The stride cannot be 0. The convolutional filter would never move if you set the stride. Furthermore, neither the stride nor the convolutional filter size can be larger than the previous grid. There are additional constraints on the stride (*s*), padding (*p*), and the filter width (*f*) for an image of width (*w*). Specifically, the convolutional filter must be able to start at the far left or top border, move a certain number of strides, and land on the far right or bottom border. The following equation shows the number of steps a convolutional operator
must take to cross the image:

$$ steps = \frac{w - f + 2p}{s}+1 $$

The number of steps must be an integer. In other words, it cannot have decimal places. The purpose of the padding (*p*) is to be adjusted to make this equation become an integer value.

## Max Pooling Layers

Max-pool layers downsample a 3D box to a new one with smaller dimensions. Typically, you can always place a max-pool layer immediately following the convolutional layer. The LENET shows the max-pool layer immediately after layers C1 and C3. These max-pool layers progressively decrease the size of the dimensions of the 3D boxes passing through them. This technique can avoid overfitting (Krizhevsky, Sutskever & Hinton, 2012).

A pooling layer has the following hyper-parameters:

* Spatial Extent (*f*)
* Stride (*s*)

Unlike convolutional layers, max-pool layers do not use padding. Additionally, max-pool layers have no weights, so training does not affect them. These layers downsample their 3D box input. The 3D box output by a max-pool layer will have a width equal to this equation:

$$ w_2 = \frac{w_1 - f}{s} + 1 $$

The height of the 3D box produced by the max-pool layer is calculated similarly with this equation:

$$ h_2 = \frac{h_1 - f}{s} + 1 $$

The depth of the 3D box produced by the max-pool layer is equal to the depth the 3D box received as input. The most common setting for the hyper-parameters of a max-pool layer is f=2 and s=2. The spatial extent (f) specifies that boxes of 2x2 will be scaled down to single pixels. Of these four pixels, the pixel with the maximum value will represent the 2x2 pixel in the new grid. Because squares of size 4 are replaced with size 1, 75% of the pixel information is lost. The following figure shows this transformation as a 6x6 grid becomes a 3x3:

**Figure 5.MAXPOOL: Max Pooling Layer**
![Max Pooling Layer](https://raw.githubusercontent.com/jeffheaton/t81_558_deep_learning/master/images/class_8_conv_maxpool.png "Max Pooling Layer")

Of course, the above diagram shows each pixel as a single number. A grayscale image would have this characteristic. We usually take the average of the three numbers for an RGB image to determine which pixel has the maximum value.

## Regression Convolutional Neural Networks

We will now look at two examples, one for regression and another for classification. For supervised computer vision, your dataset will need some labels. For classification, this label usually specifies the subject of the image. For regression, this "label" is some numeric quantity the image should produce, such as a count. We will look at two different means of providing this label.

The first example will show how to handle regression with convolution neural networks. We will provide an image and expect the neural network to count items in that image. We will use a [dataset](https://www.kaggle.com/jeffheaton/count-the-paperclips) that I created that contains a random number of paperclips. Figure 5.CNN shows a sample from this dataset.


**Figure 5.CLIPS: Count the Paperclips**

![Max Pooling Layer](https://data.heatonresearch.com/images/wustl/class/clips-25009.jpg "Count the Paperclips")

As you can see, each of the images contains a random number of randomly placed and sized paperclips.

The following code will download this dataset for you.


In [2]:
import os

URL = "https://github.com/jeffheaton/data-mirror/releases/"
DOWNLOAD_SOURCE = URL+"download/v1/paperclips.zip"
DOWNLOAD_NAME = DOWNLOAD_SOURCE[DOWNLOAD_SOURCE.rfind('/')+1:]

if COLAB:
  PATH = "/content"
else:
  # I used this locally on my machine, you likely need different
  PATH = "/Users/jeff/temp"

EXTRACT_TARGET = os.path.join(PATH,"clips")
SOURCE = os.path.join(EXTRACT_TARGET, "paperclips")

Next, we download the images. This part depends on the origin of your images. The following code downloads images from a URL, where a ZIP file contains the images. The code unzips the ZIP file.

In [3]:
# HIDE OUTPUT
!wget -O {os.path.join(PATH,DOWNLOAD_NAME)} {DOWNLOAD_SOURCE}
!mkdir -p {SOURCE}
!mkdir -p {TARGET}
!mkdir -p {EXTRACT_TARGET}
!unzip -o -j -d {SOURCE} {os.path.join(PATH, DOWNLOAD_NAME)} >/dev/null

--2023-08-15 21:38:06--  https://github.com/jeffheaton/data-mirror/releases/download/v1/paperclips.zip
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/408419764/25830812-b9e6-4ddf-93b6-7932d9ef5982?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230816%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230816T023806Z&X-Amz-Expires=300&X-Amz-Signature=6272ff8ca84fe9ab8bd8d52e4522a4377cef43a6ddf35a6dd088eef8320048d4&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=408419764&response-content-disposition=attachment%3B%20filename%3Dpaperclips.zip&response-content-type=application%2Foctet-stream [following]
--2023-08-15 21:38:06--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/408419764/25830812-b9e6-4ddf-93b6-7932d9ef5982?X-Amz-Al

The labels are contained in a CSV file named **train.csv**for regression. This file has just two labels, **id** and **clip_count**. The ID specifies the filename; for example, row id 1 corresponds to the file **clips-1.jpg**. The following code loads the labels for the training set and creates a new column, named **filename**, that contains the filename of each image, based on the **id** column.

In [4]:
import pandas as pd

df = pd.read_csv(
    os.path.join(SOURCE,"train.csv"), 
    na_values=['NA', '?'])

df['filename']="clips-"+df["id"].astype(str)+".jpg"

This results in the following dataframe.

In [5]:
df

Unnamed: 0,id,clip_count,filename
0,30001,11,clips-30001.jpg
1,30002,2,clips-30002.jpg
2,30003,26,clips-30003.jpg
3,30004,41,clips-30004.jpg
4,30005,49,clips-30005.jpg
...,...,...,...
19995,49996,35,clips-49996.jpg
19996,49997,54,clips-49997.jpg
19997,49998,72,clips-49998.jpg
19998,49999,24,clips-49999.jpg


Separate into a training and validation (for early stopping)

In [6]:
TRAIN_PCT = 0.9
TRAIN_CUT = int(len(df) * TRAIN_PCT)

df_train = df[0:TRAIN_CUT]
df_validate = df[TRAIN_CUT:]

print(f"Training size: {len(df_train)}")
print(f"Validate size: {len(df_validate)}")

Training size: 18000
Validate size: 2000


We are now ready to create a custom class named **ClipCountDataset** that extends the PyTorch **Dataset** class. We will use a technique called augmentation to create additional training data by manipulating the source material. This technique can produce considerably stronger neural networks. The generator below flips the images both vertically and horizontally. PyTorch will train the neuron network both on the original images and the flipped images. This augmentation increases the size of the training data considerably. Module 5.4 will go deeper into the transformations you can perform. You can also specify a target size to resize the images automatically.

This class will load the labels from a Pandas dataframe connected to our **train.csv** file. When we demonstrate classification, we will use the new **ClipCountDataset** class; which loads the labels from the directory structure rather than a CSV.

In [7]:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from PIL import Image
import pandas as pd
import os
import tqdm
import torch.nn.functional as F
from torch.optim.lr_scheduler import ReduceLROnPlateau
from sklearn.model_selection import train_test_split

df_train = pd.read_csv(os.path.join(PATH, "clips/paperclips/train.csv"))
df_test = pd.read_csv(
    os.path.join(PATH, "clips/paperclips/test.csv"), na_values=["NA", "?"]
)
df_test["filename"] = "clips-" + df_test["id"].astype(str) + ".jpg"
df_train["clip_count"] = df_train["clip_count"].astype("float32")


class ClipCountDataset(Dataset):
    def __init__(self, dataframe, root_dir, transform=None):
        self.data = dataframe
        self.root_dir = root_dir
        self.transform = transform

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        img_name = os.path.join(
            self.root_dir, "clips-" + str(self.data.iloc[idx, 0]) + ".jpg"
        )
        image = Image.open(img_name)
        clip_count = self.data.iloc[idx, 1]
        sample = {"image": image, "clip_count": clip_count}
        if self.transform:
            sample["image"] = self.transform(sample["image"])
        return sample

Next we define a transformation chain, this includes changes that we wish to make to the images. As you can see, we standardize the images to a size of 256x256 and normalize the RGB color components into a standard distribution. 

In [8]:
data_transform = transforms.Compose([
        transforms.Resize((256, 256)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406],
                             std=[0.229, 0.224, 0.225])
    ])

train_dataset = ClipCountDataset(df_train, SOURCE, transform=data_transform)
val_dataset = ClipCountDataset(df_validate, SOURCE, transform=data_transform)
test_dataset = ClipCountDataset(df_test, SOURCE, transform=data_transform)

train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=32, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=1, shuffle=False)

The constants you see in transforms.Normalize are specific to the ImageNet dataset and the model training process. They are used to normalize the image tensor before passing it to a model trained on ImageNet.

Here's a breakdown of what they are:

* mean=[0.485, 0.456, 0.406]: This is the mean of the RGB channels of the ImageNet dataset. When models are trained on ImageNet, images are typically zero-centered by subtracting the mean value of each channel from the respective channel in the image. This helps the model converge faster during training.

* std=[0.229, 0.224, 0.225]: This is the standard deviation of the RGB channels of the ImageNet dataset. After zero-centering, each channel is typically divided by its standard deviation. This normalization process makes the values of each channel have a zero mean and a standard deviation of 1.

The normalization values (both mean and standard deviation) for ImageNet were computed from the dataset and are widely used in the deep learning community for models pre-trained on ImageNet.

When you apply this normalization to your input image, you are effectively putting it on the same scale as the images the model was trained on. This ensures that the model will process the image in a manner consistent with its training.

If you were working with a different model trained on a different dataset, you would use different mean and standard deviation values specific to that dataset.

We can now train the neural network. The code to build and train the neural network is not that different than in the previous modules. We will use the PyTorch **Sequential** class to provide layers to the neural network. We now have several new layer types that we did not previously see.

* **Conv2d** - The convolution layers.
* **MaxPooling2d** - The max-pooling layers.
* **Flatten** - Flatten the 2D (and higher) tensors to allow a Dense layer to process.
* **Linear** - Dense layers, the same as demonstrated previously. Dense layers often form the final output layers of the neural network.

The training code is very similar to previously. This code is for regression, so a final linear activation is used, along with mean_squared_error for the loss function. The generator provides both the *x* and *y* matrixes we previously supplied.

In [9]:
model = nn.Sequential(
    nn.Conv2d(3, 64, 3),  # 3 input channels, 64 output channels, 3x3 kernel
    nn.ReLU(),
    nn.MaxPool2d(2, 2),  # 2x2 pooling kernel with stride 2
    nn.Conv2d(64, 64, 3), # 64 input channels, 64 output channels, 3x3 kernel
    nn.ReLU(),
    nn.MaxPool2d(2, 2),  # 2x2 pooling kernel with stride 2
    nn.Flatten(),       # Flattening the tensor for the fully connected layers
    nn.Linear(64 * 62 * 62, 512), # 64 * 62 * 62 input features, 512 output features
    nn.ReLU(),
    nn.Linear(512, 1)    # 512 input features, 1 output feature
)
model = torch.compile(model,backend="aot_eager").to(device)


criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters())
scheduler = ReduceLROnPlateau(optimizer, 'min')

EPOCHS = 1

print("Training")
for epoch in range(EPOCHS):
    running_loss = 0.0
    steps = list(enumerate(train_dataloader, 0))
    for i, data in tqdm.tqdm(steps):
        inputs, labels = data['image'].to(device).float(), data['clip_count'].to(device).float()
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs.view(-1), labels.view(-1))
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    scheduler.step(running_loss)
    print(f"Epoch {epoch}/{EPOCHS}, loss: {loss.item()}")

print('Finished Training')

Training


100%|██████████| 625/625 [06:28<00:00,  1.61it/s]

Epoch 0/1, loss: 13.497312545776367
Finished Training





This code will run very slowly if you do not use a GPU. The above code takes approximately 13 minutes with a GPU.

## Score Regression Image Data

Scoring/predicting from a generator is a bit different than training. We do not want augmented images, and we do not wish to have the dataset shuffled. For scoring, we want a prediction for each input. We use the same batch size to guarantee that we do not run out of GPU memory if our prediction set is large and concatenate all of the results together. You can increase this value for better performance. We can now generate a CSV file to hold the predictions.

In [10]:
# Testing
predictions = []
with torch.no_grad():
    for data in tqdm.tqdm(test_dataloader):
        images = torch.tensor(data['image']).to(device)
        outputs = model(images)
        predictions.append(outputs.item())

df_submit = pd.DataFrame({'id': df_test['id'], 'clip_count': predictions})
df_submit.to_csv("submit.csv", index=False)

  images = torch.tensor(data['image']).to(device)
  images = torch.tensor(data['image']).to(device)
100%|██████████| 5000/5000 [03:25<00:00, 24.29it/s]


## Classification Neural Networks

Just like earlier in this module, we will load data. However, this time we will use a dataset of images of three different types of the iris flower. This zip file contains three different directories that specify each image's label. The directories are named the same as the labels:

* iris-setosa
* iris-versicolour
* iris-virginica

This dataset contains images of actual flowers, such as Figure 5.IRIS.

**Figure 5.IRIS: Iris Flower**

![Iris Flower](https://s3.amazonaws.com/data.heatonresearch.com/images/wustl/class/iris-0c826b6f4648edf507e0cafdab53712bb6fd1f04dab453cee8db774a728dd640.jpg "Iris Flower")

We will begin by loading a local copy of the dataset, like we did earlier for the paperclips.


In [11]:
import os

URL = "https://github.com/jeffheaton/data-mirror/releases"
DOWNLOAD_SOURCE = URL+"/download/v1/iris-image.zip"
DOWNLOAD_NAME = DOWNLOAD_SOURCE[DOWNLOAD_SOURCE.rfind('/')+1:]

if COLAB:
  PATH = "/content"
  EXTRACT_TARGET = os.path.join(PATH,"iris")
  SOURCE = EXTRACT_TARGET # In this case its the same, no subfolder
else:
  # I used this locally on my machine, you may need different
  PATH = "/Users/jeff/temp"
  EXTRACT_TARGET = os.path.join(PATH,"iris")
  SOURCE = EXTRACT_TARGET # In this case its the same, no subfolder

Just as before, we unzip the images.

In [12]:
# HIDE OUTPUT
!wget -O {os.path.join(PATH,DOWNLOAD_NAME)} {DOWNLOAD_SOURCE}
!mkdir -p {SOURCE}
!mkdir -p {TARGET}
!mkdir -p {EXTRACT_TARGET}
!unzip -o -d {EXTRACT_TARGET} {os.path.join(PATH, DOWNLOAD_NAME)} >/dev/null

--2023-08-15 21:48:56--  https://github.com/jeffheaton/data-mirror/releases/download/v1/iris-image.zip
Resolving github.com (github.com)... 140.82.112.3
Connecting to github.com (github.com)|140.82.112.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/408419764/d548babd-36c3-414e-add2-a5d9ab941e6e?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20230816%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20230816T024856Z&X-Amz-Expires=300&X-Amz-Signature=1902b47c7c916265b57e6a0d77ff04afb0914f67bb2d2f0431b87b94231a4a9c&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=408419764&response-content-disposition=attachment%3B%20filename%3Diris-image.zip&response-content-type=application%2Foctet-stream [following]
--2023-08-15 21:48:56--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/408419764/d548babd-36c3-414e-add2-a5d9ab941e6e?X-Amz-Al

You can see these folders with the following command.

In [13]:
!ls {EXTRACT_TARGET}

[34miris-setosa[m[m      [34miris-versicolour[m[m [34miris-virginica[m[m


We set up a pipeline, similar to before.  We will transform the images, however, we also flip the images for additional augmented data.

In [14]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import transforms
from torchvision.datasets import ImageFolder
from torch.utils.data import DataLoader
from sklearn.metrics import accuracy_score
import numpy as np

# Data augmentation and normalization for training
train_transforms = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomVerticalFlip(),
    transforms.RandomRotation(360),
    transforms.RandomResizedCrop(256, scale=(0.5, 1.0)),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

# Just normalization for validation
validation_transforms = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(256),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
])

train_dataset = ImageFolder(root=SOURCE, transform=train_transforms)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

validation_dataset = ImageFolder(root=SOURCE, transform=validation_transforms)
validation_loader = DataLoader(validation_dataset, batch_size=32, shuffle=False)

The neural network is similar for classification. However, the output is equal to the number of classes, or iris types, which is 3.

In [15]:
class_count = len(train_dataset.classes)

# Define the CNN architecture
model = nn.Sequential(
            # Features
            nn.Conv2d(3, 16, 3),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            nn.Conv2d(16, 32, 3),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.MaxPool2d(2, 2),
            nn.Conv2d(32, 64, 3),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.MaxPool2d(2, 2),
            nn.Conv2d(64, 64, 3),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),
            nn.Conv2d(64, 64, 3),
            nn.ReLU(),
            nn.MaxPool2d(2, 2),

            # Classifier
            nn.Flatten(),
            nn.Dropout(0.5),
            nn.Linear(64 * 6 * 6, 512),
            nn.ReLU(inplace=True),
            nn.Linear(512, class_count)).to(device)

optimizer = optim.Adam(model.parameters())
loss_fn = nn.CrossEntropyLoss()



We will create function, named **validate** that will loop over the provided validation data and loss function to calculate a score. We simply take an average of each of the batches to compute the overall score.

In [16]:
def validate(model, loader):
    loss_num = loss_denom = 0.0
    model.eval()
    with torch.no_grad():
        for inputs, labels in loader:
            outputs = model(inputs.to(device))
            loss_num += loss_fn(outputs, labels.to(device))
            loss_denom += 1

    return loss_num / loss_denom


Training works similar to regression, we loop over the training batches and update the weights.

In [17]:
# Create datasets
BATCH_SIZE = 16

es = EarlyStopping()

epoch = 0
done = False
while epoch < 1000 and not done:
    epoch += 1
    steps = list(enumerate(train_loader))
    pbar = tqdm.tqdm(steps)
    model.train()
    for i, (x_batch, y_batch) in pbar:
        y_batch_pred = model(x_batch.to(device))
        loss = loss_fn(y_batch_pred, y_batch.to(device))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        loss, current = loss.item(), (i + 1) * len(x_batch)
        if i == len(steps) - 1:
            model.eval()
            vloss = validate(model, validation_loader)
            if es(model, vloss):
                done = True
            pbar.set_description(
                f"Epoch: {epoch}, tloss: {loss}, vloss: {vloss:>7f}, {es.status}"
            )
        else:
            pbar.set_description(f"Epoch: {epoch}, tloss {loss:}")

Epoch: 1, tloss: 0.8976501226425171, vloss: 0.991944, : 100%|██████████| 14/14 [00:04<00:00,  3.34it/s]
Epoch: 2, tloss: 0.6614288091659546, vloss: 0.953182, Improvement found, counter reset to 0: 100%|██████████| 14/14 [00:01<00:00,  7.57it/s]
Epoch: 3, tloss: 0.6952608823776245, vloss: 0.956685, No improvement in the last 1 epochs: 100%|██████████| 14/14 [00:01<00:00,  7.84it/s]
Epoch: 4, tloss: 0.5599308609962463, vloss: 0.914633, Improvement found, counter reset to 0: 100%|██████████| 14/14 [00:01<00:00,  7.79it/s]
Epoch: 5, tloss: 1.548996925354004, vloss: 0.960487, No improvement in the last 1 epochs: 100%|██████████| 14/14 [00:01<00:00,  7.69it/s]
Epoch: 6, tloss: 0.9194130897521973, vloss: 0.917065, No improvement in the last 2 epochs: 100%|██████████| 14/14 [00:01<00:00,  7.64it/s]
Epoch: 7, tloss: 1.0552207231521606, vloss: 0.952847, No improvement in the last 3 epochs: 100%|██████████| 14/14 [00:01<00:00,  7.53it/s]
Epoch: 8, tloss: 0.6120624542236328, vloss: 0.950160, No im

The iris image dataset is not easy to predict; it turns out that a tabular dataset of measurements is more manageable.  However, we can achieve a 63%. 

In [18]:
# Validation and accuracy calculation
model.eval()
preds = []
targets = []
with torch.no_grad():
    for inputs, labels in validation_loader:
        outputs = model(inputs.to(device))
        _, predictions = torch.max(outputs, 1)
        preds.extend(predictions.cpu().numpy())
        targets.extend(labels.cpu().numpy())

correct = accuracy_score(targets, preds)
print(f"Accuracy: {correct}")

Accuracy: 0.6389548693586699



# Other Resources

* [Imagenet:Large Scale Visual Recognition Challenge 2014](http://image-net.org/challenges/LSVRC/2014/index)
* [Andrej Karpathy](http://cs.stanford.edu/people/karpathy/) - PhD student/instructor at Stanford.
* [CS231n Convolutional Neural Networks for Visual Recognition](http://cs231n.stanford.edu/) - Stanford course on computer vision/CNN's.
* [CS231n - GitHub](http://cs231n.github.io/)
* [ConvNetJS](http://cs.stanford.edu/people/karpathy/convnetjs/) - JavaScript library for deep learning.