<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173/blob/master/Class_06_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 1173: Intro Computational Biology**

**Module 6: Convolutional Neural Networks (CNN) for Computer Vision**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Integrative Biology](https://sciences.utsa.edu/integrative-biology/), [UTSA](https://www.utsa.edu/)

### Module 6 Material

* Part 6.1: Image Processing in Python
* **Part 6.2: Using Convolutional Neural Networks** 
* Part 6.3: Using Pretrained Neural Networks with Keras 
* Part 6.4: Looking at Keras Generators and Image Au


## **WARNING--WARNING--WARNING**

`Class_06_2` is the first lesson in our course in which we analyze **large** image datasets using a new type of neural network called a Convolutional Neural Network (CNN). 

Training a CNN on a large image dataset requires considerable amount of computational power. If your laptop does not have GPU support for Tensorflow, you will most likely want to run this lesson on Google COLAB. Otherwise, it might take many hours (days?) to run the examples in this lesson. 

## Using Google COLAB

In order to run this lesson on Google COLAB, you will already need to have a [Google Drive account](https://support.google.com/a/users/answer/13022292?hl=en).

When you open this lesson in COLAB, you must immediately change the `RUNTIME` environment to take advantage of GPU **_before_** you start your lesson. While you can change your `RUNTIME` environment after you start, you will to restart from the beginning. 

When you open up this lesson in COLAB, you should select `RUNTIME -> Change runtime type` 

![___](https://biologicslab.co/BIO1173/images/class_06/class_06_2_COLAB1.png)



This will open the following popup menu:

![___](https://biologicslab.co/BIO1173/images/class_06/class_06_2_COLAB2.png)


Select `T4 GPU` and then `Save`. 

Make sure to save a copy of your lesson to your Google Drive. And don't forget, if you stop working on your lesson, COLAB will "kick you off" after some period of inactivity.

So if you decide you want to run this lesson on Google COLAB, click on this button:

<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173/blob/master/Class_06_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.
  Running the following code will map your GDrive to ```/content/drive```.

In [2]:
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    COLAB = True
    print("Note: using Google CoLab")
    %tensorflow_version 2.x
except:
    print("Note: not using Google CoLab")
    COLAB = False

Note: not using Google CoLab


### Lesson Setup

Run the next code cell to load necessary packages

In [3]:
# You MUST run this code cell first
import tensorflow as tf
import pandas as pd
import os
import numpy as np
import pandas as pd

import os
import shutil
path = '/'
memory = shutil.disk_usage(path)
LESSON_DIRECTORY = os.getcwd()
print("Your LESSON_DIRECTORY is: " + LESSON_DIRECTORY)
print("Disk", memory)
print("Tensorflow version =", (tf.__version__))
print("Available GPU acceleration =", tf.test.gpu_device_name())

Your LESSON_DIRECTORY is: C:\Users\David\BIO1173\Class_06_2
Disk usage(total=4000108531712, used=993676402688, free=3006432129024)
Tensorflow version = 2.10.0
Available GPU acceleration = /device:GPU:0


### Install new packages

You will need to install the next two packages for this lesson if you are NOT using COLAB. 

You only need to install a package once, so you can comment out these cells after you have run them.

In [9]:
if not COLAB:
    !pip install wget



In [10]:
if not COLAB:
    !pip install patool



### Which Operating System?

In this lesson we will need to have your JupyterLab notebook create new file folders on your computer or laptop to store large image datasets. Image datasets can be **HUGE** and they can consume a large part of your available disk space. especially on a laptop. To help avoid problems, we will put all of your large datasets in a special **_temporary_** folder called `/temp`. That way, you only need to go to **one place** to delete them when you need more disk space. 

One complication is that some students in this course will be using Google COLAB, some will be using Windows and some students will be using MacOS. Each of these **_operating systems_* (`os`) handles files and folders somewhat differently. For example, my LESSON_DIRECTORY on my Windows machine is at:
~~~text
Your LESSON_DIRECTORY is: C:\Users\David\BIO1173\Class_06_2
~~~
while on my MacBook it's:
~~~
Your LESSON_DIRECTORY is: /home/david/BIO1173/Class_06_2
~~~
You should take special note of the **_opposite_ direction** of the slashes between folders and files. If the slashes are pointing in the wrong direction your machine, the lesson won't work!

The code in the cell below uses the following system command `os.name` to see if you are running Windows:
~~~text
WINDOWS = False
if os.name == "nt":
    WINDOWS = True
~~~
If `os.name` is equal to the letters `nt` you are running a version of Microsoft Windows. The word `WINDOWS`, all in caps, is known as a ENVIRONMENTAL VARIABLE. If it is set to `True`, it tells Python that you are running Windows. If it is set to `False` Python knows that you are running MacOS or you are working on COLAB. 

But why `nt`? Microsoft introduced Windows NT on July 27, 1993. The initials `NT` stood for _New Technology_. The NT technology lives on today in Microsoft Windows 10 and 11, and as the os variable.

In [6]:
# Detect Operating System

# Start with WINDOWS flag set to False
WINDOWS = False

if os.name == "nt":
    WINDOWS = True
    print("\nNote: Jupyterlab is running on a WINDOWS computer")
else:
    print("\n Note: Jupyterlab is not running on a WINDOWS computer")


Note: Jupyterlab is running on a WINDOWS computer


### Create a `/temp` folder

Neural network requires a lot of file storage space. If you are working on Google COLAB, this shouldn't be too much of an issue. On the other hand, if you are working on your laptop computer, and your hard drive is nearly full, this could be a problem. 

As part of this lesson, a new folder will be created called `/temp`. This folder will be used to store large image datasets for the remainder of this course. Temporary folders, like `/temp` are ofter used in computer programming to store **_temporary files_**. In general, you should be able to delete **_all_** the folder(s) and files(s) in a temporary folder without causing any problems. 

Exactly where your `/temp` folder will be located in your filesystem is somewhat hard to predict. If you have been following the instructions given to you at the start of this course, `/temp` should be created in your course folder `/BIO1173`. On my Windows machine the location of my temporary folder is:
~~~text
 C:\Users\David\BIO1173\temp
~~~
On my MacBook the folder is located at:
~~~text
 /home/users/david/BIO1173/temp
~~~
When you run the code cell below, it will create your temporary folder in the **_same_** directory you were in, when you started this lesson (i.e. LESSON_DIRECTORY). It will also print out a log of _diagnostic_ info that can be used later to help troubleshoot any potential problems.

In [7]:
# System commands to create a temporary folder called /temp

# Change to LESSON_DIRECTORY
os.chdir(LESSON_DIRECTORY)

# Different commands for different Operating Systems 
if COLAB:
    print("Note: Using COLAB, no /temp folder is needed")
elif WINDOWS: # Windows machine
    os.chdir("..\\")  # move up one level in directory 
    BASE_DIR = os.getcwd() # directory for temp folder
    print("Your temporary folder will be in: " + BASE_DIR)
    try:
        os.mkdir(".\\temp") 
        print(f"Making \\temp in {BASE_DIR}")
    except:
        print(f"Note: \\temp already present in {BASE_DIR}")
    # Change back to LESSON_DIRECTORY
    os.chdir(LESSON_DIRECTORY)
else: # MacOS machine
    os.chdir("../") # move up one level in directory 
    BASE_DIR = os.getcwd()
    print("Your temporary folder will be in: " + BASE_DIR)
    try:
        os.mkdir("./temp")
        print(f"Making /temp in in {BASE_DIR}")
    except:
        print(f"Note: /temp already present in {BASE_DIR}")
    # Change back to LESSON_DIRECTORY
    os.chdir(LESSON_DIRECTORY)
    
#print("Your current working directory is : " + os.getcwd())

Your temporary folder will be in: C:\Users\David\BIO1173
Note: \temp already present in C:\Users\David\BIO1173


Here is the output the first time I ran the cell above, before a temporary folder had been created:
~~~text
Your temporary folder will be in: C:\Users\David\BIO1173
Making \temp in C:\Users\David\BIO1173
~~~
Here is the output when you run the cell after the temporary folder has already been created:
~~~text
Your temporary folder will be in: C:\Users\David\BIO1173
Note: \temp already present in C:\Users\David\BIO1173
~~~


### Define functions

The cell below creates the function(s) needed for this lesson.

In [8]:
# Simple function to print out elasped time
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)

# Part 6.2: Keras Neural Networks for Digits and Medical MNIST

This module will focus on computer vision. There are some important differences and similarities with previous neural networks.

* We will usually use classification, though regression is still an option.
* The input to the neural network is now 3D (height, width, _and_ color)
* Data are not transformed; no more Z-scores or dummy variables.
* Processing time is **_much_** longer.
* We now have different layer types. Besides dense layers, we now have convolution layers, and max-pooling layers.
* Data will no longer arrive as tabular data stored in CSV files, but as hundred or even thousands of **_images_**. We will take advantage of TensorFlow utilities for "flowing" images from folders directly to the input for a neural network.


## Common Computer Vision Data Sets

There are many data sets for computer vision. Two of the most popular classic datasets are the MNIST digits data set and the CIFAR image data sets. We will not use either of these datasets in this lesson, but it is important to be familiar with them since, neural network texts often refer to them.

The [MNIST Digits Data Set](http://yann.lecun.com/exdb/mnist/) is very popular in the neural network research community. You can see a sample of it in Figure 6.MNIST.

**Figure 6.MNIST: MNIST Data Set**
![MNIST Data Set](https://biologicslab.co/BIO1173/images/class_8_mnist.png "MNIST Data Set")

The MNIST Digit Data Set is a large database of handwritten digits that is commonly used for training various image processing systems. It was created by Yan LeCun, Corinna Cortes, and Christopher Burges as a benchmark for evaluating machine learning algorithms in the field of computer vision. The dataset was first released in 1998 and consists of 60,000 training images and 10,000 testing images of handwritten digits from 0 to 9.

The MNIST dataset has been widely used in the research community to develop and test classification algorithms, particularly in the field of deep learning. It has become a standard benchmark for evaluating the performance of machine learning models on image recognition tasks. Despite its simplicity, the MNIST dataset remains popular due to its ease of use and ability to quickly assess the effectiveness of new algorithms.

Over the years, the MNIST dataset has been used in numerous research studies and competitions, leading to the development of more advanced techniques in computer vision. It continues to be a valuable resource for researchers and practitioners in the field of machine learning.

[MedMINST Data Sets](https://medmnist.com/) are a collection of 18 standardized biomedical datasets produced by a consortium of researchers at Harvard University and colaborators in Germany and China. The image sets cover a variety medical tissues and cell types including Chest X-Rays, Colon Pathology, Breast Ultrasound, Blood Cytology and Abdominal CT scans. The `RetinaMINST` dataset has 1,600 fundus camera samples (1,080 training, 120 validation, 400 test).

**Figure 6.MedMNIST: RetinaMNIST Data Set**
![RetinaMNIST](https://biologicslab.co/BIO1173/images/class_06/RetinaMNIST.jpg "RetinaMNIST")

The [CIFAR-10 and CIFAR-100](https://www.cs.toronto.edu/~kriz/cifar.html) datasets are also frequently used by the neural network research community.

**Figure 6.CIFAR: CIFAR Data Set**
![CIFAR Data Set](https://biologicslab.co/BIO1173/images/class_8_cifar.png "CIFAR Data Set")

The CIFAR-10 data set contains low-rez images that are divided into 10 classes. The CIFAR-100 data set contains 100 classes in a hierarchy. 

## Convolutional Neural Networks (CNNs)

The convolutional neural network (CNN) is a neural network technology that has profoundly impacted the area of computer vision (CV). Fukushima  (1980) [[Cite:fukushima1980neocognitron]](https://www.rctn.org/bruno/public/papers/Fukushima1980.pdf) introduced the original concept of a convolutional neural network, and   LeCun, Bottou, Bengio & Haffner (1998) [[Cite:lecun1995convolutional]](http://yann.lecun.com/exdb/publis/pdf/lecun-bengio-95a.pdf) greatly improved this work. From this research, Yan LeCun introduced the famous LeNet-5 neural network architecture. This chapter follows the **_LeNet-5 style_** of convolutional neural network. Although computer vision primarily uses CNNs, this technology has some applications outside of the field. You need to realize that if you want to utilize CNNs on non-visual data, you must find a way to encode your data to mimic the properties of visual data.  

The order of the input array elements is _crucial_ to the training. In contrast, most neural networks that are not CNNs, treat their input data as a long vector of values. The order in which you arrange the incoming features in this vector is irrelevant. Importantly, you **can't** change the order of the data in these vectors for these types of neural networks once your network has been trained. 

On the other hand, the CNN network arranges the inputs into a **_grid_**. This arrangement works well with images because the pixels in closer proximity to each other are important to each other. The order of pixels in an image is significant. The human body is a relevant example of this type of order. For the design of the face, we are accustomed to eyes being near to each other. 

This advance in CNNs is due to years of research on biological eyes. In other words, CNNs utilize overlapping fields of input to simulate features of biological eyes. Until this breakthrough, AI had been unable to reproduce the capabilities of biological vision.

Scale, rotation, and noise have presented challenges for AI computer vision research. You can observe the complexity of biological eyes in the example that follows. 

A friend raises a sheet of paper with a large number written on it. As your friend moves nearer to you, the number is still identifiable. In the same way, you can still identify the number when your friend rotates the paper. Lastly, your friend creates noise by drawing lines on the page, but you can still identify the number. 

As you can see, these examples demonstrate the high function of the biological eye and allow you to understand better the research breakthrough of CNNs. That is, this neural network can process scale, rotation, and noise in the field of computer vision. You can see this network structure in Figure 6.LENET.

**Figure 6.LENET: A LeNET-5 Network (LeCun, 1998)**
![A LeNET-5 Network](https://biologicslab.co/BIO1173/images/class_8_lenet5.png "A LeNET-5 Network")

So far, we have only seen one layer type (dense layers). By the end of this course you will also know about:
  
* **Convolution Layers** - Used to scan across images. 
* **Max Pooling Layers** - Used to downsample images. 
* **Dropout Layers** - Used to add regularization. 
* **LSTM and Transformer Layers** - Used for time series data.


## Convolution Layers

The first layer that we will examine is the convolutional layer. We will begin by looking at the hyper-parameters that you must specify for a convolutional layer in most neural network frameworks that support the CNN:

* Number of filters
* Filter Size
* Stride
* Padding
* Activation Function/Non-Linearity

The primary purpose of a convolutional layer is to detect features such as edges, lines, blobs of color, and other visual elements. The filters can detect these features. The more filters we give to a convolutional layer, the more features it can see.

A filter is a square-shaped object that scans over the image. A grid can represent the individual pixels of a grid. You can think of the convolutional layer as a smaller grid that sweeps left to right over each image row. There is also a hyperparameter that specifies both the width and height of the square-shaped filter. The following figure shows this configuration in which you see the six convolutional filters sweeping over the image grid:

A convolutional layer has weights between it and the previous layer or image grid. Each pixel on each convolutional layer is a weight. Therefore, the number of weights between a convolutional layer and its predecessor layer or image field is the following:

```
[FilterSize] * [FilterSize] * [# of Filters]
```

For example, if the filter size were 5 (5x5) for 10 filters, there would be 250 weights.

You need to understand how the convolutional filters sweep across the previous layer's output or image grid. Figure 6.CNN illustrates the sweep:

**Figure 6.CNN: Convolutional Neural Network**
![Convolutional Neural Network](https://biologicslab.co/BIO1173/images/class_8_cnn_grid.png "Convolutional Neural Network")

The above figure shows a convolutional filter with 4 and a padding size of 1. The **_padding size_** is responsible for the border of zeros in the area that the filter sweeps. Even though the image is 8x7, the extra padding provides a virtual image size of 9x8 for the filter to sweep across. The **_stride_** specifies the number of positions the convolutional filters will stop. The convolutional filters move to the right, advancing by the number of cells specified in the stride. Once you reach the far right, the convolutional filter moves back to the far left; then, it moves down by the stride amount and continues to the right again.

Some constraints exist concerning the size of the stride. The stride cannot be `0`. The convolutional filter would never move if you set the stride to `0`. Furthermore, neither the stride nor the convolutional filter size can be larger than the previous grid. There are additional constraints on the stride (*s*), padding (*p*), and the filter width (*f*) for an image of width (*w*). Specifically, the convolutional filter must be able to start at the far left or top border, move a certain number of strides, and land on the far right or bottom border. The following equation shows the number of steps a convolutional operator
must take to cross the image:

$$ steps = \frac{w - f + 2p}{s}+1 $$

The number of steps must be an integer. In other words, it cannot have decimal places. The purpose of the padding (*p*) is to be adjusted to make this equation become an integer value.


## Max Pooling Layers

Max-pool layers downsample a 3D box to a new one with smaller dimensions. Typically, you can always place a max-pool layer immediately following the convolutional layer. The LENET shows the max-pool layer immediately after layers C1 and C3. These max-pool layers progressively decrease the size of the dimensions of the 3D boxes passing through them. This technique can avoid overfitting (Krizhevsky, Sutskever & Hinton, 2012).

A pooling layer has the following hyper-parameters:

* Spatial Extent (*f*)
* Stride (*s*)

Unlike convolutional layers, max-pool layers do not use padding. Additionally, max-pool layers have no weights, so training does not affect them. These layers downsample their 3D box input. The 3D box output by a max-pool layer will have a width equal to this equation:

$$ w_2 = \frac{w_1 - f}{s} + 1 $$

The height of the 3D box produced by the max-pool layer is calculated similarly with this equation:

$$ h_2 = \frac{h_1 - f}{s} + 1 $$

The depth of the 3D box produced by the max-pool layer is equal to the depth the 3D box received as input. The most common setting for the hyper-parameters of a max-pool layer is f=2 and s=2. The spatial extent (f) specifies that boxes of 2x2 will be scaled down to single pixels. Of these four pixels, the pixel with the maximum value will represent the 2x2 pixel in the new grid. Because squares of size 4 are replaced with size 1, 75% of the pixel information is lost. The following figure shows this transformation as a 6x6 grid becomes a 3x3:

**Figure 6.MAXPOOL: Max Pooling Layer**
![Max Pooling Layer](https://biologicslab.co/BIO1173/images/class_8_conv_maxpool.png "Max Pooling Layer")

Of course, the above diagram shows each pixel as a single number. A grayscale image would have this characteristic. We usually take the average of the three numbers for an RGB image to determine which pixel has the maximum value.

----------------------------------------
## Supervised _vs_ Unsupervised Machine Learning

In **_supervised machine learning_**, the algorithm is trained on a _labeled_ dataset, where each training example is paired with the correct output. The goal is to learn a mapping from input features to the corresponding output labels. During training, the algorithm adjusts its parameters to minimize the difference between the predicted output and the true label. Once the model is trained, it can make predictions on new, unseen data by applying the learned mapping. Common supervised learning tasks include classification and regression.

On the other hand, **_unsupervised machine learning_** involves training the algorithm on an _unlabeled_ dataset, where the algorithm must find patterns or relationships in the data without explicit guidance. The goal of unsupervised learning is to discover hidden structures or clusters in the data. This type of learning is often used for tasks such as clustering, anomaly detection, and dimensionality reduction. Unlike supervised learning, there are no explicit output labels to guide the learning process in unsupervised learning.

---------------------------------------------

## **Regression Convolutional Neural Networks**

We will now look at two examples, one for regression and another for classification. For **_supervised_** computer vision, your dataset will need some labels. For classification, this label usually specifies _what_ the image is a picture of, e.g., dog, cat, carcinoma, etc. For regression, this "label" is some _numeric_ quantity the image should produce, such as a count, e.g. how many white blood cells. We will look at two different means of providing this label.

The first example will show how to handle regression with convolution neural networks. We will provide an image and expect the neural network to count items in that image. We will use a [dataset](https://www.kaggle.com/jeffheaton/count-the-paperclips) created by [Jeff Heaton](https://www.heatonresearch.com/) that contains a random number of paperclips. 

Here are three sample images from the 25,000 images in the paperclips dataset:

![Paperclips Sample](https://biologicslab.co/BIO1173/images/class_06/class_06_2_paperclips.png "Paperclips Sample")

Our goal will be to create a convolutional neural network (CNN) that can _count_ the number of paperclips in an image. To put in a more ecological or biomedical context, a similar neural network could also be trained to count the number of giant Saguaro cacti (_Carnegiea gigantea_) in an image of the Sonoran Desert, or the number of leucocytes in a blood smear from a patient with symptoms of AML (Acute myeloid leukemia).  


### Setup ENVIRONMENTAL VARIABLES

The code in the cell below creates the ENVIRONMENTAL VARIABLES that are needed to download a zip file from the course HTTPS server, save it to your `/temp` folder and then extract (unzip) it into a new folder called `/clips`. As above, different ENVIRONMENTAL VARIABLES are needed for the different Operating Systems.  

In [None]:
# Set ENVIRONMENTAL VARIABLES

URL = "https://biologicslab.co/BIO1173/data/"
DOWNLOAD_SOURCE = URL+"paperclips.zip"
DOWNLOAD_FILE = DOWNLOAD_SOURCE[DOWNLOAD_SOURCE.rfind('/')+1:]
print("DOWNLOAD_SOURCE=",DOWNLOAD_SOURCE)
print("DOWNLOAD_FILE=",DOWNLOAD_FILE)

if COLAB:
    PATH = "/content"
    EXTRACT_TARGET = os.path.join(PATH,"clips")
    SOURCE = os.path.join(EXTRACT_TARGET, "paperclips")
    print("Note: Using COLAB")
elif WINDOWS:
    PATH=LESSON_DIRECTORY
    print("PATH=",PATH)
    EXTRACT_FOLDER_IN=BASE_DIR+"\\temp\\"
    print("EXTRACT_FOLDER_IN=",EXTRACT_FOLDER_IN)
    EXTRACT_FOLDER_OUT=EXTRACT_FOLDER_IN+"clips\\"
    print("EXTRACT_FOLDER_OUT=",EXTRACT_FOLDER_OUT)
    SOURCE = os.path.join(EXTRACT_FOLDER_IN, "paperclips.zip")
    print("Note: Using WINDOWS")
    print("SOURCE=",SOURCE)
    print("DOWNLOAD_FILE=",DOWNLOAD_FILE)
else:
    PATH=LESSON_DIRECTORY
    print("PATH=",PATH)
    EXTRACT_FOLDER_IN=BASE_DIR+"/temp/"
    print("EXTRACT_FOLDER_IN=",EXTRACT_FOLDER_IN)
    EXTRACT_FOLDER_OUT=EXTRACT_FOLDER_IN+"clips/"
    print("EXTRACT_FOLDER_OUT=",EXTRACT_FOLDER_OUT)
    SOURCE = os.path.join(EXTRACT_FOLDER_IN, "paperclips.zip")
    print("Note: Using OTHER (MacOS?)")
    print("SOURCE=",SOURCE)
    print("DOWNLOAD_FILE=",DOWNLOAD_FILE)

This is the output from the cell when run on a Windows machine:
~~~text
DOWNLOAD_SOURCE= https://biologicslab.co/BIO1173/data/paperclips.zip
DOWNLOAD_FILE= paperclips.zip
PATH= C:\Users\David\BIO1173\Class_06_2
EXTRACT_FOLDER_IN= C:\Users\David\BIO1173\temp\
EXTRACT_FOLDER_OUT= C:\Users\David\BIO1173\temp\clips\
Note: Using WINDOWS
SOURCE= C:\Users\David\BIO1173\temp\paperclips.zip
DOWNLOAD_FILE= paperclips.zip
~~~
It has been printed out in case there is a problem.  

### Download and Extract the image data

Now that the ENVIRONMENTAL VARIABLES have been defined, we can go ahead and download the datafile. The code in the cell below uses a program called `wget()` to download the datafile `paperclips.zip` from the course HTTPS server, `https://biologicslab.co/BIO1173/data`. The code chunk for downloading the zip file on a WINDOWS machine is:
~~~text
!python -m wget {DOWNLOAD_SOURCE} -o {SOURCE}
~~~
The code places the zip file in the `/temp` folder and creates a new folder inside of `/temp` called `/clips`. 

In the next step, a program called `patoolib()` extracts the zip file:
~~~text
patoolib.extract_archive(SOURCE,outdir=EXTRACT_FOLDER_OUT)
~~~


In [1]:
# Download and extract the image data

if COLAB:
#    print("PATH=",PATH, "DOWNLOAD_FILE=",DOWNLOAD_FILE, "DOWNLOAD_SOURCE=",DOWNLOAD_SOURCE)
#    !wget -O {os.path.join(PATH,DOWNLOAD_FILE)} {DOWNLOAD_SOURCE}
#    !mkdir -p {SOURCE}
#    !mkdir -p {TARGET}
#    !mkdir -p {EXTRACT_TARGET}
#    EXTRACT_FOLDER_OUT = '/content/clips/paperclips/'
#    SOURCE_TARGET='/content/paperclips.zip'
#    patoolib.extract_archive(SOURCE_TARGET,outdir=EXTRACT_FOLDER_OUT)
    !wget -O {os.path.join(PATH,DOWNLOAD_NAME)} {DOWNLOAD_SOURCE}
    !mkdir -p {SOURCE}
    !mkdir -p {TARGET}
    !mkdir -p {EXTRACT_TARGET}
    !unzip -o -j -d {SOURCE} {os.path.join(PATH, DOWNLOAD_NAME)} >/dev/null

elif WINDOWS:
    import patoolib
    import wget
    DATA_FOLDER=EXTRACT_FOLDER_OUT+"\\paperclips\\"
    if os.path.isfile(SOURCE):
        print("Note: 'paperclips.zip' already downloaded.", SOURCE)
    else:
        !python -m wget {DOWNLOAD_SOURCE} -o {SOURCE}
    try:
        patoolib.extract_archive(SOURCE,outdir=EXTRACT_FOLDER_OUT)
    except:
        print("Note: File already extracted")
else: # MacOS
    import patoolib
    import wget
    DATA_FOLDER=EXTRACT_FOLDER_OUT+"/paperclips/"
    if os.path.isfile(SOURCE):
        print("Note: 'paperclips.zip' already downloaded.", SOURCE)
    else:
        !wget -v {DOWNLOAD_SOURCE} --output-document={SOURCE}
        try:
            patoolib.extract_archive(SOURCE,outdir=EXTRACT_FOLDER_OUT)
        except:
            print("Note: File already extracted")

NameError: name 'COLAB' is not defined

On my Windows machine, I recieved the following output when I first ran the code cell above:
~~~text
Saved under C:\Users\David\BIO1173\temp\paperclips.zip
patool: Extracting C:\Users\David\BIO1173\temp\paperclips.zip ...
patool: ... creating output directory `C:\Users\David\BIO1173\temp\clips\'.
patool: running "C:\Program Files\7-Zip\7z.EXE" x -oC:\Users\David\BIO1173\temp\clips\ -- C:\Users\David\BIO1173\temp\paperclips.zip
patool:     with creationflags=134217728
patool: ... C:\Users\David\BIO1173\temp\paperclips.zip extracted to `C:\Users\David\BIO1173\temp\clips\'.
~~~

If you try to run the code cell above more than once, it _should_ detect that you have already run it and not try to run it a second time:
~~~text
Note: 'paperclips.zip' already downloaded. C:\Users\David\BIO1173\temp\paperclips.zip
patool: Extracting C:\Users\David\BIO1173\temp\paperclips.zip ...
patool: running "C:\Program Files\7-Zip\7z.EXE" x -oC:\Users\David\BIO1173\temp\clips\ -- C:\Users\David\BIO1173\temp\paperclips.zip
patool:     with creationflags=134217728
Note: File already extracted
~~~

However, your mileage may vary...

### Supervised Machine Learning

As mentioned previously, image analysis with a CNN is an example of _supervised_ learning. In this type of learning, the CNN always has to know the **_correct answer_** to the problem it is trying to solve. Why? How else would the CNN know whether the predictions it makes at end of each epoch are correct or incorrect? After all, if the CNN's predictions are 100% accurate, it means its synaptic weights are perfect and no more training is necessay. On the other hand, if the predictions are less than perfect, the CNNs still needs adjust its connection weights to improve its accuracy.  

For this particular example, our CNN will need to know the **_actual_** number of paperclips shown in each image in the training set. These numbers, i.e. the number of paperclips, are called the **_labels_**. Image datasets for CNNs always come with labels and the Paperclip image set is not exception. When the `paperclips.zip` file was extracted, it automatically created two folders inside of the `/clips` folder. One folder, called `/paperclips`, contains 25,000 images of paperclips (see picture above). The other folder is called `/_MACOSX`. (Jeff Heaton apparently created his `paperclips.zip` file on an Apple computer). Inside the `/_MACOSX` folder is another folder called `/paperclips` and finally, inside this folder, are two CSV files called `_test.csv` and `_train.csv` that contain the labels (number of paperclips). 

The labels are contained in a CSV file named **train.csv** for regression. This file has just two labels, **id** and **clip_count**. The ID specifies the filename; for example, row id 1 corresponds to the file **clips-1.jpg**. 

The following code loads the labels for the training set and creates a new column, named **filename**, that contains the filename of each image, based on the **id** column.

In [None]:
# Load the labels for the training set

import pandas as pd

if COLAB:
    df = pd.read_csv(
        os.path.join(SOURCE,"train.csv"), 
        na_values=['NA', '?'])
    df['filename']="clips-"+df["id"].astype(str)+".jpg"
elif WINDOWS:
    df = pd.read_csv(
    os.path.join(DATA_FOLDER,"train.csv"), 
    na_values=['NA', '?'])
    df['filename']="clips-"+df["id"].astype(str)+".jpg"
else:
    df = pd.read_csv(
    os.path.join(DATA_FOLDER,"train.csv"), 
    na_values=['NA', '?'])
    df['filename']="clips-"+df["id"].astype(str)+".jpg"

This results in the following dataframe.

In [None]:
df

If your code is correct you should see the following table:

![___](https://biologicslab.co/BIO1173/images/class_06/class_06_2_df.png)



### Create training and validation sets

The next step is to separate the data (paperclip images) into a **_training_** set and a **_validation_** set. We will use the validation set for early stopping.

In [None]:
# Split images into training and validation sets

TRAIN_PCT = 0.9
TRAIN_CUT = int(len(df) * TRAIN_PCT)

df_train = df[0:TRAIN_CUT]
df_validate = df[TRAIN_CUT:]

print(f"Training size: {len(df_train)}")
print(f"Validate size: {len(df_validate)}")

If your code is correct you should see the following output:
~~~text
Training size: 18000
Validate size: 2000
~~~

### Create **_ImageDataGenerator_** objects

CNNs require a large amount of diverse data to learn features that generalize well to new, unseen images. By training on a large dataset, the model can learn a wide range of patterns and variations in the data, leading to better generalization performance. Large datasets provide enough training examples for the model to adjust these parameters effectively and learn complex patterns in the images.

One clever way to increase the size of an image dataset, **_without_** downloading additional image files, is to use an **_ImageDataGenerator object_**. An ImageDataGenerator is basically a computer algorithmn that creates additional training data by **_manipulating_** the source material. For example, the generator below flips the paperclip images both vertically and horizontally. After all, you could recognize a picture of your grandmother, if it was flipped upside down, couldn't you?

Keras will train the neuron network both on the original images and the flipped images. This augmentation increases the size of the training data considerably. Module 6.4 goes deeper into the transformations you can perform. For example, you can also specify a target size to resize the images automatically.

The function **flow_from_dataframe** loads the labels from a Pandas dataframe connected to our **train.csv** file. When we demonstrate classification, we will use the **flow_from_directory**; which loads the labels from the directory structure rather than a CSV.

In [None]:
# This might be needed in COLAB

if COLAB:
    !pip install keras_preprocessing 

In [None]:
import tensorflow as tf
import keras_preprocessing
from keras_preprocessing import image
from keras_preprocessing.image import ImageDataGenerator

training_datagen = ImageDataGenerator(
  rescale = 1./255,
  horizontal_flip=True,
  vertical_flip=True,
  fill_mode='nearest')

train_generator = training_datagen.flow_from_dataframe(
        dataframe=df_train,
        directory=DATA_FOLDER,
        x_col="filename",
        y_col="clip_count",
        target_size=(256, 256),
        batch_size=32,
        class_mode='other')

validation_datagen = ImageDataGenerator(rescale = 1./255)

val_generator = validation_datagen.flow_from_dataframe(
        dataframe=df_validate,
        directory=DATA_FOLDER,
        x_col="filename",
        y_col="clip_count",
        target_size=(256, 256),
        class_mode='other')

If your code is correct you should see the following output:
~~~text
Found 18000 validated image filenames.
Found 2000 validated image filenames.
~~~

We can now train the neural network. The code to build and train the neural network is not that different than in the previous modules. We will use the Keras Sequential class to provide layers to the neural network. We now have several new layer types that we did not previously see.

* **Conv2D** - The convolution layers.
* **MaxPooling2D** - The max-pooling layers.
* **Flatten** - Flatten the 2D (and higher) tensors to allow a Dense layer to process.
* **Dense** - Dense layers, the same as demonstrated previously. Dense layers often form the final output layers of the neural network.

The training code is very similar to previously. This code is for regression, so a final linear activation is used, along with mean_squared_error for the loss function. The generator provides both the *x* and *y* matrixes we previously supplied.

In [None]:
from tensorflow.keras.callbacks import EarlyStopping
import time

# Set variables
EPOCHS=5  # Use 25 if possible
MODEL_STEPS= 250

# Build model
model = tf.keras.models.Sequential([
    # Note the input shape is the desired size of the image 150x150 
    # with 3 bytes color.
    
    # This is the first convolution
    tf.keras.layers.Conv2D(64, (3,3), activation='relu', 
        input_shape=(256, 256, 3)),
    tf.keras.layers.MaxPooling2D(2, 2),
    
    # The second convolution
    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    tf.keras.layers.Flatten(),
    
    # 512 neuron hidden layer
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dense(1, activation='linear')
])

# Print model summary
model.summary()

# Set model steps
epoch_steps = MODEL_STEPS # needed for 2.2
validation_steps = len(df_validate)
model.compile(loss = 'mean_squared_error', optimizer='adam')
monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, 
        patience=5, verbose=1, mode='auto',
        restore_best_weights=True)

# Record start
start_time = time.time()

# Train model
print(f"Starting training for {EPOCHS} epochs...")
history = model.fit(train_generator,  
  verbose = 1, 
  validation_data=val_generator, callbacks=[monitor], epochs=EPOCHS)

# Print elapsed time
elapsed_time = time.time() - start_time
print("Elapsed time: {}".format(hms_string(elapsed_time)))

This code will run very slowly, even if you use a GPU. The code takes approximately 13 minutes to run 25 epochs with a GPU. To speed up this lesson, the number of epochs has been reduced to only 5. 

If you don't have GPU acceleration on your laptop, you can run this lesson on Google COLAB. Make sure, before you start running any cells, to change the RUNTIME option to  



## Score Regression Image Data

Scoring/predicting from a generator is a bit different than training. We do not want augmented images, and we do not wish to have the dataset shuffled. For scoring, we want a prediction for each input. We construct the generator as follows:

* shuffle=False
* batch_size=1
* class_mode=None

We use a **batch_size** of 1 to guarantee that we do not run out of GPU memory if our prediction set is large. You can increase this value for better performance. The **class_mode** is None because there is no *y*, or label. After all, we are predicting.

In [None]:
#

df_test = pd.read_csv(
    os.path.join(DATA_FOLDER,"test.csv"), 
    na_values=['NA', '?'])

df_test['filename']="clips-"+df_test["id"].astype(str)+".jpg"

test_datagen = ImageDataGenerator(rescale = 1./255)

test_generator = validation_datagen.flow_from_dataframe(
        dataframe=df_test,
        directory=DATA_FOLDER,
        x_col="filename",
        batch_size=1,
        shuffle=False,
        target_size=(256, 256),
        class_mode=None)

If your code is correct you should see the following output:
~~~text
Found 5000 validated image filenames.
~~~

We need to reset the generator to ensure we are always at the beginning.

In [None]:
test_generator.reset()
pred = model.predict(test_generator,steps=len(df_test))

If your code is correct you should see the following output:
~~~text
5000/5000 [==============================] - 26s 5ms/step
~~~

We can now generate a CSV file to hold the predictions.

In [None]:
df_submit = pd.DataFrame({'id':df_test['id'],'clip_count':pred.flatten()})
df_submit.to_csv(os.path.join(PATH,"submit.csv"),index=False)

In [None]:
df_submit

If your code is correct you should see the following table:

![___](https://biologicslab.co/BIO1173/images/class_06/class_06_2_df2.png)



## **Classification Neural Networks**

In this example, we will build a convolutional neural network (CNN) to analyze images of 3 different species of Iris flowers:
* iris-setosa
* iris-versicolour
* iris-virginica

We have use the fampis Iris flower dataset before.  In that case, however, we numeric data--namely, the physical dimensions of the flower's sepals and petals--presented in a tabular form (i.e. a DataFrame). In this lesson we will use hundreds of photographic images of Iris flower images. The picture below shows a random example of an image from each species. 

The type of images used for the training are shown here:
![___](https://biologicslab.co/BIO1173/images/class_06/iris_species.png)

An important challenge in working with CNN's is managing the large numbers of images in the dataset. Learning how to load and efficiently process these images is just as important as building the neural networks to analyze them.


### **Setting up Environmental Variables**

The code in the section below sets ups **_ENVIRONMENTAL VARIABLES_**. 

In [None]:
# 
import os
PATH=True

URL = "https://biologicslab.co/BIO1173/data/"
DOWNLOAD_SOURCE = URL+"/iris-image.zip"
DOWNLOAD_FILE = DOWNLOAD_SOURCE[DOWNLOAD_SOURCE.rfind('/')+1:]
print("DOWNLOAD_SOURCE=",DOWNLOAD_SOURCE)
print("DOWNLOAD_FILE=",DOWNLOAD_FILE)

if COLAB:
    PATH = "/content"
    EXTRACT_TARGET = os.path.join(PATH,"iris-images")
    SOURCE = os.path.join(EXTRACT_TARGET, "iris-images")
    print("Note: Using COLAB")
elif WINDOWS:
    PATH=LESSON_DIRECTORY
    print("PATH=",PATH)
    EXTRACT_FOLDER_IN=BASE_DIR+"\\temp\\"
    print("EXTRACT_FOLDER_IN=",EXTRACT_FOLDER_IN)
    EXTRACT_FOLDER_OUT=EXTRACT_FOLDER_IN+"iris-images\\"
    print("EXTRACT_FOLDER_OUT=",EXTRACT_FOLDER_OUT)
    SOURCE = os.path.join(EXTRACT_FOLDER_IN, "iris-images.zip")
    print("Note: Using WINDOWS")
    print("SOURCE=",SOURCE)
    print("DOWNLOAD_FILE=",DOWNLOAD_FILE)
else:
    PATH=LESSON_DIRECTORY
    print("PATH=",PATH)
    EXTRACT_FOLDER_IN=BASE_DIR+"/temp/"
    print("EXTRACT_FOLDER_IN=",EXTRACT_FOLDER_IN)
    EXTRACT_FOLDER_OUT=EXTRACT_FOLDER_IN+"iris-images/"
    print("EXTRACT_FOLDER_OUT=",EXTRACT_FOLDER_OUT)
    SOURCE = os.path.join(EXTRACT_FOLDER_IN, "iris-images.zip")
    print("Note: Using OTHER (MacOS?)")
    print("SOURCE=",SOURCE)
    print("DOWNLOAD_FILE=",DOWNLOAD_FILE)

Just as before, we unzip the images.

In [None]:
# Download and extract the image data

if COLAB:
    print("PATH=",PATH, "DOWNLOAD_FILE=",DOWNLOAD_FILE, "DOWNLOAD_SOURCE=",DOWNLOAD_SOURCE)
    !wget -O {os.path.join(PATH,DOWNLOAD_FILE)} {DOWNLOAD_SOURCE}
    !mkdir -p {SOURCE}
    !mkdir -p {TARGET}
    !mkdir -p {EXTRACT_TARGET}
    !unzip -o -j -d {SOURCE} {os.path.join(PATH, DOWNLOAD_NAME)} >/dev/null

elif WINDOWS:
    import patoolib
    import wget
    if os.path.isfile(SOURCE):
        print("SOURCE already exists.", SOURCE)
    else:
        !python -m wget {DOWNLOAD_SOURCE} -o {SOURCE}
        print("In WINDOWS")
    try:
        patoolib.extract_archive(SOURCE,outdir=EXTRACT_FOLDER_OUT)
    except:
        print("Note: File already extracted")
        
    DATA_FOLDER=EXTRACT_FOLDER_OUT          # +"\\iris\\"

else:
    import patoolib
    import wget
    DATA_FOLDER=EXTRACT_FOLDER_OUT          #+"/iris/"
    if os.path.isfile(SOURCE):
        print("SOURCE already exists.", SOURCE)
    else:
        print("EXTRACT_FOLDER_IN=",EXTRACT_FOLDER_IN,"DOWNLOAD_SOURCE=",DOWNLOAD_SOURCE)
        !wget -v {DOWNLOAD_SOURCE} --output-document={SOURCE}
        try:
            patoolib.extract_archive(SOURCE,outdir=EXTRACT_FOLDER_OUT)
        except:
            print("Note: File already extracted")
    

You can see these folders with the following command.

We set up the generator, similar to before.  This time we use flow_from_directory to get the labels from the directory structure.

In [None]:
import tensorflow as tf
import keras_preprocessing
from keras_preprocessing import image
from keras_preprocessing.image import ImageDataGenerator

training_datagen = ImageDataGenerator(
  rescale = 1./255,
  horizontal_flip=True,
  vertical_flip=True,
  width_shift_range=[-200,200],
  rotation_range=360,

  fill_mode='nearest')

train_generator = training_datagen.flow_from_directory(
    directory=DATA_FOLDER, target_size=(256, 256), 
    class_mode='categorical', batch_size=32, shuffle=True)

validation_datagen = ImageDataGenerator(rescale = 1./255)

validation_generator = validation_datagen.flow_from_directory(
    directory=DATA_FOLDER, target_size=(256, 256), 
    class_mode='categorical', batch_size=32, shuffle=True)


If your code is correct you should see the following output:
~~~text
Found 421 images belonging to 3 classes.
Found 421 images belonging to 3 classes.
~~~

Training the neural network with classification is similar to regression. 

In [None]:
from tensorflow.keras.callbacks import EarlyStopping
import time

# Set variables
EPOCHS=5  # Use 50 if possible
STEPS_PER_EPOCH =10

class_count = len(train_generator.class_indices)

# Record start time
start_time = time.time()

# Build the model
model = tf.keras.models.Sequential([
    
    # Note the input shape is the desired size of the image 
    # 300x300 with 3 bytes color
    
    # This is the first convolution
    tf.keras.layers.Conv2D(16, (3,3), activation='relu', 
        input_shape=(256, 256, 3)),
    tf.keras.layers.MaxPooling2D(2, 2),
    
    # The second convolution
    tf.keras.layers.Conv2D(32, (3,3), activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.MaxPooling2D(2,2),
    
    # The third convolution
    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.MaxPooling2D(2,2),
    
    # The fourth convolution
    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    
    # The fifth convolution
    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),
    
    # Flatten the results to feed into a DNN
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dropout(0.5),
    
    # 512 neuron hidden layer
    tf.keras.layers.Dense(512, activation='relu'),
    
    # Only 1 output neuron. It will contain a value from 0-1 
    tf.keras.layers.Dense(class_count, activation='softmax')
])

# Print model summary
model.summary()

# Compile model
model.compile(loss = 'categorical_crossentropy', optimizer='adam')

# Train model
print(f"Starting training for {EPOCHS} epochs...") 
model.fit(train_generator, epochs=EPOCHS, steps_per_epoch=STEPS_PER_EPOCH, 
                    verbose = 1)

# Print elapsed time
elapsed_time = time.time() - start_time
print("Elapsed time: {}".format(hms_string(elapsed_time)))

The iris image dataset is not easy to predict; it turns out that a tabular dataset of measurements is more manageable.  However, we can achieve a 63%. 

In [None]:
from sklearn.metrics import accuracy_score
import numpy as np

validation_generator.reset()
pred = model.predict(validation_generator)

predict_classes = np.argmax(pred,axis=1)
expected_classes = validation_generator.classes

correct = accuracy_score(expected_classes,predict_classes)
print(f"Accuracy: {correct}")

If your code is correct you should see the following output:
~~~text
14/14 [==============================] - 1s 53ms/step
Accuracy: 0.6389548693586699
~~~


# Other Resources

* [Imagenet:Large Scale Visual Recognition Challenge 2014](http://image-net.org/challenges/LSVRC/2014/index)
* [Andrej Karpathy](http://cs.stanford.edu/people/karpathy/) - PhD student/instructor at Stanford.
* [CS231n Convolutional Neural Networks for Visual Recognition](http://cs231n.stanford.edu/) - Stanford course on computer vision/CNN's.
* [CS231n - GitHub](http://cs231n.github.io/)
* [ConvNetJS](http://cs.stanford.edu/people/karpathy/convnetjs/) - JavaScript library for deep learning.

Now we can zip the preprocessed files and store them somewhere.