<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173/blob/main/Assigment_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 1173: Intro Computational Biology**

**Assignment 1: Convolutional Neural Networks (CNN) for Computer Vision**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Integrative Biology](https://sciences.utsa.edu/integrative-biology/), [UTSA](https://www.utsa.edu/)



# **The Purpose of Assignments**

In this course, **_Assignments_** are designed to help me (and you) assess your ability to transfer knowledge gained from completing class coding exercises to solving more realistic problems.

Assignments play a pivotal role in reinforcing your learning, as they require you to apply theoretical concepts to practical scenarios. This helps solidify your understanding and enhances your problem-solving skills. By tackling these assignments independently, you develop critical thinking and the ability to synthesize information from various sources. Moreover, assignments encourage you to explore topics more deeply, fostering intellectual curiosity and promoting a deeper engagement with the subject matter. Ultimately, these assignments are not just a measure of your learning, but a means to equip you with the skills needed for real-world applications and future challenges.

### Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.
  Running the following code will map your GDrive to ```/content/drive```.

In [None]:
# YOU MUST RUN THIS CELL FIRST

try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    COLAB = True
    print("Note: using Google CoLab")
    #%tensorflow_version 2.x
    #print(f"Tensorflow version: {tf.__version__}")
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("Note: not using Google CoLab")
    COLAB = False

### Define functions

The cell below creates several functions that are needed for this assignment. If you don't run this cell, you will receive errors later when you try to run some cells.

In [None]:
# Create functions for this lesson

import psutil
import os

def list_files():
   files = os.listdir('.')
   print(f"Current files: {files}")

# Simple function to print out elasped time
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)


list_files()

## **Record your GPU/TPU Runtime**

Since you will be training a fairly large CNN neural network for this assignment, you will probably want to use a GPU or TPU runtime environment.

To assist your Instructor in grading your work, you need to record what hardware you will be using for your assignment.

Record your current runtime environment by entering the appropiate value in the `my_GPU_dict` below. For example, if have changed your runtime to the `A100 GPU`, you would enter the number `2` in the empty square brackets in the next code cell.

In [None]:
# Record your current Runtime GPU/TPU


# List of Current GPU/TPUs
my_GPU_dict = {
    1: 'CPU',
    2: 'A100 GPU',
    3: 'L4 GPU',
    4: 'T4 GPU',
    5: 'TPU v2-8'
}

# Enter the correct key number in the square brackets [ ]
my_GPU = my_GPU_dict[ ]

# Print selection
print(f"My current runtime GPU/TPU is: {my_GPU}")

If your code is correct, you should see something like the following:
~~~text
My current runtime GPU/TPU is: A100 GPU
~~~

If you get an error, it probably means that you didn't enter a value in the square brackets.

# **Assigment 1: Keras Neural Networks for Medical MNIST**

**Assignment 1** is designed to assess your ability to write the Python/Tensorflow/Keras code necessary to classify image data in a MedMNIST dataset. In creating this assignment, it was assumed that you have already completed  **Class_06_1**. The same series of steps used to solve the **Exercise** in Class_06_1, have been provided in this assignment to help guide your coding.

While most of the code in Class_06_1 should be reusable here without modification, the MedMNSIT datafiles vary in scale and content. It will be up to you to make any necessary code changes to make your CNN model work.

As always, ask your Instructor for help if you encounter an error that you can't figure out.


## **MedMNIST Datasets**

[**MedMNIST**](https://medmnist.com) offers a collection of 12 pre-processed 2D datasets designed for various biomedical image classification tasks. These datasets cover primary data modalities such as **X-Ray**, **OCT (Optical Coherence Tomography)**, **Ultrasound**, **CT (Computed Tomography)**, and **Electron Microscope** images.

The datasets are diverse, ranging from binary/multi-class classification to ordinal regression and multi-label tasks. They also vary in scale, with data sizes ranging from 100 to 100,000 images.

Here's a list of the 12 datasets offered by MedMNIST, the number of classes they contain and their data modality:

| Dataset Name       | Classes                         |Data Modality
|--------------------|---------------------------------|--------------|
| DermaMNIST         | 7 (skin conditions)   |Dermatoscope
| OCTMNIST           | 10 (retinal layers)   |Retina OCT
| PneumoniaMNIST     | 2 (normal, pneumonia) |Chest X-Ray
| RetinaMNIST        | 5 (retinal diseases)  |Fundus Camera
| ChestMNIST         | 14 (chest X-ray views)|Chest X-Ray
| BreastMNIST        | 2 (benign, malignant) |Brease Ultrasound
| PathMNIST          | 9 (histopathological conditions) |Colon Pathology
| BloodMNIST         | 8 (blood cell types)  |Blood Cell Microscope
| TissueMNIST        | 7 (tissue types)   |Kidney Cortex Microscope
| OrganMNIST - A     | 9 (organs -axial view)   |Abdominal CT
| OrganMNIST - C     | 9 (organs - coronal view)   |Abdominal CT
| OrganMNIST - S     | 9 (organs - saggital view) |Abdominal CT              |

# **Dataset You Must Analyze**

The last digit in your myUTSA ID (e.g. 'abc123`) will determine which MedMNIST dataset you are to analyze for **Assignment 1**.

You are not free to choose the dataset. If you decide analyze the wrong dataset, _30 points_ will be deducted from your score. If you are uncertain which dataset you should be working on, contact your Instructor for help.


| Last Digit in my UTSA ID | MedMNIST Dataset to Analyze
---------------------------|--------------------------------
0                          | breastmnist_224.npz
1                          | dermamnist_64.npz
2                          | octmnist.npz
3                          | organamnist.npz
4                          | organcmnist.npz
5                          | organsmnist_64.npz
6                          | pathmnist_224.npz
7                          | pneumoniamnist_64.npz
8                          | retinamnist_128.npz
9                          | tissuemnist.npz



### **Step - 1: Setup Evironmental Variables**

In the cell below, create environmental variables so you can download your specific MedMNIST dataset that has been assigned to you in the cell above. Assuming that you are using the code provided in Class_06_1 as a template you will only need to make changes to the `DOWNLOAD_SOURCE` and the `EXTRACT_TARGET`.

For example, if the last digit of your myUTSA ID was `6`, your assigned MedMNIST dataset is `pathmnist_224.npz`. To change the `DOWNLOAD_SOURCE`, simple past in the file as shown in the following code chunk:

~~~text
DOWNLOAD_SOURCE = URL+"/pathmnist_224.npz"
~~~
Make sure not to leave any spaces between the `/` and the filename.

Use the same filename for `EXTRACT_TARGET` but make sure to remove the dot and the 3 letters `.npz`.  

So your `EXTRACT_TARGET` would be:
~~~text
EXTRACT_TARGET = os.path.join(PATH,"pathmnist_224")
~~~


In [None]:
# Step - 1: Setup Environmental Variables




### **Step - 2: Download and Extract Data**

If your code in Step 1 is correct, you should be ready to download and extract your dataset.

In the cell below, write the code to download your datafile, make the appropiate file folders and then extract (unzip) your datafile into the file folders you created.

**Please Note:** There is considerable differences in the size of these MedMNIST datasets. The larger ones (e.g. `pathmnist_224.npz`) are more than 12GB in size and will require many minutes to upload to Colab and extract. Smaller datasets will upload and extract fairly quickly. As long as the "little wheel" at the top left of the code cell keeps spinning, your code is working correctly. Just be patient.

In [None]:
# Insert your code for Step 2 here


### **Step - 3: Load and Shuffle Images and Labels into Numpy arrays**

In the cell below, write the Python code to read ("unpack") and shuffle the image and label data into Numpy arrays. In total, you should create 6 numpy arrays: `train_X`, `train_Y`, `test_X`, `test_Y`, `val_X` and `val_Y`. The `X` arrays will hold the medical images while the `Y` arrays will have their corresponding labels.

Make sure to print out the `shape` of each numpy array.   

In [None]:
# Insert your code for Step 3 here



Take a good look at your output. Make a note of the `shape` value for the array called `train_X`.

The `shape` should have either 3 or 4 numbers. The first number is the number of images in your particular dataset. The next 2 numbers are the dimensions (in pixels) of the image, The last number, if present, specifies the number of color channels. The number `3` means a color image (RGB). If your image files don't have a 4th number, you can add a single color channel (monochrome) in the next step.

### **Step 4 - Add Color Channel and Resize Images**

In the cell below, write the color to add a color channel (if needed) and to resize your images if their pixel size is less than 64 X 64. Make these changes to all 3 of your image arrays (`train_X`, `text_X` and `val_X`). Print out the `shape` of your image arrays.

### **Step 5 - Check Available Memory**

In the cell below, write the code to check the available memory and print out its values (`Total Memory`, `Available Memory` and `Used Memory`).

In [None]:
# Insert your code for Step 5 here



### **Step - 6: Augment Training Image Set**

In the cell below, write the code to augment your set of training images (`train_X`) if sufficient free memory is available. Print out the `shape` of `train_X`.

In [None]:
# Insert your code for Step 6 here


### **Step - 7: One-Hot Encode Labels**

In the cell below, use the Keras function `to_categorical()` to One-Hot Encode the label information for your training, testing and validation images. Make sure to print out the `shapes` of your arrays after One-Hot Encoding, Their shape should match as the new number of classes in your dataset. Keep in mind that some MedMNIST datasets have only 2 classes, while other datasets have more than 10.

In [None]:
# Insert your code for Step - 7 here



##### **WARNING:** If your output shows 3 numbers instead of 2 after the arrays with label data as shown in this example:
~~~text
The label data contains 2 classes
Train Labels Shape (train_Y): (496398, 8, 2)
Test Labels Shape (test_Y): (47280, 8, 2)
Validation Labels Shape (val_Y): (23640, 8, 2)
~~~
**Do not proceed!**

The presence of a third number indicates that you have already One-Hot encoded the data. If you try to use this data, your model will crash during training. You need to go back to Step 1 and start over.

### **Step - 8: Create and Compile CNN neural network model**

In the cell below, build a convolutional neural network (CNN) model to classify the images in your particular dataset into different classes.

Use the code for the CNN models shown in Step 5 of Class_06_1 as a template.

### **Setting the `input_dim`**

In a CNN model, `input_dim` refers to the dimensions of the input data that the model will process. It includes the height, width, and number of channels (color depth) of the images. For example, if you’re working with RGB images of size 64x64 pixels, the input_dim would be (64, 64, 3). This ensures that the model architecture matches the shape of your data. It’s the initial layer's responsibility to match this input shape, setting the stage for the entire convolutional process

It is up to you to set the correct values for the `input_dim`. These values were printed out in **Step 4**.

**IMPORTANT WARNING:**

Probably the most common error students encounter in creating a CNN model is putting in the wrong input dimensions (`input_dim`). If your model crashes almost immediately after training starts, it is probably due to the wrong values for your `input_dim` variable.

### **Setting the `learning_rate`**

Choosing the optimal learning rate for a CNN involves some experimentation and fine-tuning. You can start with a learning rate = 0.0001. If training proceeds smoothly, that's great. However, if encounter a problem you could increase or decrease the learning rate to see if that resolves the issue.


In [None]:
# Insert your code for Step - 8 here


### **Step - 9: Train the Neural Network**

In the cell below, write the Python code to train the neural network that you constructed and compiled above.

Set your number of EPOCHS to 100. When it comes to BATCH_SIZE, you can set it to either `64` or `128`.

### **Setting the `BATCH_SIZE`**

Batch size in a CNN (or any neural network) refers to the number of training examples used in one forward and backward pass through the network.

##### **Why It Matters:**
* **Memory Management:** Larger batch sizes can make better use of your GPU's memory, leading to faster training times. However, they require more memory.
* **Gradient Estimates:** Smaller batch sizes provide noisier estimates of the gradient, which can help the model escape local minima and potentially improve generalization.
* **Training Speed:** Larger batches tend to converge faster in terms of iterations but might take longer in terms of wall-clock time due to the increased computations.

##### **Choosing Batch Size:**
* Common choices are powers of 2 (e.g., 32, 64, 128), as they align well with memory allocation in hardware.

* Experiment with different sizes to see what works best for your specific dataset and model architecture.

The sweet spot for batch size often depends on your specific hardware and dataset, so some experimentation is usually required.

### **Setting the `steps_per_epoch`**

**steps_per_epoch** defines how many batches of data the model will train on in one epoch. The optimal step size is equal to the number of images in `train_X` divided by the batch size. In Class_06_1, the `STEPS_PER_EPOCH` was set automatically using the following code chunk:

~~~text
STEPS_PER_EPOCH = len(train_X) // BATCH_SIZE
~~~



In [None]:
# Insert your code for Step - 9 here



## **Evaluating Model's Training**

Now that we have trained our model, let's look at how it changed during its training.

### **Step 10: Plot `accuracy` and `val_accuracy`**

In the cell below, write the code to plot the `accuracy` and the `val_accuracy` of your model during each epoch in the training cycle.

In [None]:
# Insert your code for Step 10 here



### **Step 11: Compute Accuracy Score with Validation Data**

In the cell below, write the code to compute the accuracy score of your trained model using the validation dataset (`val_X`, `val_Y`). Print out your accuracy score.

In [None]:
# Insert your codde for Step 11 here



### **Step 12: Plot Image with Label**

In the cell below, write the code to print out image `10` from your test image set (`test_X`) with its label as a title. Make your label reflect what your image is -- don't just use the labels in the Class_06_1 code.

In [None]:
# Insert your code for Step 12 here



### **Step 13: Plot 4 Frames with Label**

In the cell below, write the code to generate a 2 X 2 plot showing 4 images from the training dataset along with their labels.As above make your labels reflect the type of images in your particular dataset.

In [None]:
# Insert your code for Step 13 here



## **Assigment Turn-in**

When you have completed and run all of the code cells, use the **File --> Print.. --> Save to PDF** to generate a PDF of your Colab notebook. Save your PDF as `Assignment_01.lastname.pdf` where _lastname_ is your last name, and upload the file to Canvas.