<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173/blob/main/Class_03_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 1173: Intro Computational Biology**

**Module 3: Introduction to TensorFlow**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Integrative Biology](https://sciences.utsa.edu/integrative-biology/), [UTSA](https://www.utsa.edu/)

### Module 3 Material

* Part 3.1: Deep Learning and Neural Network Introduction
* Part 3.2: Using Keras to Build Regression Models
* Part 3.3: Using Keras to Build Classification Models
* **Part 3.4: Saving and Loading a Keras Neural Network**
* Part 3.5: Early Stopping in Keras to Prevent Overfitting


### Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.

In [None]:
# YOU MUST RUN THIS CODE CELL FIRST

try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    COLAB = True
    print("Note: using Google CoLab")
    %tensorflow_version 2.x
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("Note: not using Google CoLab")
    COLAB = False

Make sure your GMAIL address is included as the last line in the output abpve.

## Datasets for Class_03_4

In this class we will be using the **_Orange Quality_** dataset for the Examples and the **_Apple Quality_** dataset for the **Exercises**. Both of these datasets will be downloaded from the course HTTPS server [https://biologicslab.co](https://biologicslab.co).

### Orange Quality Dataset

[Orange Quality Dataset](https://www.kaggle.com/datasets/shruthiiiee/orange-quality)

**Description**

This tabular dataset contains numerical attributes describing the quality of oranges, including their size, weight, sweetness (Brix), acidity (pH), softness, harvest time, and ripeness, as well as categorical attributes such as color, variety, presence of blemishes, and overall quality. The number of oranges in the dataset (_n_=241) is relatively small making it difficult to achieve robut learning using neural networks.

The list below shows the actual `Column Names` as well a brief description of their contents.

`Column Name`: Description

* `Size (cm)` : Size of orange in cm
* `Weight (g)` : Weight of orange in gm
* `Brix (Sweetness)`: Sweetness level in Brix
* `pH (Acidity)`: Acidity level (pH)
* `Softness (1-5)`: Softness rating (1-5)
* `HarvestTime (days)`: Days since harvest
* `Ripeness (1-5)` : Ripeness rating (1-5)
* `Color` : Fruit color
* `Variety` : Orange variety
* `Blemishes (Y/N)` : Presence of blemishes (Yes/No)
* `Quality (1-5)` : Overall quality rating (1-5)

Columns containing non-numeric values:
* `Color`
* `Variety`
* `Blemished`


The number of examples in the Orange Quality dataset by `Color` are as follows:
~~~text
Color
Deep Orange      75
Light Orange     64
Orange           38
Orange-Red       55
Yellow-Orange     9
~~~

The number of examples in the Orange Quality dataset by `Variety` are as follows:
~~~text
Variety
Ambiance                 11
Blood Orange              2
California Valencia       7
Cara Cara                21
Clementine               14
Clementine (Seedless)     4
Hamlin                    5
Honey Tangerine           7
Jaffa                    11
Midsweet (Hybrid)         5
Minneola (Hybrid)        12
Moro (Blood)             16
Murcott (Hybrid)          3
Navel                    16
Navel (Early Season)      2
Navel (Late Season)       3
Ortanique (Hybrid)       13
Satsuma Mandarin         13
Star Ruby                18
Tangelo (Hybrid)          1
Tangerine                14
Temple                   18
Valencia                 11
Washington Navel         14
~~~

The number of examples in the Orange Quality  dataset by `Blemishes (Y/N)` are as follows:
~~~text
Blemishes (Y/N)
N                          149
N (Minor)                    1
N (Split Skin)               1
Y (Bruise)                   1
Y (Bruising)                 9
Y (Minor Insect Damage)      6
Y (Minor)                   14
Y (Mold Spot)               10
Y (Scars)                   17
Y (Split Skin)               8
Y (Sunburn Patch)           23
Y (Sunburn)                  2
~~~

We will need to use this information about these non-numeric categorical values when we prepare the feature vector below.

### Apple Quality Dataset

[Apple Quality Data Set](https://www.kaggle.com/datasets/nelgiriyewithana/apple-quality)

**Description:**

This is the same Apple Quality dataset that has been used several times in previous lesson. The Apple Quality dataset contains information about various attributes of a large sample of apples (_n_=4000). The size of this dataset is sufficiently large to enable robust learning with neural networks.

The dataset includes details such as fruit ID, size, weight, sweetness, crunchiness, juiciness, ripeness, acidity, and quality.

**Key Features:**

* **A_id:** Unique identifier for each fruit
* **Size:** Size of the fruit
* **Weight:** Weight of the fruit
* **Sweetness:** Degree of sweetness of the fruit
* **Crunchiness:** Texture indicating the crunchiness of the fruit
* **Juiciness:** Level of juiciness of the fruit
* **Ripeness:** Stage of ripeness of the fruit
* **Acidity:** Acidity level of the fruit
* **Quality:** Overall quality of the fruit

Columns containing non-numeric values:

* `Quality`: "good" and "bad"

# Part 3.4: Saving and Loading a Keras Neural Network

Complex neural networks will take a _long_ time to fit/train.  It is helpful to be able to save a trained neural network so that you can reload it and using it again.  Again, a reloaded neural network will **not** require retraining.  

Keras provides the following two formats for saving neural networks:

* **JSON** - Stores the neural network structure (no weights) in the [JSON file format](https://en.wikipedia.org/wiki/JSON).
* **Keras** - Stores the complete neural network (with weights) in the native Keras format.

Usually, you will want to save in native Keras format.

### Example 1: Build and Train a Classification Neural Network

The code in Example 1 builds and trains a neural network called `orModel` that can classify the `Quality` of an orange based on its physical and chemical characteristics.

The code in the cell below reads the Orange Quality dataset from the course HTTP server and creates a DataFrame called `orDF` (i.e. "orange" DataFrame).

In order to create a feature vector, the 3 non-numeric columns in the dataset: `Color`, `Variety` and `Blemished` must be pre-processed. Mapping strings to integers is used to take care of the column `Color` while One-Hot Encoding is used to take care the column `Variety`. To take care of the column `Blemish`, it will simply excluded (dropped) from the column list when generating the X-values.

While most of the numerical values are small integers, the weight values are an order of magnitude larger, so they are standardized to their z-scores. This makes weight values closer in magnitude to the other numerical values and significantly improved the accuracy of the model. After the weights are stardardized, X-values are generated and stored in a Numpy array called `orX`.

For this assigment we won't bother to split the data into Training/Validation sets, nor will we bother to shuffle the data.

Since we are building a classification neural network, we will need to One-Hot Encode the column `Quality` which contains the Y-values. It should be noted that this column is already numeric, so we are **not** using One-Hot Encoding to replace string values with integer. Rather, One-Hot Encoding the Y-values is necessary to give them the **_correct format_** for the neural network.

Again, because we want `orModel` to act as a _classifier_, not a "regressor", we will use the `softmax` activation function in the output layer.

Finally, we will compile the model using `categorical_crossentropy` as the loss function instead of using `mean_squared_error`.   

In [None]:
# Example 1: Build and Train Classification Model

import pandas as pd
import numpy as np
from scipy.stats import zscore
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation

# Read dataset and create DataFrame --------------------------------
orDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/orange_quality.csv",
    na_values=['NA', '?'])


# Create feature vector ---------------------------------------------

# Map str to int
mapping = {'Orange':0,'Deep Orange':1,
           'Light Orange':2,'Orange-Red':3,
           'Yellow-Orange':4}
orDF['Color'] = orDF['Color'].map(mapping)

# Standardize to z-scores
orDF['Weight (g)'] = zscore(orDF['Weight (g)'])

# Generate X-values
orX = orDF[['Size (cm)', 'Weight (g)', 'Brix (Sweetness)', 'pH (Acidity)',
       'Softness (1-5)', 'HarvestTime (days)', 'Ripeness (1-5)',
        'Color']].values
orX = np.asarray(orX).astype('float32')

# Generate Y-values
dummies = pd.get_dummies(orDF['Quality (1-5)'], dtype=int) # Classification
orY = dummies.values
orY = np.asarray(orY).astype('float32')

# Build neural network-----------------------------------------
orModel = Sequential()
orModel.add(Dense(50, input_dim=orX.shape[1], activation='relu')) # Hidden 1
orModel.add(Dense(25, activation='relu')) # Hidden 2
orModel.add(Dense(orY.shape[1],activation='softmax')) # Output
orModel.compile(loss='categorical_crossentropy', optimizer='adam')

# Train model
orModel.fit(orX,orY,verbose=2,epochs=100)

Training should go pretty fast since there are only 241 oranges in the dataset.

If you code is correct, you should see something similar to the following output:

~~~text
Epoch 1/100
8/8 - 0s - loss: 8.4513 - 306ms/epoch - 38ms/step
Epoch 2/100
8/8 - 0s - loss: 4.5695 - 61ms/epoch - 8ms/step
Epoch 3/100
8/8 - 0s - loss: 2.3987 - 39ms/epoch - 5ms/step
Epoch 4/100
8/8 - 0s - loss: 2.1689 - 46ms/epoch - 6ms/step
Epoch 5/100
8/8 - 0s - loss: 2.0318 - 34ms/epoch - 4ms/step

.......................

Epoch 95/100
8/8 - 0s - loss: 1.1693 - 28ms/epoch - 3ms/step
Epoch 96/100
8/8 - 0s - loss: 1.1688 - 31ms/epoch - 4ms/step
Epoch 97/100
8/8 - 0s - loss: 1.1676 - 28ms/epoch - 4ms/step
Epoch 98/100
8/8 - 0s - loss: 1.1612 - 27ms/epoch - 3ms/step
Epoch 99/100
8/8 - 0s - loss: 1.1614 - 28ms/epoch - 4ms/step
Epoch 100/100
8/8 - 0s - loss: 1.1533 - 28ms/epoch - 3ms/step

<keras.callbacks.History at 0x2d31709f910>
~~~

### **Exercise 1: Build and Train a Classification Neural Network**

In the cell below build and train a new classification neural network called `apModel`.

Start by reading the dataset and creating a DataFrame called `apDF` ("apple" DataFrame) using this code chunk:
~~~text
apDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/apple_quality.csv",
    na_values=['NA', '?'])
~~~
The goal of your neural network model `apModel` will be to classify the apples in the Apple Quality dataset using the values in the following columns: 'Size', 'Weight', 'Sweetness', 'Crunchiness', 'Juiciness', 'Acidity' and 'Ripeness'. Since all of these columns are numeric, there is no need pre-process any of these columns. Moreover, the numerical values all have a similar magnitude so you don't need to standardize any column to their z-scores. Nor do you need to split the data into Training/Validation sets or suffle the data. When you generate your X-values, you should called them `apX`.

Since you are building a classification neural network, you will need to One-Hot Encode the column containing the Y-values, `Quality`. This column is non-numeric, so by One-Hot Encoding it, you will accomplish two things: (1) replace string values with integers and (2) give the Y-values the correct format for the neural network. When you generate your Y-values, you should called them `apY`.

Again, because you want your `apModel` to act as a _classifier_, use the `softmax` activation function in the output layer. You should also compile your model using `categorical_crossentropy` as the loss function.

Finally, train (fit) your model on your X-values (`apX`) and your Y-values (`apY`) for 100 epochs.  

In [None]:
# Insert your code for Exercise 1 here



Training your model will take longer since the Apple Quality dataset contains 4,000 items instead of only 241 in the Orange Quality dataset.

If your code was correct your should see something similiar to the following output:
~~~text
Epoch 1/100
125/125 - 1s - loss: 0.4866 - 746ms/epoch - 6ms/step
Epoch 2/100
125/125 - 0s - loss: 0.3555 - 440ms/epoch - 4ms/step
Epoch 3/100
125/125 - 0s - loss: 0.3111 - 444ms/epoch - 4ms/step
Epoch 4/100
125/125 - 0s - loss: 0.2861 - 435ms/epoch - 3ms/step
Epoch 5/100
125/125 - 0s - loss: 0.2686 - 497ms/epoch - 4ms/step

............................................

Epoch 95/100
125/125 - 0s - loss: 0.0764 - 429ms/epoch - 3ms/step
Epoch 96/100
125/125 - 0s - loss: 0.0734 - 429ms/epoch - 3ms/step
Epoch 97/100
125/125 - 0s - loss: 0.0741 - 447ms/epoch - 4ms/step
Epoch 98/100
125/125 - 1s - loss: 0.0732 - 507ms/epoch - 4ms/step
Epoch 99/100
125/125 - 0s - loss: 0.0723 - 418ms/epoch - 3ms/step
Epoch 100/100
125/125 - 0s - loss: 0.0719 - 416ms/epoch - 3ms/s
~~~

Notice that in this example, the loss went from `0.4866` after the 1st epoch down to `0.0719` after the 100th epoch.

### Example 2: Determine the Model's RSME and Accuracy

The overall objective of this assignment is to convince you that can save a _trained_ neural network to a file, and then later, recreate the neural network from the file, **_without changing the model's accuracy_**.

#### Why is this important?

As you already know, it can take significant time and processing power to train even relatively small neural networks that we created so far in this course. Neural networks that are used commercially (think "Siri" or "Alexa" or ChatGPT) are many times larger and require enormous resources as well as weeks (or months) to train. Obviously, if you had to train a neural network every time you wanted to use it, it won't be very practical and there would be little interest in "AI". However, once the neural network has been trained, you can save it to a file, and then re-use it over and over again, without any loss in the neural network's ability to solve problems (i.e. loss in accuracy).      

The code in the cell below calculates ability of the `orModel` neural network to predict an orange's quality (Y-value) based on its physical and chemical characteristics (X-values). Two measures of predictive ability are computed, the **_Root Mean Square Error (RMSE)_** and **_Accuracy_**. The code stores the RSME value in the variable `orScore` and the Accuracy value in the variable `orCorrect` and then prints out thes values.


In [None]:
# Example 2: Determine the model's RMSE and Accuracy

import sklearn
from sklearn import metrics
from sklearn.metrics import accuracy_score

# Measure RMSE error.
orPred = orModel.predict(orX)
orScore = np.sqrt(metrics.mean_squared_error(orPred,orY))
print(f"Before save score (RMSE): {orScore}")

# Measure the accuracy
orPredict_classes = np.argmax(orPred,axis=1)
orExpected_classes = np.argmax(orY,axis=1)
orCorrect = accuracy_score(orExpected_classes,orPredict_classes)
print(f"Before save Accuracy: {orCorrect}")

If your code is correct you should see something similar to the following output:
~~~text
8/8 [==============================] - 0s 3ms/step
Before save score (RMSE): 0.26590439677238464
Before save Accuracy: 0.5518672199170125
~~~

The `orModel` isn't doing that great ( Accuracy: 0.5518672199170125 or 55% accurate). A model with only 55% accuracy is about a accurate as flipping a coin, which is 50% accurate.

However, as mentioned above, the number of items (oranges) in this dataset is relatively small (_n_=241). As you can see, this is really too few samples for training a neural network and expect to acheive a high level of accuracy.

### **Exercise 2: Determine the Model's RSME and Accuracy**

In the cell below, determine the RSME and Accuracy of your `apModel`. Store the RSME value in a variable called `apScore` and the Accuracy value in a variable called `apCorrect`. Print out these values as shown in Example 2 above.

In [None]:
# Insert your code for Exercise 2 here



If your code is correct you should see something similiar to the following output:

~~~text
125/125 [==============================] - 0s 2ms/step
Before save score (RMSE): 0.13398617506027222
Before save Accuracy: 0.977
~~~

WOW! Your `apModel` "nailed-it" with better than 95% accuracy.

The code below sets up a neural network and reads the data (for predictions), but it does not clear the model directory or fit the neural network. The code loads the weights from the previous fit. Now we reload the network and perform another prediction. The RMSE should match the previous one exactly if we saved and reloaded the neural network correctly.

### Example 3: Save the Model

The code in the cell below saves the _trained_ neural network `orModel` as a file in two different file formats: JSON and HDF5.

The code saves each file in the current working directory (`save_path = "."`). The filename of the JSON file is `orModel.json` while the filename of the HDF5 file is `orModel.h5`.

In [None]:
# Example 3: Save the model

import os

# Save path is the current directory
save_path = "."

# Save neural network structure to JSON (no weights)
orModel_json = orModel.to_json()
with open(os.path.join(save_path,"orModel.json"), "w") as json_file:
    json_file.write(orModel_json)

# Save the model in the native Keras format
orModel.save('orModel.keras')

# Print out the files in current directory
files = os.listdir()
print(files)

If your code is correct you should see something like the following output:

~~~text
['.config', 'orModel.json', 'orModel.keras', 'drive', 'sample_data']
~~~

After running the code cell above, there should now be two new files in your `Class_03_4` folder called `orModel.hh5` and `orModel.json`.

### **Exercise 3: Save the Model**

In the code cell below save your _trained_ neural network `apModel` as a JSON file with the filename, `apModel.json`, and as a native Keras file with the filenmane `apModel.keras`. Save both files to your current working directory (`save_path = "."`).

In [None]:
# Insert your code for Exercise 3 here



If your code is correct you should see the following output:

~~~text
['.ipynb_checkpoints', 'apModel.h5', 'apModel.json', 'Class_03_4.ipynb', 'orModel.h5', 'orModel.json']
~~~

You should now see the two more files with your neural network.

The advantage of the JSON format is that it can be visually inspected -- just click on the file name in the file browser panel. The JSON file perserves the model's _architecture_ which you can see by looking at the JSON file, but if you want to use it, you will need to train all over again.

On the other hand, you can't view the contents of the HDF5 file, since it is not UTF-8 encoded (it's formated). Neverthelss, you should always save your model in the HDF5 format since this **_preserves architecture and the values of the weights_** of the model's connections. By preserving these values you don't have to waste time retraining the model again.

### Example 4: Create New Model from Saved Model

Once a trained model has been saved in the HDF5 format, it is a simple matter to read the file to make an exact copy of the model using the Keras function `load_model()` as shown in the cell below. In Example 4 we have given the re-loaded neural network the name `or2Model` to differentiate it from the one that we built previously.  

In [None]:
# Example 4: Create new model from saved model

from tensorflow.keras.models import load_model

# Look in current folder
save_path = "."

# Create model2 from the saved model
or2Model = load_model(os.path.join(save_path,"orModel.h5"))

# Print out model summary
or2Model.summary()


If your code is correct, you should see something similar to the following output:

~~~text
Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_9 (Dense)             (None, 50)                450       
                                                                 
 dense_10 (Dense)            (None, 25)                1275      
                                                                 
 dense_11 (Dense)            (None, 8)                 208       
                                                                 
=================================================================
Total params: 1,933
Trainable params: 1,933
Non-trainable params: 0
___________________________
~~~

### **Exercise 4: Create New Model from Saved Model**

In the cell below create a new neural network called `ap2Model` from the file `apModel.h5` in your current directory. Print out a summary of your new `ap2Model`.

In [None]:
# Insert your code for Exercise 4 here



If your code is correct you should see something similar to the following output:

~~~text
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 50)                400       
                                                                 
 dense_1 (Dense)             (None, 25)                1275      
                                                                 
 dense_2 (Dense)             (None, 2)                 52        
                                                                 
=================================================================
Total params: 1,727
Trainable params: 1,727
Non-trainable params: 0
~~~

### Example 5: Compare the Predictive Accuracy of the Old and New Models

The code in the cell below computes the RMSE error and the Accuracy of our new model `or2Model` and compares these values with the original `orModel`. We are trying to address the question whether re-loaded model has the same accuracy as the original model?

In [None]:
# Example 5: Determine new model's RMSE and Accuracy

# Measure RMSE error.
or2Pred = or2Model.predict(orX)
or2Score = np.sqrt(metrics.mean_squared_error(or2Pred,orY))
print(f"Before save score (RMSE): {orScore}")
print(f"After save score (RMSE) : {or2Score}")

# Measure the accuracy
or2Predict_classes = np.argmax(or2Pred,axis=1)
or2Expected_classes = np.argmax(orY,axis=1)
or2Correct = accuracy_score(or2Expected_classes,or2Predict_classes)
print(f"Before save Accuracy: {orCorrect}")
print(f"After save Accuracy : {or2Correct}")


If your code is correct you should see something similar to the following output:

~~~text
8/8 [==============================] - 0s 4ms/step
Before save score (RMSE): 0.26590439677238464
After save score (RMSE) : 0.26590439677238464
Before save Accuracy: 0.5518672199170125
After save Accuracy : 0.5518672199170125
~~~~

As you can see, there is **_no difference_** in the accuracy of the saved model compared to the original one. Train Once...Use Anywhere!

### **Exercise 5: Compare the Predictive Accuracy of the Old and New Models**

In the cell below write the code to compute the RMSE and Accuracy values for your `ap2Model` and print out these values along with the values for your original `apModel`.

In [None]:
# Insert your code for Exercise 5 here




If your code is correct you should see something similar to the following output:

~~~text
125/125 [==============================] - 0s 2ms/step
Before save score (RMSE): 0.13398617506027222
After save score (RMSE) : 0.13398617506027222
Before save Accuracy: 0.977
After save Accuracy : 0.977
~~~

Again, there is no loss in accuracy using a trained neural network that has been saved and then recreated from the saved `h5` file.

## **Lesson Turn-in**

When you have completed and run all of the code cells, use the **File --> Print.. --> Save to PDF** to generate a PDF of your Colab notebook. Save your PDF as `Class_03_4.lastname.pdf` where _lastname_ is your last name, and upload the file to Canvas.