<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173/blob/master/Class_03_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# BIO 1173: Intro Computational Biology

**Module 3: Introduction to TensorFlow**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Integrative Biology](https://sciences.utsa.edu/integrative-biology/), [UTSA](https://www.utsa.edu/)


### Module 3 Material

* Part 3.1: Deep Learning and Neural Network Introduction
* Part 3.2: Introduction to Tensorflow and Keras
* Part 3.3: Saving and Loading a Keras Neural Network
* **Part 3.4: Early Stopping in Keras to Prevent Overfitting**
* Part 3.5: Extracting Weights and Manual Calculation


# Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.

In [None]:
try:
    %tensorflow_version 2.x
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

### Lesson Setup

Run the next code cell to load necessary packages

In [None]:
# You MUST run this code cell first
import tensorflow as tf
import numpy as np
import pandas as pd
import h5py

import os
import shutil
path = '/'
memory = shutil.disk_usage(path)
dirpath = os.getcwd()
print("Your current working directory is : " + dirpath)
print("Disk", memory)
print("Numpy version =", (np.__version__)) 
print("Tensorflow version =", (tf.__version__))
print("Available GPU acceleration =", tf.test.gpu_device_name())
print("Current version of h5py =", (h5py.__version__))

# Part 3.4: Early Stopping in Keras to Prevent Overfitting

It can be difficult to determine how many epochs to cycle through to train a neural network. **_Overfitting_** will occur if you train the neural network for too many epochs, and the neural network will not perform well on new data, despite attaining a good accuracy on the training set. Overfitting occurs when a neural network is trained to the point that it begins to memorize rather than generalize, as demonstrated in Figure 3.OVER. 

**Figure 3.OVER: Training vs. Validation Error for Overfitting**
![Training vs. Validation Error for Overfitting](https://raw.githubusercontent.com/jeffheaton/t81_558_deep_learning/master/images/class_3_training_val.png "Training vs. Validation Error for Overfitting")

It is important to segment the original dataset into several datasets:

* **Training Set**
* **Validation Set**
* **Holdout Set**

You can construct these sets in several different ways. The following programs demonstrate some of these.

The first method is a training and validation set. We use the training data to train the neural network until the validation set no longer improves. This attempts to stop at a near-optimal training point. This method will only give accurate "out of sample" predictions for the validation set; this is usually 20% of the data. The predictions for the training data will be overly optimistic, as these were the data that we used to train the neural network. Figure 3.VAL demonstrates how we divide the dataset.

**Figure 3.VAL: Training with a Validation Set**
![Training with a Validation Set](https://raw.githubusercontent.com/jeffheaton/t81_558_deep_learning/master/images/class_1_train_val.png "Training with a Validation Set")

## Early Stopping

We will now see an example of classification training with early stopping. We will train the neural network until the error no longer improves on the validation set.

### Example 1: Early Stopping with Classification: Iris data

The code in the cell below builds and trains a **_classification_** neural network called `irisModel`. The model is trained/fitted to the Iris flower dataset (`iris.csv`) downloaded from the course HTTPS server `corgi.genomelab.utsa.edu` and stored in the DataFrame `irisDF`. The independent variables (attributes) are the values for `sepal_length`, `sepal_width`, `petal_length` and `petal_width` which are stored in the variable `irisX`. The dependent variable (response variable) that the model will be trained to classify is the column `species` which contains the names of the three Iris species in the dataset, _Iris setosa_, _Iris_ versicolor_ and _Iris virginica_. Since the species names are entered in the `species` column as strings, it is necessary to use One-Hot Encoding to convert these strings into the values `0` and `1` using the command `pd.get_dummies()`. The variable holding the dependent values, `irisY`, is created from the `dummies.values` as shown below.

In order to implement _Early Stopping_, it is first necessary to split the dataset into 4 separate groups: X train, X test, Y train and Y test using the function `train_test_split()`. The argument `test_size=0.25` tells the function that 75% of the data should be put into the two train sets (i.e. `irisX_train` and `irisY_train`) and the remaining 25% should be put into the two validation sets `irisX_test` and `irisY_test`. Since the separation of data into training and test sets is a random process, the argument `random_state=42` is used for teaching/demonstration purposes to insure that the `split` occurs at the same places when the code is re-run.

The model, `irisModel`, is a densely connected sequential neural network with two hidden layers. The 1st layer has 50 neurons, the 2nd hidden layer 25. The activation function for both hidden layers is `relu`. Since this function of this model is classification, the `softmax` activation function is used in the output layer. The model is compiled with the 'categorical_crossentropy` loss function and the `adam` optimizer. 

The code for implementing early stopping variable `irisMoniter` is shown below:

The meaning/function of the different arguments will be discussed below.

Finally, the model is fitted to the Iris data with the number of epochs set to **1000!** Don't worry, you won't have to wait forever for the training to complete--thanks to early stopping. 

In [None]:
# Example 1: Early stopping - iris data

import pandas as pd
import io
import requests
from sklearn import metrics
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping

irisDF = pd.read_csv("https://corgi.genomelab.utsa.edu/BIO1173/Datasets/iris.csv",
    na_values=['NA', '?'])

# Convert to numpy - Classification
irisX = irisDF[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']].values
dummies = pd.get_dummies(irisDF['species']) # Classification
species = dummies.columns
irisY = dummies.values

# Split into validation and training sets
irisX_train, irisX_test, irisY_train, irisY_test = train_test_split(    
    irisX, irisY, test_size=0.25, random_state=42)

# Build neural network
irisModel = Sequential()
irisModel.add(Dense(50, input_dim=irisX.shape[1], activation='relu')) # Hidden 1
irisModel.add(Dense(25, activation='relu')) # Hidden 2
irisModel.add(Dense(irisY.shape[1],activation='softmax')) # Output
irisModel.compile(loss='categorical_crossentropy', optimizer='adam')

irisMonitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, patience=5, 
        verbose=1, mode='auto', restore_best_weights=True)
irisModel.fit(irisX_train,irisY_train,validation_data=(irisX_test,irisY_test),
        callbacks=[irisMonitor],verbose=2,epochs=1000)


Even though the number of epochs was set to 1000, the training/fitting should have stopped much earlier. For example, on the machine this assigment is being created, the training stopped after only 102 epochs.

There are a number of parameters that are specified to the **EarlyStopping** object. 

* **min_delta** This value should be kept small. It simply means the minimum change in error to be registered as an improvement.  Setting it even smaller will not likely have a great deal of impact.
* **patience** How long should the training wait for the validation error to improve?  
* **verbose** How much progress information do you want?
* **mode** In general, always set this to "auto".  This allows you to specify if the error should be minimized or maximized.  Consider accuracy, where higher numbers are desired vs log-loss/RMSE where lower numbers are desired.
* **restore_best_weights** This should always be set to true.  This restores the weights to the values they were at when the validation set is the highest.  Unless you are manually tracking the weights yourself (we do not use this technique in this course), you should have Keras perform this step for you.

As you can see from above, the entire number of requested epochs were not used.  The neural network training stopped once the validation set no longer improved.

### **Exercise 1: Early Stopping with Classification: Heart Failure data**

In the cell below, write the code to read the Heart Failure dataset ("heart_failure.csv") from the course HTTPS server ("corgi.genomelab.utsa.edu") and store the data in a DataFrame called `hfDF`. As independent variables, **only** use the columns Age, RestingBP, Cholesterol, MaxHR and Oldpeak. You should name the variable holding the independent values `hfX`. 

Use the column `HeartDisease` as your dependent (response) variable. You will need to One-Hot Encode the `HeartDisease` values. Assign the independent values to the variable `hfY` using the `dummies.values` as was done in Example 1. 

Use the `train_test_split(hfX, hfY, test_size=0.25, random_state=42)` function to create `hfX_train`, `hfX_test`, `hfY_train` and `hfY_test` datasets. 

Build a Sequential neural network called `hfModel` with 2 hidden layers with 50 neurons in the first layer and 25 neurons in the second layer. Use `relu` activation for these two hidden layers. The output layer should use `softmax` activation. Compile your model using `categorical_crossentropy` as the loss function with `adam` as the optimizer. 

After your model has been compiled, create a variable called `hfMonitor` to provide `EarlyStopping()` with the same arguments are shown in Example 1. 

Finally, train/fit the model for 1000 epochs using the `hfMonitor` to enable early stopping. 

In [None]:
# Insert your code for Exercise 1 here 



If your code is correct, your model `hfModel` should have stopped early, before reaching 30 epochs. The training stopped with the `val_loss` value reached a minimum value after waiting 5 epochs (`patience=5`) for the `val_loss` to start going to a lower value. 

### Example 2: Compute Accuracy Score for `irisModel`

Let's see what effect early stopping might have on the accuracy of the `irisModel`?

The code below illustrates how to compute the accuracy score for the model `irisModel` created in Example 1 using the Keras `model.predict()` function and the `accuracy_score()` functin from the `scikit-learn` metrics package.  

In [None]:
# Example 2: Compute accuracy score

from sklearn.metrics import accuracy_score

irisPred = irisModel.predict(irisX_test)
irisPredict_classes = np.argmax(irisPred,axis=1)
irisExpected_classes = np.argmax(irisY_test,axis=1)
irisCorrect = accuracy_score(irisExpected_classes,irisPredict_classes)
print(f"Accuracy: {irisCorrect}")

If your code is correct you should see something similar to the following output:

### **Exercise 2: Compute Accuracy Score for hfModel**

In the cell below, compute the accuracy score for your `hfModel` and print out the results.

In [None]:
# Insert your code for Exercise 2 here



If your code is correct you should see something similar to the following output:

Clearly your `hfModel` is better at predicting heart attacks than the `irisModel` is at predicting the Iris flower species based on the sepal and petal dimensions.

## Early Stopping with Regression

The following code demonstrates how we can apply early stopping to a regression problem.  The technique is similar to the early stopping for classification code that we just saw.

### Example 3: Early Stopping with Regression

The code below uses the Apple Quality dataset downloaded from the course HTTPS server ("corgi.genomelab.utsa.edu") to create a DataFrame called `apDF`. When we use a neural network for **_classification_**, we design the neural network so there is one output neuron for each "class" that we are trying to classify. For example, with the Iris data there were three classes, one for example species (i.e. _Iris setosa_, _Iris versicolor_, and _Iris virginica_). So our neural network model, `irisModel` had just 3 neurons in the output layer -- one for each class (species). 

Suppose we could build a _prefect_ classification neural network for Iris flowers, with 100% accuracy. If we entered the sepal and petal dimensions of a flower taken from a _Iris setosa_ plant into this perfect neural network, the value in the output neuron reprensenting _Iris setosa_ would have the number `1.0` in it, while the other two output neurons (representing the other two species) would have the number `0` in them. 

In other words, our goal with a classification neural network is to have only one output neuron with a value close to `1` and the rest of the output neurons to have a value close to `0` when we "feed" the neural network the independent values (attributes) from a single subject. 

The goal of a **_regression_** neural network is somewhat different. In a regression neural network there is only a **single** output neuron. When we use regression neural network to make a prediction, the single output neuron will end up having a floating point number in it. The value of this number is the **_prediction_**.

The neural network `apModel`, constructed in the cell below, is designed to predict the `Sweetness` of a particular apple based on its Size, Weight, Crunchiness, Juiciness, Acidity, Ripeness and Quality. According the output from the command `apDF.describe()` (see Appendix), the `Sweetness` of an apple can range from a minimum of -6.894485 to a maximum of 6.374916. Therefore, when we make a prediction with the trained `apModel`, we would expect a value in the output neural between -6.894485 and 6.374916   


In [None]:
# Example 3: Early stopping with regression


# Read the datafile
apDF = pd.read_csv("https://corgi.genomelab.utsa.edu/BIO1173/Datasets/apple_quality.csv", 
                    na_values=['NA', '?'])

# Define the mapping dictionary
mapping = {'bad': 0, 'good': 1}
# Map the integer column to strings
apDF['Quality'] = apDF['Quality'].map(mapping)

# Assign the independent variables (x)
apX = apDF[['Size', 'Weight', 'Crunchiness',
            'Juiciness', 'Acidity', 'Ripeness', 'Quality']].values

# Assign the dependent variables (y)
apY = apDF['Sweetness'].values


# Specify the model type as sequential
apModel = Sequential()

# Split into validation and training sets
apX_train, apX_test, apY_train, apY_test = train_test_split(    
    apX, apY, test_size=0.25, random_state=42)


# Build the neural network
apModel = Sequential()
apModel.add(Dense(25, input_dim=apX.shape[1], activation='relu')) # Hidden 1
apModel.add(Dense(10, activation='relu')) # Hidden 2
apModel.add(Dense(1)) # Output
apModel.compile(loss='mean_squared_error', optimizer='adam')

apMonitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, 
        patience=5, verbose=1, mode='auto',
        restore_best_weights=True)
apModel.fit(apX_train,apY_train,validation_data=(apX_test,apY_test),
        callbacks=[apMonitor], verbose=2,epochs=1000)


If your code is correct, training/fitting should have stopped before 100 epochs. 

### **Exercise 3: Early Stopping with Regression**

If the cell below, use the same Apple Quality dataset as used in Example 3 to construct a regression neural network that can predict the 'Acidity' of an apple instead of it `Sweetness`. 

Call your new model `ap2Model`, the independent variables `ap2X`, the dependent variable `ap2Y`, `ap2X_train` and so forth. As in Example 3, implement early stopping.  

In [None]:
# Insert your code for Exercise 3 here




If your code is correct, training/fitting should have stopped around 100 epochs. 

### Example 4: Compute the RMSE for the neural network `apModel`

When working with neural networks that perform a regression analysis, it is customary to use the Root Mean Square Error (RMSE) as a measurement of predictive accuracy. The code in the cell below shows how to compute the RMSE for the `apModel` neural network and then print out the result.

In [None]:
# Example 4 Compute the RMSE

# Measure RMSE error. 
apPred = apModel.predict(apX_test)
apScore = np.sqrt(metrics.mean_squared_error(apPred,apY_test))

# Print out the results
print(f"Final score (RMSE): {apScore}")

If your code is correct you should see something similar to the following output:

### **Exercise 4: Compute the RMSE for the neural network `ap2Model`**

In the cell below, write the code to compute RMSE for your neural network model `ap2Model` and then print out the results.

In [None]:
# Insert your code for Exercise 4 here



If your code is correct you should see something similar to the following output:

### Example 5: _Ad Hoc_ prediction for neural network `apModel`

The code in the cell below uses the `apModel` built and trained in Example 3 above, to predict the sweetness of the first apple in the Apple Quality dataset (`Apple0`). 

The complete list of **all** the attributes for `Apple0` are as follows:

* **Size =** -3.970049
* **Weight =** -2.512336
* **Sweetness =** 5.346330 
* **Crunchiness =** -1.012009
* **Juciness =**  1.844900
* **Ripeness =** 0.329840
* **Acidity =** -0.491590
* **Quality =** 1

Keep in mind that the attribute `Sweetness` was the dependent (response) variable in Example 3. In other words, the neural network model `apModel` was designed and trained **_only_** to predict the sweetness of an apple given its other attributes.

**IMPORTANT:** The _order_ in which the `np.array()` values for Apple0 are entered is critical. To use the model `apModel` to make a prediction, the X values in the prediction must be in the same extact order as the X values were in the training and test datasets. The line of code that was used for assigning the features to variable `apX` is shown in below:

Therefore, when creating the `np.array()` for `Apple0` in the cell below, the data values were entered in the following order:

While the order in which attributes are placed in the X variable `apX` is not important -- from the standpoint of training the neural network. However, when it comes to making _predictions_ using the trained neural network, the X values passed to the neural network for the ad hoc example **must** be in the same order for the model to work correctly. 

In [None]:
# Example 5: Ad hoc prediction

# x = Size, Weight, Crunchiness, Juciness, Acidity, Ripness, Quality
Apple0 = np.array( [[-3.970049, -2.512336, -1.012009,
                     1.844900, -0.491590, 0.329840, 1 ]], dtype=float)

# Use the neural network to predict which species of Iris
apPred = apModel.predict(Apple0)

# Print out the results
print(f"Model predicts that the sweetness of Apple0 is: {apPred}")
print(f"The actual sweetness of Apple0 was: {apDF['Sweetness'].values[0]}")

If your code is correct you should see something similar to the following output:

### **Exercise 5: _Ad Hoc_ prediction for neural network `ap2Model`**

In the cell below, use the `ap2Model` that you created in **Exercise 4** to predict the acidity of the same apple (`Apple0`) used in the Example 5 above. Use the values for the attributes of `Apple0`, shown above in Example 5, for creating the `np.array()` for your _ad hoc_ prediction. 

**BE CAREFUL:** You must make sure to use exactly the same order of these attributes as you used in creating your `ap2Model`.  

In [None]:
# Insert your code for Exercise 5 here



If your code is correct you should see something similar to the following output:

### SUMMARY

The primary objective of this lesson was to demonstrate how to implement **_Early Stopping_** in the training/fitting of neural networks. Clearly, the ability to stop training when the loss function on the validation training set reaches a minimum can save a significant amount of time. However, perhaps more importantly, Early Stopping can prevent a neural network from **_overtraining_**. Overtraining occurs when the neural network can predict training examples with very high accuracy but cannot generalize to new data. In other words, the neural network starts learning **_specific details_** about the training data. While this improves the model's loss function in the particular training set, it will actually performs worse when presented with new data that it hasn't seen before. 


## **Lesson Turn-in**

When you have completed all of the code cells, and run them in sequential order (the last code cell should be number 12), use the **File --> Print.. --> Save to PDF** to generate a PDF of your JupyterLab notebook. Save your PDF as `Class_03_4.lastname.pdf` where _lastname_ is your last name, and upload the file to Canvas.

## Appendix

The cells below shows the output from the command `apDF.head()`:

The cell below shows an abbreviated version of the output from the command `apDF.describe()`: