<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173/blob/master/Class_05_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# BIO 1173: Intro Computational Biology

**Module 5: Regularization and Dropout**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Integrative Biology](https://sciences.utsa.edu/integrative-biology/), [UTSA](https://www.utsa.edu/)

### Module 5 Material

* Part 5.1: Part 5.1: Introduction to Regularization: Ridge and Lasso
* **Part 5.2: Using K-Fold Cross Validation with Keras**
* Part 5.3: Using L1 and L2 Regularization with Keras to Decrease Overfitting
* Part 5.4: Drop Out for Keras to Decrease Overfitting
* Part 5.5: Benchmarking Keras Deep Learning Regularization Techniques



# Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.
  Running the following code will map your GDrive to ```/content/drive```.

In [1]:
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    COLAB = True
    print("Note: using Google CoLab")
    %tensorflow_version 2.x
except:
    print("Note: not using Google CoLab")
    COLAB = False

Note: not using Google CoLab


### Lesson Setup

Run the next code cell to load necessary packages

In [1]:
# You MUST run this code cell first

from sklearn import metrics
from sklearn.model_selection import KFold
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from scipy.stats import zscore
from sklearn.model_selection import train_test_split

import numpy as np
import pandas as pd
import time

import os
import shutil
path = '/'
memory = shutil.disk_usage(path)
dirpath = os.getcwd()
print("Your current working directory is : " + dirpath)
print("Disk", memory)

Your current working directory is : C:\Users\David\BIO1173\Class_05_2
Disk usage(total=4000108531712, used=992608657408, free=3007499874304)


# Part 5.2: Using K-Fold Cross-validation with Keras

You can use cross-validation for a variety of purposes in predictive modeling:

* Generating out-of-sample predictions from a neural network
* Estimate a good number of epochs to train a neural network for (early stopping)
* Evaluate the effectiveness of certain hyperparameters, such as activation functions, neuron counts, and layer counts

Cross-validation uses several folds and multiple models to provide each data segment a chance to serve as both the validation and training set. Figure 5.CROSS shows cross-validation.

**Figure 5.CROSS: K-Fold Crossvalidation**
![K-Fold Crossvalidation](https://biologicslab.co/BIO1173/images/class_1_kfold.png "K-Fold Crossvalidation")

It is important to note that each fold will have one model (neural network). To generate predictions for new data (not present in the training set), predictions from the fold models can be handled in several ways:

* Choose the model with the highest validation score as the final model.
* Preset new data to the five models (one for each fold) and average the result (this is an [ensemble](https://en.wikipedia.org/wiki/Ensemble_learning)).
* Retrain a new model (using the same settings as the cross-validation) on the entire dataset. Train for as many epochs and with the same hidden layer structure.

Generally, I prefer the last approach and will retrain a model on the entire data set once I have selected hyper-parameters. Of course, I will always set aside a final holdout set for model validation that I do not use in any aspect of the training process.

## Regression vs Classification K-Fold Cross-Validation

Regression and classification are handled somewhat differently concerning cross-validation. Regression is the simpler case where you can break up the data set into K folds with little regard for where each item lands. For regression, the data items should fall into the folds as randomly as possible. It is also important to remember that not every fold will necessarily have the same number of data items. It is not always possible for the data set to be evenly divided into K folds. For regression cross-validation, we will use the Scikit-Learn class **KFold**.

Cross-validation for classification could also use the **KFold** object; however, this technique would not ensure that the class balance remains the same in each fold as in the original. The balance of classes that a model was trained on must remain the same (or similar) to the training set. Drift in this distribution is one of the most important things to monitor after a trained model has been placed into actual use. Because of this, we want to make sure that the cross-validation itself does not introduce an unintended shift. This technique is called stratified sampling and is accomplished by using the Scikit-Learn object **StratifiedKFold** in place of **KFold** whenever you use classification. In summary, you should use the following two objects in Scikit-Learn:

* **KFold** When dealing with a regression problem.
* **StratifiedKFold** When dealing with a classification problem.

## Body Performance dataset

For this lesson we will be using the [Body Performance dataset](https://www.kaggle.com/datasets/kukuroo3/body-performance-data).
This is data that confirmed the grade of performance with age and some exercise performance data. This is a relatively large dataset with 12 categories of information about 13,303 individuals.

The 12 categories are:
* **age:** 20 ~64
* **gender:** M,F
* **height_cm:** (If you want to convert to feet, divide by 30.48)
* **weight_kg:**
* **body fat_%:**
* **diastolic:** diastolic blood pressure (min)
* **systolic:** systolic blood pressure (min)
* **gripForce:**
* **sit and bend forward_cm:**
* **sit-ups counts:**
* **broad jump_cm:**
* **class:** A,B,C,D ( A: best) / stratified


In [86]:
# Read the data set
df = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/bodyPerformance.csv",
    na_values=['NA','?'])

# Set max rows and max columns
pd.set_option('display.max_rows', 8)
pd.set_option('display.max_columns', 8)

# Display DataFrame
display(df)

Unnamed: 0,age,gender,height_cm,weight_kg,...,sit and bend forward_cm,sit-ups counts,broad jump_cm,class
0,27.0,M,172.3,75.24,...,18.4,60.0,217.0,C
1,25.0,M,165.0,55.80,...,16.3,53.0,229.0,A
2,31.0,M,179.6,78.00,...,12.0,49.0,181.0,C
3,32.0,M,174.5,71.10,...,15.2,53.0,219.0,B
...,...,...,...,...,...,...,...,...,...
13389,21.0,M,179.7,63.90,...,1.1,48.0,167.0,D
13390,39.0,M,177.2,80.50,...,16.4,45.0,229.0,A
13391,64.0,F,146.1,57.70,...,9.2,0.0,75.0,D
13392,34.0,M,164.0,66.10,...,7.1,51.0,180.0,C


## Out-of-Sample Regression Predictions with K-Fold Cross-Validation

The following code trains the simple dataset using a 5-fold cross-validation. The expected performance of a neural network of the type trained here would be the score for the generated out-of-sample predictions. We begin by preparing a feature vector using the **jh-simple-dataset** to predict age. This model is set up as a regression problem.

In [84]:
# Read and preprocess the data

# Read the data set
dfBig = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/bodyPerformance.csv",
    na_values=['NA','?'])

# Only use 15% for neural network
df=dfBig.sample(frac=.15)

# Map gender
mapping = {'M': 1, 'F': 0}
df['gender'] = df['gender'].map(mapping)

# Map class
mapping =  {'A': 0,
            'B': 1,
            'C': 2,
            'D': 3}
df['class'] = df['class'].map(mapping)

# Generate list of columns for x
x_columns = df.columns.drop('class')

# Standardize values with their Z-scores
for col in x_columns:
    df[col] = zscore(df[col])

# Generate x-values as numpy array
x = df[x_columns].values
x = np.asarray(x).astype('float32')

# Generate y-values as numpy array
y = df['class'].values
y = np.asarray(y).astype('float32')

Now that the feature vector is created a 5-fold cross-validation can be performed to generate out-of-sample predictions.  We will assume 500 epochs and not use early stopping.  Later we will see how we can estimate a more optimal epoch count.

In [66]:
# Setup KFold classification and train model

# Set EPOCHS
EPOCHS=500

# Record the start time in st
st = time.time()

# Cross-Validate
kf = KFold(5, shuffle=True, random_state=42) # Use for KFold classification
oos_y = []
oos_pred = []

fold = 0
for train, test in kf.split(bpX):
    fold+=1
    print(f"Fold #{fold}")
        
    x_train = x[train]
    y_train = y[train]
    x_test = x[test]
    y_test = y[test]
    
    model = Sequential()
    model.add(Dense(20, input_dim=x.shape[1], activation='relu'))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1))
    model.compile(loss='mean_squared_error', optimizer='adam')
    
    model.fit(x_train,y_train,validation_data=(x_test,y_test),verbose=0,
              epochs=EPOCHS)
    
    pred = model.predict(x_test)
    
    oos_y.append(y_test)
    oos_pred.append(pred)    

    # Measure this fold's RMSE
    score = np.sqrt(metrics.mean_squared_error(pred,y_test))
    print(f"Fold score (RMSE): {score}")

# Build the oos prediction list and calculate the error.
oos_y = np.concatenate(oos_y)
oos_pred = np.concatenate(oos_pred)
score = np.sqrt(metrics.mean_squared_error(oos_pred,oos_y))
print(f"Final, out of sample score (RMSE): {score}")    
    
# Write the cross-validated prediction
oos_y = pd.DataFrame(oos_y)
oos_pred = pd.DataFrame(oos_pred)
oosDF = pd.concat( [df, oos_y, oos_pred],axis=1 )
#oosDF.to_csv(filename_write,index=False)

# Record the end time in et
et = time.time()

# Print out time
seconds = int((et-st))
seconds = seconds % (24 * 3600)
hour = seconds // 3600
seconds %= 3600
minutes = seconds // 60
seconds %= 60
print("Elapsed time = %d:%02d:%02d" % (hour, minutes, seconds))

Fold #1
Fold score (RMSE): 0.6252933740615845
Fold #2
Fold score (RMSE): 0.6707077026367188
Fold #3
Fold score (RMSE): 0.698620080947876
Fold #4
Fold score (RMSE): 0.6351367235183716
Fold #5
Fold score (RMSE): 0.623394787311554
Final, out of sample score (RMSE): 0.6513092517852783
Elapsed time = 0:09:55


As you can see, the above code also reports the average number of epochs needed.  A common technique is to then train on the entire dataset for the average number of epochs required.


## Classification with Stratified K-Fold Cross-Validation

The following code trains and fits the **jh**-simple-dataset dataset with cross-validation to generate out-of-sample.  It also writes the out-of-sample (predictions on the test set) results.

It is good to perform stratified k-fold cross-validation with classification data.  This technique ensures that the percentages of each class remain the same across all folds.  Use the **StratifiedKFold** object instead of the **KFold** object used in the regression.

In [68]:
# Read and preprocess the data

import pandas as pd
from scipy.stats import zscore

# Read the data set
dfBig = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/bodyPerformance.csv",
    na_values=['NA','?'])

# Only use 15% for neural network
df=dfBig.sample(frac=.15)

# Map gender
mapping = {'M': 1, 'F': 0}
df['gender'] = df['gender'].map(mapping)

# Map class
mapping =  {'A': 0,
            'B': 1,
            'C': 2,
            'D': 3}
df['class'] = df['class'].map(mapping)

# Create list of X columns
x_columns = df.columns.drop('class')

# Standardize ranges with their Z-scores
for col in x_columns:
    df[col] = zscore(df[col])

# Generate x-values as numpy array
x = df[x_columns].values
x = np.asarray(x).astype('float32')

# Generate y-values as numpy array
dummies = pd.get_dummies(df['class']) # Classification
FitClass = dummies.columns
y = dummies.values
y = np.asarray(y).astype('float32')

In [69]:
# Classification with Stratified K-Fold Cross-Validation

import pandas as pd
import os
import numpy as np
from sklearn import metrics
from sklearn.model_selection import StratifiedKFold
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation

# Record the start time in st
st = time.time()

# np.argmax(pred,axis=1)
# Cross-validate
# Use for StratifiedKFold classification
kf = StratifiedKFold(5, shuffle=True, random_state=42) 
    
oos_y = []
oos_pred = []
fold = 0

# Must specify y StratifiedKFold for
for train, test in kf.split(bpX,bpDF['class']):  
    fold+=1
    print(f"Fold #{fold}")
        
    x_train = x[train]
    y_train = y[train]
    x_test = x[test]
    y_test = y[test]
    
    model = Sequential()
    # Hidden 1
    model.add(Dense(50, input_dim=x.shape[1], activation='relu')) 
    model.add(Dense(25, activation='relu')) # Hidden 2
    model.add(Dense(y.shape[1],activation='softmax')) # Output
    model.compile(loss='categorical_crossentropy', optimizer='adam')

    model.fit(x_train,y_train,validation_data=(x_test,y_test),
              verbose=0, epochs=EPOCHS)
    
    pred = model.predict(x_test)
    
    oos_y.append(y_test)
    # raw probabilities to chosen class (highest probability)
    pred = np.argmax(pred,axis=1) 
    oos_pred.append(pred)  

    # Measure this fold's accuracy
    y_compare = np.argmax(y_test,axis=1) # For accuracy calculation
    score = metrics.accuracy_score(y_compare, pred)
    print(f"Fold score (accuracy): {score}")

# Build the oos prediction list and calculate the error.
oos_y = np.concatenate(oos_y)
oos_pred = np.concatenate(oos_pred)
oos_y_compare = np.argmax(oos_y,axis=1) # For accuracy calculation

score = metrics.accuracy_score(oos_y_compare, oos_pred)
print(f"Final score (accuracy): {score}")    
    
# Write the cross-validated prediction
oos_y = pd.DataFrame(oos_y)
oos_pred = pd.DataFrame(oos_pred)
oosDF = pd.concat( [df, oos_y, oos_pred],axis=1 )
#oosDF.to_csv(filename_write,index=False)

# Record the end time in et
et = time.time()

# Print out time
seconds = int((et-st))
seconds = seconds % (24 * 3600)
hour = seconds // 3600
seconds %= 3600
minutes = seconds // 60
seconds %= 60
print("Elapsed time = %d:%02d:%02d" % (hour, minutes, seconds))

Fold #1
Fold score (accuracy): 0.6442786069651741
Fold #2
Fold score (accuracy): 0.6840796019900498
Fold #3
Fold score (accuracy): 0.6393034825870647
Fold #4
Fold score (accuracy): 0.6691542288557214
Fold #5
Fold score (accuracy): 0.6608478802992519
Final score (accuracy): 0.6595321055251369
Elapsed time = 0:09:48


## Training with both a Cross-Validation and a Holdout Set

If you have a considerable amount of data, it is always valuable to set aside a holdout set before you cross-validate. This holdout set will be the final evaluation before using your model for its real-world use. Figure 5. HOLDOUT shows this division.

**Figure 5. HOLDOUT: Cross-Validation and a Holdout Set**
![Cross Validation and a Holdout Set](https://biologicslab.co/BIO1173/images/class_3_hold_train_val.png "Cross-Validation and a Holdout Set")

The following program uses a holdout set and then still cross-validates.  

In [80]:
import pandas as pd
from scipy.stats import zscore
from sklearn.model_selection import train_test_split

import pandas as pd
from scipy.stats import zscore

# Read the data set
dfBig = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/bodyPerformance.csv",
    na_values=['NA','?'])

# Only use 20% for neural network
df=dfBig.sample(frac=.20)

# Map gender
mapping = {'M': 1, 'F': 0}
df['gender'] = df['gender'].map(mapping)

# Map class
mapping =  {'A': 0,
            'B': 1,
            'C': 2,
            'D': 3}
df['class'] = df['class'].map(mapping)


# Create list of X columns
x_columns = df.columns.drop('age')

# Standardize ranges with their Z-scores
for col in x_columns:
    df[col] = zscore(df[col])


# Convert to numpy - Classification
x = df[x_columns].values
x = np.asarray(x).astype('float32')
y = df['age'].values
y = np.asarray(y).astype('float32')


Now that the data has been preprocessed, we are ready to build the neural network.

In [83]:
from sklearn.model_selection import train_test_split
import pandas as pd
import os
import numpy as np
from sklearn import metrics
from scipy.stats import zscore
from sklearn.model_selection import KFold

# Keep a 10% holdout
x_main, x_holdout, y_main, y_holdout = train_test_split(    
    x, y, test_size=0.10) 

# Cross-validate
kf = KFold(5)
    
oos_y = []
oos_pred = []
fold = 0
for train, test in kf.split(x_main):        
    fold+=1
    print(f"Fold #{fold}")
        
    x_train = x_main[train]
    y_train = y_main[train]
    x_test = x_main[test]
    y_test = y_main[test]
    
    model = Sequential()
    model.add(Dense(20, input_dim=x.shape[1], activation='relu'))
    model.add(Dense(5, activation='relu'))
    model.add(Dense(1))
    model.compile(loss='mean_squared_error', optimizer='adam')
    
    model.fit(x_train,y_train,validation_data=(x_test,y_test),
              verbose=0,epochs=EPOCHS)
    
    pred = model.predict(x_test)
    
    oos_y.append(y_test)
    oos_pred.append(pred) 

    # Measure accuracy
    score = np.sqrt(metrics.mean_squared_error(pred,y_test))
    print(f"Fold score (RMSE): {score}")


# Build the oos prediction list and calculate the error.
oos_y = np.concatenate(oos_y)
oos_pred = np.concatenate(oos_pred)
score = np.sqrt(metrics.mean_squared_error(oos_pred,oos_y))
print()
print(f"Cross-validated score (RMSE): {score}")    
    
# Write the cross-validated prediction (from the last neural network)
holdout_pred = model.predict(x_holdout)

score = np.sqrt(metrics.mean_squared_error(holdout_pred,y_holdout))
print(f"Holdout score (RMSE): {score}")    


Fold #1
Fold score (RMSE): 8.436017990112305
Fold #2
Fold score (RMSE): 8.459228515625
Fold #3
Fold score (RMSE): 8.455117225646973
Fold #4
Fold score (RMSE): 8.372004508972168
Fold #5
Fold score (RMSE): 8.859386444091797

Cross-validated score (RMSE): 8.518101692199707
Holdout score (RMSE): 8.80966567993164
