<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173/blob/master/Class_05_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# BIO 1173: Intro Computational Biology

**Module 5: Regularization and Dropout**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Integrative Biology](https://sciences.utsa.edu/integrative-biology/), [UTSA](https://www.utsa.edu/)

### Module 5 Material

* Part 5.1: Part 5.1: Introduction to Regularization: Ridge and Lasso
* **Part 5.2: Using K-Fold Cross Validation with Keras**
* Part 5.3: Using L1 and L2 Regularization with Keras to Decrease Overfitting
* Part 5.4: Drop Out for Keras to Decrease Overfitting
* Part 5.5: Benchmarking Keras Deep Learning Regularization Techniques



# Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.
  Running the following code will map your GDrive to ```/content/drive```.

In [1]:
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    COLAB = True
    print("Note: using Google CoLab")
    %tensorflow_version 2.x
except:
    print("Note: not using Google CoLab")
    COLAB = False

Note: not using Google CoLab


### Lesson Setup

Run the next code cell to load necessary packages

In [1]:
# You MUST run this code cell first

from sklearn import metrics
from sklearn.model_selection import KFold
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from scipy.stats import zscore
from sklearn.model_selection import train_test_split

import numpy as np
import pandas as pd
import time

import os
import shutil
path = '/'
memory = shutil.disk_usage(path)
dirpath = os.getcwd()
print("Your current working directory is : " + dirpath)
print("Disk", memory)

Your current working directory is : C:\Users\David\BIO1173\Class_05_2
Disk usage(total=4000108531712, used=992608657408, free=3007499874304)


# Part 5.2: Using K-Fold Cross-validation with Keras

You can use cross-validation for a variety of purposes in predictive modeling:

* Generating out-of-sample predictions from a neural network
* Estimate a good number of epochs to train a neural network for (early stopping)
* Evaluate the effectiveness of certain hyperparameters, such as activation functions, neuron counts, and layer counts

Cross-validation uses several folds and multiple models to provide each data segment a chance to serve as both the validation and training set. Figure 5.CROSS shows cross-validation.

**Figure 5.CROSS: K-Fold Crossvalidation**
![K-Fold Crossvalidation](https://biologicslab.co/BIO1173/images/class_1_kfold.png "K-Fold Crossvalidation")

It is important to note that each fold will have one model (neural network). To generate predictions for new data (not present in the training set), predictions from the fold models can be handled in several ways:

* Choose the model with the highest validation score as the final model.
* Preset new data to the five models (one for each fold) and average the result (this is an [ensemble](https://en.wikipedia.org/wiki/Ensemble_learning)).
* Retrain a new model (using the same settings as the cross-validation) on the entire dataset. Train for as many epochs and with the same hidden layer structure.

Generally, I prefer the last approach and will retrain a model on the entire data set once I have selected hyper-parameters. Of course, I will always set aside a final holdout set for model validation that I do not use in any aspect of the training process.

## Regression vs Classification K-Fold Cross-Validation

Regression and classification are handled somewhat differently concerning cross-validation. Regression is the simpler case where you can break up the data set into K folds with little regard for where each item lands. For regression, the data items should fall into the folds as randomly as possible. It is also important to remember that not every fold will necessarily have the same number of data items. It is not always possible for the data set to be evenly divided into K folds. For regression cross-validation, we will use the Scikit-Learn class **KFold**.

Cross-validation for classification could also use the **KFold** object; however, this technique would not ensure that the class balance remains the same in each fold as in the original. The balance of classes that a model was trained on must remain the same (or similar) to the training set. Drift in this distribution is one of the most important things to monitor after a trained model has been placed into actual use. Because of this, we want to make sure that the cross-validation itself does not introduce an unintended shift. This technique is called stratified sampling and is accomplished by using the Scikit-Learn object **StratifiedKFold** in place of **KFold** whenever you use classification. In summary, you should use the following two objects in Scikit-Learn:

* **KFold** When dealing with a regression problem.
* **StratifiedKFold** When dealing with a classification problem.

The following two sections demonstrate cross-validation with classification and regression. 

## Out-of-Sample Regression Predictions with K-Fold Cross-Validation

The following code trains the simple dataset using a 5-fold cross-validation. The expected performance of a neural network of the type trained here would be the score for the generated out-of-sample predictions. We begin by preparing a feature vector using the **jh-simple-dataset** to predict age. This model is set up as a regression problem.

### Example 1A: Out-of-Sample Regression Predictions with K-Fold Cross-Validation



In [2]:
# Example 1A

# Read the data set
aqDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/apple_quality.csv",
    na_values=['NA','?'])

# Generate dummies for Quality
aqDF = pd.concat([aqDF,pd.get_dummies(aqDF['Quality'],prefix="Quality")],axis=1)
aqDF.drop('Quality', axis=1, inplace=True)

# Standardize ranges
aqDF['Size'] = zscore(aqDF['Size'])
aqDF['Weight'] = zscore(aqDF['Weight'])
aqDF['Sweetness'] = zscore(aqDF['Sweetness'])
aqDF['Crunchiness'] = zscore(aqDF['Crunchiness'])
# aqDF['Juiciness'] = zscore(aqDF['Juiciness'])
aqDF['Ripeness'] = zscore(aqDF['Ripeness'])
aqDF['Acidity'] = zscore(aqDF['Acidity'])

# Generate X
aqX_columns = aqDF.columns.drop('Juiciness').drop('A_id')
aqX = aqDF[aqX_columns].values
aqX = np.asarray(aqX).astype('float32')

# Generate Y
aqY = aqDF['Juiciness'].values
aqY = np.asarray(aqY).astype('float32')

# Print aqX
#print(aqX[0:4])

### Example 1B: Out-of-Sample Regression Predictions with K-Fold Cross-Validation

Now that the feature vector is created a 5-fold cross-validation can be performed to generate out-of-sample predictions.  We will assume 500 epochs and not use early stopping.  Later we will see how we can estimate a more optimal epoch count.

In [3]:
# Example 1B: Out-of-Sample Regression Predictions with K-Fold Cross-Validation

# Set EPOCHS
EPOCHS=500

# Record the start time in st
st = time.time()

# Cross-Validate
kf = KFold(5, shuffle=True, random_state=42) # Use for KFold classification
oos_y = []
oos_pred = []

fold = 0
for train, test in kf.split(aqX):
    fold+=1
    print(f"Fold #{fold}")
        
    aqX_train = aqX[train]
    aqY_train = aqY[train]
    aqX_test = aqX[test]
    aqY_test = aqY[test]
    
    aqModel = Sequential()
    aqModel.add(Dense(20, input_dim=aqX.shape[1], activation='relu'))
    aqModel.add(Dense(10, activation='relu'))
    aqModel.add(Dense(1))
    aqModel.compile(loss='mean_squared_error', optimizer='adam')
    
    aqModel.fit(aqX_train,aqY_train,validation_data=(aqX_test,aqY_test),verbose=0,
              epochs=EPOCHS)
    
    aqPred = aqModel.predict(aqX_test)
    
    oos_y.append(aqY_test)
    oos_pred.append(aqPred)    

    # Measure this fold's RMSE
    score = np.sqrt(metrics.mean_squared_error(aqPred,aqY_test))
    print(f"Fold score (RMSE): {score}")

# Build the oos prediction list and calculate the error.
oos_y = np.concatenate(oos_y)
oos_pred = np.concatenate(oos_pred)
score = np.sqrt(metrics.mean_squared_error(oos_pred,oos_y))
print(f"Final, out of sample score (RMSE): {score}")    
    
# Write the cross-validated prediction
oos_y = pd.DataFrame(oos_y)
oos_pred = pd.DataFrame(oos_pred)
oosDF = pd.concat( [aqDF, oos_y, oos_pred],axis=1 )
#oosDF.to_csv(filename_write,index=False)

# Record the end time in et
et = time.time()

# Print out time
seconds = int((et-st))
seconds = seconds % (24 * 3600)
hour = seconds // 3600
seconds %= 3600
minutes = seconds // 60
seconds %= 60
print("Elapsed time = %d:%02d:%02d" % (hour, minutes, seconds))

Fold #1
Fold score (RMSE): 0.9102703928947449
Fold #2
Fold score (RMSE): 0.9169977307319641
Fold #3
Fold score (RMSE): 0.9683015942573547
Fold #4
Fold score (RMSE): 0.954668402671814
Fold #5
Fold score (RMSE): 1.029394507408142
Final, out of sample score (RMSE): 0.9568834900856018
Elapsed time = 0:16:47


As you can see, the above code also reports the average number of epochs needed.  A common technique is to then train on the entire dataset for the average number of epochs required.

## Classification with Stratified K-Fold Cross-Validation

The following code trains and fits the **jh**-simple-dataset dataset with cross-validation to generate out-of-sample.  It also writes the out-of-sample (predictions on the test set) results.

It is good to perform stratified k-fold cross-validation with classification data.  This technique ensures that the percentages of each class remain the same across all folds.  Use the **StratifiedKFold** object instead of the **KFold** object used in the regression.

### Example 2A: Classification with Stratified K-Fold Cross-Validation

In [17]:
# Example 2A: Classification with Stratified K-Fold Cross-Validation

# Read the data set
obDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/ObesityDataSet.csv",
    na_values=['NA','?'])

# Map Gender
mapping = {'Male': 1, 'Female': 0}
obDF['Gender'] = obDF['Gender'].map(mapping)

# Map family_history_with_overweight
mapping = {'yes': 1, 'no': 0}
obDF['family_history_with_overweight'] = obDF['family_history_with_overweight'].map(mapping)

# Map FAVC
mapping = {'yes': 1, 'no': 0}
obDF['FAVC'] = obDF['FAVC'].map(mapping)

# Map SMOKE
mapping = {'yes': 1, 'no': 0}
obDF['SMOKE'] = obDF['SMOKE'].map(mapping)

# Map SCC
mapping = {'yes': 1, 'no': 0}
obDF['SCC'] = obDF['SCC'].map(mapping)

# Map NObeyesdad
mapping = {'Insufficient_Weight': 0,
            'Normal_Weight': 1,
            'Overweight_Level_I': 2,
            'Overweight_Level_II': 3,
            'Obesity_Type_I': 4,
            'Obesity_Type_II': 5,
            'Obesity_Type_III': 6}

obDF['NObeyesdad'] = obDF['NObeyesdad'].map(mapping)

# Generate dummies for CAEC
obDF = pd.concat([obDF,pd.get_dummies(obDF['CAEC'],prefix="CAEC")],axis=1)
obDF.drop('CAEC', axis=1, inplace=True)

# Generate dummies for CALC
obDF = pd.concat([obDF,pd.get_dummies(obDF['CALC'],prefix="CALC")],axis=1)
obDF.drop('CALC', axis=1, inplace=True)

# Generate dummies for MTRANS
obDF = pd.concat([obDF,pd.get_dummies(obDF['MTRANS'],prefix="MTRANS")],axis=1)
obDF.drop('MTRANS', axis=1, inplace=True)

# Standardize ranges
obDF['Height'] = zscore(obDF['Height'])
obDF['Weight'] = zscore(obDF['Weight'])

# Generate X
obX_columns = obDF.columns.drop('NObeyesdad')
obX = obDF[obX_columns].values
obX = np.asarray(obX).astype('float32')

# Generate Y
# obY = obDF['NObeyesdad'].values
# Generate dummies for 'NObeyesdad'

dummies = pd.get_dummies(obDF['NObeyesdad'],dtype=int) # Classification
OBtypes = dummies.columns
obY = dummies.values
#obDF = pd.concat([obDF,pd.get_dummies(obDF['NObeyesdad'],prefix="NObeyesdad")],axis=1)
#obY = obDF['NObeyesdad'].values
obY = np.asarray(obY).astype('float32')

# Print
# print (obX[0:4])

### Example 2B: Classification with Stratified K-Fold Cross-Validation


In [19]:
# Example 2B: Classification with Stratified K-Fold Cross-Validation

# Import libraries
import pandas as pd
import os
import numpy as np
from sklearn import metrics
from sklearn.model_selection import StratifiedKFold
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation

# Record the start time in st
st = time.time()

# np.argmax(pred,axis=1)
# Cross-validate
# Use for StratifiedKFold classification
kf = StratifiedKFold(5, shuffle=True, random_state=42) 
    
oos_y = []
oos_pred = []
fold = 0

# Must specify y StratifiedKFold for
for train, test in kf.split(obX,obDF['NObeyesdad']):  
    fold+=1
    print(f"Fold #{fold}")
        
    obX_train = obX[train]
    obY_train = obY[train]
    obX_test = obX[test]
    obY_test = obY[test]
    
    obModel = Sequential()
    # Hidden 1
    obModel.add(Dense(50, input_dim=obX.shape[1], activation='relu')) 
    obModel.add(Dense(25, activation='relu')) # Hidden 2
    obModel.add(Dense(obY.shape[1],activation='softmax')) # Output
    obModel.compile(loss='categorical_crossentropy', optimizer='adam')

    obModel.fit(obX_train,obY_train,validation_data=(obX_test,obY_test),
              verbose=0, epochs=EPOCHS)
    
    pred = obModel.predict(obX_test)
    
    oos_y.append(obY_test)
    # raw probabilities to chosen class (highest probability)
    pred = np.argmax(pred,axis=1) 
    oos_pred.append(pred)  

    # Measure this fold's accuracy
    y_compare = np.argmax(obY_test,axis=1) # For accuracy calculation
    score = metrics.accuracy_score(y_compare, pred)
    print(f"Fold score (accuracy): {score}")

# Build the oos prediction list and calculate the error.
oos_y = np.concatenate(oos_y)
oos_pred = np.concatenate(oos_pred)
oos_y_compare = np.argmax(oos_y,axis=1) # For accuracy calculation

score = metrics.accuracy_score(oos_y_compare, oos_pred)
print(f"Final score (accuracy): {score}")    
    
# Write the cross-validated prediction
oos_y = pd.DataFrame(oos_y)
oos_pred = pd.DataFrame(oos_pred)
oosDF = pd.concat( [obDF, oos_y, oos_pred],axis=1 )
#oosDF.to_csv(filename_write,index=False)

# Print out time
seconds = int((et-st))
seconds = seconds % (24 * 3600)
hour = seconds // 3600
seconds %= 3600
minutes = seconds // 60
seconds %= 60
print("Elapsed time = %d:%02d:%02d" % (hour, minutes, seconds))


Fold #1
Fold score (accuracy): 0.9692671394799054
Fold #2
Fold score (accuracy): 0.9715639810426541
Fold #3
Fold score (accuracy): 0.943127962085308
Fold #4
Fold score (accuracy): 0.9597156398104265
Fold #5
Fold score (accuracy): 0.9691943127962085
Final score (accuracy): 0.9625769777356703
Elapsed time = 23:40:46


## Training with both a Cross-Validation and a Holdout Set

If you have a considerable amount of data, it is always valuable to set aside a holdout set before you cross-validate. This holdout set will be the final evaluation before using your model for its real-world use. Figure 5. HOLDOUT shows this division.

**Figure 5. HOLDOUT: Cross-Validation and a Holdout Set**
![Cross Validation and a Holdout Set](https://biologicslab.co/BIO1173/images/class_3_hold_train_val.png "Cross-Validation and a Holdout Set")

The following program uses a holdout set and then still cross-validates.  

### Example 4: Training with both a Cross-Validation and a Holdout Set



In [23]:
# Example 4: Training with both a Cross-Validation and a Holdout Set

EPOCHS=500

import pandas as pd
import os
import numpy as np
from sklearn import metrics
from scipy.stats import zscore
from sklearn.model_selection import KFold
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation

# Record the start time in st
st = time.time()

# Cross-Validate
kf = KFold(5, shuffle=True, random_state=42) # Use for KFold classification
oos_y = []
oos_pred = []

fold = 0
for train, test in kf.split(obX):
    fold+=1
    print(f"Fold #{fold}")
        
    obX_train = obX[train]
    obY_train = obY[train]
    obX_test = obX[test]
    obY_test = obY[test]
    
    cvModel = Sequential()
    cvModel.add(Dense(20, input_dim=obX.shape[1], activation='relu'))
    cvModel.add(Dense(10, activation='relu'))
    cvModel.add(Dense(1))
    cvModel.compile(loss='mean_squared_error', optimizer='adam')
    
    cvModel.fit(obX_train,obY_train,validation_data=(obX_test,obY_test),verbose=0,
              epochs=EPOCHS)
    
    pred = cvModel.predict(obX_test)
    
    oos_y.append(obY_test)
    oos_pred.append(pred)    

    # Measure this fold's RMSE
    score = np.sqrt(metrics.mean_squared_error(pred,obY_test))
    print(f"Fold score (RMSE): {score}")

# Build the oos prediction list and calculate the error.
oos_y = np.concatenate(oos_y)
oos_pred = np.concatenate(oos_pred)
score = np.sqrt(metrics.mean_squared_error(oos_pred,oos_y))
print(f"Final, out of sample score (RMSE): {score}")    
    
# Write the cross-validated prediction
oos_y = pd.DataFrame(oos_y)
oos_pred = pd.DataFrame(oos_pred)
# oosDF = pd.concat( [df, oos_y, oos_pred],axis=1 )
oosDF = pd.concat( [obDF, oos_y, oos_pred],axis=1 )
#oosDF.to_csv(filename_write,index=False)

# Print out time
seconds = int((et-st))
seconds = seconds % (24 * 3600)
hour = seconds // 3600
seconds %= 3600
minutes = seconds // 60
seconds %= 60
print("Elapsed time = %d:%02d:%02d" % (hour, minutes, seconds))


Fold #1


ValueError: y_true and y_pred have different number of output (1!=7)