<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173/blob/master/Class_05_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 1173: Intro Computational Biology**

**Module 5: Regularization and Dropout**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Integrative Biology](https://sciences.utsa.edu/integrative-biology/), [UTSA](https://www.utsa.edu/)

### Module 5 Material

* Part 5.1: Part 5.1: Introduction to Regularization: Ridge and Lasso
* **Part 5.2: Using K-Fold Cross Validation with Keras**
* Part 5.3: Using L1 and L2 Regularization with Keras to Decrease Overfitting
* Part 5.4: Drop Out for Keras to Decrease Overfitting
* Part 5.5: Benchmarking Keras Deep Learning Regularization Techniques



## Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.
  Running the following code will map your GDrive to ```/content/drive```.

In [1]:
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    COLAB = True
    print("Note: using Google CoLab")
    %tensorflow_version 2.x
except:
    print("Note: not using Google CoLab")
    COLAB = False

Note: not using Google CoLab


### Lesson Setup

Run the next code cell to load necessary packages

In [2]:
# You MUST run this code cell first

from sklearn import metrics
from sklearn.model_selection import KFold
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from scipy.stats import zscore
from sklearn.model_selection import train_test_split

import numpy as np
import pandas as pd
import time

import os
import shutil
path = '/'
memory = shutil.disk_usage(path)
dirpath = os.getcwd()
print("Your current working directory is : " + dirpath)
print("Disk", memory)

Your current working directory is : C:\Users\David\BIO1173\Class_05_2
Disk usage(total=4000108531712, used=994841530368, free=3005267001344)


# Part 5.2: Using K-Fold Cross-validation with Keras

**_K-fold validation_** is a technique used in machine learning to evaluate the performance and generalization ability of a model. In K-fold validation, the original dataset is randomly partitioned into K equal-sized subsets. The model is trained and evaluated K times, with each iteration using a different subset as the validation set and the remaining subsets as the training set. This allows for a more robust evaluation of the model's performance as it reduces the variance that may result from using a single train-test split. The final performance metric is typically averaged over the K iterations for a more reliable estimation of the model's performance.

You can use cross-validation for a variety of purposes in predictive modeling:

* Generating out-of-sample predictions from a neural network
* Estimate a good number of epochs to train a neural network for (early stopping)
* Evaluate the effectiveness of certain hyperparameters, such as activation functions, neuron counts, and layer counts

Cross-validation uses several folds and multiple models to provide each data segment a chance to serve as both the validation and training set. Figure 5.CROSS shows cross-validation.

**Figure 5.CROSS: K-Fold Crossvalidation**
![K-Fold Crossvalidation](https://biologicslab.co/BIO1173/images/class_1_kfold.png "K-Fold Crossvalidation")

It is important to note that each fold will have one model (neural network). To generate predictions for new data (not present in the training set), predictions from the fold models can be handled in several ways:

* Choose the model with the highest validation score as the final model.
* Preset new data to the five models (one for each fold) and average the result (this is an [ensemble](https://en.wikipedia.org/wiki/Ensemble_learning)).
* Retrain a new model (using the same settings as the cross-validation) on the entire dataset. Train for as many epochs and with the same hidden layer structure.

Generally, I prefer the last approach and will retrain a model on the entire data set once I have selected hyper-parameters. Of course, I will always set aside a final holdout set for model validation that I do not use in any aspect of the training process.

## Regression vs Classification K-Fold Cross-Validation

Regression and classification are handled somewhat differently concerning cross-validation. Regression is the simpler case where you can break up the data set into K folds with little regard for where each item lands. For regression, the data items should fall into the folds as randomly as possible. It is also important to remember that not every fold will necessarily have the same number of data items. It is not always possible for the data set to be evenly divided into K folds. For regression cross-validation, we will use the Scikit-Learn class **KFold**.

Cross-validation for classification could also use the **KFold** object; however, this technique would not ensure that the class balance remains the same in each fold as in the original. The balance of classes that a model was trained on must remain the same (or similar) to the training set. Drift in this distribution is one of the most important things to monitor after a trained model has been placed into actual use. Because of this, we want to make sure that the cross-validation itself does not introduce an unintended shift. This technique is called stratified sampling and is accomplished by using the Scikit-Learn object **StratifiedKFold** in place of **KFold** whenever you use classification. In summary, you should use the following two objects in Scikit-Learn:

* **KFold** When dealing with a regression problem.
* **StratifiedKFold** When dealing with a classification problem.

## Datasets for this lesson



### Body Performance dataset

For this lesson we will be using the [Body Performance dataset](https://www.kaggle.com/datasets/kukuroo3/body-performance-data).
This is data that confirmed the grade of performance with age and some exercise performance data. This is a relatively large dataset with 12 categories of information about 13,303 individuals.

The 12 categories are:
* **age:** 20 ~64
* **gender:** M,F
* **height_cm:** (If you want to convert to feet, divide by 30.48)
* **weight_kg:**
* **body fat_%:**
* **diastolic:** diastolic blood pressure (min)
* **systolic:** systolic blood pressure (min)
* **gripForce:**
* **sit and bend forward_cm:**
* **sit-ups counts:**
* **broad jump_cm:**
* **class:** A,B,C,D ( A: best) / stratified


In [3]:
# Read the data set
df = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/bodyPerformance.csv",
    na_values=['NA','?'])

# Set max rows and max columns
pd.set_option('display.max_rows', 8)
pd.set_option('display.max_columns', 8)

# Display DataFrame
display(df)

Unnamed: 0,age,gender,height_cm,weight_kg,...,sit and bend forward_cm,sit-ups counts,broad jump_cm,class
0,27.0,M,172.3,75.24,...,18.4,60.0,217.0,C
1,25.0,M,165.0,55.80,...,16.3,53.0,229.0,A
2,31.0,M,179.6,78.00,...,12.0,49.0,181.0,C
3,32.0,M,174.5,71.10,...,15.2,53.0,219.0,B
...,...,...,...,...,...,...,...,...,...
13389,21.0,M,179.7,63.90,...,1.1,48.0,167.0,D
13390,39.0,M,177.2,80.50,...,16.4,45.0,229.0,A
13391,64.0,F,146.1,57.70,...,9.2,0.0,75.0,D
13392,34.0,M,164.0,66.10,...,7.1,51.0,180.0,C


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13393 entries, 0 to 13392
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   age                      13393 non-null  float64
 1   gender                   13393 non-null  object 
 2   height_cm                13393 non-null  float64
 3   weight_kg                13393 non-null  float64
 4   body fat_%               13393 non-null  float64
 5   diastolic                13393 non-null  float64
 6   systolic                 13393 non-null  float64
 7   gripForce                13393 non-null  float64
 8   sit and bend forward_cm  13393 non-null  float64
 9   sit-ups counts           13393 non-null  float64
 10  broad jump_cm            13393 non-null  float64
 11  class                    13393 non-null  object 
dtypes: float64(10), object(2)
memory usage: 1.2+ MB


### Blood Samples dataset



In [5]:
# Read the data set
df = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/Blood_samples.csv",
    na_values=['NA','?'])

# Set max rows and max columns
pd.set_option('display.max_rows', 8)
pd.set_option('display.max_columns', 8)

# Display DataFrame
display(df)

Unnamed: 0,Glucose,Cholesterol,Hemoglobin,Platelets,...,Creatinine,Troponin,C-reactive Protein,Disease
0,0.739597,0.650198,0.713631,0.868491,...,0.095512,0.465957,0.769230,Healthy
1,0.121786,0.023058,0.944893,0.905372,...,0.659060,0.816982,0.401166,Diabetes
2,0.452539,0.116135,0.544560,0.400640,...,0.417295,0.799074,0.779208,Thalasse
3,0.136609,0.015605,0.419957,0.191487,...,0.490349,0.637061,0.354094,Anemia
...,...,...,...,...,...,...,...,...,...
2347,0.407101,0.124738,0.983306,0.663867,...,0.310964,0.310900,0.622403,Thalasse
2348,0.344356,0.783918,0.582171,0.996841,...,0.606719,0.395145,0.134021,Anemia
2349,0.351722,0.014278,0.898615,0.167550,...,0.882164,0.411158,0.146255,Diabetes
2350,0.032726,0.053596,0.102633,0.221356,...,0.437285,0.288961,0.709262,Anemia


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2351 entries, 0 to 2350
Data columns (total 25 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   Glucose                                    2351 non-null   float64
 1   Cholesterol                                2351 non-null   float64
 2   Hemoglobin                                 2351 non-null   float64
 3   Platelets                                  2351 non-null   float64
 4   White Blood Cells                          2351 non-null   float64
 5   Red Blood Cells                            2351 non-null   float64
 6   Hematocrit                                 2351 non-null   float64
 7   Mean Corpuscular Volume                    2351 non-null   float64
 8   Mean Corpuscular Hemoglobin                2351 non-null   float64
 9   Mean Corpuscular Hemoglobin Concentration  2351 non-null   float64
 10  Insulin                 

In [7]:
df.describe()

Unnamed: 0,Glucose,Cholesterol,Hemoglobin,Platelets,...,Heart Rate,Creatinine,Troponin,C-reactive Protein
count,2351.0,2351.0,2351.0,2351.0,...,2351.0,2351.0,2351.0,2351.0
mean,0.362828,0.393648,0.58619,0.504027,...,0.582255,0.425075,0.454597,0.430308
std,0.251889,0.239449,0.271498,0.303347,...,0.250915,0.229298,0.251189,0.243034
min,0.010994,0.012139,0.003021,0.012594,...,0.11455,0.021239,0.00749,0.004867
25%,0.129198,0.195818,0.346092,0.200865,...,0.339125,0.213026,0.288961,0.196192
50%,0.351722,0.397083,0.609836,0.533962,...,0.61086,0.417295,0.426863,0.481601
75%,0.582278,0.582178,0.791215,0.754841,...,0.800666,0.606719,0.682164,0.631426
max,0.96846,0.905026,0.983306,0.999393,...,0.996873,0.925924,0.972803,0.797906


In [8]:
# Simple function to print out elasped time

def elaspedTime(start,end):
    # Print out time
    seconds = int((end-start))
    seconds = seconds % (24 * 3600)
    hour = seconds // 3600
    seconds %= 3600
    minutes = seconds // 60
    seconds %= 60
    print("Elapsed time = %d:%02d:%02d" % (hour, minutes, seconds))
    print()

def rename_col_by_index(dataframe, index_mapping):
    dataframe.columns = [index_mapping.get(i, col) for i, col in enumerate(dataframe.columns)]
    return dataframe

## Out-of-Sample Regression Predictions with K-Fold Cross-Validation

The following code trains the simple dataset using a 5-fold cross-validation. The expected performance of a neural network of the type trained here would be the score for the generated out-of-sample predictions. We begin by preparing a feature vector using the **jh-simple-dataset** to predict age. This model is set up as a regression problem.

### Example 1A: Preprocess data for **Regression** with K-Fold Cross-Validation 



In [9]:
# Example 1A

# Read the data set
bpBigDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/bodyPerformance.csv",
    na_values=['NA','?'])

# Only use 10% for neural network
bpDF=bpBigDF.sample(frac=0.10)

# Map gender
mapping = {'M': 1, 'F': 0}
bpDF['gender'] = bpDF['gender'].map(mapping)

# Map class
mapping =  {'A': 0,
            'B': 1,
            'C': 2,
            'D': 3}
bpDF['class'] = bpDF['class'].map(mapping)

# Generate list of columns for x
bpX_columns = bpDF.columns.drop('weight_kg')  # 'class'

# Standardize values with their Z-scores
for col in bpX_columns:
    bpDF[col] = zscore(bpDF[col])

# Generate x-values as numpy array
bpX = bpDF[bpX_columns].values
bpX = np.asarray(bpX).astype('float32')

# Generate y-values as numpy array
# No One-Hot encoding with Regression
bpY = bpDF['weight_kg'].values
bpY = np.asarray(bpY).astype('float32')

### Example 1B:

Now that the feature vector is created a 5-fold cross-validation can be performed to generate out-of-sample predictions. We will assume 500 epochs and not use early stopping. Later we will see how we can estimate a more optimal epoch count.

In [10]:
# Insert your code for Exercise 1B here

# Set variables
EPOCHS=100 # number of epochs for each loop
numK=5     # Set number of K-folds
filename_write="OutOfSampleKfoldPred.csv" # Set filename

# Record the start time in T_start
T_start = time.time()

# Specify type of K-fold Cross-Validation
kf = KFold(numK, shuffle=True, random_state=42) # Use for KFold classification

# Initial arrays for Out Of Samples (oos)
oos_y = []    # array to hold actual y-values
oos_pred = [] # array to hold predicted y-values

# START LOOP HERE -----------------------------------#

fold = 0 # initialize fold count

# Run loop for each fold
for train, test in kf.split(bpX):
    fold+=1  # increment loop counter
    print(f"Fold #{fold} starting...")

    # Generate this fold's train and test datasets
    x_train = bpX[train]
    y_train = bpY[train]
    x_test = bpX[test]
    y_test = bpY[test]

    # Build new model for this fold
    model = Sequential()
    model.add(Dense(20, input_dim=bpX.shape[1], activation='relu'))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1))

    # Compile model for regression 
    model.compile(loss='mean_squared_error', optimizer='adam')

    # Run model for this fold
    model.fit(x_train,y_train,validation_data=(x_test,y_test),verbose=0,
              epochs=EPOCHS)

    # Store model predictions
    pred = model.predict(x_test)

    # oos_y contains fold's actual y-values
    oos_y.append(y_test)
    
    # oos_pred contains fold's predicted y-values
    oos_pred.append(pred)    

    # Measure this fold's RMSE and print out when fold is done
    score = np.sqrt(metrics.mean_squared_error(pred,y_test))
    print(f"Fold score (RMSE): {score}")

# END LOOP HERE -----------------------------------#


# Build the oos prediction list and calculate the error.

# actual y-values for all loops
oos_y = np.concatenate(oos_y)

# predicted y-values from all loops
oos_pred = np.concatenate(oos_pred)

# Compute Final (grand) total from all K loops
score = np.sqrt(metrics.mean_squared_error(oos_pred,oos_y))
print(f"Final, out of sample score (RMSE): {score} \n")    
    
# Write the cross-validated prediction
oos_y = pd.DataFrame(oos_y)
oos_pred = pd.DataFrame(oos_pred)
# oosDF = pd.concat( [bpDF, oos_y, oos_pred],axis=1 )
oosDF = pd.concat( [oos_y,oos_pred],axis=1 )

# Uncomment the next line to write file
#oosDF.to_csv(filename_write,index=False)

# Record the end time in T_end
T_end = time.time()

# Print out elapsed time
elaspedTime(T_start,T_end)


Fold #1 starting...
Fold score (RMSE): 5.738091945648193
Fold #2 starting...
Fold score (RMSE): 6.243690490722656
Fold #3 starting...
Fold score (RMSE): 6.179523944854736
Fold #4 starting...
Fold score (RMSE): 5.553683280944824
Fold #5 starting...
Fold score (RMSE): 5.415999889373779
Final, out of sample score (RMSE): 5.8359222412109375 

Elapsed time = 0:01:30



### Example 1C: Print out actual and predicted y-values



In [11]:
# Example 1C: Print out actual and predicted y-values 

# Rename columns
new_column_mapping = {0: 'Actual Wt (kg)', 1: 'Predicted Wt (kg)'}
oosDF = rename_col_by_index(oosDF, new_column_mapping)

# Set display options
pd.set_option('display.max_rows', 8)
pd.set_option('display.max_columns', 8)

# Display DataFrame
display(oosDF)

Unnamed: 0,Actual Wt (kg),Predicted Wt (kg)
0,75.500000,69.805740
1,58.400002,61.141804
2,84.900002,76.254089
3,60.900002,66.370972
...,...,...
1335,48.099998,52.042274
1336,73.800003,86.053925
1337,53.200001,47.598969
1338,57.000000,59.565720


### **Exercise 1A: Preprocess data for _Regression_ with K-Fold Cross-Validation** 

Disease
Anemia      623
Diabetes    540
Healthy     556
Thalasse    509
Thromboc    123

C-reactive Protein

In [12]:
# Insert your code for Exercise 1A here

# Read the data set
bldBigDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/Blood_samples.csv",
    na_values=['NA','?'])

# Only use 50% for neural network
bldDF=bldBigDF.sample(frac=0.80)

# Map Diseases
mapping =  {'Anemia': 0,
            'Diabetes': 1,
            'Healthy': 2,
            'Thalasse': 3,
            'Thromboc': 4}
bldDF['Disease'] = bldDF['Disease'].map(mapping)

# Generate list of columns for x
bldX_columns = bldDF.columns.drop('Glucose')  # 

# Standardize values with their Z-scores
for col in bldX_columns:
    bldDF[col] = zscore(bldDF[col])

# Generate x-values as numpy array
bldX = bldDF[bldX_columns].values
bldX = np.asarray(bldX).astype('float32')

# Generate y-values as numpy array
# No One-Hot encoding with Regression
bldY = bldDF['Glucose'].values
bldY = np.asarray(bldY).astype('float32')

In [13]:
bldDF.head()

Unnamed: 0,Glucose,Cholesterol,Hemoglobin,Platelets,...,Creatinine,Troponin,C-reactive Protein,Disease
1503,0.407101,-1.136944,1.466834,0.525258,...,-0.494823,-0.559265,0.791413,1.178715
325,0.531424,-0.274366,0.983851,-0.303907,...,-1.248306,-0.275234,0.554768,1.178715
754,0.353734,1.506013,0.627161,-1.630251,...,-0.071593,-0.096447,0.418602,1.990862
1816,0.104427,0.002459,0.598667,-0.578757,...,0.273141,-0.354749,-1.758049,-0.44558
1519,0.389399,0.272868,-0.382109,-0.914306,...,-0.181818,0.129667,-1.575788,0.366567


### **Exercise 1B:**

Now that the feature vector is created a 5-fold cross-validation can be performed to generate out-of-sample predictions. We will assume 500 epochs and not use early stopping. Later we will see how we can estimate a more optimal epoch count.

In [14]:
# Insert your code for Exercise 1B here

# Set variables
EPOCHS=100 # number of epochs for each loop
numK=5     # Set number of K-folds
filename_write="OutOfSampleKfoldPred.csv" # Set filename

# Record the start time in T_start
T_start = time.time()

# Specify type of K-fold Cross-Validation
kf = KFold(numK, shuffle=True, random_state=42) # Use for KFold classification

# Initial arrays for Out Of Samples (oos)
oos_y = []    # array to hold actual y-values
oos_pred = [] # array to hold predicted y-values

# START LOOP HERE -----------------------------------#

fold = 0 # initialize fold count

# Run loop for each fold
for train, test in kf.split(bldX):
    fold+=1  # increment loop counter
    print(f"Fold #{fold} starting...")

    # Generate this fold's train and test datasets
    x_train = bldX[train]
    y_train = bldY[train]
    x_test = bldX[test]
    y_test = bldY[test]

    # Build new model for this fold
    model = Sequential()
    model.add(Dense(20, input_dim=bldX.shape[1], activation='relu'))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1))

    # Compile model for regression 
    model.compile(loss='mean_squared_error', optimizer='adam')

    # Run model for this fold
    model.fit(x_train,y_train,validation_data=(x_test,y_test),verbose=0,
              epochs=EPOCHS)

    # Store model predictions
    pred = model.predict(x_test)

    # oos_y contains fold's actual y-values
    oos_y.append(y_test)
    
    # oos_pred contains fold's predicted y-values
    oos_pred.append(pred)    

    # Measure this fold's RMSE and print out when fold is done
    score = np.sqrt(metrics.mean_squared_error(pred,y_test))
    print(f"Fold score (RMSE): {score}")

# END LOOP HERE -----------------------------------#


# Build the oos prediction list and calculate the error.

# actual y-values for all loops
oos_y = np.concatenate(oos_y)

# predicted y-values from all loops
oos_pred = np.concatenate(oos_pred)

# Compute Final (grand) total from all K loops
score = np.sqrt(metrics.mean_squared_error(oos_pred,oos_y))
print(f"Final, out of sample score (RMSE): {score} \n")    
    
# Write the cross-validated prediction
oos_y = pd.DataFrame(oos_y)
oos_pred = pd.DataFrame(oos_pred)
# oosDF = pd.concat( [bpDF, oos_y, oos_pred],axis=1 )
oosDF = pd.concat( [oos_y,oos_pred],axis=1 )

# Uncomment the next line to write file
#oosDF.to_csv(filename_write,index=False)

# Record the end time in T_end
T_end = time.time()

# Print out elapsed time
elaspedTime(T_start,T_end)


Fold #1 starting...
Fold score (RMSE): 0.034317292273044586
Fold #2 starting...
Fold score (RMSE): 1.2464899157293985e-07
Fold #3 starting...
Fold score (RMSE): 1.3111900898366002e-07
Fold #4 starting...
Fold score (RMSE): 6.244755695661297e-06
Fold #5 starting...
Fold score (RMSE): 3.5743195780924e-07
Final, out of sample score (RMSE): 0.015363468788564205 

Elapsed time = 0:01:58



### **Exercise 1C: Print out actual and predicted y-values**



In [15]:
# Insert your code for Exercise 1C here

# Rename columns
new_column_mapping = {0: 'Actual Glucose', 1: 'Predicted Glucose'}
oosDF = rename_col_by_index(oosDF, new_column_mapping)

# Set display options
pd.set_option('display.max_rows', 8)
pd.set_option('display.max_columns', 8)

# Display DataFrame
display(oosDF)

Unnamed: 0,Actual Glucose,Predicted Glucose
0,0.399017,0.398986
1,0.102006,0.102230
2,0.253417,0.253489
3,0.121786,0.122025
...,...,...
1877,0.652122,0.652121
1878,0.107165,0.107165
1879,0.596150,0.596150
1880,0.789111,0.789111


## Classification with Stratified K-Fold Cross-Validation

The following code trains and fits the **jh**-simple-dataset dataset with cross-validation to generate out-of-sample.  It also writes the out-of-sample (predictions on the test set) results.

It is good to perform stratified k-fold cross-validation with classification data.  This technique ensures that the percentages of each class remain the same across all folds.  Use the **StratifiedKFold** object instead of the **KFold** object used in the regression.

### Example 2A: Preprocess data for Classification with Stratified K-Fold Cross-Validation



In [16]:
# Example 2A: Preprocess data for Classification with Stratified K-Fold Cross-Validation

import pandas as pd
from scipy.stats import zscore

# Read the data set
bpBig = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/bodyPerformance.csv",
    na_values=['NA','?'])

# Only use 10% for neural network
bpDF=bpBig.sample(frac=0.10)

# Map gender
mapping = {'M': 1, 'F': 0}
bpDF['gender'] = bpDF['gender'].map(mapping)

# Map class
mapping =  {'A': 0,
            'B': 1,
            'C': 2,
            'D': 3}
bpDF['class'] = bpDF['class'].map(mapping)

# Generate list of columns for x
bpX_columns = bpDF.columns.drop('class')

# Standardize values with their Z-scores
for col in bpX_columns:
    bpDF[col] = zscore(bpDF[col])

# Generate x-values as numpy array
bpX = bpDF[bpX_columns].values
bpX = np.asarray(bpX).astype('float32')

# Generate y-values as numpy array
dummies = pd.get_dummies(bpDF['class']) # Classification
FitClass = dummies.columns
bpY= dummies.values
bpY = np.asarray(bpY).astype('float32')

### Example 2B: Classification with Stratified K-Fold Cross-Validation


In [17]:
# Example 2B: Classification with Stratified K-Fold Cross-Validation

import pandas as pd
import os
import numpy as np
from sklearn import metrics
from sklearn.model_selection import StratifiedKFold
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from keras.callbacks import ModelCheckpoint

# Set variables
EPOCHS=100 # number of epochs for each loop
numK=5     # Set number of folds
filename_write="StratKfoldOosPred.csv" # Set filename

# Record the start time in T_start
T_start = time.time()

# np.argmax(pred,axis=1)
# Cross-validate
# Use for StratifiedKFold classification
kf = StratifiedKFold(numK, shuffle=True, random_state=42) 

# Initial arrays to hold Out Of Samples (oos)
oos_y = []
oos_pred = []
fold = 0

# START LOOP HERE -----------------------------------#

# Must specify y StratifiedKFold for
for train, test in kf.split(bpX,bpDF['class']):  
    fold+=1
    print(f"Fold #{fold} starting...")
        
    # Generate fold's train and test datasets
    x_train = bpX[train]
    y_train = bpY[train]
    x_test = bpX[test]
    y_test = bpY[test]
    
    # Build new model for this fold
    model = Sequential()
    # Hidden 1
    model.add(Dense(50, input_dim=bpX.shape[1], activation='relu')) 
    model.add(Dense(25, activation='relu')) # Hidden 2
    model.add(Dense(bpY.shape[1],activation='softmax')) # Output

    # Compile model for classification
    model.compile(loss='categorical_crossentropy', optimizer='adam')

    # Define the checkpoint callback to save the model with the best performance
    checkpoint = ModelCheckpoint(f'Model_2_bestFold_{fold}.h5', 
                    monitor='val_loss', save_best_only=True, 
                    save_weights_only=True, mode='min', verbose=0)

    # Run model for this fold
    model.fit(x_train,y_train,validation_data=(x_test,y_test),
              verbose=0, callbacks=[checkpoint], epochs=EPOCHS)

    # Use model to predict y-values from x_test
    pred = model.predict(x_test)

    # Save actual y-values 
    oos_y.append(y_test)
    # raw probabilities to chosen class (highest probability)
    pred = np.argmax(pred,axis=1) 
    oos_pred.append(pred)  

    # Measure this fold's accuracy
    y_compare = np.argmax(y_test,axis=1) # For accuracy calculation
    score = metrics.accuracy_score(y_compare, pred)
    print(f"Fold score (accuracy): {score}")

# END LOOP HERE -----------------------------------#

# Build the oos prediction list and calculate the error.
oos_y = np.concatenate(oos_y)
oos_pred = np.concatenate(oos_pred)
oos_y_compare = np.argmax(oos_y,axis=1) # For accuracy calculation

score = metrics.accuracy_score(oos_y_compare, oos_pred)
print(f"Final score (accuracy): {score} \n")    
    
# Write the cross-validated prediction
oos_y = pd.DataFrame(oos_y)
oos_pred = pd.DataFrame(oos_pred)
oosDF = pd.concat( [oos_pred,oos_y],axis=1 )

# Record the end time in T_end
T_end = time.time()

# Print out elapsed time
elaspedTime(T_start,T_end)

Fold #1 starting...
Fold score (accuracy): 0.6119402985074627
Fold #2 starting...
Fold score (accuracy): 0.6343283582089553
Fold #3 starting...
Fold score (accuracy): 0.6343283582089553
Fold #4 starting...
Fold score (accuracy): 0.7313432835820896
Fold #5 starting...
Fold score (accuracy): 0.6666666666666666
Final score (accuracy): 0.655713218820015 

Elapsed time = 0:01:33



In [18]:
# Example 2C: Print out actual and predicted y-values

# Rename columns
new_column_mapping = {0: 'Actual class', 1: 'Predictions: 0'}
oosDF = rename_col_by_index(oosDF, new_column_mapping)

# Set display options
pd.set_option('display.max_rows', 8)
pd.set_option('display.max_columns', 8)

# Display DataFrame
display(oosDF)

Unnamed: 0,Actual class,Predictions: 0,1,2,3
0,2,0.0,0.0,0.0,1.0
1,1,0.0,1.0,0.0,0.0
2,0,1.0,0.0,0.0,0.0
3,0,0.0,1.0,0.0,0.0
...,...,...,...,...,...
1335,0,0.0,1.0,0.0,0.0
1336,3,0.0,0.0,1.0,0.0
1337,1,1.0,0.0,0.0,0.0
1338,3,0.0,0.0,0.0,1.0


### **Exercise 2A: Preprocess data for Classification with Stratified K-Fold Cross-Validation**



In [19]:
# Insert your code for Exercise 2A here

import pandas as pd
from scipy.stats import zscore

# Read the data set
bldBig = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/Blood_samples.csv",
    na_values=['NA','?'])

# Only use 10% for neural network
bldDF=bldBig.sample(frac=0.30)


# Map Diseases
mapping =  {'Anemia': 0,
            'Diabetes': 1,
            'Healthy': 2,
            'Thalasse': 3,
            'Thromboc': 4}
bldDF['Disease'] = bldDF['Disease'].map(mapping)

# Generate list of columns for x
bldX_columns = bldDF.columns.drop('Disease')

# Standardize values with their Z-scores
for col in bpX_columns:
    bpDF[col] = zscore(bpDF[col])

# Generate x-values as numpy array
bldX = bldDF[bldX_columns].values
bldX = np.asarray(bldX).astype('float32')

# Generate y-values as numpy array
dummies = pd.get_dummies(bldDF['Disease']) # Classification
DiseaseClass = dummies.columns
bldY= dummies.values
bldY = np.asarray(bldY).astype('float32')

### **Exercise 2B: Classification with Stratified K-Fold Cross-Validation**


In [20]:
# Insert your code for Exercise 2B here

import pandas as pd
import os
import numpy as np
from sklearn import metrics
from sklearn.model_selection import StratifiedKFold
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from keras.callbacks import ModelCheckpoint

# Set variables
EPOCHS=100 # number of epochs for each loop
numK=5     # Set number of folds
filename_write="StratKfoldOosPred.csv" # Set filename

# Record the start time in T_start
T_start = time.time()

# np.argmax(pred,axis=1)
# Cross-validate
# Use for StratifiedKFold classification
kf = StratifiedKFold(numK, shuffle=True, random_state=42) 

# Initial arrays to hold Out Of Samples (oos)
oos_y = []
oos_pred = []
fold = 0

# START LOOP HERE -----------------------------------#

# Must specify y StratifiedKFold for
for train, test in kf.split(bldX,bldDF['Disease']):  
    fold+=1
    print(f"Fold #{fold} starting...")
        
    # Generate fold's train and test datasets
    x_train = bldX[train]
    y_train = bldY[train]
    x_test = bldX[test]
    y_test = bldY[test]
    
    # Build new model for this fold
    model = Sequential()
    # Hidden 1
    model.add(Dense(50, input_dim=bldX.shape[1], activation='relu')) 
    model.add(Dense(25, activation='relu')) # Hidden 2
    model.add(Dense(bldY.shape[1],activation='softmax')) # Output

    # Compile model for classification
    model.compile(loss='categorical_crossentropy', optimizer='adam')

    # Define the checkpoint callback to save the model with the best performance
    checkpoint = ModelCheckpoint(f'Model_2B_bestFold_{fold}.h5', 
                    monitor='val_loss', save_best_only=True, 
                    save_weights_only=True, mode='min', verbose=0)

    # Run model for this fold
    model.fit(x_train,y_train,validation_data=(x_test,y_test),
              verbose=0, callbacks=[checkpoint], epochs=EPOCHS)

    # Use model to predict y-values from x_test
    pred = model.predict(x_test)

    # Save actual y-values 
    oos_y.append(y_test)
    # raw probabilities to chosen class (highest probability)
    pred = np.argmax(pred,axis=1) 
    oos_pred.append(pred)  

    # Measure this fold's accuracy
    y_compare = np.argmax(y_test,axis=1) # For accuracy calculation
    score = metrics.accuracy_score(y_compare, pred)
    print(f"Fold score (accuracy): {score}")

# END LOOP HERE -----------------------------------#

# Build the oos prediction list and calculate the error.
oos_y = np.concatenate(oos_y)
oos_pred = np.concatenate(oos_pred)
oos_y_compare = np.argmax(oos_y,axis=1) # For accuracy calculation

score = metrics.accuracy_score(oos_y_compare, oos_pred)
print(f"Final score (accuracy): {score} \n")    
    
# Write the cross-validated prediction
oos_y = pd.DataFrame(oos_y)
oos_pred = pd.DataFrame(oos_pred)
oosDF = pd.concat( [oos_pred,oos_y],axis=1 )

# Record the end time in T_end
T_end = time.time()

# Print out elapsed time
elaspedTime(T_start,T_end)

Fold #1 starting...
Fold score (accuracy): 1.0
Fold #2 starting...
Fold score (accuracy): 1.0
Fold #3 starting...
Fold score (accuracy): 1.0
Fold #4 starting...
Fold score (accuracy): 1.0
Fold #5 starting...
Fold score (accuracy): 1.0
Final score (accuracy): 1.0 

Elapsed time = 0:01:05



In [21]:
# Example 2C: Print out actual and predicted y-values

# Rename columns
new_column_mapping = {0: 'Actual Disease class', 1: 'Predicted Disease Class: 0'}
oosDF = rename_col_by_index(oosDF, new_column_mapping)

# Set display options
pd.set_option('display.max_rows', 8)
pd.set_option('display.max_columns', 8)

# Display DataFrame
display(oosDF)

Unnamed: 0,Actual Disease class,Predicted Disease Class: 0,1,2,3,4
0,2,0.0,0.0,1.0,0.0,0.0
1,0,1.0,0.0,0.0,0.0,0.0
2,3,0.0,0.0,0.0,1.0,0.0
3,0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...
701,3,0.0,0.0,0.0,1.0,0.0
702,4,0.0,0.0,0.0,0.0,1.0
703,3,0.0,0.0,0.0,1.0,0.0
704,0,1.0,0.0,0.0,0.0,0.0


## Training with both a Cross-Validation and a Holdout Set

If you have a considerable amount of data, it is always valuable to set aside a holdout set before you cross-validate. This holdout set will be the final evaluation before using your model for its real-world use. Figure 5. HOLDOUT shows this division.

**Figure 5. HOLDOUT: Cross-Validation and a Holdout Set**
![Cross Validation and a Holdout Set](https://biologicslab.co/BIO1173/images/class_3_hold_train_val.png "Cross-Validation and a Holdout Set")

The following program uses a holdout set and then still cross-validates.  

### Example 3A: 



In [22]:
# Example 3A: 

# Read the data set
bpBigDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/bodyPerformance.csv",
    na_values=['NA','?'])

# Only use 20% for neural network
bpDF=bpBigDF.sample(frac=0.20)

# Map gender
mapping = {'M': 1, 'F': 0}
bpDF['gender'] = bpDF['gender'].map(mapping)

# Map class
mapping =  {'A': 0,
            'B': 1,
            'C': 2,
            'D': 3}
bpDF['class'] = bpDF['class'].map(mapping)

# Generate list of columns for x
bpX_columns = bpDF.columns.drop('systolic')  # 'class'

# Standardize values with their Z-scores
for col in bpX_columns:
    bpDF[col] = zscore(bpDF[col])

# Generate x-values as numpy array
bpX = bpDF[bpX_columns].values
bpX = np.asarray(bpX).astype('float32')

# Generate y-values as numpy array
# Do NOT One-Hot encoding with Regression
bpY = bpDF['systolic'].values
bpY = np.asarray(bpY).astype('float32')

### Example 3B: Training with both a Cross-Validation and a Holdout Set 

Now that the data has been preprocessed, we are ready to build the neural network.

In [23]:
# Training with both a Cross-Validation and a Holdout Set

from sklearn.model_selection import train_test_split
import pandas as pd
import os
import numpy as np
from sklearn import metrics
from scipy.stats import zscore
from sklearn.model_selection import KFold

# Set variables
EPOCHS=100 # number of epochs for each loop
numK=5     # Set number of K-folds
# filename_write="OutOfSampleKfoldPred.csv" # Set filename

# Record the start time in T_start
T_start = time.time()

# Keep a 10% holdout
bpX_main, bpX_holdout, bpY_main, bpY_holdout = train_test_split(    
    bpX, bpY, test_size=0.10) 

# Initial arrays for Out Of Samples (oos)
oos_y = []    # array to hold actual y-values
oos_pred = [] # array to hold predicted y-values

# Cross-validate
kf = KFold(numK)

fold = 0 # initialize fold count

# START LOOP HERE -----------------------------------#

# Run loop for each fold
for train, test in kf.split(bpX_main):        
    fold+=1
    print(f"Starting Fold #{fold}...")

    # Generate this fold's train and test datasets
    x_train = bpX_main[train]
    y_train = bpY_main[train]
    x_test = bpX_main[test]
    y_test = bpY_main[test]

    # Build new model for this fold
    model = Sequential()
    model.add(Dense(20, input_dim=bpX.shape[1], activation='relu'))
    model.add(Dense(5, activation='relu'))
    model.add(Dense(1))

    # Compile model for regression
    model.compile(loss='mean_squared_error', optimizer='adam')

    # # Run model for this fold
    model.fit(x_train,y_train,validation_data=(x_test,y_test),
              verbose=0,epochs=EPOCHS)
    
     # Use model to predict y-values from x_test
    pred = model.predict(x_test)

    # Save actual y-values 
    oos_y.append(y_test)
    oos_pred.append(pred) 

    # Measure accuracy
    score = np.sqrt(metrics.mean_squared_error(pred,y_test))
    print(f"Fold score (RMSE): {score}")


# Build the oos prediction list and calculate the error.
oos_y = np.concatenate(oos_y)
oos_pred = np.concatenate(oos_pred)
score = np.sqrt(metrics.mean_squared_error(oos_pred,oos_y))
print()
print(f"Cross-validated score (RMSE): {score}")    

# Write the cross-validated prediction
oos_y = pd.DataFrame(oos_y)
oos_pred = pd.DataFrame(oos_pred)
oosDF = pd.concat([oos_y,oos_pred],axis=1 )

# Write the cross-validated prediction (from the last neural network)
holdout_pred = model.predict(bpX_holdout)

score = np.sqrt(metrics.mean_squared_error(holdout_pred,bpY_holdout))
print(f"Holdout score (RMSE): {score}")    

# Record the end time in T_end
T_end = time.time()

# Print out elapsed time
elaspedTime(T_start,T_end)

Starting Fold #1...
Fold score (RMSE): 10.276460647583008
Starting Fold #2...
Fold score (RMSE): 12.139886856079102
Starting Fold #3...
Fold score (RMSE): 10.246830940246582
Starting Fold #4...
Fold score (RMSE): 10.663908958435059
Starting Fold #5...
Fold score (RMSE): 10.272160530090332

Cross-validated score (RMSE): 10.7442626953125
Holdout score (RMSE): 9.729659080505371
Elapsed time = 0:02:28



### Example 3C: 



In [24]:
# Example 3C: Print out actual and predicted y-values

# Rename columns
new_column_mapping = {0: 'Actual Systolic pressure', 1: 'Predicted Systolic pressure'}
oosDF = rename_col_by_index(oosDF, new_column_mapping)

# Set display options
pd.set_option('display.max_rows', 8)
pd.set_option('display.max_columns', 8)

# Display DataFrame
display(oosDF)

Unnamed: 0,Actual Systolic pressure,Predicted Systolic pressure
0,142.0,118.408318
1,118.0,121.304047
2,153.0,154.796783
3,122.0,124.149635
...,...,...
2407,144.0,141.003220
2408,129.0,133.930969
2409,139.0,132.655197
2410,121.0,119.744713


### **Exercise 3A:** 



In [25]:
# Insert your code for Exercise 3A here

# Read the data set
bldBigDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/Blood_samples.csv",
    na_values=['NA','?'])

# Only use 20% for neural network
bldDF=bldBigDF.sample(frac=0.50)

# Map Diseases
mapping =  {'Anemia': 0,
            'Diabetes': 1,
            'Healthy': 2,
            'Thalasse': 3,
            'Thromboc': 4}
bldDF['Disease'] = bldDF['Disease'].map(mapping)

# Generate list of columns for x
bldX_columns = bldDF.columns.drop('BMI')  # 'class'

# Standardize values with their Z-scores
for col in bldX_columns:
    bldDF[col] = zscore(bldDF[col])

# Generate x-values as numpy array
bldX = bldDF[bldX_columns].values
bldX = np.asarray(bldX).astype('float32')

# Generate y-values as numpy array
# Do NOT One-Hot encoding with Regression
bldY = bldDF['BMI'].values
bldY = np.asarray(bldY).astype('float32')

### **Exercise 3B: Training with both a Cross-Validation and a Holdout Set**

Now that the data has been preprocessed, we are ready to build the neural network.

In [26]:
# Insert your code for Exercise 3B here

from sklearn.model_selection import train_test_split
import pandas as pd
import os
import numpy as np
from sklearn import metrics
from scipy.stats import zscore
from sklearn.model_selection import KFold

# Set variables
EPOCHS=100 # number of epochs for each loop
numK=5     # Set number of K-folds
# filename_write="OutOfSampleKfoldPred.csv" # Set filename

# Record the start time in T_start
T_start = time.time()

# Keep a 10% holdout
bldX_main, bldX_holdout, bldY_main, bldY_holdout = train_test_split(    
    bldX, bldY, test_size=0.10) 

# Initial arrays for Out Of Samples (oos)
oos_y = []    # array to hold actual y-values
oos_pred = [] # array to hold predicted y-values

# Cross-validate
kf = KFold(numK)

fold = 0 # initialize fold count

# START LOOP HERE -----------------------------------#

# Run loop for each fold
for train, test in kf.split(bldX_main):        
    fold+=1
    print(f"Starting Fold #{fold}...")

    # Generate this fold's train and test datasets
    x_train = bldX_main[train]
    y_train = bldY_main[train]
    x_test = bldX_main[test]
    y_test = bldY_main[test]

    # Build new model for this fold
    model = Sequential()
    model.add(Dense(20, input_dim=bldX.shape[1], activation='relu'))
    model.add(Dense(5, activation='relu'))
    model.add(Dense(1))

    # Compile model for regression
    model.compile(loss='mean_squared_error', optimizer='adam')

    # # Run model for this fold
    model.fit(x_train,y_train,validation_data=(x_test,y_test),
              verbose=0,epochs=EPOCHS)
    
     # Use model to predict y-values from x_test
    pred = model.predict(x_test)

    # Save actual y-values 
    oos_y.append(y_test)
    oos_pred.append(pred) 

    # Measure accuracy
    score = np.sqrt(metrics.mean_squared_error(pred,y_test))
    print(f"Fold score (RMSE): {score}")


# Build the oos prediction list and calculate the error.
oos_y = np.concatenate(oos_y)
oos_pred = np.concatenate(oos_pred)
score = np.sqrt(metrics.mean_squared_error(oos_pred,oos_y))
print()
print(f"Cross-validated score (RMSE): {score}")    

# Write the cross-validated prediction
oos_y = pd.DataFrame(oos_y)
oos_pred = pd.DataFrame(oos_pred)
oosDF = pd.concat([oos_y,oos_pred],axis=1 )

# Write the cross-validated prediction (from the last neural network)
holdout_pred = model.predict(bldX_holdout)

score = np.sqrt(metrics.mean_squared_error(holdout_pred,bldY_holdout))
print(f"Holdout score (RMSE): {score}")    

# Record the end time in T_end
T_end = time.time()

# Print out elapsed time
elaspedTime(T_start,T_end)

Starting Fold #1...
Fold score (RMSE): 0.06385468691587448
Starting Fold #2...
Fold score (RMSE): 0.07605636864900589
Starting Fold #3...
Fold score (RMSE): 0.07875346392393112
Starting Fold #4...
Fold score (RMSE): 0.04226360470056534
Starting Fold #5...
Fold score (RMSE): 0.006202332209795713

Cross-validated score (RMSE): 0.059919603168964386
Holdout score (RMSE): 0.007993982173502445
Elapsed time = 0:00:43



### **Exercise 3C: Print out actual and predicted y-values**



In [27]:
# Insert your code for Exercise 3C here

# Rename columns
new_column_mapping = {0: 'Actual BMI', 1: 'Predicted BMI'}
oosDF = rename_col_by_index(oosDF, new_column_mapping)

# Set display options
pd.set_option('display.max_rows', 8)
pd.set_option('display.max_columns', 8)

# Display DataFrame
display(oosDF)

Unnamed: 0,Actual BMI,Predicted BMI
0,0.142010,0.125563
1,0.524159,0.503896
2,0.381684,0.381648
3,0.102749,0.174065
...,...,...
419,0.779364,0.778209
420,0.553962,0.553220
421,0.559571,0.559449
422,0.233581,0.233712


## **Lesson Turn-in**

When you have completed all of the code cells, and run them in sequential order (the last code cell should be number ??), use the **File --> Print.. --> Save to PDF** to generate a PDF of your JupyterLab notebook. Save your PDF as `Class_05_2.lastname.pdf` where _lastname_ is your last name, and upload the file to Canvas.