<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173/blob/master/Class_05_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 1173: Intro Computational Biology**

**Module 5: Regularization and Dropout**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Integrative Biology](https://sciences.utsa.edu/integrative-biology/), [UTSA](https://www.utsa.edu/)

### Module 5 Material

* Part 5.1: Part 5.1: Introduction to Regularization: Ridge and Lasso
* **Part 5.2: Using K-Fold Cross Validation with Keras**
* Part 5.3: Using L1 and L2 Regularization with Keras to Decrease Overfitting
* Part 5.4: Drop Out for Keras to Decrease Overfitting
* Part 5.5: Benchmarking Keras Deep Learning Regularization Techniques



### Lesson Setup

Run the next code cell to load necessary packages

In [1]:
# You MUST run this code cell first

from sklearn import metrics
from sklearn.model_selection import KFold
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from scipy.stats import zscore
from sklearn.model_selection import train_test_split

import numpy as np
import pandas as pd
import time

import os
import shutil
path = '/'
memory = shutil.disk_usage(path)
dirpath = os.getcwd()
print("Your current working directory is : " + dirpath)
print("Disk", memory)

Your current working directory is : C:\Users\David\BIO1173_Test\Class_05_2
Disk usage(total=4000108531712, used=1006327681024, free=2993780850688)


### Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.

In [2]:
# You must run this cell second
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    COLAB = True
    print("Note: using Google CoLab")
    %tensorflow_version 2.x
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("Note: not using Google CoLab")
    COLAB = False

Note: not using Google CoLab


## Datasets for this lesson

For this lesson we will be using the [Body Performance dataset](https://www.kaggle.com/datasets/kukuroo3/body-performance-data) for the Examples, and the [Multiple Disease dataset](https://www.kaggle.com/datasets/ehababoelnaga/multiple-disease-prediction) for the **Exercises**.

### Body Performance dataset

[Body Performance](https://www.kaggle.com/datasets/kukuroo3/body-performance-data)


![___](https://biologicslab.co/BIO1173/images/Broadjump.jpg)

The [Body Performance dataset](https://www.kaggle.com/datasets/kukuroo3/body-performance-data) was provided by the [Seoul Olympic Games Korea Sports Promotion Foundation](https://www.bigdata-culture.kr/bigdata/user/data_market/detail.do?id=ace0aea7-5eee-48b9-b616-637365d665c1). 

The dataset has 12 categories of body performance for a relatively large number of men and women (_n_=13,303). To speed-up neural network training, we will generally use only a fraction of the total number subjects. 

The 12 categories of fitness measurements are:
* **age:** 20 ~64
* **gender:** M,F
* **height_cm:** (If you want to convert to feet, divide by 30.48)
* **weight_kg:**
* **body fat_%:**
* **diastolic:** diastolic blood pressure (min)
* **systolic:** systolic blood pressure (min)
* **gripForce:**
* **sit and bend forward_cm:**
* **sit-ups counts:**
* **broad jump_cm:**
* **class:** A,B,C,D ( A: best) / stratified

The output for the command `df.info()` is as follows:
~~~text
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13393 entries, 0 to 13392
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   age                      13393 non-null  float64
 1   gender                   13393 non-null  object 
 2   height_cm                13393 non-null  float64
 3   weight_kg                13393 non-null  float64
 4   body fat_%               13393 non-null  float64
 5   diastolic                13393 non-null  float64
 6   systolic                 13393 non-null  float64
 7   gripForce                13393 non-null  float64
 8   sit and bend forward_cm  13393 non-null  float64
 9   sit-ups counts           13393 non-null  float64
 10  broad jump_cm            13393 non-null  float64
 11  class                    13393 non-null  object 
dtypes: float64(10), object(2)
~~~

As you can see, the only columns that have string values (type `object`) are `age` and `class`. The remaining columns are all numeric. Since all columns have the same `Non-Null Count` there is no missing data. 

### Multiple Disease Prediction Dataset

[Multiple Disease Prediction](https://www.kaggle.com/datasets/ehababoelnaga/multiple-disease-prediction?resource=download/)


![___](https://biologicslab.co/BIO1173/images/blood_sample.jpg)

A blood sample can provide valuable information about disease prediction through various methods such as genetic testing, biomarker analysis, and assessing various blood parameters. Here's why it's important:

* **Early Detection:** Blood samples can contain markers that indicate the presence of certain diseases even before symptoms appear. Detecting these markers early can lead to earlier treatment and better outcomes.
* **Risk Assessment:** Analyzing blood samples can help assess an individual's risk of developing certain diseases based on genetic predispositions or biomarker patterns. This information can guide preventative measures and lifestyle changes to reduce the risk.
* **Personalized Medicine:** Understanding an individual's genetic makeup and biomarker profile can help tailor medical treatments to their specific needs, increasing treatment efficacy and reducing adverse effects.
* **Monitoring Disease Progression:** Blood tests can be used to monitor disease progression and response to treatment over time, allowing for adjustments in treatment plans as needed.
* **Population Health Management:** Blood sample analysis on a larger scale can provide insights into the prevalence of certain diseases within a population, aiding public health efforts in disease prevention and management.

Overall, leveraging blood samples for disease prediction can significantly impact individual health outcomes, public health strategies, and the advancement of personalized medicine.

**Description of Multiple Disease Prediction Dataset**

This dataset is for the prediction of human diseases based on blood sample values and a panel of clinical assesments taken from 1552 subjects. The last column, `Disease`, list five disease categories: 

* **Anemia:** Anemia is the condition of not having enough healthy red blood cells or hemoglobin to carry oxygen to the body's tissues. 
* **Diabetes:** Diabetes is a chronic (long-lasting) condition that affects how your body turns food into energy. With diabetes, your body doesn’t make enough insulin or can’t use it as well as it should. When there isn’t enough insulin or cells stop responding to insulin, too much blood sugar stays in your bloodstream.
* **Healthy:** No apparent medical condistion.
* **Thalasse:** Thalassemia is an inherited blood disorder that inhibits the production of the protein hemoglobin.  
* **Thromboc:** Thrombocytopenia is the medical condition for low blood platelets. Normal platelet counts for adults are between 150,000 and 450,000 platelets per microliter (uL) of blood. Thrombocytopenia is a platelet count below 150,000.

**Key Features of the dataset:**

The following are the attributes of the Blood Sample dataset:

* **Cholesterol:** This is the level of cholesterol in the blood, measured in milligrams per deciliter (mg/dL).
* **Hemoglobin:** This is the protein in red blood cells that carries oxygen from the lungs to the rest of the body
* **Platelets:** Platelets are blood cells that help with clotting
* **White Blood Cells (WBC):** These are cells of the immune system that help fight infections
* **Red Blood Cells (RBC):** These are the cells that carry oxygen from the lungs to the rest of the body
* **Hematocrit:** This is the percentage of blood volume that is occupied by red blood cells
* **Mean Corpuscular Volume (MCV):** This is the average volume of red blood cells
* **Mean Corpuscular Hemoglobin (MCH):** This is the average amount of hemoglobin in a red blood cell
* **Mean Corpuscular Hemoglobin Concentration (MCHC):** This is the average concentration of hemoglobin in a red blood cell
* **Insulin:** This is a hormone that helps regulate blood sugar levels
* **BMI (Body Mass Index):** This is a measure of body fat based on height and weight
* **Systolic Blood Pressure (SBP):** This is the pressure in the arteries when the heart beats
* **Diastolic Blood Pressure (DBP):** This is the pressure in the arteries when the heart is at rest between beats
* **Triglycerides:** These are a type of fat found in the blood, measured in milligrams per deciliter (mg/dL)
* **HbA1c (Glycated Hemoglobin):** This is a measure of average blood sugar levels over the past two to three months
* **LDL (Low-Density Lipoprotein) Cholesterol:** This is the "bad" cholesterol that can build up in the arteries
* **HDL (High-Density Lipoprotein) Cholesterol:** This is the "good" cholesterol that helps remove LDL cholesterol from the arteries
* **ALT (Alanine Aminotransferase):** This is an enzyme found primarily in the liver
* **AST (Aspartate Aminotransferase):** This is an enzyme found in various tissues including the liver and heart
* **Heart Rate:** This is the number of heartbeats per minute (bpm)
* **Creatinine:** This is a waste product produced by muscles and filtered out of the blood by the kidneys
* **Troponin:** This is a protein released into the bloodstream when there is damage to the heart muscle
* **C-reactive Protein (CRP):** This is a marker of inflammation in the body
* **Disease:** This indicates whether he has a specific disease or not

The data has been standardized using a scaling techique called **_Min-Max Scaling_**. In this procedure, the values have been shifted and rescaled so that they end up ranging between 0 and 1. The following shows the minimum and maximum values used for scaling each category:

* **Glucose:** (70, 140),  # mg/dL
* **Cholesterol:** (125, 200),  # mg/dL
* **Hemoglobin:** (13.5, 17.5),  # g/dL
* **Platelets:** (150000, 450000),  # per microliter of blood
* **White Blood Cells:** (4000, 11000),  # per cubic millimeter of blood
* **Red Blood Cells:** (4.2, 5.4),  # million cells per microliter of blood
* **Hematocrit:** (38, 52),  # percentage
* **Mean Corpuscular Volume:** (80, 100),  # femtoliters
* **Mean Corpuscular Hemoglobin:** (27, 33),  # picograms
* **Mean Corpuscular Hemoglobin Concentration:** (32, 36),  # grams per deciliter
* **Insulin:** (5, 25),  # microU/mL
* **BMI:** (18.5, 24.9),  # kg/m^2
* **Systolic Blood Pressure:** (90, 120),  # mmHg
* **Diastolic Blood Pressure:** (60, 80),  # mmHg
* **Triglycerides:** (50, 150),  # mg/dL
* **HbA1c:** (4, 6),  # percentage
* **LDL Cholesterol:** (70, 130),  # mg/dL
* **HDL Cholesterol:** (40, 60),  # mg/dL
* **ALT:** (10, 40),  # U/L
* **AST:** (10, 40),  # U/L
* **Heart Rate:** (60, 100),  # beats per minute
* **Creatinine:** (0.6, 1.2),  # mg/dL
* **Troponin:** (0, 0.04),  # ng/mL
* **C-reactive Protein:** (0, 3),  # mg/L

#### Understanding Min-Max Scaling
As an example, consider the column called `Glucose`. According to the information above, the subject with the lowest Glucose measurement had `70 mg/dl` while the subject with the largest Glucose measurement had `140 mg/dl`. After Min-Max scaling, the subject with the minimum measurement now has Glucose=0, while the subject with the maximum measurement now has Glucose=1. All of the remaining subjects have a glucose value somewhere between 0 and 1 in proportion to the original glucose measurement.

The output for the command `df.info()` is as follows:
~~~text
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2351 entries, 0 to 2350
Data columns (total 25 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   Glucose                                    2351 non-null   float64
 1   Cholesterol                                2351 non-null   float64
 2   Hemoglobin                                 2351 non-null   float64
 3   Platelets                                  2351 non-null   float64
 4   White Blood Cells                          2351 non-null   float64
 5   Red Blood Cells                            2351 non-null   float64
 6   Hematocrit                                 2351 non-null   float64
 7   Mean Corpuscular Volume                    2351 non-null   float64
 8   Mean Corpuscular Hemoglobin                2351 non-null   float64
 9   Mean Corpuscular Hemoglobin Concentration  2351 non-null   float64
 10  Insulin                                    2351 non-null   float64
 11  BMI                                        2351 non-null   float64
 12  Systolic Blood Pressure                    2351 non-null   float64
 13  Diastolic Blood Pressure                   2351 non-null   float64
 14  Triglycerides                              2351 non-null   float64
 15  HbA1c                                      2351 non-null   float64
 16  LDL Cholesterol                            2351 non-null   float64
 17  HDL Cholesterol                            2351 non-null   float64
 18  ALT                                        2351 non-null   float64
 19  AST                                        2351 non-null   float64
 20  Heart Rate                                 2351 non-null   float64
 21  Creatinine                                 2351 non-null   float64
 22  Troponin                                   2351 non-null   float64
 23  C-reactive Protein                         2351 non-null   float64
 24  Disease                                    2351 non-null   object 
dtypes: float64(24), object(1)
memory usage: 459.3+ KB
~~~

As you can see, the only column that is _not_ numeric is `Disease` which has 5 categorical values: `Anemia`, `Diabetes`, `Healthy`, `Thalasse`, and `Thromboc`. Since all columns have the same `Non-Null Count` (_n_=2351), there is no missing data. 

As mentioned above, the data is **_already scaled_** in the range 0.0-1.0 using the minimum and maximum values for each data type.  

# Part 5.2: Using K-Fold Cross-validation with Keras

**_K-fold validation_** is a technique used in machine learning to evaluate the performance and generalization ability of a model. In _K_-fold validation, the original dataset is randomly partitioned into _K_ equal-sized subsets. The model is trained and evaluated _K_ times, with each iteration using a different subset as the validation set and the remaining subsets as the training set. This allows for a more robust evaluation of the model's performance as it reduces the variance that may result from using a single train-test split. The final performance metric is typically averaged over the _K_ iterations for a more reliable estimation of the model's performance.

You can use cross-validation for a variety of purposes in predictive modeling:

* Generating out-of-sample predictions from a neural network
* Estimate a good number of epochs to train a neural network for (early stopping)
* Evaluate the effectiveness of certain hyperparameters, such as activation functions, neuron counts, and layer counts

Cross-validation uses several folds and multiple models to provide each data segment a chance to serve as both the validation and training set. Figure 5.CROSS shows cross-validation.

**Figure 5.CROSS: K-Fold Crossvalidation**
![K-Fold Crossvalidation](https://biologicslab.co/BIO1173/images/class_1_kfold.png "K-Fold Crossvalidation")

It is important to note that each fold will have one model (neural network). To generate predictions for new data (not present in the training set), predictions from the fold models can be handled in several ways:

* Choose the model with the highest validation score as the final model.
* Preset new data to the five models (one for each fold) and average the result (this is an [ensemble](https://en.wikipedia.org/wiki/Ensemble_learning)).
* Retrain a new model (using the same settings as the cross-validation) on the entire dataset. Train for as many epochs and with the same hidden layer structure.

Generally, I prefer the last approach and will retrain a model on the entire data set once I have selected hyper-parameters. Of course, I will always set aside a final holdout set for model validation that I do not use in any aspect of the training process.

## Regression vs Classification _K_-Fold Cross-Validation

Regression and classification are handled somewhat differently concerning cross-validation. Regression is the simpler case where you can break up the data set into _K_ folds with little regard for where each item lands. For regression, the data items should fall into the folds as _randomly_ as possible. It is also important to remember that not every fold will necessarily have the same number of data items. It is not always possible for the data set to be evenly divided into _K_ folds. For regression cross-validation, we will use the Scikit-Learn class **KFold**.

Cross-validation for classification could also use the **KFold** object; however, this technique would not ensure that the class balance remains the same in each fold as in the original. The balance of classes that a model was trained on must remain the same (or similar) to the training set. Drift in this distribution is one of the most important things to monitor after a trained model has been placed into actual use. Because of this, we want to make sure that the cross-validation itself does not introduce an unintended shift. This technique is called **_stratified sampling_** and is accomplished by using the Scikit-Learn object **StratifiedKFold** in place of **KFold** whenever you use classification. In summary, you should use the following two objects in Scikit-Learn:

* **KFold** When dealing with a regression problem.
* **StratifiedKFold** When dealing with a classification problem.

### Create functions for this lesson

The code in the cell below creates 2 useful functions for this lesson, `elaspedTime(start,stop)` and `rename_col_by_index(dataframe, index_mapping)`. 

In [3]:
# Create functions

# Simple function to print out elasped time
def elaspedTime(start,end):
    # Print out time
    seconds = int((end-start))
    seconds = seconds % (24 * 3600)
    hour = seconds // 3600
    seconds %= 3600
    minutes = seconds // 60
    seconds %= 60
    print("Elapsed time = %d:%02d:%02d" % (hour, minutes, seconds))
    print()

# Simple function to change column name in a dataframe
def rename_col_by_index(dataframe, index_mapping):
    dataframe.columns = [index_mapping.get(i, col) for i, col in enumerate(dataframe.columns)]
    return dataframe


## Out-of-Sample Regression Predictions with K-Fold Cross-Validation

The following code trains the simple dataset using a 5-fold cross-validation. The expected performance of a neural network of the type trained here would be the score for the generated out-of-sample predictions. 

### Example 1-Step 1: Prepare feature vector

In the cell below, we prepare a feature vector using the **Body Performance** dataset to predict an individual's weight in kg. This model is set up as a regression problem.

The datafile `bodyPerformance.csv` is read from the course HTTPS server to create a DataFrame called `bpBigDF`. To speed training of the neural network, we will only use 15% of the data in `bpBigDF` using the code chunk:
~~~text
# Only use 15% for neural network
bpDF=bpBigDF.sample(frac=0.15)
~~~
The actual DataFrame used for Example 1 will be `bpDF`. 

As mentioned above, there are two non-numeric columns, `gender` and `class`. The code below uses `mapping` to convert the letters `M` to the integer `1` and `F` to the integer `0` in the `gender` column. Mapping is also used to convert the letters in the column `class` to integers using this code chunck:
~~~text
# Map gender
mapping = {'M': 1, 'F': 0}
bpDF['gender'] = bpDF['gender'].map(mapping)

# Map class
mapping =  {'A': 0,
            'B': 1,
            'C': 2,
            'D': 3}
bpDF['class'] = bpDF['class'].map(mapping)
~~~

Since the goal of this regression neural network is to predict weight, the column `weight_kg` is dropped when creating a list of column names to be used for the x-values:
~~~text
# Generate list of columns for x
bpX_columns = bpDF.columns.drop('weight_kg')  # weight_kg is y
~~~
In this example, we will use the `Min-Max` scaler for standardizing values.
~~~text
# Standardize values with Min-Max
for col in bpX_columns:
    bpDF[col] = minmax_scale(bpDF[col])
~~~
Having processed all of the x-values, we can now generate the X feature vector using this code chunk:
~~~text
# Generate X feature vector
bpX = bpDF[bpX_columns].values
bpX = np.asarray(bpX).astype('float32')
~~~
Finally, since this is a regression analysis we **don't** want to One-Hot encode the values in the column `weight_kg` but use them directly:
~~~text
# Generate Y feature vector
bpY = bpDF['weight_kg'].values
bpY = np.asarray(bpY).astype('float32')
~~~

In [4]:
# Example 1-Step 1: Prepare feature vector

from sklearn.preprocessing import minmax_scale
# Read the data set
bpBigDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/bodyPerformance.csv",
    na_values=['NA','?'])

# Only use 15% for neural network
bpDF=bpBigDF.sample(frac=0.15)

# Map category `gender`
mapping = {'M': 1, 'F': 0}
bpDF['gender'] = bpDF['gender'].map(mapping)

# Map category `class`
mapping =  {'A': 0,
            'B': 1,
            'C': 2,
            'D': 3}
bpDF['class'] = bpDF['class'].map(mapping)

# Generate list of columns for x
bpX_columns = bpDF.columns.drop('weight_kg')  # weight_kg is y

# Standardize values with Min-Max
for col in bpX_columns:
    bpDF[col] = minmax_scale(bpDF[col])

# Generate X feature vector
bpX = bpDF[bpX_columns].values
bpX = np.asarray(bpX).astype('float32')

# Generate Y feature vector
bpY = bpDF['weight_kg'].values
bpY = np.asarray(bpY).astype('float32')

# Print first 4 values in bpX
np.set_printoptions(suppress=True,precision=4)
print(bpX[0:4])

[[0.4651 1.     0.6595 0.4308 0.4937 0.4632 0.713  0.7085 0.4231 0.7559
  0.6667]
 [0.2093 1.     0.642  0.3371 0.4684 0.5158 0.8102 0.6384 0.3846 0.7659
  1.    ]
 [0.6744 0.     0.286  0.7121 0.4304 0.3158 0.2362 0.5314 0.1538 0.3579
  1.    ]
 [0.2558 1.     0.716  0.3884 0.5443 0.5368 0.7339 0.8137 0.4615 0.6455
  0.6667]]


If your code is correct you should see something similar to the following output:
~~~text
[[0.4651 1.     0.8912 0.2554 0.7373 0.469  0.7895 0.7083 0.5733 0.7972
  0.3333]
 [0.0233 0.     0.4791 0.4892 0.5254 0.1947 0.2684 0.5924 0.36   0.5455
  0.6667]
 [0.0233 0.     0.2824 0.5758 0.5847 0.4602 0.3316 0.8116 0.48   0.6049
  0.3333]
 [0.0698 1.     0.6339 0.1775 0.5339 0.531  0.7474 0.2699 0.5867 0.8357
  1.    ]]
~~~~

These are the Min-Max scaled scores of the 11 fitness measurements for the first 4 individuals in `bpX`. 

If you re-run this cell, you will get slightly different values due to the fact that we are using only a **_random sample_** of the entire dataset. Each time the cell is run, a different random sample is selected. 

See if you can figure out the gender of all of these individuals in the output shown above by their Min-Max scores? (HINT: The first record is from a man, the second from a woman).

### Example 1-Step 2: Out-of-Sample Regression Predictions with K-Fold Cross-Validation

Now that we have created our X and Y feature vectors, we are ready to perform 5-fold cross-validation to generate out-of-sample (oos) predictions. For this example, we will use 100 epochs and no early stopping. Later we will see how we can estimate a more optimal epoch count.

In [5]:
# Example 1-Step 2: Out-of-Sample Regression Predictions with K-Fold Cross-Validation

# Set variables
EPOCHS=100 # number of epochs for each loop
numK=5     # Set number of K-folds

# Record the start time in T_start
T_start = time.time()

# Specify type of K-fold Cross-Validation
kf = KFold(numK, shuffle=True, random_state=42) 

# Initial arrays for Out Of Samples (oos)
oos_y = []    # array to hold actual y-values
oos_pred = [] # array to hold predicted y-values

# START LOOP HERE -----------------------------------------------#

fold = 0 # initialize fold count

# Run loop for each fold
for train, test in kf.split(bpX):
    fold+=1  # increment loop counter
    print(f"Fold #{fold} starting...")

    # Generate this fold's train and test datasets
    x_train = bpX[train]
    y_train = bpY[train]
    x_test = bpX[test]
    y_test = bpY[test]

    # Build new model for this fold
    model = Sequential()
    model.add(Dense(20, input_dim=bpX.shape[1], 
                    activation='relu'))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1))

    # Compile model for this fold (regression) 
    model.compile(loss='mean_squared_error', 
                  optimizer='adam')

    # Run model for this fold
    model.fit(x_train,y_train,
              validation_data=(x_test,y_test),verbose=0,
              epochs=EPOCHS)

    # Store model predictions
    pred = model.predict(x_test)

    # oos_y contains fold's actual Y-values
    oos_y.append(y_test)
    
    # oos_pred contains fold's predicted Y-values
    oos_pred.append(pred)    

    # Measure and print RMSE for this fold
    score = np.sqrt(metrics.mean_squared_error(pred,y_test))
    print(f"Fold score (RMSE): {score}")

# END LOOP HERE -----------------------------------------------#


# Build the oos prediction list and calculate the error.

# Actual y-values for all loops
oos_y = np.concatenate(oos_y)

# Predicted y-values from all loops
oos_pred = np.concatenate(oos_pred)

# Compute Final (Grand) total from all K loops
score = np.sqrt(metrics.mean_squared_error(oos_pred,oos_y))
print(f"Final, out of sample score (RMSE): {score} \n")    
    
# Write the cross-validated prediction
oos_y = pd.DataFrame(oos_y)
oos_pred = pd.DataFrame(oos_pred)
oosDF = pd.concat( [oos_y,oos_pred],axis=1 )

# Uncomment the next line to write file
#oosDF.to_csv(filename_write,index=False)

# Record the end time in T_end
T_end = time.time()

# Print out elapsed time
elaspedTime(T_start,T_end)


Fold #1 starting...
Fold score (RMSE): 4.976469993591309
Fold #2 starting...
Fold score (RMSE): 5.076278209686279
Fold #3 starting...
Fold score (RMSE): 5.131648540496826
Fold #4 starting...
Fold score (RMSE): 5.244519233703613
Fold #5 starting...
Fold score (RMSE): 5.0899658203125
Final, out of sample score (RMSE): 5.104522705078125 

Elapsed time = 0:02:00



This is fairly large neural network so it make take some time to finish all 5 K-fold traing loops. Please be patient.

If your code is correct you should see something similar to the following output:
~~~text
Fold #1 starting...
13/13 [==============================] - 0s 1ms/step
Fold score (RMSE): 5.109001159667969
Fold #2 starting...
13/13 [==============================] - 0s 1ms/step
Fold score (RMSE): 5.294778347015381
Fold #3 starting...
13/13 [==============================] - 0s 1ms/step
Fold score (RMSE): 4.638960838317871
Fold #4 starting...
13/13 [==============================] - 0s 1ms/step
Fold score (RMSE): 4.801635265350342
Fold #5 starting...
13/13 [==============================] - 0s 2ms/step
Fold score (RMSE): 4.9324259757995605
Final, out of sample score (RMSE): 4.9606804847717285 

Elapsed time = 0:01:49
~~~

As you can see, the data was processed by 5 different neural network models, one model for each _K_ fold. Due to the random nature of the training process, each model performed somewhat differently. 

In this particular example, third model (Fold #3) had the lowest RMSE (`4.63896`) and therefore was the most accurate. On the other hand, the second model (Fold #2) had the highest RSME score (`5.294778`) and therefore was the least accurate. The `Final` out of sample score, was `4.9324`. 

### Example 1-Step 3: Print out Actual and Predicted Y-values

So what does a RMSE value like `4.961` mean for this model trained on this dataset? Our goal in Example 1 was to predict the **weight** of individuals given measurements of 11 fitness characteristics. This is a basic _regression_ problem for a neural network. 

The code in the cell below prints out the actual weights and predicted weights (in kilograms) for all of the "out-of-sample" individuals. Out-of-sample refers to data that was _not_ used in the process of developing the neural network model. 

Using oos data, the accuracy and performance of the model can be evaluated on new, unseen data to assess its generalizability and potential for predicting future outcomes. This helps ensure that the neural network model was not overfitting or performing well only on the data it was trained on. 

In [6]:
# Example 1-Step 3: Print out actual and predicted y-values 

# Rename columns
new_column_mapping = {0: 'Actual Wt (kg)', 1: 'Predicted Wt (kg)'}
oosDF = rename_col_by_index(oosDF, new_column_mapping)

# Set display options
pd.set_option('display.max_rows', 8)
pd.set_option('display.max_columns', 8)

# Display DataFrame
display(oosDF)

Unnamed: 0,Actual Wt (kg),Predicted Wt (kg)
0,88.500000,88.168480
1,47.099998,55.621708
2,75.000000,74.698868
3,80.800003,80.698418
...,...,...
2005,65.199997,70.975533
2006,60.599998,65.756805
2007,81.400002,79.140892
2008,64.900002,65.492760


If your code is correct you should see something similar to the following table:

![__](https://biologicslab.co/BIO1173/images/class_05_2_Exm1C.png)

By inspection you can see that the model's weight predictions are close to, but not exactly the same as, the out-of-sample (oos) individuals. For example, the actual weight of the last individual (2008) was 77.00 kg while the model predicted his weight as 68.98 kg. For this particular individual the difference (i.e. error) is about 8 kg. We should not be too surprised by this amount of since the were know that the Final RMSE value was `5.054`.   

-----------------------------------------------
## Standardize Data Twice?

Standardizing data twice, first to a range of 0 to 1 and then taking their Z-scores, is not a recommended practice. This is because standardizing data to a specific range (0 to 1) and then calculating Z-scores introduces redundant and unnecessary transformations that can distort the interpretation of the data.

When data is standardized to a range of 0 to 1, the original distribution and variability of the data are altered to fit within a specific interval. Subsequently, calculating Z-scores on this already transformed data can lead to incorrect normalization and scaling of the variables.

It is more appropriate to choose one standardization method based on the requirements of the analysis. Standardizing the data either to a specific range or to Z-scores alone is typically sufficient for most analytical purposes and helps maintain the integrity of the data without introducing unnecessary complexity or distorting the distribution of the variables.

-----------------------------------------------

### **Exercise 1A: Prepare feature vector**

In the cell below prepare a feature vector for the Multiple Disease Prediction dataset for Out-of-Sample _Regression_ with K-Fold Cross-Validation. When preparing your feature vector, keep in mind that your regression neural network will be used to predict `Glucose` levels in the blood. 

Use the following code chunk to read the data from the course HTTPS server:
~~~text
# Read the data set
mdpBigDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/Blood_samples.csv",
    na_values=['NA','?'])
~~~
Use 60% of `mdpBigDF` to create a the actual DataFrame that you will use. Call this DataFrame`mdpDF`. 
~~~text
# Only use 60% for neural network
mdpDF=mdpBigDF.sample(frac=0.60)
~~~
The only non-numeric column in the Multiple Disease Prediction dataset is `Disease`. You will need to map the five categorical values in this column to integers using the following code chunk:
~~~text
# Map Diseases
mapping =  {'Anemia': 0,
            'Diabetes': 1,
            'Healthy': 2,
            'Thalasse': 3,
            'Thromboc': 4}
mdpDF['Disease'] = mdpDF['Disease'].map(mapping)
~~~

Then create a list of columns to be used for generating the x-values by dropping the column `Glucose`--which is the y-value. Call your list `mdpX_columns`. Do **not** standardize the x-values to their Z-scores. The data in the Multiple Disease Prediction dataset has already been standardized to the range 0.0 to 1.0. As described above, applying a second standarization is not considered good practice.

Use your list `mdpX_columns` to generate the X feature vector called `mpdX` using the following code chunk: 
~~~text
# Generate X feature vector
mdpX = mdpDF[mdpX_columns].values
mdpX = np.asarray(mdpX).astype('float32')
~~~
You will also need to generate the Y feature vector directly from the column `Glucose` using the following code chunk:
~~~text
# Generate Y feature vector
mdpY = mdpDF['Glucose'].values
mdpY = np.asarray(mdpY).astype('float32')
~~~
Finally, print out the x-values for only the first 4 individuals in `mdpX`.  

In [7]:
# Insert your code for Exercise 1A here



If your code is correct you should see something similar to the following output:
~~~text
[[0.7188 0.2893 0.8843 0.4285 0.3649 0.8499 0.9119 0.044  0.9623 0.6245
  0.2501 0.2845 0.1755 0.0946 0.8535 0.1943 0.3111 0.3411 0.8494 0.2628
  0.2146 0.9056 0.4891 1.    ]
 [0.2842 0.3127 0.9912 0.872  0.4886 0.4946 0.8336 0.1907 0.6656 0.252
  0.6283 0.4225 0.0297 0.0888 0.7088 0.2313 0.2963 0.7866 0.4691 0.5526
  0.1389 0.0176 0.5855 3.    ]
 [0.4421 0.3317 0.1711 0.9908 0.9924 0.9137 0.88   0.0255 0.9315 0.3856
  0.7794 0.3035 0.2239 0.1417 0.3134 0.5512 0.9223 0.4851 0.4863 0.5196
  0.0763 0.0732 0.0842 3.    ]
 [0.2482 0.2956 0.0559 0.0445 0.4161 0.5485 0.3389 0.9069 0.7223 0.2431
  0.3694 0.6434 0.0431 0.2762 0.3675 0.4501 0.9894 0.0877 0.318  0.7435
  0.2066 0.5948 0.294  0.    ]]
~~~

Notice that there are no negative values. If you have any negative values in your output, it probably means that you converted your data to Zscores. As explained above, this is not a good idea since the data in the Multiple Disease Prediction dataset has already been standardized to the range 0,1 using `Min-Max` scaling.  

### **Exercise 1B: Out-of-Sample Regression Predictions with K-Fold Cross-Validation**

Use the feature vector you just created in **Exercise 1B** to perform a 5-fold cross-validation of out-of-sample predictions of glucose blood levels. Follow the template shown above in Example 1B. Set the number of epochs to 100. 

In [8]:
# Insert your code for Exercise 1B here



If your code is correct you should see something similiar to the following output:
~~~text
Fold #1 starting...
9/9 [==============================] - 0s 1ms/step
Fold score (RMSE): 0.0006521505420096219
Fold #2 starting...
9/9 [==============================] - 0s 1ms/step
Fold score (RMSE): 3.028195578735904e-07
Fold #3 starting...
9/9 [==============================] - 0s 2ms/step
Fold score (RMSE): 0.00023288458760362118
Fold #4 starting...
9/9 [==============================] - 0s 1ms/step
Fold score (RMSE): 0.0007166583673097193
Fold #5 starting...
9/9 [==============================] - 0s 1ms/step
Fold score (RMSE): 7.3592500484664924e-06
Final, out of sample score (RMSE): 0.000445868237875402 

Elapsed time = 0:01:23
~~~
The `Final, out of sample score (RMSE): 0.000445868237875402` in this particular run, appears to be quite small. When it comes to RMSE values, the smaller the better. But let's see how error level looks like in the context of predicting blood glucose levels?

### **Exercise 1C: Print out actual and predicted y-values**

In the cell below, rename the column header of the DataFrame `oosDF` to read `Actual Glucose` and `Predicted Glucose` using the function `rename_col_by_index()`. Once the columns have been renamed, display the DataFrame `oosDF` to see a list of the actual and predicted glucose levels.

In [9]:
# Insert your code for Exercise 1C here



If your code is correct you should see something similar to the following table:

![__](https://biologicslab.co/BIO1173/images/class_05_2_Exe1C.png)

By inspection of the output above, you can see the model's predictions of glucose blood levels are very close--or even exactly the same--as the the actual values of the out-of-sample individuals. The Final RMSE value =`0.000445868237875402` indicates the your model is very accurate for predicting glucose blood levels!   

## Classification with Stratified K-Fold Cross-Validation

**_Classification with Stratified K-Fold Cross-Validation_** is a method used to evaluate the performance of a classification model. It involves splitting the dataset into a specified number of folds (_K_) while ensuring that each fold maintains the same proportion of classes as the original dataset (_stratification_).

The process works as follows:

* The dataset is divided into _K_ equal-sized folds.
* For each fold, the model is trained on _K_-1 folds and validated on the remaining fold.
* This process is repeated _K_ times, with each fold used as the validation set exactly once.
* The performance metrics (such as accuracy, precision, recall) are averaged across all _K_ iterations to provide a robust estimate of the model's performance.

Stratified K-Fold Cross-Validation is particularly useful for imbalanced datasets where one class may dominate the others. By ensuring that each fold maintains the class proportions, this technique provides a **_more reliable estimate_** of the model's performance.

### Example 2A: Prepare feature vector

The following code prepares a feature vector from the Body Performance dataset for classification with stratified _K_-fold cross-validation.  **Stratified** _K_-fold cross-validation is the perferred form with classification data.  This technique ensures that the percentages of each class remain the same across all folds. 

In Example 2, we want our neural network model to predict the fitness level (`class`) of an individual based on his/her 11 fitness measurements (x-values). 

As in Example 1A: we need to map categorical values in columns `gender` and `class` to integers. The code is pretty much the same except we will need to drop the column `class`, since this is the y-value. Be careful to never include the y-value in the column list for the x-values.

Since this is a **_classification_** model, we need to generate the y-values differently. In particular, we need to One-Hot encode the column `class` as shown in the following code chunk:
~~~text
# Generate y-values as numpy array
dummies = pd.get_dummies(bpDF['class']) # Classification
FitClass = dummies.columns
bpY= dummies.values
bpY = np.asarray(bpY).astype('float32')
~~~
To double check that the y-values are in the correct shape, the y-values for the first 10 individual in `bpY` are printed out.

In [10]:
# Example 2A: Prepare feature vector

import pandas as pd
from scipy.stats import zscore

# Read the data set
bpBig = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/bodyPerformance.csv",
    na_values=['NA','?'])

# Only use 10% for neural network
bpDF=bpBig.sample(frac=0.10)

# Map gender
mapping = {'M': 1, 'F': 0}
bpDF['gender'] = bpDF['gender'].map(mapping)

# Map class
mapping =  {'A': 0,
            'B': 1,
            'C': 2,
            'D': 3}
bpDF['class'] = bpDF['class'].map(mapping)

# Generate list of columns for x
bpX_columns = bpDF.columns.drop('class')

# Standardize values with their Z-scores
for col in bpX_columns:
    bpDF[col] = zscore(bpDF[col])

# Generate x-values as numpy array
bpX = bpDF[bpX_columns].values
bpX = np.asarray(bpX).astype('float32')

# Generate y-values as numpy array
dummies = pd.get_dummies(bpDF['class']) # Classification
FitClass = dummies.columns
bpY= dummies.values
bpY = np.asarray(bpY).astype('float32')

# Print out bpY
print(bpY[0:10])

[[0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]]


If your code is correct you should see something similar to the following output:
~~~text
[[0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]]
~~~
As you can see, the y-values are One-Hot encoded. Since there are 4 possible fitness classifications, `A`, `B`, `C` and `D`, there are 4 dummy columns. In this particular example, the actual fitness level of the first subject was `B` since he has a `1` in the second column. The next three individuals were in the top fitness category, `A`, since they have a `1` in the first column. 

### Example 2B: Classification with Stratified K-Fold Cross-Validation

Once again, the code in the cell below is fairly similar to the code used in Example 1B. The most important difference is the selection of the **stratfied** variable `kf` as shown in this code chunk:
~~~text
# Cross-validate
# Use for StratifiedKFold classification
kf = StratifiedKFold(numK, shuffle=True, random_state=42) 
~~~
As mentioned above, **_stratified_ K-fold_ Cross-Validation** should always be used with classification neural networks. Where the different variable `kf` comes into play is at the beginning of the `for` loop:
~~~text
for train, test in kf.split(bpX,bpDF['class']): 
~~~
Stratification affects sample splitting by ensuring that the proportion of different classes or categories in the dataset is maintained across the training and validation sets. This was **not** done previously in Example 1B where `kf = KFold()`. In the code below, `kf = StratifiedKFold()` instead.

As mentioned above, when splitting the data into subsets for training and testing, stratification helps prevent bias and ensures that each subset is representative of the overall dataset in terms of class distribution. By using stratified sampling, the resulting training and validation sets will have a similar distribution of classes as the original dataset. This is especially important in scenarios where the classes are imbalanced, as stratification helps ensure that all classes are adequately represented in both subsets. This approach can lead to more reliable and accurate model evaluation, particularly in classification tasks where the goal is to predict class labels.

The neural network model below has an additional feature that was created with this code chunk:
~~~text
    # Define the checkpoint callback to save the model with the best performance
    checkpoint = ModelCheckpoint(f'Model_2_bestFold_{fold}.h5', 
                    monitor='val_loss', save_best_only=True, 
                    save_weights_only=True, mode='min', verbose=0)
~~~


In [11]:
# Example 2B: Classification with Stratified K-Fold Cross-Validation

import pandas as pd
import os
import numpy as np
from sklearn import metrics
from sklearn.model_selection import StratifiedKFold
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from keras.callbacks import ModelCheckpoint

# Set variables
EPOCHS=100 # number of epochs for each loop
numK=5     # Set number of folds
filename_write="StratKfoldOosPred.csv" # Set filename

# Record the start time in T_start
T_start = time.time()

# Cross-validate
# Use for StratifiedKFold classification
kf = StratifiedKFold(numK, shuffle=True, random_state=42) 

# Initial arrays to hold Out Of Samples (oos)
oos_y = []
oos_pred = []
fold = 0

# START LOOP HERE -----------------------------------#

# Must specify y StratifiedKFold for
for train, test in kf.split(bpX,bpDF['class']):  
    fold+=1
    print(f"Fold #{fold} starting...")
        
    # Generate fold's train and test datasets
    x_train = bpX[train]
    y_train = bpY[train]
    x_test = bpX[test]
    y_test = bpY[test]
    
    # Build new model for this fold
    model = Sequential()
    # Hidden 1
    model.add(Dense(50, input_dim=bpX.shape[1], activation='relu')) 
    model.add(Dense(25, activation='relu')) # Hidden 2
    model.add(Dense(bpY.shape[1],activation='softmax')) # Output

    # Compile model for classification
    model.compile(loss='categorical_crossentropy', optimizer='adam')

    # Define the checkpoint callback to save the model with the best performance
    checkpoint = ModelCheckpoint(f'Model_2_bestFold_{fold}.h5', 
                    monitor='val_loss', save_best_only=True, 
                    save_weights_only=True, mode='min', verbose=0)

    # Run model for this fold
    model.fit(x_train,y_train,validation_data=(x_test,y_test),
              verbose=0, callbacks=[checkpoint], epochs=EPOCHS)

    # Use model to predict y-values from x_test
    pred = model.predict(x_test)

    # Save actual y-values 
    oos_y.append(y_test)
    # raw probabilities to chosen class (highest probability)
    pred = np.argmax(pred,axis=1) 
    oos_pred.append(pred)  

    # Measure this fold's accuracy
    y_compare = np.argmax(y_test,axis=1) # For accuracy calculation
    score = metrics.accuracy_score(y_compare, pred)
    print(f"Fold score (accuracy): {score}")

# END LOOP HERE -----------------------------------#

# Build the oos prediction list and calculate the error.
oos_y = np.concatenate(oos_y)
oos_pred = np.concatenate(oos_pred)
oos_y_compare = np.argmax(oos_y,axis=1) # For accuracy calculation

score = metrics.accuracy_score(oos_y_compare, oos_pred)
print(f"Final score (accuracy): {score} \n")    
    
# Write the cross-validated prediction
oos_y = pd.DataFrame(oos_y)
oos_pred = pd.DataFrame(oos_pred)
oosDF = pd.concat( [oos_pred,oos_y],axis=1 )

# Record the end time in T_end
T_end = time.time()

# Print out elapsed time
elaspedTime(T_start,T_end)

Fold #1 starting...
Fold score (accuracy): 0.6455223880597015
Fold #2 starting...
Fold score (accuracy): 0.6268656716417911
Fold #3 starting...
Fold score (accuracy): 0.6865671641791045
Fold #4 starting...
Fold score (accuracy): 0.6380597014925373
Fold #5 starting...
Fold score (accuracy): 0.6704119850187266
Final score (accuracy): 0.6534727408513816 

Elapsed time = 0:01:31



If your code is correct you should see something similiar to the following output:
~~~text
Fold #1 starting...
9/9 [==============================] - 0s 1ms/step
Fold score (accuracy): 0.6865671641791045
Fold #2 starting...
9/9 [==============================] - 0s 2ms/step
Fold score (accuracy): 0.6753731343283582
Fold #3 starting...
9/9 [==============================] - 0s 2ms/step
Fold score (accuracy): 0.7014925373134329
Fold #4 starting...
9/9 [==============================] - 0s 1ms/step
Fold score (accuracy): 0.6305970149253731
Fold #5 starting...
9/9 [==============================] - 0s 2ms/step
Fold score (accuracy): 0.704119850187266
Final score (accuracy): 0.6796116504854369 

Elapsed time = 0:01:30
~~~
Since this is classification our error measure is `accuracy`. For this particular example, the Final accuracy score is `0.68`. Let's see what this means in the context of predicting an individual's fitness level.

You should also see that there are now 5 new files in your local directory called `Model_2_bestFold_1.h5` to `Model_2_bestFold_5.h5`. These files are the 5 **_saved models_** that has the best accuracy score. 

### Example 2C: Print out actual and predicted y-values

So what does a accuracy value of `0.68` mean? Our goal in Example 2 was to predict the fitness level of individuals given measurements of 11 fitness characteristics. This is a basic classification problem for a neural network using tabular data (i.e Pandas DataFrame).

The code in the cell below prints out the actual and predicted fitness levels for all of the "out-of-sample" individuals. The code uses the function `rename_col_by_index()` to create the appropiate column labels.

In [12]:
# Example 2C: Print out actual and predicted y-values

# Rename columns
new_column_mapping = {0: 'Actual class', 1: 'Predictions: 0'}
oosDF = rename_col_by_index(oosDF, new_column_mapping)

# Set display options
pd.set_option('display.max_rows', 8)
pd.set_option('display.max_columns', 8)

# Display DataFrame
display(oosDF)

Unnamed: 0,Actual class,Predictions: 0,1,2,3
0,1,1.0,0.0,0.0,0.0
1,3,0.0,0.0,0.0,1.0
2,2,0.0,1.0,0.0,0.0
3,3,0.0,0.0,0.0,1.0
...,...,...,...,...,...
1335,2,0.0,0.0,1.0,0.0
1336,0,1.0,0.0,0.0,0.0
1337,0,0.0,1.0,0.0,0.0
1338,2,0.0,0.0,1.0,0.0


If your code is correct you should see something similar to the following table:

![__](https://biologicslab.co/BIO1173/images/class_05_2_Exm2C.png)

By inspection you can see that the model's predictions of fitness level are pretty good but not perfect. For example, the model correctly predicted the fitness level for the first two subjects in the example shown above (level 0) but incorrectly predicted the third individual (index 2) has having a fitness level of `2` (B) when in fact, the individual's actual fitness level was `3` (C). The Final accuracy value =`0.68` indicates that the model is pretty good, but not perfect.   

### **Exercise 2A: Prepare feature vector**

In the cell below, write the code to prepare a feature vector for the for the Multiple Disease Prediction dataset for classification with stratified _K_-fold cross-validation. 

Start by reading the dataset for the course HTTPS server:
~~~text
# Read the data set
mdpBigDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/Blood_samples.csv",
    na_values=['NA','?'])
~~~

Use 60% of the data in `mdpBigDF` for your DataFrame, `mdpDF`:
~~~text
# Only use 60% for neural network
mdpDF=mdpBigDF.sample(frac=0.60)
~~~~
Use the same mapping of the `Diesease` column as you did in **Exercise 1A**. Remember, do **not** convert your data to Zscores since the data is already standardized.

Since this is model will be doing classification, you will need to handle the y-values differently. Use the following code chunk for generating your y-values:
~~~text
# Generate y-values as numpy array
dummies = pd.get_dummies(mdpDF['Disease']) # Classification
DiseaseClass = dummies.columns
mdpY= dummies.values
mdpY = np.asarray(mdpY).astype('float32')
~~~
Finally, print out the actual disease `classes` (i.e. `mdpY`) for the first 10 subjects.


In [13]:
# Insert your code for Exercise 2A here




If your code is correct you should see something similiar to the following output:
~~~text
[[1. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 1. 0. 0.]
 [1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]
~~~

Since there are 5 disease categories:

* Anemia   =  0
* Diabetes =  1
* Healthy  =  2
* Thalasse =  3
* Thromboc =  4

there should be 5 dummy columns.

In the specific example shown above, the first individual has `Anemia` since there is a `1` in the first column or the first row, while the second individual is healthy since there is a `1` in the third column of the second row.

### **Exercise 2B: Classification with Stratified K-Fold Cross-Validation**

In the cell below, use the feature vector you created in **Exercise 2A** to predict disease categories in the Multiple Disease Prediction dataset using stratified _K_-fold cross-validation setting _K_=5. Train your model for 100 epochs. 

Follow the template in Example 2B for building your neural network.

Make sure to create a `checkpoint` using the following code chunk:
~~~text
    # Define the checkpoint callback to save the model with the best performance
    checkpoint = ModelCheckpoint(f'Model_2B_bestFold_{fold}.h5', 
                    monitor='val_loss', save_best_only=True, 
                    save_weights_only=True, mode='min', verbose=0)
~~~
and include this `checkpoint` when you run your model:
~~~text
    # Run model for this fold
    model.fit(x_train,y_train,validation_data=(x_test,y_test),
              verbose=0, callbacks=[checkpoint], epochs=EPOCHS)
~~~

You should notice that your model will generate five new files in your current directory with the names `Model_2B_bestFold_1.h5` to `Model_2B_bestFold_5.h5`. These are the saved models with the connection weights that generated the lowest `val_loss`.  

In [14]:
# Insert your code for Exercise 2B here



If your code is correct you should see something similar to the following output:
~~~text
Fold #1 starting...
9/9 [==============================] - 0s 1ms/step
Fold score (accuracy): 1.0
Fold #2 starting...
9/9 [==============================] - 0s 2ms/step
Fold score (accuracy): 1.0
Fold #3 starting...
9/9 [==============================] - 0s 2ms/step
Fold score (accuracy): 1.0
Fold #4 starting...
9/9 [==============================] - 0s 2ms/step
Fold score (accuracy): 1.0
Fold #5 starting...
9/9 [==============================] - 0s 2ms/step
Fold score (accuracy): 1.0
Final score (accuracy): 1.0 

Elapsed time = 0:01:29
~~~

In this particular run, the model was able to achieve **perfect** accuracy with a `Final score (accuracy)= 1.0`. We should expect the model's predictions of the `Disease` class to be perfect, or very close to it when we compare the actual disease class to the predicted disease class, in the **Exercise 2C**.

### **Exercise 2C: Print out actual and predicted y-values**

In the cell below, print out the actual and predicted disease classes. Label your column headers  `Actual Disease Class` and  `Predicted Disease Class`.  

In [15]:
# Insert your code for Exercise 2C here



If your code is correct you should see something similar to the following table:

![__](https://biologicslab.co/BIO1173/images/class_05_2_Exe2C.png)

By inspection you can see that the model's predictions of the disease category are extremely good. In the example shown above there we no errors. 

## Training with both a Cross-Validation and a Holdout Set

Training with both Cross-Validation and a separate **_Holdout Set_** is useful because it offers a comprehensive approach to model evaluation and validation. Here are the key reasons why using both techniques together is beneficial:

* **Cross-Validation:** 
1. Provides a robust estimate of the model's performance by training and validating the model on multiple subsets of the data.
2. Helps in tuning hyperparameters and assessing model generalization by averaging performance across multiple validation sets.
3. Reduces the risk of overfitting by using multiple validation sets and ensuring the model's performance is not biased by a single validation set.

* **Holdout Set:**
1. Offers an additional level of validation by providing a completely unseen dataset for final model evaluation.
2. Mimics real-world scenarios where the model needs to perform well on new, unseen data.
3. Provides a final checkpoint to ensure that the model's performance on the Holdout Set aligns with the expected performance based on Cross-Validation results.

By combining Cross-Validation for model training, tuning, and initial validation, with a Holdout Set for final model evaluation, practitioners can ensure that their model is well-optimized, generalizes well to unseen data, and performs as expected in real-world applications. If you have a considerable amount of data, it is always valuable to set aside a holdout set before you cross-validate. This holdout set will be the final evaluation before using your model for its real-world use. Figure 5. HOLDOUT shows this division.

**Figure 5. HOLDOUT: Cross-Validation and a Holdout Set**
![Cross Validation and a Holdout Set](https://biologicslab.co/BIO1173/images/class_3_hold_train_val.png "Cross-Validation and a Holdout Set")

  

### Example 3A: Prepare feature vector

The code in the cell below creates a feature vector the Body Performance dataset for regression. In this example, the model will try to predict the `systolic` blood pressure of individuals. The code is very similiar to the code used in Example 1A. Since we will be performing regression, the y-values will **not** be One-Hot encoded.

Since we need additional data for the holdout set, the fraction of `bpBigDF` used for the analysis will be increased from 15% to 20%.


In [16]:
# Example 3A: Prepare feature vector

# Read the data set
bpBigDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/bodyPerformance.csv",
    na_values=['NA','?'])

# Increase to 20% for holdout set
bpDF=bpBigDF.sample(frac=0.20)

# Map gender
mapping = {'M': 1, 'F': 0}
bpDF['gender'] = bpDF['gender'].map(mapping)

# Map class
mapping =  {'A': 0,
            'B': 1,
            'C': 2,
            'D': 3}
bpDF['class'] = bpDF['class'].map(mapping)

# Generate list of columns for x
bpX_columns = bpDF.columns.drop('systolic')  # 'class'

# Standardize values with their Z-scores
for col in bpX_columns:
    bpDF[col] = zscore(bpDF[col])

# Generate x-values as numpy array
bpX = bpDF[bpX_columns].values
bpX = np.asarray(bpX).astype('float32')

# Generate y-values as numpy array
# Do NOT One-Hot encoding with Regression
bpY = bpDF['systolic'].values
bpY = np.asarray(bpY).astype('float32')

# Print y-values
print(bpY[0:10])

[ 99. 136. 140. 135. 128. 143. 146. 137. 132. 128.]


If your code is correct you should see something similar to the following output:
~~~text
[124. 129. 121. 137. 112. 147. 116. 127. 115. 116.]
~~~
These are the actual `systolic` blood pressure measurements (in mmHg) for the first 10 subjects in `bpY`.

### Example 3B: Training with both a Cross-Validation and a Holdout Set 

Now that the data has been preprocessed, we are ready to build the neural network and train the neural network. The code chunk that creates the additional holdout set is:
~~~text
# Keep a 10% holdout
bpX_main, bpX_holdout, bpY_main, bpY_holdout = train_test_split(    
    bpX, bpY, test_size=0.10) 
~~~
During the `for` loop, the datasets `bpX_main` and `bpY_main` are split again into test and training sets with this code chunk:
~~~text
    # Generate this fold's train and test datasets
    x_train = bpX_main[train]
    y_train = bpY_main[train]
    x_test = bpX_main[test]
    y_test = bpY_main[test]
~~~


In [17]:
# Example 3B: Training with both a Cross-Validation and a Holdout Set

from sklearn.model_selection import train_test_split
import pandas as pd
import os
import numpy as np
from sklearn import metrics
from scipy.stats import zscore
from sklearn.model_selection import KFold

# Set variables
EPOCHS=100 # number of epochs for each loop
numK=5     # Set number of K-folds

# Record the start time in T_start
T_start = time.time()

# Keep a 10% holdout
bpX_main, bpX_holdout, bpY_main, bpY_holdout = train_test_split(    
    bpX, bpY, test_size=0.10) 

# Initial arrays for Out Of Samples (oos)
oos_y = []    # array to hold actual y-values
oos_pred = [] # array to hold predicted y-values

# Cross-validate
kf = KFold(numK)

fold = 0 # initialize fold count

# START LOOP HERE -----------------------------------#

# Run loop for each fold
for train, test in kf.split(bpX_main):        
    fold+=1
    print(f"Starting Fold #{fold}...")

    # Generate this fold's train and test datasets
    x_train = bpX_main[train]
    y_train = bpY_main[train]
    x_test = bpX_main[test]
    y_test = bpY_main[test]

    # Build new model for this fold
    model = Sequential()
    model.add(Dense(20, input_dim=bpX.shape[1], activation='relu'))
    model.add(Dense(5, activation='relu'))
    model.add(Dense(1))

    # Compile model for regression
    model.compile(loss='mean_squared_error', optimizer='adam')

    # # Run model for this fold
    model.fit(x_train,y_train,validation_data=(x_test,y_test),
              verbose=0,epochs=EPOCHS)
    
     # Use model to predict y-values from x_test
    pred = model.predict(x_test)

    # Save actual y-values 
    oos_y.append(y_test)
    oos_pred.append(pred) 

    # Measure accuracy
    score = np.sqrt(metrics.mean_squared_error(pred,y_test))
    print(f"Fold score (RMSE): {score}")


# Build the oos prediction list and calculate the error.
oos_y = np.concatenate(oos_y)
oos_pred = np.concatenate(oos_pred)
score = np.sqrt(metrics.mean_squared_error(oos_pred,oos_y))
print()
print(f"Cross-validated score (RMSE): {score}")    

# Write the cross-validated prediction
oos_y = pd.DataFrame(oos_y)
oos_pred = pd.DataFrame(oos_pred)
oosDF = pd.concat([oos_y,oos_pred],axis=1 )

# Write the cross-validated prediction (from the last neural network)
# THIS IS THE FIRST TIME THE HOLDOUT DATA IS USED!
holdout_pred = model.predict(bpX_holdout)

score = np.sqrt(metrics.mean_squared_error(holdout_pred,bpY_holdout))
print(f"Holdout score (RMSE): {score}")    

# Record the end time in T_end
T_end = time.time()

# Print out elapsed time
elaspedTime(T_start,T_end)

Starting Fold #1...
Fold score (RMSE): 10.775425910949707
Starting Fold #2...
Fold score (RMSE): 10.332239151000977
Starting Fold #3...
Fold score (RMSE): 10.509686470031738
Starting Fold #4...
Fold score (RMSE): 10.549328804016113
Starting Fold #5...
Fold score (RMSE): 12.115436553955078

Cross-validated score (RMSE): 10.875533103942871
Holdout score (RMSE): 10.59890079498291
Elapsed time = 0:02:16



If your code is correct you should see something similar to the following output:
~~~text
Starting Fold #1...
16/16 [==============================] - 0s 1ms/step
Fold score (RMSE): 11.849387168884277
Starting Fold #2...
16/16 [==============================] - 0s 1ms/step
Fold score (RMSE): 11.288235664367676
Starting Fold #3...
16/16 [==============================] - 0s 2ms/step
Fold score (RMSE): 10.869893074035645
Starting Fold #4...
16/16 [==============================] - 0s 1ms/step
Fold score (RMSE): 10.248887062072754
Starting Fold #5...
16/16 [==============================] - 0s 1ms/step
Fold score (RMSE): 9.961965560913086

Cross-validated score (RMSE): 10.865667343139648
9/9 [==============================] - 0s 2ms/step
Holdout score (RMSE): 11.717198371887207
Elapsed time = 0:02:09
~~~

In the example shown above, the RMSE for the **_holdout_** data was `11.717`. It should be noted that the data "held back" in the holdout set was **never** seen by any of the 5 separate neural network models (one model for each fold). Only after the networks were fully trained were they exposed to the holdout data. This mimics real-world scenarios where the model needs to perform well on new, unseen data.

### Example 3C: Print out actual and predicted y-values

The code in the cell below prints out the actual and predicted systolic blood pressure of the subjects in the holdout set. 


In [18]:
# Example 3C: Print out actual and predicted y-values

# Rename columns
new_column_mapping = {0: 'Actual Systolic Pressure (mmHg)', 1: 'Predicted Systolic Pressure (mmHg)'}
oosDF = rename_col_by_index(oosDF, new_column_mapping)

# Set display options
pd.set_option('display.max_rows', 8)
pd.set_option('display.max_columns', 8)

# Display DataFrame
display(oosDF)

Unnamed: 0,Actual Systolic Pressure (mmHg),Predicted Systolic Pressure (mmHg)
0,111.0,122.738640
1,137.0,134.768951
2,123.0,129.864639
3,136.0,134.051117
...,...,...
2407,134.0,107.484818
2408,120.0,120.928993
2409,147.0,117.389870
2410,133.0,135.984360


If your code is correct you should see something similar to the following table:

![__](https://biologicslab.co/BIO1173/images/class_05_2_Exm3C.png)

By inspection you can see that the model's predictions for systolic blood pressure are not as accurate as might be desired. The RMSE for the **_holdout_** data in this particular run was `11.717` or about 10%.  So while the model did pretty good for the first subject on the list above (index `0`), the prediction for the last subject (index `2410`) was not very close. 

### **Exercise 3A: Prepare feature vector**

In the cell below, write the code to prepare a feature vector for the for the Multiple Disease Prediction dataset for regression with stratified K-fold cross-validation and a holdout set. Use 80% of the data for your DataFrame, `mdpDF`, to compensate for the extra holdout data. 
~~~text
# Use 80% for neural network
mdpDF=mdpBigDF.sample(frac=0.80)
~~~

Use the same mapping of the `Disease` column as you did in Exercise 1A. The goal of your neural network will be to predict the body mass index (`BMI`), so you will need to drop that column when making your list, `mdpX_columns`, that holds the names of the columns to be included when generating your x-values. 

As above, do _not_ standardize the x data to their Zscores.

Since this is model will be doing regression, use `mdpY=mdpDF[`BMI`].values` when creating the y-values using the following code chunk:
~~~text
# Generate y-values as numpy array
mdpY = mdpDF['BMI'].values
mdpY = np.asarray(mdpY).astype('float32')
~~~

Finally, print out the y-values for the first 10 subjects using the following code chunk:
~~~text
# Print y-values
print(mdpY[0:10])
~~~


In [19]:
# Insert your code for Exercise 3A here



If your code is correct you should see something similar to the following output:
~~~text
[0.142  0.0718 0.8393 0.497  0.019  0.6797 0.4039 0.5901 0.1415 0.651 ]
~~~
These are the standardized `BMI` values for the first 10 subjects in `mdpY`. If we know the minimum and maximal of the actual `BMI` measurements, _before_ they were standardized, we can reverse the process to set the actual, unstandardized `BMI` value. The code in the cell below shows how to "reverse" the standarization of the first `BMI` value in the output shown above, `0.1414939`. 

In [20]:
# Convert standardized number back into their original non-standardized value

# create a simple function
def convert_from_0_to_1(value, min_val, max_val):
    return value * (max_val - min_val) + min_val

# The min, max BMI values were: 18.5 and 24.9 
min_val = 18.5  # Minimum original value before standardization
max_val = 24.9  # Maximum original value before standardization

# Assign standardize value to reverse
stand_value = 0.1414939  # First y-value in output

# Call function and print out value
original_value = convert_from_0_to_1(stand_value, min_val, max_val)
print(f"Standardize BMI Value: {stand_value} is equal to the original BMI value: {original_value}")


Standardize BMI Value: 0.1414939 is equal to the original BMI value: 19.40556096


If your code is correct, you should see the following output:
~~~text
Standardize BMI Value: 0.1414939 is equal to the original BMI value: 19.40556096
~~~

### **Exercise 3B: Training with both a Cross-Validation and a Holdout Set**

In the cell below, write the code to create and train a regression neural network with both a Cross-Validation and a Holdout Set using the feature vector you created in **Exercise 3A**. Set the number of _K_-folds to `5` and the number of epochs to `100`. 

In [21]:
# Insert your code for Exercise 3B here



If your code is correct, you should see something similiar to following output:
~~~text
Starting Fold #1...
11/11 [==============================] - 0s 1ms/step
Fold score (RMSE): 2.8767874027835205e-05
Starting Fold #2...
11/11 [==============================] - 0s 1ms/step
Fold score (RMSE): 6.265888259804342e-06
Starting Fold #3...
11/11 [==============================] - 0s 2ms/step
Fold score (RMSE): 0.00013639968528877944
Starting Fold #4...
11/11 [==============================] - 0s 2ms/step
Fold score (RMSE): 1.5171485756582115e-06
Starting Fold #5...
11/11 [==============================] - 0s 1ms/step
Fold score (RMSE): 0.0023843529634177685

Cross-validated score (RMSE): 0.0010675085941329598
6/6 [==============================] - 0s 2ms/step
Holdout score (RMSE): 0.0026710291858762503
Elapsed time = 0:01:38
~~~

The `Holdout score (RMSE): 0.0026710291858762503` that was obtained in this particular run, means the model is very accurate due to the low RMSE value.

### **Exercise 3C: Print out actual and predicted y-values**

In the cell below, print out the actual and predicted y-values from the holdout set. 


In [22]:
# Insert your code for Exercise 3C here



If your code is correct, you should see something similar to the following table:

![__](https://biologicslab.co/BIO1173/images/class_05_2_Exe3C.png)

## **Lesson Turn-in**

When you have completed all of the code cells, and run them in sequential order (the last code cell should be number 22), use the **File --> Print.. --> Save to PDF** to generate a PDF of your JupyterLab notebook. Save your PDF as `Class_05_2.lastname.pdf` where _lastname_ is your last name, and upload the file to Canvas.