<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173/blob/master/Class_04_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# BIO 1173: Intro Computational Biology

##### **Module 4: Training for Tabular Data**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Integrative Biology](https://sciences.utsa.edu/integrative-biology/), [UTSA](https://www.utsa.edu/)

### Module 4 Material

* Part 4.1: Encoding a Feature Vector for Keras Deep Learning
* Part 4.2: Keras Multiclass Classification for Deep Neural Networks with ROC and AUC
* **Part 4.3: Keras Regression for Deep Neural Networks with RMSE**
* Part 4.4: Backpropagation, Nesterov Momentum, and ADAM Neural Network Training
* Part 4.5: Neural Network RMSE and Log Loss Error Calculation from Scratch


### Lesson Setup

Run the next code cell to load necessary packages

In [None]:
!pip install statsmodels

In [None]:
# You MUST run this code cell first

# Classification neural network
import tensorflow.keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
from sklearn import metrics

%matplotlib inline
import matplotlib.pyplot as plt
from scipy.stats import zscore
import numpy as np
import pandas as pd

import os
import shutil
path = '/'
memory = shutil.disk_usage(path)
dirpath = os.getcwd()
print("Your current working directory is : " + dirpath)
print("Disk", memory)

# Part 4.3: Keras Regression for Deep Neural Networks with RMSE

In Keras, regression models can be built using deep neural networks to predict continuous values. The RMSE (Root Mean Squared Error) is a common metric used to evaluate the performance of regression models. It measures the average magnitude of the errors between predicted and actual values. To train a regression model using Keras, you would typically define a deep neural network architecture with dense layers, activation functions, and optimizer. The target variable would be continuous and the loss function used would be mean squared error (MSE).

During training, the model would minimize the MSE loss function by adjusting the weights and biases of the neural network using backpropagation. The model's performance can be evaluated using RMSE on a separate test dataset, where a lower RMSE indicates a better performing model. Overall, Keras Regression for Deep Neural Networks with RMSE involves building a neural network for regression tasks, training it using MSE loss, and evaluating its performance using RMSE.

We evaluate regression results differently than classification.  Consider the following code that trains a neural network for regression on the data set **jh-simple-dataset.csv**.  We begin by preparing the data set.

## Body Fat Dataset for the Examples

In this lesson we will be using the [Body Fat Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/body-fat-prediction-dataset) for the Examples. 

Volume, and hence body **_density_**, can be accurately measured a variety of ways. The technique of [underwater weighing](https://en.wikipedia.org/wiki/Hydrostatic_weighing), computes body volume as the difference between body weight measured in air and weight measured during water submersion. In other words, body volume is equal to the loss of weight in water with the appropriate temperature correction for the water's density. 

![___](https://biologicslab.co/BIO1173/images/underwater-weigh.jpg)
**Image of a woman having her body density measured by Hydrostatic Weighing**

Using this technique,

$Body Density = \frac{W_{A}}{(W_{A} - W_{W}/c.f.- LV)}$        

where:

* $W_{A}$ = Weight in air (kg)
* $W_{W}$ = Weight in water (kg)
* $c.f.$ = Water correction factor =0.997 at 76-78 deg F)
* $LV$ = Residual Lung Volume (liters)

Determining a person's body density by water submersion is at best inconvienient. Wearing a swimsuit, you are completely immersed into a tank of water and are asked to expel as much air from your lungs as possible. A measurement is taken, and the displacement of water is measured to determine body density. Typically, 4 to 5 trials are repeated. Testing usually takes about 10-15 minutes.

In the Examples below we will see if we can construct and train a deep neural network that can accurately predict body density using Keras linear regression and a set of clinical measurements.  
  
The factors (X-variables) for training your neural network are:
* **Percent body fat:** from Siri's (1956) equation
* **Age:** (years)
* **Weight:** (lbs)
* **Height:** (inches)
* **Neck circumference:** (cm)
* **Chest circumference:** (cm)
* **Abdomen 2 circumference:** (cm)
* **Hip circumference:** (cm)
* **Thigh circumference:** (cm)
* **Knee circumference:** (cm)
* **Ankle circumference:** (cm)
* **Biceps (extended) circumference:** (cm)
* **Forearm circumference:** (cm)
* **Wrist circumference:** (cm)

The response variable (Y) that your neural network will try to predict is:
* **Density:** determined from underwater weighing

## Medical Costs Dataset for the **Exercises**

In this lesson we will be using the [Medical Costs Personal Datasets](https://www.kaggle.com/datasets/mirichoi0218/insurance) for the **Exercises**. 

Understanding the factors involved in personal medical costs in the US is important for several reasons:

* **Financial burden:** Medical costs can be a significant financial burden for individuals and families, impacting their ability to afford necessary healthcare services. Understanding the factors influencing these costs can help individuals make informed decisions about their healthcare spending and budgeting.
* **Access to healthcare:** High medical costs can create barriers to accessing healthcare services, particularly for individuals with limited financial resources. By understanding the factors contributing to medical costs, policymakers and healthcare providers can work towards improving affordability and access to care.
* **Health outcomes:** The cost of healthcare can influence individuals' decisions to seek treatment or adhere to medical recommendations. Understanding factors influencing medical costs can help identify disparities in access to care and develop interventions to improve health outcomes.
* **Policy implications:** Knowledge of the factors shaping personal medical costs can inform healthcare policies and regulations aimed at controlling costs, improving quality of care, and expanding access to healthcare services. This understanding is crucial for policymakers seeking to address healthcare affordability and sustainability in the US.

The Medical Costs dataset contains statistical information about the insurance medical bill (`charges`) for individuals (_n_=1338) living in 4 different areas of the US. 

The dataset included some of the factors (X-variables) that contribute to medical costs including: 
* **age:** age of primary beneficiary
* **sex:** insurance contractor gender, female, male
* **bmi:** Body mass index, providing an understanding of body, weights that are relatively high or low relative to height,
objective index of body weight (kg / $m^2$) using the ratio of height to weight, ideally 18.5 to 24.9
* **children:** Number of children covered by health insurance / Number of dependents
* **smoker:** Smoking
* **region:** the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.

The response variable (Y) that your neural network will try to predict will be:
* **charges:** Individual medical costs billed by health insurance

### Example 1: Read the datafile, create DataFrame and display

The code in the cell below reads the Body Fat datafile, `bodyfat.csv` from the course HTTPS server and creates a DataFrame called `bfDF`. The display options are set for 6 rows and 8 columns.

In [None]:
# Example 1: Read data, create DataFrame and display

# Read the data set
bfDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/bodyfat.csv",
    na_values=['NA','?'])

# Set display options
pd.set_option('display.max_rows', 6)
pd.set_option('display.max_columns', 8)

# Display DataFrame
display(bfDF)

If you code is correct you should see the following table:

![___](https://biologicslab.co/BIO1173/images/class_04_3_Exe1.png)



### **Exercise 1: Read the datafile, create DataFrame and display**

In the cell below, read `Medical Costs` datafile `medical_costs.csv` from the course HTTPS server and creates a DataFrame called `mcDF`. Set your display options for 6 rows and 7 columns.

In [None]:
# Insert your code for Exercise 1 here



If you code is correct you should see the following table:

![___](https://biologicslab.co/BIO1173/images/class_04_3_Exm1.png)



### Example 2: Determine Preprocessing Steps

In almost every instance, some degree of preprocessing must be done before data can be used for training a neural network. The cell below uses the Pandas method `pd.dtypes()` to print out a list of the data types in the different columns in a DataFrame. Example 2 prints out the data types in `bfDF`. What we are looking for are the columns with categorical (non-numeric) values. 

Dependent on the number of columns in the DataFrame, you may have to adjust the number of rows to display. Since the DataFrame `bfDF` has 15 columns, the number of rows to display had to be set to 15, as shown below. 

In [None]:
# Example 2: Print data types

# Set num of row to number of columns in DF
pd.set_option('display.max_rows', 15)

# Print data types
print(bfDF.dtypes)

If your code is correct you should see the following output:

~~~text
Density    float64
BodyFat    float64
Age          int64
Weight     float64
Height     float64
Neck       float64
Chest      float64
Abdomen    float64
Hip        float64
Thigh      float64
Knee       float64
Ankle      float64
Biceps     float64
Forearm    float64
Wrist      float64
dtype: object
~~~

Columns with data types that are either `int64` or `float64` are numeric while columns that are `object` are categorical (string) values which must be converted into numerical values during data preprocessing. At least for the Body Fat dataset, we don't have to worry about categorical values.

### **Exercise 2: Determine Preprocessing Steps**

In the cell below use the Pandas method `pd.dtypes()` to print out a list of the data types in the DataFrame `mcDF`. Since this DataFrame has 7 columns, set the number of rows to display to be set to 7.

In [None]:
# Insert your code for Exercise 2 here



If your code is correct you should see the following output:

~~~text
age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object
~~~

Again, columns with data types that are either `int64` or `float64` are numeric while columns that are `object` are categorical (string) values. In the Medical Costs data set there are 3 columns with categorical values, `sex`, `smoker` and `region`. You will have to convert these strings into numerical values during data preprocessing in **Exercise 3**.

### Example 3: Preprocess data, generate X and Y, split data

From the results of Example 2, we know that the DataFrame `bfDF` only has numeric values. There are also no missing values to worry about. However, with numeric data it is generally a good idea to standardize the values. Standardizing numeric values helps ensure that features are on a similar scale, which can lead to faster convergence during training. Neural networks often perform better when input features are standardized as it makes it easier for the optimizer to find the optimal weights and biases.

The code in the cell below standardizes all of the columns in the Body Fat dataset with the notable exception of the target column, `Density` by converting the values into their Z-scores 

Once the data has been standardize, the independent (X) values are generated in a 2-step process. First, a list of the columns that are to be included, called `bfX_columns`, is created using the Pandas command:

> bfX_columns = bfDF.columns.drop('Density')`

At this point we can drop one (or more) columns that we **don't** want to be included in the X-values. Since the column `Density` is going to be out Y-values, we drop this column. 

The second step in creating the X values is to use the column list, `bfX_columns` as part of the following command:

> `bfX = bfDF[bfX_columns].values`

This command creates the variable `bfX` by using the Pandas method `.values`. The variable `bfX` is **not** a DataFrame, but a large Numpy array containing the X-values. The next line of code is necessary for Keras to work correctly by making sure all values in `bfX` are type `float32`. 

> `bfX = np.asarray(bfX).astype('float32')`

The next step is to generate the Y-values for the regression directly from the target column, `Density`. It is very important **not** to One-Hot Encode the Y-values in a regression analysis! We just want to use the numerical values as they are.

The last preprocessing step is to split the X-values in `bfX` and the Y-values in `bfY` into training and test (validation) data sets. The parameter, `test_size=0.25` specifies that we want about 25% of the X and Y data going into the test data sets `bfX_test` and `bfY_test`, respectively, and the rest of the data going into the training data sets `bfX_train` and `bfY_train`. 

As a final check, the X data for the first 4 subjects in the test or validation set, `bfX_test`, is printed out.  

In [None]:
# Example 3: Preprocess data, generate X and Y, split data


# Standardize ranges to z-scores
bfDF['BodyFat'] = zscore(bfDF['BodyFat'])
bfDF['Age'] = zscore(bfDF['Age'])
bfDF['Weight'] = zscore(bfDF['Weight'])
bfDF['Neck'] = zscore(bfDF['Neck'])
bfDF['Abdomen'] = zscore(bfDF['Abdomen'])
bfDF['Hip'] = zscore(bfDF['Hip'])
bfDF['Thigh'] = zscore(bfDF['Thigh'])
bfDF['Knee'] = zscore(bfDF['Knee'])
bfDF['Ankle'] = zscore(bfDF['Ankle'])
bfDF['Biceps'] = zscore(bfDF['Biceps'])
bfDF['Forearm'] = zscore(bfDF['Forearm'])
bfDF['Wrist'] = zscore(bfDF['Wrist'])


# Generate X
bfX_columns = bfDF.columns.drop('Density')
bfX = bfDF[bfX_columns].values
bfX = np.asarray(bfX).astype('float32')

# Generate Y
bfY = bfDF['Density'].values
bfY = np.asarray(bfY).astype('float32')


# Create train/test
bfX_train, bfX_test, bfY_train, bfY_test = train_test_split(    
    bfX, bfY, test_size=0.25, random_state=42)

# Print out bfX_test
print(bfX_test[0:4])

If your code is correct you should see the following output:

~~~text
[ 5.89148048e-03 -7.85951495e-01  1.29814422e+00  7.37500000e+01
   1.03373802e+00  1.07500000e+02  2.36399174e-01  6.42705977e-01
   1.02949166e+00  1.12567818e+00  1.47654676e+00  1.36856163e+00
   2.49723125e+00  1.25598311e+00]
 [ 5.89148048e-03 -1.50154281e+00  7.07650632e-02  6.97500000e+01
  -6.56227350e-01  1.05099998e+02 -1.72459662e-01  5.52793778e-02
  -1.91993043e-01 -1.20679036e-01 -1.19643927e-01 -1.23840421e-01
  -4.28372264e-01 -5.68578303e-01]
 [ 1.05951631e+00 -1.49870321e-01  1.47476256e-01  7.00000000e+01
  -3.67696702e-01  1.08000000e+02  1.15633154e+00  4.32910770e-01
   8.19549024e-01  5.85590065e-01  2.94183314e-01  4.06791419e-01
  -4.28372264e-01 -8.90559793e-01]
 [ 1.61540598e-01 -7.85951495e-01 -5.70869297e-02  7.10000000e+01
   1.68145999e-01  1.00500000e+02 -2.09628657e-01 -1.68502197e-01
  -3.06507230e-01 -5.36131442e-01 -4.15234804e-01 -4.22320843e-01
   1.79062374e-02 -5.68578303e-01]]
~~~

This output is the X-values for 4 subjects in the test set. For each subject there are 14 floating-point numbers representing the Z-score values, in order, for their:  `BodyFat`, `Age`, `Weight`, `Height`, `Neck`, `Chest`, `Abdomen`, `Hip`, `Thigh`, `Knee`, `Ankle`, `Biceps`, `Forearm` and `Wrist`.

### **Exercise 3: Preprocess data, generate X and Y, split data**

From the results of **Exercise 2**, we know that the DataFrame `mcDF` has three columns with categorical values: `sex`, `smoker` and `region`. The columns `sex` and `smoker` are **_binary_**. In other words, these columns only contain two values. The column `sex` contains the strings `male` and `female`, while the column `smoker` contains the strings `yes` and `no`. As a general rule, _mapping_ is the most efficient way to handle binary categories. 

In the cell below, map the string `male` to a value of `1` and the string `female` to a value of `0` using the following line of code:

~~~text
# Map sex
mapping = {'male': 1, 'female': 0}
mcDF['sex'] = mcDF['sex'].map(mapping)
~~~

Use the same approach to map the strings `yes` to `1` and `no` to `0` for the column `smoker`.

The column `region` contains 4 strings: `northeast`, `northwest`, `southeast` and `southwest`. In this instance, you should use `One-Hot Encoding`, instead of mapping, to preprocess the strings in `region` using the Pandas function `pd.get_dummies()` as shown as in the code example below. 

~~~text
# Generate dummies for region
mcDF = pd.concat([mcDF,pd.get_dummies(mcDF['regions'],prefix="region")],axis=1)
mcDF.drop('region', axis=1, inplace=True)
~~~

There are two columns with numeric values, `age` and `bmi`, the need to be standardized to the Z-scores.

Once the data has been standardize, generate the independent (X) values by first creating  a list of columns to be included, called `mcX_columns`, making sure to drop the target column, `charges`. 

Then generate the Y-values for the regression directly from the target column, `charges`. Again, it is very important **not** to One-Hot Encode the Y-values in a regression analysis! We just want to use the numerical values as they are.

Finally, split the X-values in `mcX` and the Y-values in `mcY` into training and test (validation) data sets setting the parameter,`test_size=0.25`. The test data sets should be called `mcX_test` and `mcY_test`, and the training data sets called `mcX_train` and `mcY_train`. 

As a final check, print out the X data for the first 4 subjects in the test or validation set, `mcX_test`.  

In [None]:
# Insert your code for Exercise 3 here



If your code is correct you should see something similar to the following output:

~~~text
[[ 0.41246668  0.         -0.9003412   2.          0.          1.
   0.          0.          0.        ]
 [-0.22834402  0.         -0.10554571  0.          0.          0.
   1.          0.          0.        ]
 [ 1.7652893   0.         -0.6198251   0.          1.          0.
   1.          0.          0.        ]
 [ 0.48366788  1.         -0.80683583  3.          0.          0.
   1.          0.          0.        ]]
~~~

These are the values for: `age`, `sex`, `bmi`, `children`, `smoker` and the last 4 values are the dummy columns for `region`. 

You can tell the gender of these subjects by the 2nd value in each array. In this case, the first three subjects have `0` as the second value, so they are female, while the 4th subject has `1` making him a male.  

### Example 4: Construct, compile and fit neural network

The code in the cell below constructs a linear ("sequential") regression neural network called `bfModel` with 3 hidden layers, with 50 neurons in the 1st layer, 25 neurons in the second layer and 10 neurons in the 3rd layer. Since this is a regression neural network, there is only a single neuron in the output layer. The "voltage" in this neuron at the end of a run represents the neural network's prediction of an individual's body `density`.  

Since the objective of the model `bfModel` is regression, we compile the model using the 'mean_squared_error' as the loss function along with `adam` as the optimizer.

An EarlyStopping monitor called `bfMonitor` is created to stop the fitting process if the value for the `validation loss` doesn't increase after waiting 50 epochs. 

Finally, the model is run ("fitted") for 1000 epochs using the training and test data created in Example 3. The verbose setting (output) is set to 0. 

**WARNING---WARNING--WARNING--WARNING**

The variable `verbose` is set to `0`.  
You will **NOT** see anything happing when you run the next cell. 

> BE PATIENT. 

Depending upon your computer's speed, it will take some time to complete. Just relax. 

![___](https://biologicslab.co/BIO1173/images/KeepCalm.png)

In [None]:
# Example 4: Construct, compile and fit neural network

# Construct
bfModel = Sequential()
bfModel.add(Dense(50, input_dim=bfX.shape[1], activation='relu')) # Hidden 1
bfModel.add(Dense(25, activation='relu')) # Hidden 2
bfModel.add(Dense(10, activation='relu')) # Hidden 3
bfModel.add(Dense(1)) # Output

# Compile
bfModel.compile(loss='mean_squared_error', optimizer='adam')

# Create Monitor
bfMonitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, 
                        patience=50, verbose=1, mode='auto', 
                        restore_best_weights=True)

# Fit
bfModel.fit(bfX_train,bfY_train,validation_data=(bfX_test,bfY_test),
          callbacks=[bfMonitor],verbose=0,epochs=1000)


Assuming everything went smoothly, you should see something similar to the output below:

~~~text
Restoring model weights from the end of the best epoch: 121.
Epoch 171: early stopping

<keras.callbacks.History at 0x242baf29880>
~~~

In this particular example, the lowest validation loss occured after epoch 121. The model ran another 50 epochs before the monitor `bfMonitor` terminated the fitting and restored the connection weights between all of the neurons to the value they had after the 121 epoch.  

### **Exercise 4: Construct, compile and fit neural network**

In the cell below constructs a linear ("sequential") regression neural network called `mcModel` with 3 hidden layers, with 50 neurons in the 1st layer, 25 neurons in the second layer and 10 neurons in the 3rd layer. Make sure that there is only a single neuron in the output layer.   

Compile the model using the 'mean_squared_error' as the loss function along with `adam` as the optimizer.

Create an EarlyStopping monitor called `mcMonitor` to stop the fitting process if the value for the `validation loss` doesn't increase after waiting 50 epochs. 

Finally, fit your model for 1000 epochs using the training and test data created in **Exercise 3**. 

You may set the verbose argument to either `0` (no output) or `2` (output after each epoch). In either case, your model `mcModel` will take significantly **LONGER** to run than the previous model. If you select `verbose=2` be prepared for several "pages" of screen output.

In [None]:
# Insert your code for Exercise 4 here



Here is the output from setting verbose=0 and waiting several minutes for the fitting to terminate after 645 epochs.

~~~text
Restoring model weights from the end of the best epoch: 645.
Epoch 695: early stopping

<keras.callbacks.History at 0x242cd496f70>
~~~

## Mean Square Error

Using **_Mean Squared Error (MSE)_** for regression neural networks is common for several reasons:

* **Differentiable and continuous:** MSE is a differentiable and continuous loss function, making it suitable for optimization algorithms like gradient descent. This allows the neural network to update its parameters smoothly during training to minimize the error.
* **Mathematically well-defined:** MSE calculates the average of the squared differences between predicted and actual values, providing a clear measure of how well the model is performing in terms of minimizing prediction errors. It provides a single, interpretable metric for assessing the model's performance.
* **Emphasis on outliers:** Squaring the errors in MSE gives higher weights to larger errors, making the model more sensitive to outliers in the data. This can be useful in regression tasks where accurately predicting extreme values is important.
* **Convex optimization:** MSE is convex, meaning it has a single global minimum, making it easier for optimization algorithms to find the optimal model parameters. This can lead to faster convergence during training.
* **Widely used:** MSE is a commonly used loss function for regression tasks in neural networks, which means there are well-established techniques and frameworks for implementing and optimizing models with MSE as the loss function.

The mean square error (MSE) is the sum of the squared differences between the prediction ($\hat{y}$) and the expected ($y$).  MSE values are not of a particular unit. If an MSE value has decreased for a model, that is good. However, beyond this, there is not much more you can determine. We seek to achieve low MSE values. The following equation demonstrates how to calculate MSE.

$$ \mbox{MSE} = \frac{1}{n} \sum_{i=1}^n \left(\hat{y}_i - y_i\right)^2 $$


### Example 5: Compute MSE

The code in the cell below uses the function `metrics.mean_squared_error()` from the Python library called `scikit-learn` (alias `sklearn`) to compute the MSE for the model `bfModel`. This function takes 2 arguments, an array containing the model's **_predicted_** value for body `Density` for every subject in the validation dataset `bfX_test`, and their **_actual_** body `Density` values, stores in `bfY_test`.  Finally, the code prints out the results. 

In [None]:
# Example 5: Compute MSE

# Use model to predict values
bfPred = bfModel.predict(bfX_test)

# Compare predicted and actual values  
score = metrics.mean_squared_error(bfPred,bfY_test)

# Print results
print("Final score (MSE): {}".format(score))

If your code is correct, you should see the following output:

~~~text
2/2 [==============================] - 0s 4ms/step
Final score (MSE): 0.0009884501341730356
~~~

This is a relatively small number so that's a good sign. However, when it comes to MSE, all you really know is that `smaller is better`. 

### **Exercise 5: Compute MSE**

In the cell below use the function `metrics.mean_squared_error()` to compute the MSE for your model `mcModel`. Call your prediction `mcPred`. Print out the results as illustrated in Example 5. 

In [None]:
# Insert your code for Exercise 5 here



If your code is correct, you should see something similar to the following output:

~~~text
11/11 [==============================] - 0s 2ms/step
Final score (MSE): 19451004.0
~~~

Compared to the MSE computed above in Example 5, your MSE for `mcModel`, 19451004.0, might seem to be extremely large. However, looks can be deceiving.  Your model was measuring the cost of medical treatment in the _tens of thousands_ of dollars, while the other model, `bfModel` was measuring body density, which has an average (mean) value of only 0.31558996.

## Root Mean Square Error

Using Root Mean Squared Error (RMSE) for regression neural networks has several advantages:

* **Scale interpretation:** RMSE is in the same units as the target variable, providing a more interpretable measure of error compared to MSE. This makes it easier to understand the magnitude of the errors in the predicted values.
* **Outlier sensitivity:** RMSE penalizes large errors more heavily than smaller errors due to the square root operation, making the model more sensitive to outliers. This can be beneficial for regression tasks where accurately predicting extreme values is important.
* **Averaging effect:** RMSE averages the errors across all samples in the dataset, providing a single metric that represents the overall model performance. This can simplify the evaluation process and make it easier to compare different models.

The root mean square (RMSE) is essentially the square root of the MSE. Because of this, the RMSE error is in the same units as the training data outcome. We desire Low RMSE values. The following equation calculates RMSE.

$$ \mbox{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n \left(\hat{y}_i - y_i\right)^2} $$

### Example 6: Compute RMSE

The code in the cell below again uses the function `metrics.mean_squared_error()` from the Python library called `scikit-learn` (alias `sklearn`) to compute the MSE for the model `bfModel`. To compute the _Root_ Mean Squared Error, you simple take the square root of the MSE, using the Numpy function `np.sqrt()` as shown in the next line of code:

> `score = np.sqrt(metrics.mean_squared_error(mcPred,mcY_test))`


In [None]:
# Example 6: Compute RMSE

# Compute RMSE
score = np.sqrt(metrics.mean_squared_error(bfPred,bfY_test))

# Print RSME
print("Final score (RMSE): {}".format(score))

# Print mean of Y
print(f"Average body `Density`:{bfY_test.mean()}")

# Print comparison
print(f"RMSE as percent of mean `Density`:  {score/bfY_test.mean()}")

If your code is correct you should see something similar to the following output:

~~~text
Final score (RMSE): 0.03143962845206261
Average body `Density`:1.0570443868637085
RMSE as percent of mean `Density`:  0.029742959886789322
~~~

The RSME represents has about a 3% error in the `bfModel's` ability to accurately predict body `Density`.

### **Exercise 6: Compute RMSE**

In the cell below compute and print out the RMSE for your model `mcModel`, the average (mean) value of `charges` of the subjects in the validation set and 

To compute the _Root_ Mean Squared Error, you simple take the square root of the MSE, using the Numpy function `np.sqrt()` as shown in the next line of code:

> `score = np.sqrt(metrics.mean_squared_error(mcPred,mcY_test))`


In [None]:
# Example 6: Compute RMSE



If your code is correct you should see something similar to the following output:

~~~text
Final score (RMSE): 4410.3291015625
Average insurance `charges`:13277.865234375
RMSE as percent of mean `charges`:  0.3321564793586731625
~~~

In this example, RMSE for your `mcModel` was \\$4,410. On average, the people in the validation group spent \\$13,278 on insurance `charges` each year. Therefore, when expressed as a percentage of the mean, the RMSE was about 33% of the average costs. 

## Lift Chart

We often visualize the results of regression with a lift chart. Lift charts are graphical tools used to evaluate the performance of predictive models, including neural network models. The lift chart shows how much better the model is at predicting outcomes compared to a random model. They also help in understanding the model's ability to rank or classify instances correctly.

To generate a lift chart, perform the following activities:

* Sort the data by expected output and plot these values.
* For every point on the x-axis, plot that same data point's predicted value in another color.
* The x-axis is just 0 to 100% of the dataset. The expected always starts low and ends high.
* The y-axis is ranged according to the values predicted.

You can interpret the lift chart as follows:

* The expected and predict lines should be close. Notice where one is above the other.


### Example 7: Plot Lift Chart

The code in the cell below creates a function called `chart_regression()`. The function takes two arguments: (1) the model's **_predictions_** of the response variable for each subject in the dataset and (2) the **_actual_** response variable for each subject in the dataset. In this example, the `bfModel's` predictions are stored in the variable `bfPred` while the corresponding actual values are stored in `bfY`. 

The function begins by sorting the predicted and actual values by size, from small to large. The sorted predicted values are assigned to the variable `pred`. The actual values in `y` are 'flattened'. This means all of the values stored in 2-dimensional arrays are converted into a single, contiguous one-dimensional array. 

The function plots two lines. In this example, the blue line shows all of the 'Density` values in the Body Fat dataset, with the smallest value to the largest value plotted left-to-right. The orange line shows the model's predicted value for each actual value. Sometimes the model's predictions are greater than the actual values (the orange line is above the blue) and sometimes the predicted values are lower than the actual values (the orange line is below the blue). The difference between these two lines are the _'errors'_ that are used in the calculation of the **Root Mean Squared _ERRORS_ (RMSE)**. 

In [None]:
# Example 7: Plot Lift Chart

# Define plot function
def chart_regression(pred, y, sort=True):
    t = pd.DataFrame({'pred': pred, 'y': y.flatten()})
    if sort:
        t.sort_values(by=['y'], inplace=True)
    plt.plot(t['y'].tolist(), label='expected')
    plt.plot(t['pred'].tolist(), label='prediction')
    plt.ylabel('output')
    plt.legend()
    plt.show()
    
# Use function to plot the chart
chart_regression(bfPred.flatten(),bfY_test)

If your code is correct you should see a Lift Plot similar to the one shown below.

![___](https://biologicslab.co/BIO1173/images/class_04_3_Lift1.png)

### **Exercise 7: Plot Lift Chart**

In the cell below plot the Lift Chart for your `mcModel` predictions.

In [None]:
# Insert your code for Exercise 7 here



If your code is correct you should see a Lift Plot similar to the one shown below.

![___](https://biologicslab.co/BIO1173/images/class_04_3_Lift2.png)

By visual inspect your `mcModel's` ability to predict the insurance costs (`charges`) becomes less accurate as the insurance costs start to increase more rapidly starting around the X value of 240. The reason for this sudden increase is not immediately obvious. 

## **Lesson Turn-in**

When you have completed all of the code cells, and run them in sequential order (the last code cell should be number 16), use the **File --> Print.. --> Save to PDF** to generate a PDF of your JupyterLab notebook. Save your PDF as `Class_04_3.lastname.pdf` where _lastname_ is your last name, and upload the file to Canvas.