<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173/blob/main/Class_02_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 1173: Intro Computational Biology**

##### **Module 2: Neural Networks with Tensorflow and Keras**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Biology, Health and the Environment](https://sciences.utsa.edu/bhe/), [UTSA](https://www.utsa.edu/)

### Module 2 Material

* Part 2.1: Introduction to Neural Networks with Tensorflow and Keras**
* **Part 2.2: Encoding Feature Vectors**
* Part 2.3: Early Stopping and Dropout to Prevent Overfitting
* Part 2.4: Saving and Loading a Keras Neural Network

## Google CoLab Instructions

You MUST run the following code cell to get credit for this class lesson. By running this code cell, you will map your GDrive to /content/drive and print out your Google GMAIL address. Your Instructor will use your GMAIL address to verify the author of this class lesson.

In [None]:
# You must run this cell first
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    COLAB = True
    print("Note: Using Google CoLab")
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("**WARNING**: Your GMAIL address was **not** printed in the output below.")
    print("**WARNING**: You will NOT receive credit for this lesson.")
    COLAB = False

You should see the following output except your GMAIL address should appear on the last line.

![__](https://biologicslab.co/BIO1173/images/class_04/class_04_1_image01B.png)

If your GMAIL address does not appear your lesson will **not** be graded.

### Create Custom Function

The cell below creates a custom function called `hms_string()`. This function is needed to record the time required to train your neural network model.

If you fail to run this cell now, you will receive one (or more) error message(s) later in this lesson.

In [2]:
# Create custom function

# ------------------------------------------------------------------------
# 0️⃣  Create hms_string()
# ------------------------------------------------------------------------

# Simple function to print out elasped time
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)

## **Datasets for Class_02_2**

For Class_02_2 we will be using the `Wisconsin Breast Cancer` dataset for the Examples and the `Heart Disease` dataset for the **Exercises**.

### **`Breast Cancer Wisconsin (Diagnostic)` Data Set**

[Breast Cancer Wisconsin (Diagnostic) Data Set](https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data)


![___](https://biologicslab.co/BIO1173/images/breast_cancer.png)


The average risk of developing breast cancer in the United States is 13%, or 1 in 8. Approximately 42,000 women in the US die from breast cancer each year. Like most cancers, early detection and treatment is singularily important in preventing mortallity.

The Breast Cancer Wisconsin (BCW) dataset contains detailed microscopic measurements of cell nuclei obtained by fine needle aspirates (FNAs) from breast tumors found in 569 women. Some of these tumors were later determined to be **_malignant_** (cancerous), while other tumors were found to be **_benign_** (non-cancerous). Being able to differentiate cancerous from non-cancerous tumors is of obvious importance.  

Fine needle aspiration (FNA), also called a fine needle aspiration biopsy, is a minimally invasive procedure that uses a thin needle and syringe to extract a sample of cells, tissue, or fluid from an abnormal area or lump in the body. The sample is then examined under a microscope to confirm a diagnosis or guide treatment.

![___](https://biologicslab.co/BIO1173/images/fna_tech.png)


The list of features computed from digitized images of breast mass cell nuclei obtained from by FNA in the Breast Cancer Wisconsing datasete are as follows:

**Attribute Information:**

* **ID number**
* **Diagnosis:** (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:

*  **radius:** (mean of distances from center to points on the perimeter)
* **texture:** (standard deviation of gray-scale values)
* **perimeter:**
* **area:**
* **smoothness:** (local variation in radius lengths)
* **compactness:** (perimeter<sup>2</sup> / area - 1.0)
* **concavity:** (severity of concave portions of the contour)
* **concave points:** (number of concave portions of the contour)
* **symmetry:**
* **fractal dimension:** ("coastline approximation" - 1)

The mean, standard error and "worst" or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32)





### **Heart Disease Dataset**

[Heart Disease Dataset](https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction)


![___](https://biologicslab.co/BIO1173/images/HD.jpg)

**Description**

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Four out of 5CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

* **Age:** age of the patient [years]
* **Sex:** sex of the patient [M: Male, F: Female]
* **ChestPainType:** chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
* **RestingBP:** resting blood pressure [mm Hg]
* **Cholesterol:** serum cholesterol [mm/dl]
* **FastingBS:** fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
* **RestingECG:** resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
* **MaxHR:** maximum heart rate achieved [Numeric value between 60 and 202]
* **ExerciseAngina:** exercise-induced angina [Y: Yes, N: No]
* **Oldpeak:** oldpeak = ST [Numeric value measured in depression]
* **ST_Slope:** the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
* **HeartDisease:** output class [1: heart disease, 0: Normal]

# **Encoding a Feature Vector for Keras Deep Learning**

Neural networks can accept many types of data. We will continue our focus on tabular data, where there are well-defined rows and columns. This kind of data is what you would typically see in Microsoft Excel spreadsheet. Tabular data can contain both numbers or words (e.g. `male` or `female`).

Neural networks require numeric input. This numeric form is called a **_feature vector_**. If the tabular data contains any words, we will need to convert a different word into a specific number. Each input neuron receives one feature (or column) from this vector. Each row of training data typically becomes one vector.

In this lesson, we will see how to encode tabular data stored in a Pandas DataFrame into a feature vector that can be used by two types of neural networks: (1) classification and (2) regression.



### Example 1 - Step 1: Read dataset and store values in a DataFrame

Data is the essence of neural networks and deep learning. Neural networks are of little use until they have been trained on **large** datasets. Only by making repeated adjustments in the weights of their neural connections, during many rounds of training (epochs) on a particular dataset can a neural network **learn** to make accurate predictions.  

Not surprisingly, building and training neural networks begins with a dataset. The code in the cell below reads the Breast Cancer Wisconsin (BCW) dataset file, `wcbreast.csv`. With few exceptions, all of the dataset that we will use in the course are stored a dedicated HTTPS server https://biologicslab.co.  

This code snippet uses the Pandas `pd.read_csv()` function to read the Breast Cancer Wisconsin dataset from the course web server and store the information in a DataFrame called `bcwDF`
~~~text
# Read file and create DataFrame
bcwDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/wcbreast.csv",
    index_col=0,
    na_values=['NA','?'])
~~~
In Python programming, DataFrames are usually just called `df`. However, in this course we need to give DataFrames more explicit names since we will typically be using two different DataFrames at the same time, one for the Examples and one for the **Exercises**.

The name `bcwDF` was chosen to remind us that the DataFrame contains the **b** reast **c** ancer **_w_** isconsin dataset.

As a general rule, it is always a good idea to display at least part of your new DataFrame to make sure it was read correctly. Since large DataFrames can have many columns and many, many rows--too many to display in COLAB notebook-- it is helpful to specify the maximum number of rows and columns to display. This is accomplished in the cell below using this code chunk:

~~~text
# Set display options
pd.set_option('display.max_columns', 4)
pd.set_option('display.max_rows', 4)
~~~



In [None]:
# Example 1 - Step 1: Read data and create dataframe

import numpy as np
import pandas as pd


# Read file and create DataFrame
bcwDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/wcbreast.csv",
    index_col=0,
    na_values=['NA','?'])

# Set display options
pd.set_option('display.max_columns', 4)
pd.set_option('display.max_rows', 4)

# Display DataFrame
display(bcwDF)

If the code is correct you should see the following table:

![__](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image01B.png)


There are several observations that you should make from this table. First, looking at the very bottom you see:
~~~text
569 rows x 32 columns
~~~
This means that our DataFrame `bcwDF` has clinical information for `569` subjects (i.e. 1 row/subject) and that there are `32` clinical features (1 feature/column) recorded for each subject.

By inspection, we can see at least one column, `diagnosis` has non-numerical values (the strings "M" and "B"), but there could be more, since we can only see a small fraction of the entire 32 columns.

### Example 1 - Step 2: Display data types

In order to create a feature vector, we need to know which column(s) in our DataFrame are non-numeric, i.e., contain string values. We can easily print out the different data types in a DataFrame using the Pandas method `df.info`.   

However, in order see **_all_** of the different data types, we need to change the number of rows to display. While we could simply set this option to `33`, (i.e. the number of columns in the DataFrame), here we used a slightly more elegant method using `len(bcwDF.columns)`. This code will automatically computer the number of columns for us.     

In [None]:
# Example 1 - Step 2: Find data types

import pandas as pd

# Set max rows to the number of columns
pd.set_option('display.max_rows', len(bcwDF.columns))

# Print data types
bcwDF.info()

If the code is correct you should see the following table:

![__](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image02B.png)

You should make the following observations from this output:
* The target column is the column that you seek to predict. By convention, the target column, containing the `Y-values`, will usually, be the rightmost column in a display, or last column in a list. In this particular example, the target column, `diagnosis`, is the second column in the list.   
* There is a column called `id` which identifies each subject. We should exclude this columns from our analysis because it contains no information useful for making a prediction.
* From the data types output, we can see that with the exception of the column `diagnois`, all of the fields, are **_numeric_**. Non-numeric values are classified as `object` while numeric values are classified as being either `int64` or `float64`. Numeric columns might not require further processing before there are used to generate our `X-values`.
* Categorical values (strings) are only found in the target column (`diagnosis`) which we will take care of later when we generate our `Y-values`.


### **Exercise 1 - Step 1: Read dataset and store values in a DataFrame**

In the cell below, use the Pandas function `pd.read_csv()` to read the Heart Disease data file `heart_disease.csv` located on the course HTTPS server. Save the data to a new DataFrame called `hdDF`.

_Code Hints:_

1. In order to read this file correctly, you **must** comment out the following line of code:

~~~text    
# index_col=0,
~~~
The `index_col=0` parameter in `pandas.read_csv()` tells `Pandas` to use the first column of the CSV file as the index of the resulting DataFrame. Whether or not you need to specify it depends on the structure of the CSV file you're reading.

##### **When to Use `index_col=0`**
Use it when the first column contains row labels (i.e., meaningful identifiers like patient IDs, sample names, etc.) rather than actual data.

##### **When to Omit `index_col`**
If the first column is just another data column (like age, cholesterol, etc.), and not meant to be the index, then you should not use index_col=0. The easiest way to "omit" this argument is to `comment it out` by placing a `#` at the start of the line.

**WARNING:** If you don't comment out that line the column `Age` will not be placed in your DataFrame `hdDF` correctly.


Set the display for 6 rows and 6 columns and then print out a display of `hdDF`.


In [None]:
# Insert your code for Exercise 1 - Step 1 here



If your code is correct, you should see the following table.

![Heart Failure DataFrame](https://biologicslab.co/BIO1173/images/class_04/class_04_1_image01a.png)

Check your output carefully. From left to right, there should be an unamed index column with descending numbers (i.e, 0, 1, 2, ...) and just to the right of the index column there should be a column called `Age`. If your output doesn't have an index column, and the column `Age` is the first column on the left, go back and re-read the instruction for **Exercise 1A**.

Your `hdDF` DataFrame has information on `918` subjects (number of rows = 918) and `12` clinical values for each subject (number of columns = 12). There are clearly more than one column with non-numeric values, but you won't know exactly how many until you run **Exercise 1B**.

### **Exercise 1 - Step 2: Display data types**

In the cell below, write the code to print out the different data types in your DataFrame `hdDF` using the Pandas method `df.info()`. Use `len(hdDF.columns)` to set the number of rows to display, before you print out the data types.     

In [None]:
# Insert your code for Exercise 1 - Step 2 here


If the code is correct you should see the following table:

![__](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image03B.png)

You should make the following observations from the above output:
* The target column is the column that you usually want to predict with a classification neural network. In this instance, the last column in this list, `HeartDisease`, will be your target column (Y-values).
* The column `FastingBS` doesn't appear to contain information that would be especially useful for predicting heart disease, so you will need to drop it.
* Some fields are numeric (data type `int64` or `float64`) and might not require further processing.
* There are categorical values (data type `object`) in 5 columns including: `Sex`, `ChestPainType`, `RestingECG`, `ExerciseAngina` and `ST_Slope`. The categorical values (strings) in these columns will need to be taken care of, before you can use them in generating your X-values.

### Example 2: Drop unecessary columns

The `id` column in the `bcwDF` DataFrame does not contain information useful for predicting breast cancer, so we need to exclude from the feature vector containing the `X-values`. To do this we will use the Pandas method `df.drop()` as shown by the next code chunk:
~~~text
bcwDF.drop('id', axis=1, inplace=True)
~~~
The method `df.drop()` has three arguments. The first argument `id` is the sname of the column to be dropped. The second argument `axis=1` means to drop the entire column, while the third argument `inplace=True` means to change the DataFrame **_permanently_**.

**NOTE:** After you run a code cell where you drop a column, you will get an error if you try to re-run the same cell, since there is no longer any column to drop.

If you need to run the cell again, you will first need to re-read the datafile and re-create the DataFrame `bcwDF` with **all** of the original columns by running Example 1 again.

In [None]:
# Example 2: Drop unecesary columns

import pandas as pd


# Drop specific column
bcwDF.drop('id', axis=1, inplace=True)

# Set the max rows and max columns
pd.set_option('display.max_columns', 4)
pd.set_option('display.max_rows', 4)

# Display the updated DataFrame
display(bcwDF)

If you code is correct you should see the following table:

![__](https://biologicslab.co/BIO1173/images/class_04/class_04_1_image02a.png)

You should note that the column `id` has been removed. Instead of the original `32` columns, there are now only `31` columns.

**NOTE:** If you get an error that column `id` doesn't exist, it probably means that you have already run this cell and dropped the column. To correct this error, simply go back and re-read the datafile by re-running Example 1 to create a fresh copy of `bcwDF`.

### **Exercise 2: Drop unecessary columns**

Since the column `FastingBS` in the Heart Disease dataset doesn't contain information that will be especially useful for predicting heart disease, this column should not be included in the analysis. In the cell below, write the code to drop the `FastingBS` column. Set your display 6 rows and 6 columns of your updated DataFrame and print out your updated DataFrame `hdDF`.

In [None]:
# Insert your code for Exercise 2 here



If your code is correct you should see the following table:

![_ _](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image04B.png)

Since the column `FastingBS` wasn't displayed previously (**Example 1A**), you can't tell if it was dropped. However, the number of columns is now `11`, instead of the original `12` so you can assume your code was successful.

**NOTE:** If you get an error that column `FastingBS` doesn't exist, it probably means that you have already run this cell and dropped the column. To correct this error, simply go back and re-read the datafile to create a fresh copy of `hdDF` by running all of the code cells starting with **Exercise 1** again.

--------------------
### **One-Hot Encoding**

**One-Hot Encoding** is a technique used to convert categorical variables into a binary matrix (0s and 1s). Each category is represented as a vector where only one element is "hot" (i.e., 1) and the rest are 0.

**Example:**
For a feature `Color` with values `["Red", "Green", "Blue"]`, `One-Hot Encoding` transforms it as:

| Original Value | Red | Green | Blue |
|----------------|-----|-------|------|
| Red            |  1  |   0   |  0   |
| Green          |  0  |   1   |  0   |
| Blue           |  0  |   0   |  1   |

**Why it's used in neural networks:**
- Neural networks require numerical input.
- One-Hot Encoding avoids assigning arbitrary numerical values to categories, which could mislead the model into thinking there's an ordinal relationship.
- It ensures each category is treated independently and equally.

##### **Note:** For high-cardinality features, `One-Hot Encoding` can lead to a large number of input dimensions, which may affect performance and memory usage.
-------------------

### Example 3: One-Hot Encode Categorical Variables

In general neural networks can only process **numerical** data, not string (categorical) data.

Your `hdDF` DataFrame has 5 columns that have categorical variables (strings):
1. `Sex`
2. `ChestPainType`
3. `RestingECG`
4. `ExerciseAngina`
5. `ST_Slope`.

You will need to One-Hot Encode each of these `5` columns separately.

To help you get started, Example 3 illustrates how to One-Hot encode the first column, `Sex`.

The following line of code uses the Pandas function, `pd.get_dummies()` to `One-Hot Encode` the string values in the `Sex` column and turn them into the integer values `0` and `1`.

You should notice that the code below uses `try:` and `except:` blocks. Normally, you would **not** use `try` and `except` when creating a feature vector. They have been added here simply as a teaching aid.

The `try` block lets you test a block of code for errors while the `except` block lets you handle the error. The `try` block demonstrates the 3 steps you need to One-Hot Encode the column `Sex`. The `except` block is there in case you try to re-run this cell after you have already dropped the column `Sex`. Instead of giving an error message and stopping, the `except` block simply warns you that the column has already been dropped.


In [None]:
# Example 3: One-Hot encode categorical variables

import pandas as pd

# Adding try and except blocks as teaching aid-------------------------
try:
    # Step 1 - Get dummy values
    dummies = pd.get_dummies(hdDF['Sex'],prefix="Sex", dtype=int)

    # Step 2 - Add dummies to DataFrame
    hdDF = pd.concat([hdDF,dummies],axis=1)

    # Step 3- Drop column replaced by dummies
    hdDF.drop('Sex', axis=1, inplace=True)
    print("Column 'Sex' has been dropped")

except:
    print("ERROR: Column 'Sex' may have been already been dropped")

# Set the max rows and max columns
pd.set_option('display.max_columns', 6)
pd.set_option('display.max_rows', 6)

# Display the updated DataFrame
display(hdDF)

If your code is correct, you should see the following table.

![__](https://biologicslab.co/BIO1173/images/class_04/class_04_1_image04a.png)


You should notice that the original column `Sex` has been replaced by two new columns called `Sex_F` and `Sex_M`. You should also notice that these two new columns contain either the number `0` or the number `1`. This is what **`One Hot Encoding`** does. One-Hot Encoding replaces the categorical (string) values with integer values, in this case `M` or `F` have been replaced by the numbers `0` and `1`.  

### **Exercise 3: One-Hot encode categorical variables**

Your `hdDF` DataFrame still has 4 columns with categorical variables:
1. `ChestPainType`
2. `RestingECG`
3. `ExerciseAngina`
4. `ST_Slope`.

For **Exercise 3**, you are to One-Hot Encode these remaining `4` columns using the example shown in Example 3. To help you with your coding, **Exercise 3** has been divided into a series of steps.

### **Exercise 3 - Step 1: One-Hot encode the column `ChestPainType`**

In the cell below, `One-Hot Encode` the column `ChestPainType` in the Heart Disease DataFrame `hdDf`. Use the word `Pain` as the dummy `prefix`.

After you add the dummies back into to the DataFrame, drop the column `ChestPainType`.

Set the display for 6 columns and 6 rows and print out a display of your updated DataFrame.

In [None]:
# Insert your code for Exercise 3 - Step 1 here



If your code is correct, you should see the following table.

![_ _](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image05B.png)


### **Exercise 3 - Step 2: One-Hot encode the column `RestingECG`**

In the cell below, One-Hot encode the column `RestingECG` in the Heart Disease DataFrame, `hdDF`. Use the word `RestingECG` as the dummy `prefix`.

Add the dummies to the DataFrame and then drop the column `RestingECG`. Set the display for 6 columns and 6 rows and print out a display of your updated DataFrame.

In [None]:
# Insert your code for Exercise 3 - Step 2 here



If your code is correct, you should see the following table.

![_ _](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image06B.png)

### **Exercise 3 - Step 3: One-Hot encode the column `ExerciseAngina`**

In the cell below, One-Hot encode the column `ExerciseAngina` in the Heart Disease DataFrame, `hdDF`. Use the word `ExAngina` as the dummy `prefix`.

Add the dummies to the DataFrame and then drop the column `ExerciseAngina`. Set the display for 6 columns and 6 rows and print out a display of your updated DataFrame.

In [None]:
# Insert your code for Exercise 3 - Step 3 here



If your code is correct, you should see the following table.

![_ _](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image07B.png)


### **Exercise 3 - Step 4: One-Hot encode the column `ST_Slope`**

In the cell below, One-Hot encode the column `ST_Slope` in the Heart Disease DataFrame, `hdDF`. Add the dummies to the DataFrame and then drop the column `ST_Slope`. Set the display for 6 columns and 6 rows and print out a display of your updated DataFrame.

In [None]:
# Insert your code for Exercise 3 - Step 4 here



If the code is correct you should see the following table.

![_ _](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image08B.png)  

You should notice that the number of columns in your DataFrame, `hdDF`, has increased from the `12` original columns to total of **20** columns!. This a clear example of **_column inflation_** in an invariable consequence of using One-Hot Encoding.

### Sample Code: Print out column names

If your coding has been correct so far, your DataFrame, `hdDF` should have 20 columns. This is an inconviently large number of columns to display on your computer screen. In this situation, you can use the code in the cell below to print out a complete list of all the column names in DataFrame:
~~~text
# Use for loop
for idx, col in enumerate(hdDF.columns, start=1):
    print(f"{idx}. {col}")
~~~
Run the next cell to check whether your coding is correct so far, before going on to the next part of the lesson.

In [None]:
# Print columns name in updated hdDR

# Use for loop
for idx, col in enumerate(hdDF.columns, start=1):
    print(f"{idx}. {col}")


If the code is correct you should see the following output

![_ _](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image09B.png)  

If your output from doesn't match the output above, figure out which columns were not successfully `One-Hot Encoded`, and/or dropped, and make the necessary code fixes.

When you are making significant changes to a DataFrame, its always a good idea to go back to where you originally created the DataFrame, in the lesson, **Exercise 1 - Step 1**, re-run this cell make a new copy of `hdDF` and then re-run your code cells.

**WARNING:** If your output does _not_ start with the name `Age`, it probably means that you didn't read the datafile correctly. You need to go back and re-read the instructions if you want this lesson to receive a passing grade.

## **Generate X and Y for a Classification Neural Network**

Now that unecessary columns have been dropped, and all of the string data has been `One-Hot Encoded` (except for the target column), we are ready to use the data stored in the updated DataFrame as input for a neural network.

There are two basic ways to used tabular data as input into a neural network. The neural network can perform **_classification_** or **_regression_**. There are small number of very important differences in how you generate X and Y values for these different functions.

We will begin creating a feature vector for a **classification** neural network.

### Example 4: Create Feature Vector for _Classification_ Neural Network

The goal of a **classification neural network** is to accurately categorize input data (`X-values`) into predefined classes or categories (`Y-values`). The network learns to identify patterns and features within the input data that are associated with each class, allowing it to make predictions about the class of new, unseen data. During training (fitting), the neural network tries to minimize the classification error as a way to improve the overall accuracy of the network's predictions.

The code in the cell below creates a feature vector to hold the `X-values` as well as another feature vector to hold the `Y-values`. Keep in mind that a `feature vector` can only contain _numerical values_.  

In the code below, the first step is to specify the column name in the DataFrame that contains the Y-values. For the DataFrame `bcwDF` the Y-values are in the column called **`diagnosis`** so we set the target name using the following line of code:
```text
bcw_target_name='diagnosis'
```
Once we have specified which column will be the target, the next step is make sure that the data in this column is **not** included as part of your feature vector holding the `X-values`. The code for doing this is shown here using the `df.drop()` method:
```text
    # ----------------------------------------------------------------
    #  2️⃣  Feature matrix - drop the target column
    # ----------------------------------------------------------------
    bcwX = df.drop(columns=target_col).to_numpy(dtype=dtype)
```

**IMPORTANT:** Most classification neural networks use `categorical cross-entropy` as the loss function. This function expects the target labels to be in `One-Hot Encoded` format, where each label is represented as a vector with a `1` in the position of the correct class and `0s` elsewhere. Here is the code that generates the feature vector containing our One-Hot Encoded `Y-values`:
```text
    # ----------------------------------------------------------------
    #  3️⃣  Target matrix – one‑hot encode the diagnosis column
    # ----------------------------------------------------------------
    bcwY = pd.get_dummies(df[target_col], dtype=int).to_numpy(dtype=dtype)
```

You should note that when the cell below is run, `X-values` will have the name `bcwX` and the `Y-values` will have the name `bcwY`. This is to remind you that the X-values are for the Breast Cancer Wisconsin (`bcw`) dataset.

We will have to use these names below when we build our classification neural network.

In [None]:
# Example 4: Create Feature Vector for Classification Neural Network

import numpy as np
import pandas as pd

# Set target name
bcw_target_name = 'diagnosis'

# ------------------------------------------------------------------
# 1️⃣ Sanity check – make sure the target column exists
# ------------------------------------------------------------------
if bcw_target_name not in bcwDF.columns:
    raise KeyError(f"'{bcw_target_name}' column not found in the dataframe.")

# ------------------------------------------------------------------
# 2️⃣ Feature matrix – drop the target column
# ------------------------------------------------------------------
bcwX = bcwDF.drop(columns=bcw_target_name).to_numpy(dtype=np.float32)

# ------------------------------------------------------------------
# 3️⃣ Target matrix – one-hot encode the diagnosis column
# ------------------------------------------------------------------
bcwY = pd.get_dummies(bcwDF[bcw_target_name], dtype=int).to_numpy(dtype=np.float32)

# ------------------------------------------------------------------
# 4️⃣ Quick sanity-check prints (useful for Colab output)
# ------------------------------------------------------------------
np.set_printoptions(suppress=True, precision=4)

print(f"\nFeature matrix shape: {bcwX.shape}")
print(f"Target matrix shape: {bcwY.shape}")

print("\nFirst 4 feature vectors:")
print(bcwX[:4])

print("\nCorresponding one-hot targets:")
print(bcwY[:4])

If the code is correct you should see the following output

![_ _](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image10B.png)  


**IMPORTANT:** You should never see any but numeric values in your feature vectors. If you see and "words" or "letters" (i.e. strings) your feature vector was not generated correctly. If you try to feed a feature vector containing strings to your neural network for training, the training will immediately terminate with an error message.

### **Exercise 4: Create Feature Vectors for _Classification_ Neural Network**

In the cell below, create `X-` and `Y-` feature vectors for a classification neural network from the data in your Heart Disease DataFrame `hdDF`. Use the column `HeartDisease` for your **target column** (i.e. your `Y-values`), and all of the other columns for your `X-values`. Call your X-feature vector **`hdX`** and your `Y-feature vector` **`hdY`**. Don't forget that you **MUST** `One-Hot Encode` the values in the target column.

**Code Hints:**

1. Set the target name to "HeartDisease"
2. Change `bcwX` to `hdX` everywhere in the code that you copied from Example 4
3. Change `bcwY` to `hdY` everywhere in the code that you copied from Example 4

Don't forget to change the names for `X-` and `Y-` feature vectors in the print statements at the end of the code cell.

In [None]:
# Insert your code for Exercise 4 here



If the code is correct you should see the following output

![_ _](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image11B.png)  

You might notice that there are a lot of `1s` and `0s` in your `X feature vector`. This is because you `One-Hot Encoded` several non-numeric columns in **Exercise 3**.

Also note that there are no "words" or "letters" in either the `X-` or the `Y-` feature vector.

### Example 5A:  Construct, Compile and Train _Classification_ Neural Network

When building a **classification** neural network, there are two important points to remember:

* Classification neural networks have an output neuron count equal to the number of classes.
* Classification neural networks should use the **softmax** activation function in the output layer and **categorical_crossentropy** as the loss function when you compile your neural network.

The code in the cell below starts out by defining several parmeters using the following code snippet:
```text
# ---------------------------------------------------------------------------
# 1️⃣ Define parameters
# ---------------------------------------------------------------------------

EPOCHS=200
PATIENCE=20
VERBOSE=2
lr=0.0010
OPTIMIZER = Adam(learning_rate=lr)
```
This makes it easier to change one (or more) of these values later if you want to "tune" your training of your neural network.

The following code snippet splits the X and Y data into `traing` and `validation` data sets:
```text
# ---------------------------------------------------------------------------
# 2️⃣ Split data
# ---------------------------------------------------------------------------

# Define split value
split_val=0.8

# Train / validation split
split = int(split_val * bcwX.shape[0])
x_train, x_val = bcwX[:split], bcwX[split:]
y_train, y_val = bcwY[:split], bcwY[split:]
```
The variable `split_val=0.8` means that 80% of the dataset will be used for `training` and the remaining 20% will be used for `validation`.

The model is built and compiled with this code snippet
```text
# ---------------------------------------------------------------------------
# 3️⃣ Build and compile model
# ---------------------------------------------------------------------------
inputs = Input(shape=(bcwX.shape[1],))
x = Dense(25, activation="relu")(inputs)
x = Dropout(0.2)(x)
x = Dense(50, activation="relu")(x)
outputs = Dense(bcwY.shape[1], activation="softmax")(x)

# Create model
bcw_model = Model(inputs, outputs)

# Compile model
bcw_model.compile(
    loss="categorical_crossentropy",
    optimizer=OPTIMIZER,
    metrics=["accuracy"],
)

```
You should note that this line of code:
```text
inputs = Input(shape=(bcwX.shape[1],))
```
insures that the number of neurons in the input layer is exactly equal to the number of columns in the `X-feature` vector `(bcwX.shape[1],))`.

You should also note that this line of code:
```text
outputs = Dense(bcwY.shape[1], activation="softmax")(x)
```
insures that the number of neurons in the output layer is exactly equal to the number of items being classified `bcwY.shape[1]`.

You should note that since we are building a _classification_ neural network, we need to use `categorical_crossentropy` as our loss function
```text
loss="categorical_crossentropy",
```

Here is the code snippet that performs the actual training of the neural network:
```text

# ---------------------------------------------------------------------------
# 5️⃣ Train model
# ---------------------------------------------------------------------------
bcw_history = bcw_model.fit(
    x_train,
    y_train,
    validation_data=(x_val, y_val),
    epochs=EPOCHS,
    batch_size=32,
    callbacks=callbacks,
    verbose=VERBOSE,
)
)
```
You should note that the prefix `bcw_` has been added to both the model name (`bcw_model`) and to the `history` variable (`bcw_history`). This has been done to get these objects separate from the code you will be writing in **Exercise 5A** below.

In [None]:
# Example 5A: Construct, Compile and Train Classification Neural Network

from __future__ import annotations

import os
import time
import random
import numpy as np
import tensorflow as tf
from tensorflow.keras import Input, Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import (
    EarlyStopping,
    ModelCheckpoint,
    ReduceLROnPlateau,
)

# ---------------------------------------------------------------------------
# 1️⃣ Define parameters
# ---------------------------------------------------------------------------
EPOCHS = 200
PATIENCE = 20
VERBOSE=2
lr = 0.0010
OPTIMIZER = Adam(learning_rate=lr)

# ---------------------------------------------------------------------------
# 2️⃣ Split data
# ---------------------------------------------------------------------------
split_val = 0.8
split = int(split_val * bcwX.shape[0])
x_train, x_val = bcwX[:split], bcwX[split:]
y_train, y_val = bcwY[:split], bcwY[split:]

# ---------------------------------------------------------------------------
# 3️⃣ Build and compile model
# ---------------------------------------------------------------------------
inputs = Input(shape=(bcwX.shape[1],))
x = Dense(25, activation="relu")(inputs)
x = Dropout(0.2)(x)
x = Dense(50, activation="relu")(x)
outputs = Dense(bcwY.shape[1], activation="softmax")(x)

# Create model
bcw_model = Model(inputs, outputs)

# Compile model
bcw_model.compile(
    loss="categorical_crossentropy",
    optimizer=OPTIMIZER,
    metrics=["accuracy"],
)

# ---------------------------------------------------------------------------
# 4️⃣ Add Callbacks
# ---------------------------------------------------------------------------
checkpoint_path = "bcw_best_classification_model.keras"
callbacks = [
    EarlyStopping(
        monitor="val_loss", patience=PATIENCE, restore_best_weights=True
    ),
    ModelCheckpoint(
        filepath=checkpoint_path,
        monitor="val_loss",
        save_best_only=True,
    ),
    ReduceLROnPlateau(
        monitor="val_loss", factor=0.5, patience=PATIENCE, verbose=1
    ),
]

# ---------------------------------------------------------------------------
# 5️⃣ Train model
# ---------------------------------------------------------------------------
print(f"-- Training (classification) is starting for {EPOCHS} epochs ----------------------------")
start_time = time.time()
bcw_history = bcw_model.fit(
    x_train,
    y_train,
    validation_data=(x_val, y_val),
    epochs=EPOCHS,
    batch_size=32,
    callbacks=callbacks,
    verbose=VERBOSE,
)

# ---------------------------------------------------------------------------
# 6️⃣ Inspect training
# ---------------------------------------------------------------------------
print("\nTraining complete.")
print("Best validation accuracy:", max(bcw_history.history["val_accuracy"]))

elapsed_time = time.time() - start_time
print(f"Elapsed time: {hms_string(elapsed_time)}")


If the code is correct you should see something _similar_ to the following final output from the training

![_ _](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image13B.png)  

You should notice 3 things about the output:

1. The training stop well short of 200 epochs. This occurred due to `EarlyStopping` in the `callbacks`
```text  
callbacks = [
    EarlyStopping(
        monitor="val_loss", patience=PATIENCE, restore_best_weights=True
    ),
```
Since we set `PATIENCE = 20` the training waited for 20 epochs for the validation accuracy to improve after reaching a low value before "early stopping".

2. The `Best validation accuracy` was very high, over 90% accurate meaning our neural network `bcw_model` learned how to classify turmors into `malignant` or `benign` to a high degree of precision based on the clinical measurements of the tumor.

3. Training time was very short, less than 20 seconds.

### Example 5B:  Visualize Training Curves

Visualizing training curves—specifically **train loss vs. validation loss** and **train accuracy vs. validation accuracy**—is incredibly useful for diagnosing and improving the performance of a neural network. Here's why:

##### **Monitor Learning Progress**
* **Train Loss/Accuracy** shows how well the model is fitting the training data.
* **Validation Loss/Accuracy** indicates how well the model generalizes to unseen data.

These curves help you understand whether the model is learning effectively or struggling.

##### **Detect Overfitting**
* If training loss keeps decreasing while validation loss starts increasing, the model is likely **overfitting**—memorizing training data rather than learning general patterns.
* Similarly, if **training accuracy increases** but **validation accuracy plateaus or drops**, it's another sign of overfitting.

##### **Detect Underfitting**
* If both training and validation metrics are poor and don’t improve, the model might be **underfitting**—too simple to capture the data's complexity.

##### **Identify Optimal Stopping Point**
These curves help determine when to stop training (e.g., using early stopping) to avoid wasting time and resources once validation performance stops improving.

The code in the cell below uses the graphics package `matplotlib.pyplot` to generate a plot of **`training loss vs validation loss`** and a plot of **`training accuracy vs validation accuracy`**.

In [None]:
# Example 5B: Visualize training curve

import matplotlib.pyplot as plt

# Plot train loss vs val loss
plt.figure(figsize=(8, 5))
plt.plot(bcw_history.history['loss'], label='train loss')
plt.plot(bcw_history.history['val_loss'], label='val loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()

# Plot train acc vs val acc
plt.figure(figsize=(8, 5))
plt.plot(bcw_history.history['accuracy'], label='train acc')
plt.plot(bcw_history.history['val_accuracy'], label='val acc')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

If the code is correct you should see something _similar_ to the following output

![_ _](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image12B.png)  

Here is an analysis of these two graphs:

#### **Top Graph: Loss vs. Epochs**

* **Training Loss (blue):**
* * Starts high and decreases steadily, indicating that the model is learning and fitting the training data well.
* **Validation Loss (orange):**
* * Initially decreases but then plateaus or increases, suggesting that the model begins to overfit after a certain number of epochs.
* * If validation loss increases while training loss continues to decrease, it's a classic sign of overfitting.

#### **Bottom Graph: Accuracy vs. Epochs**

* **Training Accuracy (blue):**
* * Increases steadily and approaches close to `1.0`, showing that the model is nearly perfectly fitting the training data.

* **Validation Accuracy (orange):**
* * Initially improves but then levels off or slightly declines, again pointing to **overfitting**—the model performs well on training data but not on unseen validation data.

#### **Interpretation**
The model is **learning well initially**, but after a certain point, it starts to memorize the training data rather than generalize.
This is evident from the **divergence between training and validation** metrics after a certain number of epochs.

### **Exercise 5A:  Construct, Compile and Train _Classification_ Neural Network**

In the cell below write the code to construct, compile and train a classification neural network called `hd_model` to analyze the data in the DataFrame `hdDF`.

**Code Hints:**

1. Change the pre-fix `bcw` to `hd` **_everywhere_** in the code that you copied from Example 5A.

**NOTE:** If you get an error when you try to run your code, it probably means that you missed one (or more) places that had the pre-fix `bcw`.



In [None]:
# Insert your code for Exercise 5A here



If the code is correct you should see something _similar_ to the following final output from the training

![_ _](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image14B.png)  

### **Exercise 5B:  Visualize Training Curves**

In the cell below write the code to visualize the training curves for your `hd_model`.

**Code Hints:**

Change `bcw_history` to `hd_history`.

In [None]:
# Insert your code for Exercise 5B here



If the code is correct you should see something _similar_ to the following output

![_ _](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image15B.png)  

Here is an analysis of these two graphs:

#### **Top Graph: Loss vs. Epochs**

* **Training Loss (blue):**
* * Shows a **steady decline**, indicating that the model is successfully minimizing the loss on the training data.
* **Validation Loss (orange):**
* * Initially decreases, but then **flattens or slightly increases**, suggesting that the model starts to **overfit** after a certain number of epochs.

#### **Bottom Graph: Accuracy vs. Epochs**
* **Training Accuracy (blue):**
* * Increases consistently, approaching high values (likely near `1.0`), which means the model is fitting the training data very well.
* **Validation Accuracy (orange):**
* * Improves early on but then **plateaus or slightly declines**, again pointing to **overfitting**—the model performs well on training data but struggles to generalize.

#### **Overall Interpretation**
* The model is learning effectively during early epochs.
* After a certain point, generalization performance stagnates, as seen in the validation metrics.
* This is a classic case of **overfitting**, where the model memorizes training data but fails to perform well on unseen data.

--------------------------------

## **Classification vs. Regression: A General Overview**

In supervised machine learning, tasks are typically categorized into **classification** or **regression**, depending on the nature of the output variable.

### **Classification**

- **Goal**: Predict a **discrete label** or **category**.
- **Output**: Categorical values (e.g., "spam" or "not spam", "cat", "dog", "bird").
- **Examples**:
  - Email spam detection
  - Disease diagnosis (e.g., predicting if a patient has a disease)
  - Image recognition (e.g., identifying objects in photos)
- **Algorithms**:
  - Logistic Regression
  - Decision Trees
  - Random Forest
  - Support Vector Machines (SVM)
  - **Neural Networks (for multi-class classification)**

### **Regression**

- **Goal**: Predict a **continuous value**.
- **Output**: Real numbers (e.g., price, temperature, age).
- **Examples**:
  - Predicting medical costs
  - Estimating a person's weight based on height
  - Forecasting the spread of infections
- **Algorithms**:
  - Linear Regression
  - Polynomial Regression
  - Decision Trees
  - Random Forest
  - **Neural Networks (for regression tasks)**

##### **Key Differences Between Classification and Regression**

| Feature | Classification                 | Regression                       |
|---------|--------------------------------|----------------------------------|
| Output  | Discrete categories            | Continuous values                |
| Evaluation | Accuracy, Precision, Recall, F1 Score | MSE, RMSE, MAE, R² |
| Use Case | Label prediction               | Value estimation                 |

-------------------------------

## **Generate Feature Vectors for a Regression Neural Network**

As mentioned above, the procedure for generating `X-` and `Y ` feature vectors for a regression neural network is somewhat different the procedure used above. Even though these differences are not large, they are important. If your `X` and `Y` feature vectors are not generated in the correct format, your neural network will not compile and run.

### Example 6: Generate Feature Vectors for Regression Neural Network


For regression, we want to predict a variable that has a **_range of values_**. For Example 6, we will generate `X` and `Y` feature vectors for a regression neural network designed to predict the `mean_area` of tumor cell nuclei in the Breast Cancer Wisconsin dataset.

The first step is to specify the target column
```text
# ------------------------------------------------------------------
# 1️⃣  Identify feature / target columns
# ------------------------------------------------------------------
BCW_TARGET_COL = "mean_area"  # continuous variable we want to predict
```

Then the code in the cell below automatically classifies columns in `bcwDF` as categorical or numeric (excluding the target column).  
```text
# Identify categorical and numeric columns directly
categorical_cols = [col for col in bcwDF.columns if bcwDF[col].dtype == "object" and col != BCW_TARGET_COL]
numeric_cols = [col for col in bcwDF.columns if col != BCW_TARGET_COL and col not in categorical_cols]
```

It then builds two preprocessing pipelines:  
- **Numeric** - imputes missing values with the median and standard-scales the features.  
- **Categorical** - imputes missing values with the most frequent value and `one-hot-encodes` the categories (dropping binary columns).  

After a `ColumnTransformer` combines these pipelines so that both numeric and categorical data are processed consistently before model training.

The data is then split into training and validation datasets using this code snippet:
```text
# ------------------------------------------------------------------
# 3️⃣  Split into train / test sets
# ------------------------------------------------------------------
bcwX = bcwDF.drop(columns=[BCW_TARGET_COL])
bcwY = bcwDF[BCW_TARGET_COL].values.astype(np.float32)

test_size = 0.2

bcwX_train, bcwX_val, bcwY_train, bcwY_val = train_test_split(
    bcwX, bcwY,
    test_size=test_size,
    random_state=42,
    shuffle=True,
)

```
The next step transforms the training data
```text
# ------------------------------------------------------------------
# 4️⃣  Fit the transformer on the training data
# ------------------------------------------------------------------
bcwX_train_proc = preprocess.fit_transform(bcwX_train)
bcwX_val_proc = preprocess.transform(bcwX_val)
```
* The preprocess `ColumnTransformer` learns the necessary statistics from the training data:
* * For numeric columns - it calculates median values for imputation and the mean & standard deviation for scaling.
* * For categorical columns - it determines the most frequent category for imputation and the unique categories for one-hot encoding.
* After fitting, it immediately **transforms** `bcwX_train` into a processed NumPy array `bcwX_train_proc`.

* The same learned parameters (median, mean, std, category levels) are applied to the validation features, producing `bcwX_val_proc` without altering the training data.

**Result:** Both training and validation feature sets are now clean, imputed, scaled, and one-hot-encoded, ready for downstream modeling

The next step is to scale the data
```text
# ------------------------------------------------------------------
# 5️⃣  Scale the target
# ------------------------------------------------------------------
from sklearn.preprocessing import StandardScaler

scale_y = True
if scale_y:
    y_scaler = StandardScaler()
    bcwY_train = y_scaler.fit_transform(bcwY_train.reshape(-1, 1)).ravel()
    bcwY_val = y_scaler.transform(bcwY_val.reshape(-1, 1)).ravel()
```
##### **Why do we scale the target?**
* **Regression models** such as linear regression, ridge, LASSO, or tree‑based methods often converge faster and produce more stable numeric results when the target has zero mean and unit variance.
* Some models, especially those that penalize large coefficients (e.g., regularized regressions), can benefit from a scaled target to keep coefficient magnitudes in a comparable range.
* When you later need to inverse-transform predictions back to the original scale, you can use `y_scaler.inverse_transform(predictions.reshape(-1, 1))`.

Finally, we print out the first 4 values in the X and Y feature vectors so we can make sure they have the correct format.


In [None]:
# Example 6: Generate Feature Vectors for Regression Neural Network

from __future__ import annotations

import numpy as np
import pandas as pd
from pathlib import Path

# ------------------------------------------------------------------
# 1️⃣  Identify feature / target columns
# ------------------------------------------------------------------
BCW_TARGET_COL = "mean_area"  # continuous variable we want to predict

# ------------------------------------------------------------------
# 2️⃣  Pre‑processing pipeline
# ------------------------------------------------------------------
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

# Identify categorical and numeric columns directly
categorical_cols = [col for col in bcwDF.columns if bcwDF[col].dtype == "object" and col != BCW_TARGET_COL]
numeric_cols = [col for col in bcwDF.columns if col != BCW_TARGET_COL and col not in categorical_cols]

numeric_pipe = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler()),
    ]
)

categorical_pipe = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(drop="if_binary", sparse_output=False)),
    ]
)

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_pipe, numeric_cols),
        ("cat", categorical_pipe, categorical_cols),
    ]
)

# ------------------------------------------------------------------
# 3️⃣  Split into train / test sets
# ------------------------------------------------------------------
bcwX = bcwDF.drop(columns=[BCW_TARGET_COL])
bcwY = bcwDF[BCW_TARGET_COL].values.astype(np.float32)

test_size = 0.2

bcwX_train, bcwX_val, bcwY_train, bcwY_val = train_test_split(
    bcwX, bcwY,
    test_size=test_size,
    random_state=42,
    shuffle=True,
)

# ------------------------------------------------------------------
# 4️⃣  Fit the transformer on the training data
# ------------------------------------------------------------------
bcwX_train_proc = preprocess.fit_transform(bcwX_train)
bcwX_val_proc = preprocess.transform(bcwX_val)

# ------------------------------------------------------------------
# 5️⃣  Scale the target
# ------------------------------------------------------------------
from sklearn.preprocessing import StandardScaler

scale_y = True
if scale_y:
    y_scaler = StandardScaler()
    bcwY_train = y_scaler.fit_transform(bcwY_train.reshape(-1, 1)).ravel()
    bcwY_val = y_scaler.transform(bcwY_val.reshape(-1, 1)).ravel()

# ------------------------------------------------------------------
# 6️⃣  Inspect the first few rows
# ------------------------------------------------------------------
np.set_printoptions(suppress=True, precision=4)
print("First 4 rows of processed X (bcwX_train_proc):")
print(bcwX_train_proc[:4])
print("\nCorresponding y (bcwY_train):")
print(bcwY_train[:4])


If the code is correct you should see the following output

![_ _](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image16B.png)  

### **Exercise 6: Generate Feature Vectors for Regression Neural Network**

In the cell below write the code to generate feature vectors for your `hdDF` DataFrame. Set your `TARGET_COL="MaxHR"` since you are to predict the maximum heart rate.

**Code Hints:**

Change the prefix `bcw` to `hd` everywhere in the code that you copied from Example 6.



In [None]:
# Insert your code for Exercise 6 here



If the code is correct you should see the following output

![_ _](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image17B.png)  

## **Classification vs. Regression in Neural Networks**

While **classification** and **regression** are fundamentally different tasks—classification predicts **discrete categories**, whereas regression predicts **continuous values**—the **core architecture** of the neural networks used for both can be remarkably similar. This includes shared components like:

- Hidden layers
- Activation functions (e.g., ReLU)
- Optimizers (e.g., Adam)

However, there are **critical differences** that must be addressed when switching between these tasks. These differences primarily affect the **output layer**, **loss function**, **label representation**, and **evaluation metrics**.

#### **Key Differences**

| Component      | Classification                                            | Regression                          |
|----------------|-----------------------------------------------------------|-------------------------------------|
| Output Layer   | `Dense(num_classes, activation='softmax')`                | `Dense(1)` (linear activation by default) |
| Loss Function  | `categorical_crossentropy` or `sparse_categorical_crossentropy` | `mean_squared_error`, `mean_absolute_error` |
| Label Format   | One‑hot encoded or integer class labels                  | Continuous numeric values           |
| Metrics        | `accuracy`, `precision`, `recall`                        | `mse`, `mae`, `r²`                  |

#### **Summary**
In essence, while the internal structure of classification and regression models can be nearly identical, the task-specific components—especially the output layer and loss function—must be carefully tailored to the nature of the prediction problem. This ensures that the model learns appropriately and that its performance is evaluated meaningfully.

### Example 7: Construct, Compile and Train Regression Neural Network

The code in the cell below performs a regression analysis of the data in the Breast Cancer Wisconsin (`bcw`) dataset store in the `bcwDF` DataFrame.

Here is the code that builds the neural network
```text
# ---------------------------------------------------------------------------
# 2️⃣ Build and compile regression model
# ---------------------------------------------------------------------------
inputs = Input(shape=(bcwX.shape[1],))
x = Dense(25, activation="relu")(inputs)
x = Dropout(0.2)(x)
x = Dense(50, activation="relu")(x)
outputs = Dense(1)(x)  # Single output for regression
```
You should notice the following 2 points:
1. There is 1 input neuron for each column in the `X` feature vector
```text
inputs = Input(shape=(bcwX.shape[1],))
```
2. There is only **`1`** neuron in the output layer
```text
outputs = Dense(1)(x)  # Single output for regression
``

Here is the code that compiles the model
~~~text
# Compile model
bcw_model.compile(
    loss="mean_squared_error",  # or "mae"
    optimizer=OPTIMIZER,
    metrics=["mae"],  # Mean Absolute Error
)
~~~

You should notice that with regression we need to use a different loss function
```text
loss="mean_squared_error",  # or "mae"
```
and we need to adjust our metrics to this loss function
```text
 metrics=["mae"],  # Mean Absolute Error
```
Finally, here is the code that actually `trains` (fits) the model
```text
# ---------------------------------------------------------------------------
# 4️⃣ Train model
# ---------------------------------------------------------------------------
print(f"-- Training (regression) is starting for {EPOCHS} epochs ----------------------------")
start_time = time.time()
bcw_history = bcw_model.fit(
    bcwX_train_proc,
    bcwY_train,
    validation_data=(bcwX_val_proc, bcwY_val),
    epochs=EPOCHS,
    batch_size=32,
    callbacks=callbacks,
    verbose=VERBOSE,
)
```
As before, we will train our model `bcw_model` for `200` epochs unless the `EarlyStopping` callback kicks in sooner.

In [None]:
# Example 7: Construct, Compile and Train Regression Neural Network

from __future__ import annotations

import os
import time
import random
import numpy as np
import tensorflow as tf
from tensorflow.keras import Input, Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import (
    EarlyStopping,
    ModelCheckpoint,
    ReduceLROnPlateau,
)

# ---------------------------------------------------------------------------
# 1️⃣ Define parameters
# ---------------------------------------------------------------------------
EPOCHS = 200
PATIENCE = 20
VERBOSE = 2
lr = 0.0010
OPTIMIZER = Adam(learning_rate=lr)


# ---------------------------------------------------------------------------
# 2️⃣ Build and compile regression model
# ---------------------------------------------------------------------------
inputs = Input(shape=(bcwX.shape[1],))
x = Dense(25, activation="relu")(inputs)
x = Dropout(0.2)(x)
x = Dense(50, activation="relu")(x)
outputs = Dense(1)(x)  # Single output for regression

# Create model
bcw_model = Model(inputs, outputs)

# Compile model
bcw_model.compile(
    loss="mean_squared_error",  # or "mae"
    optimizer=OPTIMIZER,
    metrics=["mae"],  # Mean Absolute Error
)

# ---------------------------------------------------------------------------
# 3️⃣ Add Callbacks
# ---------------------------------------------------------------------------
checkpoint_path = "bcw_best_regression_model.keras"
callbacks = [
    EarlyStopping(
        monitor="val_loss", patience=PATIENCE, restore_best_weights=True
    ),
    ModelCheckpoint(
        filepath=checkpoint_path,
        monitor="val_loss",
        save_best_only=True,
    ),
    ReduceLROnPlateau(
        monitor="val_loss", factor=0.5, patience=PATIENCE, verbose=VERBOSE
    ),
]

# ---------------------------------------------------------------------------
# 4️⃣ Train model
# ---------------------------------------------------------------------------
print(f"-- Training (regression) is starting for {EPOCHS} epochs ----------------------------")
start_time = time.time()
bcw_history = bcw_model.fit(
    bcwX_train_proc,
    bcwY_train,
    validation_data=(bcwX_val_proc, bcwY_val),
    epochs=EPOCHS,
    batch_size=32,
    callbacks=callbacks,
    verbose=VERBOSE,
)


# ---------------------------------------------------------------------------
# 5️⃣ Inspect training
# ---------------------------------------------------------------------------
print("\nTraining complete.")
print("Best validation MAE:", min(bcw_history.history["val_mae"]))

elapsed_time = time.time() - start_time
print(f"Elapsed time: {hms_string(elapsed_time)}")


If the code is correct you should see something _similar_ to the following final output from the training

![_ _](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image22B.png)  

In the example above, EarlyStopping terminated the training at `Epoch 87` after only `18` seconds of training.

### **Exercise 7: Construct, Compile and Train Regression Neural Network**

In the cell below write the code to contruct, compile and train a regression neural network called `hd_model` on the Heart Disease dataset using the `X-` and `Y-` feature vectors that you created in **Exercise 6**.

**Code Hints:**

Change the prefix `bcw` to `hd` everywhere in the code that you copied from Example 7.

In [None]:
# Insert your code for Exercise 7 here



If the code is correct you should see something _similar_ to the following final output from the training

![_ _](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image23B.png)  

### Example 8: Evaluate the Model

The code in the cell below computes Mean Absolute Error (MAE) for the `bcw_model` in both raw units (after the data was scaled) and in original units (before the data was scaled).

In [None]:
# Example 8: Evaluate the model

# Print MAE
val_metrics = bcw_model.evaluate(bcwX_val_proc, bcwY_val, verbose=0)
print(f"\nValidation MAE (raw units) = {val_metrics[1]:.4f}")

# Convert MAE back to original units before scaling
mae_raw = y_scaler.inverse_transform(np.array([[val_metrics[1]]]))[0, 0]
print(f"Validation MAE (original units) = {mae_raw:.4f}")


If the code is correct you should see something _similar_ to the following output

![_ _](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image29B.png)  

As is shown in the `Appendix` (below), the `mean_area` of the breast cancer tumors was 654.9 mm<sup>2</sup>. The `Validateion MAE` in original units is about 140 mm<sup>2</sup>. This means the `bcw_model's` ability to predict turmor areas is **reasonable but not excellent**. The model is better than a “predict-the-mean” baseline, but it is likely to produce clinically-relevant errors for many downstream tasks.


### **Exercise 8: Evaluate the Model**

In the cell below write the code to compute the MAE for your `hd_model's` ability to predict maximum heart rate (`MaxHR`).

In [None]:
# Insert your code for Exercise 8 here




If the code is correct you should see something _similar_ to the following output

![_ _](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image28B.png)  

The Mean Absolut Error (MAE) is about 17 bpm (beats per minute). Since the average maximum heart rate in Heart Disease dataset is 137 bpm (see Appendix below) here is what you can conclude about the accuracy of your regression `hd_model`.

**Baseline performance**
If you had simply predicted the mean (137 bpm) for every person, the MAE you’d get on the original scale is roughly the mean absolute deviation of the target.
For a skewed heart‑rate distribution this is typically 50–80 bpm.
Thus **17 bpm is roughly 3-5x** smaller than the naïve baseline → a substantial improvement.

**Comparing to typical error tolerances**
In many medical-predictive-model benchmarks, a relative MAE below 5-10% is considered “excellent,” while 10–20 % is “acceptable.”
Your 12% sits comfortably in the **excellent range**.




### Example 9: Plot Predicted vs Actual

The code in the cell below shows the code needed to plot the Mean Tumor size predicted by the regression `bcw_model` vs the Actual tumor size using the common Python plotting program `matplotlib.pyplot`.

In [None]:
# Example 9: Plot Predicted vs Actual

import numpy as np
import matplotlib.pyplot as plt

# Predict on validation set
y_pred_scaled = bcw_model.predict(bcwX_val_proc)

# Inverse transform predictions and actual values
y_pred = y_scaler.inverse_transform(y_pred_scaled.reshape(-1, 1)).ravel()
y_true = y_scaler.inverse_transform(bcwY_val.reshape(-1, 1)).ravel()

# Plot predicted vs actual
plt.figure(figsize=(8, 6))
plt.scatter(y_true, y_pred, alpha=0.6, color='blue', edgecolor='k')
plt.plot([y_true.min(), y_true.max()], [y_true.min(), y_true.max()], 'r--', lw=2, label='Perfect Prediction')
plt.xlabel("Actual Mean Area")
plt.ylabel("Predicted Mean Area")
plt.title("Predicted vs Actual Mean Area")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()


If the code is correct you should see something _similar_ to the following output

![_ _](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image30B.png)  

Based on the scatter plot of Predicted vs. Actual Mean Area in the Wisconsin Breast Cancer dataset, we can conclude the following:

#### **Model Performance Summary:**
* **Strong alignment with the red dashed line** (Perfect Prediction) indicates that the model's predictions are very close to the actual values.
* The **tight clustering of blue dots** around the line suggests **low error and high accuracy**.
* There are no major outliers or systematic deviations, which implies the model generalizes well across the dataset.

#### **Conclusion:**
The regression model is performing **very well** in predicting `mean_area` of tumors. Its predictions are both **reliable** and **consistent**, making it a strong candidate for use in clinical or diagnostic support settings where accurate tumor size estimation is important.

### **Exercise 9: Plot Predicted vs Actual**

In the cell below write the code needed to plot the Maximum Heart Rate (`MaxHR`) predicted by your regression `hd_model` vs the Actual `MaxHR` using the common Python plotting program `matplotlib.pyplot`.

In [None]:
# Insert your code for Exercise 9 here



If the code is correct you should see something _similar_ to the following output

![_ _](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image31B.png)  

Based on the scatter plot of **Predicted vs Actual MaxHR** from the Heart Disease dataset, here’s what we can conclude:

#### **Model Performance Insights:**
* The **red dashed line** represents **perfect predictions** — where predicted values exactly match actual values.
* The **blue dots** (data points) show a **noticeable spread** around this line, indicating that the model's predictions are not consistently accurate.
* Some predictions are close to the line, but many are **significantly off**, suggesting **moderate to high error**.

#### **Implications:**
* The model may be **underfitting** — not capturing enough complexity in the data.
* There could be **missing or weakly predictive features** in the input data.
The variability in prediction accuracy suggests the model might benefit from:
* * More expressive architecture
* * Feature engineering
* * Hyperparameter tuning
* * Possibly using ensemble methods or alternative regression algorithms

# **Lesson Turn-in**

When you have completed and run all of the code cells, use the **File --> Print.. --> Save to PDF** to generate a PDF of your Colab notebook. Save your PDF as `Class_02_2_lastname.pdf` where _lastname_ is your last name, and upload the file to Canvas.

## Appendix

The code in the cells use the Pandas method `pd.describe()` to print out a statistical summary of the column `mean_area` in the `Wisconsin Breast Cancer` dataset and the column `MaxHR` in the `Heart Disease` dataset.

In [None]:
bcwDF['mean_area'].describe()

The **mean_area** variable shows a wide range of tumor sizes with a positively skewed distribution. This suggests that while most tumors are relatively small to moderately sized, there are a few with very large areas that could be clinically significant and may warrant further investigation.

In [None]:
hdDF['MaxHR'].describe()

The **MaxHR** variable shows a relatively symmetric distribution with a moderate spread. Most individuals have a maximum heart rate between **110** and **160**, but there are a few with very high values (up to 202), which could be outliers or clinically significant cases.

## **Poly-A Tail**

## **UNIVAC**

![___](https://upload.wikimedia.org/wikipedia/commons/2/2f/Univac_I_Census_dedication.jpg)

**UNIVAC (Universal Automatic Computer)** was a line of electronic digital stored-program computers starting with the products of the Eckert–Mauchly Computer Corporation. Later the name was applied to a division of the Remington Rand company and successor organizations.

### **Historical Overview of UNIVAC**

**UNIVAC** (Universal Automatic Computer) was the first commercially produced digital computer in the United States. It was designed primarily for business and administrative use, marking a significant shift from earlier computers that were mostly used for scientific and military purposes.

#### Key Milestones

- **1946–1951**: Developed by **J. Presper Eckert** and **John Mauchly**, the creators of the ENIAC, under the company **Eckert-Mauchly Computer Corporation**.
- **1951**: The first UNIVAC I was delivered to the **U.S. Census Bureau**.
- **1952**: UNIVAC I gained national attention when it successfully predicted the outcome of the U.S. presidential election on live television, favoring Eisenhower over Stevenson.

#### Technical Specifications

- **Memory**: Used mercury delay lines for memory storage.
- **Storage**: Featured magnetic tape for data storage, a novel concept at the time.
- **Speed**: Could perform about 1,000 calculations per second.
- **Size**: Occupied over 35 square meters and weighed approximately 13 tons.

#### Impact and Legacy

- UNIVAC I was the first computer to be widely used for **business applications**, including payroll, inventory, and accounting.
- It helped establish the **commercial computer industry**, paving the way for companies like IBM to enter the market.
- The name "UNIVAC" became synonymous with "computer" in the 1950s and early 1960s.

#### Fun Fact

The UNIVAC I's prediction of the 1952 election was so unexpected that CBS initially hesitated to air it. The prediction turned out to be accurate, boosting public confidence in computing technology.

---

> UNIVAC represents a pivotal moment in computing history, transitioning from experimental machines to practical tools that shaped modern data processing.


The BINAC, built by the Eckert–Mauchly Computer Corporation, was the first general-purpose computer for commercial use, but it was not a success. The last UNIVAC-badged computer was produced in 1986.

### **History and structure**

**UNIVAC Sperry Rand label**

J. Presper Eckert and John Mauchly built the ENIAC (Electronic Numerical Integrator and Computer) at the University of Pennsylvania's Moore School of Electrical Engineering between 1943 and 1946. A 1946 patent rights dispute with the university led Eckert and Mauchly to depart the Moore School to form the Electronic Control Company, later renamed Eckert–Mauchly Computer Corporation (EMCC), based in Philadelphia, Pennsylvania. That company first built a computer called BINAC (BINary Automatic Computer) for Northrop Aviation (which was little used, or perhaps not at all). Afterwards, the development of UNIVAC began in April 1946.[1] UNIVAC was first intended for the Bureau of the Census, which paid for much of the development, and then was put in production.

With the death of EMCC's chairman and chief financial backer Henry L. Straus in a plane crash on October 25, 1949, EMCC was sold to typewriter, office machine, electric razor, and gun maker Remington Rand on February 15, 1950. Eckert and Mauchly now reported to Leslie Groves, the retired army general who had previously managed building The Pentagon and led the Manhattan Project.

The most famous UNIVAC product was the UNIVAC I mainframe computer of 1951, which became known for predicting the outcome of the U.S. presidential election the following year: this incident is noteworthy because the computer correctly predicted an Eisenhower landslide over Adlai Stevenson, whereas the final Gallup poll had Eisenhower winning the popular vote 51–49 in a close contest.

The prediction led CBS's news boss in New York, Siegfried Mickelson, to believe the computer was in error, and he refused to allow the prediction to be read. Instead, the crew showed some staged theatrics that suggested the computer was not responsive, and announced it was predicting 8–7 odds for an Eisenhower win (the actual prediction was 100–1 in his favour).

When the predictions proved true—Eisenhower defeated Stevenson in a landslide, with UNIVAC coming within 3.5% of his popular vote total and four votes of his Electoral College total—Charles Collingwood, the on-air announcer, announced that they had failed to believe the earlier prediction.

The United States Army requested a UNIVAC computer from Congress in 1951. Colonel Wade Heavey explained to the Senate subcommittee that the national mobilization planning involved multiple industries and agencies: "This is a tremendous calculating process...there are equations that can not be solved by hand or by electrically operated computing machines because they involve millions of relationships that would take a lifetime to figure out." Heavey told the subcommittee it was needed to help with mobilization and other issues similar to the invasion of Normandy that were based on the relationships of various groups.

The UNIVAC was manufactured at Remington Rand's former Eckert-Mauchly Division plant on W Allegheny Avenue in Philadelphia, Pennsylvania. Remington Rand also had an engineering research lab in Norwalk, Connecticut, and later bought Engineering Research Associates (ERA) in St. Paul, Minnesota. In 1953 or 1954 Remington Rand merged their Norwalk tabulating machine division, the ERA "scientific" computer division, and the UNIVAC "business" computer division into a single division under the UNIVAC name. This severely annoyed those who had been with ERA and with the Norwalk laboratory.

In 1955 Remington Rand merged with Sperry Corporation to become Sperry Rand. General Douglas MacArthur, then the chairman of the Board of Directors of Remington Rand, was chosen to continue in that role in the new company. Harry Franklin Vickers, then the President of Sperry Corporation, continued as president and CEO of Sperry Rand. The UNIVAC division of Remington Rand was renamed the Remington Rand Univac division of Sperry Rand. William Norris was put in charge as Vice-President and General Manager reporting to the President of the Remington Rand Division (of Sperry Rand).

## **UNIVAC - A Quick Snapshot**

| Topic | Key Facts |
|-------|-----------|
| **Founded** | 1946 by J. Presper Eckert & John Mauchly (the same inventors of the ENIAC) |
| **Full Name** | _Univac, Inc._ (short for *UNiversal Automatic Computer*) |
| **First Product** | **UNIVAC I** – the world’s first commercially available electronic digital computer |
| **First U.S. Government Use** | 1949: The U.S. Census Bureau used UNIVAC I to process the 1950 census in a record 30 days |
| **First Commercial Sales** | 1950: UNIVAC I sold to the U.S. Department of Defense and later to companies like AT&T |
| **Notable Programs** | 1952: Ran the first computer‑simulated nuclear war scenario for the U.S. Strategic Air Command |
| **Mass‑Market Success** | 1950s–1960s: UNIVAC sold more than 300 computers, dominating the early mainframe market |
| **Key Innovations** | • Use of **vacuum tubes** and later **transistors** in a single product line<br>• First to use magnetic drum memory for data storage<br>• Developed the **UNIVAC 1107** – the world’s first minicomputer in 1962 |
| **Corporate Evolution** | 1976: Acquired by Sperry Corporation → became **Sperry UNIVAC**<br>1990: Merged with Burroughs to form **Unisys** |
| **Legacy** | • Introduced the phrase *“computing”* to mainstream culture<br>• Paved the way for the **computer revolution** in business and government<br>• Influenced the development of **software engineering** and **computer architecture** |
| **Fun Trivia** | • The name “UNIVAC” was chosen by a contest in 1946 – the winning entry was a 7‑word sentence: “Universal Computer, for Universal Use.”<br>• UNIVAC’s first machine was a 90‑meter tall building in Kansas City’s industrial park.<br>• The 1977 UNIVAC 1108 was used by NASA’s Apollo missions for trajectory calculations. |

---

## Quick Takeaway

UNIVAC didn’t just build computers; it built the **foundation of modern computing**. From the first commercial mainframe to pioneering minicomputers, its innovations shaped the entire industry and made the concept of “software” and “programming” part of everyday language. Even today, the legacy lives on in the name **Unisys**, the company that traces its lineage back to these early pioneers.
