<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173/blob/master/Assignment_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 1173: Intro Computational Biology**

## **Assignment 01**

### **Objectives**

* Analyze a new data set by applying tools of previous lessons.
* Further develop critical thinking skills.



### Blood Sample Dataset

[Blood Sample Dataset](https://www.kaggle.com/datasets/ehababoelnaga/multiple-disease-prediction?resource=download/)

**Description:**

This dataset is for the prediction of human diseases based on blood sample values and a panel of clinical assesments taken from 1552 subjects. The last column, `Disease`, list five disease categories: 

* **Anemia:** Anemia is the condition of not having enough healthy red blood cells or hemoglobin to carry oxygen to the body's tissues. 
* **Diabetes:** Diabetes is a chronic (long-lasting) condition that affects how your body turns food into energy. With diabetes, your body doesn’t make enough insulin or can’t use it as well as it should. When there isn’t enough insulin or cells stop responding to insulin, too much blood sugar stays in your bloodstream.
* **Healthy:** No apparent medical condistion.
* **Thalasse:** Thalassemia is an inherited blood disorder that inhibits the production of the protein hemoglobin.  
* **Thromboc:** Thrombocytopenia is the medical condition for low blood platelets. Normal platelet counts for adults are between 150,000 and 450,000 platelets per microliter (uL) of blood. Thrombocytopenia is a platelet count below 150,000.

**Key Features of the dataset:**

The following are the attributes of the Blood Sample dataset:

* **Cholesterol:** This is the level of cholesterol in the blood, measured in milligrams per deciliter (mg/dL).
* **Hemoglobin:** This is the protein in red blood cells that carries oxygen from the lungs to the rest of the body
* **Platelets:** Platelets are blood cells that help with clotting
* **White Blood Cells (WBC):** These are cells of the immune system that help fight infections
* **Red Blood Cells (RBC):** These are the cells that carry oxygen from the lungs to the rest of the body
* **Hematocrit:** This is the percentage of blood volume that is occupied by red blood cells
* **Mean Corpuscular Volume (MCV):** This is the average volume of red blood cells
* **Mean Corpuscular Hemoglobin (MCH):** This is the average amount of hemoglobin in a red blood cell
* **Mean Corpuscular Hemoglobin Concentration (MCHC):** This is the average concentration of hemoglobin in a red blood cell
* **Insulin:** This is a hormone that helps regulate blood sugar levels
* **BMI (Body Mass Index):** This is a measure of body fat based on height and weight
* **Systolic Blood Pressure (SBP):** This is the pressure in the arteries when the heart beats
* **Diastolic Blood Pressure (DBP):** This is the pressure in the arteries when the heart is at rest between beats
* **Triglycerides:** These are a type of fat found in the blood, measured in milligrams per deciliter (mg/dL)
* **HbA1c (Glycated Hemoglobin):** This is a measure of average blood sugar levels over the past two to three months
* **LDL (Low-Density Lipoprotein) Cholesterol:** This is the "bad" cholesterol that can build up in the arteries
* **HDL (High-Density Lipoprotein) Cholesterol:** This is the "good" cholesterol that helps remove LDL cholesterol from the arteries
* **ALT (Alanine Aminotransferase):** This is an enzyme found primarily in the liver
* **AST (Aspartate Aminotransferase):** This is an enzyme found in various tissues including the liver and heart
* **Heart Rate:** This is the number of heartbeats per minute (bpm)
* **Creatinine:** This is a waste product produced by muscles and filtered out of the blood by the kidneys
* **Troponin:** This is a protein released into the bloodstream when there is damage to the heart muscle
* **C-reactive Protein (CRP):** This is a marker of inflammation in the body
* **Disease:** This indicates whether he has a specific disease or not

## **Assignment Setup**

Run the next cell to load the necessary Python packages for this lesson.

In [None]:
# You MUST run this code cell

# Import Tensorflow and Keras
import tensorflow as tf
import io
from sklearn import metrics
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.models import load_model

# Import Numpy, Pandas and h5py
import numpy as np
import pandas as pd

# Import Utilites
import os
import shutil
path = '/'
memory = shutil.disk_usage(path)
dirpath = os.getcwd()
print("Your current working directory is : " + dirpath)
print("Disk", memory) 
print("Tensorflow version =", (tf.__version__))
print("Available GPU acceleration =", tf.test.gpu_device_name())

If you received an error message after running the above cell, it probably means one (or more) Python packages are not installed. Contact your instructor if you need assistance.

### Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.

In [None]:
# You must run this cell
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    COLAB = True
    print("Note: using Google CoLab")
    %tensorflow_version 2.x
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("Note: not using Google CoLab")
    COLAB = False

## **Part I: Neural Network for Disease Classification**

The goal of Part I is to build and assess the accuracy of a deep neural network that is able to predict the health status of a patient based on his/her blood sample values and the results obtained from a panel of clinical assessements. 

Part I has been subdivided into a series of logical steps to help guide your coding.


### Part I - Step 1: Read the dataset 

In the cell below, write the Python code to read the Blood Sample dataset, `Blood_samples.csv`, located on the course fileserver. Create a new DataFrame called `bsDF`. 

You can use this code snippet to read the datafile:

~~~text
bsDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/Blood_samples.csv", 
    na_values=['NA', '?'])
~~~
    
Set the display options to 8 columns and 8 rows and display the contents of `bsDF`.

In [None]:
# Insert your code for Part 1 - Step 1 here 




If your code is correct you should see the following output:

![___](https://biologicslab.co/BIO1173/images/Assignment_01_Step1.png)

In [None]:
# Insert your code for Part 1-Step 2 here 



If your code is correct, you should see all 25 column names.

### Part I - Step 2: Print out a statistical summary 

In the cell below, use the Pandas `pd.describe()` to print out a statistical summary of the Blood Sample dataset.

In [None]:
# Insert your code for Part I - Step 2 here 



### Part I - Step 3: Suffle and Reindex the DataFrame

In the cell below, write the code to suffle and reindex the DataFrame `bsDF`. Set the random seed to `420` before suffling. Set the display options to 8 rows and 8 columns and print out the shuffled and reindexed DataFrame. 

(HINT: This topic was covered in Class_02_3, Example 2)

In [None]:
# Insert your code for Part I - Step 3 here 



If your code is correct, you should see the same DataFrame table shown in Step 1 above, but with the index values on the left side in a random order.

### Part I - Step 4: Create independent and dependent variables

In the cell below, write the code to create an independent variable called `bsX` and a dependent variable called `bsY`. 

The independent variable, `bsX` should contain the clinical data from **all** of the columns in the DataFrame `bsDF`, _except_ the very last column, `Disease`. In other words, there should be 24 column names in the definition of `bsX`. 

Here is a simple way to generate the variable `bsX`. First, create a list called `bxX_columns` using the following code snippet:

`bsX_columns= bsDF.columns.drop('Disease') `

If you printed out the variable `bsX_columns` you would see that it contained all of the column names in the DataFrame `bsDF`, except the column name `Disease`.

The next step is to use `bsX_columns` to generate our Numpy array, `bsX` using the following code snippet:

`bpX = bsDF[bsX_columns].values`

The last column, `Disease`, is the dependent or response variable for your neural network. Use this column to create `bsY`. Since this column contains string variables (i.e. the names of diseases) you will have to `One-Hot Encode` it. As part of creating `bsY`, create a variable called `diseaseLst` from the `dummies.columns`. Print out the values in `diseaseLst` using the command `print(*diseaseLst)`. By adding an asterisk `*` before the list variable, only the names in the list are printed out.

In [None]:
# Insert your code for Part I - Step 4 here 



If your code is correct you should see the following output:

### Part I - Step 5: Split the dataset and build the neural network

In the cell below, start by spliting the independent variable, `bsX` and the dependent variable, `bsY`, into train and test datasets with 80% of the data going into the training data and the remaining 20% going into the test (validation) data. Set the variable `random_state` to the value `420`.

Then build a _classification_ neural network called `bsModel` with **3 hidden layers**. The 1st hidden layer should have `80` neurons, the 2nd layer `40` neurons and the 3rd layer `20` neurons. Compile your neural network and print out a summary of `bsModel`.

In [None]:
# Insert your code for Part I - Step 5 here 




If your code is correct, you should see the following output:

~~~text
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 80)                2000      
                                                                 
 dense_1 (Dense)             (None, 40)                3240      
                                                                 
 dense_2 (Dense)             (None, 20)                820       
                                                                 
 dense_3 (Dense)             (None, 5)                 105       
                                                                 
=================================================================
Total params: 6,165
Trainable params: 6,165
Non-trainable params: 0

~~~

### Part I - Step 6: Create Early Stopping monitor and fit the model

In the cell below, use the Keras function `EarlyStopping()` to create a monitor that will stop the training/fitting of your model, `bsModel`, after waiting **_50_ epochs** for an improvement in the `val_loss` value. Set the `min_delta` to `le-3` and have the monitor restore the best weights after stopping. 

Then train/fit `bsModel` for 1000 epochs using the monitor for early stopping. Set the `verbose` variable to a value of `2`.

In [None]:
# Insert your code for Part I - Step 6 here 



If your code is correct, you should expect the training to stop fairly soon due to **early stopping**.

### Part I - Step 7: Compute the accuracy score

In the cell below, write the code to predict the `accuracy_score` for your classification model, `bsModel`, and print out its value.

In [None]:
# Insert your code for Part 1 - Step 7 here 



If your code is correct, you should be a very high value for Accuracy.

### Part I - Step 8: _Ad hoc_ prediction

To assess your neural network's utility as a diagnostic tool in a clinical environment, use `bsModel` to make two _ad hoc_ predictions. 

Start by running the next two code cells to create clinical values for two patients `Subject0` and `Subject3`. 

In [None]:
# Run this cell to create data for Subject0

Subject0 = np.array( [[0.1967969 , 0.16542872, 
        0.86688889, 0.1897373 , 0.84452864,
        0.26358894, 0.3063053 , 0.15061818, 
        0.91539114, 0.96827259, 0.42538595, 
        0.74910886, 0.61077082, 0.70906404, 
        0.63093562, 0.20622291, 0.79804346, 
        0.22110572, 0.76061147, 0.5913137,
        0.54413641, 0.49432947, 0.01228667, 
        0.63308778]], dtype=float)
Sub0Diag = "Thalasse" 

In [None]:
# Run this cell to create data for Subject3

Subject3 = np.array( [[0.1071651 , 0.60334057, 
        0.79121548, 0.17883956, 0.71867373,
        0.82576883, 0.75365724, 0.39666911, 
        0.76266705, 0.73267095, 0.23287651, 
        0.11933781, 0.1425914 , 0.47437821,
        0.73136867, 0.4895137 , 0.10267911, 
        0.86103504, 0.60906836, 0.67741541,
        0.97919154, 0.55496046, 0.47242801, 
        0.62168682]], dtype=float)

Sub3Diag = "Thromboc"

### Example code for making _ad hoc_ prediction

The code in the cell below illustrates how to use `bsModel` to predict the disease status of `Subject0`.

In [None]:
# Example for making an ad hoc prediction

# Use the neural network to predict the disease
sub0Pred = bsModel.predict(Subject0)

# Print out the results
print(*diseaseLst)
print(*sub0Pred)
sub0Pred = np.argmax(sub0Pred)
print(f"Model predicts that Subject0 is mostly likely suffering from:{diseaseLst[sub0Pred]}")
print(f"Subject0 has been diagnosed with:{Sub0Diag} ")

If your code is correct you should see something _similar_ to the following output:

### Part I - Step 8: _Ad hoc_ prediction

Using the example above as a template, use your neural network model to predict the disease status of `Subject3`. 

In [None]:
# Insert your code for Part I - Step 8 here



If your code is correct, your Subject3 should be suffering from `Thromboc`. 

## Part II: Neural Network to Predict C-reactive Protein Levels

The goal of Part 2 is to build a neural network that can predict the levels of C-reactive protein. Unlike the neural network constructed in Part 1, the job of this  neural neural network is **_regression_** instead of classification. Before we start, here is some information about C-reactive protein and why it is of clinical significance.

[C-Reactive Protein (CRP) Test](https://medlineplus.gov/lab-tests/c-reactive-protein-crp-test/#:~:text=A%20c%2Dreactive%20protein%20test,have%20inflammation%20in%20your%20body)

**_What is a c-reactive (CRP) protein test?_**

A c-reactive protein test measures the level of c-reactive protein (CRP) in a sample of your blood. CRP is a protein that your liver makes. Normally, you have low levels of c-reactive protein in your blood. Your liver releases more CRP into your bloodstream if you have inflammation in your body. High levels of CRP may mean you have a serious health condition that causes inflammation. CRP test can show whether you have inflammation in your body and how much. But the test can't show what's causing the inflammation or which part of your body is inflamed.

A CRP test may be used to help find or monitor inflammation in acute or chronic conditions, including:

* **Infections:** from bacteria or viruses
* **Inflammatory bowel disease:** disorders of the intestines that include Crohn's disease and ulcerative colitis
* **Autoimmune disorders:** such as lupus, rheumatoid arthritis, and vasculitis
* **Lung diseases:** such as asthma

### Part II - Step 1: Re-Read the Blood Sample dataset

In order to separate the data/code between Part I and Part II, re-read the Blood Sample dataset from the course web server and store these values in a new DataFrame called `bs2DF`. 

Use the DataFrame`bs2DF` to create a new variable called `bsGrps` that contains the number of patients in the column `Disease` grouped by their health status. Print out `bsGrps`. 

(HINT: This topic was covered in Class_02_3, Example 4)

In [None]:
# Insert your code for Part II - Step 1 here



If your code is correct you should see the following output:

### Part II - Step 2: Create disease list

In the cell below, create a variable called `diseaseLst` that contains a list of the category values in the column `Disease` in the DataFrame `bs2DF`. 

The trick here is to use the `.unique()` method so that list only has one value from each category. Print out the number of categories on one line, and the names of the categories on the next line. 

(HINT: Look at Class_02_2 Example 3--Step 2 for reference)


In [None]:
# Insert your code for Part 2 -Step 1 here



If your code is correct, you should see the following output:

### Part II - Step 3: Map disease names to integers

In order to use the string data in column `Disease` as an independent variable, you will need to map these 5 disease names to the integer values, `0` through `4`. 

Set the display to 8 rows and 8 columns are display DataFrame `bs2DF` after mapping.

(HINT: See Class_01_9, Example 8).

In [None]:
# Insert your code for Part II - Step 3 here



If your code is correct you should see the following output:

![___](https://biologicslab.co/BIO1173/images/Assignment_01_Part2Step3.png)

### Part II - Step 4: Suffle and Reindex the DataFrame

In the cell below, write the code to suffle and reindex the DataFrame `bs2DF`. Set the random seed to `420` before suffling. Set the display options to 8 rows and 8 columns and print out the shuffled and reindexed DataFrame. 

In [None]:
# Insert your code for Part II - Step 4 here 



If your code is correct, you should see the same DataFrame table shown in Part II Step 3 above, but with the index values on the left side in a random order.

### Part II - Step 5: Create independent and dependent variables

In the cell below, create the independent variable, `bs2X`, that contains the values from in **_all of the columns_** in the DataFrame, `bs2DF`, **_except_** the column `C-reactive Protein`. In other words, make sure to include the column `Disease` in `bs2X`, but **not** the column `C-reactive Protein`.

Since you are building a **_regression_** neural network to predict the value of `C-reactive protein`, use the values in this column as your dependent variable, `bs2Y'.

Split the independent variable, `bs2X` and the dependent variable, `bs2Y`, into train and test datasets with 80% of the data going into the training data and the remaining 20% going into the test (validation) data. Set the variable random_state to the value 420.

Then build a **_regression_** neural network called `bs2Model` with 3 hidden layers. The 1st hidden layer should have 80 neurons, the 2nd layer 40 neurons and the 3rd layer 20 neurons. Compile your neural network and print out a summary of `bs2Model`.

In [None]:
# Insert your code for Part II - Step 5 here 



If your code is correct, the summary of model `bs2Model` should have the following:

~~~text
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_4 (Dense)             (None, 80)                2000      
                                                                 
 dense_5 (Dense)             (None, 40)                3240      
                                                                 
 dense_6 (Dense)             (None, 20)                820       
                                                                 
 dense_7 (Dense)             (None, 1)                 21        
                                                                 
=================================================================
Total params: 6,081
Trainable params: 6,081
Non-trainable params: 0
~~~

### Part II - Step 6: Create Early Stopping monitor and fit the model

In the cell below, use the Keras function `EarlyStopping()` to create a monitor called `bs2Monitor` that will stop the training/fitting of your model, `bs2Model`, after waiting **_50_ epochs** for an improvement in the `val_loss` value. Set the `min_delta` to `le-3` and have the monitor restore the best weights after stopping. 

Then train/fit `bs2Model` for 1000 epochs using the monitor for early stopping. Set the `verbose` variable to a value of `2`.

In [None]:
# Insert your code for Part II - Step 6 here 



If your code is correct, training/fitting should have stopped well before 200 epochs. 

### Part II - Step 7: Compute the Root Mean Square Error (RMSE)

When working with neural networks that perform a regression analysis, it is customary to use the Root Mean Square Error (RMSE) as a measurement of predictive accuracy. In the cell below write the code to compute the RMSE for your `bs2Model` neural network and then print out the result.

In [None]:
# Insert your code for Part II - Step 7 here 



If your code is correct you should see something _similar_ to the following output:

### Part II - Step 8: _Ad hoc_ prediction

To assess your neural network's utility as a diagnostic tool in a clinical environment, use `bs2Model` to make two _ad hoc_ predictions. 

Start by running the next two code cells to create clinical values for two patients `Subject163` and `Subject205`. 

In [None]:
# Run this cell to create Subject163

Subject163 = np.array( [[0.14152901, 0.57785393, 
    0.09730975, 0.12706108, 0.22775076,
    0.63645922, 0.40086091, 0.61307206, 
    0.02086084, 0.85518951, 0.44480601, 
    0.55396231, 0.49913998, 0.5690086 , 
    0.89760374, 0.459586  , 0.48598845, 
    0.12536889, 0.77141595, 0.17089377,
    0.35264676, 0.92592426, 0.69949006, 2]], dtype=float)
Sub163C = "0.2939618585350387"

In [None]:
# Run this cell to create Subject205

Subject205 = np.array([[0.58227775, 0.11463251, 
    0.84626618, 0.736968  , 0.48740453,
    0.90404592, 0.39454734, 0.06592621, 
    0.68745776, 0.45989471, 0.07954346, 
    0.78002702, 0.54174906, 0.06648775, 
    0.15177372, 0.53362722, 0.71344132, 
    0.42099619, 0.21107792, 0.61167368,
    0.92816086, 0.14738511, 0.75177812, 0.]], dtype=float)

Sub205C = "0.709262"

### Example code for making an _ad hoc_ prediction

The code in the cell below illustrates how to use `bs2Model` to predict the C-reactive protein level of `Subject163`.

In [None]:
# Example for making an ad hoc prediction

# Make the prediction
Sub163_Pred = bs2Model.predict(Subject163)

# Print out the results
print(f"Model predicts that the C-reactive protein of Subject163 is: {Sub163_Pred}")
print(f"The actual C-reactive protein of Subject163 was: {Sub163C}")

If your code is correct you should see similiar to the following output:

### Part II - Step 8: _Ad hoc_ prediction

Using the example above as a template, use your neural network model to predict the C-reactive Protein level of `Subject205`. 

In [None]:
# Insert your code for Part II - Step 8 here 



If your code is correct, the C-reactive protein of Subject205 should be around `0.7`.

## **Assignment Turn-in**

When you have completed all of the code cells, and run them in sequential order (the last code cell should be number 25), use the **File --> Print.. --> Save to PDF** to generate a PDF of your JupyterLab notebook. Save your PDF as `Assignment_02.lastname.pdf` where _lastname_ is your last name, and upload the file to Canvas.