<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173/blob/master/Class_02_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 1173: Intro Computational Biology**

**Module 2: Machine Learning**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Integrative Biology](https://sciences.utsa.edu/integrative-biology/), [UTSA](https://www.utsa.edu/)


### Module 2 Material

* Part 2.1: Pandas DataFrame Operations
* Part 2.2: Categorical Values 
* **Part 2.3: Grouping, Sorting and Shuffling on Pandas**

### Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.

### Lesson Setup

Run the next code cell to load necessary packages

In [1]:
# You MUST run this code cell first
import pandas as pd
import os
import numpy as np
from sklearn import metrics
from scipy.stats import zscore

import os
import shutil
path = '/'
memory = shutil.disk_usage(path)
dirpath = os.getcwd()
print("Your current working directory is : " + dirpath)
print("Disk", memory)

Your current working directory is : C:\Users\David\BIO1173_Test\Class_02_3
Disk usage(total=4000108531712, used=1002413891584, free=2997694640128)


### Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.

In [None]:
# You must run this cell second
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    COLAB = True
    print("Note: using Google CoLab")
    %tensorflow_version 2.x
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("Note: not using Google CoLab")
    COLAB = False

# Part 2.3: Grouping, Sorting, and Shuffling  

We will take a look at a few ways to affect an entire Pandas data frame. These techniques will allow us to group, sort, and shuffle data sets. These are all essential operations for both data preprocessing and evaluation.


### Datasets for Class_02_3

In this class we will again be using the **_Obesity Prediction_** dataset for the Examples and the **_Heart Failure_** dataset for the Exercises.

### Obesity Prediction Dataset

[Obesity Prediction Dataset](https://www.kaggle.com/datasets/mrsimple07/obesity-prediction)

**Description**

The dataset provides comprehensive information on individuals' demographic characteristics, physical attributes, and lifestyle habits, aiming to facilitate the analysis and prediction of obesity prevalence. It includes key variables such as age, gender, height, weight, body mass index (BMI), physical activity level, and obesity category. 

* **Age:** The age of the individual, expressed in years.
* **Gender:** The gender of the individual, categorized as male or female.
* **Height:** The height of the individual, typically measured in centimeters or inches.
* **Weight:** The weight of the individual, typically measured in kilograms or pounds.
* **BMI:** A calculated metric derived from the individual's weight and height
* **PhysicalActivityLevel:** This variable quantifies the individual's level of physical activity
* **ObesityCategory:** Categorization of individuals based on their BMI into different obesity categories

### Example 1: Read data file and create Pandas DataFrame

The cell below use the Pandas `read_csv()` method to read the `obesity_prediction.csv` file using the code chunk below:
~~~text
opDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/obesity_prediction.csv",
    na_values=['NA','?'])
~~~
The function `read_csv()` is an important Pandas method to read CSV files. In the cell below, the `read_csv()` method takes 2 arguments. The first argument, `"https://biologicslab.co/BIO1173/data/obesity_prediction.csv"` is a string that provides the filepath and filename. The second argument, `na_values=['NA','?']` is used to recognize `?` as NaN (Not a Number).

As the file is read, Pandas creates a Pandas DataFrame variable called `opDF` to hold the information. 

After reading the datafile into a DataFrame, it is always a good idea to use the function `display()` to print out a specified number of rows and columns to make sure the data was read correctly. The code below sets the maximum number of rows to 6 and the maximum number of columns to 6. 

In [None]:
# Example 1: Read data file and create Pandas DataFrame

# Read the datafile 
opDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/obesity_prediction.csv",
    na_values=['NA','?'])

# Set max rows and max columns
pd.set_option('display.max_rows', 6)
pd.set_option('display.max_columns', 6) 

# Display DataFrame
display(opDF)

If your code is correct, you should see the following output:

![_](https://biologicslab.co/BIO1173/images/class_02_2_Exm1.png)

You can see from looking at the last line of the output, the DataFrame `opDF` has 7 columns and 1000 rows. This means `opDF` has clinical measurements for 1000 patients and for each patient, there are 7 separate clinical measurements.

### Heart Disease Dataset

[Heart Disease Dataset](https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction)

**Description**

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Four out of 5CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

* **Age:** age of the patient [years]
* **Sex:** sex of the patient [M: Male, F: Female]
* **ChestPainType:** chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
* **RestingBP:** resting blood pressure [mm Hg]
* **Cholesterol:** serum cholesterol [mm/dl]
* **FastingBS:** fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
* **RestingECG:** resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
* **MaxHR:** maximum heart rate achieved [Numeric value between 60 and 202]
* **ExerciseAngina:** exercise-induced angina [Y: Yes, N: No]
* **Oldpeak:** oldpeak = ST [Numeric value measured in depression]
* **ST_Slope:** the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
* **HeartDisease:** output class [1: heart disease, 0: Normal]

### **Exercise 1: Read data file into a Pandas DataFrame**

In the cell below use the Pandas `read_csv()` method to read the `heart_disease.csv` file that is located on the course HTTPS server using this code chunk:
~~~text
# Read the datafile 
hdDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/heart_disease.csv",
    na_values=['NA','?'])
~~~
As the file is read, have Pandas create a DataFrame variable called `hdDF` to hold the heart disease dataset. 

Then use the `display()` function to print out 6 rows and 6 columns of `hdDF`.

In [None]:
# Insert your code for Exercise 1 here



If your code is correct, you should see the following output:

![_](https://biologicslab.co/BIO1173/images/class_02_2_Ex1.png)

You can see from the output that your DataFrame `hdDF` has 12 columns of clinical data for 918 subjects (i.e. 918 rows with 1 row/subject).

## Shuffling a Dataset

There may be information lurking in the order of the rows of your dataset. Unless you are dealing with time-series data, the order of the rows should _not_ be significant. However, consider the situation where your training set included patients in a clinical study. Perhaps this dataset is ordered by the age of the patient. It is okay to have an individual column that specifies age but having the data in this order might be problematic.  

Consider if you were to split the data into training and validation sets. You could end up with your validation set having mostly older patients and the training set having mostly younger patients. Separating the data into a k-fold cross validation could have similar problems. Because of these issues, it is important to always shuffle a data set before you begin your analysis.

Often shuffling and reindexing are both performed together. **Shuffling** randomizes the order of the data set. However, it does _not_ change the Pandas row numbers. The following code demonstrates a reshuffle. Notice that the program has not reset the row indexes' first column. Generally, this will not cause any issues and allows tracing back to the original order of the data. However, I usually prefer to reset this index. Typically I do not care about the initial position, and there are a few instances where this unordered index can cause issues. Therefore I recommend that you **reindex** the dataset as well as shuffle it. 

### Example 2: Shuffling and Reindex a DataFrame

The following code demonstrates how to simultaneously shuffle and reindex a DataFrame.  The method `df.reindex()` reorders the row indexes. The Numpy function, `random.permutation()`, is used to scramble the index of the `opDF` DataFrame as well as the data. 

The code takes advantage of `setting the random seed` to some integer value. In this example, the value `42` was selected. 

As will be explained below, setting the random seed is simply a way to insure the that same random numbers are used when the code cell is first run. This is useful in a learning environment, but it is not something that would normally be done. 

In [None]:
# Example 2: Suffle aand Reindex a DataFrame

# Set the random seed to 42
np.random.seed(42) 

# Use random.permutation function for shuffling & reindexing
opDF = opDF.reindex(np.random.permutation(opDF.index))

# Set the max rows and max columns
pd.set_option('display.max_rows', 6)
pd.set_option('display.max_columns', 6)

# Display the shuffled & reindexed DataFrame
display(opDF)

If the random seed was set to 42, the output from the above cell should look like the following:

![_](https://biologicslab.co/BIO1173/images/class_02_3_Exm2.png)

As you can see, the index values at the left side of the DataFrame have been randomized.

NOTE: If you rerun this cell again, you will get different index values. 

-------------------

### **Setting the Random Seed**

Setting the random seed in Python is a way to control the output of random number generators. A random seed is an initial value used as a starting point for generating random numbers. By setting the seed to a specific value, you can ensure the same sequence of random numbers is generated every time you run your code.

The purpose of setting the random seed is primarily two-fold:

* **Reproducibility:** When developing or debugging code that involves randomness, it is often useful to have deterministic behavior. Setting the random seed allows you to reproduce the same results consistently. This can be crucial for debugging issues or creating reproducible experiments.
* **Comparability:** In certain scenarios, you might need to compare the performance or behavior of different algorithms or models. By setting the same seed for each algorithm or model, you can ensure they are being evaluated on the same random data. This helps in making fair and meaningful comparisons.

To set the random seed in Python, you typically use the seed() function from the random module. You provide an integer value as the seed parameter, representing the starting point for random number generation. For example:

>  `import random`  <br>
>  `random.seed(123)  # Set the random seed to 123`

By setting the random seed, you can control the behavior of random number generators and achieve reproducible and comparable results in scenarios where randomness is involved.

------------------

### **Exercise 2: Shuffling and Reindex a DataFrame**

In the code below, shuffle and reindex the DataFrame `hdDF`. Set the random seed = 1604. Display 8 rows and 8 columns of the shffled and reindexed DataFrame.

In [None]:
# Insert your code for Exercise 2 here



If the random seed was set to 1604, the output from the above cell should look like the following:

![_](https://biologicslab.co/BIO1173/images/class_02_4_Exm3.png)

As in Example 2, the index values at the left side of the `hdDF` DataFrame should have been randomized.

NOTE: If you rerun this cell again, you will get different index values. 

## Sorting a Data Set

While it is always good to shuffle a data set before training, you may also wish to **_sort)** the data set. Sorting the data set allows you to order the rows in either ascending or descending order for one, or more, columns.

### Example 3: Sorting a DataFrame

The code in the cell below sorts the Obesity Prediction dataset. The code starts by re-reading the dataset to restore it to its original form. 

Sorting is accomplished using the Pandas method `df.sort_values()`:

~~~text
opDF = opDF.sort_values(by='ObesityCategory', ascending=True)`
~~~
The argument `ascending=True` means that the sorting will be from A-to-Z in this example since the values in the `ObesityCategory` are strings. 

The code then prints out the categorical value of `ObesityCategory` using square bracket indexing `[ ]` as shown in the following code chunk:
~~~text
# Print out the first Obesity type
print(f"The first Obesity Type is: {opDF['ObesityCategory'].iloc[0]}")
~~~
The argument `.loc[0]` tells Pandas to only print the first value (remember Python sequences begin with `0`). 

In [None]:
# Example 3: Sorting a data set

# Read the datafile 
opDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/obesity_prediction.csv",
    na_values=['NA','?'])

# Sort the DataFrame by ObesityCategory
opDF = opDF.sort_values(by='ObesityCategory', ascending=True)

# Print out the first Obesity type
print(f"The first Obesity Type is:\
        {opDF['ObesityCategory'].iloc[0]}")
      
# Set the max rows and max columns
pd.set_option('display.max_rows', 5)
pd.set_option('display.max_columns', 7)

# Display the DataFrame
display(opDF)

If the code is correct, you should see the following output. 

![_](https://biologicslab.co/BIO1173/images/class_02_4_Sort1.png)

As with reindexing in Example 2, the index values at the left are no longer sequential. However, they haven't been "scrambled", but the whole DataFrame has been re-ordered alphabetically, based on the values in the `ObesityCategory`. 

### **Exercise 3: Sorting a DataFrame**

In the cell below, write the Python code to sort the Heart Failure dataset. Start by re-reading the Heart Disease dataset from the course HTTPS server to create a fresh version of your DataFrame, `hdDF`. 

Then sort `hdDF` by the `RestingBP` column in **_decending_** order so that the largest resting blood pressure value is at the beginning of the sorted DataFrame. Display 8 rows and 8 columns of the sorted DataFrame. 

Use the following code chunk to print out the highest resting blood pressure:
~~~text
# Print out the first RestingBP value
print(f"The highest resting blood pressure value is: {hdDF['RestingBP'].iloc[0]}")
~~~

In [None]:
# Insert your code for Exercise 3 here



If the code is correct, you should see the following output. 

![__](https://biologicslab.co/BIO1173/images/class_02_4_Sort2.png)


## Grouping a Data Set

**_Grouping_** is a typical operation on data sets.  Structured Query Language (SQL) calls this operation a "GROUP BY."  Programmers use grouping to summarize data.  Because of this, the summarization row count will usually shrink, and you cannot undo the grouping.  Because of this loss of information, it is essential to keep your original data before the grouping. 

### Example 4: Grouping a dataset

You can use grouping to perform summaries of a large DataFrame. Since we modified our DataFrame above, we start by re-reading it to create a fresh version of `opDF`. 

The code in the cell below, groups all of the subjects in the Obesity Prediction dataset by their average (mean) Body Mass Indices (BMI) using the following code chunk:
~~~text
obDF.groupby('ObesityCategory')['BMI'].mean()
~~~
In addition to **mean**, you can use other aggregating functions, such as **sum** or **count** as the basis for the grouping. For example, you could group by `count` if you wanted to know which obesity category had the greatest number of subjects.

In [None]:
# Example 4: Grouping a dataset

# Read the datafile 
opDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/obesity_prediction.csv",
    na_values=['NA','?'])

# Group by ObesityCategory 
op_grp = opDF.groupby('ObesityCategory')['BMI'].mean()

# Show the DataFrame after grouping
op_grp

If your code is correct you should see the following output:
~~~text
ObesityCategory
Normal weight    22.018367
Obese            33.858442
Overweight       27.308287
Underweight      15.360809
Name: BMI, dtype: float64
~~~

----------------------

### **BMI Categories**

Obesity is serious medical problem in the US. According to the Centers for Disease Control and Prevention (CDC), obesity rates in the US have been steadily increasing over the past few decades. In 2018, approximately 42.4% of US adults were considered obese. This high prevalence of obesity is concerning because it can lead to a wide range of serious medical issues.

Obesity is linked to numerous health conditions and diseases such as type 2 diabetes, heart disease, stroke, certain types of cancer, sleep apnea, and osteoarthritis. These conditions can significantly reduce the quality of life and increase the risk of premature death.

Additionally, obesity puts a substantial burden on the healthcare system. The medical costs associated with obesity-related conditions are estimated to be billions of dollars each year. Obesity also has an impact on productivity and overall economic well-being due to increased rates of absenteeism and decreased work productivity.

It is important to address obesity as a public health issue by promoting healthy lifestyles, encouraging regular physical activity, and providing access to nutritious foods.

Clinically, the diagnosis of obesity is based on a patient's Body Mass Index (BMI). The [Centers for Disease Control and Prevention (CDC)](https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/english_bmi_calculator/bmi_calculator.html) recognizes 5 BMI categories as follows:

* **Underweight:** BMI < 19
* **Normal Weight:** BMI >19 and <25
* **Overweight:** BMI >25 and <30
* **Obese:** BMI >30 and <40
* **Extremely Obese:** BMI >40

These values agree well with the group summary above, in Example 4.

----------------------

### **Exercise 4: Grouping a dataset**

In the cell below, start by re-reading the Heart Disease dataset to create a fresh copy of `hdDF`.

Then group all of the subjects in the Heart Failure dataset according to their resting ECG measurements (`RestingECG`) according to the average (mean) of their oldpeak (`Oldpeak`) measurements. 

In [None]:
# Insert your code for Exercise 4 here



If your code is correct, you should see the following output:

~~~test
RestingECG
LVH       1.069681
Normal    0.786051
ST        1.008989
Name: Oldpeak, dtype: float64
~~~

---------------------------------

### **What is Oldpeak?**

In cardiac physiology, **_oldpeak_** refers to the [ST depression](https://en.wikipedia.org/wiki/ST_depression) observed on an electrocardiogram (ECG) during exercise stress testing. It measures the amount of ST segment deviation below the baseline level. ST depression is indicative of insufficient blood supply to the heart muscle, particularly during exercise when the heart requires more oxygen. The magnitude of oldpeak is often used as an indicator of the severity of coronary artery disease.

![__](https://biologicslab.co/BIO1173/images/ST_depression.jpg)

-----------------

## Saving Values to a Dictionary

The Pandas function `to_dict()` is used to convert a DataFrame into a Python dictionary. This function allows you to transform the data structure of a DataFrame into the key-value format of a dictionary.

The basic syntax to use this function is as follows:
~~~text
DataFrame.to_dict(self, orient='dict')` 
~~~

The `to_dict()` function takes an optional parameter called **_orient_** which specifies the format of the resulting dictionary. By default, it is set to 'dict', which returns a dictionary where keys are column labels and values are respective data values.

This function is useful when you need to convert your DataFrame or a similar Python Series into a dictionary for further processing or analysis using Python dictionary methods and functions.


### Example 5: Save values to a dictionary

The code in the cell below saves the mean BMI values of the different obesity catagories, `ob_grp`, computed in Example 4, to a Python dictionary `ob_dic`. 

In [None]:
# Example 5: Save values to dictionary 

# Use to_dict() method to save BMI means
op_dic = op_grp.to_dict()

# Print out the dictionary
op_dic

If your code is correct you should see the following output:
~~~text
{'Normal weight': 22.018366725098193,
 'Obese': 33.8584419831173,
 'Overweight': 27.30828686187581,
 'Underweight': 15.360808884000432}
~~~

### **Exercise 5: Save values to a dictionary**

In the cell below, save the mean Oldpeak values for the different resting ECG categories `hd_grp`, computed in Exercise 4, to a Python dictionary `hd_dic`. 

In [None]:
# Insert your code for Exercise 5 here



If your code is correct you should see the following output:

~~~text
{'LVH': 1.0696808510638298,
 'Normal': 0.7860507246376811,
 'ST': 1.0089887640449438}
~~~

## Using a Python Dictionary to access elements

A Python dictionary allows you to access an individual element quickly.  For example, you could quickly look up the mean BMI for obese patients in the Obesity Prediction dataset.  You will see that target encoding, introduced later in this module, uses this technique. 

### Example 6: Access elements in the dictionary

The code in the cell below uses the `op_dic` dictionary created in Example 5, to quickly access the mean BMI values of the different obesity catagories. 

In [None]:
# Example 6: Access element in a dictionary 

# Print out mean for underweight BMI category
op_dic["Underweight"]


If your code is correct, you should see the following output:
~~~text
15.360808884000432
~~~

The mean BMI for subjects that were in the `Underweight` category in the Obesity Prediction dataset was `15.36`, rounded off to the second decimal place. 

### **Exercise 6: Access elements in the dictionary**

In the cell below use the `hd_dic` dictionary created in Exercise 5 to quicly access the mean `Oldpeak` value for the `ST` catagory in the resting ECG column. 

In [None]:
# Insert your code for Exercise 6 here



If your code is correct you should see the following output:
~~~text
1.0089887640449438
~~~

### Example 7: Using `groupby count`

The code below shows how to count the number of subjects in each `ObesityCategory` in the Obesity Prediction dataset .

In [None]:
# Example 7: Use groupby count 

opDF.groupby('ObesityCategory')['BMI'].count().to_dict()

If your code is correct you should see the following output:
~~~text
{'Normal weight': 371, 'Obese': 191, 'Overweight': 295, 'Underweight': 143
~~~

### **Exercise 7: Use `groupby count`**

In the cell below, write the code to count the number of subjects in each `ChestPainType` column by their `RestingBP` type. 

In [None]:
# Insert your code for Exercise 7 here



If your code is correct you should see the following output:
~~~text
{'ASY': 496, 'ATA': 173, 'NAP': 203, 'TA': 46}
~~~

## **Lesson Turn-in**

When you have completed all of the code cells, and run them in sequential order (the last code cell should be number 16), use the **File --> Print.. --> Save to PDF** to generate a PDF of your JupyterLab notebook. Save your PDF as `Class_02_3.lastname.pdf` where _lastname_ is your last name, and upload the file to Canvas.