<a href="https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_02_1_python_pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 1173: Intro Computational Biology**

**Module 2: Machine Learning**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Integrative Biology](https://sciences.utsa.edu/integrative-biology/), [UTSA](https://www.utsa.edu/)


### Module 2 Material

* Part 2.1: Pandas DataFrame Operations
* **Part 2.2: Categorical Values** 
* Part 2.3: Grouping, Sorting and Shuffling on Pandas
* Part 2.4: Using Apply and Map in Pandas for Keras

### Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.

In [None]:
try:
    %tensorflow_version 2.x
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

### Lesson Setup

Run the next code cell to load necessary packages

In [None]:
# You MUST run this code cell first
import pandas as pd
import os
import numpy as np
from sklearn import metrics
from scipy.stats import zscore

import os
import shutil
path = '/'
memory = shutil.disk_usage(path)
dirpath = os.getcwd()
print("Your current working directory is : " + dirpath)
print("Disk", memory)

# Part 2.2: Categorical and Continuous Values

Neural networks require their input to be a fixed number of columns. This input format is very similar to spreadsheet data; it must be entirely numeric. It is essential to represent the data so that the neural network can train from it. Before we look at specific ways to preprocess data, it is important to consider four basic types of data, as defined by [[Cite:stevens1946theory]](http://psychology.okstate.edu/faculty/jgrice/psyc3214/Stevens_FourScales_1946.pdf). Statisticians commonly refer to as the [levels of measure](https://en.wikipedia.org/wiki/Level_of_measurement):

* Character Data (strings)
    * **Nominal** - Individual discrete items, no order. For example, color, zip code, and shape.
    * **Ordinal** - Individual distinct items have an implied order. For example, grade level, job title, Starbucks(tm) coffee size (tall, vente, grande) 
* Numeric Data
    * **Interval** - Numeric values, no defined start.  For example, temperature. You would never say, "yesterday was twice as hot as today."
    * **Ratio** - Numeric values, clearly defined start.  For example, speed. You could say, "The first car is going twice as fast as the second."

## Datasets for Class_02_2

In this class we will be using the **_Obesity Prediction_** dataset for the Examples and the **_Heart Failure_** dataset for the Exercises. Both of these datasets can be downloaded from the course HTTPS server [https://biologicslab.co](https://biologicslab.co).

### Obesity Prediction Dataset

[Obesity Prediction Dataset](https://www.kaggle.com/datasets/mrsimple07/obesity-prediction)

**Description**

The dataset provides comprehensive information on individuals' demographic characteristics, physical attributes, and lifestyle habits, aiming to facilitate the analysis and prediction of obesity prevalence. It includes key variables such as age, gender, height, weight, body mass index (BMI), physical activity level, and obesity category. 

* **Age:** The age of the individual, expressed in years.
* **Gender:** The gender of the individual, categorized as male or female.
* **Height:** The height of the individual, typically measured in centimeters or inches.
* **Weight:** The weight of the individual, typically measured in kilograms or pounds.
* **BMI:** A calculated metric derived from the individual's weight and height
* **PhysicalActivityLevel:** This variable quantifies the individual's level of physical activity
* **ObesityCategory:** Categorization of individuals based on their BMI into different obesity categories

### Example 1: Read data file and create Pandas DataFrame

The cell below use the Pandas `read_csv()` method to read the `obesity_prediction.csv` file using the code chunk below:
~~~text
obDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/obesity_prediction.csv",
    na_values=['NA','?'])
~~~
The function `read_csv()` is an important Pandas method to read CSV files. In the cell below, the `read_csv()` method takes 2 arguments. The first argument, `"https://biologicslab.co/BIO1173/data/obesity_prediction.csv"` is a string that provides the filepath and filename. The second argument, `na_values=['NA','?']` is used to recognize `?` as NaN (Not a Number).

As the file is read, Pandas creates a Pandas DataFrame variable called `obDF` to hold the information. After reading the datafile into a DataFrame, it is always a good idea to use the function `display()` to print out a specified number of rows and columns to make sure the data was read correctly. The code below sets the maximul number of rows to 12 and the maximum number of columns to 0. Setting the maximum number of columns (or rows) to 0 means "display all the columns". 

In [None]:
# Example 1: Read data file and create Pandas DataFrame

# Read the datafile 
obOrigDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/obesity_prediction.csv",
    na_values=['NA','?'])

# Make a copy of the original dataset
obDF = obOrigDF.copy()

# Set max rows and max columns
pd.set_option('display.max_rows', 12)
pd.set_option('display.max_columns', 0)  # Zero means all columns

# Display DataFrame
display(obDF)

### Heart Disease Dataset

[Heart Disease Dataset](https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction)

**Description**

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Four out of 5CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

* **Age:** age of the patient [years]
* **Sex:** sex of the patient [M: Male, F: Female]
* **ChestPainType:** chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
* **RestingBP:** resting blood pressure [mm Hg]
* **Cholesterol:** serum cholesterol [mm/dl]
* **FastingBS:** fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
* **RestingECG:** resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
* **MaxHR:** maximum heart rate achieved [Numeric value between 60 and 202]
* **ExerciseAngina:** exercise-induced angina [Y: Yes, N: No]
* **Oldpeak:** oldpeak = ST [Numeric value measured in depression]
* **ST_Slope:** the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
* **HeartDisease:** output class [1: heart disease, 0: Normal]

### **Exercise 1: Read data file into a Pandas DataFrame**

In the cell below use the Pandas `read_csv()` method to read the `heart_disease.csv` file that is located on the course HTTPS server using this code chunk:
~~~text
# Read the datafile 
hdOrigDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/heart_disease.csv",
    na_values=['NA','?'])
~~~
As the file is read, have Pandas create a DataFrame variable called `hdOrigDF` to hold the orignal information. You will need to use the orignal DataFrame `hdOrigDF` later in this lesson.

Then make a copy of the orginal DataFrame using this line of code:
~~~text
# Make a copy of the original dataset
hdDF = hdOrigDF.copy()
~~~
Then use the `display()` function to print out 8 rows and 8 columns of `hdDF`.

In [None]:
# Insert your code for Exercise 1 here



If your code is correct, you should see the following output:

![_](https://biologicslab.co/BIO1173/images/class_02_2_Ex1.png)

You can see from the output that your DataFrame `hdDF` has 12 columns of clinical data for 918 subjects (i.e. 918 rows with 1 row/subject).

## Data Normalization

Neural network datasets need to be normalized for the following reasons:

* **Improving Convergence:** Normalization helps ensure that the input values to a neural network are within a similar range. This prevents certain features from dominating others and avoids issues such as slow convergence or the network getting stuck in local optima. By normalizing the data, we can achieve a more balanced training process and faster convergence.
* **Avoiding Gradient Instability:** During the training of a neural network, backpropagation is used to adjust the weights based on the gradient of the loss function. If the input features have significantly different scales, the gradient updates may become unstable. Normalizing the data mitigates this problem by keeping the input values at a similar magnitude, leading to more stable and reliable gradient updates.
* **Efficient Computation:** Normalizing the data to a common range between 0 and 1 or -1 and 1 can improve the efficiency of computations within the neural network. Many activation functions and optimization algorithms are designed to work well with inputs in this range. By normalizing the data, we can leverage these computational efficiencies and speed up the training process.
* **Generalization:** Normalization helps the neural network generalize better to unseen data. If the input features have different scales or distributions in the training and test datasets, the network may struggle to generalize its learned patterns effectively. By normalizing the data, we ensure that the network receives consistent input representations, improving its ability to handle new, unseen samples.
* **Better Weight Initialization:** Normalizing the data can facilitate better weight initialization in a neural network. Weight initialization methods like Xavier or He initialization assume that the input features are normalized to have zero mean and unit variance. By normalizing the data, we align the network's expectations with these weight initialization techniques, enhancing the overall training process.
* **Handling Outliers:** Normalization can help handle outliers in the data. Outliers can disproportionately influence the learning process and bias the network's decisions. By normalizing the data, outliers are brought closer to the range of other values, minimizing their impact on the network's behavior.

In summary, normalizing neural network datasets improves convergence, avoids gradient instability, enhances computational efficiency, promotes generalization, aids in weight initialization, and helps handle outliers. These benefits contribute to more effective training and improved performance of neural networks.


## Encoding Continuous Values

One common transformation for data **_normalization_** is to convert the input values into Z-scores.  Normalizing numeric inputs into a standard form makes it easier for a program to compare values.  Consider if a friend told you that he received a 10-dollar discount.  Is this a good deal?  Maybe.  But the cost is not normalized.  If your friend purchased a car, the discount is not that good.  If your friend bought lunch, this is an excellent discount!

Converting a number into a percentage is a common form of normalization.  If your friend tells you they got 10% off, we know that this is a better discount than 5%.  It does not matter how much the purchase price was.  

For machine learning, a better form of normalization than percentages is the Z-Score:

$$ z = \frac{x - \mu}{\sigma} $$

To calculate the Z-Score, you also need to calculate the mean(&mu; or $\bar{x}$) and the standard deviation (&sigma;).  You can calculate the mean with this equation:

$$ \mu = \bar{x} = \frac{x_1+x_2+\cdots +x_n}{n} $$

The standard deviation is calculated as follows:

$$ \sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2} $$

Example 2 and Exercise 2 below will demostrate how to replace numerical values in a DataFrame with their Z-scores. Average values will end up having Z-scores near zero, values that are greater than average will have positive Z-scores while below values below average will end up having negative Z-scores. 

---------------------

### **Z-scores**

**_Z-scores_**, also known as _standard scores_, are statistical measures that indicate how far a particular value is from the mean of a dataset, measured in terms of standard deviations. They are important because they allow us to compare and analyze data points from different distributions with different units and scales.

The calculation of a Z-score involves subtracting the mean of the distribution from the specific value and then dividing the result by the standard deviation. This transforms the original value into a standardized value that represents its relative position within the distribution.

Z-scores are important for several reasons:

* **Standardization:** Z-scores provide a way to standardize data, making it easier to compare values from different datasets and variables. By converting values to a common scale, we can compare observations, identify outliers, and analyze data more accurately.
* **Normal distribution:** Z-scores are frequently used with normally distributed data, where the mean is 0 and the standard deviation is 1. When data is standardized to a Z-score distribution, it becomes easier to apply statistical techniques and make meaningful interpretations based on the standard normal distribution.
* **Identification of extreme values:** Z-scores help in identifying extreme values, known as outliers. Values with Z-scores greater than a certain threshold (e.g., 2 or 3) are considered outliers, indicating that they deviate significantly from the mean.
* **Probability calculations:** Z-scores also enable us to calculate probabilities and determine the likelihood of a value occurring in a normal distribution. By converting a value to its Z-score, we can look up the probability associated with that Z-score in a standard normal distribution table.

Overall, Z-scores provide a standardized way to analyze, compare, and interpret data, making them an essential tool in statistical analysis and research.

-----------------------------------------------

### Example 2: Convert to Z-scores

The cell below shows how to use the `zscore` package from the `scipy.stats` module to compute Z-score values for the height and weight measurements in the Obesity Prediction dataset. Height and weight measurements are good candidates for converting to Z-scores since their values can be quite different. 

In [None]:
# Example 2: Convert to Z-scores

# Import zscore package
from scipy.stats import zscore

# Set max rows and max columns
pd.set_option('display.max_rows', 5)
pd.set_option('display.max_columns', 7)

# Regenerate original DataFrame
obDF = obOrigDF.copy()

# Display DataFrame before calculation
print("DataFrame before Z-score conversion")
display(obDF)

# Convert height values to Z-scores
obDF['Height'] = zscore(obDF['Height'])

# Convert weight values to Z-scores
obDF['Weight'] = zscore(obDF['Weight']) 

# Display DataFrame
print("DataFrame after Z-score conversion")
display(obDF)

As you can see, after converting to Z-scores, the height and weight measurements have gone from relatively large, all positive numbers, to small positive and negative numbers near zero. 

### **Exercise 2: Convert to Z-scores**

In the cell below, use the `zscore` package from the `scipy.stats` module to compute Z-score values for the resting blood pressue (`RestingBP`) values and the serum cholesterol values (`Cholesterol`) in the Heart Disease dataset. Set the display parameters to 5 rows and 10 columns. Then print out the `hdDF` DataFrame before, and after, converting the values to Z-scores.

In [None]:
# Insert your code for Exercise 2 here



If your code is correct you should see the following output:

![_](https://biologicslab.co/BIO1173/images/class_02_2_Zscore1.png)

As you can see, after converting to Z-scores, the values for resting blood pressure (`RestingBP`) and serum cholesterol (`Cholesterol`) are much more similiar. 

### Encoding Categorical Values as Dummies

The traditional means of encoding categorical values (i.e. converting string data to numerical values) is to replace them with **_dummy variables_**.  This technique is also called One-Hot Encoding. 

**One-Hot Encoding** is a technique used in data preprocessing and feature engineering to convert categorical variables into a numerical representation that can be used by machine learning algorithms. It is important because many machine learning models require numerical input, and categorical variables cannot be directly used in their raw form.

In One-Hot Encoding, each category in a categorical variable is converted into a new binary feature column (i.e. `0` or `1`). For a variable with N categories, N new binary columns are created, where each column represents a specific category. If an observation belongs to a particular category, the corresponding feature column is assigned a value of 1, otherwise 0.

There are a few reasons why One-Hot Encoding is important:

* **Retaining categorical information:** One-Hot Encoding allows us to retain the categorical information that would otherwise be lost if we simply assigned numerical labels to each category. By creating separate binary columns, we preserve the distinctiveness of each category, enabling the model to understand and utilize this information.
* **Avoiding numerical assumptions:** By converting categorical variables into numerical representations, we eliminate any numerical order or relationship assumptions that may not exist in the original data. This prevents the model from mistakenly interpreting the numerical values as meaningful in terms of order or magnitude.
* **Compatibility with machine learning algorithms:** Many machine learning algorithms require numerical input. By converting categorical variables into a binary representation, One Hot Encoding makes it possible to feed categorical data into these algorithms, expanding the range of models that can be utilized.
* **Handling multi-class categories:** One Hot Encoding is particularly useful when dealing with categorical variables with multiple classes. By creating binary columns for each class, we allow the model to learn distinct patterns and relationships between the categories.

It is important to note that One-Hot Encoding can increase the dimensionality of the dataset, especially if the categorical variable has a large number of classes. This can potentially lead to the **"curse of dimensionality"** and affect the performance of the model. However, it is a widely used and effective technique for incorporating categorical variables into machine learning models.


## Example 3: Preprocessing data

Example 3 has been broken down into 5 separate steps. Each step illustrates an important technique used to **_preprocess_** data before it can be used with deep neural networks. You will be using these steps over-and-over again in this course. 

### Example 3 - _Step 1:_ Determine categories that are not numeric

The first step in encoding categorical variables is determine which column(s) have non numerical values. 

We start by simply displaying all of the columns in the `obDF` DataFrame. To see all the columns, we set the variable `display.max.columns` to `0` which means **_all_** of the columns. 

**WARNING**: Be careful. **NEVER** set the variable `display.max.rows` to `0`! You will end up printing out hundreds of pages of values. 

In [None]:
# Example 3 Step 1: Determine categories that are not numeric

# Set max rows and max columns
pd.set_option('display.max_rows', 6)
pd.set_option('display.max_columns', 0) # 0 means all columns

# Display the DataFrame
display(obDF)

From simply looking at the output we can see that there are two columns with non numeric values, `Gender` and `ObesityCategory`.

### Example 3 - _Step 2:_ Print out a list of the category values 

For Example 3-Step 2: we are only going to encode the `ObesityCategory` column using One-Hot Encoding. If we were going to use this data to build a neural network, and we wanted to include the gender column, the categorical values in this column (i.e. "Male" and "Female") would also need to be encoded. 

The code in the cell below shows how to determine the number of different categories that are used in `ObesityCategory` column in the `obDF` DataFrame. This step is not really necessary for this particular dataset since the number of columns is relatively small. However in a dataset with a large number of columns, this step might be very helpful.

The trick here is to use the Python `list()` function with the `unique()` method when creating the category list. This insures that the list contains only the name of each _different_ category in the column and not simply a list containing all of the category names repeated hundred of times. 

In [None]:
# Example 3 Step 2: Print out the category values

# Generate a list with only unique values
numObCat = list(obDF['ObesityCategory'].unique())

# Print out the results
print(f'Number of obesity categories: {len(numObCat)}')
print(f'numObCat: {numObCat}')

As you can see from the output above:
~~~text
Number of obesity categories: 4
numObCat: ['Normal weight', 'Obese', 'Overweight', 'Underweight']
~~~
there are four different strings used a categorical values. 

### Example 3 - _Step 3:_ Encode the column

From the output above we know that there are exactly 4 different category values used in the `ObesityCategory` column: `Normal weight`, `Obese`, `Overweight` and `Underweight`. We need to One-Hot Encode these values.

The code in the cell below uses Pandas' `pd.get_dummies()` function to create dummy columns that can be used to replace the `ObesityCategory` column in the `obDF` DataFrame. To make it easier to remember what the dummy columns represent, we are going to add the prefix `obCat` to each of the new dummy columns. 

In [None]:
# Example 3 - Step 3: Encode the actual column

# Encode the ObesityCategory column
dummies = pd.get_dummies(obDF['ObesityCategory'],prefix='obCat', dtype=int)

# Print out the first 10 values
print(dummies[0:10])

If your code is correct you should see the following output:
~~~text
    obCat_Normal weight  obCat_Obese  obCat_Overweight  obCat_Underweight
0                     1            0                 0                  0
1                     0            1                 0                  0
2                     0            0                 1                  0
..                  ...          ...               ...                ...
7                     0            0                 1                  0
8                     1            0                 0                  0
9                     1            0                 0                  0

[10 rows x 4 columns]
~~~

These four dummy columns now encode the categorical data that is in the `ObesityCategory` column. Notice that for each row (i.e. for each _subject_), only one column that has a value of `1`, while the rest of the columns in that row contain `0`. For this reason, this type of encoding is called **_One-Hot_** Encoding. 

### Example 3 - _Step 4:_ Merge dummy columns into dataset

The code in the cell above only created 4 dummy columns. For the new dummy/one hot encoded values to be of any use, they must be merged back into the dataset. We can now add them back into our `obDF` DataFrame using Pandas' `pd.concat()` method shown in the next cell.

In [None]:
# Example 3 - Step 4: Merge dummies with dataset

# Merge dummies with the DataFrame
obDF = pd.concat([obDF,dummies],axis=1)

# Set max rows and max columns
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 6)

# Display certain columns in the DataFrame
display(obDF[['BMI','ObesityCategory','obCat_Normal weight',
                  'obCat_Obese','obCat_Overweight','obCat_Underweight']])

If your code is correct you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_02_2obdummies.png)

From the output above, we can see that the four dummy columns, with the pre-fix `obCat_` have been added to the `obDF` DataFrame. 

### Example 3 - _Step 5:_ Remove the original column

Usually, you will remove the original column area because the goal is to get the DataFrame to be entirely numeric before using it in a neural network. The cell below shows how to use the Pandas `df.drop()` method to drop the `ObesityCategory` column from the `obDF` DataFrame. The argument `axis=1` tells the method that you want to drop the column instead of a row.  


In [None]:
# Example 3 - Step 5: Remove the orginal column

# Use drop method to drop the ObesityCategory column
obDF.drop('ObesityCategory', axis=1, inplace=True)

# Set max rows and max columns
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 6)

# Display certain columns in the DataFrame
display(obDF[['BMI','obCat_Normal weight',
                  'obCat_Obese','obCat_Overweight','obCat_Underweight']])

If your code is correct you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_02_2_dropOB.png)

You should note the the column `ObesityCategory` is no longer present.

## **Exercise 3**

In **Exercise 3** you are to repeat the same 5 steps from Example 3, but using the the Heart Disease dataset in the DataFrame `hdDF`. In other words, you are to follow the same 5 steps shown in Example 3 to Hot-Encode data in the Heart Failure dataset.

### **Exercise 3 - _Step 1:_ Determine categories that are not numeric**

In the first step, print out 6 rows and **all** of the columns in the Heart Failure DataFrame `hdDF`. This dataset has 12 columns so you might have to use the scroll bar at the bottom to see the columns on the right.

In [None]:
# Insert your code for Exercise 3 Step 1 here



If your code is correct you should see something similiar to the following output:

![_](https://biologicslab.co/BIO1173/images/class_02_2_Ex3_3.png)

There may be a slider at the bottom your output that will let you see the columns that didn't fit on your computer screen. 

There are actually 4 columns that have non numeric values: `Sex`, `RestingECG`, `ExerciseAngina` and `ST_Slope`.


### **Exercise 3 - _Step 2:_ Print out a list of the category values** 

Even though there are 4 columns with non numerical values, for Exercise 3 you are to only encode the `ChestPainType` column. 

In the cell below write the Python code to print out a list showing the number of different categories in the `ChestPainType` column. 

In [None]:
# Insert your code for Exercise 3 Step 2 here



If your code is correct you should see the following output:

~~~text
Number of chest pain categories: 4
numHfCat: ['ATA', 'NAP', 'ASY', 'TA'
~~~

### **Exercise 3 - _Step 3:_ Encode the column**

In the cell below use the `pd.get_dummies()` function to create dummy columns for the column `ChestPainType` in the `hdDF` DataFrame. To make it easier to remember what the dummy columns represent, add the prefix `Pain` to each dummy column. 

In [None]:
# Insert your code for Exercise 3 Step 3 here



If your code is correct you should see the following output:

~~~text
   Pain_ASY  Pain_ATA  Pain_NAP  Pain_TA
0         0         1         0        0
1         0         0         1        0
2         0         1         0        0
3         1         0         0        0
4         0         0         1        0
5         0         0         1        0
6         0         1         0        0
7         0         1         0        0
8         1         0         0        0
9         0         1         0        0

[10 rows x 4 columns]
~~~

Notice that for each row, there is only one column that has a value of `1`, while the rest of the columns in that row contain `0`. This why this type of encoding is referred to as "One-Hot" encoding. 

### **Exercise 3 - _Step 4:_ Merge dummy columns into dataset**

In the cell below write the code to add the dummy columns back into the `hdDF` DataFrame using Pandas' `pd.concat()`. Set the display option to print out 6 rows and 6 columns.

In [None]:
# Insert your code for Exercise 3 Step 4 here



If your code is correct you should see the following output:

![_](https://biologicslab.co/BIO1173/images/class_02_2_Dummy1.png)


### **Exercise 3 - _Step 5:_ Remove the orignal column**

In the cell below, write the code to drop the `ChestPainType` column from the `hdDF` DataFrame. Display 6 row of the columns `Age`, `Pain_ASY`, `Pain_ATA`, `Pain_NAP` and `Pain_TA`. 


In [None]:
# Insert your code for Exercise 3 Step 5 here



If your code is correct you should see the following output:

![_](https://biologicslab.co/BIO1173/images/class_02_2_Dummy2.png)


### Removing the First Level

The **pd.concat** function also includes a parameter named *drop_first*, which specifies whether to get k-1 dummies out of k categorical levels by removing the first level. 

Why would you want to remove the first level? 

Consider the category `Gender` in the `obDF` dataframe. This column contains the string variables `Male` and `Female`. Suppose we were to `Hot-One-Encode` the `Gender` column with the following line of Python code:

> `dummies = pd.get_dummies(obDF['Gender'],prefix='Gender', dtype=int)`
>

Here is what the two dummy columns would look like:

~~~text
    Gender_Female  Gender_Male
0               0            1
1               0            1
2               1            0
..            ...          ...
7               0            1
8               0            1
9               0            1

[10 rows x 2 columns]
~~~

Now, ask yourself the question, "Do we really need **both** columns to know if a subject was a female or a male?" The answer is, "Not really". 

Suppose we removed the `First Level` using the following line of code:

> `dummies = pd.get_dummies(obDF['Gender'],prefix='Gender', dtype=int, drop_first=True)`
>
Here is the output:

~~~text
    Gender_Male
0             1
1             1
2             0
..          ...
7             1
8             1
9             1

[10 rows x 1 columns]
~~~

Since the first level `Gender_Female` was dropped there is now only the `Gender_Male` dummy column left. However, you only need this column to know the gender of each subject in the `obDF` dataframe! 

Consider the subject in Row 0. This subject is a male since he has a `1` in the `Gender_Male` column. This is also true of the subject in Row 1. But the subject in Row 2 _must_ be a female because she has a `0` in the `Gender_Male` column. In other words, we can determine the gender of every subject by the values in a single column. 

It turns out that this idea is not limited to situations were there are only two possible choices. You can **always** drop one column in _any_ series of dummy columns **without** losing any information. For this reason the command `pd.get_dummies()` is often used with the argument `drop_first` set to `True` to simplify this process. 

### Example 4: Using `drop_first=True`

The code in the cell below, begins by regenerating the original `obDF` DataFrame from the backup `obOrigDF` DataFrame using the `df.copy)` command. The code then creates dummy columns for the `ObesityCategory` column in the DataFrame `obDF` but drops the first category before merging the remaining 3 dummy columns with the DataFrame. Finally, the code drops the `ObesityCategory` column from the `obDF` DataFrame before printing out 4 rows and all of the columns. 

In [None]:
# Example 4 - Use drop_first = True

# Make new cop7
obDF=obOrigDF.copy()

# Encode the ObesityCategory column as dummy variables
obDummies = pd.get_dummies(obDF['ObesityCategory'], drop_first=True, dtype= int, prefix='obCat')

# Merge the dummie with the DataFrame
obDF = pd.concat([obDF,obDummies],axis=1)

# Drop the column replaced by the dummies
obDF.drop('ObesityCategory', axis=1, inplace=True)

# Set max rows and max columns
pd.set_option('display.max_rows', 4)
pd.set_option('display.max_columns', 0)

# Display the DataFrame
display(obDF)

If your code is correct you should see the following output:

![__](https://biologicslab.co/BIO1173/images/class_02_2_Exm4.png)

Notice that the first subject (row `0`) who is a male, age 56. He must also have a _normal weight_, even though in the first category `ob_Normal Weight` was dropped. Why? Since he has a zero in the remaining 3 obesity categories, the only possibilty is that he had a `1` in the dropped category.

### **Exercise 4: Using `drop_first=True`**

In the cell below, start by regenerating the complete (original) Heart Failure DataFrame using the the command `hdDF = hdOrigDF.copy()`. Then write the Python code to One-Hot encode the column `ChestPainType` column setting the `drop_first` argument to `True` to drop the first dummy column. Merge the remaining 3 dummy columns with the `hfDF` DataFrame before dropping the `ChestPainType` column.  Set the display options to print out 6 rows and 8 columns and print out the updated DataFrame.

In [None]:
# Insert your code for Exercise 4 here



If your code is correct you should see the following output:

![_](https://biologicslab.co/BIO1173/images/class_02_2_Drop1.png)


## **Lesson Turn-in**

When you have completed all of the code cells, and run them in sequential order (the last code cell should be number 18), use the **File --> Print.. --> Save to PDF** to generate a PDF of your JupyterLab notebook. Save your PDF as `Class_02_2.lastname.pdf` where _lastname_ is your last name, and upload the file to Canvas.