<a href="https://colab.research.google.com/github/DavidSenseman/BIO1173/blob/main/Class_02_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---------------------------
**COPYRIGHT NOTICE:** This Jupyterlab Notebook is a Derivative work of [Jeff Heaton](https://github.com/jeffheaton) licensed under the Apache License, Version 2.0 (the "License"); You may not use this file except in compliance with the License. You may obtain a copy of the License at

> [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

------------------------

# **BIO 1173: Intro Computational Biology**

**Module 2: Machine Learning**

* Instructor: [David Senseman](mailto:David.Senseman@utsa.edu), [Department of Biology, Health and the Environment](https://sciences.utsa.edu/bhe/), [UTSA](https://www.utsa.edu/)

### Module 2 Material

* Part 2.1: Pandas DataFrame Operations
* **Part 2.2: Categorical Values**
* Part 2.3: Grouping, Sorting and Shuffling on Pandas

## Google CoLab Instructions

You MUST run the following code cell to get credit for this class lesson. By running this code cell, you will map your GDrive to ```/content/drive``` and print out your Google GMAIL address. Your Instructor will use your GMAIL address to verify the author of this class lesson.

In [None]:
# You must run this cell first
try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    from google.colab import auth
    auth.authenticate_user()
    COLAB = True
    print("Note: Using Google CoLab")
    import requests
    gcloud_token = !gcloud auth print-access-token
    gcloud_tokeninfo = requests.get('https://www.googleapis.com/oauth2/v3/tokeninfo?access_token=' + gcloud_token[0]).json()
    print(gcloud_tokeninfo['email'])
except:
    print("**WARNING**: Your GMAIL address was **not** printed in the output below.")
    print("**WARNING**: You will NOT receive credit for this lesson.")
    COLAB = False

Make sure your GMAIL address is visible in the output above.

## Define Functions

The code in the next cell creates one (or more) functions that are used later in this class lesson. If you don't run the next cell, you will get an error.

In [3]:
# Create functions for this lesson

def list_float_columns(dataframe):
    """
    Create a list of all columns in a DataFrame that contain float values.

    Parameters:
    dataframe (pd.DataFrame): The DataFrame to check.

    Returns:
    list: A list of column names that contain float values.
    """
    float_columns = [col for col in dataframe.columns if dataframe[col].dtype == 'float64']
    return float_columns

# **Categorical and Continuous Values**

Neural networks require their input to be a fixed number of columns. This input format is very similar to spreadsheet data; it must be entirely numeric. It is essential to represent the data so that the neural network can train from it. Before we look at specific ways to preprocess data, it is important to consider four basic types of data, as defined by [[Cite:stevens1946theory]](http://psychology.okstate.edu/faculty/jgrice/psyc3214/Stevens_FourScales_1946.pdf). Statisticians commonly refer to as the [levels of measure](https://en.wikipedia.org/wiki/Level_of_measurement):

* Character Data (strings)
    * **Nominal** - Individual discrete items, no order. For example, color, zip code, and shape.
    * **Ordinal** - Individual distinct items have an implied order. For example, grade level, job title, Starbucks(tm) coffee size (tall, vente, grande)
* Numeric Data
    * **Interval** - Numeric values, no defined start.  For example, temperature. You would never say, "yesterday was twice as hot as today."
    * **Ratio** - Numeric values, clearly defined start.  For example, speed. You could say, "The first car is going twice as fast as the second."

## **Datasets for Class_02_2**

In this class we will be using the **_Obesity Prediction_** dataset for the Examples and the **_Heart Failure_** dataset for the **Exercises**. Both of these datasets will be downloaded from the course HTTPS server [https://biologicslab.co](https://biologicslab.co).

## **Obesity Prediction Dataset**

[Obesity Prediction Dataset](https://www.kaggle.com/datasets/mrsimple07/obesity-prediction)

**Description**

The dataset provides comprehensive information on individuals' demographic characteristics, physical attributes, and lifestyle habits, aiming to facilitate the analysis and prediction of obesity prevalence. It includes key variables such as age, gender, height, weight, body mass index (BMI), physical activity level, and obesity category.

* **Age:** The age of the individual, expressed in years.
* **Gender:** The gender of the individual, categorized as male or female.
* **Height:** The height of the individual, typically measured in centimeters or inches.
* **Weight:** The weight of the individual, typically measured in kilograms or pounds.
* **BMI:** A calculated metric derived from the individual's weight and height
* **PhysicalActivityLevel:** This variable quantifies the individual's level of physical activity
* **ObesityCategory:** Categorization of individuals based on their BMI into different obesity categories

### Example 1: Read data file and create Pandas DataFrame

The cell below use the Pandas `read_csv()` method to read the `obesity_prediction.csv` file using the code chunk below:
~~~text
opDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/obesity_prediction.csv",
    na_values=['NA','?'])
~~~
The function `read_csv()` is an important Pandas method to read CSV files. In the cell below, the `read_csv()` method takes 2 arguments.

The first argument, `"https://biologicslab.co/BIO1173/data/obesity_prediction.csv"` is a string that provides the filepath and filename. The second argument, `na_values=['NA','?']` is used to recognize `?` as NaN (Not a Number).

As the file is read, Pandas creates a DataFrame called `opDF` to hold the information.

After reading the datafile into a DataFrame, it is always a good idea to use the function `display()` to print out a specified number of rows and columns to make sure the data was read correctly.

The code in the cell below, sets the maximum number of rows to 6 and the maximum number of columns to 6.  

In [None]:
# Example 1: Read data file and create Pandas DataFrame

import pandas as pd

# Read the datafile
opDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/obesity_prediction.csv",
    na_values=['NA','?'])

# Set max columns and max rows
pd.set_option('display.max_columns', opDF.shape[1])
pd.set_option('display.max_rows', 8)


# Display DataFrame
display(opDF)

If your code is correct, you should see the following output:

![_](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image11A.png)

You can see from looking at the last line of the output, the DataFrame `opDF` has 7 columns and 1000 rows. This means `opDF` has clinical measurements for 1000 patients and for each patient, there are 7 separate clinical measurements.

## **Heart Disease Dataset**

[Heart Disease Dataset](https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction)

**Description**

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Four out of 5CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict a possible heart disease.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

* **Age:** age of the patient [years]
* **Sex:** sex of the patient [M: Male, F: Female]
* **ChestPainType:** chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
* **RestingBP:** resting blood pressure [mm Hg]
* **Cholesterol:** serum cholesterol [mm/dl]
* **FastingBS:** fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
* **RestingECG:** resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
* **MaxHR:** maximum heart rate achieved [Numeric value between 60 and 202]
* **ExerciseAngina:** exercise-induced angina [Y: Yes, N: No]
* **Oldpeak:** oldpeak = ST [Numeric value measured in depression]
* **ST_Slope:** the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
* **HeartDisease:** output class [1: heart disease, 0: Normal]

### **Exercise 1: Read data file into a Pandas DataFrame**

In the cell below use the Pandas `read_csv()` method to read the `heart_disease.csv` file that is located on the course HTTPS server using this code chunk:
~~~text
# Read the datafile
hdDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/heart_disease.csv",
    na_values=['NA','?'])
~~~
As the file is read, have Pandas create a DataFrame called `hdDF`. to hold the heart disease data.

Use the `display()` function to print out 6 rows and 6 columns of `hdDF`.

In [None]:
# Insert your code for Exercise 1 here



If your code is correct, you should see the following output:

![_](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image12A.png)

You can see by looking at the last line of the output, your DataFrame, `hdDF`, has 12 columns and 918 rows. In other words `hdDF` has clinical measurements for 918 patients and 12 separate clinical measurements for each patient.

## **Data Normalization**

Neural network datasets need to be _normalized_ for the following reasons:

* **Improving Convergence:** Normalization helps ensure that the input values to a neural network are within a similar range. This prevents certain features from dominating others and avoids issues such as slow convergence or the network getting stuck in local optima. By normalizing the data, we can achieve a more balanced training process and faster convergence.
* **Avoiding Gradient Instability:** During the training of a neural network, backpropagation is used to adjust the weights based on the gradient of the loss function. If the input features have significantly different scales, the gradient updates may become unstable. Normalizing the data mitigates this problem by keeping the input values at a similar magnitude, leading to more stable and reliable gradient updates.
* **Efficient Computation:** Normalizing the data to a common range between 0 and 1 or -1 and 1 can improve the efficiency of computations within the neural network. Many activation functions and optimization algorithms are designed to work well with inputs in this range. By normalizing the data, we can leverage these computational efficiencies and speed up the training process.
* **Generalization:** Normalization helps the neural network generalize better to unseen data. If the input features have different scales or distributions in the training and test datasets, the network may struggle to generalize its learned patterns effectively. By normalizing the data, we ensure that the network receives consistent input representations, improving its ability to handle new, unseen samples.
* **Better Weight Initialization:** Normalizing the data can facilitate better weight initialization in a neural network. Weight initialization methods like Xavier or He initialization assume that the input features are normalized to have zero mean and unit variance. By normalizing the data, we align the network's expectations with these weight initialization techniques, enhancing the overall training process.
* **Handling Outliers:** Normalization can help handle outliers in the data. Outliers can disproportionately influence the learning process and bias the network's decisions. By normalizing the data, outliers are brought closer to the range of other values, minimizing their impact on the network's behavior.

In summary, normalizing neural network datasets improves convergence, avoids gradient instability, enhances computational efficiency, promotes generalization, aids in weight initialization, and helps handle outliers. These benefits contribute to more effective training and improved performance of neural networks.


## **Encoding Continuous Values**

One common transformation for data **_normalization_** is to convert the input values into Z-scores.  Normalizing numeric inputs into a standard form makes it easier for a program to compare values.  Consider if a friend told you that he received a 10-dollar discount.  Is this a good deal?  Maybe.  But the cost is not normalized.  If your friend purchased a car, the discount is not that good.  If your friend bought lunch, this is an excellent discount!

Converting a number into a percentage is a common form of normalization.  If your friend tells you they got 10% off, we know that this is a better discount than 5%.  It does not matter how much the purchase price was.  

For machine learning, a better form of normalization than percentages is the Z-Score:

$$ z = \frac{x - \mu}{\sigma} $$

To calculate the Z-Score, you also need to calculate the mean(&mu; or $\bar{x}$) and the standard deviation (&sigma;).  You can calculate the mean with this equation:

$$ \mu = \bar{x} = \frac{x_1+x_2+\cdots +x_n}{n} $$

The standard deviation is calculated as follows:

$$ \sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2} $$

Example 2 and Exercise 2 below will demostrate how to replace numerical values in a DataFrame with their Z-scores. Average values will end up having Z-scores near zero, values that are greater than average will have positive Z-scores while below values below average will end up having negative Z-scores.

---------------------

## **Z-scores**

**_Z-scores_**, also known as _standard scores_, are statistical measures that indicate how far a particular value is from the mean of a dataset, measured in terms of standard deviations. They are important because they allow us to compare and analyze data points from different distributions with different units and scales.

The calculation of a Z-score involves subtracting the mean of the distribution from the specific value and then dividing the result by the standard deviation. This transforms the original value into a standardized value that represents its relative position within the distribution.

Z-scores are important for several reasons:

* **Standardization:** Z-scores provide a way to standardize data, making it easier to compare values from different datasets and variables. By converting values to a common scale, we can compare observations, identify outliers, and analyze data more accurately.
* **Normal distribution:** Z-scores are frequently used with normally distributed data, where the mean is 0 and the standard deviation is 1. When data is standardized to a Z-score distribution, it becomes easier to apply statistical techniques and make meaningful interpretations based on the standard normal distribution.
* **Identification of extreme values:** Z-scores help in identifying extreme values, known as outliers. Values with Z-scores greater than a certain threshold (e.g., 2 or 3) are considered outliers, indicating that they deviate significantly from the mean.
* **Probability calculations:** Z-scores also enable us to calculate probabilities and determine the likelihood of a value occurring in a normal distribution. By converting a value to its Z-score, we can look up the probability associated with that Z-score in a standard normal distribution table.

Overall, Z-scores provide a standardized way to analyze, compare, and interpret data, making them an essential tool in statistical analysis and research.

-----------------------------------------------

### Example 2: Convert Floats to Z-values

The code in the cell below uses the function `list_float_columns()`that was constructed at the start of this notebook to generate a list columns names that contain float values.

The code then uses this list to convert the float values in the selected columns to the Z-value equivalent.

Finally, the first 5 Z-score values in each column are printed out to check the accuracy of the code.

In [None]:
# Example 2: Convert floats to Z-values

import pandas as pd
from scipy.stats import zscore

# Use function defined at the start of this notebook
float_columns = list_float_columns(opDF)
print(f"Columns with float values: {float_columns}")

# Convert floats to Z-scores
for col in float_columns:
    opDF[col] = zscore(opDF[col])

# Print the first 5 values of each float column
for col in float_columns:
    print(f"First 5 values in column '{col}': {opDF[col].head().tolist()}")

If your code is correct, you should see the following output:
~~~text
Columns with float values: ['Height', 'Weight', 'BMI']
First 5 values in column 'Height': [0.3418640392162576, -0.5749847402902392, -0.19216404055993186, -0.15456698221858464, 1.3116345561014482]
First 5 values in column 'Weight': [0.0500759354697495, 1.2097390434498765, 0.11126628433386448, 0.8825352755671579, -0.1397761977927924]
First 5 values in column 'BMI': [-0.16096979006604262, 1.3741152139851138, 0.1501289782474755, 0.8115135795924825, -0.7107971919502696]
~~~

As you can see, after converting to their Z-scores, the height, weight and BMI measurements have gone from relatively large, all positive numbers, to small values, near zero, that are both positive and negative. A value equal to `0` means average, while positive values are above average and negative values are below average.

NOTE: If you get an error message, it probably means that you forgot to run the Define functions cell at the start of this notebook.

### **Exercise 2: Convert to Z-scores**

In the cell below, use the `list_float_columns()` function, that was created at the start of this notebook, to generate a list of columns containing float values. Then use the `zscore` package from the `scipy.stats` module to compute Z-score values for each float value in these columns.

At the end, write the code to print out the first 5 Z-scores from each column to check the accuracy of your code.

In [None]:
# Insert your code for Exercise 2 here



If your code is correct, you should see the following output:

~~~text
Columns with float values: ['Oldpeak']
First 5 values in column 'Oldpeak': [-0.8324323931317043, 0.10566352743655559, -0.8324323931317043, 0.5747114877206856, -0.8324323931317043]
~~~

## **Encoding Categorical Values: Mapping**

In many cases, the data used to traing a neural network must be numeric, either an integer or a float value. There are two ways that can be used to convert non-numeric values (i.e. strings) to a number, (1) Mapping and (2) One-Hot Encoding.

Mapping of categorical variables was covered earlier in Class_01_6 (Example 8 and **Exercise 8**). Here we revist the mapping procedure before presenting a discussion of One-Hot Encoding.

## Example 3: Preprocessing data

Example 3 has been broken down into 6 separate steps. Each step illustrates an important technique used to **_preprocess_** data before it can be used with deep neural networks. You will be using these steps over-and-over again in this course.

### Example 3 - Step 1: Determine categories that are not numeric

The first step in encoding categorical variables is determine which column(s) have non numerical values.

We can use the Pandas method `df.select_dtypes(exclude='number'0.columns` to generate a list of column names.

The code in the cell below also uses the `starred` print option.
~~~text
# Print result
print(*non_numerical_columns)
~~~

By simply inserting an asterisk `*` before the list variable, the print statement only prints out the column names.

In [None]:
# Example 3-Step 1: Determine columns with non-numeric values

import pandas as pd

# Select columns
non_numerical_columns = opDF.select_dtypes(exclude='number').columns

# Print result
print(*non_numerical_columns)

If your code is correct, you should see the following output:
~~~text
Gender ObesityCategory
~~~

There are two columns in the `opDF` Dataframe, `Gender` and `ObesityCategory`, that we will need to convert string values into numbers.

### Example 3 - Step 2: Map Strings to Integers

In most situations where the number of categorical values is relatively small, "mapping" is generally easier than One-Hot Encoding.

The code in the cell below maps the string 'Male' to the number `1` and string `Female` to the number `2`.

To make sure the correct change occurred, the `display(df)` function is used to print out the revised DataFrame.

In [None]:
# Example 3 - Step 2: Map strings to integers

# Set max columns and max rows
pd.set_option('display.max_columns', opDF.shape[1])
pd.set_option('display.max_rows', 8)


# Define the mapping dictionary
mapping = {'Male': 1, 'Female': 2}

# Check if all values to be mapped are present in the column
unique_values = opDF['Gender'].unique()

# Find values in the column that are not in the mapping dictionary
missing_values = [value for value in unique_values if value not in mapping]

if missing_values:
    print(f"Error: The following values in the 'Gender' column are not in the mapping dictionary: {missing_values}")
    print(f"Error: Either your mapping is wrong or you have already converted the strings to integers")
else:
    # Map the 'Gender' column using the mapping dictionary
    opDF['Gender'] = opDF['Gender'].map(mapping)
    print("Obesity data after mapping:")
    display(opDF)


If the code is correct, you should see the following output:

![_](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image18A.png)

However, if your output looks like this:

~~~text
Error: The following values in the 'Gender' column are not in the mapping dictionary: [1, 2]
Error: Either your mapping is wrong or you have already converted the strings to integers
~~~
It could mean that you have already converted the strings. You might need to go back and re-run Example 1.

### Example 3 - Step 3: Print out a list of the category values

The code in the cell below shows how to determine the number of different categories that are used in `ObesityCategory` column in the `opDF` DataFrame. This step is not really necessary for this particular dataset since the number of string values in the `ObesityCategory` column is relatively small. However in a dataset with a large number of string names in a column, this step might be very helpful.

The trick here is to use the Python `list()` function in conjunction with the `unique()` method when creating the category list. This insures that the list contains only the name of each _different_ category in the column and not simply a list containing all of the category names repeated hundred of times.

In [None]:
# Example 3 Step 3: Print category values

import pandas as pd

# Generate a list with only unique values
numOpCat = list(opDF['ObesityCategory'].unique())

# Print out the results
print(f'Number of obesity categories: {len(numOpCat)}')
print(f'numOpCat: {numOpCat}')

Number of obesity categories: 4
numOpCat: ['Normal weight', 'Obese', 'Overweight', 'Underweight']


As you can see from the output above:
~~~text
Number of obesity categories: 4
numOpCat: ['Normal weight', 'Obese', 'Overweight', 'Underweight']
~~~
there are four different strings used as categorical values.

------------------------------------

## **Encoding Categorical Values: One-Hot Encoding**

**One-Hot Encoding** is a technique used in data preprocessing and feature engineering to convert categorical variables into a numerical representation that can be used by machine learning algorithms. It is important because many machine learning models require numerical input, and categorical variables cannot be directly used in their raw form.

In One-Hot Encoding, each category in a categorical variable is converted into a new binary feature column (i.e. `0` or `1`). For a variable with _N_ categories, _N_ new binary columns are created, where each column represents a specific category. If an observation belongs to a particular category, the corresponding feature column is assigned a value of 1, otherwise 0.

There are a few reasons why One-Hot Encoding is important:

* **Retaining categorical information:** One-Hot Encoding allows us to retain the categorical information that would otherwise be lost if we simply assigned numerical labels to each category. By creating separate binary columns, we preserve the distinctiveness of each category, enabling the model to understand and utilize this information.
* **Avoiding numerical assumptions:** By converting categorical variables into numerical representations, we eliminate any numerical order or relationship assumptions that may not exist in the original data. This prevents the model from mistakenly interpreting the numerical values as meaningful in terms of order or magnitude.
* **Compatibility with machine learning algorithms:** Many machine learning algorithms require numerical input. By converting categorical variables into a binary representation, One Hot Encoding makes it possible to feed categorical data into these algorithms, expanding the range of models that can be utilized.
* **Handling multi-class categories:** One Hot Encoding is particularly useful when dealing with categorical variables with multiple classes. By creating binary columns for each class, we allow the model to learn distinct patterns and relationships between the categories.

It is important to note that One-Hot Encoding can increase the dimensionality of the dataset, especially if the categorical variable has a large number of classes. This can potentially lead to the **"curse of dimensionality"** and affect the performance of the model. However, it is a widely used and effective technique for incorporating categorical variables into machine learning models.

-------------------------------------


### Example 3 - Step 4: One-Hot Encode the Column

From the output above we know that there are exactly 4 different category values used in the `ObesityCategory` column: `Normal weight`, `Obese`, `Overweight` and `Underweight`. We need to One-Hot Encode these values.

The code in the cell below uses Pandas' `pd.get_dummies()` function to create dummy columns that can be used to replace the `ObesityCategory` column in the `opDF` DataFrame. To make it easier to remember what the dummy columns represent, we are going to add the prefix `OBCat` to each of the new dummy columns. What you use as a prefix is totally up to you, since it doesn't have any effect on how the data is processed. The prefix is just a reminder of what the original data was.

In [None]:
# Example 3 - Step 4: One-Hot Encode the column

import pandas as pd

# Encode the ObesityCategory column
dummies = pd.get_dummies(opDF['ObesityCategory'],prefix='OBCat', dtype=int)

# Display dummies DataFrame
display (dummies)

Unnamed: 0,OBCat_Normal weight,OBCat_Obese,OBCat_Overweight,OBCat_Underweight
0,1,0,0,0
1,0,1,0,0
2,0,0,1,0
3,0,0,1,0
...,...,...,...,...
996,0,1,0,0
997,0,1,0,0
998,1,0,0,0
999,1,0,0,0


If your code is correct you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image02A.png)



The variable called `dummies`, that was created by the One-Hot Encoding, is actually a new, **_separate_** DataFrame which is displayed above.

These four dummy columns encode the categorical data that is in the `ObesityCategory` column.

It is important to know how dummy columns encode numerical information. Notice that for each row, only **_one_** column that has a value of `1`, while the other columns in that row contain `0`.

For example, the first patient (index value `0`) has a `1` only in the column `OBCat_Normal weight` while the second patient (index value `1`) only has a `1` in the column `OBCat_Obese`.  

For this reason, this type of encoding is called **_One-Hot_** Encoding.

### Example 3 - Step 5: Merge Dummy Columns into Dataset

As mentioned above, One-Hot Encoding only generates a new, separate DataFrame called `dummies`. In order to use it, it is up to you to **_merge_** the `dummies` DataFrame with the DataFrame containing the dataset, in this example, `opDF`.   

The code in the cell below, shows how to merge these two DataFrames using the Pandas function `pd.concat()`. The word 'concat' in this command is short for `concatenate`.

**_Concatenate_** means combining multiple strings, lists, or other sequences into a single sequence. It can be done using various methods including the `concat()` function in Pandas.

The code fragment that accomplishes the concatenation is:
~~~text
# Merge dummies with the DataFrame
opDF = pd.concat([opDF,dummies],axis=1)
~~~

The argument `axis=1` specifies that the concatenation should be done along **_columns_**. This means that the DataFrames are joined _horizontally_, with the columns from the second DataFrame (`dummies`) added next to the columns from the first DataFrame (`opDF`).

The code below also illustrates how you can display only **_selected columns_** in a large DataFrame by specifying their column name.

In [None]:
# Example 3 - Step 5: Merge dummies with dataset

import pandas as pd

# Merge dummies with the DataFrame
opDF = pd.concat([opDF,dummies],axis=1)

# Set max columns and max rows
pd.set_option('display.max_columns', opDF.shape[1])
pd.set_option('display.max_rows', 8)

# Display certain columns in the DataFrame
display(opDF[['BMI','ObesityCategory','OBCat_Normal weight',
                  'OBCat_Obese','OBCat_Overweight','OBCat_Underweight']])

Unnamed: 0,BMI,ObesityCategory,OBCat_Normal weight,OBCat_Obese,OBCat_Overweight,OBCat_Underweight
0,-0.160970,Normal weight,1,0,0,0
1,1.374115,Obese,0,1,0,0
2,0.150129,Overweight,0,0,1,0
3,0.811514,Overweight,0,0,1,0
...,...,...,...,...,...,...
996,1.767533,Obese,0,1,0,0
997,1.172337,Obese,0,1,0,0
998,-0.546350,Normal weight,1,0,0,0
999,-0.221481,Normal weight,1,0,0,0


If your code is correct you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image03A.png)

From the output above, we can see that the four dummy columns, with the pre-fix `OBCat_` have been added to the `opDF` DataFrame.

### Example 3 - Step 6: Remove the original column

Usually, you will need to remove the column that was One-Hot Encoded from your DataFrame for two reasons. First, the column still contains string values which you can't use. And second, the informat in that column is **_redundant_** -- the information is encoded in the dummy columns.

The cell below shows how to use the Pandas `df.drop()` method to drop the `ObesityCategory` column from the `opDF` DataFrame. As above, the argument `axis=1` tells the method that you want to drop the column instead of a row.  

In [None]:
# Example 3-Step 6: Remove the orginal column

import pandas as pd

# Use drop method to drop the ObesityCategory column
opDF.drop('ObesityCategory', axis=1, inplace=True)

# Set max columns and max rows
pd.set_option('display.max_columns', opDF.shape[1])
pd.set_option('display.max_rows', 8)

# Display DataFrame
display(opDF)

Unnamed: 0,Age,Gender,Height,Weight,BMI,PhysicalActivityLevel,OBCat_Normal weight,OBCat_Obese,OBCat_Overweight,OBCat_Underweight
0,56,1,0.341864,0.050076,-0.160970,4,1,0,0,0
1,69,1,-0.574985,1.209739,1.374115,2,0,1,0,0
2,46,2,-0.192164,0.111266,0.150129,4,0,0,1,0
3,32,1,-0.154567,0.882535,0.811514,3,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...
996,35,2,-0.482874,1.705189,1.767533,1,0,1,0,0
997,49,2,-1.308268,0.490161,1.172337,1,0,1,0,0
998,64,1,-0.568685,-0.853282,-0.546350,4,1,0,0,0
999,66,2,0.823374,0.242315,-0.221481,1,1,0,0,0


If your code is correct you should see the following output:

![___](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image16A.png)

You should note the the column `ObesityCategory` is no longer present.

IF you get an error, it might mean that you have already removed the column. To correct this particular error, go back and re-run your code starting with Example 1.

## **Exercise 3**

In **Exercise 3** you are to repeat the same 6 steps from Example 3, but using the the Heart Disease dataset in the DataFrame `hdDF`. In other words, you are to follow the same 6 steps shown in Example 3 to first map and then Hot-Encode data in the Heart Failure dataset.

### **Exercise 3 - Step 1: Determine categories that are not numeric**

In the first step, use the Pandas function `df.select_dtypes()` to select all of the columns in your DataFrame `hdDF` that contain non-numeric (string) values.

Print out the results using the `starred` print statement.

In [None]:
# Insert your code for Exercise 3 - Step 1 here

import pandas as pd

# Select columns
non_numerical_columns = hdDF.select_dtypes(exclude='number').columns

# Print result
print(*non_numerical_columns)

Sex ChestPainType RestingECG ExerciseAngina ST_Slope


If your code is correct you should see the following output:
~~~text
Sex ChestPainType RestingECG ExerciseAngina ST_Slope
~~~
There are 4 columns that have non numeric values: `Sex`, `RestingECG`, `ExerciseAngina` and `ST_Slope`.


### **Exercise 3 - Step 2: Map Strings to Integers**

In the cell below write the code to map the string 'M' to the number `1` and string `F` to the number `2` in the column `Sex`.

Print out your revised DataFrame to verify that the correct mapping occured.

In [None]:
# Insert your code for Exercise 3 - Step 2 here

# Set max columns and max rows
pd.set_option('display.max_columns', hdDF.shape[1])
pd.set_option('display.max_rows', 8)

# Define the mapping dictionary
mapping = {'M': 1, 'F': 2}

# Check if all values to be mapped are present in the column
unique_values = hdDF['Sex'].unique()

# Find values in the column that are not in the mapping dictionary
missing_values = [value for value in unique_values if value not in mapping]

if missing_values:
    print(f"Error: The following values in the 'Gender' column are not in the mapping dictionary: {missing_values}")
    print(f"Error: Either your mapping is wrong or you have already converted the strings to integers")
else:
    # Map the 'Gender' column using the mapping dictionary
    hdDF['Sex'] = hdDF['Sex'].map(mapping)
    print("Heart disease data after mapping:")
    display(hdDF)

Heart disease data after mapping:


Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,1,ATA,140,289,0,Normal,172,N,-0.832432,Up,0
1,49,2,NAP,160,180,0,Normal,156,N,0.105664,Flat,1
2,37,1,ATA,130,283,0,ST,98,N,-0.832432,Up,0
3,48,2,ASY,138,214,0,Normal,108,Y,0.574711,Flat,1
...,...,...,...,...,...,...,...,...,...,...,...,...
914,68,1,ASY,144,193,1,Normal,141,N,2.357094,Flat,1
915,57,1,ASY,130,131,0,Normal,115,Y,0.293283,Flat,1
916,57,2,ATA,130,236,0,LVH,174,N,-0.832432,Flat,1
917,38,1,NAP,138,175,0,Normal,173,N,-0.832432,Up,0


If the code is correct, you should see the following output:

![_](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image13A.png)

However, if your output looks like this:

~~~text
Error: The following values in the 'Sex' column are not in the mapping dictionary: [1, 2]
Error: Either your mapping is wrong or you have already converted the strings to integers
~~~
It could mean that you have already converted the strings. You might need to go back and re-run **Exercise 1**.

### **Exercise 3 - Step 3: Print out a list of the category values**

In the cell below write the Python code to print out a list showing the number of different categories in the `ChestPainType` column.

In [None]:
# Insert your code for Exercise 3 Step 3 here

import pandas as pd

# Generate a list with only unique values
numHdCat = list(hdDF['ChestPainType'].unique())

# Print out the results
print(f'Number of chest pain categories: {len(numHdCat)}')
print(f'numHdCat: {numHdCat}')

Number of chest pain categories: 4
numHdCat: ['ATA', 'NAP', 'ASY', 'TA']


If your code is correct you should see the following output:

~~~text
Number of chest pain categories: 4
numHfCat: ['ATA', 'NAP', 'ASY', 'TA'
~~~

### **Exercise 3 - Step 4: One-Hot Encode the column**

In the cell below use the `pd.get_dummies()` function to create dummy columns for the column `ChestPainType`. To make it easier to remember what the dummy columns represent, add the prefix `Pain` to each dummy column.

In [None]:
# Insert your code for Exercise 3 Step 4 here

import pandas as pd

# Encode the ObesityCategory column
dummies = pd.get_dummies(hdDF['ChestPainType'],prefix='Pain', dtype=int)

# Display dummies DataFrame
display (dummies)

Unnamed: 0,Pain_ASY,Pain_ATA,Pain_NAP,Pain_TA
0,0,1,0,0
1,0,0,1,0
2,0,1,0,0
3,1,0,0,0
...,...,...,...,...
914,1,0,0,0
915,1,0,0,0
916,0,1,0,0
917,0,0,1,0


If your code is correct you should see the following output:

![__](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image14A.png)


Notice that for each row, there is only one column that has a value of `1`, while the rest of the columns in that row contain `0`. The first patient (index `0`) has the pain type `ATA` since he/she has a `1` in that column.

### **Exercise 3 - Step 5: Merge dummy columns into dataset**

In the cell below write the code to add the dummy columns back into the `hdDF` DataFrame using the Pandas' `pd.concat()` function. Set the display option to print out 6 rows and 6 columns. Unlike Example 3-Step 4, don't print out specific columns. Instead just use the command `display(hdDF)` to see your DataFrame.

In [None]:
# Insert your code for Exercise 3 Step 5 here

import pandas as pd

# Merge dummies with the DataFrame
hdDF = pd.concat([hdDF,dummies],axis=1)

# Set max columns and max rows
pd.set_option('display.max_columns', hdDF.shape[1])
pd.set_option('display.max_rows', 8)

# Print out
display(hdDF)

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease,Pain_ASY,Pain_ATA,Pain_NAP,Pain_TA
0,40,1,ATA,140,289,0,Normal,172,N,-0.832432,Up,0,0,1,0,0
1,49,2,NAP,160,180,0,Normal,156,N,0.105664,Flat,1,0,0,1,0
2,37,1,ATA,130,283,0,ST,98,N,-0.832432,Up,0,0,1,0,0
3,48,2,ASY,138,214,0,Normal,108,Y,0.574711,Flat,1,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
914,68,1,ASY,144,193,1,Normal,141,N,2.357094,Flat,1,1,0,0,0
915,57,1,ASY,130,131,0,Normal,115,Y,0.293283,Flat,1,1,0,0,0
916,57,2,ATA,130,236,0,LVH,174,N,-0.832432,Flat,1,0,1,0,0
917,38,1,NAP,138,175,0,Normal,173,N,-0.832432,Up,0,0,0,1,0


If your code is correct you should see the following output:

![_](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image15A.png)


### **Exercise 3 -Step 6: Remove the orignal column**

In the cell below, write the code to drop the `ChestPainType` column from the `hdDF` DataFrame.

Display 6 rows and 6 columns of your modified DataFrame.


In [None]:
# Insert your code for Exercise 3 Step 6 here

# Use drop method to drop the ObesityCategory column
hdDF.drop('ChestPainType', axis=1, inplace=True)

# Set max columns and max rows
pd.set_option('display.max_columns', hdDF.shape[1])
pd.set_option('display.max_rows', 8)

# Display DataFrame
display(hdDF)

Unnamed: 0,Age,Sex,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease,Pain_ASY,Pain_ATA,Pain_NAP,Pain_TA
0,40,1,140,289,0,Normal,172,N,-0.832432,Up,0,0,1,0,0
1,49,2,160,180,0,Normal,156,N,0.105664,Flat,1,0,0,1,0
2,37,1,130,283,0,ST,98,N,-0.832432,Up,0,0,1,0,0
3,48,2,138,214,0,Normal,108,Y,0.574711,Flat,1,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
914,68,1,144,193,1,Normal,141,N,2.357094,Flat,1,1,0,0,0
915,57,1,130,131,0,Normal,115,Y,0.293283,Flat,1,1,0,0,0
916,57,2,130,236,0,LVH,174,N,-0.832432,Flat,1,0,1,0,0
917,38,1,138,175,0,Normal,173,N,-0.832432,Up,0,0,0,1,0


If your code is correct you should see the following output:

![_](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image17A.png)


### Removing the First Level

The **pd.concat** function also includes a parameter named *drop_first*, which specifies whether to get k-1 dummies out of k categorical levels by removing the first level.

Why would you want to remove the first level?

Consider the category `Gender` in the `opDF` dataframe. This column contains the string variables `Male` and `Female`. Suppose we were to `Hot-One-Encode` the `Gender` column with the following line of Python code:

> `dummies = pd.get_dummies(opDF['Gender'],prefix='Gender', dtype=int)`
>

Here is what the two dummy columns would look like:

~~~text
    Gender_Female  Gender_Male
0               0            1
1               0            1
2               1            0
..            ...          ...
7               0            1
8               0            1
9               0            1

[10 rows x 2 columns]
~~~

Now, ask yourself the question, "Do we really need **both** columns to know if a subject was a female or a male?"

The answer is, "Not really".

Suppose we removed the `First Level` using the following line of code:

> `dummies = pd.get_dummies(obDF['Gender'],prefix='Gender', dtype=int, drop_first=True)`
>
Here is the output:

~~~text
    Gender_Male
0             1
1             1
2             0
..          ...
7             1
8             1
9             1

[10 rows x 1 columns]
~~~

Since the first level `Gender_Female` was dropped, there is now only the single dummy column `Gender_Male` left. However, you only need this column to know the gender of each subject in the `opDF` dataframe!

Consider the subject in Row 0. This subject is a male since he has a `1` in the `Gender_Male` column. This is also true of the subject in Row 1. But the subject in Row 2 _must_ be a female because she has a `0` in the `Gender_Male` column. In other words, we can determine the gender of every subject by the values in a single column.

It turns out that this idea is not limited to situations were there are only two possible choices. You can **always** drop one column in _any_ series of dummy columns **without** losing any information. For this reason the command `pd.get_dummies()` is often used with the argument `drop_first` set to `True` to simplify this process.

### Example 4: Using `drop_first=True`

The code in the cell below, begins by regenerating the original `opDF` DataFrame by re-reading the datafile.

The code then creates dummy columns for the `ObesityCategory` column in the DataFrame `opDF` but drops the first category before merging the remaining 3 dummy columns with the DataFrame and then drops the `ObesityCategory` column from the `opDF` DataFrame.

In [None]:
# Example 4 - Use drop_first = True

import pandas as pd

# Read the datafile
opDF = pd.read_csv(
    "https://biologicslab.co/BIO1173/data/obesity_prediction.csv",
    na_values=['NA','?'])

# Encode the ObesityCategory column as dummy variables
dummies = pd.get_dummies(opDF['ObesityCategory'], drop_first=True,
                         dtype= int, prefix='OBCat')

# Merge the dummie with the DataFrame
opDF = pd.concat([opDF,dummies],axis=1)

# Drop the column replaced by the dummies
opDF.drop('ObesityCategory', axis=1, inplace=True)

# Set max rows and max columns
pd.set_option('display.max_rows', 4)
pd.set_option('display.max_columns', 6)

# Display the DataFrame
display(opDF)

Unnamed: 0,Age,Gender,Height,...,OBCat_Obese,OBCat_Overweight,OBCat_Underweight
0,56,Male,173.575262,...,0,0,0
1,69,Male,164.127306,...,1,0,0
...,...,...,...,...,...,...,...
998,64,Male,164.192222,...,0,0,0
999,66,Female,178.537130,...,0,0,0


If your code is correct you should see the following output:

![_](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image09A.png)

Notice that the first subject (row `0`) who is a male, age 56. He must also have a _normal weight_, even though in the first category `OBCat_Normal Weight` was dropped.

Why?

Since he has a zero in the remaining 3 obesity categories, the only possibilty is that he had a `1` in the dropped category.

### **Exercise 4: Using `drop_first=True`**

In the cell below, start by regenerating the complete (original) Heart Failure DataFrame using your code from **Exercise 1**.

Then write the Python code to One-Hot encode the column `ChestPainType` column setting the `drop_first` argument to `True` to drop the first dummy column. Merge the remaining 3 dummy columns with the `hfDF` DataFrame before dropping the `ChestPainType` column.  Set the display options to print out 6 rows and 8 columns and print out the updated DataFrame.

In [None]:
# Insert your code for Exercise 4 here



If your code is correct you should see the following output:

![_](https://biologicslab.co/BIO1173/images/class_02/class_02_2_image10A.png)


## **Lesson Turn-in**

When you have completed and run all of the code cells, use the **File --> Print.. --> Save to PDF** to generate a PDF of your Colab notebook. Save your PDF as `Class_02_2.lastname.pdf` where _lastname_ is your last name, and upload the file to Canvas.

## **Poly-A Tail**



![___](https://upload.wikimedia.org/wikipedia/commons/e/ee/Pdp-11-40.jpg)


The **PDP–11** is a series of 16-bit minicomputers originally sold by Digital Equipment Corporation (DEC) from 1970 into the late 1990s, one of a set of products in the Programmed Data Processor (PDP) series. In total, around 600,000 PDP-11s of all models were sold, making it one of DEC's most successful product lines. The PDP-11 is considered by some experts to be the most popular minicomputer.

The PDP–11 included a number of innovative features in its instruction set and additional general-purpose registers that made it easier to program than earlier models in the PDP series. Further, the innovative Unibus system allowed external devices to be more easily interfaced to the system using direct memory access, opening the system to a wide variety of peripherals. The PDP–11 replaced the PDP–8 in many real-time computing applications, although both product lines lived in parallel for more than 10 years. The ease of programming of the PDP–11 made it popular for general-purpose computing.

The design of the PDP–11 inspired the design of late-1970s microprocessors including the Intel x86[1] and the Motorola 68000. The design features of PDP–11 operating systems, and other operating systems from Digital Equipment, influenced the design of operating systems such as CP/M and hence also MS-DOS. The first officially named version of Unix ran on the PDP–11/20 in 1970. It is commonly stated that the C programming language took advantage of several low-level PDP–11–dependent programming features, albeit not originally by design.

An effort to expand the PDP–11 from 16- to 32-bit addressing led to the VAX-11 design, which took part of its name from the PDP–11.

**History**

Previous machines
In 1963, DEC introduced what is considered to be the first commercial minicomputer in the form of the PDP–5. This was a 12-bit design adapted from the 1962 LINC machine that was intended to be used in a lab setting. DEC slightly simplified the LINC system and instruction set, aiming the PDP-5 at smaller settings that did not need the power of their larger 18-bit PDP-4. The PDP-5 was a success, ultimately selling about 1,000 machines. This led to the PDP–8, a further cost-reduced 12-bit model that sold about 50,000 units.

During this period, the computer market was moving from computer word lengths based on units of 6 bits to units of 8 bits, following the introduction of the 7-bit ASCII standard. In 1967–1968, DEC engineers designed a 16-bit machine, the PDP–X,[5] but management ultimately canceled the project as it did not appear to offer a significant advantage over their existing 12- and 18-bit platforms.

This prompted several of the engineers from the PDP-X program to leave DEC and form Data General. The next year they introduced the 16-bit Data General Nova.[6] The Nova sold tens of thousands of units and launched what would become one of DEC's major competitors through the 1970s and 1980s.

**Release**

Ken Olsen, president and founder of DEC, was more interested in a small 8-bit machine than the larger 16-bit system. This became the "Desk Calculator" project. Not long after, Datamation published a note about a desk calculator being developed at DEC, which caused concern at Wang Laboratories, who were heavily invested in that market. Before long, it became clear that the entire market was moving to 16-bit, and the Desk Calculator began a 16-bit design as well.

The team decided that the best approach to a new architecture would be to minimize the memory bandwidth needed to execute the instructions. Larry McGowan coded a series of assembly language programs using the instruction sets of various existing platforms and examined how much memory would be exchanged to execute them. Harold McFarland joined the effort and had already written a very complex instruction set that the team rejected, but a second one was simpler and would ultimately form the basis for the PDP–11.

When they first presented the new architecture, the managers were dismayed. It lacked single instruction-word immediate data and short addresses, both of which were considered essential to improving memory performance. McGowan and McFarland were eventually able to convince them that the system would work as expected, and suddenly "the Desk Calculator project got hot". Much of the system was developed using a PDP-10 where the SIM-11 simulated what would become the PDP–11/20 and Bob Bowers wrote an assembler for it.

At a late stage, the marketing team wanted to ship the system with 2K of memory[a] as the minimal configuration. When McGowan stated this would mean an assembler could not run on the system, the minimum was expanded to 4K. The marketing team also wanted to use the forward slash character for comments in the assembler code, as was the case in the PDP–8 assembler. McGowan stated that he would then have to use semicolon to indicate division, and the idea was dropped.[7]

The PDP–11 family was announced in January 1970 and shipments began early that year. DEC sold over 170,000 PDP–11s in the 1970s.

Initially manufactured of small-scale transistor–transistor logic, a single-board large-scale integration version of the processor was developed in 1975. A two- or three-chip processor, the J-11 was developed in 1979.

The last models of the PDP–11 line were the single board PDP–11/94 and PDP–11/93 introduced in 1990.