# Day2: EDA & Preprocessing

# EDA: Experimental Data Analysis

**Exploratory Data Analysis (EDA)** is the process of analyzing and summarizing the main characteristics of a dataset using statistical and visualization techniques. It helps you understand the data's structure, detect patterns, spot anomalies, and determine relationships among variables.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset into a Pandas DataFrame
df = pd.read_csv('../data/heart.csv')

In [None]:
# View the first few rows of the dataset
df.head()

In [None]:
# Get information about data types and missing values
df.info()

In [None]:
# Get summary statistics of numeric columns
df.describe()

In [None]:
# Check for missing values
df.isnull().sum()

In [None]:
# Drop duplicates if necessary
df.drop_duplicates(inplace=True)

## Univariate Analysis

In [None]:
# Age Distribution
plt.figure(figsize=(8, 6))
df['Age'].hist(bins=20, color='skyblue', edgecolor='black')
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

# Chest Pain Type Frequency
plt.figure(figsize=(8, 6))
df['ChestPainType'].value_counts().plot(kind='bar', color='salmon', edgecolor='black')
plt.title('Chest Pain Type Frequency')
plt.xlabel('Chest Pain Type')
plt.ylabel('Count')
plt.show()

## Bivariate Analysis

In [None]:
# Correlation Heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

# Age vs Cholesterol
plt.figure(figsize=(8, 6))
plt.scatter(df['Age'], df['Cholesterol'], alpha=0.6, edgecolor='k')
plt.title('Age vs Cholesterol')
plt.xlabel('Age')
plt.ylabel('Cholesterol')
plt.show()

## Grouped Analysis

In [None]:
# Resting Blood Pressure by Chest Pain Type
plt.figure(figsize=(10, 6))
sns.boxplot(x='ChestPainType', y='RestingBP', data=df, palette='Set2')
plt.title('Resting Blood Pressure by Chest Pain Type')
plt.xlabel('Chest Pain Type')
plt.ylabel('Resting Blood Pressure (mm Hg)')
plt.show()

## Outlier Detection

In [None]:
# Box Plot for Cholesterol
plt.figure(figsize=(8, 6))
sns.boxplot(y='Cholesterol', data=df, color='lightblue')
plt.title('Cholesterol Distribution and Outliers')
plt.ylabel('Cholesterol (mg/dl)')
plt.show()

## Patterns and Anomalies

In [None]:
# Exercise-Induced Angina by Heart Disease Status
plt.figure(figsize=(8, 6))
sns.countplot(x='ExerciseAngina', hue='HeartDisease', data=df, palette='pastel')
plt.title('Exercise-Induced Angina by Heart Disease Status')
plt.xlabel('Exercise Angina')
plt.ylabel('Count')
plt.legend(title='Heart Disease', loc='upper right')
plt.show()

# Preprocessing

Preprocessing is an essential step of the machine learning workflow and important for the performance of models. This notebook will introduce the major steps of preprocessing for machine learning. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
data = pd.read_csv('../data/heart.csv')
# Check out the first few rows
data.head()

Below is a "data dictionary", containing information about each of the variables in the dataset.

| Feature           | Data Type                    | Description                                                                                                         |
|-------------------|-----------------------------|---------------------------------------------------------------------------------------------------------------------|
| **Age**           | continuous                 | Age of the patient in years                                                                                        |
| **Sex**           | binary discrete (M/F)      | Sex of the patient: M = Male, F = Female                                                                           |
| **ChestPainType** | multi-valued discrete (TA, ATA, NAP, ASY) | Type of chest pain: TA = Typical Angina, ATA = Atypical Angina, NAP = Non-Anginal Pain, ASY = Asymptomatic         |
| **RestingBP**     | continuous                 | Resting blood pressure measured in mm Hg                                                                           |
| **Cholesterol**   | continuous                 | Serum cholesterol level measured in mg/dl                                                                          |
| **FastingBS**     | binary discrete (0/1)      | Fasting blood sugar: 1 = Fasting blood sugar > 120 mg/dl, 0 = Fasting blood sugar ≤ 120 mg/dl                      |
| **RestingECG**    | multi-valued discrete (Normal, ST, LVH) | Resting electrocardiogram results: Normal = Normal, ST = ST-T wave abnormality, LVH = Left ventricular hypertrophy |
| **MaxHR**         | continuous                 | Maximum heart rate achieved, numeric value between 60 and 202                                                      |
| **ExerciseAngina**| binary discrete (Y/N)      | Presence of exercise-induced angina: Y = Yes, N = No                                                               |
| **Oldpeak**       | continuous                 | Depression of the ST segment measured in numeric value (Oldpeak)                                                   |
| **ST_Slope**      | multi-valued discrete (Up, Flat, Down) | Slope of the peak exercise ST segment: Up = upsloping, Flat = flat, Down = downsloping                            |
| **HeartDisease**  | binary discrete (0/1)      | Output class indicating heart disease: 1 = Presence of heart disease, 0 = Normal                                  |

## Exploratory Data Analysis

Let's start by getting familiar with our data. This is an important first step before jumping into any modeling.

How many samples in the dataset do we have?

In [None]:
data.shape

This is a pretty small dataset.

Let's look at the distribution of the target variable:

In [None]:
ax = data['Age'].hist(grid=False, bins=np.linspace(1, 100, 20))
ax.set_xlabel('Age')
ax.set_ylabel('Frequency')
plt.show()

How about how the age correlates with the predictors? We can use the `corr()` function to do this:

In [None]:
data.corr(numeric_only=True)

---
### Challenge 1: More EDA

Create the following plots, or examine the following distributions, while exploring your data:

1. What are column names of this data frame?
2. A histogram of the continuous variables.
3. What are the unique values of `ExerciseAngina`, and their counts?
6. What are the unique `ChestPainType` values, and their counts?

---

# What would be a good machine learning question for this data set?

# Creating Train and Test Splits

Next, we'll want to split our dataset into training and test data. When creating the model, we need to make sure it only sees the training data. Then, we can examine how well it **generalizes** to data it hasn't seen before. The train and test split is a foundational concept in machine learning. Be sure you're confident you understand why we do this before moving forward!

A dataset is often broken up into a feature set, or **design matrix** (typically with the variable name `X`) as well as the target or response variable `y`. Both have $D$ samples, but the design matrix will have a second dimension indicating the number of features we're using for prediction.

In this case, we'll extract the output variable `RestingBP` from the data frame to make the `X` and `y` variables. We use a capital `X` to denote it is a `matrix` or 2-D array, and use a lowercase `y` to denote that it is a `vector`, or 1-D array.

In [None]:
# Remove the response variable and car name
X = data.drop(columns=['RestingBP'])
# Assign response variable to its own variable
y = data['RestingBP'].astype(np.float64)
# Confidence check
print(X.shape)
print(y.shape)

Now, we perform the train/test split. The package `scikit-learn` is the most commonly used package for machine learning in Python. It provides a function we can easily use to perform this split. Let's import it:

In [None]:
from sklearn.model_selection import train_test_split

We commonly do an 80/20 split, where 80% of the data is used for training, and the remaining 20% is used for testing. We can customize this using the parameters of the `train_test_split` function, which you can find in the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

We typically split the data randomly. However, sometimes we want this random split to occur in a *reproducible* fashion. This might be when we're testing our code, and want the same random split every time. Or, during a workshop, when we want all participants to get the same split, so that the results look the same for everyone. A reproducible random fit can be done by setting the `random_state`, which is an input argument to `train_test_split`. 

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23)

In [None]:
print(f'X train shape: {X_train.shape}; y train shape: {y_train.shape}')
print(f'X test shape: {X_test.shape}; y test shape: {y_test.shape}')

BEFORE we split the data, there are certain preprocessing tasks we need to do. ORDER MATTERS!

## Missing Data Preprocessing

First, let's check to see if there are any missing values in the data set. Missing values are represented by `NaN`. 

**Question:** In this case, what do missing values stand for?

In [None]:
data.isnull().sum()

There is no `NaN` missing values. Is this great?

In [None]:
data['Sex'].unique()

In [None]:
data['Age'].unique()

In [None]:
data['RestingBP'].unique()

In [None]:
data['Cholesterol'].unique()

In this case, the `0` represents a missing value, so let's replace those with `np.nan` objects.

In [None]:
data.columns

In [None]:
#data[['RestingBP', 'Cholesterol']]
data['RestingBP'].replace(0, np.nan, inplace=True)
data['RestingBP'].unique()

data['Cholesterol'].replace(0, np.nan, inplace=True)
data['Cholesterol'].unique()

### Imputation

In the case of missing values, we have the option to fill in the missing values with the best guess. This is called **imputation**. Here we'll impute any missing values using the average, or mean, of all the data that does exist, as that's the best guess for a data point if all we have is the data itself. To do that we'll use the `SimpleImputer` to assign the mean to all missing values in the data.

There are also other strategies that can be used to impute missing data ([see documentation](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)).

Let's see how the `SimpleImputer` works on a subset of the data. 

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan,
                        strategy='mean', 
                        copy=True)
imputed = imputer.fit_transform(data[['RestingBP','Cholesterol']])

Now let's check that the previously null values have been filled in. 

In [None]:
print(imputed[data[data['RestingBP'].isna()].index])

### Dropping Null Values

Another option option is to use `pd.dropna()` to drop `Null` values from the `DataFrame`. This should almost always be used with the `subset` argument which restricts the function to only dropping values that are null in a certain column(s).

In [None]:
data = data.dropna(subset='Sex')

# Now this line will return an empty dataframe
data[data['Sex'].isna()]

## Categorical Data Processing

`Heart disease` dataset contains both categorical and continuous features, which will each need to be preprocessed in different ways. First, we want to transform the categorical variables from strings to **indicator variables**. Indicator variables have one column per level, For example, the island variable will change from ATA/NAP/ASY/TA --> ATA (1/0), NAP (1/0), ASY (1/0), and TA (1/0). For each set of indicator variables, there should be a 1 in exactly one column.

In [None]:
data['ST_Slope'].unique()

In [None]:
data.dtypes

 Let's make a list of the categorical variable names to be transformed into indicator variables.

In [None]:
# Define the variable names that are categorical for use later
cat_var_names = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope', 'FastingBS']
data_cat = data[cat_var_names]
data_cat.head()

### Categorical Variable Encoding (One-hot & Dummy)

Many machine learning algorithms require that categorical data be encoded numerically in some fashion. There are two main ways to do so:


- **One-hot-encoding**, which creates `k` new variables for a single categorical variable with `k` categories (or levels), where each new variable is coded with a `1` for the observations that contain that category, and a `0` for each observation that doesn't. 
- **Dummy encoding**, which creates `k-1` new variables for a categorical variable with `k` categories

However, when using some machine learning algorithms we can run into the so-called ["Dummy Variable Trap"](https://www.algosome.com/articles/dummy-variable-trap-regression.html) when using One-Hot-Encoding on multiple categorical variables within the same set of features. This occurs because each set of one-hot-encoded variables can be added together across columns to create a single column of all `1`s, and so are multi-colinear when multiple one-hot-encoded variables exist within a given model. This can lead to misleading results. 

To resolve this, we can simply add an intercept term to our model (which is all `1`s) and remove the first one-hot-encoded variable for each categorical variables, resulting in `k-1` so-called "Dummy Variables". 

Luckily the `OneHotEncoder` from `sklearn` can perform both one-hot and dummy encoding simply by setting the `drop` parameter (`drop = 'first'` for Dummy Encoding and `drop = None` for One Hot Encoding). 

**Question:** How many total columns will there be in the output?

In [None]:
from sklearn.preprocessing import OneHotEncoder
dummy_e = OneHotEncoder(categories='auto', drop='first', sparse=False)
dummy_e.fit(data_cat);
dummy_e.categories_

In [None]:
temp = dummy_e.transform(data_cat)

## Continuous Data Preprocessing

For numeric data, we don't need to create indicator variables, instead we need to normalize our variables, which helps improve performance of many machine learning models.

 Let's make subset out the continuous variables to be normalized.

In [None]:
data_num = data.drop(columns=cat_var_names + ['HeartDisease'])
data_num.head()

### Normalization

[Normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) is a transformation that puts data into some known "normal" scale. We use normalization to improve the performance of many machine learning algorithms (see [here](https://en.wikipedia.org/wiki/Feature_scaling)). There are many forms of normalization, but perhaps the most useful to machine learning algorithms is called the "z-score" also known as the standard score. 

To z-score normalize the data, we simply subtract the mean of the data, and divide by the standard deviation. This results in data with a mean of `0` and a standard deviation of `1`.

We'll use the `StandardScaler` from `sklearn` to do normalization.

In [None]:
from sklearn.preprocessing import StandardScaler
norm_e = StandardScaler()
norm_e.fit_transform(data_num,).mean(axis=0)

To check the normalization works, let's look at the mean and standard variation of the resulting columns. 

**Question:** What should the mean and std variation be?

In [None]:
print('mean:',norm_e.fit_transform(data_num,).mean(axis=0))
print('std:',norm_e.fit_transform(data_num,).std(axis=0))

---
## Challenge 1: Fitting preprocessing functions

The simple imputer, normalization and one-hot-encoding rely on sklearn functions that are fit to a data set. 

1) What is being fit for each of the three functions?
    1) One Hot Encoding
    2) Standard Scaler
    3) Simple Imputer
    
*YOUR ANSWER HERE*

When we are preprocessing data we have a few options: 
1) Fit on the whole data set
2) Fit on the training data
3) Fit on the testing data

Which of the above methods would you use and why?

*YOUR ANSWER HERE*

---


## Combine it all together

Now let's combine what we've learned to preprocess the entire dataset.

First we will reload the data set to start with a clean copy.

In [None]:
data = pd.read_csv('../data/heart.csv')
data['RestingBP'].replace(0, np.nan, inplace=True)
data['Cholesterol'].replace(0, np.nan, inplace=True)

In [None]:
data.columns

In [None]:
# Perform the train-test split
y = data['HeartDisease']
X = data.drop('HeartDisease', axis =1, inplace=False)
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=.20, stratify=y)
print(X_train.shape)

We want to train our imputers on the training data using `fit_transform`, then `transform` the test data. This more closely resembles what the workflow would look like if you are bringing in brand new test data.

First, we will subset out the categorical and numerical features separately. 

In [None]:
# Get the categorical and numerical variable column indices
cat_var = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope', 'FastingBS']
num_var = ['Age', 'RestingBP', 'Cholesterol', 'MaxHR', 'Oldpeak']
# Splice the training array
X_train_cat = X_train[cat_var]
X_train_num = X_train[num_var]

# Splice the test array
X_test_cat = X_test[cat_var]
X_test_num = X_test[num_var]

Now, let's process the categorical data with **Dummy encoding**

In [None]:
import warnings
warnings.filterwarnings('ignore')

# Categorical feature encoding
X_train_dummy = dummy_e.fit_transform(X_train_cat)
X_test_dummy = dummy_e.transform(X_test_cat)


# Check the shape
X_train_dummy.shape, X_test_dummy.shape

Now, let's process the numerical data by imputing any missing values and normalizing the results.

In [None]:
# Numerical feature standardization

# Impute the data
X_train_imp = imputer.fit_transform(X_train_num)
X_test_imp = imputer.transform(X_test_num)

# Check for missing values
np.isnan(X_train_imp).any(), np.isnan(X_test_imp).any()

# normalize
X_train_norm = norm_e.fit_transform(X_train_imp)
X_test_norm = norm_e.transform(X_test_imp)

X_train_norm.shape, X_test_norm.shape

Now that we've processed the numerical and categorical data separately, we can put the two arrays back together.

In [None]:
X_train = np.hstack((X_train_dummy, X_train_norm))
X_test = np.hstack((X_test_dummy, X_test_norm))

X_train.shape, X_test.shape

In [None]:
dummy_e.get_feature_names_out()

In [None]:
X_train

---
## Challenge 2: Order of Preprocessing

In the preprocessing we did the following steps: 

1) Null values
2) One-hot-encoding
3) Imputation
4) Normalization

Now, consider that we change the order of the steps in the following ways. What effect might that have on the algorithms?
**Hint**: Try copying the code from above and trying it out!

- One-Hot-Encoding before Null Values
- Normalization before Null values

**Bonus:** Are there any other switches in order that might affect preprocessing?

---

In [None]:
# YOUR CODE HERE

Finally, let's save our results as separate `.csv` files, so we won't have to run the preprocessing again.

First we will make them DataFrames, add columns, and save them as .csv files

In [None]:
X_train = pd.DataFrame(X_train)
X_train.columns = ['Sex_M', 'ChestPainType_ATA', 'ChestPainType_NAP',
       'ChestPainType_TA', 'RestingECG_Normal', 'RestingECG_ST',
       'ExerciseAngina_Y', 'ST_Slope_Flat', 'ST_Slope_Up', 'FastingBS_1', 
                   'Age', 'RestingBP', 'Cholesterol', 'MaxHR', 'Oldpeak' ]

X_test = pd.DataFrame(X_test)

X_test.columns = ['Sex_M', 'ChestPainType_ATA', 'ChestPainType_NAP',
       'ChestPainType_TA', 'RestingECG_Normal', 'RestingECG_ST',
       'ExerciseAngina_Y', 'ST_Slope_Flat', 'ST_Slope_Up', 'FastingBS_1', 
                   'Age', 'RestingBP', 'Cholesterol', 'MaxHR', 'Oldpeak' ]
y_train = pd.DataFrame(y_train)
y_train.columns = ['HeartDisease']

y_test = pd.DataFrame(y_test)
y_test.columns = ['HeartDisease']

X_train.to_csv('../data/heart_X_train.csv')
X_test.to_csv('../data/heart_X_test.csv')
y_train.to_csv('../data/heart_y_train.csv')
y_test.to_csv('../data/heart_y_test.csv')


Although now we will move on to talk about classification, all of the choices we make in the preprocessing pipeline are extremely important to machine learning.