## Homework 3- Feature Engineering for Classification

Goal: Train the best classifier possible for Heart Disease Prediction while trying various feature engineering techniques and learning to run experiments to find the best possible model.


### Feature Preprocessing
For each categorical features implement two feature representations:
1. OneHot Encoding: For example, transform `X_train['Sex']` with values `M or F` into two features `X_Train['Sex_M']`, `X_Train['Sex_F']` with values `0 or 1`
2. Target Encoding: For example, transform `X_train['Sex']` with values `M or F` into a feature `X_Train['Sex-TargetEncoded']` with value equal to the average rate of heart disease of `M and F` respectively.

Please implement these yourself, but you can check against sklearn.preprocessing implementations for correctness.

The set of categorical features is:
`categorical_features = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']`


For numerical features implement two feature normalisations:
1. Standard Scaler (scale numeric values to have zero mean and unit variance). For example `X_train['Age-Scaled'] = (X_train['Age'] - mean) / standard_deviation`
2. MinMax Normalization (subtract min and divide by max - min). For example `X_train['Age-MinMax'] = (X_train['Age'] - min) / (max - min)`

Please implement these yourself but you can check against the sklearn.preprocessing implementations.

The set of numeric features is:
`numeric_features = ["Age", "RestingBP", "Cholesterol", "FastingBS", "MaxHR", "Oldpeak"]`
    
Note, any feature preprocessing parameters (like mean or standard deviation) should be calculated on training data only.

### Feature Engineering:
Create at least 5 custom features as functions of other features:

Some ideas:
  - A binary feature representing High Heart Rate `Custom-BinaryHighMaxHR`
  - A bucketized categorical feature of Oldpeak or of MaxHR called `Oldpeak`
  - A feature cross of a bucketized version of OldPeak and MaxHR for example `HighOldPeak_X_LowMaxHR`
  
### Feature Experiments:
Please run at least this set of experiments, but try others as you see fit.
1. All features with no scaling on numeric features and OneHot for categorical. No custom features.
2. All features using StandardScaler for numeric and OneHot for categorical. No custom features.
3. All features using StandardScaler for numeric and TargetEncoding for categorical. No custom features.
4. All features using MinMax-Normalization for numeric and OneHot for categorical. No custom features.
5. The kitchen sink: Include everything to try to get the best performance possible.
6. Only custom features. How good can you get with your own custom feature set?
7. Only categorical features using one of the encodings.
8. Only numerical features using one of the encodings.


### Model Training
Model training Data:
Separate data into a training set with 80% of the samples and a test set with 20%. 
`X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)`. Note that ideally we would also have a validation set, but since this dataset is fairly small, we will just do train/test splits. Please use the same random state as above so we have comparable performance across the class.

### Models
Please run all experiments using at least 3 different models
1. Logistic regression using `sklearn.linear_model.LogisticRegression`. Extra credit of +2 if you use your own gradient descent implementation from Homework 2.
2. Decision Trees: using `sklearn.tree.DecisionTreeClassifier`. Hyperparameter here is the depth of the tree.
3. Neural Net classifier using `sklearn.neural_network.MLPClassifier`. Hyperparameters here include the shape of the network and the learning rate. Consider using the 'adam' solver as constructor argument, 'relu' activation fuctions (which is the default), and try with a shape of network, but start small with a network of size `(4,1)`, that is 4 hidden units in one layer or `(3,2)` which includes 2 hidden layers of 3 units each.
4. At least 1 other classifier. Some ideas include SVM classifiers or Gradient Boosted Decision trees which are all available in sklearn as well


### Question 1: Run Experiments (15 points)

We have many varying parameters. Often in ML we run multiple experiments to find the best result.
- at least 8 different feature setups, maybe more
- at least 4 different model algorithms
- some number of hyperparemeters, let's say we try at least 10 for each model

This gives us a minimum of 320 different experiments. To easily keep track of these, let's create a spreadsheet to track the experiment results. 

Create a .csv file with the following columns tracking your experiments. Make sure this .csv can be loaded into Google sheets so I can grade it. Upload the .csv separately.

1. Experiment Name string. For example `Experiment1-LogReg-OneHot-Custom`. This name be anything.
2. List of Feature Included in Model separated by semicolons. Please name these according to this scheme: `<FeatureName>-<Preprocessing>-<Value>` for example `MaxHR-StandardScaler` or `RestingECG-OneHot-Normal`. For custom features name them `Custom-<Name>` for example: `Custom-BucketizedOldpeak`. As an example, a row in this columns might look like `MaxHR; MaxHR-StandardScaler; Custom-BinaryHighHeartRate; ...`. If your features are in your dataframe, you can create this list with `';'.join(X_train.columns)`.
3. Model Type (MyLogisticRegression or DecisionTree or anything else you want to try)
4. Epochs (for logistic regression)
5. Hyperparameters used in your model, for example depth of tree or size of neural network
6. Training Accuracy (w/ threshold of 0.5)
7. Testing Accuracy (w/ threshold of 0.5)
8. Training PR-AUC (Area under the precision/recall curve)
9. Testing PR-AUC (Area under the precision/recall curve)
10. Training Precision (w/ threshold of 0.5)
11. Testing Precision (w/ threshold of 0.5)
12. Training Recall (w/ threshold of 0.5)
13. Testing Recall (w/ threshold of 0.5)

Extra Credit (2 points): Use Weights And Biases (https://wandb.ai/) to track your experiments as an alternative to a spreadsheet (https://towardsdatascience.com/introduction-to-weight-biases-track-and-visualize-your-machine-learning-experiments-in-3-lines-9c9553b0f99d)


### Question 2: Analyze Logistic Regression Experiments (4 points)

- Question 2.1: What is the best set of features and parameters for Logistic Regression in terms of Test PR-AUC. Is this the same as the best in terms of accuracy? Discuss why you think this experiment showed the best results.
- Question 2.2: If features are well normalized, we can get a sense of feature importance by looking at the absolute value of the weights of each feature. Print out the highest 5 weights by absolute value. Which features are these? Discuss what this set of 5 features has on the model.
- Question 2.3: Train a model with just those top 5 features. What is the Test Accuracy and PR-AUC?
- Question 2.4: Describe the custom features you added. Why did you pick these?

### Question 3: Analyze Other Experiments (4 points)

- Question 3.1: What is the best set of features and parameters for Decision Tree classifier in terms of Test PR-AUC.  Discuss why you think this experiment showed the best results.
- Question 3.2: What is the best set of features and parameters for Neural Net classifier in terms of Test PR-AUC.  Discuss why you think this experiment showed the best results.
- Question 3.3: What is the best set of features and parameters for another chosen model in terms of Test PR-AUC.  Discuss why you think this experiment showed the best results.

### Question 4: Achieving particular peformance charactaristics (2 points)
- Question 4.1: From your trained models, can you produce a classifier that has around 90% recall on the test set? How? What is the precision of this model?
- Question 4.2: From your trained models, can you produce a classifier that has around 90% precision on the test set? How? What is the recall of this model?

### Question 5: Taking a step back (2 points)
- Question 5.1: What is your best performing model over all experiments? Describe why you think this was the best? What is the PR-AUC and Accuracy of this model? Feel free to share your results on discord, the best model in the class will receive extra credit of 3 points!
- Question 5.2: Discuss your thoughts on this process? What surprised you? What was hard or tedious? What did you learn?


In [2]:
# Imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

In [3]:
# Load data from our csv
df = pd.read_csv('../../Data/heart.csv')

# Select out Binary and Categorical Features
numeric_features = ["Age", "RestingBP", "Cholesterol", "FastingBS", "MaxHR", "Oldpeak"]
categorical_features = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']
label_feature = "HeartDisease"
X = df[numeric_features + categorical_features]
y = df[label_feature]

# Create Train / Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [4]:
# Examples of Pandas operations you can use:
# Calculate mean of a column:
print(X_train['Age'].mean())
# Calculate standard deviation of a column:
print(X_train['Age'].std())

# Example of creating a custom new feature column, in this case a binary feature indicating age > 50
X_train_example= X_train.copy()
X_train_example['Binary-AgeGt50'] = (X_train['Age'] > 50).apply(int)

53.65122615803815
9.364289800106517


In [5]:
# Feature Preprocessing
# 1. OneHot Encoding: For example, transform X_train['Sex'] with values M or F into two features X_Train['Sex_M'], X_Train['Sex_F'] with values 0 or 1
# Encode Sex_M and Sex_F columns with 0, 1 by using values of M and F in column Sex
Sex_M, Sex_F = [], []
for i in X_train["Sex"]:
    if i == 'M':
        Sex_M.append(1)
        Sex_F.append(0)
    elif i == 'F':
        Sex_M.append(0)
        Sex_F.append(1)

# Replace Sex column with Sex_M and Sex_F
X_train.pop("Sex")
X_train.insert(6, "Sex_M", Sex_M)
X_train.insert(7, "Sex_F", Sex_F)
X_train # X_train now has Sex_M and Sex_F columns and these columns are fitted with 0, 1

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,Sex_M,Sex_F,ChestPainType,RestingECG,ExerciseAngina,ST_Slope
795,42,120,240,1,194,0.8,1,0,NAP,Normal,N,Down
25,36,130,209,0,178,0.0,1,0,NAP,Normal,N,Up
84,56,150,213,1,125,1.0,1,0,ASY,Normal,Y,Flat
10,37,130,211,0,142,0.0,0,1,NAP,Normal,N,Up
344,51,120,0,1,104,0.0,1,0,ASY,Normal,N,Flat
...,...,...,...,...,...,...,...,...,...,...,...,...
106,48,120,254,0,110,0.0,0,1,ASY,ST,N,Up
270,45,120,225,0,140,0.0,1,0,ASY,Normal,N,Up
860,60,130,253,0,144,1.4,1,0,ASY,Normal,Y,Up
435,60,152,0,0,118,0.0,1,0,ASY,ST,Y,Up


In [6]:
# Feature Preprocessing
# 2. Target Encoding: For example, transform X_train['Sex'] with values M or F into a feature X_Train['Sex-TargetEncoded'] with value equal to the average rate of heart disease of M and F respectively.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42) # We changed X_train for the previous problem, so we will restore it now
maleCount, femaleCount, maleHeartDiseaseCount, femaleHeartDiseaseCount = 0, 0, 0, 0
gender, heartDisease = np.array(X_train["Sex"]), np.array(y_train)
for i in range(len(gender)):
    if gender[i] == 'M':
        maleCount += 1
        if heartDisease[i] == 1:
            maleHeartDiseaseCount += 1
    elif gender[i] == 'F':
        femaleCount += 1
        if heartDisease[i] == 1:
            femaleHeartDiseaseCount += 1
maleHeartDiseaseRate = maleHeartDiseaseCount/maleCount
femaleHeartDiseaseRate = femaleHeartDiseaseCount/femaleCount
for i in range(len(gender)):
    if gender[i] == 'M':
        gender[i] = maleHeartDiseaseRate
    elif gender[i] == 'F':
        gender[i] = femaleHeartDiseaseRate
X_train.pop("Sex")
X_train.insert(6, "Sex-TargetEncoded", gender)
X_train # Sex-TargetEncoded column added where the Sex column was

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,Sex-TargetEncoded,ChestPainType,RestingECG,ExerciseAngina,ST_Slope
795,42,120,240,1,194,0.8,0.632042,NAP,Normal,N,Down
25,36,130,209,0,178,0.0,0.632042,NAP,Normal,N,Up
84,56,150,213,1,125,1.0,0.632042,ASY,Normal,Y,Flat
10,37,130,211,0,142,0.0,0.253012,NAP,Normal,N,Up
344,51,120,0,1,104,0.0,0.632042,ASY,Normal,N,Flat
...,...,...,...,...,...,...,...,...,...,...,...
106,48,120,254,0,110,0.0,0.253012,ASY,ST,N,Up
270,45,120,225,0,140,0.0,0.632042,ASY,Normal,N,Up
860,60,130,253,0,144,1.4,0.632042,ASY,Normal,Y,Up
435,60,152,0,0,118,0.0,0.632042,ASY,ST,Y,Up


In [7]:
# Feature Preprocessing
# 1. Standard Scaler (scale numeric values to have zero mean and unit variance). For example X_train['Age-Scaled'] = (X_train['Age'] - mean) / standard_deviation
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42) # We changed X_train for the previous problem, so we will restore it now
for i in ["RestingBP", "Cholesterol", "FastingBS", "MaxHR", "Oldpeak"]:
    X_train[i+"-Scaled"] = (X_train[i] - X_train[i].mean()) / X_train[i].std()
    X_train.pop(i)
X_train # ''-Scaled columns added and original columns removed

Unnamed: 0,Age,Sex,ChestPainType,RestingECG,ExerciseAngina,ST_Slope,RestingBP-Scaled,Cholesterol-Scaled,FastingBS-Scaled,MaxHR-Scaled,Oldpeak-Scaled
795,42,M,NAP,Normal,N,Down,-0.708502,0.372549,1.841354,2.282796,-0.096995
25,36,M,NAP,Normal,N,Up,-0.166172,0.086087,-0.542339,1.651116,-0.835717
84,56,M,ASY,Normal,Y,Flat,0.918489,0.123050,1.841354,-0.441327,0.087685
10,37,F,NAP,Normal,N,Up,-0.166172,0.104569,-0.542339,0.229834,-0.835717
344,51,M,ASY,Normal,N,Flat,-0.708502,-1.845220,1.841354,-1.270407,-0.835717
...,...,...,...,...,...,...,...,...,...,...,...
106,48,F,ASY,ST,N,Up,-0.708502,0.501919,-0.542339,-1.033527,-0.835717
270,45,M,ASY,Normal,N,Up,-0.708502,0.233938,-0.542339,0.150874,-0.835717
860,60,M,ASY,Normal,Y,Up,-0.166172,0.492678,-0.542339,0.308794,0.457046
435,60,M,ASY,ST,Y,Up,1.026955,-1.845220,-0.542339,-0.717687,-0.835717


In [8]:
# Feature Preprocessing
# 2. MinMax Normalization (subtract min and divide by max - min). For example X_train['Age-MinMax'] = (X_train['Age'] - min) / (max - min)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42) # We changed X_train for the previous problem, so we will restore it now
for i in ["RestingBP", "Cholesterol", "FastingBS", "MaxHR", "Oldpeak"]:
    X_train[i+"-MinMax"] = (X_train[i] - X_train[i].min()) / (X_train[i].max() - X_train[i].min())
    X_train.pop(i)
X_train # ''-MinMax columns added and original columns removed

Unnamed: 0,Age,Sex,ChestPainType,RestingECG,ExerciseAngina,ST_Slope,RestingBP-MinMax,Cholesterol-MinMax,FastingBS-MinMax,MaxHR-MinMax,Oldpeak-MinMax
795,42,M,NAP,Normal,N,Down,0.60,0.398010,1.0,0.943662,0.386364
25,36,M,NAP,Normal,N,Up,0.65,0.346600,0.0,0.830986,0.295455
84,56,M,ASY,Normal,Y,Flat,0.75,0.353234,1.0,0.457746,0.409091
10,37,F,NAP,Normal,N,Up,0.65,0.349917,0.0,0.577465,0.295455
344,51,M,ASY,Normal,N,Flat,0.60,0.000000,1.0,0.309859,0.295455
...,...,...,...,...,...,...,...,...,...,...,...
106,48,F,ASY,ST,N,Up,0.60,0.421227,0.0,0.352113,0.295455
270,45,M,ASY,Normal,N,Up,0.60,0.373134,0.0,0.563380,0.295455
860,60,M,ASY,Normal,Y,Up,0.65,0.419569,0.0,0.591549,0.454545
435,60,M,ASY,ST,Y,Up,0.76,0.000000,0.0,0.408451,0.295455


In [12]:
# Feature Engineering
# Create at least 5 custom features as functions of other features ---> first custom feature
# A binary feature representing High Heart Rate Custom-BinaryHighMaxHR
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42) # We changed X_train for the previous problem, so we will restore it now
averageHeartRate, Custom_BinaryHighMaxHR = X_train["MaxHR"].mean(), [] # Assuming everyone with a max heart rate over the average max heart rate has high max heart rate
for i in X_train["MaxHR"]:
    if i > averageHeartRate:
        Custom_BinaryHighMaxHR.append(1)
    else:
        Custom_BinaryHighMaxHR.append(0)
print(Custom_BinaryHighMaxHR) # Printing out custom feature

[1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 