<font size="+3"><b>Assignment 4: Pipelines and Hyperparameter Tuning</b></font>

***
* **Full Name** = Nathan Ante
* **UCID** = 30157706
***

<font color='Blue'>
In this assignment, you will be putting together everything you have learned so far. You will need to do all the appropriate preprocessing, test different supervised learning models, and evaluate the results. More details for each step can be found below. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.
</font>

<font color='Red'>
For this assignment, in addition to your .ipynb file, please also attach a PDF file. To generate this PDF file, you can use the print function (located under the "File" within Jupyter Notebook). Name this file ENGG444_Assignment##__yourUCID.pdf (this name is similar to your main .ipynb file). We will evaluate your assignment based on the two files and you need to provide both.
</font>


|         **Question**         | **Point(s)** |
|:----------------------------:|:------------:|
|  **1. Preprocessing Tasks**  |              |
|              1.1             |       2      |
|              1.2             |       2      |
|              1.3             |       4      |
| **2. Pipeline and Modeling** |              |
|              2.1             |       3      |
|              2.2             |       6      |
|              2.3             |       5      |
|              2.4             |       3      |
|     **3. Bonus Question**    |     **2**    |
|           **Total**          |    **25**    |

## **0. Dataset**

This data is a subset of the **Heart Disease Dataset**, which contains information about patients with possible coronary artery disease. The data has **14 attributes** and **294 instances**. The attributes include demographic, clinical, and laboratory features, such as age, sex, chest pain type, blood pressure, cholesterol, and electrocardiogram results. The last attribute is the **diagnosis of heart disease**, which is a categorical variable with values from 0 (no presence) to 4 (high presence). The data can be used for **classification** tasks, such as predicting the presence or absence of heart disease based on the other attributes.

In [1]:
import pandas as pd

# Define the data source link
_link = 'https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data'

# Read the CSV file into a Pandas DataFrame, considering '?' as missing values
df = pd.read_csv(_link, na_values='?',
                 names=['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs',
                        'restecg', 'thalach', 'exang', 'oldpeak', 'slope',
                        'ca', 'thal', 'num'])

# Display the DataFrame
display(df)

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,28,1,2,130.0,132.0,0.0,2.0,185.0,0.0,0.0,,,,0
1,29,1,2,120.0,243.0,0.0,0.0,160.0,0.0,0.0,,,,0
2,29,1,2,140.0,,0.0,0.0,170.0,0.0,0.0,,,,0
3,30,0,1,170.0,237.0,0.0,1.0,170.0,0.0,0.0,,,6.0,0
4,31,0,2,100.0,219.0,0.0,1.0,150.0,0.0,0.0,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
289,52,1,4,160.0,331.0,0.0,0.0,94.0,1.0,2.5,,,,1
290,54,0,3,130.0,294.0,0.0,1.0,100.0,1.0,0.0,2.0,,,1
291,56,1,4,155.0,342.0,1.0,0.0,150.0,1.0,3.0,2.0,,,1
292,58,0,2,180.0,393.0,0.0,0.0,110.0,1.0,1.0,2.0,,7.0,1


# **1. Preprocessing Tasks**

- **1.1** Find out which columns have more than 60% of their values missing and drop them from the data frame. Explain why this is a reasonable way to handle these columns. **(2 Points)**

- **1.2** For the remaining columns that have some missing values, choose an appropriate imputation method to fill them in. You can use the `SimpleImputer` class from `sklearn.impute` or any other method you prefer. Explain why you chose this method and how it affects the data. **(2 Points)**

- **1.3** Assign the `num` column to the variable `y` and the rest of the columns to the variable `X`. The `num` column indicates the presence or absence of heart disease based on the angiographic disease status of the patients. Create a `ColumnTransformer` object that applies different preprocessing steps to different subsets of features. Use `StandardScaler` for the numerical features, `OneHotEncoder` for the categorical features, and `passthrough` for the binary features. List the names of the features that belong to each group and explain why they need different transformations. You will use this `ColumnTransformer` in a pipeline in the next question. **(4 Points)**

<font color='Green'><b>Answer:</b></font>

- **1.1** .....................

Dropping columns with more than 60% of it's values missing is reasonable because those columns may not provide meaningful data to learn from and model. Too much missing data can also introduce bias or perhaps distort the original distribution. Removing those columns would make it easier for the model to interpret and work with.

In [2]:
# 1.1
# Add necessary code here.

# Find all missing and find the ratio by dividing by the total number of items
missing = df.isna().sum()
missing_percentages = (missing / len(df)) * 100
print('Percentages before dropping...')
print(missing_percentages)

# Drop the columns that have a missing percentage greater than 60
drop_columns = missing_percentages[missing_percentages > 60].index
df.drop(columns=drop_columns, inplace=True)

print('\nPercentages after dropping...')
new_missing_percentages = (df.isna().sum() / len(df)) * 100
print(new_missing_percentages)

Percentages before dropping...
age          0.000000
sex          0.000000
cp           0.000000
trestbps     0.340136
chol         7.823129
fbs          2.721088
restecg      0.340136
thalach      0.340136
exang        0.340136
oldpeak      0.000000
slope       64.625850
ca          98.979592
thal        90.476190
num          0.000000
dtype: float64

Percentages after dropping...
age         0.000000
sex         0.000000
cp          0.000000
trestbps    0.340136
chol        7.823129
fbs         2.721088
restecg     0.340136
thalach     0.340136
exang       0.340136
oldpeak     0.000000
num         0.000000
dtype: float64


<font color='Green'><b>Answer:</b></font>

- **1.2** .....................

I chose to use 2 Simple Imputer strategies, `mean` for numerical data and `most_frequent` for categorical data, I chose these two strategies because they work best at handling their respective type of data. Mean imputation replaces values with the average of the observed values in a column, it does not change the overall average so it can produce a relatively normal distribution but may introduce bias depending on the data. Most frequent imputation replaces values with the most common value in a column, it doesn't change the range or spread of values within the column but same as mean it can introduce bias.

In [3]:
# Check how many unique values in each to determine if categorical or numerical
df.nunique()

age          38
sex           2
cp            4
trestbps     31
chol        153
fbs           2
restecg       3
thalach      71
exang         2
oldpeak      10
num           2
dtype: int64

In [4]:
# 1.2
# Add necessary code here.

from sklearn.impute import SimpleImputer

# Print before missing values
print('Before Simple Imputer...')
print(df.isna().sum())

# These columns are separated by numerical data and categorical data
numeric_columns = ['age','trestbps', 'chol', 'thalach', 'oldpeak']
categorical_columns = ['sex', 'cp', 'num', 'fbs', 'restecg', 'restecg']

# Use mean strategy for numerical data
mean_imputer = SimpleImputer(strategy='mean')
df[numeric_columns] = mean_imputer.fit_transform(df[numeric_columns])

# Use most frequent strategy for categorical data
most_frequent_imputer = SimpleImputer(strategy='most_frequent')
df[categorical_columns] = most_frequent_imputer.fit_transform(df[categorical_columns])

print('\nAfter Simple Imputer...')
print(df.isna().sum())

Before Simple Imputer...
age          0
sex          0
cp           0
trestbps     1
chol        23
fbs          8
restecg      1
thalach      1
exang        1
oldpeak      0
num          0
dtype: int64

After Simple Imputer...
age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       1
oldpeak     0
num         0
dtype: int64


<font color='Green'><b>Answer:</b></font>

- **1.3** .....................

    - **Numerical Features:** these are features with continuous values that can be measured on a scale. These features need `Standard Scaler` as it ensures that numerical features are on similar scale. Machine learning algorithms perform better when the input variables are scaled to a standard range.
        - age
        - trestbps
        - chol
        - thalach
        - oldpeak

    - **Categorical Features:** these are features that represent qualitative attributes or categories. `One Hot Encoding` is required as it converts categorical variables into numerical values for machine learning models to process them effectively.
        - cp
        - restecg
    
    - **Binary Features:** similar to categorical features but are limited to only two categories, often 0 and 1. Binary features does not need preprocessing and can utilize `passthrough` as it is already in a format that the machine learning algorithm can understand.
        - sex
        - fbs
        - exang
        - num

In [5]:
print('Unique values in each column...')
print(df.nunique())

Unique values in each column...
age          38
sex           2
cp            4
trestbps     32
chol        154
fbs           2
restecg       3
thalach      72
exang         2
oldpeak      10
num           2
dtype: int64


In [6]:
# 1.3
# Add necessary code here.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split

# Separate data into X and y
y = df['num']
X = df.drop(columns=['num'])

# Split the data set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define numerical, categorical, and binary features.
numerical_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
categorical_features = ['cp', 'restecg']
binary_features = ['sex', 'fbs', 'exang']

# Create the column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('numerical', StandardScaler(), numerical_features),
        ('categorical', OneHotEncoder(sparse_output=False), categorical_features),
        ('binary', 'passthrough', binary_features)
    ]
)

# **2. Pipeline and Modeling**

- **2.1** Create **three** `Pipeline` objects that take the column transformer from the previous question as the first step and add one or more models as the subsequent steps. You can use any models from `sklearn` or other libraries that are suitable for binary classification. For each pipeline, explain **why** you selected the model(s) and what are their **strengths and weaknesses** for this data set. **(3 Points)**

- **2.2** Use `GridSearchCV` to perform a grid search over the hyperparameters of each pipeline and find the best combination that maximizes the cross-validation score. Report the best parameters and the best score for each pipeline. Then, update the hyperparameters of each pipeline using the best parameters from the grid search. **(6 Points)**

- **2.3** Form a stacking classifier that uses the three pipelines from the previous question as the base estimators and a meta-model as the `final_estimator`. You can choose any model for the meta-model that is suitable for binary classification. Explain **why** you chose the meta-model and how it combines the predictions of the base estimators. Then, use `StratifiedKFold` to perform a cross-validation on the stacking classifier and present the accuracy scores and F1 scores for each fold. Report the mean and the standard deviation of each score in the format of `mean ± std`. For example, `0.85 ± 0.05`. **(5 Points)**

- **2.4**: Interpret the final results of the stacking classifier and compare its performance with the individual models. Explain how stacking classifier has improved or deteriorated the prediction accuracy and F1 score, and what are the possible reasons for that. **(3 Points)**

<font color='Green'><b>Answer:</b></font>

- **2.1** .....................

    - **Logistic Regression:** I chose this model because of it's simplicity and ease of use. It can train and make predictions fast.
        - **Strengths:**
            - Fast training and fast predictions
            - They scale to very large datasets and work well with sparse data
            - Simple and easy to interpret predictions
        - **Weaknesses:**
            - Performs poorly if features have a non-linear relationship.
            - Sensitive to outliers
    
    - **Random Forest:** I chose this model because it is a popular and commonly used model for classification. It is powerful and performs well with a lot of different datasets.
        - **Strengths:**
            - Powerful and often does not require heavy tuning of parameters
            - Less frequent tendency of overfitting the training data
            - No need for scaling of data
        - **Weaknesses:**
            - Can be time consuming on larger datasets
            - Slower than linear models
            - Require more resources such as memory
            - Does not perform well on high dimensional and sparse data

    - **Support Vector Classifier:** I chose this model because it works well with a variety of datasets and is good to use when you dont have an idea of what the data means.
        - **Strengths:**
            - Powerful and performs well on a variety of datasets
            - Allow for complex decision boundaries, even with low dimensional data
            - Works well with low and high dimensional data
        - **Weaknesses:**
            - Does not scale well on larger number of samples
            - Data with larger samples can use more memory and take longer to complete


In [7]:
# 2.1
# Add necessary code here.
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

pipeline_lr = Pipeline([
    ('preprocessor', preprocessor), 
    ('classifier', LogisticRegression(max_iter=1000))
    ])

pipeline_svc = Pipeline([
    ('preprocessor', preprocessor), 
    ('classifier', SVC())
    ])

pipeline_rf = Pipeline([
    ('preprocessor', preprocessor), 
    ('classifier', RandomForestClassifier())
    ])

<font color='Green'><b>Answer:</b></font>

- **2.2** .....................

### Logistic Regression
- **Best Parameters:**
    - classifier__C: 1, 
    - classifier__penalty: 'l1', 
    - classifier__solver: 'liblinear'
- **Best Accuracy Score:** 0.825531914893617
- **Best F1 Score:** 0.7460931899641577

### Random Forest Classifier
- **Best Parameters:**
    - classifier__max_depth: 5, 
    - classifier__min_samples_split: 5, 
    - classifier__n_estimators: 200
- **Best Accuracy Score:** 0.8297872340425532
- **Best F1 Score:** 0.7432424363458846

### Support Vector Classifier
- **Best Parameters:** 
    - classifier__C: 1, 
    - classifier__gamma: 'auto', 
    - classifier__kernel: 'rbf'
- **Best Accuracy Score:** 0.8297872340425532
- **Best F1 Score:** 0.7537895286806358


In [8]:
# 2.2
# Add necessary code here.
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, accuracy_score, f1_score

# Define parameter grids for Logistic Regression classifier
param_grid_lr = {
    'classifier__C': [0.001, 0.1, 1, 10],                   
    'classifier__penalty': ['l1', 'l2'],                   
    'classifier__solver': ['liblinear', 'saga']      
}

param_grid_rf = {
    'classifier__n_estimators': [100, 200, 300],  
    'classifier__max_depth': [5, 10, 15],
    'classifier__min_samples_split': [2, 5, 10]          
}

# Define parameter grids for SVM classifier
param_grid_svc = {
    'classifier__C': [0.1, 1, 10],                   
    'classifier__kernel': ['linear', 'rbf', 'poly', 'sigmoid'],
    'classifier__gamma': ['scale', 'auto']
}

# Define the scoring metrics you want to use
scoring = {
    'accuracy': make_scorer(accuracy_score),         # Scoring based on accuracy_score
    'f1_score': make_scorer(f1_score)                # Scoring based on F1_score
}

# Perform grid search for each pipeline
grid_search_lr = GridSearchCV(pipeline_lr, param_grid_lr, cv=5, scoring=scoring, refit='accuracy', n_jobs=-1)
grid_search_rf = GridSearchCV(pipeline_rf, param_grid_rf, cv=5, scoring=scoring, refit='accuracy', n_jobs=-1)
grid_search_svc = GridSearchCV(pipeline_svc, param_grid_svc, cv=5, scoring=scoring, refit='accuracy', n_jobs=-1)


# Fit the models
grid_search_rf.fit(X_train, y_train)
grid_search_lr.fit(X_train, y_train)
grid_search_svc.fit(X_train, y_train)

# Find the best parameters for each model
best_params_rf = grid_search_rf.best_params_
best_params_lr = grid_search_lr.best_params_
best_params_svc = grid_search_svc.best_params_

# Find the best scores for each model
best_accuracy_score_rf = grid_search_rf.best_score_
best_f1_score_rf = grid_search_rf.cv_results_['mean_test_f1_score'][grid_search_rf.best_index_]

best_accuracy_score_lr = grid_search_lr.best_score_
best_f1_score_lr = grid_search_lr.cv_results_['mean_test_f1_score'][grid_search_lr.best_index_]

best_accuracy_score_svc = grid_search_svc.best_score_
best_f1_score_svc = grid_search_svc.cv_results_['mean_test_f1_score'][grid_search_svc.best_index_]

print("Logistic Regression...")
print(f"    Best Parameters: {best_params_lr}")
print(f"    Best Accuracy Score: {best_accuracy_score_lr}")
print(f"    Best F1 Score: {best_f1_score_lr}")

print("\nRandom Forest Classifier...")
print(f"    Best Parameters: {best_params_rf}")
print(f"    Best Accuracy Score: {best_accuracy_score_rf}")
print(f"    Best F1 Score: {best_f1_score_rf}")

print("\nSupport Vector Classifier...")
print(f"    Best Parameters: {best_params_svc}")
print(f"    Best Accuracy Score: {best_accuracy_score_svc}")
print(f"    Best F1 Score: {best_f1_score_svc}")

Logistic Regression...
    Best Parameters: {'classifier__C': 1, 'classifier__penalty': 'l1', 'classifier__solver': 'liblinear'}
    Best Accuracy Score: 0.825531914893617
    Best F1 Score: 0.7460931899641577

Random Forest Classifier...
    Best Parameters: {'classifier__max_depth': 5, 'classifier__min_samples_split': 5, 'classifier__n_estimators': 200}
    Best Accuracy Score: 0.8297872340425532
    Best F1 Score: 0.7432424363458846

Support Vector Classifier...
    Best Parameters: {'classifier__C': 1, 'classifier__gamma': 'auto', 'classifier__kernel': 'rbf'}
    Best Accuracy Score: 0.8297872340425532
    Best F1 Score: 0.7537895286806358


In [9]:
# Update Logistic Regression pipeline
pipeline_lr.set_params(**grid_search_lr.best_params_)

# Update Random Forest pipeline
pipeline_rf.set_params(**grid_search_rf.best_params_)

# Update SVC pipeline
pipeline_svc.set_params(**grid_search_svc.best_params_)

<font color='Green'><b>Answer:</b></font>

- **2.3** .....................

#### Why Logistic Regression and how it combines predictions of the base estimators...
I chose Logistic Regression as the meta-model because its simple and easy to interpret. It is also efficient as it fast in training and predictions.

After each base estimator makes a prediction on the data, the metda-model combines these predictions to make a final prediction. During training, the meta-model learns how to use those predictions to optimize performance. When it comes to testing, the meta-model makes predictions based on the combined insight from the base estimators.

#### Accuracy Score...
- **Mean and Standard Deviation:** 0.84 ± 0.07
- **Scores for each fold:** [0.8723404255319149, 0.8297872340425532, 0.851063829787234, 0.9361702127659575, 0.723404255319149]

#### F1 Score for each fold...
- **Mean and Standard Deviation:** 0.77 ± 0.09
- **Scores for each fold:** [0.8235294117647058, 0.7333333333333333, 0.7741935483870968, 0.9090909090909091, 0.6285714285714286]

In [10]:
# 2.3
# Add necessary code here.
import numpy as np
from sklearn.ensemble import StackingClassifier
from sklearn.model_selection import StratifiedKFold

stacking_classifier = StackingClassifier(
    estimators=[
        ('lr', pipeline_lr),
        ('rf', pipeline_rf),
        ('svc', pipeline_svc)
    ],
    final_estimator=LogisticRegression(max_iter=1000)
)

# Perform cross-validation with StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
accuracy_scores = []
f1_scores = []

for train_index, test_index in cv.split(X_train, y_train):
    X_train_fold, X_test_fold = X_train.iloc[train_index], X_train.iloc[test_index]
    y_train_fold, y_test_fold = y_train.iloc[train_index], y_train.iloc[test_index]

    stacking_classifier.fit(X_train_fold, y_train_fold)
    y_pred_fold = stacking_classifier.predict(X_test_fold)

    accuracy_scores.append(accuracy_score(y_test_fold, y_pred_fold))
    f1_scores.append(f1_score(y_test_fold, y_pred_fold))

# Report mean and standard deviation of accuracy and F1 scores

print('Accuracy Scores...')
print(accuracy_scores)

print("\nF1 Scores...")
print(f1_scores)


print(f"\nAccuracy Scores: {np.mean(accuracy_scores):.2f} ± {np.std(accuracy_scores):.2f}")
print(f"F1 Scores: {np.mean(f1_scores):.2f} ± {np.std(f1_scores):.2f}")

Accuracy Scores...
[0.8723404255319149, 0.8297872340425532, 0.851063829787234, 0.9361702127659575, 0.723404255319149]

F1 Scores...
[0.8235294117647058, 0.7333333333333333, 0.7741935483870968, 0.9090909090909091, 0.6285714285714286]

Accuracy Scores: 0.84 ± 0.07
F1 Scores: 0.77 ± 0.09


<font color='Green'><b>Answer:</b></font>

- **2.4** .....................

#### Overview of Scores...
- Stacking Classifier:
    - Accuracy Scores: 0.84 ± 0.07
    - F1 Scores: 0.77 ± 0.09

- Logistic Regression...
    - Best Accuracy Score: 0.826
    - Best F1 Score: 0.746

- Random Forest Classifier...
    - Best Accuracy Score: 0.830
    - Best F1 Score: 0.743

- Support Vector Classifier...
    - Best Accuracy Score: 0.830
    - Best F1 Score: 0.754

#### Answer...
- Overall, the stacking classifier's Accuracy and F1 scores outperformed each of the individual models slightly. This increase in scores could be due to the stacking classifier's ability to combine the strengths of each model which could potentially capture a broader range of patterns in the data.

- Possible reasons for performance increase:
    - The stacking classifier can understand and realize a wider variety of patterns that are present in the data, this leads to improved predictions.
    - The meta-model knows how to optimize performance from the base estimators. 
    - Each model has weaknesses, using a stacking classifier helps cover some of the weaknesses of individual models.
    - Less overfitting due to the combined insight of the estimators

**Bonus Question**: The stacking classifier has achieved a high accuracy and F1 score, but there may be still room for improvement. Suggest **two** possible ways to improve the modeling using the stacking classifier, and explain **how** and **why** they could improve the performance. **(2 points)**

<font color='Green'><b>Answer:</b></font>

- **More models:** having a larger variety of base estimators gives a wider view of the data. Each model has its strengths which can help cover other model's weaknesses. Each model sees the data differently, so the insight provided from each one can capture more patterns in the data. More diverse models can help capture more aspects of the data and improve overall performance.

- **Use a meta-model that fits the dataset:** Even with the insight from the base estimators, the meta-model will be the one making the final prediction. Using a model that can perform well on the specific dataset can improve performance overall. i.e. don't use a linear-model as the meta model on a dataset that does not have linear relationships between it's features.