<font size="+3"><b>Assignment 4: Pipelines and Hyperparameter Tuning</b></font>

***
* **Full Name** = Aarsh Shah    
* **UCID** = 30150079
***

<font color='Blue'>
In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models, and evaluate the results. More details for each step can be found below. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.
</font>

<font color='Red'>
For this assignment, in addition to your .ipynb file, please also attach a PDF file. To generate this PDF file, you can use the print function (located under the "File" within Jupyter Notebook). Name this file ENGG444_Assignment##__yourUCID.pdf (this name is similar to your main .ipynb file). We will evaluate your assignment based on the two files and you need to provide both.
</font>


|         **Question**         | **Point(s)** |
|:----------------------------:|:------------:|
|  **1. Preprocessing Tasks**  |              |
|              1.1             |       2      |
|              1.2             |       2      |
|              1.3             |       4      |
| **2. Pipeline and Modeling** |              |
|              2.1             |       3      |
|              2.2             |       6      |
|              2.3             |       5      |
|              2.4             |       3      |
|     **3. Bonus Question**    |     **2**    |
|           **Total**          |    **27**    |

## **0. Dataset**

This data is a subset of the **Heart Disease Dataset**, which contains information about patients with possible coronary artery disease. The data has **14 attributes** and **294 instances**. The attributes include demographic, clinical, and laboratory features, such as age, sex, chest pain type, blood pressure, cholesterol, and electrocardiogram results. The last attribute is the **diagnosis of heart disease**, which is a categorical variable with values from 0 (no presence) to 4 (high presence). The data can be used for **classification** tasks, such as predicting the presence or absence of heart disease based on the other attributes.

In [8]:
import pandas as pd

# Define the data source link
_link = 'https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data'

# Read the CSV file into a Pandas DataFrame, considering '?' as missing values
df = pd.read_csv(_link, na_values='?',
                 names=['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs',
                        'restecg', 'thalach', 'exang', 'oldpeak', 'slope',
                        'ca', 'thal', 'num'])

# Display the DataFrame
display(df)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,28,1,2,130.0,132.0,0.0,2.0,185.0,0.0,0.0,,,,0
1,29,1,2,120.0,243.0,0.0,0.0,160.0,0.0,0.0,,,,0
2,29,1,2,140.0,,0.0,0.0,170.0,0.0,0.0,,,,0
3,30,0,1,170.0,237.0,0.0,1.0,170.0,0.0,0.0,,,6.0,0
4,31,0,2,100.0,219.0,0.0,1.0,150.0,0.0,0.0,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
289,52,1,4,160.0,331.0,0.0,0.0,94.0,1.0,2.5,,,,1
290,54,0,3,130.0,294.0,0.0,1.0,100.0,1.0,0.0,2.0,,,1
291,56,1,4,155.0,342.0,1.0,0.0,150.0,1.0,3.0,2.0,,,1
292,58,0,2,180.0,393.0,0.0,0.0,110.0,1.0,1.0,2.0,,7.0,1


# **1. Preprocessing Tasks**

- **1.1** Find out which columns have more than 60% of their values missing and drop them from the data frame. Explain why this is a reasonable way to handle these columns. **(2 Points)**

- **1.2** For the remaining columns that have some missing values, choose an appropriate imputation method to fill them in. You can use the `SimpleImputer` class from `sklearn.impute` or any other method you prefer. Explain why you chose this method and how it affects the data. **(2 Points)**

- **1.3** Assign the `num` column to the variable `y` and the rest of the columns to the variable `X`. The `num` column indicates the presence or absence of heart disease based on the angiographic disease status of the patients. Create a `ColumnTransformer` object that applies different preprocessing steps to different subsets of features. Use `StandardScaler` for the numerical features, `OneHotEncoder` for the categorical features, and `passthrough` for the binary features. List the names of the features that belong to each group and explain why they need different transformations. You will use this `ColumnTransformer` in a pipeline in the next question. **(4 Points)**

<font color='Green'><b>Answer:</b></font>

- **1.1** This is a reasonable way to handle these columns because if a column has more than 60% of its values missing, it is not useful for analysis. Even if we fill in the missing values, the data will be biased and not accurate. Therefore, it is better to drop these columns.

In [9]:
# 1.1
# Add necessary code here.
df.dropna(thresh=0.6*len(df), axis=1, inplace=True)
display(df)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,num
0,28,1,2,130.0,132.0,0.0,2.0,185.0,0.0,0.0,0
1,29,1,2,120.0,243.0,0.0,0.0,160.0,0.0,0.0,0
2,29,1,2,140.0,,0.0,0.0,170.0,0.0,0.0,0
3,30,0,1,170.0,237.0,0.0,1.0,170.0,0.0,0.0,0
4,31,0,2,100.0,219.0,0.0,1.0,150.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...
289,52,1,4,160.0,331.0,0.0,0.0,94.0,1.0,2.5,1
290,54,0,3,130.0,294.0,0.0,1.0,100.0,1.0,0.0,1
291,56,1,4,155.0,342.0,1.0,0.0,150.0,1.0,3.0,1
292,58,0,2,180.0,393.0,0.0,0.0,110.0,1.0,1.0,1


<font color='Green'><b>Answer:</b></font>

- **1.2** 

- I chose to use the SimpleImputer and use 2 different methods for the different data types. 
    - For numerical columns which include columns like 'age', 'trestbps', 'oldpeak', etc I imputed the values using mean. I chose to do mean because it was simple and after visualizing the dataset it doesn't seem like there were that many outliers (vs median).
    - For 'sex', 'cp', 'fbs', etc, which are represented as integers but are categorical or binary in nature, imputing with the most frequent value (mode) makes sense. This approach assumes the most common category is the most likely value when the actual value is missing.

- This affects the data in two different ways:
    - The numerical imputation preserves the overall distribution of the numerical data. In this case, these changes alter the variance and might possibly introduce bias if the data is not missing at random (MAR).
    - For the categorical imputation, it doesn't introduce any new categories keeping things straightforward, but it may increase the prevalence of the most common category, potentially skewing the distribution, especially if the data is not MAR.

In [10]:
# 1.2
# Add necessary code here.
from sklearn.impute import SimpleImputer

# Assuming all columns except 'num' might have missing values
features = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak']

# For integer columns, impute with mean and round to the nearest integer
float_cols = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
int_cols = ['sex', 'cp', 'fbs', 'restecg', 'exang']

# Imputation
imputer_float = SimpleImputer(strategy='mean')
imputer_int = SimpleImputer(strategy='most_frequent')

df[float_cols] = imputer_float.fit_transform(df[float_cols])
df[int_cols] = imputer_int.fit_transform(df[int_cols])

# Ensure integer columns are cast back to integers in case they were converted to float by imputation
df[int_cols] = df[int_cols].astype(int)

display(df)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,num
0,28.0,1,2,130.0,132.000000,0,2,185.0,0,0.0,0
1,29.0,1,2,120.0,243.000000,0,0,160.0,0,0.0,0
2,29.0,1,2,140.0,250.848708,0,0,170.0,0,0.0,0
3,30.0,0,1,170.0,237.000000,0,1,170.0,0,0.0,0
4,31.0,0,2,100.0,219.000000,0,1,150.0,0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...
289,52.0,1,4,160.0,331.000000,0,0,94.0,1,2.5,1
290,54.0,0,3,130.0,294.000000,0,1,100.0,1,0.0,1
291,56.0,1,4,155.0,342.000000,1,0,150.0,1,3.0,1
292,58.0,0,2,180.0,393.000000,0,0,110.0,1,1.0,1


<font color='Green'><b>Answer:</b></font>

- **1.3**

1. Numerical Features:
    - Features: 'age', 'trestbps', 'chol', 'thalach', 'oldpeak'
    - Transformation: 'StandardScaler'
    - Since these are all continuous variables and can have varying ranges and distributions, Standard scaling (z-score normalization) is applied to ensure that these features have a mean of 0 and a standard deviation of 1. This helps prevent models that are sensitive to the scale of input features from misinterpreting the data.

2. Categorical Features:
    - Features: 'cp', 'restecg'
    - Transformation: 'OneHotEncoder'
    - Since these features represent categorical data that cannot be interpreted directly by machine learning models because they do not have an inherent numerical relationship. One-hot encoding transforms these categorical values into a binary matrix representation, ensuring that the model interprets these features correctly without assuming any ordinal relationship where it does not exist.

3. Binary Features:
    - Features: 'sex', 'fbs', 'exang'
    - Transformation: None
    - Since these features are already in a binary format (0 or 1), representing two categories in each variable and hence do not require normalization or one-hot encoding.

In [11]:
# 1.3
# Add necessary code here.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

y = df['num']
X = df.drop('num', axis=1)

numerical_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
categorical_features = ['cp', 'restecg']
binary_features = ['sex', 'fbs', 'exang']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features),
        ('pass', 'passthrough', binary_features),
    ])

X_transformed = preprocessor.fit_transform(X)
X_transformed

array([[-2.54234669, -0.14707632, -1.83302736, ...,  1.        ,
         0.        ,  0.        ],
       [-2.41411716, -0.71634131, -0.1210522 , ...,  1.        ,
         0.        ,  0.        ],
       [-2.41411716,  0.42218868,  0.        , ...,  1.        ,
         0.        ,  0.        ],
       ...,
       [ 1.04808013,  1.27608618,  1.40584457, ...,  1.        ,
         1.        ,  1.        ],
       [ 1.30453919,  2.69924868,  2.19242775, ...,  0.        ,
         0.        ,  1.        ],
       [ 2.2021459 , -0.14707632,  0.37249019, ...,  1.        ,
         0.        ,  1.        ]])

# **2. Pipeline and Modeling**

- **2.1** Create **three** `Pipeline` objects that take the column transformer from the previous question as the first step and add one or more models as the subsequent steps. You can use any models from `sklearn` or other libraries that are suitable for binary classification. For each pipeline, explain **why** you selected the model(s) and what are their **strengths and weaknesses** for this data set. **(3 Points)**

- **2.2** Use `GridSearchCV` to perform a grid search over the hyperparameters of each pipeline and find the best combination that maximizes the cross-validation score. Report the best parameters and the best score for each pipeline. Then, update the hyperparameters of each pipeline using the best parameters from the grid search. **(6 Points)**

- **2.3** Form a stacking classifier that uses the three pipelines from the previous question as the base estimators and a meta-model as the `final_estimator`. You can choose any model for the meta-model that is suitable for binary classification. Explain **why** you chose the meta-model and how it combines the predictions of the base estimators. Then, use `StratifiedKFold` to perform a cross-validation on the stacking classifier and present the accuracy scores and F1 scores for each fold. Report the mean and the standard deviation of each score in the format of `mean ± std`. For example, `0.85 ± 0.05`. Interpret the results and compare them with the baseline scores from the previous assignment. **(5 Points)**

- **2.4**: Interpret the final results of the stacking classifier and compare its performance with the individual models. Explain how stacking classifier has improved or deteriorated the prediction accuracy and F1 score, and what are the possible reasons for that. **(3 Points)**

<font color='Green'><b>Answer:</b></font>

- **2.1** Create **three** `Pipeline` objects that take the column transformer from the previous question as the first step and add one or more models as the subsequent steps. You can use any models from `sklearn` or other libraries that are suitable for binary classification. For each pipeline, explain **why** you selected the model(s) and what are their **strengths and weaknesses** for this data set. **(3 Points)**


Selected Models:

- Logistic Regression: I chose this because in general this problem is inherently a binary classification problem and performs well when the data is linearly separable. It is able to predict well for simple, interpretable problems, which aligns quite well with this dataset.
    - Strengths:
        - It provides a probabilistic understanding of the predictions which is very useful for medical diagnoses 
        - It is efficient and offers quick training and prediction times, which is beneficial for a dataset with a moderate number of features and samples (which this dataset has)
    - Weaknesses: In the case that the relationship here is much more complex, (which this dataset actually does have the potential to be -- see svc) than just linearly it can be slightly outperformed by something more complex. This may also mean that the model will be underfitting since a simpler model won't be able to reflect more complex relationships.

- KNN (K Neighbour Classification): I chose KNN, specifically because it performed better than the Random Forest model (accuracy was 70%+ vs 80%+). In general, however, I chose it because it was a non-parametric method that was also simple and intuitive to understand. It took away most of the underlying assumptions about the distribution of the data.
    - Strengths: No assumptions about the data distribution and easily adaptable to the disease dataset.
    - Weaknesses: Here the weaknesses have the potential to be sensitive to scaling, and if that wasn't done properly in part 1, it could cause it to underperform.

- SVC: I selected this because the heart disease dataset has a relatively high-dimensional space. And in this case, since the relationship is quite unknown it is able to perform well by performing linear and non-linear classification through the kernel trick.
    - Strengths: As mentioned in this case it is effective and very capable of defining complex higher-dimensional boundaries for classification.
    - Weaknesses: It requires careful tuning of the parameters like the choice of kernel and 'c' parameter. However, we are attempting to find the best values by using the pipeline so the main weakness here is computationally complexity to compute all the different values.

In [12]:
# 2.1
# Add necessary code here.
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# Define pipelines with different classifiers
pipeline_lr = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

pipeline_knn = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', KNeighborsClassifier())
])

pipeline_svc = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', SVC())
])


- **2.2** Use `GridSearchCV` to perform a grid search over the hyperparameters of each pipeline and find the best combination that maximizes the cross-validation score. Report the best parameters and the best score for each pipeline. Then, update the hyperparameters of each pipeline using the best parameters from the grid search. **(6 Points)**

<font color='Green'><b>Answer:</b></font>

- Logistic Regression:
    - Best Params:
        - C: 0.1
    - Score = 0.82
    - F1 = 0.72

- K-Nearest Neighbour:
    - Best Params:
        - N: 7
        - P: 2
        - Weights: Uniform
    - Score = 0.82
    - F1 = 0.74

- Support Vector Machine:
    - Best Params
        - C: 0.1
        - Gamma: 0.01
        - Kernel: Linear
    - Score = 0.82
    - F1 = 0.73

In [16]:
# 2.2
# Add necessary code here.

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, accuracy_score, f1_score


# Define parameter grids for each classifier
param_grid_lr = {
    'classifier__C': [0.1, 1, 10],
}

param_grid_knn = {
    'classifier__n_neighbors': [3, 5, 7, 9],
    'classifier__weights': ['uniform', 'distance'],
    'classifier__p': [1, 2],
}

param_grid_svc = {
    'classifier__C': [0.1, 1, 10],
    'classifier__kernel': ['linear', 'rbf'],
    'classifier__gamma': [0.01, 0.1, 1, 10, 100],
}

scoring = {'accuracy': make_scorer(accuracy_score), 'f1_score': make_scorer(f1_score)}

# Create GridSearchCV objects for each pipeline
grid_search_lr = GridSearchCV(pipeline_lr, param_grid_lr, cv=5, scoring=scoring, refit='accuracy')
grid_search_knn = GridSearchCV(pipeline_knn, param_grid_knn, cv=5, scoring=scoring, refit='accuracy')
grid_search_svc = GridSearchCV(pipeline_svc, param_grid_svc, cv=5, scoring=scoring, refit='accuracy')


# Fit each grid search to find the best model
grid_search_lr.fit(X, y)
grid_search_knn.fit(X, y)
grid_search_svc.fit(X, y)

# Print out the best parameters and scores for each model
print("Best parameters for LR:", grid_search_lr.best_params_)
print("Best accuracy for LR:", grid_search_lr.cv_results_['mean_test_accuracy'].max())
print("Best F1 score for LR:", grid_search_lr.cv_results_['mean_test_f1_score'][grid_search_lr.best_index_])

print("Best parameters for KNN:", grid_search_knn.best_params_)
print("Best accuracy for KNN:", grid_search_knn.cv_results_['mean_test_accuracy'].max())
print("Best F1 score for KNN:", grid_search_knn.cv_results_['mean_test_f1_score'][grid_search_knn.best_index_])

print("Best parameters for SVC:", grid_search_svc.best_params_)
print("Best accuracy for SVC:", grid_search_svc.cv_results_['mean_test_accuracy'].max())
print("Best F1 score for SVC:", grid_search_svc.cv_results_['mean_test_f1_score'][grid_search_svc.best_index_])


Best parameters for LR: {'classifier__C': 0.1}
Best accuracy for LR: 0.8196960841613091
Best F1 score for LR: 0.7201635703043407
Best parameters for KNN: {'classifier__n_neighbors': 7, 'classifier__p': 2, 'classifier__weights': 'uniform'}
Best accuracy for KNN: 0.8230859146697839
Best F1 score for KNN: 0.739776993969735
Best parameters for SVC: {'classifier__C': 0.1, 'classifier__gamma': 0.01, 'classifier__kernel': 'linear'}
Best accuracy for SVC: 0.8231443600233781
Best F1 score for SVC: 0.7346270559734795


- **2.3** Form a stacking classifier that uses the three pipelines from the previous question as the base estimators and a meta-model as the `final_estimator`. You can choose any model for the meta-model that is suitable for binary classification. Explain **why** you chose the meta-model and how it combines the predictions of the base estimators. Then, use `StratifiedKFold` to perform a cross-validation on the stacking classifier and present the accuracy scores and F1 scores for each fold. Report the mean and the standard deviation of each score in the format of `mean ± std`. For example, `0.85 ± 0.05`. Interpret the results and compare them with the baseline scores from the previous assignment. **(5 Points)**

<font color='Green'><b>Answer:</b></font>

Chosen meta-model: Given the problem being about medical diagnostic regarding heart disease it would make the most sense if the final estimator was one that could look at all the data and make a prediction based on probability. Thus I chose logistic regression because it not only allows us to estimate based on probability but can aggregate the predictions from the other models and capture all of the different underlying patterns weighing them accordingly. In lecture, we also saw that Logistic Regression was the most efficient and simple to understand so I went with it so I could iterate with a balance between performance and outcome.

How it combines with the base estimators: The stacking classifier works by initially training the base estimators on the training data, then predicting each of them will provide its own prediction. These predictions are then used as input features for the meta-model. In the case of Logistic Regression as the meta-model, it treats these predictions as input variables and learns how to best combine them to make a final prediction. In essence, it's learning the optimal weighting of the base estimators' predictions based on their performance.


Interpreting the results from the three different models, they all seem to have almost the exact same performance (deviating by max 1% on either accuracy or f1 score) thus they seem to perform generally well on the dataset thus there must be a relatively strong relationship between the features and the outcome and it must be simple enough that any of the models can easily pick up on those patterns. Making a comparison to the baseline score from the previous assignment we can delve deeper into how each model performed compared to the given data set. For example looking at the f1 scores of the wine dataset for 3 different models (SVC: 0.94, DT: 0.86, LSVC: 0.76) they all different by a substantial amount (up to 18% between SVC and LVC) which tells us that the wine dataset might have had some complicated relationships that SVC was able to model but the other models weren't able to. However, it is also important to note that the worst model in the wine dataset was better than all of the models for the heart disease dataset. This may mean that more feature engineering and data are required for this dataset + model combination to excel. 

In [15]:
# 2.3
# Add necessary code here.
from sklearn.ensemble import StackingClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Define the stacking classifier
estimators = [
    ('lr', grid_search_lr.best_estimator_),
    ('knn', grid_search_knn.best_estimator_),
    ('svc', grid_search_svc.best_estimator_)
]

stacking_classifier = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())

# Perform cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
acc_scores = cross_val_score(stacking_classifier, X, y, cv=cv, scoring='accuracy')
f1_scores = cross_val_score(stacking_classifier, X, y, cv=cv, scoring='f1')

# Report the scores
print(f"Accuracy: {acc_scores.mean():.2f} ± {acc_scores.std():.2f}")
print(f"F1 Score: {f1_scores.mean():.2f} ± {f1_scores.std():.2f}")


Accuracy: 0.82 ± 0.06
F1 Score: 0.74 ± 0.09


- **2.4** Interpret the final results of the stacking classifier and compare its performance with the individual models. Explain how stacking classifier has improved or deteriorated the prediction accuracy and F1 score, and what are the possible reasons for that. **(3 Points)**

<font color='Green'><b>Answer:</b></font>

The Mean and STD here is:
Accuracy: 0.82 ± 0.06
F1 Score: 0.74 ± 0.09

- Looking at the results of the stacking classifier specifically the accuracy score it seems to perform quite well at 82%, however, the standard deviation is 6% which tells us that there is a decent amount of variability in the model performance across different subsets of data, thus the distribution of data is important as the predictive power may fluctuate based on that. In addition, the F1 score which is at 74% gives us an indication of the false negatives and false positives. In medical situations this is critical and 74% is generally quite bad since missing a diagnosis (false negative) could be the difference between life and death. A lower F1 score shows that the model cannot properly balance between recall and precision. 

- It seems as if the stacking classifier hasn't really improved nor degraded the results compared to all of the individual model results. Specifically, it could be because the base models make similar predictions (i.e., they are highly correlated), stacking may not add much value. The ensemble's strength comes from leveraging the diversity of predictions; if all models agree most of the time, the meta-learner has little room to improve. Another reason could be because the stacking classifier or any of the individual models could be overfitting or underfitting making it difficult for the stacking classifier to improve. Lastly, it could also be because all 3 of the individual models performed very similarly to each other thus there was no diversity to improve from.

**Bonus Question**: The stacking classifier has achieved a high accuracy and F1 score, but there may be still room for improvement. Suggest **two** possible ways to improve the modeling using the stacking classifier, and explain **how** and **why** they could improve the performance. **(2 points)**

<font color='Green'><b>Answer:</b></font>

- Optimize the Meta-Model: In this case, I chose to use LogisticRegression with no tuning as the meta-model since it was simple. However, if we wanted to improve the performance we could likely fine-tune it or choose a much more complicated model experimenting with models such as a gradient-boosted ensemble. This might allow us to leverage the strengths of multiple approaches instead of using LogisticRegression twice (once in the base estimators and again in the final). Specifically, this could have an impact because a more complex model might do a better job of capturing the intricate interactions between the base models, which is something the simple logistic regression could be missing.

- Tuning the Stacking Classifier itself: Here I just used a very basic version of the stacking classifier by passing it to the base estimators and final estimators but its performance could be improved by experimenting with different configurations and changing how the base models are combined and by even adding more reducing the number of base models. For example, we could enable passthrough to allow the original features to be passed to the final estimator. Or change the value of 'cv' to change the number of folds for the cross-validation. In general, this could improve the performance because there could be a nuanced combination that works really well. Say one of the original features actually has a very strong relationship to the target variable, by enabling passthrough we allow it to have more feature importance thus improving the model.
