<font size="+3"><b>Assignment 4: Pipelines and Hyperparameter Tuning</b></font>

***
* **Full Name** = David Rodriguez
* **UCID** = 30145288
***

<font color='Blue'>
In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models, and evaluate the results. More details for each step can be found below. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.
</font>

<font color='Red'>
For this assignment, in addition to your .ipynb file, please also attach a PDF file. To generate this PDF file, you can use the print function (located under the "File" within Jupyter Notebook). Name this file ENGG444_Assignment##__yourUCID.pdf (this name is similar to your main .ipynb file). We will evaluate your assignment based on the two files and you need to provide both.
</font>


|         **Question**         | **Point(s)** |
|:----------------------------:|:------------:|
|  **1. Preprocessing Tasks**  |              |
|              1.1             |       2      |
|              1.2             |       2      |
|              1.3             |       4      |
| **2. Pipeline and Modeling** |              |
|              2.1             |       3      |
|              2.2             |       6      |
|              2.3             |       5      |
|              2.4             |       3      |
|     **3. Bonus Question**    |     **2**    |
|           **Total**          |    **25**    |

## **0. Dataset**

This data is a subset of the **Heart Disease Dataset**, which contains information about patients with possible coronary artery disease. The data has **14 attributes** and **294 instances**. The attributes include demographic, clinical, and laboratory features, such as age, sex, chest pain type, blood pressure, cholesterol, and electrocardiogram results. The last attribute is the **diagnosis of heart disease**, which is a categorical variable with values from 0 (no presence) to 4 (high presence). The data can be used for **classification** tasks, such as predicting the presence or absence of heart disease based on the other attributes.

In [82]:
import pandas as pd

# Define the data source link
_link = 'https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data'

# Read the CSV file into a Pandas DataFrame, considering '?' as missing values
df = pd.read_csv(_link, na_values='?',
                 names=['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs',
                        'restecg', 'thalach', 'exang', 'oldpeak', 'slope',
                        'ca', 'thal', 'num'])

# Display the DataFrame
display(df)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,28,1,2,130.0,132.0,0.0,2.0,185.0,0.0,0.0,,,,0
1,29,1,2,120.0,243.0,0.0,0.0,160.0,0.0,0.0,,,,0
2,29,1,2,140.0,,0.0,0.0,170.0,0.0,0.0,,,,0
3,30,0,1,170.0,237.0,0.0,1.0,170.0,0.0,0.0,,,6.0,0
4,31,0,2,100.0,219.0,0.0,1.0,150.0,0.0,0.0,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
289,52,1,4,160.0,331.0,0.0,0.0,94.0,1.0,2.5,,,,1
290,54,0,3,130.0,294.0,0.0,1.0,100.0,1.0,0.0,2.0,,,1
291,56,1,4,155.0,342.0,1.0,0.0,150.0,1.0,3.0,2.0,,,1
292,58,0,2,180.0,393.0,0.0,0.0,110.0,1.0,1.0,2.0,,7.0,1


# **1. Preprocessing Tasks**

- **1.1** Find out which columns have more than 60% of their values missing and drop them from the data frame. Explain why this is a reasonable way to handle these columns. **(2 Points)**

- **1.2** For the remaining columns that have some missing values, choose an appropriate imputation method to fill them in. You can use the `SimpleImputer` class from `sklearn.impute` or any other method you prefer. Explain why you chose this method and how it affects the data. **(2 Points)**

- **1.3** Assign the `num` column to the variable `y` and the rest of the columns to the variable `X`. The `num` column indicates the presence or absence of heart disease based on the angiographic disease status of the patients. Create a `ColumnTransformer` object that applies different preprocessing steps to different subsets of features. Use `StandardScaler` for the numerical features, `OneHotEncoder` for the categorical features, and `passthrough` for the binary features. List the names of the features that belong to each group and explain why they need different transformations. You will use this `ColumnTransformer` in a pipeline in the next question. **(4 Points)**

<font color='Green'><b>Answer:</b></font>

- **1.1** .....................
## Find out which columns have more than 60% of their values missing and drop them from the data frame. Explain why this is a reasonable way to handle these columns.
As shown below the columns with more than 60% of its entries empty or NaN, are slope, ca, and thal. The reason this is a responsible way to handle these columns is because filling them in leads to high bias. Most metrics such as using the mean, most_frequent, and a constant are not suitable when it substitutes a significant portion of the data. Additionally, since we still have several other features that can be utilized to train and predict our model it is not irresponsible to drop the column. 

Upon further examination the columns being dropped slope refers to the slop of the peak exercise ST segment on the patient's ECG during testing. 'ca' is the may stand for the number of major vessels colored by fluoroscopy or the coronary arteries. 'thal' refers to thalassemia, which is a genetic blood disorder that affects the production of hemoglobin.  It is characterized by abnormal hemoglobin production. These are likely to be empty because they are harder to examine and are not often recorded. In future models, including them may be ideal.



In [83]:
# 1.1
# Add necessary code here.
# print(len(df))
# print(df.isnull().sum())
missing_values = df.isnull().sum() # Count the number of missing values in each column
missing_values = missing_values[missing_values > 0.6 * len(df)] # Select columns with more than 60% missing values
print(missing_values) # Display the columns with more than 60% missing values
df = df.drop(missing_values.index, axis=1) # Drop the columns with more than 60% missing values
print(df.isnull().sum()) # Display the number of missing values in each column

# Inspect the nature of the data in each column to see if its binary, numerical, or categorical
print('trestbps:', df['trestbps'].unique())
print('chol:', df['chol'].unique())
print('fbs:', df['fbs'].unique())
print('restecg:', df['restecg'].unique())
print('thalach:', df['thalach'].unique())
print('exang:', df['exang'].unique())

slope    190
ca       291
thal     266
dtype: int64
age          0
sex          0
cp           0
trestbps     1
chol        23
fbs          8
restecg      1
thalach      1
exang        1
oldpeak      0
num          0
dtype: int64
trestbps: [130. 120. 140. 170. 100. 105. 110. 125. 150.  98. 112. 145. 190. 160.
 115. 142. 180. 132. 135.  nan 108. 124. 113. 122.  92. 118. 106. 200.
 138. 136. 128. 155.]
chol: [132. 243.  nan 237. 219. 198. 225. 254. 298. 161. 214. 220. 160. 167.
 308. 264. 166. 340. 209. 260. 211. 173. 283. 194. 223. 315. 275. 297.
 292. 182. 200. 204. 241. 339. 147. 273. 307. 289. 215. 281. 250. 184.
 245. 291. 295. 269. 196. 268. 228. 358. 201. 249. 266. 186. 207. 218.
 412. 224. 238. 230. 163. 240. 280. 257. 263. 276. 284. 195. 227. 253.
 187. 202. 328. 168. 216. 129. 190. 188. 179. 210. 272. 180. 100. 259.
 468. 274. 320. 221. 309. 312. 171. 208. 246. 305. 217. 365. 344. 394.
 256. 326. 277. 270. 229.  85. 347. 251. 222. 287. 318. 213. 294. 193.
 271. 156. 267. 282. 1

<font color='Green'><b>Answer:</b></font>

- **1.2** ..................... 
## For the remaining columns that have some missing values, choose an appropriate imputation method to fill them in. You can use the `SimpleImputer` class from `sklearn.impute` or any other method you prefer. Explain why you chose this method and how it affects the data.

There are multiple ways of filling in null values, utilizing the feature's mean, the feature's most_frequent, and a constant. Other methods in basic models can use ffil, bfill and different interpolation methods, these are not able to be applied to our model due to the unorganized data, and high dimension disabling this possibility. The mean, most_frequent, and constant are good options here depending on the data type. The mean way is valid for numerical and continous data since it still somewhat preserves the nature of the data. This will likely be an outlier in the model however, when data is scarce including it could prove beneficial. For categorical and binary data types, it is best to not use this method since it is possible to create a whole new unique data type which can undermine the function of the model. That's why I utilized most_frequent for these data types. In the examples: the binary and categorical features that had empty values were the, fbs, restecg, exang, this was evident upon inspecting the unique data types, and reading the description online. The numerical feature were chol, trestbps and thalch due to them being continous data types.

In [92]:
# 1.2
# Add necessary code here.
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
# for x in df[df.isnull().sum().index]:
#     print(x)
#     print(df[x].unique())

Binary_And_Categorical = ['fbs','restecg','exang']
Numerical_Features = ['chol','trestbps','thalach']
imputerBinaryAndCategorical = SimpleImputer(strategy='most_frequent') # Create an imputer object with a mean filling strategy
imputerNumerical = SimpleImputer(strategy='mean') # Create an imputer object with a mean filling strategy

preprocessor = ColumnTransformer(
    transformers=[
        ('num', imputerNumerical, Numerical_Features),  # Impute numerical features with mean
        ('cat', imputerBinaryAndCategorical, Binary_And_Categorical)  # Impute categorical features with most frequent
    ],
    remainder='passthrough'  # Include all remaining columns in the output DataFrame
)

other_columns = list(set(df.columns) - set(Numerical_Features) - set(Binary_And_Categorical))
df_imputed = pd.DataFrame(preprocessor.fit_transform(df), columns=Numerical_Features + Binary_And_Categorical + other_columns)

print(df_imputed) # Display the number of missing values in each column
print(df_imputed.isnull().sum()) # Display the number of missing values in each column

# for x in df.columns:
#     # print(df[x], df[x].dtype)


           chol  trestbps  thalach  fbs  restecg  exang    cp  oldpeak  num  \
0    132.000000     130.0    185.0  0.0      2.0    0.0  28.0      1.0  2.0   
1    243.000000     120.0    160.0  0.0      0.0    0.0  29.0      1.0  2.0   
2    250.848708     140.0    170.0  0.0      0.0    0.0  29.0      1.0  2.0   
3    237.000000     170.0    170.0  0.0      1.0    0.0  30.0      0.0  1.0   
4    219.000000     100.0    150.0  0.0      1.0    0.0  31.0      0.0  2.0   
..          ...       ...      ...  ...      ...    ...   ...      ...  ...   
289  331.000000     160.0     94.0  0.0      0.0    1.0  52.0      1.0  4.0   
290  294.000000     130.0    100.0  0.0      1.0    1.0  54.0      0.0  3.0   
291  342.000000     155.0    150.0  1.0      0.0    1.0  56.0      1.0  4.0   
292  393.000000     180.0    110.0  0.0      0.0    1.0  58.0      0.0  2.0   
293  275.000000     130.0    115.0  0.0      1.0    1.0  65.0      1.0  4.0   

     sex  age  
0    0.0  0.0  
1    0.0  0.0  
2  

<font color='Green'><b>Answer:</b></font>

- **1.3** .....................

In [None]:
# 1.3
# Add necessary code here.

# Assign the `num` column to the variable `y` and the rest of the columns to the variable `X`. The `num` column indicates the presence or absence of heart disease based on the angiographic disease status of the patients.
# Create a `ColumnTransformer` object that applies different preprocessing steps to different subsets of features. 
#Use `StandardScaler` for the numerical features, `OneHotEncoder` for the categorical features, and `passthrough` for the binary features. List the names of the features that belong to each group and explain why they need different transformations. You will use this `ColumnTransformer` in a pipeline in the next question.
y = df['num']
X = df.drop(columns=['num'])

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline

print(X.dtypes)
# num2 =list(df.select_dtypes(include=['float64']).columns)
numerical_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak','fbs']
categorical_features = ['cp','restecg',]
categorial_features = df.select_dtypes(include=['int64']).columns.tolist()






age         float64
sex         float64
cp          float64
trestbps    float64
chol        float64
fbs         float64
restecg     float64
thalach     float64
exang       float64
oldpeak     float64
dtype: object


# **2. Pipeline and Modeling**

- **2.1** Create **three** `Pipeline` objects that take the column transformer from the previous question as the first step and add one or more models as the subsequent steps. You can use any models from `sklearn` or other libraries that are suitable for binary classification. For each pipeline, explain **why** you selected the model(s) and what are their **strengths and weaknesses** for this data set. **(3 Points)**

- **2.2** Use `GridSearchCV` to perform a grid search over the hyperparameters of each pipeline and find the best combination that maximizes the cross-validation score. Report the best parameters and the best score for each pipeline. Then, update the hyperparameters of each pipeline using the best parameters from the grid search. **(6 Points)**

- **2.3** Form a stacking classifier that uses the three pipelines from the previous question as the base estimators and a meta-model as the `final_estimator`. You can choose any model for the meta-model that is suitable for binary classification. Explain **why** you chose the meta-model and how it combines the predictions of the base estimators. Then, use `StratifiedKFold` to perform a cross-validation on the stacking classifier and present the accuracy scores and F1 scores for each fold. Report the mean and the standard deviation of each score in the format of `mean ± std`. For example, `0.85 ± 0.05`. Interpret the results and compare them with the baseline scores from the previous assignment. **(5 Points)**

- **2.4**: Interpret the final results of the stacking classifier and compare its performance with the individual models. Explain how stacking classifier has improved or deteriorated the prediction accuracy and F1 score, and what are the possible reasons for that. **(3 Points)**

<font color='Green'><b>Answer:</b></font>

- **2.1** .....................

In [None]:
# 2.1
# Add necessary code here.

<font color='Green'><b>Answer:</b></font>

- **2.2** .....................

In [None]:
# 2.2
# Add necessary code here.

<font color='Green'><b>Answer:</b></font>

- **2.3** .....................

In [None]:
# 2.3
# Add necessary code here.

<font color='Green'><b>Answer:</b></font>

- **2.4** .....................

**Bonus Question**: The stacking classifier has achieved a high accuracy and F1 score, but there may be still room for improvement. Suggest **two** possible ways to improve the modeling using the stacking classifier, and explain **how** and **why** they could improve the performance. **(2 points)**

<font color='Green'><b>Answer:</b></font>