Q1

The goal is to build a **binary classification model** that predicts whether a patient has **heart disease (1)** or not (0), using clinical data from the **UCI Heart Disease Dataset**. The dataset includes **303 patient records** with **13 features**, such as age, chest pain type, blood pressure, cholesterol levels, and ECG results. The original target variable `num` ranges from 0 to 4, indicating increasing severity of heart disease. For binary classification, it is transformed into `0` (no heart disease) and `1` (presence of heart disease) if `num` > 0.

This is a **supervised learning** problem with a mix of numerical and categorical variables. Challenges include **missing values**, **class imbalance**, and the need for **interpretable models** due to the medical context.

Accurate prediction can support **early diagnosis** and improve clinical decision-making.

Q2

In [9]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
heart_disease = fetch_ucirepo(id=45) 
  
# Convert features and target to standard pandas DataFrames
X = heart_disease.data.features.copy()
y = heart_disease.data.targets.copy()

# Data types
print("\nData types:")
print(X.dtypes)


Data types:
age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca          float64
thal        float64
dtype: object


In [10]:
# 1. Continuous variables → float64 
continuous_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
X[continuous_features] = X[continuous_features].astype(float)

# 2. Categorical/discrete variables → Int64 
categorical_features = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'thal']
X[categorical_features] = X[categorical_features].astype('Int64')

# Data types
print("\nData types:")
print(X.dtypes)


Data types:
age         float64
sex           Int64
cp            Int64
trestbps    float64
chol        float64
fbs           Int64
restecg       Int64
thalach     float64
exang         Int64
oldpeak     float64
slope         Int64
ca          float64
thal          Int64
dtype: object


In [11]:
# Define continuous numerical features 
numerical_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']

# Display min and max for each
print(X[numerical_features].agg(['min', 'max']).transpose())

            min    max
age        29.0   77.0
trestbps   94.0  200.0
chol      126.0  564.0
thalach    71.0  202.0
oldpeak     0.0    6.2


In [12]:
from sklearn.preprocessing import StandardScaler

# List of numerical features to standardize (excluding 'ca')
numerical_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']

# Initialize the scaler
scaler = StandardScaler()

# Apply standardization only to selected numerical features
X[numerical_features] = scaler.fit_transform(X[numerical_features])

Basic data transformations were applied to prepare the dataset for analysis:

- **Variable Type Identification**:  
  Based on the official UCI dataset documentation and domain knowledge:
  - **Continuous numerical variables**: `age`, `trestbps`, `chol`, `thalach`, and `oldpeak` were treated as numerical, since they are measured on ratio or interval scales with meaningful numeric ranges. The variable `ca` (number of major vessels colored by fluoroscopy) is inherently an integer-valued numerical feature. Since it is already in the correct format, no type conversion is applied.

  - **Categorical/discrete variables**: Variables such as `sex`, `cp`, `fbs`, `restecg`, `exang`, `slope`, and `thal` are encoded as integers but represent discrete categories. This classification is supported by the original dataset description. For example, `cp` (chest pain type) ranges from 1 to 4 representing distinct medical conditions, and `thal` represents different thalassemia test results (3 = normal, 6 = fixed defect, 7 = reversible defect).

- **Missing Value Identification**:  
  The features `ca` and `thal` include missing values, which were parsed as `NaN` using the `ucimlrepo` loader. No imputation or deletion was applied at this stage, as missing value handling is addressed in Question 6.


- **Standardization**:  
  Standardization was applied to continuous numerical features (`age`, `trestbps`, `chol`, `thalach`, and `oldpeak`) using z-score normalization. The range of `ca` is integer with narrow range so no standadization is implemented. 

  Even though standardization is **not strictly necessary** for all classification algorithms, it is beneficial for scale-sensitive models such as logistic regression, SVM, and k-NN.  


These steps ensured that the dataset was clean, properly typed, and ready for downstream modeling and analysis.