# Heart Disease Prediction (Imbalanced Data)
## Advanced Classification Project

### 1. Problem Statement
- Predict the presence of heart disease in patients using clinical features.
- Address class imbalance and optimize classification thresholds.

### 2. Setup

 - Importing packages

In [None]:
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
from sklearn.model_selection import train_test_split, RandomizedSearchCV 
from imblearn.over_sampling import SMOTE 
from sklearn.linear_model import LogisticRegression  
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score 

 - Importing the Dataset

In [None]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
heart_disease = fetch_ucirepo(id=45) 

# data (as pandas dataframes) 
X = heart_disease.data.features 
y = heart_disease.data.targets 
# merging dataset
dataset = pd.concat([X,y], axis=1)

 - Method for evaluating the model

In [21]:
def evaluate(model, X_test, y_test):
    """
    Evaluate a classification model's performance using common metrics.
    
    Parameters:
    -----------
    model : classifier
        Trained classification model implementing `predict()` method
    X_test : array-like or DataFrame
        Feature matrix of test data
    y_test : array-like or Series
        True target values for test data
        
    Returns:
    --------
    dict
        Dictionary containing performance metrics (as percentages)
    """
    y_pred = model.predict(X_test)
    
    # Use macro/micro averaging for multiclass
    avg_method = 'macro'  # or 'micro'/'weighted'
    
    return {
        # Use round(value, decimals) instead of value.round()
        'Accuracy': round(100 * accuracy_score(y_test, y_pred), 2),
        'Precision': round(100 * precision_score(y_test, y_pred, average=avg_method), 2),
        'Recall': round(100 * recall_score(y_test, y_pred, average=avg_method), 2),
        'F1 Score': round(100 * f1_score(y_test, y_pred, average=avg_method), 2),
        'Roc_Auc': round(100 * roc_auc_score(y_test, model.predict_proba(X_test), multi_class='ovo'), 2)
    }

### 3. Cleaning the dataset

 - Removing any null value

In [None]:
((dataset.isnull().sum()/ len(dataset)) * 100).round(2)

In [None]:
dataset.dropna(subset=['ca','thal'], how = 'any', axis = 0, inplace = True)
((dataset.isnull().sum()/ len(dataset)) * 100).round(2)

- Splitting the Dataset into test and train

In [None]:
X = dataset.drop(['num'], axis = 1)
y = dataset['num']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42  # Optional: for reproducibility
)

### 4. Training Models

#### **Baseline**: Without fixing imbalance

**Logistic Regression**

In [None]:
LR_model = LogisticRegression()
LR_model.fit(X_train,y_train)

**RandomForestClassifier**

In [None]:
RFC_model = RandomForestClassifier()
RFC_model.fit(X_train,y_train)

**XGBClassifier**

In [10]:
XGB_model = XGBClassifier()
XGB_model.fit(X_train,y_train)

In [23]:
evaluate(LR_model, X_test, y_test)

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


{'Accuracy': 66.67,
 'Precision': 33.48,
 'Recall': 32.97,
 'F1 Score': 32.78,
 'Roc_Auc': np.float64(65.67)}

In [24]:
evaluate(RFC_model, X_test, y_test)

{'Accuracy': 61.67,
 'Precision': 22.48,
 'Recall': 25.67,
 'F1 Score': 23.95,
 'Roc_Auc': np.float64(66.08)}

In [25]:
evaluate(XGB_model, X_test, y_test)

{'Accuracy': 58.33,
 'Precision': 30.32,
 'Recall': 30.19,
 'F1 Score': 29.64,
 'Roc_Auc': np.float64(62.67)}

#### **Improved**: With SMOTE

In [27]:
smote = SMOTE(random_state = 42)
X_reshaped, y_reshaped = smote.fit_resample(X_train, y_train)

In [28]:
LR_model.fit(X_reshaped,y_reshaped)
evaluate(LR_model, X_test, y_test)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


{'Accuracy': 65.0,
 'Precision': 41.5,
 'Recall': 38.16,
 'F1 Score': 39.03,
 'Roc_Auc': np.float64(63.33)}

In [29]:
RFC_model.fit(X_reshaped,y_reshaped)
evaluate(RFC_model, X_test, y_test)

{'Accuracy': 53.33,
 'Precision': 27.29,
 'Recall': 26.97,
 'F1 Score': 26.64,
 'Roc_Auc': np.float64(59.09)}

In [30]:
XGB_model.fit(X_reshaped,y_reshaped)
evaluate(XGB_model, X_test, y_test)

{'Accuracy': 53.33,
 'Precision': 25.39,
 'Recall': 23.41,
 'F1 Score': 24.02,
 'Roc_Auc': np.float64(63.13)}