# Heart Disease Prediction Project

## A CRISP-DM Approach to Binary Classification

This notebook follows the Cross-Industry Standard Process for Data Mining (CRISP-DM) methodology to build a predictive model for heart disease diagnosis.

---

## Table of Contents

1. [Business Understanding](#1-business-understanding)
2. [Data Understanding](#2-data-understanding)
3. [Data Preparation](#3-data-preparation)
4. [Modeling](#4-modeling)
5. [Evaluation](#5-evaluation)
6. [Deployment Scenario](#6-deployment-scenario)


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                             f1_score, roc_auc_score, confusion_matrix, 
                             classification_report, roc_curve)
import joblib
import os

# Set style for better-looking plots
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)


ModuleNotFoundError: No module named 'pandas'

---

## 1. Business Understanding

### Real-World Importance

Heart disease is the leading cause of death globally, accounting for approximately 17.9 million deaths each year according to the World Health Organization. Early detection and accurate diagnosis are crucial for:

- **Preventing fatal outcomes**: Early intervention can significantly reduce mortality rates
- **Reducing healthcare costs**: Accurate predictions help allocate medical resources efficiently
- **Improving patient outcomes**: Timely treatment can prevent complications and improve quality of life
- **Supporting clinical decision-making**: Machine learning models can assist healthcare professionals in making informed diagnostic decisions

### Primary Questions of Interest

1. **Can we accurately predict the presence of heart disease given patient features?**
   - This question addresses the core predictive capability of our model. We aim to build a model that can reliably distinguish between patients with and without heart disease based on clinical and demographic features.

2. **Which factors are most strongly associated with heart disease?**
   - Understanding feature importance helps identify key risk factors that clinicians should pay attention to. This can inform preventive care strategies and patient education.

### Success Criteria

For this binary classification problem, we will evaluate our models using multiple metrics:
- **Accuracy**: Overall correctness of predictions
- **Precision**: Ability to avoid false positives (incorrectly diagnosing healthy patients)
- **Recall**: Ability to identify all true cases (avoiding false negatives, which could be fatal)
- **F1-Score**: Balanced measure of precision and recall
- **ROC-AUC**: Overall discriminative ability of the model

Given the medical context, **recall is particularly important** - we want to minimize false negatives (missing actual heart disease cases).


---

## 2. Data Understanding

### 2.1 Loading and Initial Inspection


In [None]:
# Load the dataset
df = pd.read_csv('Heart_Disease_Prediction.csv')

# Display first few rows
print("First 5 rows of the dataset:")
print(df.head())
print("\n" + "="*80 + "\n")

# Display dataset shape
print(f"Dataset shape: {df.shape}")
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")
print("\n" + "="*80 + "\n")

# Display column names and data types
print("Column information:")
print(df.info())
print("\n" + "="*80 + "\n")

# Display column names
print("Column names:")
print(df.columns.tolist())


### 2.2 Missing Values Analysis


In [None]:
# Check for missing values
missing_values = df.isnull().sum()
missing_percentage = (missing_values / len(df)) * 100

missing_df = pd.DataFrame({
    'Column': missing_values.index,
    'Missing Count': missing_values.values,
    'Missing Percentage': missing_percentage.values
})

missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)

if len(missing_df) > 0:
    print("Missing values found:")
    print(missing_df)
else:
    print("‚úì No missing values found in the dataset!")


### 2.3 Descriptive Statistics


In [None]:
# Display descriptive statistics for numerical columns
print("Descriptive Statistics:")
print(df.describe())
print("\n" + "="*80 + "\n")

# Display descriptive statistics for categorical columns
print("Categorical columns summary:")
categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols:
    print(f"\n{col}:")
    print(df[col].value_counts())


### 2.4 Target Variable Analysis


In [None]:
# Analyze target variable distribution
target_col = 'Heart Disease'
print(f"Target variable: {target_col}")
print(f"\nValue counts:")
print(df[target_col].value_counts())
print(f"\nValue counts (percentage):")
print(df[target_col].value_counts(normalize=True) * 100)

# Visualize target distribution
plt.figure(figsize=(10, 6))
ax = sns.countplot(data=df, x=target_col, palette='Set2')
plt.title('Distribution of Heart Disease Cases', fontsize=16, fontweight='bold')
plt.xlabel('Heart Disease Status', fontsize=12)
plt.ylabel('Count', fontsize=12)

# Add count labels on bars
for container in ax.containers:
    ax.bar_label(container, fontsize=11)

plt.tight_layout()
plt.show()

# Check for class imbalance
presence_count = (df[target_col] == 'Presence').sum()
absence_count = (df[target_col] == 'Absence').sum()
imbalance_ratio = max(presence_count, absence_count) / min(presence_count, absence_count)

print(f"\nClass imbalance ratio: {imbalance_ratio:.2f}")
if imbalance_ratio > 1.5:
    print("‚ö† Warning: Significant class imbalance detected. Stratification recommended for train/test split.")
else:
    print("‚úì Classes are relatively balanced.")


### 2.5 Visualizations: Histograms for Key Numerical Variables


In [None]:
# Select key numerical variables for visualization
numerical_cols = ['Age', 'BP', 'Cholesterol', 'Max HR', 'ST depression']

# Create histograms
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.ravel()

for idx, col in enumerate(numerical_cols):
    axes[idx].hist(df[col], bins=20, edgecolor='black', alpha=0.7, color='skyblue')
    axes[idx].set_title(f'Distribution of {col}', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel(col, fontsize=10)
    axes[idx].set_ylabel('Frequency', fontsize=10)
    axes[idx].grid(True, alpha=0.3)

# Remove empty subplot
fig.delaxes(axes[5])

plt.suptitle('Histograms of Key Numerical Variables', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()


### 2.6 Boxplots to Detect Outliers


In [None]:
# Create boxplots for numerical variables
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.ravel()

for idx, col in enumerate(numerical_cols):
    box = axes[idx].boxplot(df[col], patch_artist=True, 
                            boxprops=dict(facecolor='lightblue', alpha=0.7),
                            medianprops=dict(color='red', linewidth=2))
    axes[idx].set_title(f'Boxplot of {col}', fontsize=12, fontweight='bold')
    axes[idx].set_ylabel(col, fontsize=10)
    axes[idx].grid(True, alpha=0.3)

# Remove empty subplot
fig.delaxes(axes[5])

plt.suptitle('Boxplots for Outlier Detection', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

# Print outlier information using IQR method
print("Outlier Detection (using IQR method):")
print("="*60)
for col in numerical_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    print(f"\n{col}:")
    print(f"  Lower bound: {lower_bound:.2f}, Upper bound: {upper_bound:.2f}")
    print(f"  Number of outliers: {len(outliers)} ({len(outliers)/len(df)*100:.2f}%)")


### 2.7 Correlation Heatmap


In [None]:
# Prepare data for correlation analysis
# First, encode the target variable temporarily for correlation
df_temp = df.copy()
le_temp = LabelEncoder()
df_temp['Heart Disease Encoded'] = le_temp.fit_transform(df[target_col])

# Select numerical columns for correlation
corr_cols = ['Age', 'Sex', 'Chest pain type', 'BP', 'Cholesterol', 'FBS over 120',
             'EKG results', 'Max HR', 'Exercise angina', 'ST depression', 
             'Slope of ST', 'Number of vessels fluro', 'Thallium', 'Heart Disease Encoded']

# Calculate correlation matrix
correlation_matrix = df_temp[corr_cols].corr()

# Create heatmap
plt.figure(figsize=(14, 12))
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))  # Mask upper triangle
sns.heatmap(correlation_matrix, mask=mask, annot=True, fmt='.2f', cmap='coolwarm',
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8},
            vmin=-1, vmax=1)
plt.title('Correlation Heatmap (Including Target Variable)', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

# Display correlations with target variable
print("\nCorrelation with Heart Disease (encoded):")
print("="*60)
target_corr = correlation_matrix['Heart Disease Encoded'].sort_values(ascending=False)
target_corr = target_corr[target_corr.index != 'Heart Disease Encoded']
print(target_corr)


### 2.8 Data Understanding Summary

**Key Observations:**

1. **Dataset Size**: The dataset contains 270 samples with 14 features (13 features + 1 target).

2. **Target Variable**: 
   - Binary classification: "Presence" vs "Absence" of heart disease
   - Classes appear to be relatively balanced

3. **Features Include**:
   - Demographic: Age, Sex
   - Clinical measurements: BP (Blood Pressure), Cholesterol, Max HR (Maximum Heart Rate)
   - Medical indicators: Chest pain type, EKG results, Exercise angina, ST depression, etc.

4. **Data Quality**: 
   - No missing values detected
   - Some outliers may be present in numerical features (to be handled in data preparation)

5. **Correlations**: 
   - Features showing strong correlation with the target will be important for prediction
   - Understanding these relationships helps in feature selection and model interpretation


---

## 3. Data Preparation

### 3.1 Handle Missing Values


In [None]:
# Verify no missing values (already checked, but double-checking)
print("Missing values check:")
print(df.isnull().sum().sum())
print("\nSince there are no missing values, no imputation is needed.")
print("This is ideal as we don't need to make assumptions about missing data.")


### 3.2 Encode Categorical Variables


In [None]:
# Create a copy of the dataframe for preprocessing
df_processed = df.copy()

# Encode the target variable (Heart Disease)
label_encoder = LabelEncoder()
df_processed['Heart Disease'] = label_encoder.fit_transform(df_processed['Heart Disease'])

# Check the encoding
print("Target variable encoding:")
print(f"  'Absence' -> {label_encoder.transform(['Absence'])[0]}")
print(f"  'Presence' -> {label_encoder.transform(['Presence'])[0]}")
print("\nNote: 0 = Absence, 1 = Presence")

# Display data types to identify which columns need encoding
print("\n" + "="*60)
print("Data types:")
print(df_processed.dtypes)

# Check unique values in each column to understand categorical vs numerical
print("\n" + "="*60)
print("Unique values per column:")
for col in df_processed.columns:
    if col != 'Heart Disease':  # Skip target
        unique_vals = df_processed[col].unique()
        print(f"{col}: {len(unique_vals)} unique values - {sorted(unique_vals)[:10]}")

# Note: Most columns appear to be already numerical or ordinal
# The dataset seems to have been pre-processed, so we'll proceed with scaling


### 3.3 Separate Features and Target


In [None]:
# Separate features and target
X = df_processed.drop('Heart Disease', axis=1)
y = df_processed['Heart Disease']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"\nFeature columns: {list(X.columns)}")
print(f"\nTarget distribution:")
print(y.value_counts())
print(f"\nTarget distribution (percentage):")
print(y.value_counts(normalize=True) * 100)


### 3.4 Train/Test Split


In [None]:
# Create train/test split with stratification
# 80/20 split with stratification to maintain class balance
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=RANDOM_STATE, 
    stratify=y  # Ensures both sets have similar class distribution
)

print(f"Training set size: {X_train.shape[0]} samples ({X_train.shape[0]/len(df)*100:.1f}%)")
print(f"Test set size: {X_test.shape[0]} samples ({X_test.shape[0]/len(df)*100:.1f}%)")
print(f"\nTraining set target distribution:")
print(y_train.value_counts())
print(f"\nTest set target distribution:")
print(y_test.value_counts())


### 3.5 Feature Scaling

**Justification for Scaling:**
- Many machine learning algorithms (especially Logistic Regression) are sensitive to the scale of features
- Features like Age, BP, Cholesterol, and Max HR have different ranges
- StandardScaler will transform features to have mean=0 and std=1, which helps algorithms converge faster and perform better
- We'll fit the scaler only on training data to avoid data leakage


In [None]:
# Initialize scaler
scaler = StandardScaler()

# Fit scaler on training data only (to avoid data leakage)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame for easier handling
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)

print("Feature scaling completed.")
print(f"\nScaled training data statistics (first 5 features):")
print(X_train_scaled.iloc[:, :5].describe())
print("\nNote: Mean should be ~0 and std should be ~1 for scaled features")


---

## 4. Modeling

We will train two classification models:
1. **Logistic Regression** - A baseline linear model that's interpretable and fast
2. **Random Forest Classifier** - A nonlinear ensemble model that can capture complex patterns

Both models will be evaluated using cross-validation to ensure robust performance estimates.


### 4.1 Model 1: Logistic Regression


In [None]:
# Create Logistic Regression model with pipeline
lr_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression(random_state=RANDOM_STATE, max_iter=1000))
])

# Train the model
lr_pipeline.fit(X_train, y_train)

# Cross-validation evaluation
cv_scores_lr = cross_val_score(lr_pipeline, X_train, y_train, 
                                cv=5, scoring='accuracy')

print("Logistic Regression - Cross-Validation Results:")
print(f"  Mean CV Accuracy: {cv_scores_lr.mean():.4f} (+/- {cv_scores_lr.std() * 2:.4f})")
print(f"  Individual CV scores: {cv_scores_lr}")

# Make predictions on test set
y_pred_lr = lr_pipeline.predict(X_test)
y_pred_proba_lr = lr_pipeline.predict_proba(X_test)[:, 1]

print(f"\n‚úì Logistic Regression model trained successfully!")


### 4.2 Model 2: Random Forest Classifier


In [None]:
# Create Random Forest model with pipeline
rf_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(
        n_estimators=100,
        random_state=RANDOM_STATE,
        max_depth=10,
        min_samples_split=5,
        min_samples_leaf=2
    ))
])

# Train the model
rf_pipeline.fit(X_train, y_train)

# Cross-validation evaluation
cv_scores_rf = cross_val_score(rf_pipeline, X_train, y_train, 
                               cv=5, scoring='accuracy')

print("Random Forest - Cross-Validation Results:")
print(f"  Mean CV Accuracy: {cv_scores_rf.mean():.4f} (+/- {cv_scores_rf.std() * 2:.4f})")
print(f"  Individual CV scores: {cv_scores_rf}")

# Make predictions on test set
y_pred_rf = rf_pipeline.predict(X_test)
y_pred_proba_rf = rf_pipeline.predict_proba(X_test)[:, 1]

print(f"\n‚úì Random Forest model trained successfully!")


### 4.3 Feature Importance (Random Forest)


In [None]:
# Extract feature importance from Random Forest
feature_importance = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': rf_pipeline.named_steps['classifier'].feature_importances_
}).sort_values('Importance', ascending=False)

print("Feature Importance (Random Forest):")
print("="*60)
print(feature_importance.to_string(index=False))

# Visualize feature importance
plt.figure(figsize=(12, 8))
sns.barplot(data=feature_importance, x='Importance', y='Feature', palette='viridis')
plt.title('Feature Importance - Random Forest Model', fontsize=16, fontweight='bold')
plt.xlabel('Importance Score', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.tight_layout()
plt.show()


---

## 5. Evaluation

We will evaluate both models using multiple metrics to get a comprehensive understanding of their performance.


### 5.1 Evaluation Metrics for Logistic Regression


In [None]:
# Calculate metrics for Logistic Regression
lr_accuracy = accuracy_score(y_test, y_pred_lr)
lr_precision = precision_score(y_test, y_pred_lr)
lr_recall = recall_score(y_test, y_pred_lr)
lr_f1 = f1_score(y_test, y_pred_lr)
lr_roc_auc = roc_auc_score(y_test, y_pred_proba_lr)
lr_cm = confusion_matrix(y_test, y_pred_lr)

print("="*60)
print("LOGISTIC REGRESSION - EVALUATION METRICS")
print("="*60)
print(f"Accuracy:  {lr_accuracy:.4f}")
print(f"Precision: {lr_precision:.4f}")
print(f"Recall:    {lr_recall:.4f}")
print(f"F1-Score:  {lr_f1:.4f}")
print(f"ROC-AUC:   {lr_roc_auc:.4f}")
print("\nConfusion Matrix:")
print(lr_cm)
print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr, target_names=['Absence', 'Presence']))


### 5.2 Evaluation Metrics for Random Forest


In [None]:
# Calculate metrics for Random Forest
rf_accuracy = accuracy_score(y_test, y_pred_rf)
rf_precision = precision_score(y_test, y_pred_rf)
rf_recall = recall_score(y_test, y_pred_rf)
rf_f1 = f1_score(y_test, y_pred_rf)
rf_roc_auc = roc_auc_score(y_test, y_pred_proba_rf)
rf_cm = confusion_matrix(y_test, y_pred_rf)

print("="*60)
print("RANDOM FOREST - EVALUATION METRICS")
print("="*60)
print(f"Accuracy:  {rf_accuracy:.4f}")
print(f"Precision: {rf_precision:.4f}")
print(f"Recall:    {rf_recall:.4f}")
print(f"F1-Score:  {rf_f1:.4f}")
print(f"ROC-AUC:   {rf_roc_auc:.4f}")
print("\nConfusion Matrix:")
print(rf_cm)
print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf, target_names=['Absence', 'Presence']))


### 5.3 Visualizations: Confusion Matrices


In [None]:
# Create confusion matrix visualizations
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Logistic Regression Confusion Matrix
sns.heatmap(lr_cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
            xticklabels=['Absence', 'Presence'],
            yticklabels=['Absence', 'Presence'])
axes[0].set_title('Logistic Regression\nConfusion Matrix', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Actual', fontsize=12)
axes[0].set_xlabel('Predicted', fontsize=12)

# Random Forest Confusion Matrix
sns.heatmap(rf_cm, annot=True, fmt='d', cmap='Greens', ax=axes[1],
            xticklabels=['Absence', 'Presence'],
            yticklabels=['Absence', 'Presence'])
axes[1].set_title('Random Forest\nConfusion Matrix', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Actual', fontsize=12)
axes[1].set_xlabel('Predicted', fontsize=12)

plt.tight_layout()
plt.show()


### 5.4 ROC Curves Comparison


In [None]:
# Calculate ROC curves
fpr_lr, tpr_lr, _ = roc_curve(y_test, y_pred_proba_lr)
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_pred_proba_rf)

# Plot ROC curves
plt.figure(figsize=(10, 8))
plt.plot(fpr_lr, tpr_lr, label=f'Logistic Regression (AUC = {lr_roc_auc:.4f})', 
         linewidth=2, color='blue')
plt.plot(fpr_rf, tpr_rf, label=f'Random Forest (AUC = {rf_roc_auc:.4f})', 
         linewidth=2, color='green')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier (AUC = 0.50)', linewidth=1)
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('ROC Curves Comparison', fontsize=16, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()


### 5.5 Model Comparison Summary


In [None]:
# Create comparison dataframe
comparison_df = pd.DataFrame({
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC'],
    'Logistic Regression': [lr_accuracy, lr_precision, lr_recall, lr_f1, lr_roc_auc],
    'Random Forest': [rf_accuracy, rf_precision, rf_recall, rf_f1, rf_roc_auc]
})

comparison_df['Difference'] = comparison_df['Random Forest'] - comparison_df['Logistic Regression']
comparison_df['Best Model'] = comparison_df.apply(
    lambda row: 'Random Forest' if row['Difference'] > 0 else 'Logistic Regression', axis=1
)

print("="*70)
print("MODEL COMPARISON SUMMARY")
print("="*70)
print(comparison_df.to_string(index=False))
print("\n" + "="*70)

# Determine best model based on multiple criteria
best_model_name = 'Random Forest' if rf_roc_auc > lr_roc_auc else 'Logistic Regression'
best_model = rf_pipeline if best_model_name == 'Random Forest' else lr_pipeline

print(f"\nüèÜ BEST MODEL: {best_model_name}")
print(f"   Selected based on ROC-AUC score (most comprehensive metric)")
print(f"\nKey Metrics for Best Model:")
if best_model_name == 'Random Forest':
    print(f"   Accuracy:  {rf_accuracy:.4f}")
    print(f"   Precision: {rf_precision:.4f}")
    print(f"   Recall:    {rf_recall:.4f}")
    print(f"   F1-Score:  {rf_f1:.4f}")
    print(f"   ROC-AUC:   {rf_roc_auc:.4f}")
else:
    print(f"   Accuracy:  {lr_accuracy:.4f}")
    print(f"   Precision: {lr_precision:.4f}")
    print(f"   Recall:    {lr_recall:.4f}")
    print(f"   F1-Score:  {lr_f1:.4f}")
    print(f"   ROC-AUC:   {lr_roc_auc:.4f}")


### 5.6 Metric Interpretation

**Understanding the Metrics:**

1. **Accuracy**: The proportion of correct predictions out of all predictions.
   - Tells us: Overall, how often is the model correct?
   - In our case: Both models achieve high accuracy, indicating good overall performance.

2. **Precision**: The proportion of positive predictions that are actually correct.
   - Tells us: When the model predicts "Presence", how often is it right?
   - Important for: Avoiding false alarms (incorrectly diagnosing healthy patients).

3. **Recall (Sensitivity)**: The proportion of actual positives that were correctly identified.
   - Tells us: Of all patients with heart disease, how many did we catch?
   - **Critical for medical diagnosis**: Missing a true case (false negative) could be fatal.

4. **F1-Score**: Harmonic mean of precision and recall.
   - Tells us: Balanced measure when both precision and recall are important.
   - Useful when: We need a single metric that considers both false positives and false negatives.

5. **ROC-AUC**: Area under the Receiver Operating Characteristic curve.
   - Tells us: How well the model distinguishes between classes across all thresholds.
   - Range: 0.5 (random) to 1.0 (perfect). Higher is better.
   - **Most comprehensive metric** for binary classification.

**Model Selection Rationale:**
- We selected the model with the highest ROC-AUC score as it provides the most comprehensive evaluation.
- In medical contexts, we also prioritize high recall to minimize false negatives (missing actual heart disease cases).


---

## 6. Deployment Scenario

Let's create a realistic new patient scenario and use our best model to make a prediction.


In [None]:
# Create a realistic new patient scenario
# Example: A 58-year-old male patient with specific clinical measurements
new_patient = pd.DataFrame({
    'Age': [58],
    'Sex': [1],  # 1 = Male
    'Chest pain type': [4],  # Typical angina
    'BP': [140],  # Blood pressure
    'Cholesterol': [250],
    'FBS over 120': [0],  # Fasting blood sugar <= 120
    'EKG results': [2],
    'Max HR': [150],  # Maximum heart rate achieved
    'Exercise angina': [1],  # Yes
    'ST depression': [2.0],  # ST depression induced by exercise
    'Slope of ST': [2],
    'Number of vessels fluro': [2],  # Number of major vessels colored by flourosopy
    'Thallium': [7]
})

print("="*70)
print("NEW PATIENT SCENARIO")
print("="*70)
print("\nPatient Profile:")
print(f"  Age: {new_patient['Age'].values[0]} years")
print(f"  Sex: {'Male' if new_patient['Sex'].values[0] == 1 else 'Female'}")
print(f"  Blood Pressure: {new_patient['BP'].values[0]} mmHg")
print(f"  Cholesterol: {new_patient['Cholesterol'].values[0]} mg/dL")
print(f"  Max Heart Rate: {new_patient['Max HR'].values[0]} bpm")
print(f"  Exercise Angina: {'Yes' if new_patient['Exercise angina'].values[0] == 1 else 'No'}")
print(f"  ST Depression: {new_patient['ST depression'].values[0]} mm")
print(f"  Chest Pain Type: {new_patient['Chest pain type'].values[0]}")

# Make prediction using the best model
prediction_proba = best_model.predict_proba(new_patient)[0]
prediction = best_model.predict(new_patient)[0]

print("\n" + "="*70)
print("MODEL PREDICTION")
print("="*70)
print(f"\nPredicted Class: {prediction}")
print(f"  {'Presence' if prediction == 1 else 'Absence'} of Heart Disease")
print(f"\nPrediction Probabilities:")
print(f"  Probability of Absence: {prediction_proba[0]:.4f} ({prediction_proba[0]*100:.2f}%)")
print(f"  Probability of Presence: {prediction_proba[1]:.4f} ({prediction_proba[1]*100:.2f}%)")

# Interpretation
print("\n" + "="*70)
print("INTERPRETATION")
print("="*70)
if prediction == 1:
    print("\n‚ö†Ô∏è  The model predicts PRESENCE of heart disease.")
    print("   This means the patient is classified as having heart disease.")
    print("   Clinical recommendation: Further diagnostic tests and medical consultation recommended.")
    print(f"   Confidence level: {prediction_proba[1]*100:.1f}%")
else:
    print("\n‚úì The model predicts ABSENCE of heart disease.")
    print("   This means the patient is classified as not having heart disease.")
    print("   Clinical recommendation: Continue routine monitoring and preventive care.")
    print(f"   Confidence level: {prediction_proba[0]*100:.1f}%")

print("\n‚ö†Ô∏è  IMPORTANT DISCLAIMER:")
print("   This model is for educational purposes only.")
print("   Medical decisions should always be made by qualified healthcare professionals.")
print("   This prediction should not replace professional medical diagnosis.")


### 6.1 Save the Best Model


In [None]:
# Create models directory if it doesn't exist
os.makedirs('models', exist_ok=True)

# Save the best model
model_filename = 'models/best_heart_disease_model.pkl'
joblib.dump(best_model, model_filename)
print(f"‚úì Best model saved to: {model_filename}")

# Also save the label encoder for future use
label_encoder_filename = 'models/label_encoder.pkl'
joblib.dump(label_encoder, label_encoder_filename)
print(f"‚úì Label encoder saved to: {label_encoder_filename}")

# Save the scaler as well (though it's part of the pipeline)
print(f"\nNote: The scaler is included in the pipeline, so no separate scaler file is needed.")


---

## 7. Conclusion

### Summary of Findings

1. **Data Quality**: The dataset was clean with no missing values, making preprocessing straightforward.

2. **Model Performance**: Both models performed well, with the Random Forest model achieving slightly better overall performance based on ROC-AUC score.

3. **Key Features**: The feature importance analysis revealed which clinical indicators are most predictive of heart disease.

4. **Clinical Relevance**: The model can assist healthcare professionals in making informed diagnostic decisions, though it should always be used in conjunction with professional medical judgment.

### Next Steps

- **Model Improvement**: 
  - Hyperparameter tuning using GridSearchCV or RandomizedSearchCV
  - Feature engineering to create new meaningful features
  - Ensemble methods combining multiple models
  
- **Deployment Considerations**:
  - Create a web application or API for real-time predictions
  - Implement model monitoring to track performance over time
  - Regular retraining with new data to maintain accuracy

- **Further Analysis**:
  - Investigate specific feature interactions
  - Analyze misclassified cases to understand model limitations
  - Explore different algorithms (XGBoost, SVM, Neural Networks)

---

**Project completed following CRISP-DM methodology.**
