# The Heart Whisperer: A Machine Learning Adventure

**Once upon a time, in a hospital far far away...**

Doctors were drowning in patient data. ECGs piling up. Cholesterol numbers flying around like confetti. They needed a hero.

Enter: **You. The Data Scientist.**

Your mission? Build a model that can look at a patient's vitals and whisper: *"This heart... needs attention."*

Let's begin this adventure.

## Chapter 1: Gathering the Troops (Imports)

Every hero needs their tools. Batman has gadgets. We have... libraries.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    confusion_matrix, 
    classification_report, 
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    roc_curve
)
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
print("Tools loaded. Let's save some hearts.")

## Chapter 2: Meeting Our Patients (Loading Data)

918 patients walked into our clinic. Each one carrying secrets in their blood pressure, cholesterol, and heartbeats.

Let's meet them.

In [None]:
df = pd.read_csv('heart.csv')

print(f"Patients in waiting room: {len(df)}")
print(f"Vital signs recorded: {df.shape[1]}")
print("\nFirst 5 patients walk in...")
df.head()

In [None]:
print("What information do we have on each patient?")
print("="*50)
for col in df.columns:
    print(f"  {col}: {df[col].dtype}")
print("="*50)
print(f"\nMissing values? {df.isnull().sum().sum()} (phew!)")

## Chapter 3: The Plot Thickens (EDA)

Before we build our heart-whispering model, we need to understand our patients.

**The big question:** How many hearts are in danger?

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# The verdict
colors = ['#2ecc71', '#e74c3c']
labels = ['Healthy Heart', 'Heart Disease']
counts = df['HeartDisease'].value_counts().sort_index()

# Pie chart - the dramatic reveal
axes[0].pie(counts, labels=labels, colors=colors, autopct='%1.1f%%', 
            explode=(0, 0.05), shadow=True, startangle=90)
axes[0].set_title('The Diagnosis Distribution\n(Our Challenge)', fontsize=14, fontweight='bold')

# Bar chart - the numbers
bars = axes[1].bar(labels, counts, color=colors, edgecolor='black', linewidth=2)
axes[1].set_ylabel('Number of Patients', fontsize=12)
axes[1].set_title('How Many Need Our Help?', fontsize=14, fontweight='bold')
for bar, count in zip(bars, counts):
    axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 10, 
                 str(count), ha='center', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

print(f"\n{counts[1]} patients need us. Let's not let them down.")

### Age: Does Getting Older Mean More Risk?

Spoiler alert: Your heart has opinions about your birthday candles.

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))

# Age distribution by heart disease
for i, label in enumerate(labels):
    subset = df[df['HeartDisease'] == i]['Age']
    ax.hist(subset, bins=20, alpha=0.7, label=label, color=colors[i], edgecolor='black')

ax.set_xlabel('Age', fontsize=12)
ax.set_ylabel('Number of Patients', fontsize=12)
ax.set_title('Age Distribution: When Do Hearts Start Complaining?', fontsize=14, fontweight='bold')
ax.legend()
ax.axvline(df[df['HeartDisease']==1]['Age'].mean(), color='#e74c3c', linestyle='--', 
           label=f'Mean Age (Disease): {df[df["HeartDisease"]==1]["Age"].mean():.1f}')
ax.axvline(df[df['HeartDisease']==0]['Age'].mean(), color='#2ecc71', linestyle='--',
           label=f'Mean Age (Healthy): {df[df["HeartDisease"]==0]["Age"].mean():.1f}')

plt.tight_layout()
plt.show()

print(f"\nAverage age of healthy hearts: {df[df['HeartDisease']==0]['Age'].mean():.1f} years")
print(f"Average age of troubled hearts: {df[df['HeartDisease']==1]['Age'].mean():.1f} years")
print("\nLesson: Age is just a number... that your heart takes VERY seriously.")

### The Chest Pain Mystery

Not all chest pain is created equal. Let's decode the types:
- **ASY**: Asymptomatic (the silent danger)
- **ATA**: Atypical Angina  
- **NAP**: Non-Anginal Pain
- **TA**: Typical Angina

In [None]:
fig, ax = plt.subplots(figsize=(10, 6))

chest_pain_crosstab = pd.crosstab(df['ChestPainType'], df['HeartDisease'])
chest_pain_crosstab.columns = labels
chest_pain_crosstab.plot(kind='bar', ax=ax, color=colors, edgecolor='black', width=0.8)

ax.set_xlabel('Chest Pain Type', fontsize=12)
ax.set_ylabel('Count', fontsize=12)
ax.set_title('The Chest Pain Plot Twist\n(ASY = The Silent Killer)', fontsize=14, fontweight='bold')
ax.legend(title='Diagnosis')
plt.xticks(rotation=0)

plt.tight_layout()
plt.show()

print("\nPlot twist: ASY (Asymptomatic) patients have the HIGHEST heart disease rate!")
print("Sometimes the quietest chest is hiding the loudest problem.")

### Quick Stats Overview

In [None]:
print("Numerical Features - Quick Health Check:")
print("="*60)
df.describe().round(2)

## Chapter 4: Preparing for Battle (Data Preprocessing)

Our model can't read text. It only speaks numbers. Time for translation.

In [None]:
# Make a copy - never mess with the original evidence
df_model = df.copy()

# Identify categorical columns
cat_cols = df_model.select_dtypes(include=['object']).columns.tolist()
print(f"Categorical columns to encode: {cat_cols}")

# Label encoding for categorical variables
le = LabelEncoder()
for col in cat_cols:
    df_model[col] = le.fit_transform(df_model[col])
    print(f"  {col} encoded")

print("\nTranslation complete. Model can now understand.")
df_model.head()

In [None]:
# Separate features and target
X = df_model.drop('HeartDisease', axis=1)
y = df_model['HeartDisease']

print(f"Features (X): {X.shape}")
print(f"Target (y): {y.shape}")
print(f"\nFeatures we're using to predict: {list(X.columns)}")

In [None]:
# The sacred split: training warriors vs testing judges
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training army size: {len(X_train)} patients")
print(f"Testing jury size: {len(X_test)} patients")
print(f"\nTraining set disease ratio: {y_train.mean():.2%}")
print(f"Testing set disease ratio: {y_test.mean():.2%}")
print("\nPerfectly balanced, as all things should be. - Thanos")

In [None]:
# Scale the features - put everyone on equal footing
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Features scaled. No more favoritism toward big numbers.")

## Chapter 5: The Tournament of Models

Three champions enter the arena. Only one will become THE HEART WHISPERER.

**The Contenders:**
1. **Logistic Regression** - The wise elder, simple but reliable
2. **Random Forest** - The army of decision trees
3. **Gradient Boosting** - The perfectionist who learns from mistakes

In [None]:
# Initialize our contenders
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
}

# Store results
results = {}

print("LET THE TOURNAMENT BEGIN!")
print("="*60)

In [None]:
for name, model in models.items():
    print(f"\n{'='*60}")
    print(f"{name} enters the arena...")
    print(f"{'='*60}")
    
    # Train
    if name == 'Random Forest' or name == 'Gradient Boosting':
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        y_pred_proba = model.predict_proba(X_test)[:, 1]
    else:
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
        y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    auc = roc_auc_score(y_test, y_pred_proba)
    
    results[name] = {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'auc': auc,
        'y_pred': y_pred,
        'y_pred_proba': y_pred_proba,
        'model': model
    }
    
    print(f"\nResults for {name}:")
    print(f"  Accuracy:  {accuracy:.4f} ({accuracy*100:.2f}%)")
    print(f"  Precision: {precision:.4f}")
    print(f"  Recall:    {recall:.4f}")
    print(f"  F1 Score:  {f1:.4f}")
    print(f"  AUC-ROC:   {auc:.4f}")

### Tournament Summary

In [None]:
# Create summary dataframe
summary_df = pd.DataFrame({
    'Model': list(results.keys()),
    'Accuracy': [r['accuracy'] for r in results.values()],
    'Precision': [r['precision'] for r in results.values()],
    'Recall': [r['recall'] for r in results.values()],
    'F1 Score': [r['f1'] for r in results.values()],
    'AUC-ROC': [r['auc'] for r in results.values()]
}).round(4)

print("\nTOURNAMENT SCOREBOARD")
print("="*70)
summary_df

In [None]:
# Visual comparison
fig, ax = plt.subplots(figsize=(12, 6))

x = np.arange(len(summary_df))
width = 0.15
metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'AUC-ROC']
colors_metrics = ['#3498db', '#e74c3c', '#2ecc71', '#f39c12', '#9b59b6']

for i, metric in enumerate(metrics):
    bars = ax.bar(x + i*width, summary_df[metric], width, label=metric, color=colors_metrics[i])

ax.set_ylabel('Score', fontsize=12)
ax.set_title('The Tournament Results\n(Who Will Be Crowned?)', fontsize=14, fontweight='bold')
ax.set_xticks(x + width * 2)
ax.set_xticklabels(summary_df['Model'])
ax.legend(loc='lower right')
ax.set_ylim(0.7, 1.0)
ax.axhline(y=0.9, color='gray', linestyle='--', alpha=0.5, label='90% threshold')

plt.tight_layout()
plt.show()

# Declare winner
winner = summary_df.loc[summary_df['F1 Score'].idxmax(), 'Model']
print(f"\nAND THE WINNER IS... {winner.upper()}!")

## Chapter 6: The Perfect Confusion Matrix

This is it. The moment of truth. The confusion matrix tells us:
- **True Positives (TP):** We said disease, it WAS disease (saved a life)
- **True Negatives (TN):** We said healthy, it WAS healthy (no false alarm)
- **False Positives (FP):** We said disease, but they were fine (unnecessary panic)
- **False Negatives (FN):** We said healthy, but they had disease (THE WORST MISTAKE)

In [None]:
def plot_perfect_confusion_matrix(y_true, y_pred, model_name, ax=None):
    """
    Creates a beautiful, informative confusion matrix visualization.
    """
    if ax is None:
        fig, ax = plt.subplots(figsize=(8, 6))
    
    cm = confusion_matrix(y_true, y_pred)
    
    # Calculate percentages
    cm_percent = cm.astype('float') / cm.sum() * 100
    
    # Create custom annotations
    labels = np.array([
        [f'TN\n{cm[0,0]}\n({cm_percent[0,0]:.1f}%)', f'FP\n{cm[0,1]}\n({cm_percent[0,1]:.1f}%)'],
        [f'FN\n{cm[1,0]}\n({cm_percent[1,0]:.1f}%)', f'TP\n{cm[1,1]}\n({cm_percent[1,1]:.1f}%)']
    ])
    
    # Custom colormap: green for correct, red for errors
    colors_cm = np.array([
        ['#2ecc71', '#e74c3c'],  # TN (green), FP (red)
        ['#e74c3c', '#2ecc71']   # FN (red), TP (green)
    ])
    
    # Plot
    for i in range(2):
        for j in range(2):
            ax.add_patch(plt.Rectangle((j, 1-i), 1, 1, fill=True, 
                                        color=colors_cm[i, j], alpha=0.7))
            ax.text(j + 0.5, 1.5 - i, labels[i, j], 
                    ha='center', va='center', fontsize=14, fontweight='bold',
                    color='white' if cm[i,j] > cm.max()/3 else 'black')
    
    ax.set_xlim(0, 2)
    ax.set_ylim(0, 2)
    ax.set_xticks([0.5, 1.5])
    ax.set_yticks([0.5, 1.5])
    ax.set_xticklabels(['Healthy', 'Disease'], fontsize=12)
    ax.set_yticklabels(['Disease', 'Healthy'], fontsize=12)
    ax.set_xlabel('Predicted', fontsize=14, fontweight='bold')
    ax.set_ylabel('Actual', fontsize=14, fontweight='bold')
    ax.set_title(f'Confusion Matrix: {model_name}', fontsize=16, fontweight='bold', pad=20)
    
    # Add border
    for spine in ax.spines.values():
        spine.set_linewidth(2)
    
    return cm

In [None]:
# Plot confusion matrices for all models
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

for ax, (name, res) in zip(axes, results.items()):
    plot_perfect_confusion_matrix(y_test, res['y_pred'], name, ax)

plt.tight_layout()
plt.savefig('confusion_matrices_all.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nGreen = Correct predictions (what we want)")
print("Red = Errors (what we want to minimize)")

### The ULTIMATE Confusion Matrix (Best Model)

In [None]:
# Get best model based on F1 score
best_model_name = summary_df.loc[summary_df['F1 Score'].idxmax(), 'Model']
best_results = results[best_model_name]

fig, ax = plt.subplots(figsize=(10, 8))

cm = plot_perfect_confusion_matrix(y_test, best_results['y_pred'], 
                                    f'{best_model_name} (THE CHAMPION)', ax)

# Add summary stats below
tn, fp, fn, tp = cm.ravel()
summary_text = f"""
True Negatives: {tn} healthy patients correctly identified
True Positives: {tp} disease patients correctly identified 
False Positives: {fp} false alarms (said disease, was healthy)
False Negatives: {fn} missed diagnoses (said healthy, had disease)

Total Correct: {tn + tp} / {len(y_test)} = {(tn + tp) / len(y_test) * 100:.1f}%
"""

plt.figtext(0.5, -0.05, summary_text, ha='center', fontsize=11, 
            fontfamily='monospace', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.savefig('best_confusion_matrix.png', dpi=300, bbox_inches='tight')
plt.show()

### Seaborn Style Confusion Matrix (Alternative View)

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for ax, (name, res) in zip(axes, results.items()):
    cm = confusion_matrix(y_test, res['y_pred'])
    
    # Seaborn heatmap
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                xticklabels=['Healthy', 'Disease'],
                yticklabels=['Healthy', 'Disease'],
                ax=ax, annot_kws={'size': 20}, cbar=False)
    
    ax.set_xlabel('Predicted', fontsize=12)
    ax.set_ylabel('Actual', fontsize=12)
    ax.set_title(f'{name}\nAccuracy: {res["accuracy"]*100:.1f}%', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig('confusion_matrices_heatmap.png', dpi=300, bbox_inches='tight')
plt.show()

## Chapter 7: ROC Curves - The Performance Art

ROC (Receiver Operating Characteristic) curve shows us how well our model distinguishes between classes.

The closer to the top-left corner, the better. AUC = 1.0 means perfect. AUC = 0.5 means random guessing.

In [None]:
fig, ax = plt.subplots(figsize=(10, 8))

colors_roc = ['#3498db', '#e74c3c', '#2ecc71']

for (name, res), color in zip(results.items(), colors_roc):
    fpr, tpr, _ = roc_curve(y_test, res['y_pred_proba'])
    ax.plot(fpr, tpr, color=color, linewidth=2, 
            label=f'{name} (AUC = {res["auc"]:.3f})')

# Diagonal line (random guessing)
ax.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Guess (AUC = 0.500)')

ax.set_xlabel('False Positive Rate', fontsize=12)
ax.set_ylabel('True Positive Rate', fontsize=12)
ax.set_title('ROC Curves: The Battle of Discrimination\n(Higher = Better)', fontsize=14, fontweight='bold')
ax.legend(loc='lower right', fontsize=10)
ax.set_xlim([0, 1])
ax.set_ylim([0, 1.05])

# Add grid
ax.grid(True, alpha=0.3)

# Shade the area under the best curve
best_fpr, best_tpr, _ = roc_curve(y_test, best_results['y_pred_proba'])
ax.fill_between(best_fpr, best_tpr, alpha=0.1, color='green')

plt.tight_layout()
plt.savefig('roc_curves.png', dpi=300, bbox_inches='tight')
plt.show()

## Chapter 8: Feature Importance - What Matters Most?

What signs should doctors look for? Let's ask our champion model.

In [None]:
# Get feature importance from Random Forest (it has this attribute)
rf_model = results['Random Forest']['model']
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=True)

fig, ax = plt.subplots(figsize=(10, 8))

colors_importance = plt.cm.RdYlGn(np.linspace(0.2, 0.8, len(feature_importance)))
bars = ax.barh(feature_importance['feature'], feature_importance['importance'], 
               color=colors_importance, edgecolor='black')

ax.set_xlabel('Importance Score', fontsize=12)
ax.set_title('What Makes a Heart Troubled?\n(Feature Importance from Random Forest)', 
             fontsize=14, fontweight='bold')

# Add value labels
for bar, val in zip(bars, feature_importance['importance']):
    ax.text(val + 0.005, bar.get_y() + bar.get_height()/2, 
            f'{val:.3f}', va='center', fontsize=10)

plt.tight_layout()
plt.savefig('feature_importance.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nTOP 3 WARNING SIGNS:")
print("="*40)
for i, row in feature_importance.tail(3).iloc[::-1].iterrows():
    print(f"  {row['feature']}: {row['importance']:.3f}")

## Chapter 9: Classification Report - The Full Medical Report

In [None]:
print(f"\n{'='*60}")
print(f"FINAL MEDICAL REPORT: {best_model_name.upper()}")
print(f"{'='*60}\n")

print(classification_report(y_test, best_results['y_pred'], 
                           target_names=['Healthy', 'Heart Disease']))

print("\nInterpretation:")
print("-" * 40)
print(f"Precision (Disease): When we say 'disease', we're right {results[best_model_name]['precision']*100:.1f}% of the time")
print(f"Recall (Disease): We catch {results[best_model_name]['recall']*100:.1f}% of actual disease cases")
print(f"F1 Score: Balanced performance of {results[best_model_name]['f1']*100:.1f}%")

## Epilogue: The Heart Whisperer Lives

Our journey is complete. We built a model that can look at a patient's vitals and predict heart disease with impressive accuracy.

**What we learned:**
1. Age matters, but so do many other factors
2. Asymptomatic chest pain (no symptoms) is actually the most dangerous sign
3. Machine learning can be a powerful tool in healthcare
4. But always remember: this model assists doctors, it doesn't replace them

**The End.**

*...or is it just the beginning?*

In [None]:
print("""
      _____
     /     \\
    /       \\
   |  HEART  |
   |  SAVED  |
    \\       /
     \\_____/
        |
        |
    ____|____
   |         |
   |  MODEL  |
   |  READY  |
   |_________|

Thank you for joining this adventure!
""")