# Exploratory Data Analysis (EDA) - Heart Disease Dataset

**Team Member:** Lam Nguyen (Lead)  
**Course:** CMPE 257 - Machine Learning  
**Project:** Heart Disease Risk Assessment - Multi-Class Prediction

---

## üìö What is EDA and Why Do We Need It?

**Exploratory Data Analysis (EDA)** is the critical first step in any machine learning project. Think of it as "getting to know your data" before building models.

### Purpose of EDA:
1. **Understand the data structure** - What features do we have? What do they mean?
2. **Identify data quality issues** - Missing values? Outliers? Errors?
3. **Discover patterns and relationships** - Which features are correlated? Which might be important?
4. **Detect problems early** - Class imbalance? Data distribution issues?
5. **Inform preprocessing decisions** - What transformations do we need?

### Why EDA Matters:
- **Garbage in, garbage out** - If we don't understand our data, we'll build bad models
- **Prevents wasted time** - Finding issues early saves hours of debugging later
- **Guides feature engineering** - Understanding relationships helps create better features
- **Sets realistic expectations** - We'll know what accuracy levels are possible

---

## üìã Dataset Overview

**Source:** UCI Heart Disease Dataset (920 patient records from 4 medical centers)

**Target Variable:** Heart disease severity (0-4)
- **0** = No significant disease (< 50% artery blockage)
- **1-4** = Progressively worse disease severity

**Clinical Features (14):** Age, sex, chest pain type, blood pressure, cholesterol, ECG results, etc.

---

## 1Ô∏è‚É£ Setup: Import Libraries

We'll import all the tools we need for analysis and visualization.

In [1]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical analysis
from scipy import stats

# Display settings for better visualization
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.max_rows', 100)      # Show up to 100 rows
pd.set_option('display.precision', 3)        # 3 decimal places

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

‚úÖ Libraries imported successfully!
Pandas version: 2.2.2
NumPy version: 1.26.4


## 2Ô∏è‚É£ Load the Dataset

**Important:** Make sure you've downloaded the dataset from Kaggle and placed it in `../data/raw/`

The dataset should be named something like `heart.csv` or `heart_disease.csv`

In [None]:
# Load the dataset
# MODIFY THIS PATH based on your actual filename
df = pd.read_csv('../data/raw/heart_disease_uci.csv')

print("‚úÖ Dataset loaded successfully!")
print(f"\nüìä Dataset shape: {df.shape[0]} rows √ó {df.shape[1]} columns")
print(f"   (We have {df.shape[0]} patient records with {df.shape[1]} features)")

FileNotFoundError: [Errno 2] No such file or directory: '../data/raw/heart.csv'

## 3Ô∏è‚É£ First Look at the Data

Let's see what the data actually looks like. The first few rows give us a quick sense of the structure.

In [None]:
# Display first 10 rows
print("üîç First 10 rows of the dataset:")
display(df.head(10))

## 4Ô∏è‚É£ Understanding the Features

Let's understand what each column means from a medical perspective:

### üìã Feature Definitions:

1. **age** - Age in years
2. **sex** - Sex (1 = male, 0 = female)
3. **cp** - Chest pain type:
   - 0: Typical angina (chest pain from reduced blood flow)
   - 1: Atypical angina
   - 2: Non-anginal pain
   - 3: Asymptomatic (no symptoms)
4. **trestbps** - Resting blood pressure (mm Hg)
5. **chol** - Serum cholesterol (mg/dl)
6. **fbs** - Fasting blood sugar > 120 mg/dl (1 = true, 0 = false)
7. **restecg** - Resting electrocardiographic results:
   - 0: Normal
   - 1: ST-T wave abnormality
   - 2: Left ventricular hypertrophy
8. **thalach** - Maximum heart rate achieved during exercise
9. **exang** - Exercise induced angina (1 = yes, 0 = no)
10. **oldpeak** - ST depression induced by exercise relative to rest
11. **slope** - Slope of peak exercise ST segment:
    - 0: Upsloping
    - 1: Flat
    - 2: Downsloping
12. **ca** - Number of major vessels colored by fluoroscopy (0-3)
13. **thal** - Thalassemia:
    - 0: Normal
    - 1: Fixed defect
    - 2: Reversible defect
14. **target** - Heart disease severity (0 = no disease, 1-4 = increasing severity)

Let's check the column names and types:

In [None]:
# Get basic information about the dataset
print("üìä Dataset Information:")
print("=" * 60)
df.info()

print("\n" + "=" * 60)
print("\nüîç Column Names:")
print(df.columns.tolist())

### üí° What to Look For:
- **Data types:** Are they correct? (integers for categorical, floats for continuous)
- **Non-null counts:** Do we have missing values?
- **Memory usage:** Is the dataset small enough to fit in memory?

## 5Ô∏è‚É£ Statistical Summary

Statistical summaries help us understand:
- **Range** of values (min, max)
- **Central tendency** (mean, median)
- **Spread** (standard deviation, quartiles)
- **Potential outliers** (values far from mean)

In [None]:
print("üìà Statistical Summary of Numerical Features:")
print("=" * 80)
display(df.describe().round(2))

print("\n" + "=" * 80)
print("\nüéØ What to Notice:")
print("- Age range:", df['age'].min(), "-", df['age'].max(), "years")
print("- Blood pressure range:", df['trestbps'].min(), "-", df['trestbps'].max(), "mm Hg")
print("- Cholesterol range:", df['chol'].min(), "-", df['chol'].max(), "mg/dl")
print("- Max heart rate range:", df['thalach'].min(), "-", df['thalach'].max(), "bpm")

### ü§î Questions to Ask:
1. Do the ranges make medical sense?
2. Are there any impossible values? (e.g., 0 mm Hg blood pressure)
3. Are the means and medians similar? (If not, data might be skewed)

## 6Ô∏è‚É£ Missing Values Analysis

**Why this matters:** Missing data can:
- Bias our model
- Reduce predictive power
- Lead to errors during training

We need to know:
1. **How many** values are missing?
2. **Which columns** have missing data?
3. **Is the missingness random** or systematic?

In [None]:
# Count missing values
missing_counts = df.isnull().sum()
missing_percent = (df.isnull().sum() / len(df)) * 100

# Create a summary dataframe
missing_df = pd.DataFrame({
    'Missing_Count': missing_counts,
    'Percentage': missing_percent
})

# Filter to only show columns with missing values
missing_df = missing_df[missing_df['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)

print("üîç Missing Values Analysis:")
print("=" * 60)

if len(missing_df) == 0:
    print("‚úÖ Great news! No missing values detected!")
else:
    display(missing_df)
    
    # Visualize missing data
    plt.figure(figsize=(10, 6))
    plt.bar(missing_df.index, missing_df['Percentage'])
    plt.xlabel('Features')
    plt.ylabel('Percentage Missing (%)')
    plt.title('Missing Data by Feature')
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
    
print(f"\nüìä Total missing values: {df.isnull().sum().sum()}")
print(f"üìä Overall completeness: {((1 - df.isnull().sum().sum() / (df.shape[0] * df.shape[1])) * 100):.2f}%")

### üí° Important Note on Missing Values:

Sometimes values that look like valid numbers are actually **coded missing values**. For example:
- A cholesterol of **0** is medically impossible (should be missing)
- A blood pressure of **0** doesn't make sense

Let's check for these "disguised" missing values:

In [None]:
print("üîç Checking for Suspicious Zero Values:")
print("=" * 60)

# Check specific columns where 0 doesn't make sense
suspicious_cols = ['trestbps', 'chol', 'thalach']

for col in suspicious_cols:
    if col in df.columns:
        zero_count = (df[col] == 0).sum()
        zero_percent = (zero_count / len(df)) * 100
        print(f"{col:15} : {zero_count:3} zero values ({zero_percent:.2f}%)")
        
        if zero_count > 0:
            print(f"  ‚ö†Ô∏è  Warning: {col} has {zero_count} zero values (medically unlikely!)")

print("\nüí° Note: We'll need to handle these in the preprocessing step!")

## 7Ô∏è‚É£ Target Variable Analysis (THE MOST IMPORTANT PART!)

### Why This is Critical:
The target variable (what we're trying to predict) determines everything:
- What type of problem this is (classification vs regression)
- What metrics we'll use
- What algorithms might work best
- Whether we need special techniques (like SMOTE for imbalance)

### What is Class Imbalance?
**Class imbalance** occurs when we have many more examples of some classes than others.

**Example:** If 90% of patients have no disease (class 0) and only 2% have severe disease (class 4):
- A "dumb" model could just predict "no disease" for everyone and be 90% accurate!
- But it would NEVER catch severe cases (the most important ones clinically)
- This is why **accuracy alone is misleading** with imbalanced data

Let's check our target distribution:

In [None]:
# Assuming target column is named 'target' or 'num' (common in this dataset)
# Adjust if your column name is different
target_col = 'target' if 'target' in df.columns else 'num' if 'num' in df.columns else df.columns[-1]

print(f"üéØ Target Variable: '{target_col}'")
print("=" * 80)

# Count and percentage of each class
target_counts = df[target_col].value_counts().sort_index()
target_percent = (df[target_col].value_counts(normalize=True) * 100).sort_index()

target_summary = pd.DataFrame({
    'Count': target_counts,
    'Percentage': target_percent.round(2)
})

print("\nüìä Distribution of Heart Disease Severity:")
print()
print("Class | Description                    | Count | Percentage")
print("-" * 70)
severity_labels = [
    "No disease (< 50% blockage)",
    "Mild disease",
    "Moderate disease",
    "Severe disease",
    "Very severe disease"
]

for i in range(5):
    if i in target_counts.index:
        count = target_counts[i]
        pct = target_percent[i]
        label = severity_labels[i] if i < len(severity_labels) else f"Class {i}"
        print(f"  {i}   | {label:30} | {count:5} | {pct:6.2f}%")
    else:
        print(f"  {i}   | {severity_labels[i]:30} |     0 |   0.00%")

print("\n" + "=" * 80)

In [None]:
# Visualize the distribution
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Bar plot
axes[0].bar(target_counts.index, target_counts.values, color='skyblue', edgecolor='black')
axes[0].set_xlabel('Severity Level', fontsize=12)
axes[0].set_ylabel('Number of Patients', fontsize=12)
axes[0].set_title('Distribution of Heart Disease Severity (Count)', fontsize=14, fontweight='bold')
axes[0].set_xticks(range(5))
axes[0].grid(axis='y', alpha=0.3)

# Add count labels on bars
for i, (idx, count) in enumerate(target_counts.items()):
    axes[0].text(idx, count + 5, str(count), ha='center', fontweight='bold')

# Pie chart
colors = ['#90EE90', '#FFD700', '#FFA500', '#FF6347', '#DC143C']
axes[1].pie(target_percent.values, labels=target_percent.index, autopct='%1.1f%%',
            startangle=90, colors=colors[:len(target_percent)])
axes[1].set_title('Proportion of Each Severity Level', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

# Calculate imbalance ratio
max_class = target_counts.max()
min_class = target_counts.min()
imbalance_ratio = max_class / min_class

print(f"\n‚öñÔ∏è  Class Imbalance Analysis:")
print(f"   Most common class: {target_counts.idxmax()} with {max_class} samples")
print(f"   Least common class: {target_counts.idxmin()} with {min_class} samples")
print(f"   Imbalance ratio: {imbalance_ratio:.2f}:1")
print()

if imbalance_ratio > 3:
    print("   ‚ö†Ô∏è  SIGNIFICANT IMBALANCE DETECTED!")
    print("   üìå We MUST use SMOTE or other balancing techniques")
    print("   üìå We CANNOT use simple accuracy as our metric")
    print("   üìå We MUST use stratified k-fold cross-validation")
else:
    print("   ‚úÖ Classes are reasonably balanced")

### üéì Understanding the Results:

**Imbalance Ratio** tells us how severe the imbalance is:
- **< 2:1** - Mild imbalance (usually okay)
- **2-3:1** - Moderate imbalance (consider balancing techniques)
- **> 3:1** - Severe imbalance (MUST use balancing techniques)
- **> 10:1** - Extreme imbalance (very challenging problem)

**Why This Matters:**
- Clinically, missing a severe case (false negative) is much worse than a false alarm
- We want HIGH SENSITIVITY (ability to detect disease) even if it means more false alarms
- This is why we'll use **weighted F1-score** instead of accuracy

## 8Ô∏è‚É£ Feature Distribution Analysis

Understanding how each feature is distributed helps us:
1. **Detect outliers** - Values that are suspiciously far from normal
2. **Understand skewness** - Is the data bell-shaped or lopsided?
3. **Identify scaling needs** - Do we need normalization?
4. **Choose appropriate imputation** - Mean for normal distributions, median for skewed

### üìä Continuous Features:
Let's look at the continuous (numeric) features first:

In [None]:
# Identify continuous features (typically those with more unique values)
continuous_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']

# Filter to only include features that exist in the dataset
continuous_features = [f for f in continuous_features if f in df.columns]

print(f"üìä Analyzing {len(continuous_features)} Continuous Features:")
print(continuous_features)
print()

# Create subplots for histograms
n_features = len(continuous_features)
n_cols = 3
n_rows = (n_features + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 4))
axes = axes.flatten() if n_rows > 1 else [axes]

for idx, feature in enumerate(continuous_features):
    ax = axes[idx]
    
    # Histogram with KDE (Kernel Density Estimate)
    ax.hist(df[feature].dropna(), bins=30, alpha=0.7, color='skyblue', edgecolor='black')
    ax.set_xlabel(feature, fontsize=11)
    ax.set_ylabel('Frequency', fontsize=11)
    ax.set_title(f'Distribution of {feature}', fontsize=12, fontweight='bold')
    ax.grid(axis='y', alpha=0.3)
    
    # Add mean and median lines
    mean_val = df[feature].mean()
    median_val = df[feature].median()
    ax.axvline(mean_val, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_val:.1f}')
    ax.axvline(median_val, color='green', linestyle='--', linewidth=2, label=f'Median: {median_val:.1f}')
    ax.legend()

# Hide empty subplots
for idx in range(len(continuous_features), len(axes)):
    axes[idx].axis('off')

plt.tight_layout()
plt.show()

### üéì How to Interpret Distributions:

**Normal Distribution (Bell Curve):**
- Mean ‚âà Median
- Symmetric shape
- Most data near center
- **Example:** Height, many physiological measurements

**Right-Skewed (Positively Skewed):**
- Mean > Median
- Long tail to the right
- Most values on the left
- **Example:** Income, cholesterol in some populations
- **Action:** Consider log transformation or use median for imputation

**Left-Skewed (Negatively Skewed):**
- Mean < Median
- Long tail to the left
- Most values on the right
- **Example:** Age at death, test scores

**Bimodal (Two Peaks):**
- Two distinct groups
- **Example:** Could indicate different populations (e.g., healthy vs. sick)
- **Action:** Might need to analyze groups separately

In [None]:
# Statistical tests for normality and skewness
print("üìä Statistical Analysis of Distributions:")
print("=" * 80)
print(f"{'Feature':<15} | {'Mean':<8} | {'Median':<8} | {'Std Dev':<8} | {'Skewness':<10} | Shape")
print("-" * 80)

for feature in continuous_features:
    mean = df[feature].mean()
    median = df[feature].median()
    std = df[feature].std()
    skew = df[feature].skew()
    
    # Determine shape
    if abs(skew) < 0.5:
        shape = "Fairly symmetric"
    elif skew > 0.5:
        shape = "Right-skewed ‚û°Ô∏è"
    else:
        shape = "Left-skewed ‚¨ÖÔ∏è"
    
    print(f"{feature:<15} | {mean:>7.2f} | {median:>7.2f} | {std:>7.2f} | {skew:>9.2f} | {shape}")

print("\nüí° Interpretation:")
print("   ‚Ä¢ Skewness near 0 = symmetric (normal-like)")
print("   ‚Ä¢ Positive skewness = right tail (mean > median)")
print("   ‚Ä¢ Negative skewness = left tail (mean < median)")
print("   ‚Ä¢ |Skewness| > 1 = highly skewed (consider transformation)")

### üìä Categorical Features:

Now let's look at categorical features (sex, chest pain type, etc.)

In [None]:
# Identify categorical features
categorical_features = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']
categorical_features = [f for f in categorical_features if f in df.columns]

print(f"üìä Analyzing {len(categorical_features)} Categorical Features:")
print(categorical_features)
print()

# Create subplots for bar charts
n_features = len(categorical_features)
n_cols = 3
n_rows = (n_features + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 4))
axes = axes.flatten() if n_rows > 1 else [axes]

for idx, feature in enumerate(categorical_features):
    ax = axes[idx]
    
    # Count values
    value_counts = df[feature].value_counts().sort_index()
    
    # Bar plot
    ax.bar(value_counts.index, value_counts.values, color='coral', edgecolor='black', alpha=0.7)
    ax.set_xlabel(feature, fontsize=11)
    ax.set_ylabel('Count', fontsize=11)
    ax.set_title(f'Distribution of {feature}', fontsize=12, fontweight='bold')
    ax.grid(axis='y', alpha=0.3)
    
    # Add count labels
    for i, (cat, count) in enumerate(value_counts.items()):
        ax.text(cat, count + value_counts.max() * 0.01, str(count), 
                ha='center', fontweight='bold')

# Hide empty subplots
for idx in range(len(categorical_features), len(axes)):
    axes[idx].axis('off')

plt.tight_layout()
plt.show()

## 9Ô∏è‚É£ Outlier Detection

### What are Outliers?
**Outliers** are data points that are significantly different from other observations. They can be:
1. **Valid extreme values** - Some people are just unusual (e.g., Olympic athletes)
2. **Measurement errors** - Equipment malfunction or recording mistakes
3. **Data entry errors** - Typos (e.g., age 900 instead of 90)

### Why Care About Outliers?
- Can drastically affect model training
- Can skew statistical measures (mean, standard deviation)
- Some algorithms (like SVM) are very sensitive to outliers

### Detection Method: Box Plots
We'll use **box plots** which show:
- **Box:** 25th to 75th percentile (middle 50% of data)
- **Line in box:** Median (50th percentile)
- **Whiskers:** Data within 1.5 √ó IQR (Interquartile Range)
- **Dots:** Outliers (beyond whiskers)

In [None]:
# Box plots for outlier detection
fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 4))
axes = axes.flatten() if n_rows > 1 else [axes]

for idx, feature in enumerate(continuous_features):
    ax = axes[idx]
    
    # Create box plot
    bp = ax.boxplot(df[feature].dropna(), vert=True, patch_artist=True)
    
    # Color the box
    for patch in bp['boxes']:
        patch.set_facecolor('lightblue')
    
    ax.set_ylabel(feature, fontsize=11)
    ax.set_title(f'Outlier Detection: {feature}', fontsize=12, fontweight='bold')
    ax.grid(axis='y', alpha=0.3)

# Hide empty subplots
for idx in range(len(continuous_features), len(axes)):
    axes[idx].axis('off')

plt.tight_layout()
plt.show()

In [None]:
# Statistical outlier detection using IQR method
print("üîç Outlier Analysis Using IQR Method:")
print("=" * 80)
print(f"{'Feature':<15} | {'Q1':<8} | {'Q3':<8} | {'IQR':<8} | {'Lower':<8} | {'Upper':<8} | Outliers")
print("-" * 80)

outlier_summary = {}

for feature in continuous_features:
    data = df[feature].dropna()
    
    # Calculate quartiles and IQR
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    
    # Calculate bounds
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Find outliers
    outliers = data[(data < lower_bound) | (data > upper_bound)]
    n_outliers = len(outliers)
    outlier_percent = (n_outliers / len(data)) * 100
    
    outlier_summary[feature] = {
        'count': n_outliers,
        'percent': outlier_percent,
        'values': outliers.tolist()
    }
    
    print(f"{feature:<15} | {Q1:>7.1f} | {Q3:>7.1f} | {IQR:>7.1f} | {lower_bound:>7.1f} | {upper_bound:>7.1f} | {n_outliers} ({outlier_percent:.1f}%)")

print("\n" + "=" * 80)
print("\nüí° IQR Method Explanation:")
print("   ‚Ä¢ IQR = Q3 - Q1 (spread of middle 50% of data)")
print("   ‚Ä¢ Lower bound = Q1 - 1.5√óIQR")
print("   ‚Ä¢ Upper bound = Q3 + 1.5√óIQR")
print("   ‚Ä¢ Values outside these bounds are considered outliers")
print("\n‚ö†Ô∏è  Note: Not all outliers are errors! Some may be valid extreme cases.")

## üîü Correlation Analysis

### What is Correlation?
**Correlation** measures how two variables move together:
- **+1**: Perfect positive correlation (when one ‚Üë, other ‚Üë)
- **0**: No correlation (no relationship)
- **-1**: Perfect negative correlation (when one ‚Üë, other ‚Üì)

### Why This Matters:
1. **Feature Selection:** Highly correlated features are redundant
2. **Model Performance:** Multicollinearity can hurt some models (like linear regression)
3. **Understanding Relationships:** Which features relate to the target?
4. **Feature Engineering:** Create new features by combining correlated ones

### Medical Context:
We expect some correlations, like:
- Age and max heart rate (negative - older = lower max HR)
- Exercise-induced angina and oldpeak (positive - both indicate stress response)

In [None]:
# Calculate correlation matrix
correlation_matrix = df.corr()

# Create heatmap
plt.figure(figsize=(14, 12))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix of All Features', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()

print("\nüé® Heatmap Color Guide:")
print("   üî¥ Red = Strong positive correlation (both variables move together)")
print("   üîµ Blue = Strong negative correlation (variables move in opposite directions)")
print("   ‚ö™ White = No correlation (no relationship)")

In [None]:
# Find strongest correlations with target
target_corr = correlation_matrix[target_col].sort_values(ascending=False)

print("\nüéØ Features Most Correlated with Target (Heart Disease Severity):")
print("=" * 70)
print(f"{'Feature':<20} | {'Correlation':<12} | Interpretation")
print("-" * 70)

for feature, corr in target_corr.items():
    if feature != target_col:
        if abs(corr) > 0.5:
            strength = "STRONG"
        elif abs(corr) > 0.3:
            strength = "MODERATE"
        elif abs(corr) > 0.1:
            strength = "WEAK"
        else:
            strength = "VERY WEAK"
        
        direction = "positive" if corr > 0 else "negative"
        print(f"{feature:<20} | {corr:>11.3f} | {strength} {direction}")

print("\n" + "=" * 70)

In [None]:
# Visualize top correlations with target
top_features = target_corr.drop(target_col).head(10)

plt.figure(figsize=(10, 8))
colors = ['green' if x > 0 else 'red' for x in top_features.values]
plt.barh(range(len(top_features)), top_features.values, color=colors, alpha=0.7, edgecolor='black')
plt.yticks(range(len(top_features)), top_features.index)
plt.xlabel('Correlation with Target', fontsize=12)
plt.title('Top 10 Features Correlated with Heart Disease', fontsize=14, fontweight='bold')
plt.axvline(x=0, color='black', linestyle='--', linewidth=1)
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

print("\nüí° Interpretation:")
print("   üü¢ Green bars: As feature increases, heart disease severity increases")
print("   üî¥ Red bars: As feature increases, heart disease severity decreases")

### üîç Finding Multicollinearity:

**Multicollinearity** = When two features are highly correlated with each other (not just the target)

**Why it's a problem:**
- Redundant information (one feature doesn't add much if we already have the other)
- Can confuse some models about which feature is important
- Increases model complexity unnecessarily

**Rule of thumb:** If |correlation| > 0.8, features are too similar

In [None]:
# Find pairs of highly correlated features
print("üîç Detecting Multicollinearity (|correlation| > 0.7):")
print("=" * 70)

high_corr_pairs = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        feat1 = correlation_matrix.columns[i]
        feat2 = correlation_matrix.columns[j]
        corr_value = correlation_matrix.iloc[i, j]
        
        if abs(corr_value) > 0.7 and feat1 != target_col and feat2 != target_col:
            high_corr_pairs.append((feat1, feat2, corr_value))
            print(f"   {feat1} <-> {feat2}: {corr_value:.3f}")

if len(high_corr_pairs) == 0:
    print("   ‚úÖ No severe multicollinearity detected!")
else:
    print(f"\n   ‚ö†Ô∏è  Found {len(high_corr_pairs)} pairs of highly correlated features")
    print("   üí° Consider removing one feature from each pair during preprocessing")

## 1Ô∏è‚É£1Ô∏è‚É£ Feature Relationships by Target Class

Now let's see how features behave differently across severity levels. This helps us understand:
- Which features truly separate classes
- Whether patterns make medical sense
- Which features will be most useful for prediction

In [None]:
# Box plots grouped by target class
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.flatten()

selected_features = continuous_features[:6]  # Take first 6 for visualization

for idx, feature in enumerate(selected_features):
    if idx < len(axes):
        ax = axes[idx]
        
        # Create grouped box plot
        df.boxplot(column=feature, by=target_col, ax=ax)
        ax.set_xlabel('Severity Level', fontsize=11)
        ax.set_ylabel(feature, fontsize=11)
        ax.set_title(f'{feature} by Heart Disease Severity', fontsize=12, fontweight='bold')
        ax.get_figure().suptitle('')  # Remove default title

plt.tight_layout()
plt.show()

print("\nüí° What to Look For:")
print("   ‚Ä¢ Clear separation between boxes = feature is discriminative")
print("   ‚Ä¢ Overlapping boxes = feature may not help distinguish classes")
print("   ‚Ä¢ Trend across severity levels = feature has predictive relationship")

## 1Ô∏è‚É£2Ô∏è‚É£ Pair Plots for Key Features

**Pair plots** show relationships between multiple features at once. Each cell shows:
- **Diagonal:** Distribution of single feature
- **Off-diagonal:** Scatter plot of two features, colored by target class

This helps us see:
- Clustering patterns
- Whether classes are separable
- Non-linear relationships

In [None]:
# Select top features for pair plot (based on correlation with target)
top_features_for_pairplot = target_corr.drop(target_col).abs().sort_values(ascending=False).head(4).index.tolist()
features_to_plot = top_features_for_pairplot + [target_col]

print(f"üìä Creating pair plot for top {len(top_features_for_pairplot)} features...")
print(f"Features: {top_features_for_pairplot}")

# Create pair plot
pair_plot = sns.pairplot(df[features_to_plot], hue=target_col, palette='Set2', 
                         diag_kind='kde', plot_kws={'alpha': 0.6, 's': 80})
pair_plot.fig.suptitle('Pair Plot of Top Features by Disease Severity', 
                       fontsize=16, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

print("\nüí° How to Read This:")
print("   ‚Ä¢ Colors represent different severity levels")
print("   ‚Ä¢ Look for clear color separation = classes are distinguishable")
print("   ‚Ä¢ Diagonal shows feature distributions by class")
print("   ‚Ä¢ Off-diagonal shows feature relationships")

## 1Ô∏è‚É£3Ô∏è‚É£ Summary and Key Findings

Let's summarize what we've learned from this EDA:

In [None]:
print("="*80)
print("üéØ EXPLORATORY DATA ANALYSIS SUMMARY")
print("="*80)

print("\nüìä DATASET OVERVIEW:")
print(f"   ‚Ä¢ Total samples: {df.shape[0]}")
print(f"   ‚Ä¢ Total features: {df.shape[1]} (including target)")
print(f"   ‚Ä¢ Continuous features: {len(continuous_features)}")
print(f"   ‚Ä¢ Categorical features: {len(categorical_features)}")

print("\nüîç DATA QUALITY:")
total_missing = df.isnull().sum().sum()
print(f"   ‚Ä¢ Missing values: {total_missing} ({(total_missing / (df.shape[0] * df.shape[1]) * 100):.2f}%)")
if total_missing > 0:
    print(f"   ‚ö†Ô∏è  Action needed: Implement imputation strategy")
else:
    print(f"   ‚úÖ No missing values detected")

# Check for zero values in key features
zero_issues = []
for col in ['trestbps', 'chol', 'thalach']:
    if col in df.columns:
        zeros = (df[col] == 0).sum()
        if zeros > 0:
            zero_issues.append(f"{col}: {zeros} zeros")

if zero_issues:
    print(f"   ‚ö†Ô∏è  Suspicious zero values found:")
    for issue in zero_issues:
        print(f"      - {issue}")

print("\n‚öñÔ∏è  CLASS BALANCE:")
for i, count in target_counts.items():
    pct = (count / len(df)) * 100
    print(f"   ‚Ä¢ Class {i}: {count} samples ({pct:.1f}%)")

if imbalance_ratio > 3:
    print(f"   ‚ö†Ô∏è  Severe imbalance (ratio: {imbalance_ratio:.1f}:1)")
    print(f"   üìå MUST apply SMOTE or other balancing techniques")

print("\nüìà KEY CORRELATIONS WITH TARGET:")
top_3_corr = target_corr.drop(target_col).abs().sort_values(ascending=False).head(3)
for feature, corr in top_3_corr.items():
    direction = "‚Üë" if target_corr[feature] > 0 else "‚Üì"
    print(f"   ‚Ä¢ {feature}: {abs(corr):.3f} {direction}")

print("\nüîÑ MULTICOLLINEARITY:")
if len(high_corr_pairs) > 0:
    print(f"   ‚ö†Ô∏è  {len(high_corr_pairs)} pairs of highly correlated features")
    print(f"   üìå Consider feature selection during preprocessing")
else:
    print(f"   ‚úÖ No severe multicollinearity issues")

print("\nüéØ NEXT STEPS FOR PREPROCESSING:")
print("   1. ‚úÖ Handle missing/zero values (imputation strategy)")
print("   2. ‚úÖ Encode categorical variables (one-hot or label encoding)")
print("   3. ‚úÖ Scale/normalize features (StandardScaler or MinMaxScaler)")
print("   4. ‚úÖ Address class imbalance (SMOTE)")
print("   5. ‚úÖ Feature selection (remove highly correlated features if needed)")
print("   6. ‚úÖ Create train/test splits (stratified by target)")

print("\n" + "="*80)
print("‚úÖ EDA COMPLETE! Ready for preprocessing phase.")
print("="*80)

## üéì Learning Objectives Achieved:

By completing this EDA, you now understand:

1. ‚úÖ **Why EDA is critical** - It's the foundation for all subsequent work
2. ‚úÖ **Data quality assessment** - Missing values, outliers, errors
3. ‚úÖ **Distribution analysis** - Normal, skewed, bimodal patterns
4. ‚úÖ **Class imbalance** - Why it matters and how to detect it
5. ‚úÖ **Correlation analysis** - Feature relationships and multicollinearity
6. ‚úÖ **Feature behavior** - How features differ across target classes
7. ‚úÖ **Medical context** - What the features mean clinically

## üìö Key Concepts to Remember:

- **EDA guides all preprocessing decisions**
- **Class imbalance requires special handling (SMOTE, stratified CV)**
- **Outliers aren't always errors - investigate before removing**
- **High correlation ‚â† causation**
- **Understanding your data > blindly applying algorithms**

---

## üíæ Save Findings

Let's save our EDA findings for the team:

In [None]:
# Save summary statistics
summary_stats = df.describe()
summary_stats.to_csv('../results/eda_summary_statistics.csv')

# Save correlation matrix
correlation_matrix.to_csv('../results/correlation_matrix.csv')

# Save key findings to a text file
with open('../results/eda_findings.txt', 'w') as f:
    f.write("EXPLORATORY DATA ANALYSIS FINDINGS\n")
    f.write("=" * 60 + "\n\n")
    f.write(f"Dataset Size: {df.shape[0]} samples, {df.shape[1]} features\n")
    f.write(f"Missing Values: {df.isnull().sum().sum()}\n")
    f.write(f"Class Imbalance Ratio: {imbalance_ratio:.2f}:1\n\n")
    f.write("Top 5 Features Correlated with Target:\n")
    for i, (feature, corr) in enumerate(target_corr.drop(target_col).abs().sort_values(ascending=False).head(5).items(), 1):
        f.write(f"  {i}. {feature}: {corr:.3f}\n")

print("\n‚úÖ Results saved to ../results/")
print("   ‚Ä¢ eda_summary_statistics.csv")
print("   ‚Ä¢ correlation_matrix.csv")
print("   ‚Ä¢ eda_findings.txt")