# Heart Disease Prediction: Exploratory Data Analysis

This notebook explores the heart disease dataset to understand its characteristics, distributions, and potential patterns that could inform feature engineering and model selection.

## Project Overview

We propose a supervised classification project where the primary objective is to predict heart disease status (Yes/No) based on various patient health indicators. Specifically, we will use the patient's health and lifestyle features to predict Heart Disease Status, in order to identify individuals at risk and recommend early intervention or further diagnostic tests.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import sys

# Add project root to path to import custom modules
sys.path.append('..')

# Set plotting style
plt.style.use('seaborn-whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 200)

In [None]:
# Load project configuration
from src.config import RAW_DATA_PATH, TARGET_COLUMN, CATEGORICAL_FEATURES, NUMERICAL_FEATURES
from src.data.data_loader import load_data
from src.data.preprocessor import check_data_quality, clean_data

# Load the raw data
data = load_data(RAW_DATA_PATH)

# Display the first few rows
print(f"Dataset shape: {data.shape}")
data.head()

## Data Overview

Let's examine the basic information about our dataset: data types, missing values, and summary statistics.

In [None]:
# Check data types
data.dtypes

In [None]:
# Check for missing values
missing_values = data.isnull().sum()
missing_percentage = (missing_values / len(data)) * 100

missing_df = pd.DataFrame({
    'Missing Values': missing_values,
    'Percentage': missing_percentage
})

# Show only columns with missing values
missing_df[missing_df['Missing Values'] > 0].sort_values('Percentage', ascending=False)

In [None]:
# Check data quality with our custom function
quality_summary = check_data_quality(data)

# Check for duplicates
print(f"Number of duplicate rows: {quality_summary['duplicates']}")

In [None]:
# Summary statistics for numerical features
data.describe()

In [None]:
# Summary of categorical features
for col in CATEGORICAL_FEATURES:
    if col in data.columns:
        print(f"\n{col}:")
        value_counts = data[col].value_counts(dropna=False)
        percentage = value_counts / len(data) * 100
        summary = pd.DataFrame({'Count': value_counts, 'Percentage': percentage})
        print(summary)

## Target Variable Analysis

Let's examine the distribution of our target variable (Heart Disease Status) to understand the class imbalance.

In [None]:
# Class distribution
target_counts = data[TARGET_COLUMN].value_counts()
target_percent = 100 * target_counts / len(data)

# Create DataFrame for display
target_summary = pd.DataFrame({
    'Count': target_counts,
    'Percentage': target_percent
})

print("Heart Disease Status Distribution:")
target_summary

In [None]:
# Visualize target distribution
plt.figure(figsize=(10, 6))
ax = sns.countplot(x=TARGET_COLUMN, data=data)

# Add count and percentage labels
for i, p in enumerate(ax.patches):
    height = p.get_height()
    ax.text(p.get_x() + p.get_width()/2., height + 50,
            f'{height} ({target_percent.values[i]:.1f}%)',
            ha="center", fontsize=12)

plt.title('Distribution of Heart Disease Status', fontsize=14)
plt.ylabel('Count', fontsize=12)
plt.xlabel('Heart Disease Status', fontsize=12)
plt.show()

## Feature Analysis

Now let's analyze the individual features and their relationship with the target variable.

In [None]:
# Clean the data before analysis
cleaned_data = clean_data(data)

# Confirm cleaning results
print(f"Original shape: {data.shape}, Cleaned shape: {cleaned_data.shape}")
print("\nMissing values after cleaning:")
print(cleaned_data.isnull().sum().sum())

### Numerical Features Analysis

In [None]:
# Distribution of numerical features
fig, axes = plt.subplots(len(NUMERICAL_FEATURES) // 3 + 1, 3, figsize=(18, 4 * (len(NUMERICAL_FEATURES) // 3 + 1)))
axes = axes.flatten()

for i, feature in enumerate(NUMERICAL_FEATURES):
    if feature in cleaned_data.columns:
        ax = axes[i]
        
        # Plot histogram with KDE
        sns.histplot(cleaned_data[feature], kde=True, ax=ax)
        
        # Add vertical line for mean and median
        ax.axvline(cleaned_data[feature].mean(), color='red', linestyle='--', label='Mean')
        ax.axvline(cleaned_data[feature].median(), color='green', linestyle='-.', label='Median')
        
        ax.set_title(f'Distribution of {feature}')
        ax.legend()

# Hide empty subplots if any
for j in range(i + 1, len(axes)):
    axes[j].set_visible(False)

plt.tight_layout()
plt.show()

In [None]:
# Boxplots of numerical features by target class
fig, axes = plt.subplots(len(NUMERICAL_FEATURES) // 2 + 1, 2, figsize=(16, 5 * (len(NUMERICAL_FEATURES) // 2 + 1)))
axes = axes.flatten()

for i, feature in enumerate(NUMERICAL_FEATURES):
    if feature in cleaned_data.columns:
        ax = axes[i]
        
        # Convert target to categorical for better visualization
        target = cleaned_data[TARGET_COLUMN].astype('category')
        
        # Plot boxplot
        sns.boxplot(x=target, y=feature, data=cleaned_data, ax=ax)
        
        ax.set_title(f'{feature} by Heart Disease Status')
        ax.set_xlabel('Heart Disease Status')
        ax.set_ylabel(feature)

# Hide empty subplots if any
for j in range(i + 1, len(axes)):
    axes[j].set_visible(False)

plt.tight_layout()
plt.show()

### Categorical Features Analysis

In [None]:
# Analysis of categorical features
for feature in CATEGORICAL_FEATURES:
    if feature in cleaned_data.columns:
        plt.figure(figsize=(12, 6))
        
        # Create a cross-tabulation
        crosstab = pd.crosstab(
            cleaned_data[feature], 
            cleaned_data[TARGET_COLUMN],
            normalize='index'
        )
        
        # Plot stacked bar chart
        crosstab.plot(kind='bar', stacked=True)
        
        plt.title(f'{feature} vs. Heart Disease Status')
        plt.xlabel(feature)
        plt.ylabel('Proportion')
        plt.xticks(rotation=45)
        plt.legend(title='Heart Disease')
        plt.tight_layout()
        plt.show()
        
        # Show contingency table
        counts = pd.crosstab(cleaned_data[feature], cleaned_data[TARGET_COLUMN])
        percentages = pd.crosstab(
            cleaned_data[feature], 
            cleaned_data[TARGET_COLUMN], 
            normalize='index'
        ).round(4) * 100
        
        # Combine counts and percentages
        combined = counts.copy()
        for col in counts.columns:
            combined[f"{col} %"] = percentages[col]
        
        print(f"\n{feature} vs. Heart Disease Status:")
        display(combined)

### Correlation Analysis

Let's look at the correlation between numerical features and the target variable.

In [None]:
# Convert target to numeric for correlation analysis
cleaned_data_corr = cleaned_data.copy()
if cleaned_data_corr[TARGET_COLUMN].dtype == 'object':
    cleaned_data_corr[TARGET_COLUMN] = cleaned_data_corr[TARGET_COLUMN].map({'Yes': 1, 'No': 0})

# Select only numerical columns
numerical_data = cleaned_data_corr.select_dtypes(include=[np.number])

# Compute correlation matrix
corr_matrix = numerical_data.corr()

# Plot correlation heatmap
plt.figure(figsize=(14, 12))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
cmap = sns.diverging_palette(220, 10, as_cmap=True)

sns.heatmap(corr_matrix, mask=mask, cmap=cmap, vmax=1, vmin=-1, center=0,
            square=True, linewidths=.5, annot=True, fmt='.2f', cbar_kws={"shrink": .5})

plt.title('Correlation Matrix of Numerical Features', fontsize=16)
plt.tight_layout()
plt.show()

In [None]:
# Focus on correlations with the target variable
target_correlations = corr_matrix[TARGET_COLUMN].drop(TARGET_COLUMN).sort_values(ascending=False)

plt.figure(figsize=(10, 8))
sns.barplot(x=target_correlations.values, y=target_correlations.index)
plt.title('Correlation with Heart Disease Status', fontsize=14)
plt.xlabel('Correlation Coefficient', fontsize=12)
plt.axvline(x=0, color='black', linestyle='--')
plt.grid(axis='x')
plt.tight_layout()
plt.show()

print("\nNumerical features correlation with Heart Disease Status:")
display(pd.DataFrame({'Correlation': target_correlations}))

## Feature Engineering Exploration

Let's explore some potential feature interactions that might be useful for our model.

In [None]:
# Import feature engineering functions
from src.features.feature_engineering import create_interaction_features, compute_lipid_ratio

# Create interaction features
enhanced_data = create_interaction_features(cleaned_data)

# Add lipid ratio feature
enhanced_data = compute_lipid_ratio(enhanced_data)

# Check new features
new_features = [col for col in enhanced_data.columns if col not in cleaned_data.columns]
print(f"New features created: {new_features}")

# Preview the enhanced data
enhanced_data[new_features + [TARGET_COLUMN]].head()

In [None]:
# Analyze correlation of new features with target
enhanced_data_corr = enhanced_data.copy()
if enhanced_data_corr[TARGET_COLUMN].dtype == 'object':
    enhanced_data_corr[TARGET_COLUMN] = enhanced_data_corr[TARGET_COLUMN].map({'Yes': 1, 'No': 0})

# Select numerical columns including the new features
numerical_enhanced = enhanced_data_corr.select_dtypes(include=[np.number])

# Correlation of new features with target
new_features_corr = numerical_enhanced[new_features + [TARGET_COLUMN]].corr()[TARGET_COLUMN].drop(TARGET_COLUMN)

plt.figure(figsize=(10, 6))
sns.barplot(x=new_features_corr.values, y=new_features_corr.index)
plt.title('New Features: Correlation with Heart Disease Status', fontsize=14)
plt.xlabel('Correlation Coefficient', fontsize=12)
plt.axvline(x=0, color='black', linestyle='--')
plt.grid(axis='x')
plt.tight_layout()
plt.show()

print("\nNew features correlation with Heart Disease Status:")
display(pd.DataFrame({'Correlation': new_features_corr}))

## Age Group Analysis

Let's analyze heart disease rates across different age groups.

In [None]:
# Create age groups
if 'Age' in cleaned_data.columns:
    age_data = cleaned_data.copy()
    
    # Create age bins
    age_bins = [0, 30, 40, 50, 60, 70, 100]
    age_labels = ['<30', '30-40', '40-50', '50-60', '60-70', '70+']
    
    age_data['Age Group'] = pd.cut(age_data['Age'], bins=age_bins, labels=age_labels, right=False)
    
    # Calculate heart disease rate by age group
    if age_data[TARGET_COLUMN].dtype == 'object':
        age_data['Heart Disease Numeric'] = age_data[TARGET_COLUMN].map({'Yes': 1, 'No': 0})
    else:
        age_data['Heart Disease Numeric'] = age_data[TARGET_COLUMN]
    
    age_group_analysis = age_data.groupby('Age Group')['Heart Disease Numeric'].agg(['count', 'mean', 'sum'])
    age_group_analysis.columns = ['Total Count', 'Heart Disease Rate', 'Heart Disease Count']
    age_group_analysis['Heart Disease Rate'] = age_group_analysis['Heart Disease Rate'] * 100
    
    # Plot
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
    
    # Count by age group
    sns.barplot(x=age_group_analysis.index, y='Total Count', data=age_group_analysis, ax=ax1)
    ax1.set_title('Sample Count by Age Group', fontsize=14)
    ax1.set_ylabel('Count')
    ax1.set_xlabel('Age Group')
    
    # Add count labels
    for i, p in enumerate(ax1.patches):
        ax1.annotate(f"{int(p.get_height())}", 
                  (p.get_x() + p.get_width() / 2., p.get_height()), 
                  ha = 'center', va = 'bottom')
    
    # Heart disease rate by age group
    sns.barplot(x=age_group_analysis.index, y='Heart Disease Rate', data=age_group_analysis, ax=ax2)
    ax2.set_title('Heart Disease Rate by Age Group', fontsize=14)
    ax2.set_ylabel('Heart Disease Rate (%)')
    ax2.set_xlabel('Age Group')
    
    # Add percentage labels
    for i, p in enumerate(ax2.patches):
        ax2.annotate(f"{p.get_height():.1f}%", 
                  (p.get_x() + p.get_width() / 2., p.get_height()), 
                  ha = 'center', va = 'bottom')
    
    plt.tight_layout()
    plt.show()
    
    print("\nHeart Disease Analysis by Age Group:")
    display(age_group_analysis)

## Risk Factor Combinations

Let's explore how combinations of risk factors affect heart disease rates.

In [None]:
# Analyze common risk factor combinations
risk_factors = ['Smoking', 'High Blood Pressure', 'Diabetes', 'Family Heart Disease']
risk_data = cleaned_data.copy()

# Ensure all risk factors are binary
for factor in risk_factors:
    if factor in risk_data.columns and risk_data[factor].dtype == 'object':
        risk_data[factor] = risk_data[factor].map({'Yes': 1, 'No': 0})

# Convert target to numeric if needed
if risk_data[TARGET_COLUMN].dtype == 'object':
    risk_data['Heart Disease Numeric'] = risk_data[TARGET_COLUMN].map({'Yes': 1, 'No': 0})
else:
    risk_data['Heart Disease Numeric'] = risk_data[TARGET_COLUMN]

# Calculate number of risk factors for each person
available_factors = [f for f in risk_factors if f in risk_data.columns]
risk_data['Risk Factor Count'] = risk_data[available_factors].sum(axis=1)

# Analysis by risk factor count
risk_count_analysis = risk_data.groupby('Risk Factor Count')['Heart Disease Numeric'].agg(['count', 'mean', 'sum'])
risk_count_analysis.columns = ['Total Count', 'Heart Disease Rate', 'Heart Disease Count']
risk_count_analysis['Heart Disease Rate'] = risk_count_analysis['Heart Disease Rate'] * 100

# Plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Count by risk factor count
sns.barplot(x=risk_count_analysis.index, y='Total Count', data=risk_count_analysis, ax=ax1)
ax1.set_title('Sample Count by Number of Risk Factors', fontsize=14)
ax1.set_ylabel('Count')
ax1.set_xlabel('Number of Risk Factors')

# Add count labels
for i, p in enumerate(ax1.patches):
    ax1.annotate(f"{int(p.get_height())}", 
              (p.get_x() + p.get_width() / 2., p.get_height()), 
              ha = 'center', va = 'bottom')

# Heart disease rate by risk factor count
sns.barplot(x=risk_count_analysis.index, y='Heart Disease Rate', data=risk_count_analysis, ax=ax2)
ax2.set_title('Heart Disease Rate by Number of Risk Factors', fontsize=14)
ax2.set_ylabel('Heart Disease Rate (%)')
ax2.set_xlabel('Number of Risk Factors')

# Add percentage labels
for i, p in enumerate(ax2.patches):
    ax2.annotate(f"{p.get_height():.1f}%", 
              (p.get_x() + p.get_width() / 2., p.get_height()), 
              ha = 'center', va = 'bottom')

plt.tight_layout()
plt.show()

print("\nHeart Disease Analysis by Number of Risk Factors:")
display(risk_count_analysis)

## Summary of Exploratory Analysis

Let's summarize our key findings from the exploratory data analysis.

In [None]:
# Calculate the imbalance ratio
if 'Heart Disease Status' in cleaned_data.columns:
    if cleaned_data['Heart Disease Status'].dtype == 'object':
        pos_count = (cleaned_data['Heart Disease Status'] == 'Yes').sum()
        neg_count = (cleaned_data['Heart Disease Status'] == 'No').sum()
    else:
        pos_count = (cleaned_data['Heart Disease Status'] == 1).sum()
        neg_count = (cleaned_data['Heart Disease Status'] == 0).sum()
        
    imbalance_ratio = neg_count / pos_count if pos_count > 0 else float('inf')
    print(f"Class imbalance ratio (Negative:Positive): {imbalance_ratio:.2f}:1")

# Identify top correlated features
if 'target_correlations' in locals():
    top_positive = target_correlations[target_correlations > 0].head(5)
    top_negative = target_correlations[target_correlations < 0].sort_values().head(5)
    
    print("\nTop positively correlated features with Heart Disease:")
    for feature, corr in top_positive.items():
        print(f"- {feature}: {corr:.3f}")
        
    print("\nTop negatively correlated features with Heart Disease:")
    for feature, corr in top_negative.items():
        print(f"- {feature}: {corr:.3f}")

# Summary of missing values
missing_counts = cleaned_data.isnull().sum()
features_with_missing = missing_counts[missing_counts > 0]
if len(features_with_missing) > 0:
    print("\nFeatures with missing values:")
    for feature, count in features_with_missing.items():
        print(f"- {feature}: {count} missing values ({count/len(cleaned_data)*100:.2f}%)")
else:
    print("\nNo missing values after cleaning.")

## Key Takeaways

Based on the exploratory data analysis, here are the key takeaways:

1. **Class Imbalance**: The dataset shows significant class imbalance, with heart disease cases being the minority class. This will need to be addressed in our modeling approach.

2. **Important Features**: Several features show strong correlation with heart disease status, including [top features identified above].

3. **Age Factor**: Heart disease prevalence increases with age, with a notable jump in the 50+ age groups.

4. **Risk Factor Combinations**: The presence of multiple risk factors significantly increases heart disease likelihood.

5. **Feature Engineering**: We've created several interaction features that show promising correlations with the target variable.

6. **Missing Values**: [Summary of missing value handling strategy].

These insights will guide our feature engineering, model selection, and evaluation approaches in the subsequent modeling phase.