# Heart Disease Prediction – Exploratory Data Analysis (EDA)

## Objective
The goal of this notebook is to perform Exploratory Data Analysis (EDA) on the
Heart Disease UCI dataset to understand:
- Data distribution
- Feature relationships
- Class balance
- Potential preprocessing requirements

This analysis supports informed model selection and feature engineering
decisions as part of an end-to-end MLOps pipeline.

In [2]:
# Import core data analysis and visualization libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style for consistency
sns.set(style="whitegrid")

In [None]:
# Load the heart disease dataset from the data directory
data_path = "../data/heart.csv"
df = pd.read_csv(data_path)

# Display first few rows
df.head()

## Dataset Overview

This dataset contains patient health indicators such as:
- Age
- Sex
- Chest pain type
- Blood pressure
- Cholesterol levels

The target variable `target` indicates:
- `1` → Presence of heart disease
- `0` → Absence of heart disease

In [None]:
# Dataset dimensions
print("Dataset Shape:", df.shape)

# Data types and missing values
df.info()

## Data Quality Check

We inspect the dataset for:
- Missing values
- Inconsistent data types
- Potential preprocessing needs

In [None]:
# Check for missing values
df.isnull().sum()

## Target Variable Distribution

Understanding class balance is critical for:
- Model evaluation
- Metric selection
- Bias detection


In [None]:
# Plot class distribution
plt.figure(figsize=(6,4))
sns.countplot(x="target", data=df)
plt.title("Heart Disease Class Distribution")
plt.xlabel("Target Class")
plt.ylabel("Count")
plt.show()

## Distribution of Numerical Features

Histograms help understand:
- Feature spread
- Skewness
- Outliers

In [None]:
# Plot histograms for numerical features
df.hist(figsize=(14,10), bins=20)
plt.suptitle("Feature Distributions", fontsize=16)
plt.show()

## Correlation Analysis

Correlation analysis helps identify:
- Strongly related features
- Redundant variables
- Potential multicollinearity

In [None]:
# Compute correlation matrix
corr_matrix = df.corr()

# Plot heatmap
plt.figure(figsize=(12,8))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Feature Correlation Heatmap")
plt.show()

## Key EDA Insights

- The dataset is clean with no missing values.
- The target variable shows moderate class balance.
- Certain features (age, cholesterol, chest pain type) show
  noticeable correlation with heart disease.
- Feature scaling is required due to varying feature ranges.

These insights guide preprocessing, model selection,
and evaluation strategies in later stages of the MLOps pipeline.