# HeartBeat AI Exploratory Data Analysis

End-to-end analysis of cardiac and respiratory auscultation signals covering data inspection, preprocessing, baseline modeling, evaluation, and interpretability.


## Notebook Roadmap
- Dataset structure overview & sanity checks
- Audio preprocessing experiments (waveform + log-mel)
- Baseline HeartBeat ensemble training runs
- Evaluation with multi-metric reporting & confusion matrix
- Prediction demos and Grad-CAM-style insights
- Feature interpretation: frequency bands, temporal textures, cadence


In [9]:
import sys
from pathlib import Path

try:
    ROOT = Path(__file__).resolve().parents[1]
except NameError:
    ROOT = Path.cwd().parents[0]

if str(ROOT) not in sys.path:
    sys.path.append(str(ROOT))


In [16]:
from __future__ import annotations

import json
from pathlib import Path
from typing import Dict

import librosa
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import Audio, Image, display

from src.preprocessing import AudioPreprocessor
from src.train import train_pipeline


ModuleNotFoundError: No module named 'tensorflow'

## 1. Data Loading & Exploration
We mirror the required repository structure (`data/train/<class>` and `data/test/<class>`) and build a manifest to inspect class balance, durations, and sampling rates.


In [None]:
DATA_DIR = Path("../data/train")
CLASSES = [
    "normal_heart",
    "murmur",
    "extrasystole",
    "normal_resp",
    "wheeze",
    "crackle",
]


def build_manifest(root: Path) -> pd.DataFrame:
    records = []
    for label in CLASSES:
        class_dir = root / label
        if not class_dir.exists():
            continue
        for audio_path in class_dir.rglob("*.wav"):
            try:
                y, sr = librosa.load(audio_path.as_posix(), sr=None, mono=True)
                duration = len(y) / sr
            except Exception:
                duration = np.nan
                sr = np.nan
            records.append({"path": str(audio_path), "label": label, "duration": duration, "sr": sr})
    return pd.DataFrame(records)

manifest = build_manifest(DATA_DIR)
manifest.head()


In [None]:
if not manifest.empty:
    plt.figure(figsize=(8, 4))
    sns.countplot(data=manifest, x="label", order=CLASSES)
    plt.xticks(rotation=30, ha="right")
    plt.title("Class Distribution - Training Set")
    plt.show()
else:
    print("Manifest is empty. Populate data/train before running analysis.")


In [None]:
def preview_sample(label: str):
    subset = manifest[manifest.label == label]
    if subset.empty:
        print(f"No samples for {label}.")
        return
    sample_path = Path(subset.sample(1, random_state=0).iloc[0].path)
    y, sr = librosa.load(sample_path.as_posix(), sr=None)
    display(Audio(y, rate=sr))
    plt.figure(figsize=(10, 2))
    plt.plot(np.linspace(0, len(y) / sr, len(y)), y)
    plt.title(f"Waveform preview: {label}")
    plt.show()

# preview_sample("normal_heart")


## 2. Data Preprocessing Experiments
Evaluate waveform normalization, fixed-length segmentation, mel-spectrogram creation, and augmentation techniques leveraged by `AudioPreprocessor`.


In [None]:
preprocessor = AudioPreprocessor(
    target_sample_rate=4000,
    segment_seconds=5.0,
    n_mels=128,
)

if not manifest.empty:
    example_path = Path(manifest.iloc[0].path)
    waveform = preprocessor.load_waveform(example_path)
    mel = preprocessor.to_mel_spectrogram(waveform)

    fig, ax = plt.subplots(1, 2, figsize=(12, 3))
    ax[0].plot(waveform)
    ax[0].set_title("Normalized Waveform")
    sns.heatmap(mel[..., 0], ax=ax[1], cmap="magma")
    ax[1].set_title("Log-Mel Spectrogram")
    plt.tight_layout()
else:
    print("Preprocessing demo awaiting audio files.")


In [None]:
if not manifest.empty:
    aug_wave, aug_meta = preprocessor.augment(waveform)
    plt.figure(figsize=(10, 2))
    plt.plot(aug_wave)
    plt.title(f"Augmented waveform (noise={aug_meta['noise_factor']:.3f}, shift={aug_meta['time_shift']})")
    plt.show()


## 3. Baseline Training Runs
The training routine in `src/train.py` orchestrates waveform/spectrogram preparation, model creation, callbacks, evaluation, and artifact persistence.


In [None]:
# Uncomment when data directories contain audio files.
# results = train_pipeline(Path("../data"), epochs=20, batch_size=32)
# results


## 4. Model Evaluation & Diagnostics
We track accuracy, precision/recall/F1, confusion matrix, and training curves. Outputs are persisted to `outputs/` and `monitoring/metrics.json` for the dashboard/API.


In [None]:
metrics_file = Path("../monitoring/metrics.json")
if metrics_file.exists():
    metrics: Dict[str, object] = json.loads(metrics_file.read_text())
    metrics
else:
    print("Run training to generate metrics.json")


In [None]:
conf_plot = Path("../outputs/confusion_matrix.png")
if conf_plot.exists():
    display(Image(filename=conf_plot))
else:
    print("Confusion matrix plot pending training run.")


## 5. Prediction Examples & Grad-CAM
After training, we will load the persisted model through `HeartbeatPredictor` to demonstrate single/batch inference, confidence spreads, and Grad-CAM overlays on spectrograms.


In [None]:
# from src.prediction import HeartbeatPredictor
# predictor = HeartbeatPredictor(model_path=Path("../models/heartbeat_model.h5"))
# predictor.load()
# predictor.predict(Path("../data/test/normal_heart/example.wav"))


## 6. Feature Interpretation
We will interpret three clinically relevant feature families:
1. **Frequency bands** (low-frequency murmurs vs. high-frequency wheezes) using mel-bin statistics.
2. **Temporal textures** with short-time energy and spectral flux to highlight extra sounds.
3. **Cadence/shape** via autocorrelation and envelope tracking to monitor arrhythmias.

Each subsection will include quantitative plots + text summaries connecting trends to medical reasoning.


In [None]:
def frequency_band_summary(df: pd.DataFrame):
    rows = []
    for _, row in df.iterrows():
        y, sr = librosa.load(row.path, sr=4000)
        mel = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=64)
        bands = mel.mean(axis=1)
        rows.append({"label": row.label, "low_band": bands[:16].mean(), "mid_band": bands[16:32].mean(), "high_band": bands[32:].mean()})
    return pd.DataFrame(rows)

# band_stats = frequency_band_summary(manifest)
# band_stats.groupby("label").mean()


> **Next steps:** populate the dataset, execute the cells above, capture plots/tables for the README and dashboard, and document insights gathered from each feature slice.
