# ðŸ’¡ *If this notebook helps you understand the ECG reconstruction process, please consider giving it an upvote. It really helps others discover it too!*

**Main idea:**  
1) From the training CSVs we build per-lead *shape templates* by detecting R-peaks, cutting beat-aligned windows, and taking a **median beat** per lead.  
2) At test time, for each requested lead we **tile** that template beat to the target length using a few **BPM hypotheses** (beats-per-minute).  
3) We apply a light **low-pass filter**, normalize to **shape space (z-score)**, and make a **micro-ensemble** of:  
   - best BPM by autocorrelation,  
   - a fixed prior BPM for that lead,  
   - a plain per-lead mean (resampled) template.  
4) For limb leads, we optionally apply **Einthoven's law** consistency by deriving III/aVR/aVL/aVF from I and II and **blending**.  
5) Finally we scale to the required output range and write `submission.csv`.

*The goal is to understand the structure of ECG signals and how careful signal processing can go surprisingly far â€” even without machine learning.*

## Config

In [None]:

import os
import numpy as np
import pandas as pd
from tqdm.auto import tqdm
from scipy.signal import butter, filtfilt, find_peaks


TRAIN_DIR = '/kaggle/input/physionet-ecg-image-digitization/train/'
TRAIN_CSV = '/kaggle/input/physionet-ecg-image-digitization/train.csv'
TEST_DIR  = '/kaggle/input/physionet-ecg-image-digitization/test/'
TEST_CSV  = '/kaggle/input/physionet-ecg-image-digitization/test.csv'


# Config
LEADS = ['I','II','III','aVR','aVL','aVF','V1','V2','V3','V4','V5','V6']

# Beat extraction window around R 
R_PRE_S   = 0.20
R_POST_S  = 0.40
BEAT_LEN  = 360  # resample each beat to this length (shape template)

# Band-pass for R detection & morphology preservation
BP_LO_HZ  = 5.0
BP_HI_HZ  = 25.0
BP_ORDER  = 2

# BPM sweep (choose best by autocorr peak score)
BPM_CANDIDATES = [55, 65, 75, 85, 95]

# Output scale 
MIN_VAL, MAX_VAL = 0.0, 0.09

# Einthoven blend weight
EINTHOVEN_BLEND_W = 0.6 

# Micro-ensemble weights
import numpy as np
ENSEMBLE_W = np.array([0.5, 0.3, 0.2], dtype=np.float32)


### Explanation
- **Beat window** (`R_PRE_S`, `R_POST_S`): captures Pâ€“QRSâ€“T around each R-peak.  
- **Band-pass 5â€“25 Hz**: emphasizes QRS energy and keeps morphology.  
- **BPM candidates**: a small sweep of plausible resting rates; we pick the one that yields the strongest autocorrelation peak.  
- **Einthoven blend**: ensures limb leads don't contradict physics (I, II â†’ III/aVR/aVL/aVF).  
- **Ensemble**: stabilizes predictions by mixing hypotheses instead of trusting a single BPM guess.

## Utilities

In [None]:
def zscore(x):
    import numpy as np
    x = np.asarray(x, np.float32)
    s = np.std(x) + 1e-8
    return (x - np.mean(x)) / s

def bandpass(x, fs, lo=BP_LO_HZ, hi=BP_HI_HZ, order=BP_ORDER):
    import numpy as np
    from scipy.signal import butter, filtfilt
    x = np.asarray(x, np.float32)
    if len(x) < 10:
        return x
    nyq = 0.5 * fs
    lo_n = max(lo/nyq, 1e-3)
    hi_n = min(hi/nyq, 0.99)
    b, a = butter(order, [lo_n, hi_n], btype='band')
    return filtfilt(b, a, x).astype(np.float32)

def soft_minmax_scale(x, lo=MIN_VAL, hi=MAX_VAL):
    import numpy as np
    x = np.asarray(x, np.float32)
    mn, mx = float(np.min(x)), float(np.max(x))
    if not np.isfinite(mn) or not np.isfinite(mx) or mx <= mn:
        return np.full_like(x, (lo+hi)/2, np.float32)
    y = (x - mn) / (mx - mn)
    return (lo + y * (hi - lo)).astype(np.float32)

def autocorr_peak_score(y, fs, min_rr_s=0.35, max_rr_s=1.5):
    """Return normalized peak of autocorrelation within plausible RR range."""
    import numpy as np
    y = zscore(y)
    ac = np.correlate(y, y, mode='full')[len(y)-1:]
    lo = int(max(min_rr_s * fs, 1))
    hi = int(min(max_rr_s * fs, len(ac)-1))
    if hi <= lo:
        return 0.0
    segment = ac[lo:hi]
    if segment.size == 0:
        return 0.0
    peak = float(segment.max())
    norm = float(ac[0]) + 1e-8
    return float(np.clip(peak / norm, 0.0, 1.0))

def resample_to_length(x, n):
    import numpy as np
    return np.interp(
        np.linspace(0, 1, n, dtype=np.float32),
        np.linspace(0, 1, len(x), dtype=np.float32),
        x
    ).astype(np.float32)

def apply_lowpass(x, fs, cutoff=15.0, order=2):
    import numpy as np
    from scipy.signal import butter, filtfilt
    if len(x) <= 10:
        return x
    nyq = 0.5 * fs
    wn = min(cutoff/nyq, 0.99)
    b, a = butter(order, wn, btype='low')
    return filtfilt(b, a, x).astype(np.float32)

def scale_to_lead_range(x, lead_stat=None, lo=MIN_VAL, hi=MAX_VAL):
    """Final mapping: simple robust scaling to [lo, hi]."""
    return soft_minmax_scale(x, lo, hi)

# Einthoven relations 
def derive_limb_leads_from_I_II(yI, yII):
    III = yII - yI
    aVR = -(yI + yII) / 2.0
    aVL = yI - 0.5 * yII
    aVF = yII - 0.5 * yI
    return {'III': III, 'aVR': aVR, 'aVL': aVL, 'aVF': aVF}

def soft_blend(a, b, w):
    return (1.0 - w) * a + w * b


**Key utilities**  
- `autocorr_peak_score`: quantifies how *periodic* a signal is at plausible RR intervals; used to pick the best BPM hypothesis.  
- `bandpass` & `apply_lowpass`: gentle filters to emphasize QRS and then smooth the final shape.  
- `resample_to_length`: ensures any template/lead matches the requested number of rows.

## Build per-lead stats & beat-aligned templates from the train set

We **detect R-peaks** on a filtered/z-scored signal, extract windows around each peak, resample to a fixed `BEAT_LEN`, and compute the **median beat** per lead.  
We also collect **lead-wise BPM samples** to form a **prior** (median BPM) per lead.

In [None]:
def build_per_lead_stats_and_beats(train_csv, train_dir, leads=LEADS):
    import os
    import numpy as np
    import pandas as pd
    from tqdm.auto import tqdm
    from scipy.signal import find_peaks

    meta = pd.read_csv(train_csv)
    lead_vals = {ld: [] for ld in leads}
    lead_beats = {ld: [] for ld in leads}
    lead_bpm_samples = {ld: [] for ld in leads}

    for row in tqdm(meta.itertuples(index=False), total=len(meta), desc="Scan train"):
        rid = str(row.id)
        fs  = int(row.fs)
        csvp = os.path.join(train_dir, rid, f"{rid}.csv")
        if not os.path.exists(csvp):
            continue
        try:
            df = pd.read_csv(csvp)
        except:
            continue

        for ld in leads:
            if ld not in df.columns:
                continue
            y = df[ld].dropna().to_numpy(np.float32)
            if y.size < 200:
                continue

            # Aggregate stats pool (raw)
            lead_vals[ld].append(y)

            # R-peak detection on band-passed signal
            y_bp = bandpass(zscore(y), fs)
            prominence = max(0.4 * np.std(y_bp), 0.15)
            distance = int(0.35 * fs)
            pks, _ = find_peaks(y_bp, distance=distance, prominence=prominence)
            if len(pks) < 2:
                continue

            # BPM estimate for this record & lead
            rr = np.diff(pks) / float(fs)
            rr = rr[(rr > 0.3) & (rr < 2.0)]
            if rr.size >= 1:
                bpm = float(np.clip(60.0 / np.median(rr), 40.0, 160.0))
                lead_bpm_samples[ld].append(bpm)

            # Extract beats around R and resample to BEAT_LEN
            n_pre  = int(round(R_PRE_S * fs))
            n_post = int(round(R_POST_S * fs))
            for pk in pks:
                a, b = pk - n_pre, pk + n_post
                if a < 0 or b >= len(y):
                    continue
                seg = y[a:b+1].astype(np.float32)
                seg_rs = np.interp(
                    np.linspace(0, 1, BEAT_LEN, dtype=np.float32),
                    np.linspace(0, 1, len(seg), dtype=np.float32),
                    seg
                ).astype(np.float32)
                lead_beats[ld].append(seg_rs)


    lead_stats = {}
    for ld in leads:
        if len(lead_vals[ld]) == 0:
            lead_stats[ld] = {'mean': 0.0, 'std': 0.1, 'median': 0.0, 'min': -0.5, 'max': 0.5}
            continue
        vals = np.concatenate(lead_vals[ld]).astype(np.float32)
        if vals.size == 0:
            lead_stats[ld] = {'mean': 0.0, 'std': 0.1, 'median': 0.0, 'min': -0.5, 'max': 0.5}
        else:
            lead_stats[ld] = {
                'mean': float(np.mean(vals)),
                'std':  float(np.std(vals)) if vals.size > 1 else 0.1,
                'median': float(np.median(vals)),
                'min': float(np.min(vals)),
                'max': float(np.max(vals))
            }
    return lead_stats, lead_beats, lead_bpm_samples

def build_lead_templates(lead_beats, leads=LEADS):
    import numpy as np
    lead_template = {}
    for ld in leads:
        if len(lead_beats[ld]) > 0:
            arr = np.vstack(lead_beats[ld]).astype(np.float32)
            tpl = np.median(arr, axis=0).astype(np.float32)
        else:
            t = np.linspace(0, 1, BEAT_LEN, dtype=np.float32)
            tpl = np.sin(2 * np.pi * t).astype(np.float32)
        lead_template[ld] = zscore(tpl)
    return lead_template


## Plain per-lead mean templates 


This branch **ignores beats** and simply averages resampled, normalized signals per lead across the training set.  
It smooths idiosyncrasies and often yields a stable baseline when ensembled.


In [None]:
def make_plain_mean_template(train_csv, train_dir, leads=LEADS, template_len=500):
    import os
    import numpy as np
    import pandas as pd
    meta = pd.read_csv(train_csv)
    lead_means = {}
    for ld in leads:
        resamp_signals = []
        for row in meta.itertuples(index=False):
            rid = str(row.id)
            csvp = os.path.join(train_dir, rid, f"{rid}.csv")
            if not os.path.exists(csvp):
                continue
            try:
                df = pd.read_csv(csvp)
            except:
                continue
            if ld not in df.columns:
                continue
            s = df[ld].dropna().to_numpy(np.float32)
            if s.size < 50:
                continue
            s_norm = (s - np.mean(s)) / (np.std(s) + 1e-8)
            s_rs = np.interp(
                np.linspace(0, 1, template_len, dtype=np.float32),
                np.linspace(0, 1, len(s_norm), dtype=np.float32),
                s_norm
            ).astype(np.float32)
            resamp_signals.append(s_rs)
        if len(resamp_signals) > 0:
            lead_means[ld] = np.mean(np.vstack(resamp_signals), axis=0).astype(np.float32)
        else:
            t = np.linspace(0, 1, template_len, dtype=np.float32)
            lead_means[ld] = np.sin(2*np.pi*t).astype(np.float32)
    return lead_means


## BPM tiling & selection via autocorrelation

In [None]:

def tile_template(template_beat, fs, n_out, bpm, amp=1.0):
    import numpy as np
    beat_samples = max(4, int(round((60.0 / max(bpm, 1e-6)) * fs)))
    one = np.interp(
        np.linspace(0, 1, beat_samples, dtype=np.float32),
        np.linspace(0, 1, len(template_beat), dtype=np.float32),
        template_beat
    ).astype(np.float32)
    reps = int(np.ceil(n_out / len(one)))
    y = np.tile(one, reps)[:n_out]
    y = zscore(y) * float(amp)
    return y.astype(np.float32)

def choose_best_bpm(template_beat, fs, n_out, bpm_list=BPM_CANDIDATES):
    best_bpm, best_score, best_y = None, -1.0, None
    for bpm in bpm_list:
        y = tile_template(template_beat, fs, n_out, bpm, amp=1.0)
        sc = autocorr_peak_score(y, fs)
        if sc > best_score:
            best_bpm, best_score, best_y = bpm, sc, y
    return best_bpm, best_y


We **stretch and tile** the median-beat template to match a candidate BPM and the target number of samples.  
We then pick the BPM whose synthesized series shows the **strongest periodicity** (highest autocorrelation peak in plausible RR lags).

## Step 1â€“2: Build assets (stats, beat templates, mean templates, BPM priors)

In [None]:

print("[1/4] Scanning train to build stats & beat-aligned templates...")
lead_stats, lead_beats, lead_bpms = build_per_lead_stats_and_beats(TRAIN_CSV, TRAIN_DIR, LEADS)
lead_template = build_lead_templates(lead_beats, LEADS)

print("[2/4] Building plain mean templates (ensemble branch)...")
mean_templates = make_plain_mean_template(TRAIN_CSV, TRAIN_DIR, LEADS, template_len=500)

per_lead_bpm_prior = {}
for ld in LEADS:
    if len(lead_bpms[ld]) > 0:
        per_lead_bpm_prior[ld] = float(np.median(np.array(lead_bpms[ld], dtype=np.float32)))
    else:
        per_lead_bpm_prior[ld] = 75.0  # sensible fallback


## Step 3â€“4: Predict test with micro-ensemble and Einthoven blending

In [None]:

print("[3/4] Predicting test...")
test = pd.read_csv(TEST_CSV)

# Pre-group rows by record id to handle Einthoven consistency across a record
records = {}
for r in test.itertuples(index=False):
    records.setdefault(int(r.id), []).append(r)

predictions = {}

for rid, items in tqdm(records.items(), desc="Records"):
    tmp_store = {}     
    scales_store = {}  

    # Synthesize each requested lead
    for r in items:
        lead = str(r.lead)
        fs   = int(r.fs)
        n    = int(r.number_of_rows)

        tpl_beat = lead_template.get(lead, lead_template['II'])

        # BPM sweep branch
        best_bpm, y_best = choose_best_bpm(tpl_beat, fs, n, BPM_CANDIDATES)

        # Fixed bpm branch (median prior per lead)
        bpm_fixed = per_lead_bpm_prior.get(lead, 75.0)
        y_fixed   = tile_template(tpl_beat, fs, n, bpm_fixed, amp=1.0)

        # Plain mean template branch
        y_mean = resample_to_length(mean_templates.get(lead, mean_templates['II']), n)

        # Light low-pass on each branch
        y_best  = apply_lowpass(y_best, fs, cutoff=15.0, order=2)
        y_fixed = apply_lowpass(y_fixed, fs, cutoff=15.0, order=2)
        y_mean  = apply_lowpass(y_mean, fs,  cutoff=15.0, order=2)

        # Normalize branches to shape space
        B = zscore(y_best)
        F = zscore(y_fixed)
        M = zscore(y_mean)

        # Micro-ensemble in shape space
        w = ENSEMBLE_W / (np.sum(ENSEMBLE_W) + 1e-8)
        y_syn = (w[0]*B + w[1]*F + w[2]*M).astype(np.float32)

        tmp_store[lead] = y_syn
        scales_store[lead] = (fs, n)

    #  Einthoven pass 
    if EINTHOVEN_BLEND_W > 0.0 and 'I' in tmp_store and 'II' in tmp_store:
        for dlead in ['III', 'aVR', 'aVL', 'aVF']:
            if dlead not in scales_store:
                continue
            _, n_d = scales_store[dlead]
            yI_rs  = resample_to_length(tmp_store['I'],  n_d)
            yII_rs = resample_to_length(tmp_store['II'], n_d)
            derived_all = derive_limb_leads_from_I_II(yI_rs, yII_rs)
            ydrv = zscore(derived_all[dlead])

            if dlead in tmp_store and len(tmp_store[dlead]) == n_d:
                tmp_store[dlead] = soft_blend(tmp_store[dlead], ydrv, EINTHOVEN_BLEND_W)
            else:
                tmp_store[dlead] = ydrv

    # Final scaling per lead 
    for lead, y in tmp_store.items():
        fs, n_rows = scales_store.get(lead, (500, len(y)))
        y_scaled = scale_to_lead_range(y, None, MIN_VAL, MAX_VAL)
        predictions[(rid, lead)] = y_scaled.astype(np.float32)


## Submission

In [None]:

print("[4/4] Writing submission.csv ...")
rows = []
for r in test.itertuples(index=False):
    rid   = int(r.id)
    lead  = str(r.lead)
    n     = int(r.number_of_rows)
    y     = predictions[(rid, lead)]
    if len(y) != n:
        y = resample_to_length(y, n)
    for i in range(n):
        rows.append((f"{rid}_{i}_{lead}", float(y[i])))

sub = pd.DataFrame(rows, columns=['id','value'])
sub.to_csv('submission.csv', index=False)
sub.head(7)



## Why this works (and when it might fail)
ECGs are quasi-periodic. A **median beat template** captures typical morphology while rejecting outliers.  
By **tiling** the template at plausible BPMs, we synthesize a full-length series that matches the requested
sampling rate and duration. The **autocorrelation** criterion encourages rhythmic consistency.
An **ensemble** of hypotheses hedges against a wrong BPM pick. **Einthoven blending** reduces contradictions
across limb leads by enforcing physiological relations.

Edge cases include bigeminy/trigeminy, strong motion artifacts, or sudden rate changes within the record.  

**Reproducibility:** No random seeds are required since the pipeline is deterministic.  
**Dependencies:** `numpy`, `pandas`, `scipy`, `tqdm`  (all standard on Kaggle).

## Final Thoughts

If this notebook helped you learn something new,  
**donâ€™t forget to leave an upvote ðŸš€** - it really helps others find it and keeps me motivated to share more tutorials like this!  

Good luck with your own experiments, and happy Kaggle-ing!