# ‚öïÔ∏è PhysioNet ECG Image Digitization - In-Depth EDA

<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 30px; border-radius: 15px; color: white; text-align: center;">
    <h1 style="margin: 0; font-size: 2.5em;">üìä Comprehensive Exploratory Data Analysis</h1>
    <p style="font-size: 1.2em; margin-top: 10px;">Converting Decades of ECG Images to Digital Time Series</p>
</div>

---

## üéØ Competition Overview

**Challenge**: Extract time series data from ECG images to enable modern ML software to process billions of historical ECG recordings.

**Why This Matters**: 
- üè• Billions of ECG images exist globally as paper printouts, scans, and photos
- ü§ñ Modern diagnostic software requires digital time series data
- üí° Converting these images unlocks decades of medical data for AI training
- ‚ù§Ô∏è Better models = improved cardiovascular diagnosis and treatment

**Key Challenge**: Physical printouts, scanning, and photography introduce artifacts (rotations, blurring, stains, damage) that make digitization difficult.

---

## üìö What We'll Explore

1. **Dataset Structure** - Files, formats, and organization
2. **Image Properties** - Dimensions, quality, and variants
3. **Signal Analysis** - ECG waveforms and their characteristics
4. **Lead Configuration** - Understanding the 12-lead ECG layout
5. **Image Degradation** - Types and impact of artifacts
6. **Modeling Insights** - Key findings for building solutions

Let's dive in! üöÄ

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import cv2
from pathlib import Path
from PIL import Image
import warnings
from scipy import signal
from collections import Counter

!pip install --upgrade plotly
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

warnings.filterwarnings('ignore')

# Set style for beautiful plots
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (16, 8)
plt.rcParams['font.size'] = 11
plt.rcParams['axes.titlesize'] = 14
plt.rcParams['axes.labelsize'] = 12

# Define paths
DATA_PATH = Path('/kaggle/input/physionet-ecg-image-digitization')
TRAIN_PATH = DATA_PATH / 'train'
TEST_PATH = DATA_PATH / 'test'

print("‚úÖ Libraries imported successfully!")
print(f"üìÅ Data path: {DATA_PATH}")

In [None]:
# Load training and test metadata
train_df = pd.read_csv(DATA_PATH / 'train.csv')
test_df = pd.read_csv(DATA_PATH / 'test.csv')

print(f"üìä Training samples: {len(train_df):,}")
print(f"üìä Test samples: {len(test_df['id'].unique()):,} unique IDs")
print(f"üìä Total test predictions needed: {len(test_df):,} (12 leads per ID)")

# Display first few rows
print("\n" + "="*80)
print("TRAINING DATA SAMPLE")
print("="*80)
display(train_df.head(10))

print("\n" + "="*80)
print("TEST DATA SAMPLE")
print("="*80)
display(test_df.head(15))

---
# üìà 1. Dataset Statistics

Understanding the distribution of sampling frequencies, signal lengths, and data characteristics.


In [None]:
# Analyze sampling frequency distribution
fs_counts = train_df['fs'].value_counts().sort_index()

fig, axes = plt.subplots(1, 2, figsize=(18, 6))

# Bar plot
colors = sns.color_palette("rocket", len(fs_counts))
axes[0].bar(fs_counts.index.astype(str), fs_counts.values, color=colors, edgecolor='black', linewidth=1.5)
axes[0].set_xlabel('Sampling Frequency (Hz)', fontsize=13, fontweight='bold')
axes[0].set_ylabel('Count', fontsize=13, fontweight='bold')
axes[0].set_title('üìä Distribution of Sampling Frequencies', fontsize=15, fontweight='bold', pad=20)
axes[0].grid(axis='y', alpha=0.3, linestyle='--')

# Add value labels on bars
for i, (fs, count) in enumerate(zip(fs_counts.index, fs_counts.values)):
    axes[0].text(i, count + 10, f'{count}\n({count/len(train_df)*100:.1f}%)', 
                ha='center', va='bottom', fontweight='bold', fontsize=11)

# Pie chart
colors_pie = sns.color_palette("Set2", len(fs_counts))
wedges, texts, autotexts = axes[1].pie(fs_counts.values, labels=[f'{fs} Hz' for fs in fs_counts.index], 
                                        autopct='%1.1f%%', startangle=90, colors=colors_pie,
                                        explode=[0.05]*len(fs_counts), shadow=True)
axes[1].set_title('üìä Sampling Frequency Proportions', fontsize=15, fontweight='bold', pad=20)

for autotext in autotexts:
    autotext.set_color('white')
    autotext.set_fontweight('bold')
    autotext.set_fontsize(12)

plt.tight_layout()
plt.show()

print("\n" + "="*60)
print("SAMPLING FREQUENCY STATISTICS")
print("="*60)
for fs, count in fs_counts.items():
    print(f"  {fs:4d} Hz: {count:3d} samples ({count/len(train_df)*100:5.2f}%)")
print("="*60)

In [None]:
# Analyze signal lengths
sig_len_stats = train_df.groupby('fs')['sig_len'].agg(['min', 'max', 'mean', 'std', 'count'])

print("\n" + "="*80)
print("SIGNAL LENGTH STATISTICS BY SAMPLING FREQUENCY")
print("="*80)
display(sig_len_stats)

# Verify the relationship: sig_len = fs * 10 seconds
train_df['calculated_sig_len'] = train_df['fs'] * 10
train_df['sig_len_match'] = train_df['sig_len'] == train_df['calculated_sig_len']

print(f"\n‚úÖ All signal lengths match fs √ó 10 seconds: {train_df['sig_len_match'].all()}")

# Visualize
fig, ax = plt.subplots(figsize=(16, 6))

for fs in sorted(train_df['fs'].unique()):
    subset = train_df[train_df['fs'] == fs]
    ax.scatter([fs]*len(subset), subset['sig_len'], alpha=0.6, s=100, 
               label=f'{fs} Hz (n={len(subset)})', edgecolors='black', linewidth=0.5)

ax.set_xlabel('Sampling Frequency (Hz)', fontsize=13, fontweight='bold')
ax.set_ylabel('Signal Length (samples)', fontsize=13, fontweight='bold')
ax.set_title('üìè Signal Length vs Sampling Frequency (All = fs √ó 10 seconds)', 
             fontsize=15, fontweight='bold', pad=20)
ax.legend(fontsize=11, loc='best')
ax.grid(True, alpha=0.3, linestyle='--')

# Add reference line
fs_range = sorted(train_df['fs'].unique())
sig_len_range = [fs * 10 for fs in fs_range]
ax.plot(fs_range, sig_len_range, 'r--', linewidth=2, label='Theoretical (fs √ó 10)', alpha=0.7)

plt.tight_layout()
plt.show()

---
# üìä 2. Understanding ECG Structure

## What is a 12-Lead ECG?

An ECG (Electrocardiogram) measures the electrical activity of the heart from 12 different perspectives:

**Limb Leads (6 leads):**
- **I, II, III**: Bipolar limb leads (measure voltage between two electrodes)
- **aVR, aVL, aVF**: Augmented unipolar limb leads

**Chest Leads (6 leads):**
- **V1-V6**: Precordial leads placed across the chest

## Standard Layout:
- **Top 3 rows**: Each contains 4 leads (2.5 seconds each)
- **Bottom row**: Lead II "rhythm strip" (10 seconds - full cardiac rhythm)

## ECG Waveform Components:
- **P wave**: Atrial depolarization
- **QRS complex**: Ventricular depolarization (the heartbeat)
- **T wave**: Ventricular repolarization
- **Grid**: Time (horizontal) vs Voltage (vertical)
  - Standard: 1mm = 0.04s (time), 1mm = 0.1mV (voltage)

In [None]:
# Load a sample ECG for detailed analysis
sample_id = str(train_df['id'].iloc[0])
sample_fs = train_df[train_df['id'] == int(sample_id)]['fs'].values[0]
sample_sig_len = train_df[train_df['id'] == int(sample_id)]['sig_len'].values[0]

# Load time series data
ecg_data = pd.read_csv(TRAIN_PATH / sample_id / f'{sample_id}.csv')

print(f"üìã Sample ECG ID: {sample_id}")
print(f"üìä Sampling Frequency: {sample_fs} Hz")
print(f"üìè Signal Length: {sample_sig_len} samples ({sample_sig_len/sample_fs:.1f} seconds)")
print(f"üìå ECG Data Shape: {ecg_data.shape}")
print(f"\nüîç Available Leads: {list(ecg_data.columns)}")

display(ecg_data.head())

In [None]:
# Create beautiful 12-lead ECG visualization
lead_names = ['I', 'II', 'III', 'aVR', 'aVL', 'aVF', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6']
colors = plt.cm.tab20(np.linspace(0, 1, 12))

fig, axes = plt.subplots(12, 1, figsize=(20, 18), sharex=True)
fig.suptitle(f'ü´Ä 12-Lead ECG Waveform - Sample ID: {sample_id} (fs={sample_fs} Hz)', 
             fontsize=18, fontweight='bold', y=0.995)

time_axis = np.arange(len(ecg_data)) / sample_fs

for idx, (ax, lead, color) in enumerate(zip(axes, lead_names, colors)):
    # Plot the waveform
    ax.plot(time_axis, ecg_data[lead], color=color, linewidth=1.5, alpha=0.9)
    ax.fill_between(time_axis, ecg_data[lead], alpha=0.2, color=color)
    
    # Styling
    ax.set_ylabel(lead, fontsize=13, fontweight='bold', rotation=0, ha='right', va='center')
    ax.grid(True, alpha=0.3, linestyle='--', linewidth=0.5)
    ax.set_xlim(0, len(ecg_data) / sample_fs)
    
    # Add horizontal line at zero
    ax.axhline(y=0, color='black', linestyle='-', linewidth=0.5, alpha=0.5)
    
    # Remove spines for cleaner look
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)
    
    # Calculate and display basic stats
    mean_val = ecg_data[lead].mean()
    std_val = ecg_data[lead].std()
    ax.text(0.02, 0.95, f'Œº={mean_val:.3f}, œÉ={std_val:.3f}', 
            transform=ax.transAxes, fontsize=9, verticalalignment='top',
            bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

axes[-1].set_xlabel('Time (seconds)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# Demonstrate lead duration differences
# In clinical ECGs: Lead II is 10 seconds, others are 2.5 seconds

print("=" * 80)
print("LEAD DURATION IN CLINICAL ECG IMAGES")
print("=" * 80)
print("üìè Lead II (Rhythm Strip):  10.0 seconds  ‚Üê Full cardiac rhythm")
print("üìè Other 11 leads:           2.5 seconds  ‚Üê Shorter synchronized segments")
print("=" * 80)
print("\nüí° Note: In our CSV files, all leads show the full 10-second recording.")
print("   However, in the ECG IMAGES, only the first 2.5 seconds of leads I, III,")
print("   aVR, aVL, aVF, V1-V6 are displayed. Lead II shows all 10 seconds.")
print("=" * 80)

# Visualize this concept
fig, axes = plt.subplots(3, 1, figsize=(20, 10))

# Full 10-second view
time_full = np.arange(len(ecg_data)) / sample_fs
axes[0].plot(time_full, ecg_data['II'], color='crimson', linewidth=2, label='Lead II')
axes[0].set_title('üî¥ Lead II - Full 10-Second Rhythm Strip (as shown in ECG images)', 
                  fontsize=14, fontweight='bold', pad=15)
axes[0].set_ylabel('Amplitude (mV)', fontsize=12, fontweight='bold')
axes[0].legend(fontsize=12, loc='upper right')
axes[0].grid(True, alpha=0.3)
axes[0].set_xlim(0, 10)

# First 2.5 seconds - other leads
time_25s = time_full[time_full <= 2.5]
axes[1].plot(time_25s, ecg_data['I'][:len(time_25s)], color='blue', linewidth=2, label='Lead I')
axes[1].plot(time_25s, ecg_data['V1'][:len(time_25s)], color='green', linewidth=2, label='Lead V1', alpha=0.7)
axes[1].plot(time_25s, ecg_data['V5'][:len(time_25s)], color='orange', linewidth=2, label='Lead V5', alpha=0.7)
axes[1].axvline(x=2.5, color='red', linestyle='--', linewidth=2, label='2.5s cutoff')
axes[1].set_title('üîµ Other Leads - First 2.5 Seconds Only (as shown in ECG images)', 
                  fontsize=14, fontweight='bold', pad=15)
axes[1].set_ylabel('Amplitude (mV)', fontsize=12, fontweight='bold')
axes[1].legend(fontsize=12, loc='upper right')
axes[1].grid(True, alpha=0.3)
axes[1].set_xlim(0, 10)

# Comparison
axes[2].plot(time_full, ecg_data['II'], color='crimson', linewidth=2, alpha=0.8, label='Lead II (10s shown)')
axes[2].plot(time_25s, ecg_data['I'][:len(time_25s)], color='blue', linewidth=2, label='Lead I (2.5s shown)')
axes[2].axvspan(0, 2.5, alpha=0.2, color='blue', label='Other leads visible region')
axes[2].axvspan(2.5, 10, alpha=0.1, color='gray', label='Lead II only region')
axes[2].set_title('üìä Comparison: Lead II vs Other Leads Display Duration', 
                  fontsize=14, fontweight='bold', pad=15)
axes[2].set_xlabel('Time (seconds)', fontsize=12, fontweight='bold')
axes[2].set_ylabel('Amplitude (mV)', fontsize=12, fontweight='bold')
axes[2].legend(fontsize=11, loc='upper right')
axes[2].grid(True, alpha=0.3)
axes[2].set_xlim(0, 10)

plt.tight_layout()
plt.show()

---
# üñºÔ∏è 3. Image Analysis

Now let's examine the ECG images themselves - their properties, quality, and challenges.

In [None]:
# Analyze image properties across multiple samples
sample_ids = train_df['id'].head(20).tolist()
image_properties = []

for sample_id in sample_ids:
    sample_dir = TRAIN_PATH / str(sample_id)
    img_path = sample_dir / f"{sample_id}-0001.png"  # Original image
    
    if img_path.exists():
        img = Image.open(img_path)
        img_array = np.array(img)
        
        image_properties.append({
            'id': sample_id,
            'width': img.size[0],
            'height': img.size[1],
            'aspect_ratio': img.size[0] / img.size[1],
            'mode': img.mode,
            'channels': img_array.shape[2] if len(img_array.shape) == 3 else 1,
            'dtype': str(img_array.dtype),
            'file_size_kb': img_path.stat().st_size / 1024,
            'mean_intensity': img_array.mean(),
            'std_intensity': img_array.std()
        })

props_df = pd.DataFrame(image_properties)

print("=" * 80)
print("IMAGE PROPERTIES ANALYSIS")
print("=" * 80)
display(props_df.head(10))

print("\n" + "=" * 80)
print("STATISTICAL SUMMARY")
print("=" * 80)
display(props_df.describe())

# Visualize distributions
fig, axes = plt.subplots(2, 3, figsize=(20, 12))

# Width distribution
axes[0, 0].hist(props_df['width'], bins=20, color='skyblue', edgecolor='black', alpha=0.7)
axes[0, 0].set_title('üìè Image Width Distribution', fontsize=13, fontweight='bold')
axes[0, 0].set_xlabel('Width (pixels)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].grid(axis='y', alpha=0.3)

# Height distribution
axes[0, 1].hist(props_df['height'], bins=20, color='lightcoral', edgecolor='black', alpha=0.7)
axes[0, 1].set_title('üìè Image Height Distribution', fontsize=13, fontweight='bold')
axes[0, 1].set_xlabel('Height (pixels)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].grid(axis='y', alpha=0.3)

# Aspect ratio
axes[0, 2].hist(props_df['aspect_ratio'], bins=20, color='lightgreen', edgecolor='black', alpha=0.7)
axes[0, 2].set_title('üìê Aspect Ratio Distribution', fontsize=13, fontweight='bold')
axes[0, 2].set_xlabel('Width / Height')
axes[0, 2].set_ylabel('Frequency')
axes[0, 2].grid(axis='y', alpha=0.3)

# File size
axes[1, 0].hist(props_df['file_size_kb'], bins=20, color='plum', edgecolor='black', alpha=0.7)
axes[1, 0].set_title('üíæ File Size Distribution', fontsize=13, fontweight='bold')
axes[1, 0].set_xlabel('File Size (KB)')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].grid(axis='y', alpha=0.3)

# Mean intensity
axes[1, 1].hist(props_df['mean_intensity'], bins=20, color='gold', edgecolor='black', alpha=0.7)
axes[1, 1].set_title('üí° Mean Pixel Intensity', fontsize=13, fontweight='bold')
axes[1, 1].set_xlabel('Mean Intensity (0-255)')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].grid(axis='y', alpha=0.3)

# Channels
channel_counts = props_df['channels'].value_counts()
axes[1, 2].bar(channel_counts.index.astype(str), channel_counts.values, 
               color='coral', edgecolor='black', alpha=0.7)
axes[1, 2].set_title('üé® Color Channels Distribution', fontsize=13, fontweight='bold')
axes[1, 2].set_xlabel('Number of Channels')
axes[1, 2].set_ylabel('Frequency')
axes[1, 2].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\n‚úÖ Most common dimensions: {props_df['width'].mode().values[0]:.0f} √ó {props_df['height'].mode().values[0]:.0f} pixels")
print(f"‚úÖ Most common format: {props_df['mode'].mode().values[0]} ({props_df['channels'].mode().values[0]:.0f} channels)")


In [None]:
# Load and display a sample ECG image
img_path = TRAIN_PATH / str(sample_id) / f'{sample_id}-0001.png'
img = Image.open(img_path)
img_array = np.array(img)

fig, axes = plt.subplots(1, 2, figsize=(22, 10))

# Original image
axes[0].imshow(img_array)
axes[0].set_title(f'üñºÔ∏è Original ECG Image - ID: {sample_id}\n(2200 √ó 1700 pixels, {img.mode})', 
                  fontsize=14, fontweight='bold', pad=15)
axes[0].axis('off')

# Add annotations
axes[0].text(100, 100, 'üìä ECG Grid', fontsize=12, color='red', 
             bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.7), fontweight='bold')

# Grayscale version
img_gray = cv2.cvtColor(img_array[:,:,:3], cv2.COLOR_RGB2GRAY) if len(img_array.shape) == 3 else img_array
axes[1].imshow(img_gray, cmap='gray')
axes[1].set_title(f'‚ö´ Grayscale Version\n(Useful for signal extraction)', 
                  fontsize=14, fontweight='bold', pad=15)
axes[1].axis('off')

plt.tight_layout()
plt.show()

print(f"üìä Image shape: {img_array.shape}")
print(f"üìä Data type: {img_array.dtype}")
print(f"üìä Value range: [{img_array.min()}, {img_array.max()}]")
print(f"üìä Mean intensity: {img_array.mean():.2f}")

---
# üì∏ 4. Image Quality Variants

The dataset includes multiple versions of each ECG, simulating real-world degradation:

| Code | Description | Challenge |
|------|-------------|-----------|
| **0001** | Original synthetic image | ‚úÖ Clean baseline |
| **0003** | Printed & scanned (color) | üî∏ Scanning artifacts |
| **0004** | Printed & scanned (B&W) | üî∏ Loss of color info |
| **0005** | Mobile photos (color print) | üî∂ Camera distortion, lighting |
| **0006** | Mobile photos (screen) | üî∂ Moir√© patterns, glare |
| **0009** | Stained & soaked prints | üî¥ Physical damage, stains |
| **0010** | Extensively damaged | üî¥ Severe degradation |
| **0011** | Mold (color) | üî¥ Biological damage |
| **0012** | Mold (B&W) | üî¥ Severe quality loss |

Let's visualize these differences!

In [None]:
# Compare different image quality variants
image_variants = {
    '0001': 'Original Clean',
    '0003': 'Color Scanned',
    '0004': 'B&W Scanned',
    '0005': 'Mobile Photo',
    '0010': 'Extensively Damaged'
}

fig, axes = plt.subplots(len(image_variants), 2, figsize=(20, 5*len(image_variants)))
fig.suptitle(f'üîç Image Quality Comparison - Sample ID: {sample_id}', 
             fontsize=18, fontweight='bold', y=0.995)

for idx, (variant_code, variant_name) in enumerate(image_variants.items()):
    img_path = TRAIN_PATH / str(sample_id) / f'{sample_id}-{variant_code}.png'
    
    if img_path.exists():
        img = cv2.imread(str(img_path))
        img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        
        # RGB version
        axes[idx, 0].imshow(img_rgb)
        axes[idx, 0].set_title(f'{variant_name} - Color', fontsize=13, fontweight='bold')
        axes[idx, 0].axis('off')
        
        # Intensity histogram
        axes[idx, 1].hist(img_gray.flatten(), bins=50, color='steelblue', 
                         edgecolor='black', alpha=0.7)
        axes[idx, 1].set_title(f'{variant_name} - Intensity Distribution', 
                              fontsize=13, fontweight='bold')
        axes[idx, 1].set_xlabel('Pixel Intensity (0-255)')
        axes[idx, 1].set_ylabel('Frequency')
        axes[idx, 1].grid(axis='y', alpha=0.3)
        
        # Add statistics
        mean_int = img_gray.mean()
        std_int = img_gray.std()
        axes[idx, 1].axvline(mean_int, color='red', linestyle='--', linewidth=2, 
                            label=f'Mean: {mean_int:.1f}')
        axes[idx, 1].legend()
    else:
        axes[idx, 0].text(0.5, 0.5, 'Image not found', ha='center', va='center', fontsize=14)
        axes[idx, 0].axis('off')
        axes[idx, 1].text(0.5, 0.5, 'Image not found', ha='center', va='center', fontsize=14)
        axes[idx, 1].axis('off')

plt.tight_layout()
plt.show()

In [None]:
# Zoom into a specific region to see degradation details
crop_height = 400
crop_width = 600
start_y = 600
start_x = 800

variants_to_compare = ['0001', '0003', '0005', '0010']
variant_names_short = ['Original', 'Scanned', 'Photo', 'Damaged']

fig, axes = plt.subplots(2, len(variants_to_compare), figsize=(24, 12))
fig.suptitle('üî¨ Zoomed Detail Comparison - Signal Quality Analysis', 
             fontsize=16, fontweight='bold', y=0.995)

for idx, (variant_code, variant_name) in enumerate(zip(variants_to_compare, variant_names_short)):
    img_path = TRAIN_PATH / str(sample_id) / f'{sample_id}-{variant_code}.png'
    
    if img_path.exists():
        img = cv2.imread(str(img_path))
        img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        
        # Crop the region
        cropped_rgb = img_rgb[start_y:start_y+crop_height, start_x:start_x+crop_width]
        cropped_gray = img_gray[start_y:start_y+crop_height, start_x:start_x+crop_width]
        
        # Show full image with crop location
        axes[0, idx].imshow(img_gray, cmap='gray')
        rect = plt.Rectangle((start_x, start_y), crop_width, crop_height, 
                            fill=False, edgecolor='red', linewidth=3)
        axes[0, idx].add_patch(rect)
        axes[0, idx].set_title(f'{variant_name}\nFull Image', fontsize=12, fontweight='bold')
        axes[0, idx].axis('off')
        
        # Show cropped detail
        axes[1, idx].imshow(cropped_gray, cmap='gray')
        axes[1, idx].set_title(f'{variant_name}\nZoomed Detail', fontsize=12, fontweight='bold')
        axes[1, idx].axis('off')

plt.tight_layout()
plt.show()

---
# üìê 5. Grid Structure Analysis

ECG images contain a calibrated grid that defines:
- **Horizontal axis**: Time (standard: 1mm = 0.04 seconds)
- **Vertical axis**: Voltage (standard: 1mm = 0.1 mV)

Understanding this grid is crucial for accurate signal extraction!

In [None]:
# Detect and visualize grid structure
img_path = TRAIN_PATH / str(sample_id) / f'{sample_id}-0001.png'
img = cv2.imread(str(img_path))
img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# Edge detection
edges = cv2.Canny(img_gray, 30, 100)

# Detect lines using Hough Transform
lines = cv2.HoughLinesP(edges, 1, np.pi/180, threshold=100, minLineLength=100, maxLineGap=10)

# Separate horizontal and vertical lines
horizontal_lines = []
vertical_lines = []

img_with_lines = img_rgb.copy()

if lines is not None:
    for line in lines:
        x1, y1, x2, y2 = line[0]
        angle = np.abs(np.arctan2(y2 - y1, x2 - x1) * 180 / np.pi)
        
        if angle < 10 or angle > 170:  # Horizontal
            horizontal_lines.append(line)
            cv2.line(img_with_lines, (x1, y1), (x2, y2), (255, 0, 0), 2)
        elif 80 < angle < 100:  # Vertical
            vertical_lines.append(line)
            cv2.line(img_with_lines, (x1, y1), (x2, y2), (0, 255, 0), 2)

print("=" * 80)
print("GRID STRUCTURE ANALYSIS")
print("=" * 80)
print(f"‚úÖ Total lines detected: {len(lines) if lines is not None else 0}")
print(f"üìä Horizontal lines: {len(horizontal_lines)} (shown in RED)")
print(f"üìä Vertical lines: {len(vertical_lines)} (shown in GREEN)")
print("=" * 80)

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(24, 8))

axes[0].imshow(img_gray, cmap='gray')
axes[0].set_title('üìÑ Original Grayscale Image', fontsize=14, fontweight='bold')
axes[0].axis('off')

axes[1].imshow(edges, cmap='gray')
axes[1].set_title('üîç Edge Detection (Canny)', fontsize=14, fontweight='bold')
axes[1].axis('off')

axes[2].imshow(img_with_lines)
axes[2].set_title(f'üìê Detected Grid Lines\nüî¥ Horizontal ({len(horizontal_lines)}) | üü¢ Vertical ({len(vertical_lines)})', 
                 fontsize=14, fontweight='bold')
axes[2].axis('off')

plt.tight_layout()
plt.show()

In [None]:
# Analyze horizontal and vertical projections to understand structure
horizontal_proj = np.sum(img_gray, axis=1)
vertical_proj = np.sum(img_gray, axis=0)

fig = plt.figure(figsize=(20, 12))
gs = fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3)

# Main image
ax_main = fig.add_subplot(gs[1:, :2])
ax_main.imshow(img_gray, cmap='gray', aspect='auto')
ax_main.set_title('üñºÔ∏è ECG Image', fontsize=14, fontweight='bold')
ax_main.set_xlabel('Column Index (Width)', fontsize=12)
ax_main.set_ylabel('Row Index (Height)', fontsize=12)

# Horizontal projection (top)
ax_top = fig.add_subplot(gs[0, :2], sharex=ax_main)
ax_top.plot(vertical_proj, color='blue', linewidth=2)
ax_top.fill_between(range(len(vertical_proj)), vertical_proj, alpha=0.3, color='blue')
ax_top.set_title('üìä Vertical Projection (Sum across rows)', fontsize=12, fontweight='bold')
ax_top.set_ylabel('Intensity Sum', fontsize=10)
ax_top.grid(True, alpha=0.3)
ax_top.tick_params(labelbottom=False)

# Vertical projection (right)
ax_right = fig.add_subplot(gs[1:, 2], sharey=ax_main)
ax_right.plot(horizontal_proj, range(len(horizontal_proj)), color='red', linewidth=2)
ax_right.fill_betweenx(range(len(horizontal_proj)), horizontal_proj, alpha=0.3, color='red')
ax_right.set_title('üìä Horizontal\nProjection\n(Sum across columns)', fontsize=11, fontweight='bold')
ax_right.set_xlabel('Intensity Sum', fontsize=10)
ax_right.grid(True, alpha=0.3)
ax_right.tick_params(labelleft=False)
ax_right.invert_xaxis()
ax_right.invert_yaxis()

plt.suptitle('üìê Image Projection Analysis - Understanding ECG Layout', 
             fontsize=16, fontweight='bold', y=0.995)
plt.show()

print("\nüí° Interpretation:")
print("   - Vertical projection shows the overall horizontal distribution of signals")
print("   - Horizontal projection reveals the vertical layout (rows of leads)")
print("   - Peaks and valleys help identify lead boundaries and grid structure")

---
# üìä 6. Signal Characteristics

Let's analyze the actual ECG signal properties and their relationships.

In [None]:
# Comprehensive signal statistics
lead_stats = []

for lead in lead_names:
    lead_data = ecg_data[lead].values
    
    lead_stats.append({
        'Lead': lead,
        'Mean (mV)': lead_data.mean(),
        'Std Dev (mV)': lead_data.std(),
        'Min (mV)': lead_data.min(),
        'Max (mV)': lead_data.max(),
        'Range (mV)': lead_data.max() - lead_data.min(),
        'Median (mV)': np.median(lead_data),
        'RMS': np.sqrt(np.mean(lead_data**2))
    })

stats_df = pd.DataFrame(lead_stats)

print("=" * 100)
print("SIGNAL STATISTICS BY LEAD")
print("=" * 100)
display(stats_df.style.background_gradient(cmap='RdYlGn', subset=['Range (mV)', 'Std Dev (mV)']))

# Visualize distributions
fig, axes = plt.subplots(2, 2, figsize=(20, 12))

# Mean values
axes[0, 0].barh(stats_df['Lead'], stats_df['Mean (mV)'], color='skyblue', edgecolor='black')
axes[0, 0].set_title('üìä Mean Amplitude by Lead', fontsize=14, fontweight='bold')
axes[0, 0].set_xlabel('Mean (mV)', fontsize=12)
axes[0, 0].axvline(x=0, color='red', linestyle='--', linewidth=1)
axes[0, 0].grid(axis='x', alpha=0.3)

# Standard deviation
axes[0, 1].barh(stats_df['Lead'], stats_df['Std Dev (mV)'], color='lightcoral', edgecolor='black')
axes[0, 1].set_title('üìä Signal Variability (Std Dev) by Lead', fontsize=14, fontweight='bold')
axes[0, 1].set_xlabel('Standard Deviation (mV)', fontsize=12)
axes[0, 1].grid(axis='x', alpha=0.3)

# Range
axes[1, 0].barh(stats_df['Lead'], stats_df['Range (mV)'], color='lightgreen', edgecolor='black')
axes[1, 0].set_title('üìä Signal Range (Max - Min) by Lead', fontsize=14, fontweight='bold')
axes[1, 0].set_xlabel('Range (mV)', fontsize=12)
axes[1, 0].grid(axis='x', alpha=0.3)

# RMS
axes[1, 1].barh(stats_df['Lead'], stats_df['RMS'], color='plum', edgecolor='black')
axes[1, 1].set_title('üìä RMS (Root Mean Square) by Lead', fontsize=14, fontweight='bold')
axes[1, 1].set_xlabel('RMS', fontsize=12)
axes[1, 1].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Lead correlation analysis
correlation_matrix = ecg_data[lead_names].corr()

fig, ax = plt.subplots(figsize=(14, 12))

# Create heatmap
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8},
            vmin=-1, vmax=1, ax=ax)

ax.set_title('üîó Inter-Lead Correlation Matrix\n(Understanding relationships between ECG leads)', 
             fontsize=16, fontweight='bold', pad=20)

plt.tight_layout()
plt.show()

print("\nüí° Key Observations:")
print("   - Some leads show strong positive correlations (similar waveforms)")
print("   - Some leads show negative correlations (inverted waveforms)")
print("   - These relationships are based on cardiac electrical vectors")
print("   - aVR typically shows inverse patterns compared to other leads")

In [None]:
# Analyze frequency content of ECG signals
from scipy.fft import fft, fftfreq

# Select a few representative leads
selected_leads = ['II', 'V1', 'V5']
colors_freq = ['crimson', 'blue', 'green']

fig, axes = plt.subplots(len(selected_leads), 2, figsize=(20, 12))
fig.suptitle('üåä Time vs Frequency Domain Analysis', fontsize=16, fontweight='bold', y=0.995)

for idx, (lead, color) in enumerate(zip(selected_leads, colors_freq)):
    signal_data = ecg_data[lead].values
    n = len(signal_data)
    
    # Time domain
    time = np.arange(n) / sample_fs
    axes[idx, 0].plot(time, signal_data, color=color, linewidth=1.5)
    axes[idx, 0].set_title(f'Lead {lead} - Time Domain', fontsize=13, fontweight='bold')
    axes[idx, 0].set_xlabel('Time (s)', fontsize=11)
    axes[idx, 0].set_ylabel('Amplitude (mV)', fontsize=11)
    axes[idx, 0].grid(True, alpha=0.3)
    
    # Frequency domain (FFT)
    yf = fft(signal_data)
    xf = fftfreq(n, 1/sample_fs)
    
    # Only positive frequencies
    positive_freqs = xf > 0
    axes[idx, 1].plot(xf[positive_freqs], np.abs(yf[positive_freqs]), color=color, linewidth=1.5)
    axes[idx, 1].set_title(f'Lead {lead} - Frequency Domain', fontsize=13, fontweight='bold')
    axes[idx, 1].set_xlabel('Frequency (Hz)', fontsize=11)
    axes[idx, 1].set_ylabel('Magnitude', fontsize=11)
    axes[idx, 1].set_xlim(0, 50)  # Focus on physiologically relevant frequencies
    axes[idx, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° Frequency Domain Insights:")
print("   - ECG signals contain frequencies mainly < 50 Hz")
print("   - Dominant frequencies correspond to heart rate (1-3 Hz)")
print("   - Higher frequencies capture fine details of QRS complex")

---
# üéØ 7. Test Data Analysis

Understanding what we need to predict.

In [None]:
# Analyze test data structure
print("=" * 80)
print("TEST DATA STRUCTURE")
print("=" * 80)

test_summary = test_df.groupby('id').agg({
    'lead': 'count',
    'fs': 'first',
    'number_of_rows': 'sum'
}).reset_index()
test_summary.columns = ['id', 'num_leads', 'fs', 'total_rows_to_predict']

print(f"\nüìä Total unique test IDs: {len(test_summary)}")
print(f"üìä Total predictions required: {test_df.shape[0]:,} rows")
print(f"\n{test_summary.to_string(index=False)}")

# Analyze predictions per lead
lead_predictions = test_df.groupby('lead')['number_of_rows'].agg(['min', 'max', 'mean', 'count'])
print("\n" + "=" * 80)
print("PREDICTIONS REQUIRED PER LEAD")
print("=" * 80)
print(f"\n{lead_predictions.to_string()}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(20, 6))

# Predictions per lead
lead_counts = test_df.groupby('lead')['number_of_rows'].mean()
colors_bar = plt.cm.tab10(np.arange(len(lead_counts)))

axes[0].bar(lead_counts.index, lead_counts.values, color=colors_bar, edgecolor='black', linewidth=1.5)
axes[0].set_title('üìä Average Predictions Required per Lead', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Lead', fontsize=12)
axes[0].set_ylabel('Number of Rows to Predict', fontsize=12)
axes[0].grid(axis='y', alpha=0.3)
axes[0].tick_params(axis='x', rotation=0)

# Add value labels
for i, (lead, count) in enumerate(zip(lead_counts.index, lead_counts.values)):
    axes[0].text(i, count + 50, f'{count:.0f}', ha='center', va='bottom', fontweight='bold')

# Sampling frequency distribution in test
test_fs_dist = test_summary['fs'].value_counts().sort_index()
axes[1].bar(test_fs_dist.index.astype(str), test_fs_dist.values, 
           color='steelblue', edgecolor='black', linewidth=1.5)
axes[1].set_title('üìä Test Data Sampling Frequency Distribution', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Sampling Frequency (Hz)', fontsize=12)
axes[1].set_ylabel('Count', fontsize=12)
axes[1].grid(axis='y', alpha=0.3)

# Add value labels
for i, (fs, count) in enumerate(zip(test_fs_dist.index, test_fs_dist.values)):
    axes[1].text(i, count + 0.05, f'{count}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

print("\nüí° Key Points:")
print("   - Lead II requires ~2.5√ó more predictions (10s vs 2.5s)")
print("   - All 12 leads must be predicted for each test image")
print("   - Output format: {base_id}_{row_id}_{lead}")


In [None]:
# Display a sample test image
test_image_files = list(TEST_PATH.glob('*.png'))
if test_image_files:
    sample_test_img_path = test_image_files[0]
    sample_test_id = sample_test_img_path.stem
    
    test_img = Image.open(sample_test_img_path)
    test_img_array = np.array(test_img)
    
    fig, axes = plt.subplots(1, 2, figsize=(22, 10))
    
    axes[0].imshow(test_img_array)
    axes[0].set_title(f'üéØ Sample Test Image\nID: {sample_test_id}', 
                     fontsize=14, fontweight='bold', pad=15)
    axes[0].axis('off')
    
    # Grayscale
    if len(test_img_array.shape) == 3:
        test_img_gray = cv2.cvtColor(test_img_array[:,:,:3], cv2.COLOR_RGB2GRAY)
    else:
        test_img_gray = test_img_array
    
    axes[1].imshow(test_img_gray, cmap='gray')
    axes[1].set_title(f'‚ö´ Grayscale Version', fontsize=14, fontweight='bold', pad=15)
    axes[1].axis('off')
    
    plt.tight_layout()
    plt.show()
    
    print(f"üìä Test Image ID: {sample_test_id}")
    print(f"üìä Image Shape: {test_img_array.shape}")
    print(f"üìä Image Size: {test_img.size}")
else:
    print("‚ö†Ô∏è No test images found in the test directory")

---
# üí° 8. Key Insights & Modeling Recommendations

## üéØ Main Challenges Identified:

### 1. **Image Variability**
- ‚úÖ Consistent dimensions (2200√ó1700) - Good for modeling!
- ‚ö†Ô∏è Multiple quality levels (clean ‚Üí severely damaged)
- ‚ö†Ô∏è Color vs B&W, scanned vs photographed

### 2. **Signal Extraction**
- üìê Grid structure must be detected and calibrated
- üéØ 12 different lead regions must be identified
- üìè Lead II: 10 seconds, Others: 2.5 seconds
- üîÑ Variable sampling rates (250/500/1000 Hz)

### 3. **Quality Degradation**
- Stains, mold, physical damage
- Scanning artifacts, misalignments
- Mobile photo distortions (lighting, angle, blur)

---

## üõ†Ô∏è Recommended Modeling Approach:

### **Pipeline Components:**

1. **Image Preprocessing**
   - Grid detection and alignment
   - Rotation/skew correction
   - Contrast enhancement
   - Denoising for damaged images

2. **Lead Segmentation**
   - Identify 12 lead regions
   - Handle different lead durations
   - Account for standard ECG layout

3. **Signal Extraction**
   - Pixel-to-voltage conversion
   - Time axis calibration
   - Waveform smoothing

4. **Model Architecture Options**
   - **U-Net / Segmentation**: Identify signal pixels
   - **CNN + LSTM**: Image ‚Üí Time series
   - **Transformer**: Attention-based extraction
   - **Ensemble**: Combine multiple approaches

5. **Post-Processing**
   - Temporal alignment
   - Baseline correction
   - Interpolation to match sampling rate

---

## üìä Success Metrics:

- **Modified SNR**: Accounts for small alignment errors
- Higher score = better reconstruction
- Focus on signal fidelity over perfect alignment

---

## üöÄ Next Steps:

1. Build baseline model (start simple!)
2. Implement robust preprocessing
3. Handle multiple image quality levels
4. Optimize for SNR metric
5. Ensemble different approaches

Good luck! üçÄ

In [None]:
# Final comprehensive summary
print("=" * 100)
print("üéØ COMPREHENSIVE DATASET SUMMARY")
print("=" * 100)

summary_stats = {
    'Category': [
        'Training Samples',
        'Test Images',
        'Total Test Predictions',
        'ECG Leads',
        'Sampling Frequencies',
        'Image Dimensions (typical)',
        'Image Variants per Sample',
        'Lead II Duration',
        'Other Leads Duration',
        'Evaluation Metric'
    ],
    'Value': [
        f'{len(train_df):,}',
        f'{len(test_df["id"].unique()):,}',
        f'{len(test_df):,}',
        '12 (I, II, III, aVR, aVL, aVF, V1-V6)',
        '250, 500, 1000 Hz',
        '2200 √ó 1700 pixels',
        '9 quality variants (0001-0012)',
        '10 seconds (full rhythm)',
        '2.5 seconds',
        'Modified SNR (dB)'
    ]
}

summary_df = pd.DataFrame(summary_stats)
display(summary_df.style.set_properties(**{
    'text-align': 'left',
    'font-size': '12pt',
}).hide(axis='index'))



---
<div style="background: linear-gradient(135deg, #f093fb 0%, #f5576c 100%); padding: 30px; border-radius: 15px; color: white; text-align: center;">
    <h2>üôè Thank You!</h2>
    <p style="font-size: 1.2em;">If you found this EDA helpful, please:</p>
    <h3>‚≠ê Upvote this notebook!</h3>
    <p>Your support helps others discover useful content.</p>
    <br>
    <p style="font-size: 0.9em;">üí¨ Questions or suggestions? Leave a comment!</p>
    <p style="font-size: 0.9em;">ü§ù Let's collaborate and learn together!</p>
</div>

---


## üîó Connect & Collaborate:

Good luck with your models! Let's digitize those ECGs!