# **Digitizing Waveform From ECG Image wo/Training**


## **def analyze_and_split(img):**

This function processes an ECG waveform image and extracts numerical data from it through several stages:

### 1. **Image Preprocessing**
- **Grayscale Conversion**: Converts the color image to grayscale for simpler processing
- **Gaussian Blur**: Applies smoothing to reduce noise while preserving important edges

### 2. **Grid Line Removal**
- **HSV Color Space Conversion**: Converts to HSV for better color segmentation
- **Red Color Masking**: Identifies and creates a mask for red grid lines
- **Inpainting**: Removes the detected grid lines while preserving the waveform

### 3. **Waveform Segmentation**
- **Binarization**: Uses Otsu's thresholding to create a binary image (waveform vs background)
- **Contour Detection**: Finds all external contours in the image
- **Main Waveform Selection**: Selects the largest contour as the primary ECG waveform

### 4. **Coordinate Processing**
- **Point Extraction**: Extracts X-Y coordinates from the contour
- **Y-axis Inversion**: Flips the Y-axis since image coordinates increase downward
- **Normalization**: 
  - Shifts Y-values to start at zero
  - Scales Y-values to range [0, 1] for standardization

### 5. **Resampling and Cleaning**
- **Interpolation**: Resamples the waveform to exactly 10,000 points for consistency
- **Endpoint Trimming**: Removes 240 points from start and 140 points from end to eliminate edge artifacts
- **Final Length**: Results in 9,620 clean, uniformly spaced data points

### 6. **Lead Separation**
- **Segmentation**: Splits the continuous waveform into 4 equal segments representing different ECG leads
- **Segment Length**: Each lead contains 2,500 data points
- **Zero-padding**: Ensures consistent length across all leads if needed

### 7. **Output and Visualization**
- **DataFrame Creation**: Organizes the extracted data into a pandas DataFrame with separate columns for each lead
- **Visualization**: Plots all 4 leads with grid lines and point count annotations
- **Statistics Display**: Shows the number of data points extracted for each lead


In [None]:
import cv2
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from PIL import Image 
from IPython.display import display
import io

In [None]:
# === Load Image ===
img_path = "/kaggle/input/physionet-ecg-image-digitization/test/1053922973.png"
img = cv2.imread(img_path)
img1 = img[530:840]
img2 = img[840:1170]
img3 = img[1130:1420]
img4 = img[1400:1600]
print(img.shape)
plt.imshow(img)
plt.show()
plt.imshow(img1)
plt.show()
plt.imshow(img2)
plt.show()
plt.imshow(img3)
plt.show()
plt.imshow(img4)
plt.show()

In [None]:
test=pd.read_csv('/kaggle/input/physionet-ecg-image-digitization/test.csv')
display(test)

In [None]:
submission=pd.read_parquet('/kaggle/input/physionet-ecg-image-digitization/sample_submission.parquet')
display(submission)
submission['idi']=submission['id'].apply(lambda x:x.split('_')[0])
submission['index']=submission['id'].apply(lambda x:x.split('_')[1])
submission['lead']=submission['id'].apply(lambda x:x.split('_')[2])
display(submission)

In [None]:
def analyze_and_split(img):
    # === Preprocessing ===
    # Convert to grayscale
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    # Apply Gaussian blur
    blur = cv2.GaussianBlur(gray, (5, 5), 0)
    
    # === Grid Removal ===
    # Convert to HSV color space
    hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
    # Define range for red color (assuming grid lines are red)
    lower_red = np.array([0, 80, 80])
    upper_red = np.array([20, 255, 255])
    # Create mask for red color
    mask_red = cv2.inRange(hsv, lower_red, upper_red)
    # Inpaint the grayscale image to remove the grid
    gray_no_grid = cv2.inpaint(gray, mask_red, 3, cv2.INPAINT_TELEA)
    
    # === Binarization (Thresholding) ===
    # Apply Otsu's thresholding with inverse binary
    _, binary = cv2.threshold(gray_no_grid, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
    
    # === Waveform Contour Extraction ===
    # Find all external contours
    contours, _ = cv2.findContours(binary, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
    # Select the contour with the largest area (assumed to be the main waveform)
    wave = max(contours, key=cv2.contourArea)
    
    # === Coordinate Extraction and Preprocessing ===
    # Squeeze the contour array to remove single-dimensional entries
    wave_points = np.squeeze(wave)
    # Sort points by X-coordinate
    wave_points = wave_points[np.argsort(wave_points[:,0])]
    
    # Extract X and Y coordinates
    x = wave_points[:,0]
    # Invert Y-axis (since pixel Y increases downwards)
    y = -wave_points[:,1]
    # Normalize Y: shift to start at zero
    y = y - np.min(y)
    # Normalize Y: scale to a maximum of 1
    y = y / np.max(y)
    
    # === Resampling to 5500 Points ===
    # Create new X-coordinates for resampling
    remove_start = 240  # Number of points to remove from the start
    remove_end = 140    # Number of points to remove from the end 
    x_new = np.linspace(x.min(), x.max(), 10000 +remove_start+remove_end)
    # Interpolate Y-coordinates
    y_interp = np.interp(x_new, x, y)
    y_clean = y_interp[remove_start:- remove_end]
    # ------------------------------------------------------------------
    
    # ------------------------------------------------------------------
    # === Split into 4 Leads (1100 Points Each) ===
    segment_length = 2500  
    leads = []
    
    # Total available points for splitting: 4400
    for i in range(4):
        start_idx = i * segment_length
        end_idx = (i + 1) * segment_length
        
        if end_idx <= len(y_clean):
            lead_data = y_clean[start_idx:end_idx]
            # Zero-padding if necessary (should not occur here)
            if len(lead_data) < segment_length:
                lead_data = np.pad(lead_data, (0, segment_length - len(lead_data)), 'constant')
            leads.append(lead_data)
    # ------------------------------------------------------------------
    
    # === Create DataFrame ===
    df_leads = pd.DataFrame()
    for i, lead in enumerate(leads):
        df_leads[f'lead_{i+1}'] = lead
    
    # === Visualization (X-axis is index number) ===
    plt.figure(figsize=(15, 10))
    for i in range(4):
        plt.subplot(2, 2, i+1)
        # X-axis defaults to index 0, 1, 2, ...
        plt.plot(df_leads[f'lead_{i+1}'])
        plt.title(f"Lead {i+1} - {segment_length} Points")
        plt.xlabel("Data Point Index")
        plt.ylabel("Normalized Voltage")
        plt.grid(True, alpha=0.3)
        
        # Display the number of data points
        plt.text(0.02, 0.98, f'Points: {len(df_leads[f"lead_{i+1}"])}', 
                 transform=plt.gca().transAxes, verticalalignment='top',
                 bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
    
    plt.tight_layout()
    plt.show()
    
    # === Display Basic Statistics for Each Lead ===
    print("Number of data points for each lead:")
    for i in range(4):
        print(f"Lead {i+1}: {len(leads[i])} points")
    
    return df_leads

# Example Usage
# df_ecg = analyze_and_split(img) 
# print(f"Overall Data Shape: {df_ecg.shape}")

In [None]:
def analyze_and_split_for2(img):
    # === Preprocessing ===
    # Convert to grayscale
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    # Apply Gaussian blur
    blur = cv2.GaussianBlur(gray, (5, 5), 0)
    
    # === Grid Removal ===
    # Convert to HSV color space
    hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
    # Define range for red color (assuming grid lines are red)
    lower_red = np.array([0, 80, 80])
    upper_red = np.array([20, 255, 255])
    # Create mask for red color
    mask_red = cv2.inRange(hsv, lower_red, upper_red)
    # Inpaint the grayscale image to remove the grid
    gray_no_grid = cv2.inpaint(gray, mask_red, 3, cv2.INPAINT_TELEA)
    
    # === Binarization (Thresholding) ===
    # Apply Otsu's thresholding with inverse binary
    _, binary = cv2.threshold(gray_no_grid, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
    
    # === Waveform Contour Extraction ===
    # Find all external contours
    contours, _ = cv2.findContours(binary, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
    # Select the contour with the largest area (assumed to be the main waveform)
    wave = max(contours, key=cv2.contourArea)
    
    # === Coordinate Extraction and Preprocessing ===
    # Squeeze the contour array to remove single-dimensional entries
    wave_points = np.squeeze(wave)
    # Sort points by X-coordinate
    wave_points = wave_points[np.argsort(wave_points[:,0])]
    
    # Extract X and Y coordinates
    x = wave_points[:,0]
    # Invert Y-axis (since pixel Y increases downwards)
    y = -wave_points[:,1]
    # Normalize Y: shift to start at zero
    y = y - np.min(y)
    # Normalize Y: scale to a maximum of 1
    y = y / np.max(y)
    
    # === Resampling to 10000 Points ===
    # Create new X-coordinates for resampling
    remove_start = 240  # Number of points to remove from the start
    remove_end = 140    # Number of points to remove from the end 
    x_new = np.linspace(x.min(), x.max(), 10000+remove_start+remove_end)
    # Interpolate Y-coordinates
    y_interp = np.interp(x_new, x, y)
    y_clean = y_interp[remove_start:-remove_end]
    # ------------------------------------------------------------------
    
    # === Create Single Lead DataFrame ===
    df_lead = pd.DataFrame({'lead_1': y_clean})
    
    # === Visualization ===
    plt.figure(figsize=(12, 6))
    plt.plot(df_lead['lead_1'])
    plt.title(f"Single Lead - {len(y_clean)} Points")
    plt.xlabel("Data Point Index")
    plt.ylabel("Normalized Voltage")
    plt.grid(True, alpha=0.3)
    
    # Display the number of data points
    plt.text(0.02, 0.98, f'Points: {len(df_lead["lead_1"])}', 
             transform=plt.gca().transAxes, verticalalignment='top',
             bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
    
    plt.tight_layout()
    plt.show()
    
    # === Display Basic Statistics ===
    print(f"Number of data points: {len(y_clean)}")
    print(f"Data range: {np.min(y_clean):.4f} to {np.max(y_clean):.4f}")
    print(f"Mean: {np.mean(y_clean):.4f}, Std: {np.std(y_clean):.4f}")
    
    return df_lead

In [None]:
#I,aVR,V1,V4
df1=analyze_and_split(img1)
display(df1)

    mask_i = (submission['idi'] == '1053922973') & (submission['lead'] == 'I')
    submission.loc[mask_i, 'value'] = df1.loc[:, 'lead_1'].values
    
    mask_avr = (submission['idi'] == '1053922973') & (submission['lead'] == 'aVR')
    submission.loc[mask_avr, 'value'] = df1.loc[:, 'lead_2'].values
    
    mask_v1 = (submission['idi'] == '1053922973') & (submission['lead'] == 'V1')
    submission.loc[mask_v1, 'value'] = df1.loc[:, 'lead_3'].values
    
    mask_v4 = (submission['idi'] == '1053922973') & (submission['lead'] == 'V4')
    submission.loc[mask_v4, 'value'] = df1.loc[:, 'lead_4'].values

In [None]:
#II,aVL,V2,V5
df2=analyze_and_split(img2)
display(df2)

    mask_avl = (submission['idi'] == '1053922973') & (submission['lead'] == 'aVL')
    submission.loc[mask_avl, 'value'] = df2['lead_2'].values
    
    mask_v2 = (submission['idi'] == '1053922973') & (submission['lead'] == 'V2')
    submission.loc[mask_v2, 'value'] = df2['lead_3'].values
    
    mask_v5 = (submission['idi'] == '1053922973') & (submission['lead'] == 'V5')
    submission.loc[mask_v5, 'value'] = df2['lead_4'].values

In [None]:
#III,aVF,V3,V6
df3=analyze_and_split(img3)
display(df3)

    mask_iii = (submission['idi'] == '1053922973') & (submission['lead'] == 'III')
    submission.loc[mask_iii, 'value'] = df3['lead_1'].values
    
    mask_avf = (submission['idi'] == '1053922973') & (submission['lead'] == 'aVF')
    submission.loc[mask_avf, 'value'] = df3['lead_2'].values
    
    mask_v3 = (submission['idi'] == '1053922973') & (submission['lead'] == 'V3')
    submission.loc[mask_v3, 'value'] = df3['lead_3'].values
    
    mask_v6 = (submission['idi'] == '1053922973') & (submission['lead'] == 'V6')
    submission.loc[mask_v6, 'value'] = df3['lead_4'].values

In [None]:
#II
df4=analyze_and_split_for2(img4)
display(df4)

    mask_ii = (submission['idi'] == '1053922973') & (submission['lead'] == 'II')
    submission.loc[mask_ii, 'value'] = df4['lead_1'].values

In [None]:
all_mappings = [
    (df1, {'I': 'lead_1', 'aVR': 'lead_2', 'V1': 'lead_3', 'V4': 'lead_4'}),
    (df2, {'aVL': 'lead_2', 'V2': 'lead_3', 'V5': 'lead_4'}),
    (df3, {'III': 'lead_1', 'aVF': 'lead_2', 'V3': 'lead_3', 'V6': 'lead_4'}),
    (df4, {'II': 'lead_1'})
]

for df, mapping in all_mappings:
    for lead_name, df_column in mapping.items():
        mask = (submission['idi'] == '1053922973') & (submission['lead'] == lead_name)
        submission.loc[mask, 'value'] = df[df_column].values

In [None]:
display(submission[submission['idi']=='1053922973'])