# Apex Weld Quality â€“ Phase 1 Dashboard
## Data Preparation, Feature Engineering & Analysis

This notebook is the **interactive analysis dashboard** for Phase 1 of the Apex weld-quality project. It:

1. Ingests and validates all weld-run data (sensor CSVs + images + labels)
2. Explores dataset statistics, label distributions, and signal quality
3. Engineers features from sensor time-series and image statistics
4. Creates a reproducible, group-based train/val/test split (no leakage)
5. Builds PyTorch datasets ready for downstream modelling
6. Exports all artefacts and a data-card summary

---

## 1. Import Libraries and Configuration

In [None]:
# â”€â”€ Standard libraries â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
import sys, os, json, logging, warnings
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import seaborn as sns

# Optional: interactive plots
try:
    import plotly.express as px
    import plotly.graph_objects as go
    from plotly.subplots import make_subplots
    HAS_PLOTLY = True
except ImportError:
    HAS_PLOTLY = False
    print("plotly not installed â€“ falling back to matplotlib only")

import torch
from torch.utils.data import DataLoader

# â”€â”€ Project modules â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
# Ensure project root is on sys.path
PROJECT_ROOT = Path().resolve()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

from src.config import (
    DATA_DIR, LABELS_CSV, SPLIT_DIR, OUTPUT_DIR, DASHBOARD_DIR,
    LABEL_COL, SAMPLE_ID_COL, LABEL_MAP, LABEL_INV,
    SENSOR_COLUMNS, FIXED_SEQ_LEN, IMAGE_SIZE,
    TRAIN_RATIO, VAL_RATIO, TEST_RATIO, RANDOM_SEED,
)
from src.data_ingestion import ingest
from src.feature_engineering import (
    build_feature_table, extract_sensor_features,
    extract_image_features, sensor_to_fixed_tensor,
)
from src.splitter import group_split, save_split, load_split
from src.dataset import WeldDataset, compute_normalize_stats

# â”€â”€ Logging & style â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
warnings.filterwarnings("ignore", category=FutureWarning)
sns.set_theme(style="whitegrid", palette="muted", font_scale=1.1)
plt.rcParams.update({"figure.dpi": 110, "savefig.dpi": 150, "figure.figsize": (12, 5)})

print("All imports OK âœ“")

## 2. Define Project Constants and Paths

All configuration is centralised in `src/config.py`. We print the key values here for reproducibility.

In [None]:
config_info = {
    "DATA_DIR": str(DATA_DIR),
    "LABELS_CSV": str(LABELS_CSV),
    "SPLIT_DIR": str(SPLIT_DIR),
    "OUTPUT_DIR": str(OUTPUT_DIR),
    "SENSOR_COLUMNS": SENSOR_COLUMNS,
    "FIXED_SEQ_LEN": FIXED_SEQ_LEN,
    "IMAGE_SIZE": IMAGE_SIZE,
    "SPLIT_RATIOS": f"train={TRAIN_RATIO}, val={VAL_RATIO}, test={TEST_RATIO}",
    "RANDOM_SEED": RANDOM_SEED,
    "LABEL_MAP": LABEL_MAP,
}
for k, v in config_info.items():
    print(f"  {k:20s} : {v}")

## 3. Data Ingestion â€“ Discover and Load All Weld Runs

Each weld run lives in its own folder (e.g. `08-17-22-0011-00/`) containing:
- A sensor CSV with columns: Date, Time, Part No, Pressure, CO2 Weld Flow, Feed, Primary Weld Current, Wire Consumed, Secondary Weld Voltage, Remarks
- An `images/` sub-folder with inspection photographs

The `ingest()` function discovers all runs, validates each CSV, catalogues images, and merges labels from `labels.csv`.

In [None]:
# Run the ingestion pipeline
manifest, sensor_data = ingest(DATA_DIR, LABELS_CSV)

print(f"Runs discovered : {len(manifest)}")
print(f"Labelled        : {manifest[LABEL_COL].notna().sum()}")
print(f"Sensor data keys: {len(sensor_data)}")
print()

# Show manifest
display(manifest.drop(columns=["image_paths"], errors="ignore").head(10))

In [None]:
# Show a sample raw sensor DataFrame for one run
sample_run = list(sensor_data.keys())[0]
print(f"Sample run: {sample_run}  ({len(sensor_data[sample_run])} rows)")
display(sensor_data[sample_run].head(10))
print(f"\nDtypes:\n{sensor_data[sample_run].dtypes}")

## 4. Data Validation â€“ Identify Missing, Corrupt, and Mismatched Records

We check for: empty CSVs, NaN values in sensor channels, zero-variance columns, missing images, and duration outliers.

In [None]:
# Build validation report
val_rows = []
for _, row in manifest.iterrows():
    sid = row[SAMPLE_ID_COL]
    sdf = sensor_data[sid]
    nan_counts = sdf[SENSOR_COLUMNS].isnull().sum().to_dict()
    val_rows.append({
        "sample_id": sid,
        "n_rows": row["n_sensor_rows"],
        "duration_s": row["duration_s"],
        "n_images": row["n_images"],
        "const_cols": row["const_sensor_cols"],
        "issues": row["issues"],
        **{f"nan_{c}": nan_counts[c] for c in SENSOR_COLUMNS},
    })

validation_df = pd.DataFrame(val_rows)

# Flag runs with issues
print("=== Data Validation Report ===\n")
n_issues = validation_df["issues"].apply(len).sum()
print(f"Total runs with issues: {(validation_df['issues'].apply(len) > 0).sum()} / {len(validation_df)}")

# Duration outlier detection (IQR)
durations = validation_df["duration_s"].dropna()
Q1, Q3 = durations.quantile(0.25), durations.quantile(0.75)
IQR = Q3 - Q1
low, high = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
outlier_mask = (durations < low) | (durations > high)
if outlier_mask.any():
    print(f"\nDuration outliers (outside [{low:.1f}, {high:.1f}]s):")
    display(validation_df.loc[outlier_mask, ["sample_id", "duration_s"]])
else:
    print(f"No duration outliers (IQR range [{low:.1f}, {high:.1f}]s)")

# Show full table
display(validation_df.style.applymap(
    lambda v: "background-color: #ffcccc" if isinstance(v, list) and len(v) > 0 else "",
    subset=["issues"]
))

## 5. Define the Unit of Prediction and Align Labels

**Decision:** One sample = one complete weld run (Part No). Each run folder maps to exactly one row in `labels.csv`. The label is binary: `0 = good`, `1 = defect`.

In [None]:
# Verify label alignment
print("Label map:", LABEL_MAP)
print()

label_check = manifest[[SAMPLE_ID_COL, LABEL_COL]].copy()
label_check["label_name"] = label_check[LABEL_COL].map(LABEL_MAP)
label_check["has_sensor_data"] = label_check[SAMPLE_ID_COL].isin(sensor_data.keys())

n_labelled = label_check[LABEL_COL].notna().sum()
n_unlabelled = label_check[LABEL_COL].isna().sum()
print(f"Labelled runs   : {n_labelled}")
print(f"Unlabelled runs : {n_unlabelled}")

if n_unlabelled > 0:
    print("âš  Unlabelled runs (will be excluded from supervised training):")
    display(label_check[label_check[LABEL_COL].isna()])

display(label_check)

## 6. Exploratory Dataset Overview Dashboard

Summary statistics across all weld runs: durations, row counts, runs-per-date, and per-sensor aggregates.

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 9))
fig.suptitle("Dataset Overview Dashboard", fontsize=15, fontweight="bold")

# (a) Weld duration histogram
ax = axes[0, 0]
durations = manifest["duration_s"].dropna()
ax.hist(durations, bins=15, color="steelblue", edgecolor="white")
ax.set_xlabel("Duration (s)")
ax.set_ylabel("Count")
ax.set_title("Weld Duration Distribution")
ax.axvline(durations.mean(), color="red", ls="--", label=f"mean={durations.mean():.1f}s")
ax.legend()

# (b) Rows per run
ax = axes[0, 1]
rows_per_run = manifest["n_sensor_rows"]
ax.hist(rows_per_run, bins=15, color="darkorange", edgecolor="white")
ax.set_xlabel("Sensor Rows")
ax.set_ylabel("Count")
ax.set_title("Sensor Readings per Run")
ax.axvline(rows_per_run.mean(), color="red", ls="--", label=f"mean={rows_per_run.mean():.0f}")
ax.legend()

# (c) Runs per date
ax = axes[1, 0]
manifest["_date"] = manifest[SAMPLE_ID_COL].str[:8]
date_counts = manifest["_date"].value_counts().sort_index()
ax.bar(date_counts.index, date_counts.values, color="mediumseagreen", edgecolor="white")
ax.set_xlabel("Date group")
ax.set_ylabel("# Runs")
ax.set_title("Runs per Date")
ax.tick_params(axis="x", rotation=30)

# (d) Summary stats table (as text)
ax = axes[1, 1]
ax.axis("off")
all_sensor = pd.concat([sensor_data[sid][SENSOR_COLUMNS] for sid in sensor_data], ignore_index=True)
stats = all_sensor.describe().T[["mean", "std", "min", "max"]].round(2)
tbl = ax.table(
    cellText=stats.values,
    rowLabels=stats.index,
    colLabels=stats.columns,
    loc="center",
    cellLoc="center",
)
tbl.auto_set_font_size(False)
tbl.set_fontsize(9)
tbl.scale(1.2, 1.4)
ax.set_title("Global Sensor Statistics", fontsize=12, pad=20)

plt.tight_layout(rect=[0, 0, 1, 0.95])
DASHBOARD_DIR.mkdir(parents=True, exist_ok=True)
fig.savefig(DASHBOARD_DIR / "01_dataset_overview.png", bbox_inches="tight")
plt.show()
print(f"Saved â†’ {DASHBOARD_DIR / '01_dataset_overview.png'}")

## 7. Label Distribution Analysis

In [None]:
labelled = manifest[manifest[LABEL_COL].notna()].copy()
labelled["label_name"] = labelled[LABEL_COL].map(LABEL_MAP)

if len(labelled) > 0:
    vc = labelled["label_name"].value_counts()
    
    fig, axes = plt.subplots(1, 2, figsize=(12, 4.5))
    
    # Bar chart
    colors = ["#4CAF50" if x == "good" else "#F44336" for x in vc.index]
    axes[0].bar(vc.index, vc.values, color=colors, edgecolor="white", width=0.5)
    for i, (lbl, cnt) in enumerate(vc.items()):
        axes[0].text(i, cnt + 0.1, str(cnt), ha="center", fontweight="bold")
    axes[0].set_title("Label Counts (bar)")
    axes[0].set_ylabel("# Runs")
    
    # Pie chart
    axes[1].pie(vc.values, labels=vc.index, autopct="%1.0f%%", colors=colors,
                startangle=90, textprops={"fontsize": 12})
    axes[1].set_title("Label Proportions")
    
    plt.tight_layout()
    fig.savefig(DASHBOARD_DIR / "02_label_distribution.png", bbox_inches="tight")
    plt.show()
    
    # Class imbalance warning
    majority = vc.max()
    minority = vc.min()
    ratio = majority / minority if minority > 0 else float("inf")
    print(f"\nClass imbalance ratio: {ratio:.1f}:1  (majority={vc.idxmax()}, minority={vc.idxmin()})")
    if ratio > 3:
        print("âš  Significant imbalance detected. Consider:")
        print("  â€¢ Weighted loss function (class_weight in BCE / CrossEntropy)")
        print("  â€¢ Oversampling minority class (SMOTE, random duplication)")
        print("  â€¢ Stratified sampling in DataLoader")
    else:
        print("âœ“ Imbalance within acceptable range.")
else:
    print("No labels available â€“ skipping distribution analysis.")

## 8. Sensor Signal Visualization â€“ Representative Examples

Multi-panel plots of all 6 sensor channels for selected weld runs with weld-phase annotations (idle â†’ arc ignition â†’ steady state â†’ ramp-down â†’ cool-down).

In [None]:
# Select representative runs: first good, first defect, and one from a different date
example_ids = []
if len(labelled) > 0:
    good_ids = labelled[labelled[LABEL_COL] == 0][SAMPLE_ID_COL].tolist()
    defect_ids = labelled[labelled[LABEL_COL] == 1][SAMPLE_ID_COL].tolist()
    if good_ids:
        example_ids.append(good_ids[0])
    if defect_ids:
        example_ids.append(defect_ids[0])
    # Add one from a different date if available
    all_ids = manifest[SAMPLE_ID_COL].tolist()
    for sid in all_ids:
        if sid not in example_ids and sid[:8] != example_ids[0][:8]:
            example_ids.append(sid)
            break
else:
    example_ids = list(sensor_data.keys())[:3]

print(f"Plotting {len(example_ids)} representative runs: {example_ids}\n")

for sid in example_ids:
    sdf = sensor_data[sid].copy()
    
    # Get label name
    lbl_row = manifest[manifest[SAMPLE_ID_COL] == sid]
    lbl_val = lbl_row[LABEL_COL].values[0] if len(lbl_row) else None
    lbl_name = LABEL_MAP.get(lbl_val, "unlabelled") if pd.notna(lbl_val) else "unlabelled"
    
    fig, axes = plt.subplots(3, 2, figsize=(15, 10), sharex=True)
    fig.suptitle(f"Run: {sid}  |  Label: {lbl_name}", fontsize=14, fontweight="bold")
    
    # Time axis (seconds from start)
    if "datetime" in sdf.columns and sdf["datetime"].notna().any():
        t = (sdf["datetime"] - sdf["datetime"].min()).dt.total_seconds().values
    else:
        t = np.arange(len(sdf)) * 0.11  # ~110ms intervals
    
    for i, col in enumerate(SENSOR_COLUMNS):
        ax = axes[i // 2, i % 2]
        ax.plot(t, sdf[col].values, linewidth=0.8)
        ax.set_ylabel(col, fontsize=9)
        ax.grid(True, alpha=0.3)
    
    # Annotate weld phases on Primary Weld Current panel
    current = sdf["Primary Weld Current"].values
    threshold = 10.0
    arc_on = np.where(current > threshold)[0]
    if len(arc_on) > 0:
        t_start = t[arc_on[0]]
        t_end = t[arc_on[-1]]
        for ax_row in axes:
            for ax in ax_row:
                ax.axvspan(0, t_start, alpha=0.08, color="blue", label="idle")
                ax.axvspan(t_start, t_start + 3, alpha=0.12, color="orange", label="ignition")
                ax.axvspan(t_end - 1, t_end, alpha=0.12, color="purple", label="ramp-down")
                ax.axvspan(t_end, t[-1], alpha=0.08, color="gray", label="cool-down")
    
    axes[-1, 0].set_xlabel("Time (s)")
    axes[-1, 1].set_xlabel("Time (s)")
    plt.tight_layout(rect=[0, 0, 1, 0.96])
    fig.savefig(DASHBOARD_DIR / f"03_signals_{sid}.png", bbox_inches="tight")
    plt.show()

## 9. Data Quality Indicators â€“ Outliers, Noise, and Class Imbalance

Per-run boxplots of sensor summary stats, outlier detection (IQR), and signal-to-noise ratio during steady-state welding.

In [None]:
# Per-run summary stats for boxplot analysis
run_stats = []
for sid, sdf in sensor_data.items():
    row = {"sample_id": sid}
    for col in SENSOR_COLUMNS:
        vals = sdf[col].dropna()
        row[f"{col}__mean"] = vals.mean()
        row[f"{col}__max"] = vals.max()
        row[f"{col}__std"] = vals.std()
    # SNR during arcing phase
    current = sdf["Primary Weld Current"].values
    arcing = current > 10.0
    if arcing.sum() > 5:
        arc_current = current[arcing]
        row["arc_current_snr"] = arc_current.mean() / arc_current.std() if arc_current.std() > 0 else np.inf
        arc_voltage = sdf["Secondary Weld Voltage"].values[arcing]
        row["arc_voltage_snr"] = arc_voltage.mean() / arc_voltage.std() if np.std(arc_voltage) > 0 else np.inf
    else:
        row["arc_current_snr"] = 0.0
        row["arc_voltage_snr"] = 0.0
    run_stats.append(row)

run_stats_df = pd.DataFrame(run_stats).set_index("sample_id")

# Boxplots of per-run sensor means and maxes
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
fig.suptitle("Per-Run Sensor Statistics â€“ Boxplots", fontsize=14, fontweight="bold")

for i, col in enumerate(SENSOR_COLUMNS):
    ax = axes[i // 3, i % 3]
    data_to_plot = [
        run_stats_df[f"{col}__mean"].dropna().values,
        run_stats_df[f"{col}__max"].dropna().values,
        run_stats_df[f"{col}__std"].dropna().values,
    ]
    bp = ax.boxplot(data_to_plot, labels=["mean", "max", "std"], patch_artist=True)
    for patch, color in zip(bp["boxes"], ["#2196F3", "#FF9800", "#4CAF50"]):
        patch.set_facecolor(color)
        patch.set_alpha(0.6)
    ax.set_title(col, fontsize=10)
    ax.grid(True, alpha=0.3)

plt.tight_layout(rect=[0, 0, 1, 0.95])
fig.savefig(DASHBOARD_DIR / "04_sensor_boxplots.png", bbox_inches="tight")
plt.show()

# SNR summary
print("\n=== Signal-to-Noise Ratio (arcing phase) ===")
print(f"  Primary Weld Current   SNR: mean={run_stats_df['arc_current_snr'].mean():.2f}")
print(f"  Secondary Weld Voltage SNR: mean={run_stats_df['arc_voltage_snr'].mean():.2f}")

# IQR outlier counts per column
print("\n=== Outlier Runs per Sensor (IQR method) ===")
for col in SENSOR_COLUMNS:
    means = run_stats_df[f"{col}__mean"].dropna()
    Q1, Q3 = means.quantile(0.25), means.quantile(0.75)
    IQR = Q3 - Q1
    outliers = ((means < Q1 - 1.5*IQR) | (means > Q3 + 1.5*IQR)).sum()
    print(f"  {col:30s}: {outliers} outlier runs")

## 10. Preprocessing and Standardization â€“ Resampling, Normalization, Padding

Each run's sensor time-series is padded/truncated to `FIXED_SEQ_LEN` rows. Normalization stats (mean, std per channel) are computed from the **training set only** to avoid data leakage.

In [None]:
# Demonstrate fixed-length tensor conversion and normalization

# First, compute median sampling interval
intervals = []
for sid, sdf in sensor_data.items():
    if "datetime" in sdf.columns and sdf["datetime"].notna().sum() > 1:
        dt = sdf["datetime"].diff().dt.total_seconds().dropna()
        intervals.extend(dt.values.tolist())
median_interval = np.median(intervals) if intervals else 0.11
print(f"Median sampling interval: {median_interval*1000:.1f} ms")
print(f"Fixed sequence length: {FIXED_SEQ_LEN} rows")

# Convert one example to fixed tensor (before normalization)
demo_sid = list(sensor_data.keys())[0]
raw_tensor = sensor_to_fixed_tensor(sensor_data[demo_sid], FIXED_SEQ_LEN)
print(f"\nRaw tensor shape: {raw_tensor.shape}  (seq_len Ã— n_channels)")
print(f"Before normalization â€“ channel means: {raw_tensor.mean(axis=0).round(2)}")
print(f"Before normalization â€“ channel stds:  {raw_tensor.std(axis=0).round(2)}")

# We'll compute norm stats from the split in next sections â€“ preview here
train_ids = list(sensor_data.keys())[:6]  # stand-in for actual train split
norm_stats_preview = compute_normalize_stats(sensor_data, train_ids)
print(f"\nNormalization stats (from {len(train_ids)} train runs):")
print(f"  mean: {np.round(norm_stats_preview['mean'], 3)}")
print(f"  std:  {np.round(norm_stats_preview['std'], 3)}")

# Apply normalization
mu, sd = norm_stats_preview["mean"], norm_stats_preview["std"].copy()
sd[sd == 0] = 1.0
normed = (raw_tensor - mu) / sd

# Before / after plot
fig, axes = plt.subplots(1, 2, figsize=(14, 4))
axes[0].plot(raw_tensor[:, 3], label="Primary Weld Current (raw)")
axes[0].set_title(f"Before Normalization â€“ {demo_sid}")
axes[0].set_xlabel("Time step")
axes[0].legend()
axes[1].plot(normed[:, 3], label="Primary Weld Current (z-scored)", color="darkorange")
axes[1].set_title(f"After Z-Score Normalization")
axes[1].set_xlabel("Time step")
axes[1].legend()
plt.tight_layout()
fig.savefig(DASHBOARD_DIR / "05_normalization_demo.png", bbox_inches="tight")
plt.show()

## 11. Feature Engineering â€“ Sensor Statistics and Derived Signals

Using `build_feature_table()` we extract per-run features: global stats (mean, std, min, max, range, IQR), windowed volatility, rate-of-change, and weld-phase timing features.

In [None]:
# Build the full feature table
feature_df = build_feature_table(manifest, sensor_data)

print(f"Feature table shape: {feature_df.shape[0]} samples Ã— {feature_df.shape[1]} features\n")
print("Feature categories:")

# Categorize features
cats = {}
for col in feature_df.columns:
    prefix = col.split("__")[0] if "__" in col else col
    cats.setdefault(prefix, []).append(col)

for cat, cols in cats.items():
    print(f"  {cat:35s}: {len(cols)} features")

print(f"\nSample features for {feature_df.index[0]}:")
display(feature_df.iloc[:3].T.round(3))

# Feature mapping table
print("\n=== Raw Input â†’ Engineered Feature Mapping ===")
mapping = [
    ("Sensor CSV", "Global stats", "mean, std, min, max, median, range, IQR per channel"),
    ("Sensor CSV", "Windowed stats", "sliding-window mean-of-std, std-of-mean, max-std"),
    ("Sensor CSV", "Rate of change", "diff mean, std, max, min per channel"),
    ("Sensor CSV", "Weld phases", "arc fraction, start/end idx, duration fraction"),
    ("Images/", "Image brightness", "mean, std, min, max of pixel intensity (grayscale)"),
    ("Images/", "Image texture", "histogram entropy, edge density (Sobel gradient)"),
]
mapping_df = pd.DataFrame(mapping, columns=["Raw Source", "Feature Group", "Description"])
display(mapping_df)

## 12. Feature Correlation and Importance Analysis

Pearson correlation heatmap, highly-correlated feature pairs, and point-biserial correlation with binary labels.

In [None]:
# --- Correlation heatmap ---
# Select numeric columns only, drop any with zero variance
numeric_feats = feature_df.select_dtypes(include=[np.number])
nonconst = numeric_feats.loc[:, numeric_feats.std() > 0]

corr = nonconst.corr()

fig, ax = plt.subplots(figsize=(16, 14))
sns.heatmap(
    corr, ax=ax, cmap="RdBu_r", center=0, vmin=-1, vmax=1,
    xticklabels=False, yticklabels=True,
    linewidths=0.1, cbar_kws={"shrink": 0.6},
)
ax.set_title("Feature Correlation Matrix", fontsize=14)
ax.tick_params(axis="y", labelsize=6)
plt.tight_layout()
fig.savefig(DASHBOARD_DIR / "06_correlation_heatmap.png", bbox_inches="tight")
plt.show()

# Highly correlated pairs (|r| > 0.95)
high_corr = []
for i in range(len(corr)):
    for j in range(i + 1, len(corr)):
        if abs(corr.iloc[i, j]) > 0.95:
            high_corr.append((corr.index[i], corr.columns[j], round(corr.iloc[i, j], 3)))

print(f"\nHighly correlated pairs (|r| > 0.95): {len(high_corr)}")
if high_corr:
    hc_df = pd.DataFrame(high_corr, columns=["Feature A", "Feature B", "r"])
    display(hc_df.head(20))
else:
    print("  None found.")

# Point-biserial correlation with label
if manifest[LABEL_COL].notna().sum() > 2:
    from scipy.stats import pointbiserialr
    
    labelled_ids = manifest[manifest[LABEL_COL].notna()][SAMPLE_ID_COL].tolist()
    feat_labelled = nonconst.loc[nonconst.index.isin(labelled_ids)]
    labels_arr = manifest.set_index(SAMPLE_ID_COL).loc[feat_labelled.index, LABEL_COL].values.astype(float)
    
    pb_corrs = {}
    for col in feat_labelled.columns:
        vals = feat_labelled[col].values
        if np.std(vals) > 0:
            r, p = pointbiserialr(labels_arr, vals)
            pb_corrs[col] = abs(r)
    
    pb_series = pd.Series(pb_corrs).sort_values(ascending=False)
    
    # Top 20 features by label correlation
    fig, ax = plt.subplots(figsize=(10, 8))
    top_n = min(20, len(pb_series))
    top = pb_series.head(top_n)
    ax.barh(range(top_n), top.values, color="steelblue")
    ax.set_yticks(range(top_n))
    ax.set_yticklabels(top.index, fontsize=8)
    ax.set_xlabel("|Point-Biserial r|")
    ax.set_title(f"Top {top_n} Features by Label Correlation")
    ax.invert_yaxis()
    plt.tight_layout()
    fig.savefig(DASHBOARD_DIR / "07_feature_importance.png", bbox_inches="tight")
    plt.show()
    
    print("\nMost discriminative signals:")
    for feat, r in top.head(5).items():
        base = feat.split("__")[0]
        print(f"  {feat:45s}  |r|={r:.3f}  (derived from {base})")
else:
    print("Insufficient labels for point-biserial correlation.")

## 13. Create Reproducible Group-Based Train/Val/Test Split

Grouping by date prefix so runs from the same session stay in the same split (prevents temporal leakage). Fixed seed for reproducibility.

In [None]:
# Perform group-based split
split_map = group_split(manifest)
split_path = save_split(split_map)

print(f"Split saved to: {split_path}\n")
print("Split summary:")
for k, ids in split_map.items():
    print(f"  {k:5s}: {len(ids)} runs â†’ {ids}")

# Verify no overlap
all_ids_flat = []
for ids in split_map.values():
    all_ids_flat.extend(ids)
assert len(all_ids_flat) == len(set(all_ids_flat)), "ERROR: Overlap between splits!"
print("\nâœ“ No overlap between train/val/test sets.")

# Label distribution within each split
if manifest[LABEL_COL].notna().sum() > 0:
    print("\nLabel distribution per split:")
    lbl_lookup = manifest.set_index(SAMPLE_ID_COL)[LABEL_COL]
    split_label_summary = []
    for split_name, ids in split_map.items():
        labels = lbl_lookup.loc[ids].dropna()
        n_good = (labels == 0).sum()
        n_defect = (labels == 1).sum()
        n_unlabelled = len(ids) - len(labels)
        split_label_summary.append({
            "split": split_name, "n_runs": len(ids),
            "good": n_good, "defect": n_defect, "unlabelled": n_unlabelled,
        })
    display(pd.DataFrame(split_label_summary))

## 14. Build PyTorch Dataset and DataLoader Pipeline

Instantiate `WeldDataset` for each split with normalization stats computed on the training set only. Verify tensor shapes.

In [None]:
# Compute normalization stats from TRAIN set only
norm_stats = compute_normalize_stats(sensor_data, split_map["train"])
print("Normalization stats (train-only):")
for ch, m, s in zip(SENSOR_COLUMNS, norm_stats["mean"], norm_stats["std"]):
    print(f"  {ch:30s}  mean={m:10.3f}  std={s:10.3f}")

# Build datasets
datasets = {}
for split_name, ids in split_map.items():
    ds = WeldDataset(
        manifest=manifest,
        sensor_data=sensor_data,
        sample_ids=ids,
        feature_df=feature_df,
        normalize_stats=norm_stats,
    )
    datasets[split_name] = ds

# Demo: access one sample
print("\n=== Sample from train set ===")
sample = datasets["train"][0]
for k, v in sample.items():
    if isinstance(v, torch.Tensor):
        print(f"  {k:15s}: shape={tuple(v.shape)}, dtype={v.dtype}")
    else:
        print(f"  {k:15s}: {v}")

# Create a DataLoader and iterate one batch
train_loader = DataLoader(datasets["train"], batch_size=min(4, len(datasets["train"])), shuffle=True)
batch = next(iter(train_loader))
print(f"\n=== One batch (batch_size={batch['sensor_seq'].shape[0]}) ===")
print(f"  sensor_seq : {tuple(batch['sensor_seq'].shape)}")
print(f"  features   : {tuple(batch['features'].shape)}")
print(f"  label      : {batch['label'].tolist()}")
print(f"  sample_id  : {batch['sample_id']}")

## 15. Export Artefacts â€“ Feature Table, Manifest, Normalization Stats, Split Files

In [None]:
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
DASHBOARD_DIR.mkdir(parents=True, exist_ok=True)

# 1. Feature table
ft_path = OUTPUT_DIR / "feature_table.csv"
feature_df.to_csv(ft_path)
print(f"âœ“ Feature table        â†’ {ft_path}  ({ft_path.stat().st_size / 1024:.1f} KB)")

# 2. Manifest (drop non-serializable columns)
manifest_export = manifest.drop(columns=["image_paths"], errors="ignore")
manifest_path = OUTPUT_DIR / "manifest.csv"
manifest_export.to_csv(manifest_path, index=False)
print(f"âœ“ Manifest             â†’ {manifest_path}  ({manifest_path.stat().st_size / 1024:.1f} KB)")

# 3. Normalization stats
norm_path = OUTPUT_DIR / "normalize_stats.json"
with open(norm_path, "w") as f:
    json.dump({k: v.tolist() for k, v in norm_stats.items()}, f, indent=2)
print(f"âœ“ Normalization stats  â†’ {norm_path}")

# 4. Split definition (already saved, confirm)
split_file = SPLIT_DIR / "split.json"
print(f"âœ“ Split definition     â†’ {split_file}")

# 5. Dashboard plots
plot_files = sorted(DASHBOARD_DIR.glob("*.png"))
print(f"âœ“ Dashboard plots      â†’ {len(plot_files)} PNGs in {DASHBOARD_DIR}")
for p in plot_files:
    print(f"    {p.name}  ({p.stat().st_size / 1024:.1f} KB)")

## 16. Generate Data Card Summary Report

Programmatic one-page data card summarising dataset characteristics, preprocessing decisions, and known issues.

In [None]:
from IPython.display import Markdown

# Build data card content
n_labelled = manifest[LABEL_COL].notna().sum()
n_good = (manifest[LABEL_COL] == 0).sum()
n_defect = (manifest[LABEL_COL] == 1).sum()
dur = manifest["duration_s"].dropna()

data_card = f"""# Apex Weld Quality â€“ Data Card

## Dataset Overview
| Property | Value |
|----------|-------|
| Dataset name | Apex Weld Quality â€“ sampleData |
| Date range | Aug 17â€“18, 2022 |
| Total weld runs | {len(manifest)} |
| Labelled runs | {n_labelled} (good={n_good}, defect={n_defect}) |
| Sensor channels | {len(SENSOR_COLUMNS)} ({', '.join(SENSOR_COLUMNS)}) |
| Sampling rate | ~9â€“10 Hz (median interval â‰ˆ {median_interval*1000:.0f} ms) |
| Rows per run | {int(manifest['n_sensor_rows'].min())}â€“{int(manifest['n_sensor_rows'].max())} (mean â‰ˆ {manifest['n_sensor_rows'].mean():.0f}) |
| Weld duration | {dur.min():.1f}â€“{dur.max():.1f} s (mean â‰ˆ {dur.mean():.1f} s) |
| Images per run | {int(manifest['n_images'].min())}â€“{int(manifest['n_images'].max())} |

## Label Definitions
- **0 = good**: Normal weld with no identified defects
- **1 = defect**: Weld with one or more quality issues

## Preprocessing Choices
| Choice | Value |
|--------|-------|
| Unit of prediction | One complete weld run (Part No) |
| Sequence length | Fixed at {FIXED_SEQ_LEN} rows (pad with zeros / truncate) |
| Normalization | Per-channel z-score (mean/std from train set only) |
| Image processing | Grayscale, resized to {IMAGE_SIZE}, basic statistics extracted |
| Split strategy | Group-by-date to prevent temporal leakage |
| Split ratios | Train {TRAIN_RATIO:.0%} / Val {VAL_RATIO:.0%} / Test {TEST_RATIO:.0%} |
| Random seed | {RANDOM_SEED} |

## Known Issues & Assumptions
- **Labels are placeholders**: The provided `labels.csv` contains template labels that must
  be replaced with ground-truth annotations before training.
- **No audio/video data**: Only sensor CSVs and still images are present in the sample data.
  The pipeline is designed to accommodate additional modalities when available.
- **Small dataset**: With only {len(manifest)} runs, overfitting risk is high. Consider data
  augmentation and regularisation strategies.
- **Remarks column**: Currently empty across all runs; reserved for operator notes.

## Feature Engineering Summary
- {feature_df.shape[1]} engineered features per run
- Global sensor stats: mean, std, min, max, median, range, IQR
- Windowed volatility: sliding-window std, coefficient of variation
- Rate of change: first-difference statistics
- Weld phase timing: arc fraction, start/end indices, duration
- Image statistics: brightness, contrast, entropy, edge density

## Files Produced
- `outputs/feature_table.csv` â€“ Full feature matrix
- `outputs/manifest.csv` â€“ Run manifest with validation info
- `outputs/normalize_stats.json` â€“ Z-score parameters
- `splits/split.json` â€“ Train/val/test split definition
- `outputs/dashboard/` â€“ Analysis plots (PNG)
"""

# Save and render
data_card_path = OUTPUT_DIR / "data_card.md"
with open(data_card_path, "w", encoding="utf-8") as f:
    f.write(data_card)
print(f"âœ“ Data card saved â†’ {data_card_path}\n")

display(Markdown(data_card))

In [None]:
# Final Phase 1 summary report
from main import _make_summary

summary = _make_summary(manifest, feature_df, split_map)
print(summary)
print("\n\nðŸŽ‰ Phase 1 pipeline complete. All artefacts exported.")