# 02_federated_split_summary.ipynb



## Federated Split Summary

In this notebook we:
1. Load the preprocessed CMAPSS sliding-window dataset.
2. Apply a non-IID split by operating condition to simulate three IIoT edge clients.
3. Summarize the number of samples per client in a table.


In [None]:
# Standard imports
import pandas as pd

# Import our data loader (creates sliding-window Data objects)
from phase2_data_loader import load_cmapss

# Import the federated splitting function
from phase2_federated_split import non_iid_split_by_condition

# For nice DataFrame display in the notebook we use ace_tools
import ace_tools as tools

# Import the preprocess function for the SECOM dataset
from src.preprocess_secom import load_secom

# Import the preprocess function for the SECOM dataset
from src.phase2_federated_split import non_iid_split_by_condition

In [None]:
# 1️⃣ Load CMAPSS FD003 data as sliding windows
#    - data_dir: path where your CMAPSS files live
#    - dataset: which FD00x file to use
#    - window_size & stride: controls time-series windowing
data_list_cmapps = load_cmapss(
    data_dir='data/CMAPSS',
    dataset='FD003',
    window_size=30,
    stride=5
)

# 1. Load all windows
data_list_secom = load_secom('data/SECOM', window_size=30, stride=5, threshold=0.8)


# Quick sanity check
print(f"Total sliding-window samples for CMAPSS: {len(data_list_cmapps)}")


# Quick sanity check
print(f"Total sliding-window samples for SECOM: {len(data_list_secom)}")


In [None]:
# 2️⃣ Split into 3 non-IID clients by operating condition
splits_cmapss = non_iid_split_by_condition(data_list_cmapps, cm_pass_threshold=125.0)

# 2. Split by index modulo 3
splits_secom = non_iid_split_by_condition(data_list_secom)

# 3. Inspect distribution
for cid, items in splits_cmapss.items():
    ys = [d.y.item() for d in items]
    print(f"Client {cid}: count={len(ys)}, "
          f"early-life>={125}={sum(y>=125 for y in ys)}, "
          f"late-life< {125}={sum(y<125 for y in ys)}")


# 3. Inspect
for cid, items in splits_secom.items():
    labels = [int(d.y.item()) for d in items]
    print(f"Client {cid}: total={len(labels)}, healthy={labels.count(0)}, failures={labels.count(1)}")



In [None]:
# 3️⃣ Build a summary DataFrame for easy inspection
summary = pd.DataFrame([
    {
        'Client ID': client_id,
        'Num Samples': len(items)
    }
    for client_id, items in splits_cmapss.items()
])

# Display interactively in the notebook
tools.display_dataframe_to_user(
    name='Federated Split Summary',
    dataframe=summary
)


### Interpretation

- We see how many sliding-window samples each simulated edge client holds.
- This non-IID split (by operating condition) will drive our federated training experiments.
