# merge_all – Build Unified Dataset and Compute Run-Level Metrics

This notebook ingests the enriched straight-line intervals (from `summary_enriched.json`) together with the original CSV logs of each run, and produces a **row-level, analysis-ready dataset**. It relies on helper utilities implemented in **`report_fct.py`** and **`cog_analysis.py`**.

## What this notebook does

1. **Load run intervals**
   - Reads `summary_enriched.json` and maps each `run` to its list of intervals.
   - Locates the matching run folder under the data root and loads the two boat CSVs.

2. **Clip data to intervals**
   - For every interval with sufficient duration (≥ 30 s), slices each boat’s time series to `[start_time, end_time]` using `filter_interval(...)` from `report_fct.py`.

3. **Assign roles and metadata**
   - Determines **master/slave** roles from `*_master_leeward` flags in the summary.
   - Attaches interval/rider/equipment metadata to each row:
     - `run`, `interval_id`, `boat_name`, `opponent_name`
     - `boat_role` (master/slave), `boat_weight`, `interval_duration`, `mast_brand`

4. **Compute directional gains over time**
   - For each timestamp, accumulates **Forward**, **Lateral**, and **VMG** gains via `compute_directional_gain(master_df, slave_df)` (from `report_fct.py`).

5. **Recompute “line” values (we realized there has been some confusion between the lines in the data we received and decided to recompute them)**
   - Reassigns `Line_C`, `Line_L`, `Line_R` by sorted magnitude into:
     - `Line_C2` (largest), `Line_L2` (middle), `Line_R2` (smallest)
   - Derives `side_line2 = Line_L2 + Line_R2` and `total_line2 = side_line2 + Line_C2`.

6. **Concatenate all rows**
   - Stacks all intervals and both boats into a single DataFrame and writes it to **`all_data.csv`**.

## Output

- **`all_data.csv`**: unified, row-level dataset across all runs, intervals, boats, and derived signals.


In [1]:
import os
import json
import pandas as pd
from cog_analysis import load_boat_data, analyze_session
from report_fct import filter_interval, compute_directional_gain
import numpy as np

def build_csv_from_summary(summary_path, data_root, output_csv="all_data.csv"):
    with open(summary_path, "r") as f:
        summary = json.load(f)

    all_rows = []

    for run_entry in summary:
        run_name = run_entry["run"]
        intervals = run_entry["intervals"]

        # Recherche du dossier de la run
        run_path = None
        for root, dirs, files in os.walk(data_root):
            if os.path.basename(root) == run_name:
                run_path = root
                break

        if not run_path:
            print(f"⚠️ Run folder not found for: {run_name}")
            continue

        # Chargement des fichiers CSV
        csv_files = [f for f in os.listdir(run_path) if f.endswith(".csv")]
        if len(csv_files) != 2:
            print(f"⚠️ Skipping {run_name}: expected 2 CSVs, found {len(csv_files)}")
            continue

        csv_paths = [os.path.join(run_path, f) for f in csv_files]

        try:
            df1, df2, name1, name2 = load_boat_data(csv_paths[0], csv_paths[1])
            if df1.empty or df2.empty:
                continue
        except Exception as e:
            print(f"❌ Error loading CSVs for {run_name}: {e}")
            continue

        # Traitement des intervalles
        for i, interval in enumerate(intervals):
            start, end = interval["start_time"], interval["end_time"]
            if end - start < 30:
                print(f"⚠️ Skipping interval {i + 1} for {run_name}: duration < 30 seconds")
                continue

            df1_clip = filter_interval(df1, start, end)
            df2_clip = filter_interval(df2, start, end)
            if df1_clip.empty or df2_clip.empty:
                print(f"⚠️ Skipping interval {i + 1} for {run_name}: no data in interval")
                continue

            # Définir les rôles master/slave
            master_df, slave_df = (df1_clip, df2_clip)
            if not interval.get("boat1_master_leeward", False):
                master_df, slave_df = df2_clip, df1_clip

            # Construction des données ligne par ligne
            for df_clip, prefix, other_prefix in [(df1_clip, "boat1", "boat2"), (df2_clip, "boat2", "boat1")]:
                df = df_clip.copy()
                df["run"] = run_name
                df["interval_id"] = i + 1
                df["boat_name"] = interval.get(f"{prefix}_name", "")
                df["opponent_name"] = interval.get(f"{other_prefix}_name", "")
                df["boat_role"] = "master" if interval.get(f"{prefix}_master_leeward", False) else "slave"
                df["boat_weight"] = interval.get(f"{prefix}_total_weight", None)
                df["interval_duration"] = interval.get("duration", None)
                df["mast_brand"] = interval.get(f"{prefix}_mast_brand", None)

                # Calculs des gains
                gain_forward = []
                gain_lateral = []
                gain_vmg = []

                for t in df["SecondsSince1970"]:
                    m_clip = master_df[master_df["SecondsSince1970"] <= t]
                    s_clip = slave_df[slave_df["SecondsSince1970"] <= t]
                    gain_df = compute_directional_gain(m_clip, s_clip)

                    if gain_df.empty:
                        gain_forward.append(np.nan)
                        gain_lateral.append(np.nan)
                        gain_vmg.append(np.nan)
                    else:
                        gain_forward.append(gain_df.loc["Total Gain", "Forward"])
                        gain_lateral.append(gain_df.loc["Total Gain", "Lateral"])
                        gain_vmg.append(gain_df.loc["Total Gain", "VMG"])

                df["gain_forward"] = gain_forward
                df["gain_lateral"] = gain_lateral
                df["gain_vmg"] = gain_vmg
                
                # Réaffectation arbitraire des lignes en triant les valeurs
                lines = df[["Line_C", "Line_L", "Line_R"]].values
                sorted_lines = np.sort(lines, axis=1)  # tri croissant ligne par ligne

                # Création des nouvelles colonnes avec attribution arbitraire
                df["Line_R2"] = sorted_lines[:, 0]  # plus petit
                df["Line_L2"] = sorted_lines[:, 1]  # milieu
                df["Line_C2"] = sorted_lines[:, 2]  # plus grand
                df["side_line2"] = df["Line_L2"] + df["Line_R2"]
                df["total_line2"] = df["side_line2"] + df["Line_C2"]

                all_rows.append(df)

    # Sauvegarde finale
    if not all_rows:
        print("❌ No valid data found.")
        return

    df_global = pd.concat(all_rows, ignore_index=True)
    df_global.to_csv(output_csv, index=False)
    print(f"✅ Global CSV saved to: {output_csv}")


In [2]:
build_csv_from_summary(
    summary_path="summary_enriched.json",
    data_root="../Data_Sailnjord/Straight_lines",
    output_csv="all_data.csv"
)


✅ Global CSV saved to: all_data.csv
