# Linking Interaction-Level and Time-Resolved AIS Data

## Purpose of this Notebook

This notebook establishes a reliable link between time-resolved AIS observations
and interaction-level behavioral summaries.

The behavioral analysis performed in previous steps produces one row per
ship–ship interaction (`behavior_summary.csv`), while the original AIS-based
dataset (`classified_ais_dcpa_tcpa.csv`) contains one row per timestamp.
To enable detailed inspection, visualization, and validation of ship behavior
over time, it is necessary to connect these two representations.

This notebook augments the time-resolved AIS dataset with interaction identifiers,
resulting in a linked dataset (`classified_linked.csv`) where each AIS observation
is assigned to a specific ship–ship interaction.

This linkage enables:
- Reconstruction of full interaction trajectories
- Validation of behavioral summary metrics
- Visualization of ship behavior over time
- Consistent referencing between interaction-level and point-level data

---

## Input Data

### 1. `classified_ais_dcpa_tcpa.csv`

This dataset contains time-aligned AIS observations for pairs of ships.
Each row represents a single timestamp and includes:

- MMSI identifiers of both ships
- Position (latitude, longitude)
- Speed over ground and course over ground
- Distance between ships
- Risk indicators (DCPA and TCPA)
- Region classification (harbor / open sea)

### 2. `behavior_summary.csv`

This dataset contains one row per ship–ship interaction and summarizes
the behavioral response of both vessels during the interaction.
Each interaction is identified by:
- A canonical ship pair
- An interaction identifier
- Temporal bounds (start and end time)

---

## Key Concept: Canonical Ship Pairs

AIS data may represent the same ship pair in different orders
(e.g., ship A as `mmsi_1` and ship B as `mmsi_2`, or vice versa).

To ensure consistent interaction assignment, ship pairs are transformed into
a canonical representation:

- `ship_low`  = minimum of the two MMSI values
- `ship_high` = maximum of the two MMSI values

This guarantees that `(A, B)` and `(B, A)` are treated as the same ship pair
throughout the interaction definition and linking process.

---

## Interaction Assignment Strategy

Interactions are defined by grouping AIS observations for the same canonical
ship pair and splitting them when the time gap between consecutive observations
exceeds a specified threshold (60 minutes).

Each AIS observation is assigned:
- `ship_low`
- `ship_high`
- `interaction_id`

These identifiers exactly match those used in `behavior_summary.csv`,
ensuring a one-to-many relationship between interactions and AIS observations.


In [5]:
# ============================================================
# Add interaction linking columns to classified_ais_dcpa_tcpa.csv
# Produces: classified_ais_dcpa_tcpa_linked.csv
# (Adds ship_low, ship_high, interaction_id to every AIS pair row)
# + Includes simple verification tests against behavior_summary.csv
# ============================================================

import pandas as pd
import numpy as np

# ----------------------------
# CONFIG
# ----------------------------
CLASSIFIED_FILE = "classified_ais_dcpa_tcpa.csv"
BEHAVIOR_FILE   = "behavior_summary.csv"   
OUTPUT_FILE     = "classified_ais_dcpa_tcpa_linked.csv"

MAX_TIME_GAP_SEC = 3600  # must match the behavior notebook

# ----------------------------
# Load
# ----------------------------
df = pd.read_csv(CLASSIFIED_FILE, parse_dates=["time_window"])
print("Loaded classified:", df.shape)

# ----------------------------
# Canonical pair (A,B) == (B,A)
# ----------------------------
df["ship_low"]  = df[["mmsi_1", "mmsi_2"]].min(axis=1)
df["ship_high"] = df[["mmsi_1", "mmsi_2"]].max(axis=1)

# ----------------------------
# Sort by canonical pair + time
# ----------------------------
df = df.sort_values(["ship_low", "ship_high", "time_window"]).reset_index(drop=True)

# ----------------------------
# Define interactions (same logic as behavior notebook)
# ----------------------------
df["time_diff_s"] = (
    df.groupby(["ship_low", "ship_high"])["time_window"]
      .diff()
      .dt.total_seconds()
)

df["new_interaction"] = df["time_diff_s"].isna() | (df["time_diff_s"] > MAX_TIME_GAP_SEC)

df["interaction_id"] = (
    df.groupby(["ship_low", "ship_high"])["new_interaction"]
      .cumsum()
      .astype(int) - 1
)

# ----------------------------
# Save linked classified file
# ----------------------------
df.to_csv(OUTPUT_FILE, index=False)
print(f"✅ Saved linked classified file: {OUTPUT_FILE}")
print("Columns added: ship_low, ship_high, interaction_id")

Loaded classified: (677044, 17)
✅ Saved linked classified file: classified_ais_dcpa_tcpa_linked.csv
Columns added: ship_low, ship_high, interaction_id


In [6]:
# ============================================================
# TESTS / VERIFICATION
# ============================================================

# ---- Test 1: Basic sanity checks
print("\n--- Test 1: Sanity checks ---")
print("Any missing ship_low/ship_high?", df["ship_low"].isna().any(), df["ship_high"].isna().any())
print("Any missing interaction_id?", df["interaction_id"].isna().any())
print("interaction_id min/max:", df["interaction_id"].min(), df["interaction_id"].max())

# ---- Test 2: Pull one interaction from behavior_summary and extract its full trajectory
print("\n--- Test 2: Extract one interaction trajectory using behavior_summary.csv ---")
try:
    bs = pd.read_csv(BEHAVIOR_FILE, parse_dates=["start_time", "end_time"])
    print("Loaded behavior_summary:", bs.shape)

    # pick a "good" interaction: has multiple points and non-zero duration
    candidate = bs.sort_values(["points_count", "duration_min"], ascending=False).iloc[0]

    ship_low  = candidate["ship_low"]
    ship_high = candidate["ship_high"]
    inter_id  = candidate["interaction_id"]
    start_t   = candidate["start_time"]
    end_t     = candidate["end_time"]

    print("\nSelected interaction from behavior_summary:")
    print(f"ship_low={ship_low}, ship_high={ship_high}, interaction_id={inter_id}")
    print(f"start_time={start_t}, end_time={end_t}, points_count={candidate['points_count']}")

    traj = df[
        (df["ship_low"] == ship_low) &
        (df["ship_high"] == ship_high) &
        (df["interaction_id"] == inter_id)
    ].sort_values("time_window")

    print("\nTrajectory extracted from linked classified CSV:")
    print("traj rows:", len(traj))
    print(traj[[
        "time_window",
        "mmsi_1", "mmsi_2",
        "lat_1", "lon_1", "speed_1", "course_1",
        "lat_2", "lon_2", "speed_2", "course_2",
        "distance_m", "DCPA_m", "TCPA_s",
        "ship_low", "ship_high", "interaction_id"
    ]].head(10))

    # ---- Test 3: Strong correctness test
    # Check that the extracted trajectory time window matches behavior_summary
    print("\n--- Test 3: Linking correctness checks ---")
    if len(traj) == 0:
        print("❌ ERROR: No rows found in classified for the selected behavior_summary interaction.")
    else:
        traj_start = traj["time_window"].iloc[0]
        traj_end   = traj["time_window"].iloc[-1]

        print("Behavior start/end:", start_t, end_t)
        print("Classified traj start/end:", traj_start, traj_end)

        # Exact match should hold if both were computed with the same logic
        ok_time_match = (traj_start == start_t) and (traj_end == end_t)
        print("Time window match:", "✅" if ok_time_match else "⚠️ (check if behavior_summary was generated from same exact file)")

        # Check: all rows in this extracted traj have the same key (ship_low, ship_high, interaction_id)
        ok_keys = (
            (traj["ship_low"].nunique() == 1) and
            (traj["ship_high"].nunique() == 1) and
            (traj["interaction_id"].nunique() == 1)
        )
        print("Key uniqueness inside trajectory:", "✅" if ok_keys else "❌")

        # Check: time gaps inside the interaction never exceed MAX_TIME_GAP_SEC (definition of interaction)
        gaps = traj["time_window"].diff().dt.total_seconds()
        max_gap = gaps.max()
        print("Max time gap inside extracted interaction (s):", max_gap)
        ok_gap = (pd.isna(max_gap) or max_gap <= MAX_TIME_GAP_SEC)
        print("Gap rule satisfied:", "✅" if ok_gap else "❌")

except FileNotFoundError:
    print(f"⚠️ Could not find {BEHAVIOR_FILE}.")
    print("You can still use the linked classified file; tests 2/3 require behavior_summary.csv.")



--- Test 1: Sanity checks ---
Any missing ship_low/ship_high? False False
Any missing interaction_id? False
interaction_id min/max: 0 153

--- Test 2: Extract one interaction trajectory using behavior_summary.csv ---
Loaded behavior_summary: (26946, 24)

Selected interaction from behavior_summary:
ship_low=227314000, ship_high=228238700, interaction_id=0
start_time=2015-12-09 18:33:00, end_time=2015-12-10 03:09:00, points_count=357

Trajectory extracted from linked classified CSV:
traj rows: 758
               time_window     mmsi_1     mmsi_2      lat_1     lon_1  \
134989 2015-12-09 18:33:00  228238700  227314000  48.292885 -5.090133   
134990 2015-12-09 18:33:00  228238700  227314000  48.292885 -5.090133   
134991 2015-12-09 18:38:00  228238700  227314000  48.292046 -5.088555   
134992 2015-12-09 18:38:00  228238700  227314000  48.292046 -5.088555   
134993 2015-12-09 18:43:00  227314000  228238700  48.288815 -5.088119   
134994 2015-12-09 18:43:00  227314000  228238700  48.288815 

In [7]:
# ----------------------------------------------------
# Remove symmetric duplicate AIS pair rows
# ----------------------------------------------------

import pandas as pd

file_in  = "classified_ais_dcpa_tcpa_linked.csv"
file_out = "classified_linked.csv"

df = pd.read_csv(file_in, parse_dates=["time_window"])

before = len(df)

# Drop duplicated observations caused by (A,B) vs (B,A)
df = df.drop_duplicates(
    subset=["time_window", "ship_low", "ship_high"]
).reset_index(drop=True)

after = len(df)

df.to_csv(file_out, index=False)

print(f"Rows before deduplication: {before}")
print(f"Rows after deduplication:  {after}")
print(f"Removed duplicates:        {before - after}")
print("✅ Saved final linked file:", file_out)


Rows before deduplication: 677044
Rows after deduplication:  516954
Removed duplicates:        160090
✅ Saved final linked file: classified_linked.csv


In [13]:
import pandas as pd

# ============================
# Load final datasets
# ============================
classified = pd.read_csv("classified_linked.csv", parse_dates=["time_window"])
behavior = pd.read_csv("behavior_summary.csv", parse_dates=["start_time", "end_time"])

print("Loaded classified_linked:", classified.shape)
print("Loaded behavior_summary:", behavior.shape)

# ============================
# Select one interaction from behavior_summary
# ============================
row = behavior.sample(1, random_state=42).iloc[0]

ship_low = row["ship_low"]
ship_high = row["ship_high"]
interaction_id = row["interaction_id"]
expected_points = row["points_count"]

print("\nSelected interaction:")
print(f"ship_low={ship_low}, ship_high={ship_high}, interaction_id={interaction_id}")
print(f"Expected points_count: {expected_points}")
print(f"Start: {row['start_time']}, End: {row['end_time']}")

# ============================
# Extract trajectory
# ============================
traj = classified[
    (classified["ship_low"] == ship_low) &
    (classified["ship_high"] == ship_high) &
    (classified["interaction_id"] == interaction_id)
].sort_values("time_window")

print("\nExtracted trajectory rows:", len(traj))

# ============================
# NEW: Show trajectory preview
# ============================
print("\nTrajectory preview (first 10 rows):")
cols_to_show = [
    "time_window",
    "mmsi_1", "mmsi_2",
    "lat_1", "lon_1", "speed_1", "course_1",
    "lat_2", "lon_2", "speed_2", "course_2",
    "distance_m", "DCPA_m", "TCPA_s",
    "ship_low", "ship_high", "interaction_id"
]
cols_to_show = [c for c in cols_to_show if c in traj.columns]  # safe if a column is missing
print(traj[cols_to_show].head(10))

# ============================
# Tests
# ============================
print("\n--- Final consistency checks ---")

# 1. Row count match
print("Points count match:",
      "✅" if len(traj) == expected_points else "❌")

# 2. Unique timestamps
print("Unique time_window:",
      "✅" if traj["time_window"].is_unique else "❌")

# 3. Time bounds match
print("Start time match:",
      "✅" if traj["time_window"].iloc[0] == row["start_time"] else "❌")

print("End time match:",
      "✅" if traj["time_window"].iloc[-1] == row["end_time"] else "❌")

# 4. Max time gap rule
max_gap = traj["time_window"].diff().dt.total_seconds().max()
print("Max gap (s):", max_gap)
print("Gap rule satisfied:",
      "✅" if max_gap <= 3600 else "❌")


Loaded classified_linked: (516954, 22)
Loaded behavior_summary: (26946, 24)

Selected interaction:
ship_low=212373000, ship_high=227297000, interaction_id=0
Expected points_count: 4
Start: 2015-11-17 19:11:00, End: 2015-11-17 19:14:00

Extracted trajectory rows: 4

Trajectory preview (first 10 rows):
             time_window     mmsi_1     mmsi_2      lat_1     lon_1  speed_1  \
2774 2015-11-17 19:11:00  227297000  212373000  48.296932 -4.747700      5.1   
2775 2015-11-17 19:12:00  227297000  212373000  48.296780 -4.749918      5.2   
2776 2015-11-17 19:13:00  212373000  227297000  48.286070 -4.742983     12.7   
2777 2015-11-17 19:14:00  212373000  227297000  48.287495 -4.738005     12.7   

      course_1      lat_2     lon_2  speed_2  course_2   distance_m  \
2774     262.9  48.283173 -4.752725     12.0      64.0  1574.730481   
2775     264.2  48.284615 -4.747889     12.9      65.9  1361.059796   
2776      65.5  48.296326 -4.752366      5.4     253.7  1336.177500   
2777      71.

# Validation, Linking, and Deduplication

## Interaction Linking Validation

After assigning canonical ship pairs (`ship_low`, `ship_high`) and
`interaction_id` values to the time-resolved AIS data, a series of validation
steps are performed to ensure correctness and consistency between datasets.

### 1. Completeness Check  
All AIS observations are verified to contain valid values for:
- `ship_low`
- `ship_high`
- `interaction_id`

This confirms that every AIS row is assigned to a well-defined interaction.

---

### 2. Cross-Dataset Consistency Check  

A random interaction is selected from `behavior_summary.csv`, and the
corresponding full AIS trajectory is extracted from the linked classified data.

The following conditions are verified:

- The number of AIS rows matches `points_count`
- The first and last timestamps match `start_time` and `end_time`
- All timestamps within the interaction are unique
- The maximum time gap within the trajectory respects the interaction
  definition threshold (≤ 60 minutes)

This step confirms that:
- Interaction summaries are correctly linked to their underlying AIS data
- Behavioral metrics were computed over the correct temporal windows

---

### 3. Trajectory Reconstruction Verification  

The extracted AIS trajectory is explicitly printed and inspected to confirm:

- Correct temporal ordering
- Consistent ship identity across rows
- Physically meaningful evolution of position, speed, course, DCPA, and TCPA

This provides a transparent, human-interpretable validation that each
interaction corresponds to a coherent navigational situation.

---

## Duplicate Observation Handling

During earlier preprocessing, ship pairs were treated as ordered
(`(mmsi_1, mmsi_2)` and `(mmsi_2, mmsi_1)`), resulting in duplicated AIS rows
that represent the same physical encounter.

After introducing canonical ship pairing (`ship_low`, `ship_high`), these
duplicates become identifiable.

Duplicates are removed using the following keys:
- `time_window`
- `ship_low`
- `ship_high`
- `interaction_id`

This deduplication step ensures that:
- Each AIS observation appears exactly once
- Interaction trajectories are not artificially inflated
- Behavioral metrics and visualizations are not biased by duplicated data

The effectiveness of deduplication is verified by re-running the trajectory
extraction and confirming that row counts now exactly match
`points_count` from `behavior_summary.csv`.

---

## Output Data

### `classified_linked.csv`

The final output of this notebook is a **clean, deduplicated, and interaction-linked**
AIS dataset containing:

- Original time-resolved AIS measurements
- Canonical ship pair identifiers (`ship_low`, `ship_high`)
- Interaction identifiers (`interaction_id`)

This dataset serves as the foundation for:
- Detailed trajectory visualization
- Interaction-level case studies
- Behavioral pattern analysis
- Validation of interaction summaries
- Subsequent modeling or classification work

---

## Interpretation Note

This notebook establishes a **reliable one-to-one link** between:
- Interaction-level behavioral summaries  
- Time-resolved AIS trajectories  

Together, `behavior_summary.csv` and `classified_linked.csv` form the final,
consistent data representation used for all subsequent behavioral analysis
and visualization stages of the thesis.
