# Observation log and cascade annotation demo

This notebook shows how to:

1. Read an observation CSV from `data/observations/`.
2. Run the annotation step (if needed) to add simulated cascade fields.
3. Inspect the distribution of `sim_main_cascade` over time.
4. Build simple plots for exploratory analysis.


In [None]:
%matplotlib inline

import sys
from pathlib import Path

import pandas as pd
import matplotlib.pyplot as plt

# Ensure project root is on sys.path so that src/ and scripts/ are importable
PROJECT_ROOT = Path("..").resolve()
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

from scripts.annotate_observations_with_cascade import annotate_observations


## 1. Prepare input and output paths

By default we use:

- `data/observations/3I_ATLAS_observations_template.csv` as input,
- `data/observations/3I_ATLAS_observations_with_cascade.csv` as output.

If the annotated file does not exist yet, we will create it via the
`annotate_observations(...)` function.

In [None]:
data_dir = PROJECT_ROOT / "data" / "observations"

input_csv = data_dir / "3I_ATLAS_observations_template.csv"
output_csv = data_dir / "3I_ATLAS_observations_with_cascade.csv"

print("Input CSV:", input_csv)
print("Output CSV:", output_csv)

if not output_csv.exists():
    print("Annotated CSV does not exist yet. Running annotation...")
    annotate_observations(input_csv, output_csv)
else:
    print("Annotated CSV already exists – using existing file.")


## 2. Load annotated observations

We expect two extra columns:

- `sim_main_cascade` – dominant cascade level (1–23)
- `sim_confidence` – model "confidence" in [0, 1]


In [None]:
df = pd.read_csv(output_csv)
print("Columns:", list(df.columns))
print("Rows:", len(df))

# Parse datetime if present
if "datetime_utc" in df.columns:
    df["datetime_utc"] = pd.to_datetime(df["datetime_utc"], errors="coerce")

df.head()


## 3. Distribution of simulated main cascades

We start with a simple histogram of `sim_main_cascade` values to see how
the simulator populates the 1–23 range for our observation log.

In [None]:
plt.figure(figsize=(8, 4))
df["sim_main_cascade"].hist(bins=23)
plt.xlabel("sim_main_cascade (1–23)")
plt.ylabel("Count")
plt.title("Histogram of simulated main cascades")
plt.tight_layout()
plt.show()


## 4. Cascade vs. time (timeline)

If `datetime_utc` is available and parsed correctly, we can plot the
simulated main cascade as a function of time.

- Each point is one observation.
- The y-axis is `sim_main_cascade` (1–23).
- The color / marker can be extended later (e.g. by `channel` or `source`).

In [None]:
if "datetime_utc" in df.columns and df["datetime_utc"].notna().any():
    df_sorted = df.sort_values("datetime_utc")

    plt.figure(figsize=(10, 4))
    plt.scatter(df_sorted["datetime_utc"], df_sorted["sim_main_cascade"], s=20)
    plt.xlabel("datetime_utc")
    plt.ylabel("sim_main_cascade")
    plt.title("Simulated main cascade over time")
    plt.tight_layout()
    plt.show()
else:
    print("datetime_utc column is missing or empty – timeline plot skipped.")


## 5. Confidence vs. cascade level

We can also look at how the simulator's `sim_confidence` behaves
for different cascade IDs.


In [None]:
plt.figure(figsize=(8, 4))
plt.scatter(df["sim_main_cascade"], df["sim_confidence"], s=20)
plt.xlabel("sim_main_cascade")
plt.ylabel("sim_confidence")
plt.title("Confidence vs. simulated main cascade")
plt.tight_layout()
plt.show()


From here you can:

- filter by `source` or `channel` and compare their cascade profiles,
- export summary statistics (e.g. counts per cascade ID),
- overlay manual `cascade_level` (if set) and `sim_main_cascade` to see
  how human and simulator perspectives align or differ.
