In [5]:
# =========================================
# Setup: Ensure src module is importable
# =========================================

import sys
from pathlib import Path

# Add project root to sys.path
ROOT = Path("..").resolve()
if str(ROOT) not in sys.path:
    sys.path.insert(0, str(ROOT))

print(f"Project root: {ROOT}")
print("Setup complete.")

Project root: C:\Mes Documents\Pas cours\Devoir Stage\Space
Setup complete.


# Astronomical Transient Classification – End-to-End Pipeline

This notebook demonstrates an end-to-end prototype pipeline for astronomical
transient classification using SkyPortal-like JSON data.

The goals are:
- to transform heterogeneous, deeply nested JSON data into structured datasets
  suitable for machine learning,
- to demonstrate how a local Large Language Model (LLM) can act as a
  "copilot" to assist astronomers during transient vetting.

The notebook is organized as follows:

1. Load and inspect the raw JSON data
2. Parse and structure the data into ML-ready formats (Parquet)
3. Perform sanity checks on the structured datasets
4. Use a local LLM (Mistral) to generate copilot-style summaries and suggestions


In [9]:
# =========================================
# Imports and setup
# =========================================

import pandas as pd
from collections import Counter

from src.parser import load_json
from src.dataset import build_dataset, save_parquet as save_sources_parquet
from src.lightcurves import build_lightcurve_dataset, save_parquet as save_lc_parquet
from src.llm_copilot import (
    load_datasets,
    run_copilot_for_source,
    query_llm,
)

print("Environment ready.")


Environment ready.


## 1. Loading the raw JSON data

The input JSON file contains a small number of astronomical transients
(10–20 objects), but each object is deeply nested and includes rich information
such as photometry, spectra, comments, follow-up requests, and derived statistics.

As a result, the file is large (tens of thousands of lines), even for a small
number of sources.


In [11]:
# =========================================
# Load raw JSON
# =========================================

json_path = "data/sources_sample.json"  # adapt path if needed
data = load_json(json_path)

print(f"Number of sources loaded: {len(data)}")


FileNotFoundError: [Errno 2] No such file or directory: 'data/sources_sample.json'

## 2. Lightweight inspection of the JSON structure

Before structuring the data, we briefly inspect the JSON schema to understand
which fields are present and to justify our feature selection choices.

This step is used for reasoning and documentation, not for exhaustive parsing.


In [None]:
# =========================================
# Inspect top-level and TNS keys
# =========================================

top_level_keys = Counter()
tns_keys = Counter()

for obj in data:
    top_level_keys.update(obj.keys())
    if isinstance(obj.get("tns_info"), dict):
        tns_keys.update(obj["tns_info"].keys())

print("Top-level keys:")
top_level_keys


In [None]:
print("TNS info keys:")
tns_keys


## 3. Design choices for data structuring

From the inspection above, we make the following choices:

- **Structured ML features**:
  - source metadata (position, score, redshift)
  - astrophysical class labels (when available)
  - raw photometry time series

- **Deferred to LLM reasoning**:
  - comments
  - follow-up requests
  - human annotations and summaries

- **Ignored for this prototype**:
  - UI-related fields
  - deeply nested administrative metadata
  - pre-computed photometric statistics (to avoid data leakage)

This results in two compact and interpretable datasets.


## 4. Building structured datasets (Part 1)

We now convert the raw JSON objects into:
- a **source-level dataset** (one row per transient),
- a **long-format lightcurve dataset** (one row per photometric measurement).


In [None]:
# =========================================
# Build source-level dataset
# =========================================

sources_df = build_dataset(data)
sources_df.head()


In [None]:
# =========================================
# Build lightcurve dataset
# =========================================

lightcurves_df = build_lightcurve_dataset(data)
lightcurves_df.head()


## 5. Saving datasets to Parquet

Parquet is used for compact storage and efficient downstream processing.


In [None]:
# =========================================
# Save datasets
# =========================================

save_sources_parquet(sources_df, "data/sources.parquet")
save_lc_parquet(lightcurves_df, "data/lightcurves.parquet")

print("Parquet files written.")


## 6. Reloading structured datasets

We reload the Parquet files to ensure they are self-contained and ready
for downstream ML or LLM-based processing.


In [None]:
# =========================================
# Reload Parquet datasets
# =========================================

sources_df, lightcurves_df = load_datasets(
    "data/sources.parquet",
    "data/lightcurves.parquet",
)

print(f"Sources: {len(sources_df)}")
print(f"Photometry points: {len(lightcurves_df)}")


## 7. LLM-based astronomer copilot (Part 2)

Instead of training a new classifier, we use a local instruction-tuned LLM
(Mistral 7B Instruct) as a reasoning layer on top of structured data.

The goal is to:
- summarize key information,
- assess whether a transient is likely extragalactic,
- suggest whether follow-up observations are warranted.


In [None]:
# =========================================
# Local LLM sanity check
# =========================================

print(query_llm("In one sentence, explain what a supernova is."))


## 8. Running the copilot on a single transient

We now run the full copilot pipeline on one example transient.


In [None]:
# =========================================
# Run copilot on a single transient
# =========================================

example_id = sources_df["id"].iloc[0]
print(f"Selected transient: {example_id}")

copilot_output = run_copilot_for_source(
    source_id=example_id,
    sources=sources_df,
    lightcurves=lightcurves_df,
)

print("=== Copilot output ===")
print(copilot_output)


## 9. Running the copilot on multiple transients

This demonstrates that the pipeline generalizes beyond a single example.


In [None]:
# =========================================
# Run copilot on multiple transients
# =========================================

for source_id in sources_df["id"].head(3):
    print("\n" + "=" * 80)
    print(f"Transient ID: {source_id}")
    print(run_copilot_for_source(
        source_id=source_id,
        sources=sources_df,
        lightcurves=lightcurves_df,
    ))


## 10. User-driven question

Finally, we allow a hypothetical astronomer to ask a specific question
about the transient, using the structured context as input to the LLM.


In [None]:
# =========================================
# User-driven question
# =========================================

user_question = (
    "Is this transient more likely a supernova or a galactic variable star?"
)

combined_prompt = copilot_output + "\n\nUser question:\n" + user_question
answer = query_llm(combined_prompt)

print("=== User question answer ===")
print(answer)


## Conclusion

This notebook demonstrates:
- how to tame large, deeply nested astronomical JSON files,
- how to extract a minimal, ML-relevant core dataset,
- how to use a local LLM as a reasoning copilot for astronomers.

The focus is on clarity, robustness, and design choices rather than
model performance or over-engineering.
