In [None]:
import sys
from pathlib import Path
ROOT = Path("..").resolve()
if str(ROOT) not in sys.path:
    sys.path.insert(0, str(ROOT))

print(f"Project root: {ROOT}")
print("Setup complete.")

Project root: C:\Mes Documents\Pas cours\Devoir Stage\Space
Setup complete.


# Astronomical Transient Classification â€“ End-to-End Pipeline

This notebook demonstrates an end-to-end prototype pipeline for astronomical
transient classification using SkyPortal-like JSON data.

The goals are:
- to transform heterogeneous, deeply nested JSON data into structured datasets
  suitable for machine learning,
- to demonstrate how a local Large Language Model (LLM) can act as a
  "copilot" to assist astronomers during transient vetting.

The notebook is organized as follows:

1. Load and inspect the raw JSON data
2. Parse and structure the data into ML-ready formats (Parquet)
3. Perform sanity checks on the structured datasets
4. Use a local LLM (Mistral) to generate copilot-style summaries and suggestions


In [None]:
import pandas as pd
from collections import Counter
from src.parser import load_json
from src.dataset import build_dataset, save_parquet as save_sources_parquet
from src.lightcurves import build_lightcurve_dataset, save_parquet as save_lc_parquet
from src.llm_copilot import (
    load_datasets,
    run_copilot_for_source,
    query_llm,
)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Environment ready.


## 1. Loading the raw JSON data

In [132]:
# =========================================
# Load raw JSON
# =========================================

json_path = "../data/sources_sample.json"  # adapt path if needed
data = load_json(json_path)

print(f"Number of sources loaded: {len(data)}")


Number of sources loaded: 10


## 2. Lightweight inspection of the JSON structure

Before structuring the data, we briefly inspect the JSON schema to understand
which fields are present and to justify our feature selection choices.

This step is used for reasoning and documentation, not for exhaustive parsing.


In [133]:
# =========================================
# Inspect top-level and TNS keys
# =========================================

top_level_keys = Counter()
tns_keys = Counter()

for obj in data:
    top_level_keys.update(obj.keys())
    if isinstance(obj.get("tns_info"), dict):
        tns_keys.update(obj["tns_info"].keys())

print("Top-level keys:")
top_level_keys


Top-level keys:


Counter({'dec_dis': 10,
         'redshift_origin': 10,
         'e_mag_nearest_source': 10,
         'modified': 10,
         'ra': 10,
         'ra_err': 10,
         'redshift_history': 10,
         'transient': 10,
         'score': 10,
         'dec': 10,
         'dec_err': 10,
         'host_id': 10,
         'varstar': 10,
         'origin': 10,
         'offset': 10,
         'summary': 10,
         'is_roid': 10,
         'alias': 10,
         't0': 10,
         'summary_history': 10,
         'mpc_name': 10,
         'healpix': 10,
         'ra_dis': 10,
         'redshift': 10,
         'altdata': 10,
         'gcn_crossmatch': 10,
         'internal_key': 10,
         'redshift_error': 10,
         'dist_nearest_source': 10,
         'tns_name': 10,
         'detect_photometry_count': 10,
         'id': 10,
         'mag_nearest_source': 10,
         'tns_info': 10,
         'created_at': 10,
         'thumbnails': 10,
         'photstats': 10,
         'followup_requests'

In [134]:
print("TNS info keys:")
tns_keys


TNS info keys:


Counter({'ra': 5,
         'dec': 5,
         'objid': 5,
         'radeg': 5,
         'decdeg': 5,
         'public': 5,
         'source': 5,
         'objname': 5,
         'spectra': 5,
         'hostname': 5,
         'redshift': 5,
         'reporter': 5,
         'radeg_err': 5,
         'decdeg_err': 5,
         'discoverer': 5,
         'photometry': 5,
         'reporterid': 5,
         'name_prefix': 5,
         'object_type': 5,
         'discoverymag': 5,
         'discmagfilter': 5,
         'discoverydate': 5,
         'host_redshift': 5,
         'internal_names': 5,
         'end_prop_period': 5,
         'reporting_group': 5,
         'class_ads_bibcodes': 5,
         'discovery_ads_bibcode': 5,
         'discovery_data_source': 5,
         'discoverer_internal_name': 5})

## 3. Design choices for data structuring

From the inspection above, we make the following choices:

- **Structured ML features**:
  - source metadata (position, score, redshift)
  - astrophysical class labels (when available)
  - raw photometry time series

- **Deferred to LLM reasoning**:
  - comments
  - follow-up requests
  - human annotations and summaries

- **Ignored for this prototype**:
  - UI-related fields
  - deeply nested administrative metadata
  - pre-computed photometric statistics (to avoid data leakage)

This results in two compact and interpretable datasets.


## 4. Building structured datasets (Part 1)

We now convert the raw JSON objects into:
- a **source-level dataset** (one row per transient),
- a **long-format lightcurve dataset** (one row per photometric measurement).


In [None]:

sources_df = build_dataset(data)
sources_df.head()


Unnamed: 0,id,ra,dec,score,is_transient,is_varstar,is_roid,redshift,label,has_tns,has_redshift,has_photometry,has_spectra,n_photometry,n_spectra
0,ZTF25aaktqzg,233.857452,12.057747,0.999921,False,False,False,0.0061,SN Ia,True,True,True,True,16,3
1,ZTF25aajygin,205.251743,39.058164,0.999947,False,False,False,0.02022,SN II,True,True,True,True,9,2
2,ZTF25aajuqtp,117.185365,66.197481,0.611303,False,False,False,0.01439,SN Ic-BL,True,True,True,True,5,1
3,ZTF25aajqkdo,107.164235,61.305093,0.999521,False,False,False,0.025386,SN Ia,True,True,True,True,5,1
4,ZTF21ackbmei,291.646695,36.699277,0.998749,False,False,False,,,False,False,False,False,0,0


In [None]:
lightcurves_df = build_lightcurve_dataset(data)
lightcurves_df.head()


Unnamed: 0,id,jd,flux,flux_err,filter,is_detection
0,ZTF25aaktqzg,2460754.0,,,L-GOTO,False
1,ZTF25aaktqzg,2460760.0,,,orange-ATLAS,False
2,ZTF25aaktqzg,2460762.0,16.98,0.05,L-GOTO,True
3,ZTF25aaktqzg,2460798.0,,,BG-q-BlackGem,False
4,ZTF25aaktqzg,2460800.0,15.21,0.02,BG-q-BlackGem,True


## 5. Saving datasets to Parquet


In [None]:
save_sources_parquet(sources_df, "../data/sources.parquet")
save_lc_parquet(lightcurves_df, "../data/lightcurves.parquet")

print("Parquet files written.")


Parquet files written.


## 6. Reloading structured datasets

We reload the Parquet files to ensure they are self-contained and ready
for downstream ML or LLM-based processing.


In [None]:
sources_df, lightcurves_df = load_datasets(
    "../data/sources.parquet",
    "../data/lightcurves.parquet",
)

print(f"Sources: {len(sources_df)}")
print(f"Photometry points: {len(lightcurves_df)}")


Sources: 10
Photometry points: 37


## 7. LLM-based astronomer copilot (Part 2)

Instead of training a new classifier, we use a local instruction-tuned LLM
(Mistral 7B Instruct) as a reasoning layer on top of structured data.

The goal is to:
- summarize key information,
- assess whether a transient is likely extragalactic,
- suggest whether follow-up observations are warranted.


In [139]:
# =========================================
# Local LLM sanity check
# =========================================

print(query_llm("In one sentence, explain what a supernova is."))


 A supernova is an extremely bright exploding star, marking the end of its life cycle and releasing vast amounts of energy and heavy elements into space.


## 8. Running the copilot on a single transient

We now run the full copilot pipeline on one example transient.


In [140]:
# =========================================
# Run copilot on a single transient
# =========================================

example_id = sources_df["id"].iloc[0]
print(f"Selected transient: {example_id}")

copilot_output = run_copilot_for_source(
    source_id=example_id,
    sources=sources_df,
    lightcurves=lightcurves_df,
)

print("=== Copilot output ===")
print(copilot_output)


Selected transient: ZTF25aaktqzg
=== Copilot output ===
 1. Key information: The source ZTF25aaktqzg was detected with a high score (0.9999) and is reported in the Transient Name Server (TNS) with a confirmed classification as SN Ia. It has a redshift value of 0.0061, spectra are available, and it was observed over a period of 70.8 days using various filters.

2. Likely extragalactic: Yes, the TNS reported classification as Supernova (SN Ia) and the high redshift value support an extragalactic origin. However, the heuristic flag for being a transient is false, which may indicate some uncertainty in its classification.

3. Follow-up observations are likely needed: Given the TNS reported classification as SN Ia, it would be beneficial to confirm this classification and gather more data on the supernova's properties.

4. Suggested follow-up: Additional spectroscopy observations can help confirm the SN Ia classification and provide information about subtypes or peculiarities. More photomet

## 9. Running the copilot on multiple transients

This demonstrates that the pipeline generalizes beyond a single example.


In [141]:
# =========================================
# Run copilot on multiple transients
# =========================================

for source_id in sources_df["id"].head(3):
    print("\n" + "=" * 80)
    print(f"Transient ID: {source_id}")
    print(run_copilot_for_source(
        source_id=source_id,
        sources=sources_df,
        lightcurves=lightcurves_df,
    ))



Transient ID: ZTF25aaktqzg
 1. Key information: The source ZTF25aaktqzg is a transient object with a reported redshift of 0.0061, indicating an extragalactic origin. It has been classified as a Type Ia Supernova (SN Ia) by the TNS based on available spectra and redshift measurements. The lightcurve summary shows 16 photometry points across various filters over a period of approximately 71 days, with a flux range from 15.21 to 18.9175.

2. Likely extragalactic: Yes, the TNS classification as SN Ia and the available redshift measurement support an extragalactic origin for this transient. However, it's important to note that while the TNS label is a high-confidence signal, there may still be some uncertainty associated with the initial classification.

3. Follow-up observations: Given the confirmed classification as a SN Ia and the available redshift measurement, follow-up observations are likely warranted to confirm the initial classification and gather more detailed data about the supe