# üöÄ Master Notebook ‚Äì Pipeline Spectroscopie DR5

Ce notebook orchestre le **workflow complet** pour entra√Æner un classifieur √† partir des spectres LAMOST DR5 :  
**s√©lection d‚Äôun lot ‚Üí g√©n√©ration/enrichissement du catalogue ‚Üí pr√©traitement + features ‚Üí entra√Ænement ‚Üí journaux & artefacts.**

**Sommaire rapide**
- [üß™ √âtape 0 : SETUP & IMPORTS](#etape-0)
- [‚ñ∂Ô∏è Lancer une session compl√®te](#run-full)
- [1) T√©l√©chargement des spectres](#step-1-download)
- [2) S√©lection du lot de spectres](#step-2-select)
- [3) Catalogue : g√©n√©ration & enrichissement Gaia](#step-3-catalog)
- [3bis) Traitement & extraction des features](#step-3bis-features)
- [4) Entra√Ænement du mod√®le](#step-4-train)

> Orchestrateur utilis√© : **`MasterPipeline`**  
> M√©thodes cl√©s : `select_batch`, `generate_and_enrich_catalog`, `process_data`, `run_training_session`, `run_full_pipeline`, `interactive_training_runner`.

<a id="etape-0"></a>

#

## üß™ √âtape 0 : SETUP & IMPORTS

Initialise l‚Äôenvironnement, cr√©e/valide l‚Äôarborescence des r√©pertoires
(`RAW_DATA_DIR`, `CATALOG_DIR`, `PROCESSED_DIR`, `MODELS_DIR`, `REPORTS_DIR`)
et instancie **`MasterPipeline`**.

**Attendu apr√®s ex√©cution :**
- un objet `pipeline` pr√™t √† l‚Äôemploi,
- messages sur la racine du projet et, si configur√©, tentative de connexion √† **Gaia**.

> ‚ÑπÔ∏è **UI interactive d‚Äôentra√Ænement** : disponible avec `pipeline.interactive_training_runner()`.

In [1]:
from utils import setup_project_env, load_env_vars
from pipeline.master import MasterPipeline
from astroquery.gaia import Gaia
from pipeline.classifier import SpectralClassifier

# Initialisation automatique de l'environnement et des chemins
paths = setup_project_env()

# Chargement des credentials Gaia depuis .env
env_vars = load_env_vars()

try:
    print("Tentative de connexion √† l'archive Gaia...")
    Gaia.login(user=env_vars.get("GAIA_USER"), password=env_vars.get("GAIA_PASS"))
    print("Connexion √† Gaia r√©ussie.")
except Exception as e:
    print(f"AVERTISSEMENT : √âchec de la connexion √† Gaia ({e}). Le mode 'bulk' pourrait √©chouer.")

# Instanciation du pipeline ma√Ætre
pipeline = MasterPipeline(
    raw_data_dir=paths["RAW_DATA_DIR"],
    catalog_dir=paths["CATALOG_DIR"],
    processed_dir=paths["PROCESSED_DIR"],
    models_dir=paths["MODELS_DIR"],
    reports_dir=paths["REPORTS_DIR"],
)

print("\nSetup termin√©. Tu es pr√™t √† lancer ton pipeline.")

[INFO] Racine du projet d√©tect√©e : C:\Users\alexb\Documents\Google_Cloud\alex_labs_google_sprint\astro_spectro_git
[INFO] Dossier 'src' ajout√© au sys.path.
[INFO] Variables d'environnement charg√©es depuis C:\Users\alexb\Documents\Google_Cloud\alex_labs_google_sprint\astro_spectro_git\.env
Tentative de connexion √† l'archive Gaia...
INFO: Login to gaia TAP server [astroquery.gaia.core]
INFO: OK [astroquery.utils.tap.core]
INFO: Login to gaia data server [astroquery.gaia.core]
INFO: OK [astroquery.utils.tap.core]
Connexion √† Gaia r√©ussie.

Setup termin√©. Tu es pr√™t √† lancer ton pipeline.


#

---

<a id="run-full"></a>
## ‚ñ∂Ô∏è Lancer une session compl√®te

Lance **tout le pipeline A‚ÜíZ** :
`select_batch ‚Üí generate_and_enrich_catalog ‚Üí process_data ‚Üí run_training_session`.

> üí° **Param√®tres conseill√©s** au d√©but : `batch_size=200‚Äì500`, `n_estimators=200‚Äì400` (RF/XGB).  
> Active `enrich_gaia=True` lorsque la connectivit√© est stable.

In [None]:
pipeline.run_full_pipeline(
    batch_size=500,                 # taille du lot
    model_type="RandomForest",      # "RandomForest" ou "XGBoost"
    n_estimators=100,               # arbres du mod√®le final
    prediction_target="main_class", # ex.: "main_class", "sub_class_top25", "sub_class_bins"
    save_and_log=True,              # sauvegarde mod√®le + rapport JSON
    enrich_gaia=False,              # True pour activer Gaia
    # ...kwargs Gaia si enrich_gaia=True
)

#

---

<a id="step-1-download"></a>
## 1) T√©l√©chargement des spectres

Utilisation du script **`dr5_downloader.py`** encapsul√©.  
Cette √©tape est externalis√©e dans **[01_download_spectra.ipynb](./01_download_spectra.ipynb)** (√† ex√©cuter au besoin).

> ‚ö†Ô∏è **Quota / temps** : selon le volume demand√©, le t√©l√©chargement peut √™tre long.

#

<a id="step-2-select"></a>
## 2) S√©lection du lot de spectres

Choisit un **nouveau lot** de fichiers `.fits.gz` √† traiter sans r√©utiliser de spectres d√©j√† journalis√©s.

- `batch_size` : nombre de spectres,
- `strategy` : ex. `"random"`.

Le s√©lectionneur s‚Äôappuie sur **DatasetBuilder** pour garantir l‚Äôunicit√© des √©chantillons.


In [2]:
pipeline.select_batch(batch_size=5000, strategy="random")


=== √âTAPE 1 : S√âLECTION D'UN NOUVEAU LOT ===
--- Constitution d'un nouveau lot d'entra√Ænement ---
  > 46170 spectres trouv√©s dans 'C:\Users\alexb\Documents\Google_Cloud\alex_labs_google_sprint\astro_spectro_git\data\raw'
  > 31496 spectres d√©j√† utilis√©s dans le journal.
  > 14674 spectres **nouveaux** disponibles.
  > S√©lection al√©atoire de 5000 spectres.


['M31_011N40_B1/spec-55863-M31_011N40_B1_sp07-243.fits.gz',
 'M6201/spec-55862-M6201_sp03-173.fits.gz',
 'B6301/spec-55863-B6301_sp03-047.fits.gz',
 'M31_011N40_B1/spec-55863-M31_011N40_B1_sp02-164.fits.gz',
 'F5907/spec-55859-F5907_sp09-180.fits.gz',
 'M5904/spec-55859-M5904_sp11-123.fits.gz',
 'B6210/spec-55862-B6210_sp11-014.fits.gz',
 'B6301/spec-55863-B6301_sp01-120.fits.gz',
 'M6201/spec-55862-M6201_sp08-132.fits.gz',
 'B6302/spec-55863-B6302_sp08-049.fits.gz',
 'M6201/spec-55862-M6201_sp02-233.fits.gz',
 'M6201/spec-55862-M6201_sp14-008.fits.gz',
 'F5902/spec-55859-F5902_sp05-220.fits.gz',
 'GAC_060N28_B1/spec-55863-GAC_060N28_B1_sp16-064.fits.gz',
 'M6203/spec-55862-M6203_sp15-176.fits.gz',
 'M31_011N40_M1/spec-55863-M31_011N40_M1_sp06-086.fits.gz',
 'GAC_105N29_B1/spec-55863-GAC_105N29_B1_sp14-179.fits.gz',
 'GAC_060N28_B1/spec-55863-GAC_060N28_B1_sp09-096.fits.gz',
 'B6212/spec-55862-B6212_sp07-208.fits.gz',
 'GAC_105N29_B1/spec-55863-GAC_105N29_B1_sp14-170.fits.gz',
 'M5904/

#

<a id="step-3-catalog"></a>
## 3) Catalogue : g√©n√©ration & enrichissement Gaia

√Ä partir du lot courant, produit un **catalogue local** (CSV) et peut l‚Äô**enrichir via Gaia** (positions, photom√©trie‚Ä¶).

**Sorties :**
- `master_catalog_temp.csv` (catalogue local) puis `master_catalog_gaia.csv` si enrichi,
- mise √† jour de `pipeline.master_catalog_df`.

> ‚ÑπÔ∏è **Couplage Gaia** : g√©r√© par l‚Äôorchestrateur (appairage + stats).  
> ‚ö†Ô∏è **Connexion** : si l‚Äôauthentification Gaia √©choue, relance sans `enrich_gaia` ou v√©rifie tes identifiants.

In [3]:
pipeline.generate_and_enrich_catalog(
    enrich_gaia=True,
    mode='bulk',
    include_risky=False,   # <-- active radius/mass/age -- beta donc en test
    ruwe_max=1.4           # optionnel: garde aussi les entr√©es √† RUWE √©lev√© - <1.4 est un bon filtre
)


=== √âTAPE 2 : G√âN√âRATION ET ENRICHISSEMENT DU CATALOGUE ===


Extraction des headers: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5000/5000 [00:54<00:00, 92.57fichier/s] 


[OK] Catalogue √©crit : C:\Users\alexb\Documents\Google_Cloud\alex_labs_google_sprint\astro_spectro_git\data\catalog\master_catalog_temp.csv  (5000 lignes)
  > Catalogue local de 5000 spectres cr√©√©.
  > Tentative de cross-match en mode 'bulk'‚Ä¶
INFO: Query finished. [astroquery.utils.tap.core]
INFO: Query finished. [astroquery.utils.tap.core]
INFO: Query finished. [astroquery.utils.tap.core]
INFO: Query finished. [astroquery.utils.tap.core]
INFO: Query finished. [astroquery.utils.tap.core]
INFO: Query finished. [astroquery.utils.tap.core]
INFO: Query finished. [astroquery.utils.tap.core]
INFO: Query finished. [astroquery.utils.tap.core]
INFO: Query finished. [astroquery.utils.tap.core]
INFO: Query finished. [astroquery.utils.tap.core]
INFO: Query finished. [astroquery.utils.tap.core]
INFO: Query finished. [astroquery.utils.tap.core]
INFO: Query finished. [astroquery.utils.tap.core]
INFO: Query finished. [astroquery.utils.tap.core]
  > Gaia : 4350/5000 objets appari√©s.


#

<a id="step-3bis-features"></a>
## 3bis) Traitement & extraction des features

Ex√©cute les **pr√©traitements spectraux** et l‚Äô**extraction de features**.  
Un CSV `features_YYYYMMDDTHHMMSSZ.csv` est √©crit dans `processed/`.

- Met √† jour `pipeline.features_df` (m√©moire) & `pipeline.last_features_path` (disque).
- Features = mesures photom√©triques/astrom√©triques, indices de raies, r√©sum√©s de voisinage spectral‚Ä¶

> üí° Certaines √©tapes internes reposent sur des d√©tections/associations de raies (Balmer, Ca II H/K, Mg_b, Na_D).

In [4]:
pipeline.process_data()


=== √âTAPE 3 : TRAITEMENT DES DONN√âES ET EXTRACTION DES FEATURES ===

--- D√©marrage du pipeline de traitement pour 5000 spectres ---


Traitement des spectres: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5000/5000 [02:07<00:00, 39.25it/s]


  > Features Gaia d√©riv√©es ajout√©es : 22 colonnes
  > delta_ms ajout√© (poly deg=3).
  > Cr√©ation des features de couleur photom√©trique...

Pipeline de traitement termin√©. 5000 spectres trait√©s et enrichis.

  > Dataset de features sauvegard√© dans : features_20250823T000250Z.csv


Unnamed: 0,file_path,feature_HŒ±_prominence,feature_HŒ±_fwhm,feature_HŒ±_eq_width,feature_HŒ≤_prominence,feature_HŒ≤_fwhm,feature_HŒ≤_eq_width,feature_CaIIK_prominence,feature_CaIIK_fwhm,feature_CaIIK_eq_width,...,flux_bp_g_ratio_log10,bp_rp_excess_dev,is_good_ruwe,has_astrom_excess,is_variable_flag,parallax_missing,distance_gspphot_missing,delta_ms,feature_color_gr,feature_color_ri
0,M31_011N40_B1/spec-55863-M31_011N40_B1_sp07-24...,0.310306,21.930589,5.976233,0.575739,32.819931,-5.660850,0.806614,13.531117,9.342789,...,-0.264246,0.188488,1,0,0,0,0,-0.132822,0.38,-0.01
1,M6201/spec-55862-M6201_sp03-173.fits.gz,0.179503,0.000000,-67.941815,0.277713,25.812926,19.636811,0.818638,28.150925,39.168190,...,-0.685154,0.330248,1,0,0,0,0,0.833385,1.52,1.23
2,B6301/spec-55863-B6301_sp03-047.fits.gz,0.312114,22.122069,6.578693,0.651384,32.666060,-6.994321,0.962405,4.074939,8.758194,...,-0.265885,0.187768,1,0,0,0,0,-0.018717,0.45,0.13
3,M31_011N40_B1/spec-55863-M31_011N40_B1_sp02-16...,0.000000,0.000000,0.000000,0.604770,26.177773,18.896551,1.707828,27.679161,37.511276,...,,,0,0,1,1,1,,1.32,0.62
4,F5907/spec-55859-F5907_sp09-180.fits.gz,0.303846,24.371396,9.729840,0.821943,37.674891,-3.393420,1.154643,34.300582,-10.046530,...,-0.247835,0.181181,1,1,0,0,0,0.293726,0.33,0.12
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,B6001/spec-55860-B6001_sp07-087.fits.gz,0.404938,31.109652,-5.313483,0.365336,23.174658,4.421275,0.362522,27.606540,36.118685,...,-0.426021,0.231465,1,0,0,0,0,-6.760314,1.15,0.46
4996,GAC_060N28_B1/spec-55863-GAC_060N28_B1_sp07-06...,0.407344,24.535665,10.504724,0.803855,33.764777,-7.111623,2.119339,34.251186,-9.844155,...,-0.278502,0.198263,1,0,0,0,0,0.114821,0.45,0.16
4997,B6301/spec-55863-B6301_sp13-165.fits.gz,0.295404,,1.626657,0.638900,31.446162,-8.367735,0.831999,23.092219,19.325186,...,-0.301794,0.193900,1,0,0,0,0,-0.152355,0.63,0.12
4998,M5904/spec-55859-M5904_sp02-072.fits.gz,0.667428,,-0.043611,1.122903,51.141386,3.177288,5.294030,29.118972,59.116286,...,-0.569623,0.337275,1,1,0,0,0,1.312830,1.40,0.82


#

<a id="step-4-train"></a>
## 4) Entra√Ænement du mod√®le

Entra√Æne un **classifieur** (RF/XGBoost) avec **s√©lection de features** optionnelle (`SelectFromModel`, seuil `"median"` par d√©faut), puis **√©value** et **journalise**.

- R√©cap : nb de features conserv√©es, scores, rapports, chemins des artefacts.
- En notebook : UI d√©di√©e via `pipeline.interactive_training_runner()`.

> üí° **Astuce** : commence avec RF pour un feedback rapide, puis passe √† XGB pour gagner en performance.

In [5]:
pipeline.interactive_training_runner()

VBox(children=(HBox(children=(Dropdown(description='Mod√®le:', options=('XGBoost', 'RandomForest', 'SVM'), valu‚Ä¶

#
#
#