# üöÄ Master Notebook ‚Äì Pipeline Spectroscopie DR5

Ce notebook orchestre le **workflow complet** pour entra√Æner un classifieur √† partir des spectres LAMOST DR5 :  
**s√©lection d‚Äôun lot ‚Üí g√©n√©ration/enrichissement du catalogue ‚Üí pr√©traitement + features ‚Üí entra√Ænement ‚Üí journaux & artefacts.**

**Sommaire rapide**
- [üß™ √âtape 0 : SETUP & IMPORTS](#etape-0)
- [‚ñ∂Ô∏è Lancer une session compl√®te](#run-full)
- [1) T√©l√©chargement des spectres](#step-1-download)
- [2) S√©lection du lot de spectres](#step-2-select)
- [3) Catalogue : g√©n√©ration & enrichissement Gaia](#step-3-catalog)
- [3bis) Traitement & extraction des features](#step-3bis-features)
- [4) Entra√Ænement du mod√®le](#step-4-train)

> Orchestrateur utilis√© : **`MasterPipeline`**  
> M√©thodes cl√©s : `select_batch`, `generate_and_enrich_catalog`, `process_data`, `run_training_session`, `run_full_pipeline`, `interactive_training_runner`.

<a id="etape-0"></a>

#

## üß™ √âtape 0 : SETUP & IMPORTS

Initialise l‚Äôenvironnement, cr√©e/valide l‚Äôarborescence des r√©pertoires
(`RAW_DATA_DIR`, `CATALOG_DIR`, `PROCESSED_DIR`, `MODELS_DIR`, `REPORTS_DIR`)
et instancie **`MasterPipeline`**.

**Attendu apr√®s ex√©cution :**
- un objet `pipeline` pr√™t √† l‚Äôemploi,
- messages sur la racine du projet et, si configur√©, tentative de connexion √† **Gaia**.

> ‚ÑπÔ∏è **UI interactive d‚Äôentra√Ænement** : disponible avec `pipeline.interactive_training_runner()`.

In [1]:
from utils import setup_project_env, load_env_vars
from pipeline.master import MasterPipeline
from astroquery.gaia import Gaia
from pipeline.classifier import SpectralClassifier

# Initialisation automatique de l'environnement et des chemins
paths = setup_project_env()

# Chargement des credentials Gaia depuis .env
env_vars = load_env_vars()

try:
    print("Tentative de connexion √† l'archive Gaia...")
    Gaia.login(user=env_vars.get("GAIA_USER"), password=env_vars.get("GAIA_PASS"))
    print("Connexion √† Gaia r√©ussie.")
except Exception as e:
    print(f"AVERTISSEMENT : √âchec de la connexion √† Gaia ({e}). Le mode 'bulk' pourrait √©chouer.")

# Instanciation du pipeline ma√Ætre
pipeline = MasterPipeline(
    raw_data_dir=paths["RAW_DATA_DIR"],
    catalog_dir=paths["CATALOG_DIR"],
    processed_dir=paths["PROCESSED_DIR"],
    models_dir=paths["MODELS_DIR"],
    reports_dir=paths["REPORTS_DIR"],
)

print("\nSetup termin√©. Tu es pr√™t √† lancer ton pipeline.")

[INFO] Racine du projet d√©tect√©e : C:\Users\alexb\Documents\Google_Cloud\alex_labs_google_sprint\astro_spectro_git
[INFO] Dossier 'src' ajout√© au sys.path.
[INFO] Variables d'environnement charg√©es depuis C:\Users\alexb\Documents\Google_Cloud\alex_labs_google_sprint\astro_spectro_git\.env
Tentative de connexion √† l'archive Gaia...
INFO: Login to gaia TAP server [astroquery.gaia.core]
INFO: OK [astroquery.utils.tap.core]
INFO: Login to gaia data server [astroquery.gaia.core]
INFO: OK [astroquery.utils.tap.core]
Connexion √† Gaia r√©ussie.

Setup termin√©. Tu es pr√™t √† lancer ton pipeline.


#

---

<a id="run-full"></a>
## ‚ñ∂Ô∏è Lancer une session compl√®te

Lance **tout le pipeline A‚ÜíZ** :
`select_batch ‚Üí generate_and_enrich_catalog ‚Üí process_data ‚Üí run_training_session`.

> üí° **Param√®tres conseill√©s** au d√©but : `batch_size=200‚Äì500`, `n_estimators=200‚Äì400` (RF/XGB).  
> Active `enrich_gaia=True` lorsque la connectivit√© est stable.

In [None]:
pipeline.run_full_pipeline(
    batch_size=500,                 # taille du lot
    model_type="RandomForest",      # "RandomForest" ou "XGBoost"
    n_estimators=100,               # arbres du mod√®le final
    prediction_target="main_class", # ex.: "main_class", "sub_class_top25", "sub_class_bins"
    save_and_log=True,              # sauvegarde mod√®le + rapport JSON
    enrich_gaia=False,              # True pour activer Gaia
    # ...kwargs Gaia si enrich_gaia=True
)

#

---

<a id="step-1-download"></a>
## 1) T√©l√©chargement des spectres

Utilisation du script **`dr5_downloader.py`** encapsul√©.  
Cette √©tape est externalis√©e dans **[01_download_spectra.ipynb](./01_download_spectra.ipynb)** (√† ex√©cuter au besoin).

> ‚ö†Ô∏è **Quota / temps** : selon le volume demand√©, le t√©l√©chargement peut √™tre long.

#

<a id="step-2-select"></a>
## 2) S√©lection du lot de spectres

Choisit un **nouveau lot** de fichiers `.fits.gz` √† traiter sans r√©utiliser de spectres d√©j√† journalis√©s.

- `batch_size` : nombre de spectres,
- `strategy` : ex. `"random"`.

Le s√©lectionneur s‚Äôappuie sur **DatasetBuilder** pour garantir l‚Äôunicit√© des √©chantillons.


In [2]:
pipeline.select_batch(batch_size=3000, strategy="random")


=== √âTAPE 1 : S√âLECTION D'UN NOUVEAU LOT ===
--- Constitution d'un nouveau lot d'entra√Ænement ---
  > 43392 spectres trouv√©s dans 'C:\Users\alexb\Documents\Google_Cloud\alex_labs_google_sprint\astro_spectro_git\data\raw'
  > 13000 spectres d√©j√† utilis√©s dans le journal.
  > 30392 spectres **nouveaux** disponibles.
  > S√©lection al√©atoire de 3000 spectres.


['M5901/spec-55859-M5901_sp01-195.fits.gz',
 'M6201/spec-55862-M6201_sp11-115.fits.gz',
 'GAC_122N29_B1/spec-55874-GAC_122N29_B1_sp01-066.fits.gz',
 'GAC_060N28_B1/spec-55863-GAC_060N28_B1_sp16-109.fits.gz',
 'GAC_105N29_B1/spec-55863-GAC_105N29_B1_sp13-014.fits.gz',
 'GAC_060N28_B1/spec-55863-GAC_060N28_B1_sp08-119.fits.gz',
 'M6201/spec-55862-M6201_sp06-147.fits.gz',
 'F5907/spec-55859-F5907_sp10-064.fits.gz',
 'B6301/spec-55863-B6301_sp09-034.fits.gz',
 'M6201/spec-55862-M6201_sp13-011.fits.gz',
 'GAC_105N29_B1/spec-55863-GAC_105N29_B1_sp07-163.fits.gz',
 'M6201/spec-55862-M6201_sp05-081.fits.gz',
 'M6201/spec-55862-M6201_sp05-003.fits.gz',
 'B7401/spec-55874-B7401_sp07-140.fits.gz',
 'GAC_105N29_B1/spec-55863-GAC_105N29_B1_sp14-019.fits.gz',
 'M6201/spec-55862-M6201_sp06-216.fits.gz',
 'M5901/spec-55859-M5901_sp07-112.fits.gz',
 'B6001/spec-55860-B6001_sp07-049.fits.gz',
 'B6001/spec-55860-B6001_sp01-155.fits.gz',
 'M5901/spec-55859-M5901_sp07-095.fits.gz',
 'M31_011N40_M1/spec-558

#

<a id="step-3-catalog"></a>
## 3) Catalogue : g√©n√©ration & enrichissement Gaia

√Ä partir du lot courant, produit un **catalogue local** (CSV) et peut l‚Äô**enrichir via Gaia** (positions, photom√©trie‚Ä¶).

**Sorties :**
- `master_catalog_temp.csv` (catalogue local) puis `master_catalog_gaia.csv` si enrichi,
- mise √† jour de `pipeline.master_catalog_df`.

> ‚ÑπÔ∏è **Couplage Gaia** : g√©r√© par l‚Äôorchestrateur (appairage + stats).  
> ‚ö†Ô∏è **Connexion** : si l‚Äôauthentification Gaia √©choue, relance sans `enrich_gaia` ou v√©rifie tes identifiants.

In [3]:
pipeline.generate_and_enrich_catalog(
    enrich_gaia=True,
    mode='bulk',
    include_risky=False,   # <-- active radius/mass/age -- beta donc en test
    ruwe_max=1.4           # optionnel: garde aussi les entr√©es √† RUWE √©lev√© - <1.4 est un bon filtre
)


=== √âTAPE 2 : G√âN√âRATION ET ENRICHISSEMENT DU CATALOGUE ===


Extraction des headers: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3000/3000 [00:37<00:00, 79.56fichier/s] 


[OK] Catalogue √©crit : C:\Users\alexb\Documents\Google_Cloud\alex_labs_google_sprint\astro_spectro_git\data\catalog\master_catalog_temp.csv  (3000 lignes)
  > Catalogue local de 3000 spectres cr√©√©.
  > Tentative de cross-match en mode 'bulk'‚Ä¶
INFO: Query finished. [astroquery.utils.tap.core]
INFO: Query finished. [astroquery.utils.tap.core]
INFO: Query finished. [astroquery.utils.tap.core]
INFO: Query finished. [astroquery.utils.tap.core]
INFO: Query finished. [astroquery.utils.tap.core]
INFO: Query finished. [astroquery.utils.tap.core]
INFO: Query finished. [astroquery.utils.tap.core]
INFO: Query finished. [astroquery.utils.tap.core]
INFO: Query finished. [astroquery.utils.tap.core]
  > Gaia : 2651/3000 objets appari√©s.


#

<a id="step-3bis-features"></a>
## 3bis) Traitement & extraction des features

Ex√©cute les **pr√©traitements spectraux** et l‚Äô**extraction de features**.  
Un CSV `features_YYYYMMDDTHHMMSSZ.csv` est √©crit dans `processed/`.

- Met √† jour `pipeline.features_df` (m√©moire) & `pipeline.last_features_path` (disque).
- Features = mesures photom√©triques/astrom√©triques, indices de raies, r√©sum√©s de voisinage spectral‚Ä¶

> üí° Certaines √©tapes internes reposent sur des d√©tections/associations de raies (Balmer, Ca II H/K, Mg_b, Na_D).

In [4]:
pipeline.process_data()


=== √âTAPE 3 : TRAITEMENT DES DONN√âES ET EXTRACTION DES FEATURES ===

--- D√©marrage du pipeline de traitement pour 3000 spectres ---


Traitement des spectres: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 3000/3000 [01:32<00:00, 32.28it/s]


  > Cr√©ation des features de couleur photom√©trique...

Pipeline de traitement termin√©. 3000 spectres trait√©s et enrichis.

  > Dataset de features sauvegard√© dans : features_20250820T020828Z.csv


Unnamed: 0,file_path,feature_HŒ±_prominence,feature_HŒ±_fwhm,feature_HŒ±_eq_width,feature_HŒ≤_prominence,feature_HŒ≤_fwhm,feature_HŒ≤_eq_width,feature_CaIIK_prominence,feature_CaIIK_fwhm,feature_CaIIK_eq_width,...,pmra,pmdec,radial_velocity,distance_gspphot,astrometric_excess_noise,phot_variable_flag,bp_g,g_rp,feature_color_gr,feature_color_ri
0,M5901/spec-55859-M5901_sp01-195.fits.gz,0.897656,0.000000,-25.007062,1.584174,22.257277,14.582983,2.112454,30.612651,34.735588,...,-6.877512,-1.312645,,547.982178,0.000000,NOT_AVAILABLE,1.093296,1.025904,1.60,0.79
1,M6201/spec-55862-M6201_sp11-115.fits.gz,0.394714,0.000000,-10.810985,1.051613,23.918516,12.584038,5.413777,,-7.816648,...,7.078463,-5.884965,,809.528625,0.354053,NOT_AVAILABLE,1.271902,1.089396,1.54,1.01
2,GAC_122N29_B1/spec-55874-GAC_122N29_B1_sp01-06...,0.000000,0.000000,0.000000,0.253845,28.783282,5.656627,1.099345,27.254958,31.305362,...,-2.308271,-1.709090,,957.626526,0.000000,NOT_AVAILABLE,0.782026,0.862169,1.25,0.50
3,GAC_060N28_B1/spec-55863-GAC_060N28_B1_sp16-10...,0.413267,19.550819,5.444872,0.683954,,-0.373323,1.048922,,5.901286,...,-1.211095,-3.415593,-17.403130,977.754578,0.000000,NOT_AVAILABLE,0.589247,0.745384,0.78,0.50
4,GAC_105N29_B1/spec-55863-GAC_105N29_B1_sp13-01...,0.612846,22.183278,7.211732,1.367968,31.939313,-8.790034,3.679062,34.434944,-8.373902,...,0.079957,-2.774574,,1870.028442,0.073624,NOT_AVAILABLE,0.281265,0.443264,0.34,0.10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2995,M5904/spec-55859-M5904_sp01-016.fits.gz,0.635621,29.834264,-8.458881,0.954457,24.406084,17.638279,7.796431,26.016579,43.957247,...,-0.530957,-2.156844,,624.186096,0.594187,VARIABLE,1.314123,1.182482,1.92,0.98
2996,M5901/spec-55859-M5901_sp14-066.fits.gz,0.210758,28.733648,-3.498011,0.506400,24.435788,-4.875719,1.652133,26.057355,31.594651,...,-2.359372,-4.947338,,1528.496460,0.000000,NOT_AVAILABLE,0.528336,0.659740,0.78,0.29
2997,M6201/spec-55862-M6201_sp03-198.fits.gz,0.363929,24.131413,6.541094,0.831629,29.249404,-9.581387,2.508117,,6.711009,...,-4.742670,-1.496797,,1973.120483,0.000000,NOT_AVAILABLE,0.362835,0.531857,0.53,0.13
2998,GAC_122N29_B1/spec-55874-GAC_122N29_B1_sp14-04...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.267317,26.967935,34.074897,...,-7.056076,-34.412493,24.168312,170.580307,0.000000,NOT_AVAILABLE,1.013433,0.997933,1.44,0.85


#

<a id="step-4-train"></a>
## 4) Entra√Ænement du mod√®le

Entra√Æne un **classifieur** (RF/XGBoost) avec **s√©lection de features** optionnelle (`SelectFromModel`, seuil `"median"` par d√©faut), puis **√©value** et **journalise**.

- R√©cap : nb de features conserv√©es, scores, rapports, chemins des artefacts.
- En notebook : UI d√©di√©e via `pipeline.interactive_training_runner()`.

> üí° **Astuce** : commence avec RF pour un feedback rapide, puis passe √† XGB pour gagner en performance.

In [5]:
pipeline.interactive_training_runner()

VBox(children=(HBox(children=(Dropdown(description='Mod√®le:', options=('XGBoost', 'RandomForest', 'SVM'), valu‚Ä¶

#
#
#