# EXERCISES — SDSS Spectra Classification with Transformer Encoders (Binary & Multiclass)
Generated on 2025-10-28T18:01:56.090282Z

> Exercises-only notebook. Read the guidance, then complete the TODO blocks. No solutions are included.


## Objective

You will:
- Acquire a tiny working set of SDSS spectra using **astroquery** and the **Nair & Abraham (2010)** catalog (Vizier).
- Prepare train/val/test arrays for **binary** and **multiclass** classification.
- Build a **Transformer Encoder** model for spectra.
- Train, evaluate, and analyze results (ROC, PR, confusion matrix, macro-F1, calibration hints).

This notebook is *exercises-only*: there are **no solutions**. Follow the guidance and fill the TODOs.



### 0. Environment Notes

- Internet may be required for the Vizier/SDSS steps. If not available, skip the download cells and place your prepared files under `./spectra` and `./catalog` before continuing.
- Recommended packages: `astroquery`, `astropy`, `numpy`, `pandas`, `torch`, `matplotlib`, `scikit-learn`.


In [None]:

# 0. Create directories if not exist
import os
os.makedirs('spectra', exist_ok=True)
os.makedirs('images', exist_ok=True)       # not used in this notebook, but kept for symmetry
os.makedirs('catalog', exist_ok=True)
os.makedirs('intermediate', exist_ok=True)
print("Directories ready.")



### 1. Download Nair & Abraham (2010) catalog (Vizier)

If `astroquery` is not installed, uncomment the pip line below.


In [None]:

# Install astroquery if needed (uncomment in your environment)
# !pip install astroquery


In [None]:

from astroquery.vizier import Vizier
import pandas as pd
import os

# Set Vizier row limit to maximum
Vizier.ROW_LIMIT = -1

# Query the Nair & Abraham (2010) catalog
catalog_id = 'J/ApJS/186/427'
vizier = Vizier(columns=['*'])
result = vizier.get_catalogs(catalog_id)

# Convert to pandas DataFrame
catalog_df = result[0].to_pandas()

# Save to CSV for later use
os.makedirs('./catalog', exist_ok=True)
catalog_df.to_csv('./catalog/nair_abraham_2010.csv', index=False)

print(f"Downloaded {len(catalog_df)} galaxies from Nair & Abraham (2010) catalog")
print(f"Available columns: {list(catalog_df.columns)[:12]} ...")
catalog_df.head()



### 2. Query SDSS for spectra (tiny sample)

We cone-search around RA/Dec for a handful of sources and download the corresponding FITS spectra.

If you already have spectra, skip this and ensure FITS files live under `./spectra/`.


In [None]:

# Install astroquery if needed (uncomment in your environment)
# !pip install astroquery


In [None]:

from astroquery.sdss import SDSS
from astropy.io import fits
import pandas as pd
import os
from astropy import coordinates as coords
from astropy import units as u

# Directory to save spectra
os.makedirs('./spectra', exist_ok=True)

# Example: Search for spectra for the first 10 sources using RA/Dec cone search
# Use the correct column names from Nair & Abraham catalog: '_RA' and '_DE'
rows = catalog_df.dropna(subset=['_RA', '_DE']).head(10)

print(f"Searching for SDSS spectra for {len(rows)} galaxies...")

spec_matches = []
for idx, row in rows.iterrows():
    ra = row['_RA']
    dec = row['_DE']
    sdss_name = row.get('SDSS', f'nair_{idx}')  # SDSS identifier if present
    
    source_id = f"nair_{idx}"
    try:
        pos = coords.SkyCoord(ra, dec, unit='deg')
        # Query for spectroscopic objects within 3 arcsec
        spec_query = SDSS.query_region(pos, radius=3*u.arcsec, spectro=True)
        
        if spec_query is not None and len(spec_query) > 0:
            # Pick the closest match
            match = spec_query[0]
            plate = match['plate']
            fiberID = match['fiberID']
            mjd = match['mjd']
            specobjid = match['specobjid']
            
            fits_path = f'./spectra/spec-{specobjid}.fits'
            
            # Skip if already on disk
            if os.path.exists(fits_path):
                print(f"Spectrum exists, skipping: {fits_path}")
            else:
                sp = SDSS.get_spectra(plate=plate, fiberID=fiberID, mjd=mjd)
                if sp:
                    sp[0].writeto(fits_path, overwrite=True)
                    print(f"Downloaded: {fits_path} for {sdss_name} at RA={ra:.6f}, Dec={dec:.6f}")
                else:
                    print(f"Failed to download spectrum for {sdss_name}")
                    continue
            
            spec_matches.append({
                'source_id': source_id,
                'sdss_name': sdss_name,
                'ra': ra, 'dec': dec,
                'specobjid': specobjid,
                'spectrum_file': fits_path,
                'plate': plate, 'fiber': fiberID, 'mjd': mjd
            })
        else:
            print(f"No spectrum found near {sdss_name} at RA={ra:.6f}, Dec={dec:.6f}")
    except Exception as e:
        print(f"Error for {sdss_name} at RA={ra:.6f}, Dec={dec:.6f}: {e}")

# Save spectroscopic matches
if spec_matches:
    spec_df = pd.DataFrame(spec_matches)
    os.makedirs('./intermediate', exist_ok=True)
    spec_df.to_csv('./intermediate/sdss_spectra_matches.csv', index=False)
    print(f"\nFound {len(spec_matches)} sources with SDSS spectra")
    spec_df.head()
else:
    print("\nNo spectroscopic matches found")



### 3. Parse FITS spectra → numpy arrays

Goal: build arrays `wavelength`, `flux`, optional `ivar`, and labels.
For this exercise, you will **define your own labels** for a binary and a multiclass task.
Suggestions:
- Binary: strong emission vs weak/no emission (use heuristic on Hα region)
- Multiclass: 3–4 bins by spectral features or by a metadata column if available



In [None]:

# TODO: Implement FITS parsing and heuristic labeling
# - Walk through ./spectra/*.fits
# - Extract wavelength, flux (and ivar if present)
# - Build small arrays for train/val/test splits
# - Define labels for binary and multiclass tasks
# - Save to npz under ./intermediate/
#
# Pseudocode:
# files = sorted(glob("./spectra/*.fits"))
# for fp in files:
#     with fits.open(fp) as hdul:
#         flux = hdul[1].data["flux"] or similar
#         loglam = hdul[1].data["loglam"]  # SDSS often stores log10(lambda)
#         wave = 10**loglam
#         ivar = hdul[1].data.get("ivar", None)
#         ... your processing ...
# Save: np.savez("./intermediate/spectra_train.npz", wavelength=..., flux=..., ivar=..., label_binary=..., label_multi=...)
pass



### 4. Build a Transformer Encoder for spectra (skeleton)

Use a learnable positional embedding, mask invalid/padded bins, and produce a pooled embedding for classification.


In [None]:

# TODO: Implement SpectraTransformer (encoder-only) and heads for binary/multiclass
# class SpectraTransformer(nn.Module): ...
# class BinaryHead(nn.Module): ...
# class MultiHead(nn.Module): ...
pass



### 5. Training & Evaluation

- Train binary classifier. Report ROC AUC, PR AUC, accuracy, F1.
- Tune a decision threshold on the validation set using the PR curve.
- Train multiclass classifier. Report macro-F1 and plot confusion matrix.


In [None]:

# TODO: Dataloaders, training loops, evaluation metrics, and plots
# Tips:
# - Use BCEWithLogitsLoss for binary, CrossEntropyLoss for multiclass
# - Consider class weights if imbalance is severe
# - Plot ROC, PR, confusion matrix
pass



### 6. Optional: Calibration & Uncertainty (bonus)

- Implement temperature scaling on the validation logits to reduce overconfidence.
- Try MC dropout at test time to estimate predictive variance.
