# 03 — Prepare Data 

**Purpose.** This notebook performs **data preprocessing only**. It takes the joined Kepler tables we built earlier and produces cleaned, standardized feature matrices and a reusable **preprocessing transformer**.

## TL;DR
- **Input:** `data/processed/features/table_v1.parquet`, `labels/labels_v1.csv`, `splits/split_v1.csv`
- **Process (train-only fit to avoid leakage):**
  1) **Impute** missing numeric values (median) and add missingness indicators  
  2) **Winsorize** extreme tails (clip at the 0.1% / 99.9% quantiles)  
  3) **Power-transform** skewed distributions (Yeo–Johnson; safe with zeros/negatives)  
  4) **Standardize** features (mean 0, std 1) for linear models  
- **Output:** 
  - `artifacts/preprocess_v1.pkl` — fitted preprocessing pipeline  
  - Clean matrices: `data/processed/features_proc/X_{train,val,test}.parquet`  
  - Targets/IDs: `data/processed/features_proc/y_{train,val,test}.csv`

## Why this structure?
This follows the Appendix B “Explore/Prepare the Data” guidance: work on **copies** of the data, **write functions/pipelines** for every transform so they are repeatable, fix/guard **outliers**, **impute** missing values, and **scale** features for algorithms that need it. We **fit the transformer on the train split only** and apply it to val/test to avoid leakage (cleaning choices must not “peek” at held-out data). 

## Inputs (what each file contains)
- **`table_v1.parquet`** — one row per `kepid` with astrophysical + detectability features already joined and cleaned at a basic level.  
- **`labels_v1.csv`** — `label_strict` and `label_lenient` built from DR25 counts and (optionally) KOI dispositions.  
- **`split_v1.csv`** — stratified `train/val/test` tags by (Teff-bin × label).

## Transform details (with rationale)
- **Median imputation (+ indicator columns):** robust, simple, and lets models learn if “missingness” is predictive.  
- **Winsorization (0.1%/99.9%):** tames extreme tails/outliers without dropping rows.  
- **Yeo–Johnson power transform:** reduces skew for many features (e.g., CDPPs, radius) while handling zeros.  
- **Standardization:** brings features onto comparable scales—important for Logistic Regression and distance-based methods; harmless for trees.

All steps are packaged in a single **`sklearn` `ColumnTransformer` pipeline** so we can: (i) reuse it across notebooks, (ii) reproduce results easily, and (iii) treat preprocessing as hyperparameters later if needed.

## Leakage policy
- **Fit** the transformer on **train only**.  
- **Transform** train/val/test using that frozen transformer.  
This guarantees validation/test statistics reflect generalization, not “cleaning with future knowledge.”

## Artifacts & how downstream code uses them
- **`preprocess_v1.pkl`**: load it and call `.transform(X)` in any training notebook.  
- **`X_*.parquet` / `y_*.csv`**: ready-to-fit matrices/labels for quick baselines.

## Sanity checks we run here
- Shapes of `X/y` for each split  
- Positive class rate per split (should be **much less than 1.0**; if it’s 1.0 you filtered out all negatives—see Troubleshooting)

## Troubleshooting
- **All labels are 1 (positive rate ~1.0):** coverage filter in the *previous* notebook likely removed every negative. Loosen it and rebuild `table_v1.parquet`:
  - Try `MIN_QUARTERS = 4`, `MIN_DUTY = 0.2`, `DATASPAN_Q = 0.05`, then re-export.  
- **Feature count changed unexpectedly:** a column may be entirely missing/constant; check the attribute table in the EDA notebook and update the feature list if needed.

## Reproducibility
- This notebook is deterministic (fixed seeds upstream) and writes a mini “data card” with timestamp, target, and transform list.


In [17]:
from pathlib import Path
import pandas as pd, numpy as np


ROOT = Path("/Users/chrisjuarez/CPSC483_ML_Project")  
X_PATH = ROOT/"data/processed/features/table_v1.parquet"
Y_PATH = ROOT/"data/processed/labels/labels_v1.csv"
SPLIT  = ROOT/"data/processed/splits/split_v1.csv"

dfX = pd.read_parquet(X_PATH)
dfy = pd.read_csv(Y_PATH)
spl = pd.read_csv(SPLIT)
df  = dfX.merge(dfy, on="kepid").merge(spl, on="kepid")

FEATURES = [c for c in dfX.columns if c != "kepid"]
TARGET   = "label_lenient"     # switch to label_strict if you prefer

df.head(20), len(df), FEATURES[:6]

(       kepid    teff   logg   feh  radius   mass  kepmag  rrmscdpp03p0  \
 0   10000785  5333.0  4.616 -1.00   0.650  0.635  15.749       445.410   
 1   10000797  6289.0  4.270 -0.44   1.195  0.968  13.994        80.767   
 2   10000800  5692.0  4.547 -0.04   0.866  0.965  15.379       226.348   
 3   10000823  6580.0  4.377 -0.16   1.169  1.191  15.558       181.468   
 4   10000827  5648.0  4.559 -0.10   0.841  0.939  14.841       124.834   
 5   10000876  5249.0  4.410  0.18   0.953  0.849  14.458       104.839   
 6   10000939  4312.0  4.663 -0.50   0.579  0.564  15.939       312.067   
 7   10000941  5115.0  4.477  0.08   0.854  0.798  13.632        86.826   
 8   10000962  5496.0  4.592 -0.24   0.776  0.869  14.574       109.243   
 9   10000976  5629.0  4.546  0.04   0.870  0.972  14.586       133.739   
 10  10000981  5107.0  3.490 -0.58   2.706  0.825  13.233       222.827   
 11  10001000  5009.0  4.516  0.02   0.801  0.768  15.229       194.697   
 12  10001002  6409.0  4.

In [18]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import PowerTransformer, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
import numpy as np, pandas as pd, joblib, os

# Winsorizer to tame extreme tails (robust "clip"); appendix: fix/remove outliers. 
# (Keeps copies; originals remain in data/processed.)   [oai_citation:4‡Appendix B Machine Learning Project Checklist.pdf](file-service://file-QnwgSaxpCdwK5LF6e2wnim)
class Winsorize(BaseEstimator, TransformerMixin):
    def __init__(self, p_lo=0.001, p_hi=0.999):
        self.p_lo=p_lo; self.p_hi=p_hi; self.lo_={}; self.hi_={}
    def fit(self, X, y=None):
        Xf = pd.DataFrame(X)
        for i in range(Xf.shape[1]):
            s = Xf.iloc[:, i]
            self.lo_[i] = np.nanquantile(s, self.p_lo)
            self.hi_[i] = np.nanquantile(s, self.p_hi)
        return self
    def transform(self, X):
        Xf = pd.DataFrame(X).copy()
        for i in range(Xf.shape[1]):
            Xf.iloc[:, i] = np.clip(Xf.iloc[:, i], self.lo_[i], self.hi_[i])
        return Xf.values

numeric_pipe = Pipeline(steps=[
    ("impute",  SimpleImputer(strategy="median", add_indicator=True)),     # fill missing + track it   [oai_citation:5‡Appendix B Machine Learning Project Checklist.pdf](file-service://file-QnwgSaxpCdwK5LF6e2wnim)
    ("winsor",  Winsorize(0.001, 0.999)),                                  # robust outlier control
    ("power",   PowerTransformer(method="yeo-johnson", standardize=False)), # handles skew incl. zeros
    ("scale",   StandardScaler())                                          # feature scaling   [oai_citation:6‡Appendix B Machine Learning Project Checklist.pdf](file-service://file-QnwgSaxpCdwK5LF6e2wnim)
])

preprocess = ColumnTransformer(
    transformers=[("num", numeric_pipe, FEATURES)],
    remainder="drop"
)

# fit on TRAIN only (no leakage)
Xtr = df.loc[df.split=="train", FEATURES]
preprocess.fit(Xtr)

os.makedirs("artifacts", exist_ok=True)
joblib.dump(preprocess, "artifacts/preprocess_v1.pkl")
"Saved artifacts/preprocess_v1.pkl"

  x = um.multiply(x, x, out=x)
  ret = umr_sum(x, axis, dtype, out, keepdims=keepdims, where=where)


'Saved artifacts/preprocess_v1.pkl'

In [19]:
def transform_split(tag):
    X = df.loc[df.split==tag, FEATURES]
    y = df.loc[df.split==tag, TARGET].to_numpy().astype(int)
    Xp = preprocess.transform(X)
    # attach names for convenience
    try:
        cols = preprocess.get_feature_names_out()
    except Exception:
        cols = [f"f{i}" for i in range(Xp.shape[1])]
    out_dir = ROOT/"data/processed/features_proc"; out_dir.mkdir(parents=True, exist_ok=True)
    pd.DataFrame(Xp, columns=cols).to_parquet(out_dir/f"X_{tag}.parquet", index=False)
    pd.DataFrame({"y": y, "kepid": df.loc[df.split==tag, "kepid"].values}).to_csv(out_dir/f"y_{tag}.csv", index=False)
    return Xp.shape, y.mean()

shapes = {tag: transform_split(tag) for tag in ["train","val","test"]}
shapes

{'train': ((105532, 23), 0.02406852897699276),
 'val': ((15077, 23), 0.024076407773429728),
 'test': ((30153, 23), 0.024044042052200443)}

In [20]:
from datetime import datetime, timezone
card = {
  "generated_at": datetime.now(timezone.utc).isoformat(),
  "source_features": str(X_PATH),
  "target": TARGET,
  "transforms": ["median+indicator", "winsor 0.1%/99.9%", "Yeo-Johnson", "standardize"],
  "files": ["data/processed/features_proc/X_train.parquet",
            "data/processed/features_proc/X_val.parquet",
            "data/processed/features_proc/X_test.parquet",
            "artifacts/preprocess_v1.pkl"]
}
pd.Series(card)

generated_at                        2025-09-17T22:53:40.457712+00:00
source_features    /Users/chrisjuarez/CPSC483_ML_Project/data/pro...
target                                                 label_lenient
transforms         [median+indicator, winsor 0.1%/99.9%, Yeo-John...
files              [data/processed/features_proc/X_train.parquet,...
dtype: object