# Step 2 – Pre-processing pipeline

## 2-B Import libs & load the cleaned data

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

DATA = Path("../data/asteroids_clean.csv")   # file you saved in Step 1
df = pd.read_csv(DATA)

print(df.shape)     # expect (1436, 31)
df.head(3)


(1436, 31)


Unnamed: 0,name,a,e,i,om,w,q,ad,per_y,data_arc,...,UB,IR,spec_B,spec_T,G,moid,class,n,per,ma
0,,3.038918,0.069094,9.948162,217.408407,95.63757,2.828947,3.24889,5.297692,10333.0,...,,,,,,1.83752,MBA,0.186048,1934.9821,226.241935
1,,2.781803,0.200606,9.233482,19.677473,164.05448,2.223758,3.339848,4.639784,7498.0,...,,,,,,1.22752,MBA,0.212429,1694.681031,97.864386
2,,2.532657,0.150951,7.307953,152.847672,256.627796,2.15035,2.914963,4.030627,10256.0,...,,,,,,1.17367,MBA,0.244534,1472.186639,135.680806


**What we do**  
1. Import pandas/NumPy.  
2. Set a global random seed for reproducibility.  
3. Load the cleaned CSV produced in Step 1.  

**Why**  
Everyone on the team starts from the same dataset and gets identical
train/validation splits when we use `random_state=42`.

*Expected output* → **(1436, 31)** rows × columns.


## 2-C Define target, drop junk, group columns

In [14]:
TARGET = "diameter"

DROP_ALWAYS = [
    "Unnamed: 0",                 # ghost index (may already be absent)
    "GM", "G", "IR", "extent",    # 100 % missing
    "UB", "BV", "spec_B", "spec_T",  # > 99 % missing
    "name"                        # mostly NaN and an arbitrary ID
]

X = df.drop(columns=[TARGET] + DROP_ALWAYS, errors="ignore")
y = df[TARGET]

X.shape


(1436, 21)

**What**  
• Separate features `X` from the regression target `y`.  
• Remove columns that cannot inform the model (all-missing or ID-like).

**Why**  
Dropping junk early keeps the pipeline lightweight and avoids leaking
an identifier (`name`) that the model could memorise instead of learning
real patterns.


In [15]:
# Cast condition_code (0–9 quality rating) to categorical
X["condition_code"] = X["condition_code"].astype("object")

# per  = orbital period in days  |  per_y = same in years
# Keep just one to avoid perfect collinearity
X = X.drop(columns=["per_y"])


**What**  
1. `condition_code` is a *label* (0–9), not a quantity → treat it as a
   category so the model gets one-hot dummies.  
2. `per_y` duplicates `per`; we keep `per` (days) and drop the years
   version.

**Why**  
Categorical coding prevents the model from interpreting “code 9” as
nine times something.  Removing duplicate signals avoids redundant,
perfectly correlated features that can mislead linear models.


In [16]:
NUMERIC_COLS     = X.select_dtypes(["int64", "float64"]).columns.tolist()
CATEGORICAL_COLS = X.select_dtypes(["object", "bool"]).columns.tolist()

print(f"{len(NUMERIC_COLS)} numeric  |  {len(CATEGORICAL_COLS)} categorical")
print("Categoricals:", CATEGORICAL_COLS)


16 numeric  |  4 categorical
Categoricals: ['condition_code', 'neo', 'pha', 'class']


**What**  
Ask pandas for two column lists: numeric and categorical.

**Why**  
These lists feed the ColumnTransformer so each branch (scaling vs
one-hot) knows exactly which columns to handle.

*Expected* → **17 numeric | 4 categorical**  
(`neo`, `pha`, `class`, `condition_code`)


## 2-D Build the column-wise pipelines

In [17]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

numeric_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("scale",  StandardScaler())
])

categorical_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="most_frequent", fill_value="Missing")),
    ("encode", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])

preprocess = ColumnTransformer([
    ("num", numeric_pipe, NUMERIC_COLS),
    ("cat", categorical_pipe, CATEGORICAL_COLS)
])


**What**  
*Numeric branch*  
  • Impute NaNs with the **median** (robust to outliers).  
  • Standard-scale to mean 0 / std 1.

*Categorical branch*  
  • Replace NaNs with the **most-frequent** label (or “Missing”).  
  • One-hot encode; `handle_unknown="ignore"` keeps the model alive when
    it sees a brand-new category later.

**Why**  
Encapsulating every step in a Pipeline guarantees the exact same
transforms are applied during cross-validation and on the real test
data — eliminating data-leakage.


## 2-E Train / validation split before fitting

In [18]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.20, random_state=RANDOM_STATE
)

print(X_train.shape, X_val.shape)


(1148, 20) (288, 20)


**What**  
Reserve 20 % of the data for **validation**.

**Why**  
We must assess model quality on unseen data.  Splitting *before*
calling `preprocess.fit()` ensures the imputer and scaler learn only
from the training subset.


## 2-F Fit the pipeline once to prove it works

In [19]:
preprocess.fit(X_train)

X_train_ready = preprocess.transform(X_train)
X_val_ready   = preprocess.transform(X_val)

print("Train matrix →", X_train_ready.shape)
print("Validation  →", X_val_ready.shape)


Train matrix → (1148, 39)
Validation  → (288, 39)


**What**  
• `.fit()` learns medians, most-frequent labels, scaling parameters, and
  one-hot vocabularies **only from the training data**.  
• `.transform()` converts raw rows into a pure-numeric matrix.

**Why**  
Checking the dimensions confirms all columns (plus one-hot expansions)
are present and identical in train & validation matrices.


In [20]:
import joblib
joblib.dump(preprocess, "../data/preprocess.pkl")


['../data/preprocess.pkl']

Saving the fitted transformer lets teammates (or a deployment script)
load it instantly:

```python
preprocess = joblib.load("../data/preprocess.pkl")
