This notebook takes the cleaned data from the initial exploratory analysis and prepares it for machine learning by performing feature engineering, selection, and scaling.

# Project Initialization
## Objective
To set up the environment by loading the pre-cleaned dataset and the necessary libraries for data processing and modeling.

## Methodology
We will load the `asteroids_clean.csv` file, which was the output of the previous notebook (`01_eda.ipynb`). We also import specific modules from `sklearn` for model selection and preprocessing, and `joblib` for saving Python objects.

## Justification & Alternatives
- Loading Clean Data: Starting from the cleaned CSV ensures that we are building upon a consistent and verified baseline, making the workflow more modular and efficient.

- Library Choices:

  - sklearn.model_selection: This module contains `train_test_split`, the standard function for partitioning data.

  - sklearn.preprocessing: This module contains `StandardScaler`, a common tool for feature scaling.

  - joblib: This library is preferred for saving and loading Python objects, especially those containing large NumPy arrays (like scikit-learn models and scalers), as it is more efficient than alternatives like `pickle`.

## 2-B: Import Libraries & Load Cleaned Data

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

DATA = Path("../data/asteroids_clean.csv")   # file you saved in Step 1
df = pd.read_csv(DATA)

print(df.shape)     # expect (1436, 31)
df.head(3)


(1436, 31)


Unnamed: 0,name,a,e,i,om,w,q,ad,per_y,data_arc,...,UB,IR,spec_B,spec_T,G,moid,class,n,per,ma
0,,3.038918,0.069094,9.948162,217.408407,95.63757,2.828947,3.24889,5.297692,10333.0,...,,,,,,1.83752,MBA,0.186048,1934.9821,226.241935
1,,2.781803,0.200606,9.233482,19.677473,164.05448,2.223758,3.339848,4.639784,7498.0,...,,,,,,1.22752,MBA,0.212429,1694.681031,97.864386
2,,2.532657,0.150951,7.307953,152.847672,256.627796,2.15035,2.914963,4.030627,10256.0,...,,,,,,1.17367,MBA,0.244534,1472.186639,135.680806


# Thesis Justification
## Objective: 
To initialize the workspace for the preprocessing stage.

## Methodology:
The necessary libraries (`pandas`, `numpy`, `pathlib`) are imported. The `asteroids_clean.csv` file, which is the verified output from the Exploratory Data Analysis (EDA) phase (notebook `01_eda.ipynb`), is loaded into a pandas DataFrame. A global random seed is set to ensure reproducibility.

## Justification:

 - Reproducibility: Setting a `RANDOM_STATE` is a cornerstone of scientifically valid computational research. It ensures that any stochastic process, such as the train-test split performed later, is deterministic and can be precisely replicated by others.

 - Modularity: By loading the pre-cleaned CSV, this notebook builds upon the validated work of the previous step. This modular approach makes the project workflow more organized, efficient, and less prone to errors, as the initial data cleaning does not need to be repeated.

**What we do**  
1. Import pandas/NumPy.  
2. Set a global random seed for reproducibility.  
3. Load the cleaned CSV produced in Step 1.  

**Why**  
Everyone on the team starts from the same dataset and gets identical
train/validation splits when we use `random_state=42`.

*Expected output* → **(1436, 31)** rows × columns.


## 2-C: Define Target, Drop Uninformative Columns, and Group Features

### Code Block 1: Feature/Target Separation and Dropping

In [2]:
TARGET = "diameter"

DROP_ALWAYS = [
    "Unnamed: 0",                 # ghost index (may already be absent)
    "GM", "G", "IR", "extent",    # 100 % missing
    "UB", "BV", "spec_B", "spec_T",  # > 99 % missing
    "name"                        # mostly NaN and an arbitrary ID
]

X = df.drop(columns=[TARGET] + DROP_ALWAYS, errors="ignore")
y = df[TARGET]

X.shape


(1436, 21)

# Thesis Justification
## Objective:
To perform initial feature selection by removing columns that provide no predictive value and to formally separate the predictor features (`X`) from the target variable (`y`).

## Methodology:
The target variable `diameter` is defined. A list, `DROP_ALWAYS`, contains columns identified during EDA as uninformative. These columns, along with the target, are dropped from the original DataFrame to create the feature matrix `X`. The target variable `y` is created as a separate Series.

## Justification of Dropped Columns:

 - No Information Content: `GM`, `G`, `IR`, `extent` are removed as they were found to be 100% null. `UB`, `BV`, `spec_B`, `spec_T` are removed due to having over 99% missing values; imputing such a high percentage would introduce more noise than signal.

 - Identifier, Not a Feature: `name` is a unique identifier. Including it would risk data leakage, where a model could simply memorize the diameter for a given name rather than learning a generalizable physical relationship.

 - `errors="ignore"`: This argument makes the code robust by preventing an error if a column in DROP_ALWAYS has already been removed or is not present.

**What**  
• Separate features `X` from the regression target `y`.  
• Remove columns that cannot inform the model (all-missing or ID-like).

**Why**  
Dropping junk early keeps the pipeline lightweight and avoids leaking
an identifier (`name`) that the model could memorise instead of learning
real patterns.


## Code Block 2: Data Type Correction and Redundancy Removal

In [3]:
# Cast condition_code (0–9 quality rating) to categorical
X["condition_code"] = X["condition_code"].astype("object")

# per  = orbital period in days  |  per_y = same in years
# Keep just one to avoid perfect collinearity
X = X.drop(columns=["per_y"])


# Thesis Justification
## Objective:
To correct data types for categorical variables and remove redundant features to avoid multicollinearity.

## Methodology:
The `condition_code` column is explicitly cast to the `object` data type. The `per_y` (orbital period in years) column is dropped.

## Justification:

 - `condition_code` as Categorical: Although represented by numbers, the `condition_code` is a nominal label indicating orbit quality. Treating it as a number would imply a false ordinal relationship (e.g., that code '9' is nine times '1'). Casting it to `object` ensures it will be correctly one-hot encoded as a categorical feature.

 - Multicollinearity: per (period in days) and per_y (period in years) are perfectly correlated as they measure the same physical quantity in different units. Including both would introduce perfect multicollinearity, which can destabilize the coefficient estimates of linear models. Removing one is essential.

**What**  
1. `condition_code` is a *label* (0–9), not a quantity → treat it as a
   category so the model gets one-hot dummies.  
2. `per_y` duplicates `per`; we keep `per` (days) and drop the years
   version.

**Why**  
Categorical coding prevents the model from interpreting “code 9” as
nine times something.  Removing duplicate signals avoids redundant,
perfectly correlated features that can mislead linear models.


## Code Block 3: Programmatic Column Grouping

In [4]:
NUMERIC_COLS     = X.select_dtypes(["int64", "float64"]).columns.tolist()
CATEGORICAL_COLS = X.select_dtypes(["object", "bool"]).columns.tolist()

print(f"{len(NUMERIC_COLS)} numeric  |  {len(CATEGORICAL_COLS)} categorical")
print("Categoricals:", CATEGORICAL_COLS)


16 numeric  |  4 categorical
Categoricals: ['condition_code', 'neo', 'pha', 'class']


# Thesis Justification
## Objective:
To programmatically separate the column names into numeric and categorical groups.

## Methodology:
The `select_dtypes` method is used to automatically identify and list the names of columns belonging to numeric and categorical types.

## Justification:
This automated approach is more robust and less error-prone than manually defining these lists. These lists are critical inputs for the `ColumnTransformer` in the next step, ensuring that the correct preprocessing steps (e.g., scaling vs. one-hot encoding) are applied to the appropriate columns.

**What**  
Ask pandas for two column lists: numeric and categorical.

**Why**  
These lists feed the ColumnTransformer so each branch (scaling vs
one-hot) knows exactly which columns to handle.

*Expected* → **17 numeric | 4 categorical**  
(`neo`, `pha`, `class`, `condition_code`)


## 2-D: Build the Column-wise Preprocessing Pipelines

In [5]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

numeric_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("scale",  StandardScaler())
])

categorical_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="most_frequent", fill_value="Missing")),
    ("encode", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])

preprocess = ColumnTransformer([
    ("num", numeric_pipe, NUMERIC_COLS),
    ("cat", categorical_pipe, CATEGORICAL_COLS)
])


# Thesis Justification
## Objective:
To construct a comprehensive preprocessing pipeline that handles missing values and applies appropriate transformations to numeric and categorical data types separately.

## Methodology:
A `ColumnTransformer` is defined, which applies two distinct sub-pipelines (`numeric_pipe` and `categorical_pipe`) to their respective column groups.

## Justification of `numeric_pipe`:

 - Imputation Strategy: `SimpleImputer(strategy="median")` is chosen to fill missing numerical values. The median is a robust measure of central tendency that is less sensitive to outliers than the mean, which is a desirable property for astronomical data that can have extreme values.

- Scaling Strategy: `StandardScaler` is used to standardize features by removing the mean and scaling to unit variance. This is crucial for distance-based algorithms (e.g., SVMs) and gradient-based algorithms (e.g., linear regression), ensuring that all features contribute equally to the model's objective function regardless of their original scale.

## Justification of `categorical_pipe`:

 - Imputation Strategy: `SimpleImputer(strategy="most_frequent")` is a standard approach for filling missing categorical labels.

 - Encoding Strategy: `OneHotEncoder` is the correct method for converting nominal categorical data into a numerical format without implying an artificial order. The argument `handle_unknown="ignore"` makes the pipeline robust to new, unseen categories during prediction, preventing errors if the test set contains a category not seen in the training set.

**What**  
*Numeric branch*  
  • Impute NaNs with the **median** (robust to outliers).  
  • Standard-scale to mean 0 / std 1.

*Categorical branch*  
  • Replace NaNs with the **most-frequent** label (or “Missing”).  
  • One-hot encode; `handle_unknown="ignore"` keeps the model alive when
    it sees a brand-new category later.

**Why**  
Encapsulating every step in a Pipeline guarantees the exact same
transforms are applied during cross-validation and on the real test
data — eliminating data-leakage.


## 2-E Train / validation split before fitting

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.20, random_state=RANDOM_STATE
)

print(X_train.shape, X_val.shape)


(1148, 20) (288, 20)


# Thesis Justification
## Objective:
To partition the dataset into a training set for model development and a validation set for unbiased performance evaluation.

## Methodology:
The `train_test_split` function is used to reserve 20% of the data for validation (`X_val`, `y_val`).

## Justification:
This is the most critical step for preventing data leakage. The validation set simulates unseen data. By splitting the data before fitting the preprocessing pipeline, we ensure that the imputer and scaler learn their parameters (e.g., medians, means, standard deviations) only from the training data. This prevents any information from the validation set from "leaking" into the training process, which would lead to overly optimistic and invalid performance metrics. The 80/20 split is a standard convention that balances the need for sufficient training data with a robust validation set size.

**What**  
Reserve 20 % of the data for **validation**.

**Why**  
We must assess model quality on unseen data.  Splitting *before*
calling `preprocess.fit()` ensures the imputer and scaler learn only
from the training subset.


## 2-F: Fit and Apply the Preprocessing Pipeline

In [7]:
preprocess.fit(X_train)

X_train_ready = preprocess.transform(X_train)
X_val_ready   = preprocess.transform(X_val)

print("Train matrix →", X_train_ready.shape)
print("Validation  →", X_val_ready.shape)


Train matrix → (1148, 39)
Validation  → (288, 39)


# Thesis Justification
## Objective:
To apply the defined preprocessing steps to the training and validation sets.

## Methodology:
The `preprocess` pipeline is first fitted to the training data (`X_train`) using the `.fit()` method. Then, the fitted pipeline is used to transform both the training and validation sets using the `.transform()` method.

## Justification:
The "fit on train, transform on both" paradigm is strictly followed.

 - `preprocess.fit(X_train)`: This step learns the necessary parameters for transformation (medians for imputation, means/stds for scaling, vocabulary for one-hot encoding) exclusively from the training data.

 - `preprocess.transform(...)`: This step applies the learned transformations consistently to both datasets, ensuring that the validation data is processed in the exact same manner as the training data, which is essential for a fair evaluation. The final shapes are checked as a sanity test to confirm the pipeline executed correctly and the one-hot encoding expanded the feature space as expected.



**What**  
• `.fit()` learns medians, most-frequent labels, scaling parameters, and
  one-hot vocabularies **only from the training data**.  
• `.transform()` converts raw rows into a pure-numeric matrix.

**Why**  
Checking the dimensions confirms all columns (plus one-hot expansions)
are present and identical in train & validation matrices.


## Code: Saving the Fitted Pipeline

In [8]:
import joblib, pathlib
joblib.dump(preprocess, pathlib.Path("../data/preprocess.pkl"))


['../data/preprocess.pkl']

# Thesis Justification
## Objective:
To persist the fitted preprocessing pipeline for future use.

## Methodology:
The `joblib.dump` function is used to serialize and save the entire fitted `preprocess` object to a file.

## Justification:
Saving the fitted pipeline is crucial for reproducibility and deployment. It allows subsequent notebooks (for modeling) or production scripts to load the exact same transformation and apply it to new data without having to retrain it. This guarantees that any new data is processed identically to the original training data, which is a requirement for making valid predictions. `joblib` is generally preferred over `pickle` for scikit-learn objects as it can be more efficient with objects containing large NumPy arrays.

Saving the fitted transformer lets teammates (or a deployment script)
load it instantly:

```python
preprocess = joblib.load("../data/preprocess.pkl")


In [None]:
!git add notebooks/02_preprocessing.ipynb data/preprocess.pkl
!git commit -m "Step 2: complete preprocessing pipeline"
!git push
