# Thesis Documentation for `03_baselines.ipynb`

This document provides a detailed methodological justification for the steps in the 03_baselines.ipynb notebook. The purpose of this notebook is to establish baseline performance metrics against which more complex models can be compared.

## 3-A & 3-B: Workspace Initialization and Data Recreation

In [1]:
# 3-A: Imports & load essentials
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.pipeline import Pipeline
import joblib

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

DATA      = Path("../data/asteroids_clean.csv")
PREPROC_P = Path("../data/preprocess.pkl")

df         = pd.read_csv(DATA)
preprocess = joblib.load(PREPROC_P)   # fitted transformer from Step 2

# 3-B: Recreate X and y
TARGET = "diameter"
DROP_ALWAYS = ["Unnamed: 0", "GM", "G", "IR", "extent",
               "UB", "BV", "spec_B", "spec_T", "name",  # junk
               "per_y"]                                 # duplicate of per

X = df.drop(columns=[TARGET] + DROP_ALWAYS, errors="ignore").copy()
y = df[TARGET].copy()

# Cast condition_code (0–9 quality label) to categorical
X["condition_code"] = X["condition_code"].astype("object")


# Thesis Justification
## Objective:
To create a consistent and reproducible environment for model training by loading all necessary data, objects, and libraries.

## Methodology:
The script imports the required libraries, loads the cleaned dataset (`asteroids_clean.csv`), and, crucially, loads the pre-fitted preprocessing pipeline (`preprocess.pkl`) that was saved in the previous notebook. The feature matrix X and target vector y are then recreated using the exact same logic as in the preprocessing notebook.

## Justification:

  - Consistency: By loading the saved `preprocess` object, we guarantee that the exact same imputation, scaling, and encoding parameters learned from the training set in notebook 02 are used here. This is fundamental to preventing data leakage and ensuring a valid comparison between models.

  - Reproducibility: Re-executing the same feature selection and type casting steps ensures that the data fed into the pipeline has the precise structure the fitted pipeline expects. This makes the notebook a self-contained and verifiable unit of work.

**Purpose** Bring in scikit-learn, load the cleaned dataset, and load the
already-fitted `preprocess.pkl` so every model sees the same transforms.


Mirror the exact column drops and type cast from Step 2 so the data
arriving at `preprocess` has the layout it expects.


## 3-C: Consistent Data Partitioning

In [2]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.20, random_state=RANDOM_STATE
)


# Thesis Justification
## Objective:
To partition the data into training and validation sets that are identical to those used in the preprocessing step.

## Methodology:
`train_test_split` is called with the same `test_size` (0.20) and, most importantly, the same `random_state` (42) as used previously.

## Justification:
Using an identical `random_state` is non-negotiable for sound model comparison. It ensures that the `X_train` and `X_val` sets in this notebook contain the exact same data points as the sets used to fit and evaluate the preprocessor. This consistency is the only way to ensure that differences in performance are due to the models themselves, not variations in the data they are trained or evaluated on.

Hold out 20 % of rows as a validation set (identical random seed as
before) so scores are comparable across all models.


# Baseline 1 – DummyRegressor (median)

## 3-D: Fit & Predict

In [3]:
dummy_model = Pipeline([
    ("prep", preprocess),                     # already fitted
    ("reg",  DummyRegressor(strategy="median"))
])

dummy_model.fit(X_train, y_train)
y_pred_dummy = dummy_model.predict(X_val)


# Thesis Justification
## Objective:
To establish an absolute performance floor by creating a non-intelligent model. Any subsequent, more complex model must outperform this baseline to be considered useful.

## Methodology:
A `DummyRegressor` is placed into a `Pipeline` with the pre-fitted `preprocess` object. The `strategy="median"` instructs the model to simply predict the median value of the training set's target (`y_train`) for every single instance in the validation set.

## Justification:
The `DummyRegressor` serves as a "sanity check." It answers the question: "How well can we predict the diameter if we ignore all the features and just guess the most typical value?" The resulting performance metrics represent the lower bound of what is achievable. A negative R² score, for example, would indicate that a model is performing worse than this naïve strategy.

**Dummy (median)** simply predicts the training-set median for every
asteroid.  This sets the absolute performance floor.


# Evaluation Metrics

## 3-E: Defining the Metrics Function

In [4]:
import numpy as np
from sklearn.metrics import mean_absolute_error, r2_score

# 1️⃣  Try to import the new function (exists in sklearn ≥ 1.4)
try:
    from sklearn.metrics import root_mean_squared_error  # noqa
    def _rmse(y_true, y_pred):
        return root_mean_squared_error(y_true, y_pred)
except ImportError:
    # 2️⃣  Fallback for older sklearn: manual sqrt of mean_squared_error
    from sklearn.metrics import mean_squared_error
    def _rmse(y_true, y_pred):
        return np.sqrt(mean_squared_error(y_true, y_pred))

def metrics(y_true, y_pred):
    return {
        "MAE":  mean_absolute_error(y_true, y_pred),
        "RMSE": _rmse(y_true, y_pred),
        "R²":   r2_score(y_true, y_pred)
    }

scores_dummy = metrics(y_val, y_pred_dummy)
scores_dummy


{'MAE': 2.655770833333333,
 'RMSE': 6.86362135647065,
 'R²': -0.04744953304303601}

# Thesis Justification
## Objective:
To define a standardized set of metrics for evaluating and comparing all regression models in this project.

## Methodology:
A helper function `metrics` is created to compute three standard regression metrics: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared (R²). A compatibility wrapper `_rmse` is included to handle different versions of scikit-learn.

## Justification of Metric Choices:

 - Mean Absolute Error (MAE): Measures the average absolute difference between the predicted and actual values. It is easily interpretable as the typical prediction error in the original units of the target (km).

 - Root Mean Squared Error (RMSE): This metric squares the errors before averaging, which penalizes larger errors more heavily than smaller ones. It is also in the original units of the target, making it interpretable.

 - R-squared (R²): The coefficient of determination represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s). An R² of 1 indicates perfect prediction, while an R² of 0 indicates the model performs no better than predicting the mean. Negative values indicate the model is worse than the mean-predicting baseline.

## Code Robustness:
The `try-except` block for RMSE calculation is a best practice for writing shareable, long-lasting research code. It ensures the notebook will run correctly in different environments, whether they have the latest scikit-learn version (which includes `root_mean_squared_error`) or an older one (requiring the manual `np.sqrt(mean_squared_error(...))`).

We record **MAE**, **RMSE** and **R²** for the naïve “predict‐the-median” model.  
These values are the absolute floor: every real model must beat them.


# Baseline 2: LinearRegression

## 3-F & 3-G: Fit, Predict, and Evaluate

In [5]:
# 3-F: Fit & predict
lin_model = Pipeline([
    ("prep", preprocess),          # reuse the fitted pre-processor
    ("reg",  LinearRegression())
])

lin_model.fit(X_train, y_train)
y_pred_lin = lin_model.predict(X_val)

# 3-G: Evaluate LinearRegression
scores_lin = metrics(y_val, y_pred_lin)
scores_lin


{'MAE': 2.226734298170547, 'RMSE': 9.30398774148489, 'R²': -0.9247074738179806}

# Thesis Justification
## Objective:
To establish the performance of a simple, classical linear model. This serves as the first "intelligent" baseline.

## Methodology:
A `LinearRegression` model is placed in a `Pipeline` and trained on the preprocessed training data. Predictions are made on the validation set, and the same `metrics` function is used to evaluate its performance.

## Justification:
Linear Regression is an excellent baseline because it is fast, interpretable, and provides a clear signal of whether linear relationships exist between the engineered features and the target. If this simple model shows a significant improvement over the `DummyRegressor` (e.g., lower MAE/RMSE, higher R²), it validates that the feature engineering process has successfully extracted predictive information. It sets a new, more challenging performance target for more complex, non-linear models (like tree ensembles) to beat.

A straight-line model in the engineered feature space.  
Runs instantly and shows whether simple linear relationships capture real signal.


If MAE and RMSE drop (and R² rises) compared with the Dummy baseline,
we’ve proven even a basic linear model learns something meaningful.


## 3H ― Side-by-side comparison table

In [6]:
# 3-H — put both rows in one DataFrame
import pandas as pd

pd.DataFrame(
    [scores_dummy, scores_lin],
    index=["Dummy (median)", "LinearRegression"]
)


Unnamed: 0,MAE,RMSE,R²
Dummy (median),2.655771,6.863621,-0.04745
LinearRegression,2.226734,9.303988,-0.924707


# Thesis Justification
## Objective:
To present the performance metrics of the baseline models in a clear, concise, and easily comparable format.

## Methodology:
The dictionaries containing the scores for each model are used to construct a pandas DataFrame.

## Justification:
A summary table is the standard and most effective way to compare model performance in a research context. It allows for at-a-glance assessment of the relative strengths and weaknesses of each approach. This table will serve as the foundation for model selection, with the results from more advanced models being added as new rows in subsequent notebooks. This systematic comparison is essential for justifying the final choice of model for the project.

A single table makes it clear how much LinearRegression improves over the
naïve baseline.  
Future models (Random Forest, Gradient Boosting, …) will be added as extra rows.


In [9]:
!git add notebooks/03_baselines.ipynb
!git commit -m "Step 3: evaluated Dummy & Linear baselines"
!git push


fatal: pathspec 'notebooks/03_baselines.ipynb' did not match any files
On branch main
Your branch is up to date with 'origin/main'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   02_preprocessing.ipynb[m
	[31mmodified:   03_baselines.ipynb[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31m../data/data.csv[m

no changes added to commit (use "git add" and/or "git commit -a")
Everything up-to-date
