# Step 3 – Baseline models


## 3A ― Imports & load essentials (code)

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.pipeline import Pipeline
import joblib

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

DATA      = Path("../data/asteroids_clean.csv")
PREPROC_P = Path("../data/preprocess.pkl")

df         = pd.read_csv(DATA)
preprocess = joblib.load(PREPROC_P)   # fitted transformer from Step 2


**Purpose** Bring in scikit-learn, load the cleaned dataset, and load the
already-fitted `preprocess.pkl` so every model sees the same transforms.


## 3B ― Recreate X and y (code)


In [2]:
TARGET = "diameter"

DROP_ALWAYS = ["Unnamed: 0", "GM", "G", "IR", "extent",
               "UB", "BV", "spec_B", "spec_T", "name",  # junk
               "per_y"]                                 # duplicate of per

X = df.drop(columns=[TARGET] + DROP_ALWAYS, errors="ignore").copy()
y = df[TARGET].copy()

# Cast condition_code (0–9 quality label) to categorical
X["condition_code"] = X["condition_code"].astype("object")


Mirror the exact column drops and type cast from Step 2 so the data
arriving at `preprocess` has the layout it expects.


## 3C ― Train / validation split (code)

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.20, random_state=RANDOM_STATE
)


Hold out 20 % of rows as a validation set (identical random seed as
before) so scores are comparable across all models.


# Baseline 1 – DummyRegressor (median)

## 3D ― Fit & predict (code)

In [5]:
dummy_model = Pipeline([
    ("prep", preprocess),                     # already fitted
    ("reg",  DummyRegressor(strategy="median"))
])

dummy_model.fit(X_train, y_train)
y_pred_dummy = dummy_model.predict(X_val)


**Dummy (median)** simply predicts the training-set median for every
asteroid.  This sets the absolute performance floor.


## 3E ― Metrics (code)

In [11]:
import numpy as np
from sklearn.metrics import mean_absolute_error, r2_score

# 1️⃣  Try to import the new function (exists in sklearn ≥ 1.4)
try:
    from sklearn.metrics import root_mean_squared_error  # noqa
    def _rmse(y_true, y_pred):
        return root_mean_squared_error(y_true, y_pred)
except ImportError:
    # 2️⃣  Fallback for older sklearn: manual sqrt of mean_squared_error
    from sklearn.metrics import mean_squared_error
    def _rmse(y_true, y_pred):
        return np.sqrt(mean_squared_error(y_true, y_pred))

def metrics(y_true, y_pred):
    return {
        "MAE":  mean_absolute_error(y_true, y_pred),
        "RMSE": _rmse(y_true, y_pred),
        "R²":   r2_score(y_true, y_pred)
    }

scores_dummy = metrics(y_val, y_pred_dummy)
scores_dummy


{'MAE': 2.655770833333333,
 'RMSE': 6.86362135647065,
 'R²': -0.04744953304303601}

We record **MAE**, **RMSE** and **R²** for the naïve “predict‐the-median” model.  
These values are the absolute floor: every real model must beat them.


## 3F ― Fit & predict with LinearRegression (code)

In [12]:
lin_model = Pipeline([
    ("prep", preprocess),          # reuse the fitted pre-processor
    ("reg",  LinearRegression())
])

lin_model.fit(X_train, y_train)
y_pred_lin = lin_model.predict(X_val)


A straight-line model in the engineered feature space.  
Runs instantly and shows whether simple linear relationships capture real signal.


## 3G ― Metrics for the LinearRegression baseline (code)

In [13]:
# 3-G — evaluate LinearRegression
scores_lin = metrics(y_val, y_pred_lin)
scores_lin


{'MAE': 2.226734298170547, 'RMSE': 9.30398774148489, 'R²': -0.9247074738179806}

If MAE and RMSE drop (and R² rises) compared with the Dummy baseline,
we’ve proven even a basic linear model learns something meaningful.


## 3H ― Side-by-side comparison table (code)

In [14]:
# 3-H — put both rows in one DataFrame
import pandas as pd

pd.DataFrame(
    [scores_dummy, scores_lin],
    index=["Dummy (median)", "LinearRegression"]
)


Unnamed: 0,MAE,RMSE,R²
Dummy (median),2.655771,6.863621,-0.04745
LinearRegression,2.226734,9.303988,-0.924707


A single table makes it clear how much LinearRegression improves over the
naïve baseline.  
Future models (Random Forest, Gradient Boosting, …) will be added as extra rows.


In [15]:
!git add notebooks/03_baselines.ipynb
!git commit -m "Step 3: evaluated Dummy & Linear baselines"
!git push


fatal: pathspec 'notebooks/03_baselines.ipynb' did not match any files
On branch main
Your branch is up to date with 'origin/main'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   02_preprocessing.ipynb[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31m../data/data.csv[m
	[31m03_baselines.ipynb[m

no changes added to commit (use "git add" and/or "git commit -a")
Everything up-to-date
