
# 8. Training a regression model (DAT158-style)

I denne notatboken trener vi en enkel, pensum-vennlig **Ridge-regresjon** for å undersøke **om motorstørrelse (engine size)** påvirker bilpris.
Vi bruker egne `train.csv` og `test.csv` som allerede finnes i prosjektet.


## 8.1 Les inn data

In [1]:

import pandas as pd

df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

df_train.head()


Unnamed: 0,id,brand,model,model_year,milage,fuel_type,engine,transmission,ext_col,int_col,accident,clean_title,price
0,0,MINI,Cooper S Base,2007,213000,Gasoline,172.0HP 1.6L 4 Cylinder Engine Gasoline Fuel,A/T,Yellow,Gray,None reported,Yes,4200.0
1,1,Lincoln,LS V8,2002,143250,Gasoline,252.0HP 3.9L 8 Cylinder Engine Gasoline Fuel,A/T,Silver,Beige,At least 1 accident or damage reported,Yes,4999.0
2,2,Chevrolet,Silverado 2500 LT,2002,136731,E85 Flex Fuel,320.0HP 5.3L 8 Cylinder Engine Flex Fuel Capab...,A/T,Blue,Gray,None reported,Yes,13900.0
3,3,Genesis,G90 5.0 Ultimate,2017,19500,Gasoline,420.0HP 5.0L 8 Cylinder Engine Gasoline Fuel,Transmission w/Dual Shift Mode,Black,Black,None reported,Yes,45000.0
4,4,Mercedes-Benz,Metris Base,2021,7388,Gasoline,208.0HP 2.0L 4 Cylinder Engine Gasoline Fuel,7-Speed A/T,Black,Beige,None reported,Yes,97500.0



## 8.2 Ekstraher motorstørrelse fra tekst (`engine` → `engine_size`)

Mange rader har motorbeskrivelse som tekst (f.eks. `"240.0HP 2.0L 4 Cylinder Engine"`). Vi henter ut liter-verdien.


In [2]:

import re
import numpy as np

def extract_engine_size(s):
    m = re.search(r'([\d\.]+)\s*L', str(s))
    return float(m.group(1)) if m else np.nan

df_train['engine_size'] = df_train['engine'].apply(extract_engine_size)
df_test['engine_size'] = df_test['engine'].apply(extract_engine_size)

df_train[['engine','engine_size']].head()


Unnamed: 0,engine,engine_size
0,172.0HP 1.6L 4 Cylinder Engine Gasoline Fuel,1.6
1,252.0HP 3.9L 8 Cylinder Engine Gasoline Fuel,3.9
2,320.0HP 5.3L 8 Cylinder Engine Flex Fuel Capab...,5.3
3,420.0HP 5.0L 8 Cylinder Engine Gasoline Fuel,5.0
4,208.0HP 2.0L 4 Cylinder Engine Gasoline Fuel,2.0



## 8.3 Enkel feature engineering

Vi legger til **alder** på bil (`age = 2025 - model_year`) og **basisfarger** for å redusere kategorimengden.


In [3]:

ANCHOR_YEAR = 2025

df_train['age'] = ANCHOR_YEAR - df_train['model_year']
df_test['age'] = ANCHOR_YEAR - df_test['model_year']

BASE_COLORS = {
    'black':'Black','white':'White','gray':'Gray','grey':'Gray','silver':'Silver',
    'blue':'Blue','red':'Red','green':'Green','brown':'Brown','beige':'Beige',
    'yellow':'Yellow','gold':'Yellow','orange':'Yellow'
}
def to_base_color(s):
    s = str(s).lower()
    for k,v in BASE_COLORS.items():
        if k in s: return v
    return 'Other'

df_train['ext_base'] = df_train['ext_col'].map(to_base_color)
df_test['ext_base']  = df_test['ext_col'].map(to_base_color)
df_train['int_base'] = df_train['int_col'].map(to_base_color)
df_test['int_base']  = df_test['int_col'].map(to_base_color)

df_train[['ext_col','ext_base','int_col','int_base','age']].head()


Unnamed: 0,ext_col,ext_base,int_col,int_base,age
0,Yellow,Yellow,Gray,Gray,18
1,Silver,Silver,Beige,Beige,23
2,Blue,Blue,Gray,Gray,23
3,Black,Black,Black,Black,8
4,Black,Black,Beige,Beige,4


In [9]:
total_rows = len(df_train)
missing_prices = df_train['price'].isna().sum()
present_prices = total_rows - missing_prices
percent_missing = (missing_prices / total_rows) * 100

print(f"Totalt antall rader i treningsdata: {total_rows:,}")
print(f"Antall manglende priser: {missing_prices:,} ({percent_missing:.2f}%)")
print(f"Antall rader med gyldig pris: {present_prices:,}")

Totalt antall rader i treningsdata: 71,353
Antall manglende priser: 1 (0.00%)
Antall rader med gyldig pris: 71,352


In [10]:
# Fjerner raden uten pris (kun 1 av 71 353)
df_train = df_train.dropna(subset=['price']).reset_index(drop=True)
print("Etter fjerning:", len(df_train), "rader igjen")

Etter fjerning: 71352 rader igjen



## 8.4 Preprocessing og modell (scikit-learn Pipeline)

Vi bruker `ColumnTransformer` til å:
- imputere tall (median) og skalere dem,
- imputere kategorier og one-hot-enkode dem,
- binarisere `accident` og `clean_title` direkte i sklearn.


In [12]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, OrdinalEncoder, FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge

features = [
    'engine_size', 'model_year', 'milage', 'fuel_type', 'transmission',
    'ext_base', 'int_base', 'age', 'accident', 'clean_title'
]
X = df_train[features].copy()
y = df_train['price'].copy()

num_cols = ['engine_size','model_year','milage','age']
cat_cols = ['fuel_type','transmission','ext_base','int_base']

def is_accident_reported(X):
    """Converts accident column to binary: 1 if reported, 0 if None reported."""
    return (X != 'None reported').astype(int)

accident_bin = Pipeline([
    ('imp', SimpleImputer(strategy='most_frequent')),
    ('to_bin', FunctionTransformer(is_accident_reported))
])
clean_ord = Pipeline([
    ('imp', SimpleImputer(strategy='most_frequent')),
    ('ord', OrdinalEncoder(categories=[['No','Yes']]))
])

numeric_pipe = Pipeline([
    ('imp', SimpleImputer(strategy='median')),
    ('scale', StandardScaler())
])
categorical_pipe = Pipeline([
    ('imp', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore'))
])

preprocess = ColumnTransformer(
    transformers=[
        ('num', numeric_pipe, num_cols),
        ('cat', categorical_pipe, cat_cols),
        ('acc_bin', accident_bin, ['accident']),
        ('ct_ord', clean_ord, ['clean_title']),
    ],
    remainder='drop'
)

ridge_model = Pipeline([
    ('prep', preprocess),
    ('model', Ridge(alpha=1.0, random_state=42))
])


## 8.5 Tren/valider på `train.csv` (hold-out 80/20)

Vi bruker en enkel hold-out-split for å måle **MAE**, **RMSE** og **R²**.


In [13]:

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from math import sqrt

X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

ridge_model.fit(X_tr, y_tr)
y_pred = ridge_model.predict(X_val)

mae = mean_absolute_error(y_val, y_pred)
rmse = sqrt(mean_squared_error(y_val, y_pred))
r2 = r2_score(y_val, y_pred)

print(f"MAE:  {mae:,.0f}")
print(f"RMSE: {rmse:,.0f}")
print(f"R²:   {r2:.3f}")

MAE:  22,898
RMSE: 66,862
R²:   0.122



## 8.6 Hva betyr motorstørrelsen i modellen?

For å få et inntrykk av retningen på sammenhengen, sammenligner vi en **kun-numerisk** Ridge uten kategorier
(for tolkning av koeffisientene), og bekrefter at `engine_size` typisk har **positiv** effekt.


In [14]:

from sklearn.pipeline import Pipeline

num_only = Pipeline([
    ('imp', SimpleImputer(strategy='median')),
    ('scale', StandardScaler()),
    ('ridge', Ridge(alpha=1.0, random_state=42))
])

Xn = df_train[['engine_size','model_year','milage','age']].copy()
yn = df_train['price'].copy()

mask = Xn['engine_size'].notna()
from sklearn.model_selection import train_test_split
Xn_tr, Xn_val, yn_tr, yn_val = train_test_split(Xn[mask], yn[mask], test_size=0.2, random_state=42)

num_only.fit(Xn_tr, yn_tr)
coef = num_only.named_steps['ridge'].coef_
import pandas as pd
pd.DataFrame({'feature': Xn.columns, 'coef': coef}).sort_values('coef', ascending=False)


Unnamed: 0,feature,coef
0,engine_size,7527.672755
1,model_year,3163.193706
3,age,-3163.193706
2,milage,-18157.959465



## 8.7 Tren på hele `train.csv` og predikér `test.csv`

Til slutt trener vi på alle treningsradene og lager `predicted_price` for `test.csv`. Nyttig i Streamlit eller rapport.


In [15]:

ridge_model.fit(X, y)

X_test = df_test[features].copy()
df_test['predicted_price'] = ridge_model.predict(X_test)

df_test[['predicted_price']].head()


Unnamed: 0,predicted_price
0,14273.259079
1,62798.861428
2,63797.729717
3,38354.455137
4,37473.155439



## 8.8 Lagre modell og prediksjoner

Vi lagrer både modellen (pipeline) og en CSV med prediksjonene.


In [16]:

import os, joblib

os.makedirs('models', exist_ok=True)
joblib.dump({'pipeline': ridge_model, 'anchor_year': ANCHOR_YEAR}, 'models/engine_ridge.pkl')

df_test[['predicted_price']].to_csv('engine_price_predictions.csv', index=False)
print("Lagret: models/engine_ridge.pkl og engine_price_predictions.csv")


Lagret: models/engine_ridge.pkl og engine_price_predictions.csv
