# Car Price Prediction with CatBoost

This notebook trains and evaluates a **CatBoostRegressor** model to predict car prices using the `cleaned_autocentral_data.csv` dataset.

We will:
- Load and inspect the data
- Separate **features (X)** and **target (y)**
- Let CatBoost handle categorical features natively
- Split the data into **train/test** sets
- Train a **CatBoostRegressor** with reasonable defaults
- Evaluate the model with **RMSE**, **MAE**, and **R²** and briefly interpret the results.


In [None]:
# If CatBoost is not installed, run this (uncomment the next line):
# !pip install catboost scikit-learn pandas numpy

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

from catboost import CatBoostRegressor

RANDOM_STATE = 42
DATA_PATH = "cleaned_autocentral_data.csv"
TARGET_COLUMN = "price"


In [None]:
# Load dataset and separate features/target

df = pd.read_csv(DATA_PATH)

if TARGET_COLUMN not in df.columns:
    raise ValueError(f"Target column '{TARGET_COLUMN}' not found in dataset.")

X = df.drop(columns=[TARGET_COLUMN])
y = df[TARGET_COLUMN]

print("Data shape:", df.shape)
print("Features shape:", X.shape)
print("Target shape:", y.shape)
print("Columns:", list(X.columns))


Data shape: (11387, 8)
Features shape: (11387, 7)
Target shape: (11387,)
Columns: ['year', 'brand', 'model', 'mileage', 'cv', 'fuel_type', 'transmission']


In [None]:
# Identify categorical feature columns for CatBoost

categorical_features = X.select_dtypes(include=["object", "category"]).columns.tolist()
print("Categorical features:", categorical_features)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=RANDOM_STATE,
)

print("Train size:", X_train.shape[0])
print("Test size:", X_test.shape[0])


Categorical features: ['brand', 'model']
Train size: 9109
Test size: 2278


In [None]:
# Handle missing values in categorical features

# Fill NaN values in categorical columns with 'Unknown'
for col in categorical_features:
    X_train[col] = X_train[col].fillna('Unknown')
    X_test[col] = X_test[col].fillna('Unknown')

print("Missing values after handling:")
print(X_train[categorical_features].isnull().sum())

Missing values after handling:
brand    0
model    0
dtype: int64


In [None]:
# Build and train CatBoostRegressor

cat_model = CatBoostRegressor(
    loss_function="RMSE",
    depth=10,
    learning_rate=0.02,
    n_estimators=2000,
    random_seed=RANDOM_STATE,
    subsample=0.75,
    colsample_bylevel=0.75,
    l2_leaf_reg=2,
    min_data_in_leaf=5,
    verbose=100,
    early_stopping_rounds=100,
)

# Fit the model. CatBoost can take pandas DataFrames and column names for categorical features.
cat_model.fit(
    X_train,
    y_train,
    cat_features=categorical_features,
    eval_set=(X_test, y_test),
)


0:	learn: 48679.8927550	test: 49410.0306323	best: 49410.0306323 (0)	total: 37ms	remaining: 1m 13s
100:	learn: 23019.4575733	test: 26060.0564355	best: 26060.0564355 (100)	total: 3.88s	remaining: 1m 12s
100:	learn: 23019.4575733	test: 26060.0564355	best: 26060.0564355 (100)	total: 3.88s	remaining: 1m 12s
200:	learn: 19398.9984322	test: 23793.1151816	best: 23793.1151816 (200)	total: 7.84s	remaining: 1m 10s
200:	learn: 19398.9984322	test: 23793.1151816	best: 23793.1151816 (200)	total: 7.84s	remaining: 1m 10s
300:	learn: 18029.3284023	test: 23244.2486729	best: 23244.2486729 (300)	total: 11.9s	remaining: 1m 7s
300:	learn: 18029.3284023	test: 23244.2486729	best: 23244.2486729 (300)	total: 11.9s	remaining: 1m 7s
400:	learn: 17110.7418401	test: 22990.9041010	best: 22990.9041010 (400)	total: 15.9s	remaining: 1m 3s
400:	learn: 17110.7418401	test: 22990.9041010	best: 22990.9041010 (400)	total: 15.9s	remaining: 1m 3s
500:	learn: 16402.9048590	test: 22796.0065863	best: 22796.0065863 (500)	total: 19.

<catboost.core.CatBoostRegressor at 0x1957fcabcb0>

In [None]:
# Evaluate CatBoost model

y_pred = cat_model.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("=== CatBoost Regression: Car Price Prediction ===")
print(f"Number of samples (train): {len(X_train)}")
print(f"Number of samples (test) : {len(X_test)}")
print()
print("Performance on test set:")
print(f"- RMSE (Root Mean Squared Error): {rmse:,.2f}")
print(f"- MAE  (Mean Absolute Error)    : {mae:,.2f}")
print(f"- R² Score                      : {r2:.4f}")


=== CatBoost Regression: Car Price Prediction ===
Number of samples (train): 9109
Number of samples (test) : 2278

Performance on test set:
- RMSE (Root Mean Squared Error): 22,987.24
- MAE  (Mean Absolute Error)    : 10,864.73
- R² Score                      : 0.7886


In [None]:
# Save the trained CatBoost model to disk

import joblib

MODEL_PATH_CAT = "catboost_car_price_model.pkl"

joblib.dump(cat_model, MODEL_PATH_CAT)
print(f"Saved CatBoost model to {MODEL_PATH_CAT}")


### Interpretation of metrics

- **RMSE (Root Mean Squared Error)**: Penalizes larger errors more strongly. A lower RMSE means large price mistakes are relatively rare.
- **MAE (Mean Absolute Error)**: Average absolute difference between predicted and true prices, in the same units as the target. A lower MAE means predictions are closer on average.
- **R² Score**: Proportion of variance in car prices explained by the model (between 0 and 1). Values closer to 1 indicate a better fit.

You can compare these metrics directly with the XGBoost model to see which algorithm performs better on this dataset.
