# Car Price Prediction with XGBoost

This notebook trains and evaluates an **XGBoost (XGBRegressor)** model to predict car prices using the `cleaned_autocentral_data.csv` dataset.

We will:
- Load and inspect the data
- Separate **features (X)** and **target (y)**
- Preprocess data with **One-Hot Encoding** for categorical features and **scaling** for numeric ones (via `ColumnTransformer` and `Pipeline`)
- Split into **train/test** sets with a fixed random state
- Train an **XGBoost regressor** with reasonable default hyperparameters
- **Tune hyperparameters** using `RandomizedSearchCV`
- Evaluate the model using **RMSE**, **MAE**, and **R²**, and briefly interpret the results.


In [None]:
# If XGBoost is not installed, run this (uncomment the next line):
# !pip install xgboost scikit-learn pandas numpy

import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

from xgboost import XGBRegressor

RANDOM_STATE = 42
DATA_PATH = "cleaned_autocentral_data.csv"
TARGET_COLUMN = "price"


In [None]:
# Load dataset and separate features/target

df = pd.read_csv(DATA_PATH)

if TARGET_COLUMN not in df.columns:
    raise ValueError(f"Target column '{TARGET_COLUMN}' not found in dataset.")

X = df.drop(columns=[TARGET_COLUMN])
y = df[TARGET_COLUMN]

print("Data shape:", df.shape)
print("Features shape:", X.shape)
print("Target shape:", y.shape)
print("Columns:", list(X.columns))


In [None]:
# Build preprocessing: One-Hot Encode categoricals, scale numerics

categorical_features = X.select_dtypes(include=["object", "category"]).columns.tolist()
numeric_features = X.select_dtypes(exclude=["object", "category"]).columns.tolist()

print("Categorical features:", categorical_features)
print("Numeric features:", numeric_features)

categorical_transformer = OneHotEncoder(handle_unknown="ignore")
numeric_transformer = StandardScaler()

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)


In [None]:
# Train/test split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=RANDOM_STATE,
)

print("Train size:", X_train.shape[0])
print("Test size:", X_test.shape[0])


In [None]:
# Build XGBoost pipeline with reasonable default hyperparameters

xgb_reg = XGBRegressor(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    objective="reg:squarederror",
    random_state=RANDOM_STATE,
    n_jobs=-1,
    eval_metric="rmse",  # Evaluation metric during training
)

base_model = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("regressor", xgb_reg),
    ]
)

base_model


In [None]:
# Hyperparameter tuning with RandomizedSearchCV

param_distributions = {
    "regressor__n_estimators": [200, 300, 500, 800],
    "regressor__max_depth": [3, 4, 6, 8],
    "regressor__learning_rate": [0.01, 0.03, 0.05, 0.1],
    "regressor__subsample": [0.6, 0.8, 1.0],
    "regressor__colsample_bytree": [0.6, 0.8, 1.0],
}

search = RandomizedSearchCV(
    estimator=base_model,
    param_distributions=param_distributions,
    n_iter=20,
    scoring="neg_root_mean_squared_error",
    cv=5,
    verbose=1,
    n_jobs=-1,
    random_state=RANDOM_STATE,
)

print("Starting hyperparameter tuning (RandomizedSearchCV)...")
search.fit(X_train, y_train)

print(f"Best CV RMSE (negated): {-search.best_score_:,.2f}")
print("Best hyperparameters:")
for param, value in search.best_params_.items():
    print(f"  {param}: {value}")

best_model = search.best_estimator_


In [None]:
# Evaluate best XGBoost model on the test set

y_pred = best_model.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("=== XGBoost Regression: Car Price Prediction ===")
print(f"Number of samples (train): {len(X_train)}")
print(f"Number of samples (test) : {len(X_test)}")
print()
print("Performance on test set:")
print(f"- RMSE (Root Mean Squared Error): {rmse:,.2f}")
print(f"- MAE  (Mean Absolute Error)    : {mae:,.2f}")
print(f"- R² Score                      : {r2:.4f}")


### Interpretation of metrics

- **RMSE (Root Mean Squared Error)**: Penalizes larger errors more strongly. A lower RMSE means large price mistakes are relatively rare.
- **MAE (Mean Absolute Error)**: Average absolute difference between predicted and true prices, in the same units as the target. A lower MAE means predictions are closer on average.
- **R² Score**: Proportion of variance in car prices explained by the model (between 0 and 1). Values closer to 1 indicate a better fit.

You can compare these metrics with other models (e.g., CatBoost) to see which algorithm performs best on this dataset.


In [None]:
# Save the trained best XGBoost model to disk

import joblib

MODEL_PATH_XGB = "xgboost_car_price_model.pkl"

joblib.dump(best_model, MODEL_PATH_XGB)
print(f"Saved XGBoost model to {MODEL_PATH_XGB}")
