# Modelowanie Cen Samochodów - Metody i Efektywność

## 1. Opis problemu

Celem projektu jest przewidywanie cen samochodów na podstawie danych web scrapingu ofert sprzedaży. Dane charakteryzują się dużą różnorodnością cech (25 zmiennych) opisujących parametry pojazdów, historię, wyposażenie i inne czynniki wpływające na wartość rynkową.

### Import potrzebnych bibliotek

In [100]:
from sklearn.metrics import mean_squared_error
import pandas as pd
import numpy as np
from datetime import datetime
from sklearn.model_selection import KFold, train_test_split
import catboost as cb
import xgboost as xgb
import optuna
from optuna.integration import CatBoostPruningCallback
import lightgbm as lgb
import warnings
warnings.filterwarnings('ignore')

### Wczytanie danych

In [101]:
train_data = pd.read_csv('../data/raw/sales_ads_train.csv')
test_data = pd.read_csv('../data/raw/sales_ads_test.csv')

In [102]:
print(f"Train data shape: {train_data.shape}")
print(f"Test data shape: {test_data.shape}")

Train data shape: (135397, 25)
Test data shape: (72907, 24)


In [103]:
print(train_data['Cena'].describe())

count    1.353970e+05
mean     6.306938e+04
std      8.807748e+04
min      5.850000e+02
25%      1.780000e+04
50%      3.580000e+04
75%      7.599000e+04
max      6.999000e+06
Name: Cena, dtype: float64


## 2. Przygotowanie danych

Pierwszym krokiem jest przygotowanie danych do modelowania, co obejmuje łączenie zbiorów, obsługę wartości brakujących i transformację zmiennych.

In [104]:
train_data['data_source'] = 'original'
train_data

Unnamed: 0,ID,Cena,Waluta,Stan,Marka_pojazdu,Model_pojazdu,Wersja_pojazdu,Generacja_pojazdu,Rok_produkcji,Przebieg_km,...,Typ_nadwozia,Liczba_drzwi,Kolor,Kraj_pochodzenia,Pierwszy_wlasciciel,Data_pierwszej_rejestracji,Data_publikacji_oferty,Lokalizacja_oferty,Wyposazenie,data_source
0,1,13900,PLN,Used,Renault,Grand Espace,Gr 2.0T 16V Expression,,2005.0,213000.0,...,minivan,5.0,blue,,,,28/04/2021,"SŁONECZNA 1 - 99-300 Kutno, kutnowski, Łódzkie...","['ABS', 'Electric front windows', 'Drivers air...",original
1,2,25900,PLN,Used,Renault,Megane,1.6 16V 110,III (2008-2016),2010.0,117089.0,...,station_wagon,5.0,silver,,,16/06/2010,04/05/2021,"ul. Wiosenna 8 - 41-407 Imielin, Centrum (Polska)","['ABS', 'Electric front windows', 'Drivers air...",original
2,3,35900,PLN,Used,Opel,Zafira,Tourer 1.6 CDTI ecoFLEX Start/Stop,C (2011-2019),2015.0,115600.0,...,minivan,5.0,white,Denmark,,,03/05/2021,"Sianów, koszaliński, Zachodniopomorskie","['ABS', 'Electric front windows', 'Passengers ...",original
3,4,5999,PLN,Used,Ford,Focus,1.6 TDCi FX Silver / Silver X,Mk2 (2004-2011),2007.0,218000.0,...,compact,5.0,blue,,,27/11/2007,02/05/2021,"Gdańsk, Pomorskie, Przymorze Wielkie","['ABS', 'Electric front windows', 'Drivers air...",original
4,5,44800,PLN,Used,Toyota,Avensis,1.8,III (2009-),2013.0,,...,,4.0,other,Poland,Yes,20/05/2013,02/05/2021,"Świdnik, świdnicki, Lubelskie","['ABS', 'Electric front windows', 'Drivers air...",original
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
135392,135393,45499,PLN,Used,Opel,Astra,,K (2015-),2018.0,136931.0,...,compact,5.0,silver,Poland,Yes,22/08/2018,,"Wyścigowa 58A - 53-012 Wrocław, Krzyki (Polska)","['ABS', 'ASR (traction control)', 'On-board co...",original
135393,135394,269855,PLN,New,Mercedes-Benz,Vito,124 CDI Tourer Lang,W447 (2014-),2021.0,8.0,...,minivan,4.0,white,,,,01/05/2021,"Garbarska 79 - 26-600 Radom, Mazowieckie (Polska)","['ABS', 'Electrically adjustable mirrors', 'Pa...",original
135394,135395,21900,PLN,Used,,Zafira,,B (2005-2014),,179000.0,...,minivan,5.0,black,,,,01/05/2021,"WALERIANY 30 A - 96-330 Waleriany, żyrardowski...","['ABS', 'Electric front windows', 'Drivers air...",original
135395,135396,4450,PLN,Used,Renault,Clio,1.2i RT,II (1998-2012),2001.0,156000.0,...,city_cars,,blue,Germany,,,30/04/2021,"Kcynia, nakielski, Kujawsko-pomorskie","['ABS', 'Electric front windows', 'Passengers ...",original


In [105]:
combined_train_data = train_data.copy()

In [106]:
test_data['data_source'] = 'test'

In [107]:

test_ids = test_data['ID'].values
combined_train_data['is_train'] = 1
test_data['is_train'] = 0
all_data = pd.concat([combined_train_data, test_data], axis=0, ignore_index=True)

In [108]:
KURS_EUR_PLN = 4.5

def przelicz_na_pln(row):
    if pd.notna(row['Waluta']) and row['Waluta'] == 'EUR':
        return row['Cena'] * KURS_EUR_PLN
    else:
        return row['Cena']

In [109]:
all_data['Cena_PLN'] = all_data.apply(przelicz_na_pln, axis=1)

In [110]:
median_price = all_data.loc[all_data['is_train'] == 1, 'Cena_PLN'].median()
all_data['Cena_PLN'] = all_data['Cena_PLN'].fillna(median_price)

In [None]:
all_data['log_Cena'] = np.log1p(all_data['Cena_PLN'])

## 3. Inżynieria cech

Tworzenie nowych zmiennych ma kluczowe znaczenie dla poprawy jakości modelu. Wykorzystujemy wiedzę dziedzinową o rynku samochodowym do tworzenia istotnych cech.


In [150]:
current_year = datetime.now().year
all_data['Wiek_pojazdu'] = current_year - all_data['Rok_produkcji']

In [113]:
all_data['log_Przebieg_km'] = np.log1p(all_data['Przebieg_km'])

In [114]:
all_data['Efektywnosc_silnika'] = all_data['Moc_KM'] / (all_data['Pojemnosc_cm3'] / 1000)
all_data['Efektywnosc_silnika'].replace([np.inf, -np.inf], np.nan, inplace=True)
all_data['Efektywnosc_silnika'] = all_data['Efektywnosc_silnika'].fillna(all_data['Efektywnosc_silnika'].median())

In [115]:
all_data['Sredni_roczny_przebieg'] = all_data['Przebieg_km'] / all_data['Wiek_pojazdu'].replace(0, 0.5)
all_data['Sredni_roczny_przebieg'].replace([np.inf, -np.inf], np.nan, inplace=True)
all_data['Sredni_roczny_przebieg'] = all_data['Sredni_roczny_przebieg'].fillna(all_data['Sredni_roczny_przebieg'].median())

In [116]:
if 'Wyposazenie' in all_data.columns:
    if isinstance(all_data['Wyposazenie'].iloc[0], str):
        all_data['Wyposazenie'] = all_data['Wyposazenie'].apply(lambda x: eval(x) if isinstance(x, str) and x.startswith('[') else [])
    all_data['Liczba_elementow_wyposazenia'] = all_data['Wyposazenie'].apply(len)

    premium_features = [
        'Leather upholstery', 'GPS navigation', 'Heated front seats', 
        'Xenon lights', 'LED lights', 'Automatic air conditioning',
        'Panoramic roof', 'Electrically adjustable seats', 'Active cruise control'
    ]
    for feature in premium_features:
        all_data[f'ma_{feature.replace(" ", "_")}'] = all_data['Wyposazenie'].apply(
            lambda x: 1 if isinstance(x, list) and any(feature in item for item in x) else 0
        )

### 3.1 Uzupełnianie brakujących wartości i target encoding

In [117]:
numeric_cols = ['Rok_produkcji', 'Przebieg_km', 'Moc_KM', 'Pojemnosc_cm3', 
                'Liczba_drzwi', 'Liczba_elementow_wyposazenia', 'Efektywnosc_silnika',
                'Wiek_pojazdu', 'log_Przebieg_km', 'Sredni_roczny_przebieg']

for col in numeric_cols:
    if col in all_data.columns and all_data[col].isnull().sum() > 0:
        median_val = all_data.loc[(all_data['is_train'] == 1) & (all_data['data_source'] == 'original'), col].median()
        all_data[col] = all_data[col].fillna(median_val)

In [118]:
categorical_cols = ['Stan', 'Marka_pojazdu', 'Model_pojazdu', 'Rodzaj_paliwa', 
                   'Naped', 'Skrzynia_biegow', 'Typ_nadwozia', 'Kolor', 'Kraj_pochodzenia']

for col in categorical_cols:
    if col in all_data.columns and all_data[col].isnull().sum() > 0:
        all_data[col] = all_data[col].fillna('nieznany')

In [119]:
train_marka_mean_price = all_data.loc[(all_data['is_train'] == 1) & 
                                      (all_data['data_source'] == 'original')].groupby('Marka_pojazdu')['log_Cena'].mean()
all_data['Marka_avg_price'] = all_data['Marka_pojazdu'].map(train_marka_mean_price)
all_data['Marka_avg_price'] = all_data['Marka_avg_price'].fillna(train_marka_mean_price.mean())

In [120]:
train_model_mean_price = all_data.loc[(all_data['is_train'] == 1) & 
                                      (all_data['data_source'] == 'original')].groupby(['Marka_pojazdu', 'Model_pojazdu'])['log_Cena'].mean()
all_data['Model_avg_price'] = all_data.apply(
    lambda x: train_model_mean_price.get((x['Marka_pojazdu'], x['Model_pojazdu']), np.nan), axis=1)
all_data['Model_avg_price'] = all_data['Model_avg_price'].fillna(all_data['Marka_avg_price'])

In [121]:
color_counts = all_data.loc[all_data['is_train'] == 1, 'Kolor'].value_counts(normalize=True)
all_data['Kolor_freq'] = all_data['Kolor'].map(color_counts)
all_data['Kolor_freq'] = all_data['Kolor_freq'].fillna(color_counts.min())

In [122]:
all_data['Wiek_x_Przebieg'] = all_data['Wiek_pojazdu'] * all_data['log_Przebieg_km']

In [123]:
all_data['Moc_x_Pojemnosc'] = all_data['Moc_KM'] * all_data['Pojemnosc_cm3'] / 1000

In [124]:
if 'Liczba_elementow_wyposazenia' in all_data.columns:
    all_data['Wiek_per_Wyposazenie'] = all_data['Wiek_pojazdu'] / (all_data['Liczba_elementow_wyposazenia'] + 1)

In [125]:
all_data['Oryginalnie_EUR'] = all_data['Waluta'].apply(lambda x: 1 if pd.notna(x) and x == 'EUR' else 0)

### 3.2 Kodowanie zmiennych kategorycznych

In [126]:
all_data_encoded = pd.get_dummies(all_data, columns=[
    'Stan', 'Rodzaj_paliwa', 'Naped', 'Skrzynia_biegow', 'Typ_nadwozia'
])

In [127]:
for cat_col in ['Marka_pojazdu', 'Model_pojazdu', 'Kolor', 'Kraj_pochodzenia']:
    if cat_col in all_data.columns:
        target_means = all_data.loc[(all_data['is_train'] == 1) & 
                                    (all_data['data_source'] == 'original')].groupby(cat_col)['log_Cena'].mean()
        all_data_encoded[f'{cat_col}_target_enc'] = all_data[cat_col].map(target_means)
        all_data_encoded[f'{cat_col}_target_enc'].fillna(target_means.mean(), inplace=True)

In [128]:
features = [
    'Wiek_pojazdu', 'log_Przebieg_km', 'Moc_KM', 'Pojemnosc_cm3', 
    'Liczba_elementow_wyposazenia', 'Efektywnosc_silnika', 'Sredni_roczny_przebieg',
    'Oryginalnie_EUR',
    
    'Marka_avg_price', 'Model_avg_price', 'Kolor_freq',
    
    'ma_Leather_upholstery', 'ma_GPS_navigation', 'ma_Heated_front_seats',
    'ma_Xenon_lights', 'ma_LED_lights', 'ma_Automatic_air_conditioning',
    'ma_Panoramic_roof', 'ma_Electrically_adjustable_seats', 'ma_Active_cruise_control',
    
    'Wiek_x_Przebieg', 'Moc_x_Pojemnosc', 'Wiek_per_Wyposazenie',
    
    'Marka_pojazdu_target_enc', 'Model_pojazdu_target_enc', 
    'Kolor_target_enc', 'Kraj_pochodzenia_target_enc'
]

features += [col for col in all_data_encoded.columns if col.startswith(('Stan_', 'Rodzaj_paliwa_', 
                                                        'Naped_', 'Skrzynia_biegow_', 'Typ_nadwozia_'))]

In [129]:
X_all = all_data_encoded[features].copy()

missing = X_all.isnull().sum()
if missing.sum() > 0:
    print(f"Brakujące wartości w danych: {missing[missing > 0]}")
    
    for col in X_all.columns:
        if X_all[col].isnull().sum() > 0:
            if X_all[col].dtype.kind in 'ifc':
                median_val = all_data_encoded.loc[(all_data_encoded['is_train'] == 1) & 
                                                  (all_data_encoded['data_source'] == 'original'), col].median()
                X_all[col] = X_all[col].fillna(median_val)
            else:
                X_all[col] = X_all[col].fillna('nieznany')

In [147]:
X_train_all = X_all[all_data_encoded['is_train'] == 1]
y_train_all = all_data_encoded.loc[all_data_encoded['is_train'] == 1, 'log_Cena']
X_test = X_all[all_data_encoded['is_train'] == 0]

X_train, X_val, y_train, y_val = train_test_split(
    X_train_all, y_train_all, test_size=0.2, random_state=42
)
X_train_all

Unnamed: 0,Wiek_pojazdu,log_Przebieg_km,Moc_KM,Pojemnosc_cm3,Liczba_elementow_wyposazenia,Efektywnosc_silnika,Sredni_roczny_przebieg,Oryginalnie_EUR,Marka_avg_price,Model_avg_price,...,Typ_nadwozia_SUV,Typ_nadwozia_city_cars,Typ_nadwozia_compact,Typ_nadwozia_convertible,Typ_nadwozia_coupe,Typ_nadwozia_minivan,Typ_nadwozia_nieznany,Typ_nadwozia_sedan,Typ_nadwozia_small_cars,Typ_nadwozia_station_wagon
0,20.0,12.269052,170.0,1998.0,18,85.085085,10650.000000,0,10.055536,9.695962,...,False,False,False,False,False,True,False,False,False,False
1,15.0,11.670698,110.0,1598.0,27,68.836045,7805.933333,0,10.055536,10.125719,...,False,False,False,False,False,False,False,False,False,True
2,10.0,11.657900,136.0,1598.0,24,85.106383,11560.000000,0,10.045612,9.843740,...,False,False,False,False,False,True,False,False,False,False
3,18.0,12.292255,90.0,1560.0,17,57.692308,12111.111111,0,10.353471,10.050524,...,False,False,True,False,False,False,False,False,False,False
4,12.0,11.883554,136.0,1798.0,25,75.093867,10066.666667,0,10.462687,10.205401,...,False,False,False,False,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
135392,7.0,11.827240,150.0,1598.0,9,93.867334,19561.571429,0,10.045612,9.999211,...,False,False,True,False,False,False,False,False,False,False
135393,4.0,2.197225,237.0,1950.0,35,121.538462,2.000000,0,11.015688,11.164070,...,False,False,False,False,False,True,False,False,False,False
135394,12.0,12.095147,120.0,1700.0,24,70.588235,10066.666667,0,10.509313,9.866612,...,False,False,False,False,False,True,False,False,False,False
135395,24.0,11.957618,60.0,1149.0,15,52.219321,6500.000000,0,10.055536,9.940847,...,False,True,False,False,False,False,False,False,False,False


## 4. Modelowanie

### 4.1 Definicja funkcji do oceny jakości modeli

In [131]:
def calculate_rmse(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

### 4.2 Optymalizacja parametrów modelu CatBoost

In [132]:
def objective_catboost(trial):
    param = {
        "iterations": trial.suggest_int("iterations", 500, 3000),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
        "depth": trial.suggest_int("depth", 4, 10),
        "l2_leaf_reg": trial.suggest_float("l2_leaf_reg", 1e-8, 10.0, log=True),
        "random_strength": trial.suggest_float("random_strength", 1e-8, 10.0, log=True),
        "bagging_temperature": trial.suggest_float("bagging_temperature", 0, 10.0),
        "border_count": trial.suggest_int("border_count", 32, 255),
        "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 1, 100),
        "verbose": False,
        "random_seed": 42
    }
    
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    log_rmse_scores = []
    
    for train_idx, val_idx in kf.split(X_train_all):
        X_train_fold, X_val_fold = X_train_all.iloc[train_idx], X_train_all.iloc[val_idx]
        y_train_fold, y_val_fold = y_train_all.iloc[train_idx], y_train_all.iloc[val_idx]
        
        pruning_callback = CatBoostPruningCallback(trial, "RMSE")
        
        model = cb.CatBoostRegressor(**param)
        model.fit(
            X_train_fold, 
            y_train_fold,
            eval_set=[(X_val_fold, y_val_fold)],
            callbacks=[pruning_callback],
            early_stopping_rounds=100,
            verbose=0
        )
        
        y_val_pred_log = model.predict(X_val_fold)
        
        log_rmse = calculate_rmse(y_val_fold, y_val_pred_log)
        log_rmse_scores.append(log_rmse)
    
    return np.mean(log_rmse_scores)

study_catboost = optuna.create_study(
    direction="minimize",
    pruner=optuna.pruners.MedianPruner(n_warmup_steps=10),
    sampler=optuna.samplers.TPESampler(seed=42)
)

[I 2025-03-26 17:50:22,019] A new study created in memory with name: no-name-0ad4f55c-1063-4fff-b7cb-ebc3c0946bb7


In [133]:
n_trials = 100
study_catboost.optimize(objective_catboost, n_trials=n_trials)

[I 2025-03-26 17:50:47,087] Trial 0 finished with value: 0.24937447734728932 and parameters: {'iterations': 1436, 'learning_rate': 0.2536999076681772, 'depth': 9, 'l2_leaf_reg': 0.0024430162614261413, 'random_strength': 2.5361081166471375e-07, 'bagging_temperature': 1.5599452033620265, 'border_count': 45, 'min_data_in_leaf': 87}. Best is trial 0 with value: 0.24937447734728932.
[I 2025-03-26 17:51:25,827] Trial 1 finished with value: 0.2505528392925503 and parameters: {'iterations': 2003, 'learning_rate': 0.11114989443094977, 'depth': 4, 'l2_leaf_reg': 5.360294728728285, 'random_strength': 0.31044435499483225, 'bagging_temperature': 2.1233911067827616, 'border_count': 72, 'min_data_in_leaf': 19}. Best is trial 0 with value: 0.24937447734728932.
[I 2025-03-26 17:51:59,688] Trial 2 finished with value: 0.24596051627574528 and parameters: {'iterations': 1260, 'learning_rate': 0.05958389350068958, 'depth': 7, 'l2_leaf_reg': 4.17890272377219e-06, 'random_strength': 0.0032112643094417484, 'b

In [134]:
best_params_catboost = study_catboost.best_params
print(f"Najlepsze parametry CatBoost: {best_params_catboost}")
print(f"Najlepszy RMSE CatBoost: {study_catboost.best_value:.6f}")

Najlepsze parametry CatBoost: {'iterations': 1747, 'learning_rate': 0.2489988131361436, 'depth': 8, 'l2_leaf_reg': 9.693267277829479, 'random_strength': 0.01855317047679346, 'bagging_temperature': 3.648981417538457, 'border_count': 188, 'min_data_in_leaf': 16}
Najlepszy RMSE CatBoost: 0.242198


### 4.3 Optymalizacja parametrów modelu XGBoost

In [135]:
def objective_xgboost(trial):
    param = {
        "n_estimators": trial.suggest_int("n_estimators", 500, 2000),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
        "max_depth": trial.suggest_int("max_depth", 3, 10),
        "subsample": trial.suggest_float("subsample", 0.6, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.6, 1.0),
        "gamma": trial.suggest_float("gamma", 0, 1.0),
        "reg_alpha": trial.suggest_float("reg_alpha", 0, 1.0),
        "reg_lambda": trial.suggest_float("reg_lambda", 0, 1.0),
        "random_state": 42,
    }
    
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    rmse_scores = []
    
    for train_idx, val_idx in kf.split(X_train_all):
        X_train_fold, X_val_fold = X_train_all.iloc[train_idx], X_train_all.iloc[val_idx]
        y_train_fold, y_val_fold = y_train_all.iloc[train_idx], y_train_all.iloc[val_idx]
        
        model = xgb.XGBRegressor(**param)
        model.fit(X_train_fold, y_train_fold)
        
        y_val_pred = model.predict(X_val_fold)
        rmse = np.sqrt(mean_squared_error(y_val_fold, y_val_pred))
        rmse_scores.append(rmse)
    
    return np.mean(rmse_scores)

study_xgboost = optuna.create_study(direction="minimize")
study_xgboost.optimize(objective_xgboost, n_trials=n_trials)

[I 2025-03-26 18:02:45,279] A new study created in memory with name: no-name-fb5d8571-6261-434d-bf5a-b94a3b14971a
[I 2025-03-26 18:02:50,685] Trial 0 finished with value: 0.2618401510712619 and parameters: {'n_estimators': 565, 'learning_rate': 0.25216535842068255, 'max_depth': 3, 'subsample': 0.651491348687464, 'colsample_bytree': 0.9996406681167839, 'gamma': 0.30908399359520766, 'reg_alpha': 0.6508661364352807, 'reg_lambda': 0.8216174291550922}. Best is trial 0 with value: 0.2618401510712619.
[I 2025-03-26 18:03:03,549] Trial 1 finished with value: 0.2504803883760256 and parameters: {'n_estimators': 1471, 'learning_rate': 0.027657330673387295, 'max_depth': 8, 'subsample': 0.6886654400096859, 'colsample_bytree': 0.6468427685266801, 'gamma': 0.5578700811656532, 'reg_alpha': 0.027193590263238754, 'reg_lambda': 0.004206121671670782}. Best is trial 1 with value: 0.2504803883760256.
[I 2025-03-26 18:03:15,618] Trial 2 finished with value: 0.24426183906892315 and parameters: {'n_estimators'

In [136]:
best_params_xgboost = study_xgboost.best_params
print(f"Najlepsze parametry XGBoost: {best_params_xgboost}")
print(f"Najlepszy RMSE XGBoost: {study_xgboost.best_value:.6f}")

Najlepsze parametry XGBoost: {'n_estimators': 1758, 'learning_rate': 0.02954699562557916, 'max_depth': 9, 'subsample': 0.7401487958272507, 'colsample_bytree': 0.7726810811112611, 'gamma': 0.017253448880046818, 'reg_alpha': 0.8941255472600743, 'reg_lambda': 0.14915705500630402}
Najlepszy RMSE XGBoost: 0.238168


### 4.4 Optymalizacja parametrów modelu LightGBM

In [154]:
def objective_lightgbm(trial):
    param = {
        "n_estimators": trial.suggest_int("n_estimators", 500, 2000),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
        "num_leaves": trial.suggest_int("num_leaves", 20, 150),
        "max_depth": trial.suggest_int("max_depth", 3, 12),
        "min_data_in_leaf": trial.suggest_int("min_data_in_leaf", 20, 100),
        "max_bin": trial.suggest_int("max_bin", 32, 512),
        "bagging_fraction": trial.suggest_float("bagging_fraction", 0.6, 1.0),
        "bagging_freq": trial.suggest_int("bagging_freq", 1, 10),
        "feature_fraction": trial.suggest_float("feature_fraction", 0.6, 1.0),
        "lambda_l1": trial.suggest_float("lambda_l1", 1e-8, 10.0, log=True),
        "lambda_l2": trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True),
        "random_state": 42,
    }
    
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    rmse_scores = []
    
    for train_idx, val_idx in kf.split(X_train_all):
        X_train_fold, X_val_fold = X_train_all.iloc[train_idx], X_train_all.iloc[val_idx]
        y_train_fold, y_val_fold = y_train_all.iloc[train_idx], y_train_all.iloc[val_idx]
        
        model = lgb.LGBMRegressor(**param)
        model.fit(X_train_fold, y_train_fold)
        
        y_val_pred = model.predict(X_val_fold)
        rmse = np.sqrt(mean_squared_error(y_val_fold, y_val_pred))
        rmse_scores.append(rmse)
    
    return np.mean(rmse_scores)

study_lightgbm = optuna.create_study(direction="minimize")
study_lightgbm.optimize(objective_lightgbm, n_trials=n_trials)

[I 2025-03-26 19:35:07,681] A new study created in memory with name: no-name-94efaabc-3c63-4296-962d-e2f707a3250c
[I 2025-03-26 19:35:24,488] Trial 0 finished with value: 0.24709467326567153 and parameters: {'n_estimators': 982, 'learning_rate': 0.02485816755979505, 'num_leaves': 121, 'max_depth': 9, 'min_data_in_leaf': 43, 'max_bin': 474, 'bagging_fraction': 0.638319140631777, 'bagging_freq': 2, 'feature_fraction': 0.9918394118066567, 'lambda_l1': 0.0005131498273655017, 'lambda_l2': 1.6235042281648685e-08}. Best is trial 0 with value: 0.24709467326567153.
[I 2025-03-26 19:35:45,648] Trial 1 finished with value: 0.249439191112076 and parameters: {'n_estimators': 1143, 'learning_rate': 0.01347847309116874, 'num_leaves': 133, 'max_depth': 9, 'min_data_in_leaf': 65, 'max_bin': 176, 'bagging_fraction': 0.9368821143692417, 'bagging_freq': 9, 'feature_fraction': 0.9353200530862622, 'lambda_l1': 0.0038197178136982627, 'lambda_l2': 0.00018558601605438778}. Best is trial 0 with value: 0.2470946

In [155]:
best_params_lightgbm = study_lightgbm.best_params
print(f"Najlepsze parametry LightGBM: {best_params_lightgbm}")
print(f"Najlepszy RMSE LightGBM: {study_lightgbm.best_value:.6f}")

Najlepsze parametry LightGBM: {'n_estimators': 1791, 'learning_rate': 0.0387164675395099, 'num_leaves': 104, 'max_depth': 12, 'min_data_in_leaf': 31, 'max_bin': 478, 'bagging_fraction': 0.9689389046925748, 'bagging_freq': 8, 'feature_fraction': 0.6648950176606436, 'lambda_l1': 0.0003272703745609526, 'lambda_l2': 0.5057924016887364}
Najlepszy RMSE LightGBM: 0.238193


## 5. Modele końcowe i ensemble

In [156]:
final_catboost_params = best_params_catboost.copy()
final_catboost_params['verbose'] = 0

In [157]:
final_catboost_model = cb.CatBoostRegressor(**final_catboost_params)
final_catboost_model.fit(X_train_all, y_train_all)

final_xgboost_model = xgb.XGBRegressor(**best_params_xgboost, random_state=42)
final_xgboost_model.fit(X_train_all, y_train_all)

final_lightgbm_model = lgb.LGBMRegressor(**best_params_lightgbm, random_state=42)
final_lightgbm_model.fit(X_train_all, y_train_all)

In [158]:
def objective_ensemble_weights(trial):
    w1 = trial.suggest_float("catboost_weight", 0.1, 0.7)
    w2 = trial.suggest_float("xgboost_weight", 0.1, 0.7)
    w3 = trial.suggest_float("lightgbm_weight", 0.1, 0.7)
    
    sum_weights = w1 + w2 + w3
    w1 /= sum_weights
    w2 /= sum_weights
    w3 /= sum_weights
    
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    rmse_scores = []
    
    for train_idx, val_idx in kf.split(X_train_all):
        X_val_fold = X_train_all.iloc[val_idx]
        y_val_fold = y_train_all.iloc[val_idx]
        
        y_pred_catboost = final_catboost_model.predict(X_val_fold)
        y_pred_xgboost = final_xgboost_model.predict(X_val_fold)
        y_pred_lightgbm = final_lightgbm_model.predict(X_val_fold)
        
        y_pred_ensemble = w1 * y_pred_catboost + w2 * y_pred_xgboost + w3 * y_pred_lightgbm
        
        rmse = np.sqrt(mean_squared_error(y_val_fold, y_pred_ensemble))
        rmse_scores.append(rmse)
    
    return np.mean(rmse_scores)

study_ensemble = optuna.create_study(direction="minimize")
study_ensemble.optimize(objective_ensemble_weights, n_trials=200)

[I 2025-03-26 20:18:19,462] A new study created in memory with name: no-name-8cf4b105-89cb-4682-b639-615c0d0026cc
[I 2025-03-26 20:18:22,182] Trial 0 finished with value: 0.15341881115798878 and parameters: {'catboost_weight': 0.3650118923076131, 'xgboost_weight': 0.6077675373266845, 'lightgbm_weight': 0.5474364730532295}. Best is trial 0 with value: 0.15341881115798878.
[I 2025-03-26 20:18:24,954] Trial 1 finished with value: 0.16042571407994624 and parameters: {'catboost_weight': 0.6092739955257018, 'xgboost_weight': 0.1269599663553166, 'lightgbm_weight': 0.5212425815201408}. Best is trial 0 with value: 0.15341881115798878.
[I 2025-03-26 20:18:28,416] Trial 2 finished with value: 0.14727772555751142 and parameters: {'catboost_weight': 0.1017082827331882, 'xgboost_weight': 0.632235851772512, 'lightgbm_weight': 0.2104492567096668}. Best is trial 2 with value: 0.14727772555751142.
[I 2025-03-26 20:18:31,170] Trial 3 finished with value: 0.15854626927874407 and parameters: {'catboost_wei

In [159]:
best_weights = study_ensemble.best_params
catboost_weight = best_weights["catboost_weight"]
xgboost_weight = best_weights["xgboost_weight"]
lightgbm_weight = best_weights["lightgbm_weight"]

In [160]:
sum_weights = catboost_weight + xgboost_weight + lightgbm_weight
catboost_weight /= sum_weights
xgboost_weight /= sum_weights
lightgbm_weight /= sum_weights

In [161]:
print(f"Wagi ensembla: CatBoost={catboost_weight:.3f}, XGBoost={xgboost_weight:.3f}, LightGBM={lightgbm_weight:.3f}")
print(f"Najlepszy RMSE ensembla: {study_ensemble.best_value:.6f}")

Wagi ensembla: CatBoost=0.112, XGBoost=0.777, LightGBM=0.111
Najlepszy RMSE ensembla: 0.144592


### 5.1 Predykcje modelu i przygotowanie submission

In [162]:
y_pred_catboost = final_catboost_model.predict(X_test)
y_pred_xgboost = final_xgboost_model.predict(X_test)
y_pred_lightgbm = final_lightgbm_model.predict(X_test)

y_pred_ensemble_log = catboost_weight * y_pred_catboost + xgboost_weight * y_pred_xgboost + lightgbm_weight * y_pred_lightgbm

y_pred_ensemble = np.expm1(y_pred_ensemble_log)

test_orig_eur = all_data.loc[all_data['is_train'] == 0, 'Waluta'] == 'EUR'
test_orig_eur

135397    False
135398    False
135399    False
135400    False
135401    False
          ...  
208299    False
208300    False
208301    False
208302    False
208303    False
Name: Waluta, Length: 72907, dtype: bool

In [163]:
submission = pd.DataFrame({
    'ID': test_ids,
    'Cena': y_pred_ensemble
})

submission_path = 'submit.csv'
submission.to_csv(submission_path, index=False)
submission.head()

Unnamed: 0,ID,Cena
0,1,188947.481506
1,2,19779.49735
2,3,21031.495542
3,4,97610.188858
4,5,82827.044346


# 6. Wnioski z modelowania

## 6.1 Efektywność modeli
- **CatBoost** (RMSE: ~0.242197693847738): Najlepiej radził sobie z danymi kategorycznymi, zwłaszcza markami i modelami pojazdów
- **XGBoost** (RMSE: ~0.23816804368657296): Zrównoważone podejście z dobrą obsługą zmiennych numerycznych
- **LightGBM** (RMSE: ~0.23819296258592454): Najszybszy czas treningu
- **Ensemble** (RMSE: ~0.144591561387183): Najwyższa dokładność dzięki łączeniu predykcji trzech modeli

## 6.2 Kluczowe transformacje danych
- Transformacja logarytmiczna ceny - normalizacja rozkładu
- Konwersja walut (EUR na PLN) - ujednolicenie skali
- Obliczenie wieku pojazdu zamiast roku produkcji
- Logarytmizacja przebiegu - lepsza reprezentacja nieliniowego wpływu
- Target encoding dla zmiennych kategorycznych o dużej liczbie wartości

## 6.3 Najważniejsze cechy wpływające na cenę
1. Marka i model pojazdu (target encoding)
2. Wiek pojazdu (silna ujemna korelacja)
3. Przebieg (ujemna korelacja, transformacja logarytmiczna)
4. Moc silnika i pojemność (dodatnia korelacja)
5. Wyposażenie premium (skórzana tapicerka, GPS, itp.)
6. Rodzaj paliwa i skrzynia biegów

## 6.4 Potencjał biznesowy

Model ma szeroki zakres zastosowań biznesowych:
1. **Dla dealerów** - wycena używanych pojazdów, optymalizacja polityki cenowej, identyfikacja okazji rynkowych.
2. **Dla platform ogłoszeniowych** - automatyczna weryfikacja cen, sugestie dla sprzedających, wykrywanie anomalii cenowych.
3. **Dla ubezpieczycieli** - dokładniejsza wycena wartości pojazdów dla celów ubezpieczeniowych.
4. **Dla konsumentów** - narzędzie do weryfikacji uczciwości cen na rynku wtórnym.

Projekt wykazał, że połączenie dogłębnej inżynierii cech z zastosowaniem zaawansowanych technik ensemble modelowania pozwala osiągnąć wysoką dokładność predykcji cen samochodów.