# -- ðŸš— Used Car Price Prediction --
# 3. Feature Engineering 

Bu notebookta:

- Modelleme iÃ§in feature seÃ§imi,
- Train/Test ayrÄ±mÄ±,
- One-Hot Encoding ile baseline model kurulumu,
- Feature engineering (price_per_km, km_per_year, log_kmDriven, log_price),
- Train / Validation / Test ayrÄ±mÄ±,
- Feature engineering sonrasÄ± baseline CV ve validation performansÄ±

adÄ±mlarÄ± gerÃ§ekleÅŸtirilecektir.


In [1]:
# KÃ¼tÃ¼phaneler + Veri YÃ¼kleme

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use("seaborn-v0_8")
pd.set_option("display.max_columns", None)

# Veri Seti
df = pd.read_csv("/kaggle/input/automl888888/used_cars_dataset_v2.csv")

# Veri Temizleme
def clean_km(x):
    if pd.isna(x):
        return np.nan
    x = str(x).lower().replace("km", "").replace(",", "").strip()
    try:
        return float(x)
    except ValueError:
        return np.nan

def clean_price(x):
    if pd.isna(x):
        return np.nan
    x = str(x)
    x = (
        x.replace("â‚¹", "")
         .replace(",", "")
         .replace("rs.", "")
         .replace("rs", "")
         .strip())
    try:
        return float(x)
    except ValueError:
        return np.nan

df["kmDriven_clean"] = df["kmDriven"].apply(clean_km)
df["AskPrice_clean"] = df["AskPrice"].apply(clean_price)

# Eksik kritik satÄ±rlarÄ± at
df = df.dropna(subset=["kmDriven_clean", "AskPrice_clean"])

df.head()

Unnamed: 0,Brand,model,Year,Age,kmDriven,Transmission,Owner,FuelType,PostedDate,AdditionInfo,AskPrice,kmDriven_clean,AskPrice_clean
0,Honda,City,2001,23,"98,000 km",Manual,second,Petrol,Nov-24,"Honda City v teck in mint condition, valid gen...","â‚¹ 1,95,000",98000.0,195000.0
1,Toyota,Innova,2009,15,190000.0 km,Manual,second,Diesel,Jul-24,"Toyota Innova 2.5 G (Diesel) 7 Seater, 2009, D...","â‚¹ 3,75,000",190000.0,375000.0
2,Volkswagen,VentoTest,2010,14,"77,246 km",Manual,first,Diesel,Nov-24,"Volkswagen Vento 2010-2013 Diesel Breeze, 2010...","â‚¹ 1,84,999",77246.0,184999.0
3,Maruti Suzuki,Swift,2017,7,"83,500 km",Manual,second,Diesel,Nov-24,Maruti Suzuki Swift 2017 Diesel Good Condition,"â‚¹ 5,65,000",83500.0,565000.0
4,Maruti Suzuki,Baleno,2019,5,"45,000 km",Automatic,first,Petrol,Nov-24,"Maruti Suzuki Baleno Alpha CVT, 2019, Petrol","â‚¹ 6,85,000",45000.0,685000.0


## 3.1 Modelleme Ä°Ã§in Veri HazÄ±rlÄ±ÄŸÄ± (Feature Selection)


In [2]:
# Modelleme Ä°Ã§in Veri HazÄ±rlÄ±ÄŸÄ± (Feature Selection)

features = [
    "Brand", "model", "Year", "Age",
    "kmDriven_clean", "Transmission",
    "Owner", "FuelType"]

target = "AskPrice_clean"

df_model = df[features + [target]].copy()
df_model.head()


Unnamed: 0,Brand,model,Year,Age,kmDriven_clean,Transmission,Owner,FuelType,AskPrice_clean
0,Honda,City,2001,23,98000.0,Manual,second,Petrol,195000.0
1,Toyota,Innova,2009,15,190000.0,Manual,second,Diesel,375000.0
2,Volkswagen,VentoTest,2010,14,77246.0,Manual,first,Diesel,184999.0
3,Maruti Suzuki,Swift,2017,7,83500.0,Manual,second,Diesel,565000.0
4,Maruti Suzuki,Baleno,2019,5,45000.0,Automatic,first,Petrol,685000.0


## 3.2 Trainâ€“Test Split (EÄŸitim ve Test Setlerinin OluÅŸturulmasÄ±)


In [3]:
X = df_model.drop("AskPrice_clean", axis=1)
y = df_model["AskPrice_clean"]

# EÄŸitim %80 â€“ Test %20
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42)

X_train.shape, X_test.shape


((11924, 8), (2981, 8))

## 3.3 Kategorik DeÄŸiÅŸkenlerin Encoding Ä°ÅŸlemi (OneHotEncoder) ve Baseline Pipeline


In [4]:
# Kategorik ve sayÄ±sal kolonlarÄ±n ayrÄ±mÄ±
cat_cols = ["Brand", "model", "Transmission", "Owner", "FuelType"]
num_cols = ["Year", "Age", "kmDriven_clean"]

# One-hot encoder
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
        ("num", "passthrough", num_cols)])

# Basit pipeline (baseline model)
baseline_model = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", LinearRegression())])

baseline_model

## 3.4 Baseline Modelin EÄŸitilmesi ve Ä°lk Performans SonuÃ§larÄ±


In [5]:
# Modeli eÄŸitme
baseline_model.fit(X_train, y_train)

# Tahminler
y_pred = baseline_model.predict(X_test)

# Performans metrikleri
mse = mean_squared_error(y_test, y_pred)
rmse = mse ** 0.5
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

mse, rmse, mae, r2


(1639474415726.2817,
 1280419.6248598667,
 538396.9695133752,
 0.36600805642698875)

## 3.5 Baseline Modelin Cross-Validation (K-Fold) ile DeÄŸerlendirilmesi


In [6]:
# 5 katlÄ± CV
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

cv_scores = cross_val_score(
    baseline_model, X, y,
    scoring="r2",
    cv=kfold,
    n_jobs=-1)

cv_scores, cv_scores.mean()

(array([0.36600805, 0.40504958, 0.27251369, 0.40337956, 0.36061624]),
 0.3615134245242876)

## 3.6 Feature Engineering: Yeni Ã–zelliklerin OluÅŸturulmasÄ±

OluÅŸturulan yeni deÄŸiÅŸkenler:

- price_per_km : Fiyat / km â†’ araÃ§ deÄŸer yoÄŸunluÄŸu
- km_per_year : YÄ±llÄ±k ortalama kilometre 
- log_kmDriven, log_price : Kilometrenin log dÃ¶nÃ¼ÅŸÃ¼mÃ¼ 

In [7]:
df_fe = df_model.copy()

# 1) km baÅŸÄ±na fiyat (araÃ§ deÄŸer gÃ¶stergesi)
df_fe["price_per_km"] = df_fe["AskPrice_clean"] / (df_fe["kmDriven_clean"] + 1)

# 2) AracÄ±n kullanÄ±m yoÄŸunluÄŸu (yÄ±llÄ±k km)
df_fe["km_per_year"] = df_fe["kmDriven_clean"] / (df_fe["Age"] + 1)

# 3) Log-transform (saÄŸa Ã§arpÄ±k daÄŸÄ±lÄ±mlar iÃ§in)
df_fe["log_kmDriven"] = np.log1p(df_fe["kmDriven_clean"])
df_fe["log_price"] = np.log1p(df_fe["AskPrice_clean"])

df_fe.head()


Unnamed: 0,Brand,model,Year,Age,kmDriven_clean,Transmission,Owner,FuelType,AskPrice_clean,price_per_km,km_per_year,log_kmDriven,log_price
0,Honda,City,2001,23,98000.0,Manual,second,Petrol,195000.0,1.989776,4083.333333,11.492733,12.18076
1,Toyota,Innova,2009,15,190000.0,Manual,second,Diesel,375000.0,1.973674,11875.0,12.154785,12.834684
2,Volkswagen,VentoTest,2010,14,77246.0,Manual,first,Diesel,184999.0,2.394902,5149.733333,11.254763,12.128111
3,Maruti Suzuki,Swift,2017,7,83500.0,Manual,second,Diesel,565000.0,6.766386,10437.5,11.332614,13.244583
4,Maruti Suzuki,Baleno,2019,5,45000.0,Automatic,first,Petrol,685000.0,15.221884,7500.0,10.71444,13.437176


### 3.6.1 Feature Engineering Ä°ÅŸlemlerinin Ã–zeti

- 1. **price_per_km**  
   - FormÃ¼l: `AskPrice_clean / kmDriven_clean`  
   - AmaÃ§: AynÄ± fiyata daha az kilometre yapan araÃ§larÄ±n daha deÄŸerli olmasÄ±nÄ± modele yansÄ±tmak.

- 2. **km_per_year**  
   - FormÃ¼l: `kmDriven_clean / Age`  
   - AmaÃ§: AracÄ±n kullanÄ±m yoÄŸunluÄŸunu modele dahil ederek fiyat tahminini iyileÅŸtirmek.

- 3. **Log dÃ¶nÃ¼ÅŸÃ¼mleri (log_kmDriven & log_price)**  


- **SonuÃ§:**  
- Bu Ã¼Ã§ iÅŸlem ile:
- Yeni anlamlÄ± Ã¶zellikler Ã¼retildi,
- DaÄŸÄ±lÄ±mlar daha dengeli hale getirildi,
- Modelin Ã¶ÄŸrenme kapasitesi artÄ±rÄ±lmaya hazÄ±rlandÄ±.


## 3.7 Train / Validation / Test Split


In [8]:
# Feature-engineered veri seti
X_fe = df_fe.drop("AskPrice_clean", axis=1)
y_fe = df_fe["AskPrice_clean"]

# 1) Ã–nce Train (%70) ve Temp (%30)
X_train_full, X_temp, y_train_full, y_temp = train_test_split(
    X_fe, y_fe,
    test_size=0.30,
    random_state=42)

# 2) Temp â†’ Validation (%15) ve Test (%15)
X_val, X_test_final, y_val, y_test_final = train_test_split(
    X_temp, y_temp,
    test_size=0.50,
    random_state=42)

X_train_full.shape, X_val.shape, X_test_final.shape


((10433, 12), (2236, 12), (2236, 12))

## 3.8 Train Set Ãœzerinde K-Fold Cross-Validation ile Feature Engineering SonrasÄ± Baseline Model


In [9]:
# Kategorik ve sayÄ±sal kolonlar
cat_cols = ["Brand", "model", "Transmission", "Owner", "FuelType"]
num_cols = ["Year", "Age", "kmDriven_clean", "price_per_km",
            "km_per_year", "log_kmDriven", "log_price"]

# Preprocessing pipeline
preprocessor_fe = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
        ("num", "passthrough", num_cols)])

# Baseline model
baseline_fe_model = Pipeline(steps=[
    ("preprocess", preprocessor_fe),
    ("model", LinearRegression())])

# 5 katlÄ± CV
kfold = KFold(n_splits=5, shuffle=True, random_state=42)

cv_scores_fe = cross_val_score(
    baseline_fe_model,
    X_train_full,
    y_train_full,
    scoring="r2",
    cv=kfold,
    n_jobs=-1)

cv_scores_fe, cv_scores_fe.mean()


(array([0.490913  , 0.6197182 , 0.53152207, 0.60744894, 0.59299837]),
 0.5685201170122596)

### 3.8.1  Feature Engineering SonrasÄ± Baseline CV PerformansÄ±

- 5 katlÄ± cross-validation RÂ² skorlarÄ±:  
  `[0.49, 0.62, 0.53, 0.61, 0.59]`  

- Ortalama CV skoru: `0.5685` civarÄ±nda.

- **SonuÃ§:**
- Feature engineering sonrasÄ± modelin performansÄ± ~%35 â†’ ~%57 seviyesine yÃ¼kselmiÅŸtir.  
- Bu, Ã¼retilen yeni Ã¶zelliklerin modele Ã¶nemli katkÄ± saÄŸladÄ±ÄŸÄ±nÄ± gÃ¶sterir.  
- ArtÄ±k daha gÃ¼Ã§lÃ¼ modeller (RandomForest, XGBoost, LightGBM) kullanarak performansÄ± daha da artÄ±rabiliriz.


## 3.9 Validation Set Ãœzerinde Baseline FE Modelin DeÄŸerlendirilmesi


In [10]:
# Train set Ã¼zerinde modeli eÄŸit
baseline_fe_model.fit(X_train_full, y_train_full)

# Validation set Ã¼zerinde tahminler
y_val_pred = baseline_fe_model.predict(X_val)

# Performans metrikleri
val_mse = mean_squared_error(y_val, y_val_pred)
val_rmse = val_mse ** 0.5
val_mae = mean_absolute_error(y_val, y_val_pred)
val_r2 = r2_score(y_val, y_val_pred)

val_mse, val_rmse, val_mae, val_r2


(977805893637.6936, 988840.6816255557, 507941.6834246942, 0.5714288508799485)

###  3.9.1 Validation Set PerformansÄ±

- **MSE:** yaklaÅŸÄ±k `9.78e11`  
- **RMSE:** yaklaÅŸÄ±k `988,840`  
- **MAE:** yaklaÅŸÄ±k `507,942`  
- **RÂ²:** `0.5714` civarÄ±nda

- **SonuÃ§:**
- Validation RÂ² deÄŸeri (~0.571) â†’ Train Ã¼zerindeki CV ortalamasÄ± (~0.568) ile neredeyse aynÄ±dÄ±r.
- Bu, aÅŸÄ±rÄ± Ã¶ÄŸrenme (overfitting) olmadÄ±ÄŸÄ±nÄ± gÃ¶sterir.
- Feature engineering baÅŸarÄ±lÄ± olmuÅŸ ve model stabil Ã§alÄ±ÅŸmaktadÄ±r.
- ArtÄ±k performansÄ± artÄ±rmak iÃ§in daha gÃ¼Ã§lÃ¼ modellere (RandomForest, XGBoost, LightGBM) geÃ§ilebilir.
