
# -- ðŸš— Used Car Price Prediction --
# 2. Baseline Model 

Bu notebookta veri seti Ã¼zerinde ** basit feature set ve temel model (Linear Regression)** ile bir baÅŸlangÄ±Ã§ (baseline) modeli kurulmuÅŸtur.

AmaÃ§:
- Ã‡ok karmaÅŸÄ±k olmayan bir Ã¶n iÅŸleme pipeline'Ä± ile
- Basit bir lineer model kurmak,
- Elde edilen skorlarÄ± daha sonra geliÅŸmiÅŸ modellerle karÅŸÄ±laÅŸtÄ±rmak.


In [None]:
# KÃ¼tÃ¼phanelerin YÃ¼klenmesi
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

plt.style.use("seaborn-v0_8")
pd.set_option("display.max_columns", None)

In [None]:
# Veri Seti

df = pd.read_csv("/kaggle/input/automl88/used_cars_dataset_v2.csv")
df.head()


In [None]:
# Veri Temizleme: kmDriven ve AskPrice SayÄ±sallaÅŸtÄ±rma

def clean_km(x):
    if pd.isna(x):
        return np.nan
    x = str(x).lower().replace("km", "").replace(",", "").strip()
    try:
        return float(x)
    except ValueError:
        return np.nan

def clean_price(x):
    if pd.isna(x):
        return np.nan
    x = str(x)
    x = (
        x.replace("â‚¹", "")
         .replace(",", "")
         .replace("rs.", "")
         .replace("rs", "")
         .strip())
    try:
        return float(x)
    except ValueError:
        return np.nan

df["kmDriven_clean"] = df["kmDriven"].apply(clean_km)
df["AskPrice_clean"] = df["AskPrice"].apply(clean_price)

# Eksik deÄŸerleri temel olarak ele alma
df = df.dropna(subset=["kmDriven_clean", "AskPrice_clean"])

df[["Brand", "model", "Year", "Age", "kmDriven_clean", "AskPrice_clean"]].head()



## 2.1 Baseline Feature Set ve Hedef DeÄŸiÅŸken

Burada basit bir feature set kullanÄ±yoruz:

- Kategorik: `Brand`, `model`, `Transmission`, `Owner`, `FuelType`
- SayÄ±sal: `Year`, `Age`, `kmDriven_clean`
- Hedef: `AskPrice_clean`


In [None]:
# Baseline  Ä°Ã§in Veri HazÄ±rlÄ±ÄŸÄ± (Trainâ€“Test Split (EÄŸitim ve Test Setlerinin OluÅŸturulmasÄ±))

features = ["Brand", "model", "Year", "Age",
            "kmDriven_clean", "Transmission", "Owner", "FuelType"]
target = "AskPrice_clean"

df_model = df[features + [target]].copy()

X = df_model.drop(target, axis=1)
y = df_model[target]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42)

X_train.shape, X_test.shape


In [None]:
# Baseline Model 

cat_cols = ["Brand", "model", "Transmission", "Owner", "FuelType"]
num_cols = ["Year", "Age", "kmDriven_clean"]

preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
        ("num", "passthrough", num_cols)])

baseline_model = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", LinearRegression())])

baseline_model



## 2.2 Baseline Modelin EÄŸitilmesi ve Test PerformansÄ±


In [None]:
# Baseline Modelin EÄŸitilmesi

baseline_model.fit(X_train, y_train)
y_pred = baseline_model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Test MSE :", mse)
print("Test RMSE:", rmse)
print("Test MAE :", mae)
print("Test R2  :", r2)


## 2.3 Cross-Validation ile Baseline Model DeÄŸerlendirmesi

AÅŸaÄŸÄ±da train set Ã¼zerinde 5 katlÄ± cross-validation sonuÃ§larÄ± hesaplanmaktadÄ±r.


In [None]:
#

kfold = KFold(n_splits=5, shuffle=True, random_state=42)

cv_scores = cross_val_score(
    baseline_model, X, y,
    scoring="r2",
    cv=kfold,
    n_jobs=-1)

print("CV R2 Scores:", cv_scores)
print("CV R2 Mean  :", cv_scores.mean())


## 2.4 Baseline Ã–zet Bulgular

- Bu notebookta basit bir feature set ile Linear Regression modeli kullanÄ±larak bir **baseline** oluÅŸturuldu.
- K-fold CV sonuÃ§larÄ± modelin RÂ² skorunun ~0.36 civarÄ±nda sabitlendiÄŸini gÃ¶steriyor.
- Model hatalarÄ± (RMSE, MAE) gÃ¶rece yÃ¼ksektir; bu da fiyat dinamiklerinin tam olarak yakalanamadÄ±ÄŸÄ±nÄ± gÃ¶sterir.
- Bir sonraki adÄ±mda, **feature engineering** ile daha bilgilendirici deÄŸiÅŸkenler tÃ¼retilerek performansÄ±n iyileÅŸtirilmesi hedeflenmektedir.
