# 02 — Özellik Mühendisliği (Feature Engineering)

Bu notebook `FeatureEngineer` + `build_pipeline()` akışını görsel olarak açıklar:
- Cyclic month encoding'i doğrular
- Frequency encoding'i inceler
- Pipeline çıktısının şeklini ve kolon isimlerini kontrol eder
- Training-serving skew'e karşı savunma mekanizmasını belgelendirir

In [None]:
import sys

sys.path.insert(0, "..")

import numpy as np
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (10, 4)
plt.rcParams["axes.spines.top"] = False
plt.rcParams["axes.spines.right"] = False

## 1. FeatureEngineer Dönüşümleri

In [None]:
from src.features import FeatureEngineer
from src.io import read_input_dataset
from src.config import Paths
from src.preprocess import preprocess_basic
from src.split import split_dataset

paths = Paths()
df, _ = read_input_dataset(paths.raw_data)
df = preprocess_basic(
    df,
    target_col="is_canceled",
    label_map={"no": 0, "yes": 1, 0: 0, 1: 1},
)
print(f"Preprocessed shape: {df.shape}")

In [None]:
# Train/test split
split = split_dataset(df, target_col="is_canceled")
X_train, y_train = split.X_train, split.y_train
X_test, y_test = split.X_test, split.y_test
print(f"Train: {X_train.shape}, Test: {X_test.shape}")

### 1.1 Cyclic Month Encoding

In [None]:
fe = FeatureEngineer()
X_fe = fe.fit_transform(X_train)

if "arrival_date_month_sin" in X_fe.columns:
    months = (
        X_train["arrival_date_month"].unique()
        if "arrival_date_month" in X_train.columns
        else []
    )
    sample = (
        X_fe[["arrival_date_month_sin", "arrival_date_month_cos"]]
        .drop_duplicates()
        .sort_values("arrival_date_month_sin")
    )

    theta = np.linspace(0, 2 * np.pi, 200)
    fig, axes = plt.subplots(1, 2, figsize=(12, 4))

    # Unit circle
    axes[0].plot(np.cos(theta), np.sin(theta), "lightgrey", zorder=0)
    axes[0].scatter(
        sample["arrival_date_month_cos"],
        sample["arrival_date_month_sin"],
        c=range(len(sample)),
        cmap="hsv",
        s=80,
        zorder=1,
    )
    axes[0].set_title("Ay Cyclic Encoding (birim çember)")
    axes[0].set_aspect("equal")
    axes[0].axhline(0, c="grey", lw=0.5)
    axes[0].axvline(0, c="grey", lw=0.5)

    # sin/cos values
    axes[1].scatter(
        sample["arrival_date_month_cos"],
        sample["arrival_date_month_sin"],
        c=range(len(sample)),
        cmap="hsv",
        s=60,
    )
    axes[1].set_xlabel("cos")
    axes[1].set_ylabel("sin")
    axes[1].set_title("sin vs cos saçılma")

    plt.tight_layout()
    plt.show()
    print("Ocak (1) ile Aralık (12) çemberin komşu noktalarında — OHE bunu göremez.")
else:
    print("arrival_date_month_sin bulunamadı.")

### 1.2 Frequency Encoding — Country

In [None]:
if "country_freq" in X_fe.columns:
    freq_stats = X_fe["country_freq"].describe()
    print(freq_stats)
    X_fe["country_freq"].hist(bins=30, color="#0891b2", edgecolor="none")
    plt.title("Country Frequency Encoding Dağılımı")
    plt.xlabel("Frekans (0–1)")
    plt.ylabel("Kayıt Sayısı")
    plt.tight_layout()
    plt.show()
else:
    print("country_freq kolonu bulunamadı.")

## 2. Tam Pipeline (FeatureEngineer + ColumnTransformer)

In [None]:
from src.features import build_pipeline

pipe = build_pipeline(model=None)  # sadece preprocessor'ı al
prep = pipe.named_steps.get(
    "preprocessor", pipe[:-1] if hasattr(pipe, "__len__") else pipe
)
prep.fit(X_train)

feature_names = prep.get_feature_names_out()
print(f"Pipeline çıktı boyutu: {len(feature_names)} özellik")
print("İlk 20 özellik:")
for name in feature_names[:20]:
    print(" ", name)

## Sonuç

| Adım | Kolon Tipi | Dönüşüm |
|------|-----------|----------|
| FeatureEngineer | `arrival_date_month` | sin/cos cyclic |
| FeatureEngineer | `country` | frequency (0–1) |
| ColumnTransformer | numeric | median impute → StandardScaler |
| ColumnTransformer | categorical | most_frequent impute → OHE |

Pipeline, train ve servis zamanında **aynı dönüşümü** uygular — training-serving skew riski sıfır.

Bir sonraki notebook: `03_training.ipynb`