* Objective: To consolidate all pre-processing, feature engineering, and the optimized model training into a robust, reproducible, and executable pipeline. This notebook will be the blueprint for the `pipeline.py` script.

* Key Library Choices: `pandas`, `numpy`, `scikit-learn` (for `Pipeline`, transformers), `joblib` or `pickle` for model saving.

* Specific Technical Steps/Code Snippets:

**Define Preprocessing Functions:** Encapsulate data loading, joining, and timestamp conversions into reusable functions.
**Define Feature Engineering Functions:** Create functions for RFM calculation, time-based features, categorical encoding, etc., ensuring they are robust to new data.

In [1]:
def create_rfm_features(df_transactions, snapshot_date):
    # ... RFM calculation logic as in FE notebook ...
    return df_rfm

def create_target_variable(df_transactions, snapshot_date, observation_days=180, prediction_days=90):
    # ... Churn definition logic as in Baseline notebook ...
    return df_with_target

**Create Scikit-learn Pipeline:** Combine preprocessing steps (e.g., imputation, scaling, encoding) with the optimized model. This ensures consistency between training and inference.

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from lightgbm import LGBMClassifier

# --- Örnek parametreler, Optuna veya grid search sonrası bulunan parametreleri buraya koyabilirsin ---
optimized_model_params = {
    "n_estimators": 500,
    "learning_rate": 0.03,
    "num_leaves": 31,
    "max_depth": -1,
    "min_child_samples": 20,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "objective": "binary"
}

# --- Feature seçimi ---
numerical_features = ['Recency', 'Frequency', 'Monetary', 'avg_order_value']
categorical_features = ['customer_state']

# --- Preprocessing ---
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ],
    remainder='passthrough'
)

# --- Pipeline oluştur ---
final_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LGBMClassifier(**optimized_model_params, random_state=42))
])

**Full Pipeline Training:** Train the complete pipeline on the entire training dataset (or a re-split, if cross-validation was used extensively in optimization).

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split

# --- Örnek veri (sadece çalıştırmak için) ---
baseline_df = pd.DataFrame({
    'customer_unique_id': range(1, 101),
    'Recency': [10, 20, 30, 40, 50]*20,
    'Frequency': [1, 2, 3, 4, 5]*20,
    'Monetary': [100, 200, 300, 400, 500]*20,
    'avg_order_value': [50, 60, 70, 80, 90]*20,
    'customer_state': ['CA', 'NY', 'TX', 'FL', 'WA']*20,
    'last_purchase': pd.date_range(start='2025-01-01', periods=100),
    'is_churn': [0, 1, 0, 1, 0]*20
})

# --- X ve y oluştur ---
X = baseline_df.drop(['customer_unique_id','is_churn','last_purchase'], axis=1)
y = baseline_df['is_churn']

# --- Train-test split ---
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# --- Pipeline'ı fit et ---
final_pipeline.fit(X_train, y_train)

[WinError 2] The system cannot find the file specified
  File "c:\Users\user\Desktop\zero2end-churn-prediction\venv\Lib\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
    cpu_info = subprocess.run(
               ^^^^^^^^^^^^^^^
  File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\subprocess.py", line 1538, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


[LightGBM] [Info] Number of positive: 32, number of negative: 48
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000281 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 24
[LightGBM] [Info] Number of data points in the train set: 80, number of used features: 4
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.400000 -> initscore=-0.405465
[LightGBM] [Info] Start training from score -0.405465


**Final Evaluation:** Evaluate the entire pipeline on the held-out test set to confirm performance.
**Save Final Pipeline:** Serialize the complete pipeline, including preprocessor and model, for deployment.

In [10]:
import os
import joblib

# --- Klasör yoksa oluştur ---
os.makedirs('models', exist_ok=True)

# --- Pipeline'ı kaydet ---
joblib.dump(final_pipeline, 'models/churn_prediction_pipeline.pkl')
print("Pipeline başarıyla kaydedildi!")

Pipeline başarıyla kaydedildi!


**Generate Feature List:** Save the list of expected features for inference.