* Objective: To create more sophisticated and predictive features from the raw transactional data, aiming to improve model performance beyond the baseline.

* Key Library Choices: `pandas`, `numpy`, `scikit-learn` (for transformers), `feature_engine` (optional, for advanced feature engineering pipelines).

* Specific Technical Steps/Code Snippets:

**Load Cleaned/Joined Data:** Start with the comprehensive, joined Olist data.
**Define Snapshot Date:** Crucial for time-based feature engineering. All features must be derived *before* the churn prediction window.
**RFM Features:** Calculate Recency, Frequency, Monetary value per customer.

In [5]:
import os
import pandas as pd

os.chdir(r"C:\Users\user\Desktop\zero2end-churn-prediction")

customers = pd.read_csv('data/olist_customers_dataset.csv')
orders = pd.read_csv('data/olist_orders_dataset.csv')
order_items = pd.read_csv('data/olist_order_items_dataset.csv')
products = pd.read_csv('data/olist_products_dataset.csv')

In [6]:
# Orders + Customers
df_merged = pd.merge(orders, customers, on='customer_id', how='left')
df_merged['order_purchase_timestamp'] = pd.to_datetime(df_merged['order_purchase_timestamp'])

# Orders + Order Items
df_merged = pd.merge(df_merged, order_items, on='order_id', how='left')

# Ödeme sütunu yoksa price + freight_value ile oluştur
df_merged['payment_value'] = df_merged['price'] + df_merged['freight_value']

In [7]:
# Örnek: df_merged hazır ve order_purchase_timestamp, price, freight_value var
df_customer_transactions = df_merged.copy()

# Eğer payment_value yoksa price + freight_value ile oluştur
df_customer_transactions['payment_value'] = df_customer_transactions['price'] + df_customer_transactions['freight_value']

# snapshot_date belirle (örn. tahmin için son 90 gün öncesi)
snapshot_date = df_customer_transactions['order_purchase_timestamp'].max() - pd.Timedelta(days=90)

# Snapshot tarihi öncesi verileri al
customer_data = df_customer_transactions[df_customer_transactions['order_purchase_timestamp'] <= snapshot_date]

# Recency
recency_df = customer_data.groupby('customer_unique_id')['order_purchase_timestamp'].max().reset_index()
recency_df['Recency'] = (snapshot_date - recency_df['order_purchase_timestamp']).dt.days

# Frequency
frequency_df = customer_data.groupby('customer_unique_id')['order_id'].nunique().reset_index()
frequency_df.rename(columns={'order_id': 'Frequency'}, inplace=True)

# Monetary
monetary_df = customer_data.groupby('customer_unique_id')['payment_value'].sum().reset_index()
monetary_df.rename(columns={'payment_value': 'Monetary'}, inplace=True)

# Birleştir
df_fe = pd.merge(recency_df[['customer_unique_id', 'Recency']], frequency_df, on='customer_unique_id', how='left')
df_fe = pd.merge(df_fe, monetary_df, on='customer_unique_id', how='left')

df_fe.head()

Unnamed: 0,customer_unique_id,Recency,Frequency,Monetary
0,0000366f3b9a7992bf8c76cfdf3221e2,70,1,141.9
1,0000b849f77a49e4a4ce2b2a4ca5be3f,73,1,27.19
2,0000f46a3911fa3c0805444483337064,495,1,86.22
3,0000f6ccb0745a6a4b88665a16c9f078,279,1,43.62
4,0004aac84e0df4da2b147fca70cf8255,246,1,196.89


**Time-Based Features:**
* Average time between purchases.

* Number of purchases in last 7, 30, 90 days.

* Days since first purchase.

* Purchase velocity (e.g., frequency / days since first purchase).

**Product-Related Features:**
* Number of unique product categories purchased.

* Most frequent product category.

* Average product price.

* Ratio of expensive to cheap items.

**Geographic/Demographic Features:**
* Customer state (one-hot encoded).

* Number of unique cities/states purchased from (if applicable).

**Order-Related Features:**
* Average freight value.

* Payment type (one-hot encoded).

* Number of installments (if applicable).

**Interaction Features:** Combine existing features (e.g., Recency * Monetary).
**Handle Missing Values:** Imputation strategies (mean, median, mode, or specific indicators).
**Categorical Encoding:** One-hot encoding, target encoding, or label encoding for suitable features.
**Feature Scaling:** Apply standard scaling or min-max scaling to numerical features if required by the model.
**Store Processed Data:** Save the engineered feature set for subsequent notebooks.
