# Task 4: Proxy Target Variable Engineering (RFM-Based Risk Labeling)

## Objective
Since direct fraud/default outcomes are either unavailable or extremely sparse,
this task constructs a **proxy supervised learning target** using customer
behavioral segmentation based on **RFM (Recency, Frequency, Monetary) analysis**.

The resulting binary label (`is_high_risk`) is integrated with engineered
features from Task 3 to produce a **modeling-ready analytics base table**.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
from pathlib import Path

# ensure project root (parent of the notebooks folder) is on sys.path so local 'src' can be imported
sys.path.insert(0, str((Path.cwd() / "..").resolve()))

from src.proxy_target import proxy_target_pipeline

plt.rcParams["figure.figsize"] = (8, 5)
sns.set_style("whitegrid")


In [None]:
# Task 3 outputs
features = pd.read_csv("../data/processed/task-3-features.csv")

# Raw transaction data
transactions = pd.read_csv("../data/raw/data.csv")

transactions["TransactionStartTime"] = pd.to_datetime(
    transactions["TransactionStartTime"]
)

features.head()


### Why a Proxy Target is Needed

The dataset does not contain reliable or sufficiently frequent fraud/default
labels suitable for supervised modeling.

To enable downstream risk modeling, we construct a **proxy target variable**
using **customer behavioral patterns**, following industry-standard
RFM (Recency, Frequency, Monetary) segmentation techniques.

Customers exhibiting **recent inactivity, low engagement, and low monetary value**
are treated as higher risk.


In [None]:
rfm_labeled, features_with_target, cluster_centers = proxy_target_pipeline(
    transactions=transactions,
    feature_df=features
)

rfm_labeled.head()
#features_with_target.head()

## RFM Feature Construction

For each customer:
- **Recency**: Days since last transaction (higher = worse)
- **Frequency**: Number of transactions
- **Monetary**: Total transaction amount

A consistent snapshot date (latest transaction date) is used to ensure
recency comparability across customers.


In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

sns.boxplot(y=rfm_labeled["Recency"], ax=axes[0])
axes[0].set_title("Recency Distribution")

sns.boxplot(y=rfm_labeled["Frequency"], ax=axes[1])
axes[1].set_title("Frequency Distribution")

sns.boxplot(y=rfm_labeled["Monetary"], ax=axes[2])
axes[2].set_title("Monetary Distribution")

plt.show()


In [None]:
rfm_labeled[["Recency", "Frequency", "Monetary"]].hist(bins=30, figsize=(15, 5))
plt.suptitle("RFM Feature Distributions")
plt.show()


## Behavioral Segmentation via K-Means

RFM features are standardized using `StandardScaler` to ensure equal contribution.
Customers are segmented using **K-Means clustering (k=3)** with a fixed random seed
to ensure reproducibility.

Clusters are interpreted using their **centroid characteristics**.


In [None]:
cluster_centers
# rfm_labeled.head()

Cluster centers are examined to identify behavioral profiles.

The **high-risk cluster** is defined as the one exhibiting:
- High Recency (long inactivity)
- Low Frequency
- Low Monetary value


In [None]:
sns.scatterplot(
    data=rfm_labeled,
    x="Frequency",
    y="Monetary",
    hue="cluster",
    palette="tab10",
    alpha=0.7
)

plt.scatter(
    cluster_centers["Frequency"],
    cluster_centers["Monetary"],
    color="black",
    marker="X",
    s=200,
    label="Cluster Centers"
)

plt.title("RFM Clusters (Frequency vs Monetary)")
plt.legend()
plt.show()


In [None]:
sns.pairplot(
    rfm_labeled,
    vars=["Recency", "Frequency", "Monetary"],
    hue="cluster",
    palette="tab10",
    diag_kind="kde"
)
plt.show()


In [None]:
sns.countplot(x="is_high_risk", data=rfm_labeled)
plt.title("Proxy Target Distribution (is_high_risk)")
plt.xlabel("is_high_risk")
plt.ylabel("Number of Customers")
plt.show()

rfm_labeled["is_high_risk"].value_counts(normalize=True)


The class distribution is logged for auditability.
Due to conservative risk assignment, the high-risk class represents
a minority of customers, consistent with real-world risk modeling.


In [None]:
features_with_target.head()
features_with_target.shape
# features_with_target["is_high_risk"].value_counts(normalize=True)

The final analytics base table contains:
- Customer-level engineered features (Task 3)
- Binary proxy target (`is_high_risk`)
This dataset is ready for supervised modeling.


In [None]:
rfm_labeled[["CustomerId", "is_high_risk"]].to_csv(
    "../data/processed/rfm_labels.csv", index=False
)

features_with_target.to_csv(
    "../data/processed/features_with_target.csv", index=False
)


Due to the absence of reliable and sufficiently frequent fraud labels, a proxy target variable was engineered using customer behavioral segmentation. RFM (Recency, Frequency, Monetary) metrics were calculated at the customer level using a consistent snapshot date. Customers were standardized and clustered using K-Means (k=3) to identify distinct behavioral groups. The cluster exhibiting high recency, low transaction frequency, and low monetary value was labeled as high risk. This binary proxy target (is_high_risk) was merged with engineered features from Task 3 to create a modeling-ready dataset. All steps are fully reproducible, auditable, and aligned with industry best practices for risk modeling.