Step 1: Load and Prepare the Data
What we did:
We loaded the raw transaction data and made sure the TransactionStartTime column was in proper date format.

Why:
We need this date column to calculate Recency — how recently a customer made a purchase.

In [1]:
import pandas as pd
from datetime import timedelta

def load_and_prepare_data(filepath: str) -> pd.DataFrame:
    df = pd.read_csv(filepath)
    df['TransactionStartTime'] = pd.to_datetime(df['TransactionStartTime'])
    return df

def calculate_rfm(df: pd.DataFrame) -> pd.DataFrame:
    snapshot_date = df['TransactionStartTime'].max() + timedelta(days=1)

    rfm = df.groupby('CustomerId').agg({
        'TransactionStartTime': lambda x: (snapshot_date - x.max()).days,
        'TransactionId': 'count',
        'Amount': 'sum'
    }).reset_index()

    rfm.columns = ['CustomerId', 'Recency', 'Frequency', 'Monetary']
    return rfm


Step 2: Calculate RFM Metrics
What we did:
For each customer, we calculated:

Recency: How many days ago their last transaction was.

Frequency: How many total transactions they’ve made.

Monetary: Total money spent (sum of transaction amounts).

Why:
These 3 features are commonly used in marketing and credit scoring to understand customer engagement and financial behavior.

In [2]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

def preprocess_and_cluster_rfm(rfm: pd.DataFrame, n_clusters: int = 3) -> tuple[pd.DataFrame, KMeans]:
    # Drop rows with missing values
    rfm_clean = rfm.dropna(subset=['Recency', 'Frequency', 'Monetary']).copy()

    # Scale
    scaler = StandardScaler()
    rfm_scaled = scaler.fit_transform(rfm_clean[['Recency', 'Frequency', 'Monetary']])

    # K-Means clustering
    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
    rfm_clean['Cluster'] = kmeans.fit_predict(rfm_scaled)

    return rfm_clean, kmeans


Step 3: Preprocess the RFM data & Run Clustering
What we did:
We scaled (standardized) the RFM values so they’re on the same scale (important for clustering), then used K-Means to divide customers into 3 groups based on their RFM profiles.

Why:
We want to discover hidden customer segments. Some clusters will be more active (low risk), others more inactive (potentially high risk).

In [4]:
import matplotlib.pyplot as plt
import seaborn as sns

def analyze_and_label_clusters(rfm_clustered: pd.DataFrame, kmeans: KMeans) -> pd.DataFrame:
    # Cluster center interpretation
    cluster_centers = pd.DataFrame(
        kmeans.cluster_centers_,
        columns=['Recency', 'Frequency', 'Monetary']
    )

    high_risk_cluster = cluster_centers.sort_values(
        by=['Recency', 'Frequency', 'Monetary'],
        ascending=[False, True, True]  # high Recency, low Frequency/Monetary
    ).index[0]

    # Assign binary label
    rfm_clustered['is_high_risk'] = (rfm_clustered['Cluster'] == high_risk_cluster).astype(int)

   # return rfm_clustered[['CustomerId', 'is_high_risk']]
    return pd.DataFrame(rfm_clustered[['CustomerId', 'is_high_risk']])


Step 4: Identify the High-Risk Group
What we did:
We examined the 3 clusters and picked the one where customers:

Have high Recency (long time since last purchase)

Have low Frequency (rarely purchase)

Have low Monetary (spend very little)

This group was labeled as the high-risk cluster.

Why:
These customers are less engaged, may be financially inactive, and are thus more likely to default — a smart proxy for credit risk.

In [5]:
def merge_target_with_raw(raw_df: pd.DataFrame, risk_df: pd.DataFrame) -> pd.DataFrame:
    merged = raw_df.merge(risk_df, on='CustomerId', how='left')
    merged['is_high_risk'] = merged['is_high_risk'].fillna(0).astype(int)
    return merged


Step 5: Create the is_high_risk Column
What we did:
We created a new column:

1 if the customer is in the high-risk cluster

0 if they’re in a low/medium-risk cluster

Why:
This gives us a binary target label we can now use to train a credit scoring model.

In [6]:

raw_df = load_and_prepare_data('../data/raw/data.csv') 


# Step 1: RFM
rfm = calculate_rfm(raw_df)

# Step 2: Clustering
rfm_clustered, kmeans_model = preprocess_and_cluster_rfm(rfm)

# Step 3: Labeling
risk_df = analyze_and_label_clusters(rfm_clustered, kmeans_model)

# Step 4: Merge
final_raw_with_risk = merge_target_with_raw(raw_df, risk_df)

# Step 5: Save for Task 3 re-pipeline
final_raw_with_risk.to_csv("../data/processed/raw_plus_risk.csv", index=False)
print("✅ Saved: raw_plus_risk.csv with is_high_risk column.")


✅ Saved: raw_plus_risk.csv with is_high_risk column.


We used RFM analysis and clustering to identify disengaged (risky) customers and created a new column is_high_risk to serve as the target variable for building your credit scoring model.