<h1 style="color: salmon">K-Medoids Clustering (n=15k)</h1>
Nothing.

In [None]:
import pandas as pd
import joblib
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn_extra.cluster import KMedoids
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, silhouette_samples
import matplotlib.cm as cm

In [None]:
df = pd.read_csv("../data/lastest.csv", encoding="utf-8")
df.head()

In [None]:
df.shape

<h2 style="color: salmon;">Data Pre-Processing</h2>

In [None]:
# Remove the uneccesary single quotes
df['category'] = df['category'].str.strip("'").str.split('_').str[1]
df['customer'] = df['customer'].str.strip("'")
df['age'] = df['age'].str.strip("'")
df['gender'] = df['gender'].str.strip("'")
df['merchant'] = df['merchant'].str.strip("'")

# Add the age label as in the original paper
age_map = {
    "0": "<=18", "1": "19-25", "2": "26-35", "3": "36-45", 
    "4": "46-55", "5": "56-65", "'6'": ">65", "U": "Unknown"
}
df['age_labeled'] = df['age'].map(age_map)

# Convert from step to hour of day (ex: 2 means 2AM)
def step_to_hour(row):
    return row % 24

df["hour_of_day"] = df["step"].apply(step_to_hour)

# Drop noise cols:
df = df.drop(
    columns=[
        'zipcodeOri',
        'zipMerchant'
    ],
    errors="ignore"
)

df.head()

<h2 style="color:salmon">Feature Engineering</h2>

### Spending velocity and Spending Frequency: The amount of money that a customer spend as well as the transaction frequency in a period of time

In [None]:
df = df.sort_values(by=['customer', 'step'])
df = df.reset_index(drop=True)

# Create a temporary 'TimeDelta' column => we can then do time math on this col
df['temp_time'] = pd.to_timedelta(df['step'], unit='h')

# Spending
df['spending_vel_3h'] = (
    df.groupby('customer')
    .rolling('3h', on='temp_time')['amount']
    .sum()
    .values
)

df['spending_vel_6h'] = (
    df.groupby('customer')
    .rolling('6h', on='temp_time')['amount']
    .sum()
    .values
)

df['spending_vel_24h'] = (
    df.groupby('customer')
    .rolling('24h', on='temp_time')['amount']
    .sum()
    .values
)

# Frequency
df['frequency_3h'] = (
    df.groupby('customer')
    .rolling('3h', on='temp_time')['amount']
    .count()
    .values
)

df['frequency_6h'] = (
    df.groupby('customer')
    .rolling('6h', on='temp_time')['amount']
    .count()
    .values
)

df['frequency_24h'] = (
    df.groupby('customer')
    .rolling('24h', on='temp_time')['amount']
    .count()
    .values
)

In [None]:
df.head()

### High risk Categories and Merchants (Target Encoding using Means)

In [None]:
category_risk_map = df.groupby('category')['fraud'].mean()

df['category_risk_score'] = df['category'].map(category_risk_map)

In [None]:
merchant_risk_map = df.groupby('merchant')['fraud'].mean()

df['merchant_risk_score'] = df['merchant'].map(merchant_risk_map)

### Age and Gender targeting

In [None]:
age_risk_map = df.groupby('age')['fraud'].mean()

df['age_risk_score'] = df['age'].map(age_risk_map)

In [None]:
# According to the EDA, no cases of Enterprise were Fraud
df['is_enterprise'] = df["gender"].apply(lambda g: 1 if g == "E" else 0)
df['is_enterprise'].value_counts()

In [None]:
df.head()

In [None]:
df.columns.to_list()

<h2 style="color: salmon">Final Cleaning and Spliting</h2>

In [None]:
# This step also put the "fraud" col to its position
keep_cols = [
 'spending_vel_3h',
 'spending_vel_6h',
 'spending_vel_24h',
 'frequency_3h',
 'frequency_6h',
 'frequency_24h',
 'category_risk_score',
 'merchant_risk_score',
 'age_risk_score',
 'is_enterprise',
 'fraud'
]
df = df[keep_cols]

In [None]:
X = df.iloc[:, 0:-1]
y = df.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Standardize: This transform the data to not being biased. For example, the spending vel might be a few hundred, 
# while the risk score is just 0.8 => when we calculate the Manhatt (without StandardScaler), the spending vel will pull everything to it
# because its a big number !
scaler = StandardScaler()
scaler.fit(X_train)

# the code will transform both (test and train dts) into a standardized (Numpy arr, no col names, hard to read) 
# and then transform it back into dataframes with cols name !
X_train_scaled = pd.DataFrame(scaler.transform(X_train), columns=X_train.columns)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)

In [None]:
X_train_sample = X_train_scaled.sample(n=15000, random_state=42)
# We'll start with 3 clusters (Normal, Suspicious, Outlier)
# We use 'manhattan' distance because it's often more robust for fraud data
kmed = KMedoids(n_clusters=3, metric='manhattan', init='k-medoids++', random_state=42)

kmed.fit(X_train_sample)

<h2 style="color: salmon">Testing Phase</h2>

### External Validation (using Fraud col)

In [None]:
# 1. Assign a cluster ID to every row in the scaled data
# predict() will calculate the Manhatt from a new data point to 3 Medoids and then pick which one is the closest
train_clusters = kmed.predict(X_train_scaled)   # This uses the 'medoids' found in your 10k sample to label all 400k+ rows
test_clusters = kmed.predict(X_test_scaled)     # Test the kmed with "UNSEEN" data => the real testing

# 2. Add these labels back to the original df (the one that aren't standardized) so you can see them
# This makes it easier to compare 'cluster' vs 'fraud'
X_train['cluster'] = train_clusters
X_test['cluster'] = test_clusters

# Add the actual fraud labels back for comparison
X_test['is_fraud'] = y_test

In [None]:
# 3. Check how many fraud cases are in each cluster for the Test Set
test_results = X_test.groupby('cluster')['is_fraud'].agg(['count', 'sum', 'mean'])
test_results.columns = ['Total Transactions', 'Fraud Count', 'Fraud Percentage (%)']

print("--- K-Medoids Test Results ---")
print(test_results)

# Cluster 2 are the Fraudulent cases with 73% of it are fraud

### Elbow Test: is k=3 optimal ?

In [None]:
# The elbow test: which k number is the most optimal ?
inertias = []
k_range = range(2, 11)

for k in k_range:
    ktest = KMedoids(n_clusters=k, metric='manhattan', init='k-medoids++', random_state=42)
    ktest.fit(X_train_sample)
    inertias.append(ktest.inertia_)     # inertia_: Sum of squared distances of samples to their closest cluster center

plt.plot(k_range, inertias, marker='o', linestyle='-', color='b')
plt.title('Elbow Method for Optimal k')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia (Total Cost)')
plt.xticks(k_range)
plt.grid(True)
plt.show()

# => the best should be 4 ?

In [None]:
kmed4 = KMedoids(n_clusters=4, metric='manhattan', init='k-medoids++', random_state=42)
kmed4.fit(X_train_sample)

In [None]:
# Retest the new KMedoids
train_clust_new = kmed4.predict(X_train_scaled)  
test_clust_new = kmed4.predict(X_test_scaled) 

X_train['cluster'] = train_clust_new
X_test['cluster'] = test_clust_new
X_test['is_fraud'] = y_test

test_results = X_test.groupby('cluster')['is_fraud'].agg(['count', 'sum', 'mean'])
test_results.columns = ['Total Transactions', 'Fraud Count', 'Fraud Percentage (%)']

print("--- K-Medoids Test Results ---")
print(test_results)

### Silhouette: measures how well-separated the clusters are. For fraud detection .2 -> .4 is expected

In [None]:
sample_labels = kmed.labels_
avg_silhouette = silhouette_score(X_train_sample, sample_labels, metric='manhattan')

print(f"--- Silhouette Results ---")
print(f"Average Silhouette Score: {avg_silhouette:.4f}")

# Create the Silhouette Plot
fig, ax1 = plt.subplots(1, 1)
fig.set_size_inches(10, 7)

# Get silhouette samples for each point
sample_silhouette_values = silhouette_samples(X_train_sample, sample_labels, metric='manhattan')

y_lower = 10
for i in range(kmed.n_clusters):
    # Aggregate the silhouette scores for samples belonging to cluster i, and sort them
    ith_cluster_silhouette_values = sample_silhouette_values[sample_labels == i]
    ith_cluster_silhouette_values.sort()

    size_cluster_i = ith_cluster_silhouette_values.shape[0]
    y_upper = y_lower + size_cluster_i

    color = cm.nipy_spectral(float(i) / kmed.n_clusters)
    ax1.fill_betweenx(np.arange(y_lower, y_upper),
                      0, ith_cluster_silhouette_values,
                      facecolor=color, edgecolor=color, alpha=0.7)

    # Label the silhouette plots with their cluster numbers at the middle
    ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
    y_lower = y_upper + 10  # 10 for the 0 samples gap between clusters

ax1.set_title("Silhouette Plot for K-Medoids Clusters")
ax1.set_xlabel("Silhouette Coefficient Values")
ax1.set_ylabel("Cluster Label")

# The vertical line for average silhouette score of all the values
ax1.axvline(x=avg_silhouette, color="red", linestyle="--")
ax1.set_yticks([])  # Clear the yaxis labels / ticks
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

plt.show()

# for k=4: 0.23 while k=3 is 0.28 => the k=4 is being overlapped => STAY WITH k=3 is more stable MATHEMATICALLY

### Explain "Why": using Medoids' centers to understand the features that the model used to detect fraud

In [None]:
# 1. Get the medoid coordinates (the centers of your clusters)
scaled_medoids = kmed.cluster_centers_

# 2. Reverse the scaling to get original units (USD, counts, etc.)
# Note: Ensure 'scaler' and 'X_train' are the ones from the previous cells
original_medoids = scaler.inverse_transform(scaled_medoids)

# 3. Create a DataFrame for easy viewing
# We use the column names from X_train (excluding 'cluster' if you added it there)
feature_names = [col for col in X_train.columns if col != 'cluster']
medoid_profiles = pd.DataFrame(original_medoids, columns=feature_names)
medoid_profiles.index.name = 'Cluster'

print("--- Typical Transaction Profile per Cluster ---")
medoid_profiles

In [None]:
# We melt the dataframe to make it compatible with Seaborn's plotting
plot_data = medoid_profiles.reset_index().melt(id_vars='Cluster')

plt.figure(figsize=(12, 6))
sns.barplot(data=plot_data, x='variable', y='value', hue='Cluster')
plt.xticks(rotation=45)
plt.title("The 'Why': Feature Values of Cluster Medoids")
plt.ylabel("Original Scale Value")
plt.xlabel("Features")
plt.legend(title="Cluster ID")
plt.show()

### **Summary of K-Medoids Fraud Analysis (15k Hybrid Foundation)**

The K-Medoids model was optimized using a **15,000-sample** dataset to balance high-volume fraud detection with behavioral clustering. By identifying **Cluster 1** as the primary **"Fraud Magnet,"** the system narrowed the search space from 15,000 down to **6,253 high-risk transactions**. This unsupervised stage acts as the critical feature generator for the next stage which is Random Forest fitting.

* **Comprehensive Coverage**: Cluster 1 successfully captured **1,337 fraudulent transactions**, representing a significant increase in total fraud detection compared to smaller samples (>850 fraud cases for n=10k).
* **Risk Concentration**: While the density adjusted to **21.38%** at this larger scale, Cluster 1 remains the clear danger zone, while Clusters 0 and 2 were effectively filtered as **99.9% clean "safe zones"**.
* **Optimal Structure**: The **Elbow Method** confirmed that **$k=3$** remains the most mathematically optimal.
* **Scientific Justification**: A **Silhouette Score of 0.15** indicates a complex overlap between high-spending legitimate users and fraudsters in the 15k set. This overlap justifies the **Hybrid System architecture**, using the Random Forest to bridge the accuracy gap.
* **Behavioral DNA**: Medoid analysis continues to show that the system's primary logic for flagging risk is rooted in **extreme spending velocity** and **high-risk merchant profiles** rather than simple transaction counts.

In [None]:
print(kmed.labels_)

In [None]:
# Save and export the Train sample with cluster labeled for Random Forest fitting:)

joblib.dump(kmed, '../app/models/kmedoids_model.pkl')
joblib.dump(scaler, '../app/models/scaler.pkl')

X_hybrid_export = X_train_sample.copy()
y_train_sample = y_train.iloc[X_train_sample.index]     # Pull the exact the fraud col of the X train sample
X_hybrid_export['cluster'] = kmed.labels_
X_hybrid_export['fraud'] = y_train_sample.values
X_hybrid_export.to_csv("../data/hybrid_training_data_15k.csv", index=False)