# **K-Means Clustering on Credit Card Fraud Dataset**
This notebook implements K-Means clustering to analyze transaction patterns and identify potential fraud. It follows the Working of K-Means Clustering steps:
1. Choose K[1]
2. Initialize Centroids
3. Assign Points
4. Recalculate Centroids
5. Repeat until Convergence

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.manifold import TSNE
from sklearn.metrics import confusion_matrix, accuracy_score
from scipy.stats import mode

# Load the dataset
# Ensure 'credit_card_fraud_10k.csv' is in the same folder as your notebook
df = pd.read_csv('credit_card_fraud_10k.csv')
print(f"Dataset loaded successfully. Shape: {df.shape}")
df.head()

# **1. Data Preprocessing (Feature Scaling)**
As noted in the Disadvantages (PDF Page 2), K-Means requires rescaling or normalization because results vary depending on feature scale. We will also encode categorical categories like 'merchant_category'.

In [None]:
# Encoding categorical data
le = LabelEncoder()
df['merchant_category_encoded'] = le.fit_transform(df['merchant_category'])

# Selecting numerical features for clustering
features = ['amount', 'transaction_hour', 'merchant_category_encoded', 'foreign_transaction',
            'location_mismatch', 'device_trust_score', 'velocity_last_24h', 'cardholder_age']
X = df[features]
y_true = df['is_fraud']

# Standardizing features (Crucial for distance-based algorithms)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("Features scaled and ready for K-Means.")

# **2. K-Means Implementation & EM Interpretation**
Following the Expectation-Maximization (E-M) framework (PDF Page 5):
E-Step: Assign each transaction to the nearest centroid.
M-Step: Recalculate centroids as the mean of all transactions in that cluster.

In [None]:
# Initializing K-Means with K=2 (Fraud vs Normal)
# PDF Page 2: "Predefined value of K is required"
kmeans = KMeans(n_clusters=2, random_state=42, n_init='auto')
clusters = kmeans.fit_predict(X_scaled)

df['cluster'] = clusters
print("K-Means clustering complete.")

# **3. Evaluation: Label Matching & Confusion Matrix**
Because K-Means is Unsupervised Learning, it doesn't know the labels. We use the Permute Labels logic (PDF Page 8) to match our discovered clusters to the actual fraud labels.

In [None]:
# Match cluster IDs to real labels using the mode of the cluster
labels = np.zeros_like(clusters)
for i in range(2):
    mask = (clusters == i)
    labels[mask] = mode(y_true[mask])[0]

print(f"Clustering Accuracy: {accuracy_score(y_true, labels)*100:.2f}%")

# Plotting the Confusion Matrix
mat = confusion_matrix(y_true, labels)
plt.figure(figsize=(6, 4))
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False, cmap='Blues',
            xticklabels=['Normal', 'Fraud'], yticklabels=['Normal', 'Fraud'])
plt.xlabel('True Label')
plt.ylabel('Predicted Cluster')
plt.title('K-Means Confusion Matrix')
plt.show()

# **4. Visualizing Clusters with t-SNE**
To visualize our high-dimensional data in 2D, we use t-distributed Stochastic Neighbor Embedding (PDF Page 8). This helps confirm if the clusters are compact and spherical (PDF Page 2).

In [None]:
# Dimensionality reduction for visualization
tsne = TSNE(n_components=2, init='random', learning_rate='auto', random_state=42)
X_tsne = tsne.fit_transform(X_scaled)

plt.figure(figsize=(10, 7))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=clusters, s=5, cmap='viridis')
plt.title('t-SNE Visualization of Transaction Clusters')
plt.colorbar(label='Cluster ID')
plt.show()

# **5. Optimized Performance: Mini-Batch K-Means**
As mentioned in Example 2 (PDF Page 10), we can use MiniBatchKMeans to handle larger datasets more quickly.

In [None]:
# Implementation of Mini-Batch K-Means for efficiency
minibatch = MiniBatchKMeans(n_clusters=2, random_state=42, batch_size=100, n_init='auto')
minibatch_clusters = minibatch.fit_predict(X_scaled)

print("Mini-Batch K-Means clustering completed.")

# Compare average amount in found clusters
analysis = df.groupby('cluster')[['amount', 'device_trust_score']].mean()
print("\n--- Cluster Character Analysis ---")
print(analysis)