# Phase 6: Proactive Pattern Detection
>_(Part of the Data Science lifecycle for uncovering high-risk clusters & anomalies)_

#### **Objective:**
Use **unsupervised learning** techniques to surface _hidden structure_, such as:
* **Alias clusters** (e.g. same entity across alt names)
* **Outlier behavior** (e.g. odd country associations)
* **Potential identity masking** strategies

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
from sklearn.ensemble import IsolationForest
from sklearn.decomposition import PCA

#### **Load & Prepare Features:**

In [None]:
df = pd.read_csv("../data/sanctions_features.csv")
df = df[df["fuzz_ratio_reference"].notna()].copy()

df["is_match"] = ((df["fuzz_ratio"] > 75) & (df["common_token_count"] > 0)).astype(int)

matched_df = df[df["is_match"] == 1].copy()

is_match
1    12
Name: count, dtype: int64


#### **Clustering for Alias Networks (DBSCAN):**

Normalize Features

In [None]:
features = ["length_diff", "fuzz_ratio", "common_token_count", "word_count"]
X = StandardScaler().fit_transform(matched_df[features])

[[ 0.00000000e+00 -1.42108547e-14  0.00000000e+00  0.00000000e+00]
 [ 0.00000000e+00 -1.42108547e-14  0.00000000e+00  0.00000000e+00]
 [ 0.00000000e+00 -1.42108547e-14  0.00000000e+00  0.00000000e+00]
 [ 0.00000000e+00 -1.42108547e-14  0.00000000e+00  0.00000000e+00]
 [ 0.00000000e+00 -1.42108547e-14  0.00000000e+00  0.00000000e+00]
 [ 0.00000000e+00 -1.42108547e-14  0.00000000e+00  0.00000000e+00]
 [ 0.00000000e+00 -1.42108547e-14  0.00000000e+00  0.00000000e+00]
 [ 0.00000000e+00 -1.42108547e-14  0.00000000e+00  0.00000000e+00]
 [ 0.00000000e+00 -1.42108547e-14  0.00000000e+00  0.00000000e+00]
 [ 0.00000000e+00 -1.42108547e-14  0.00000000e+00  0.00000000e+00]
 [ 0.00000000e+00 -1.42108547e-14  0.00000000e+00  0.00000000e+00]
 [ 0.00000000e+00 -1.42108547e-14  0.00000000e+00  0.00000000e+00]]


Run DBSCAN

In [21]:
db = DBSCAN(eps=0.8, min_samples=3).fit(X)
matched_df["cluster"] = db.labels_

Visualize

In [None]:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

plt.figure(figsize=(10, 6))
sns.scatterplot(x=X_pca[:, 0], hue=matched_df["cluster"], palette="tab10")
plt.title("Alias Clusters via DBSCAN")
plt.tight_layout()
plt.grid(True)
plt.show()

[[0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]]


  explained_variance_ratio_ = explained_variance_ / total_var
