# Phase 6: Proactive Pattern Detection
>_(Part of the Data Science lifecycle for uncovering high-risk clusters & anomalies)_

#### **Objective:**
Use **unsupervised learning** techniques to surface _hidden structure_, such as:
* **Alias clusters** (e.g. same entity across alt names)
* **Outlier behavior** (e.g. odd country associations)
* **Potential identity masking** strategies

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
from sklearn.ensemble import IsolationForest
from sklearn.decomposition import PCA

#### **Load & Prepare Features:**

In [None]:
df = pd.read_csv("../data/sanctions_features.csv")
df = df[df["fuzz_ratio_reference"].notna()].copy()

df["is_match"] = ((df["fuzz_ratio"] > 50) & (df["common_token_count"] > 0)).astype(int)

matched_df = df[df["is_match"] == 1].copy()

print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49339 entries, 0 to 49338
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   ent_num               49339 non-null  int64  
 1   cleaned_name          49338 non-null  object 
 2   fuzz_ratio_reference  49339 non-null  object 
 3   name_length           49339 non-null  int64  
 4   word_count            49339 non-null  int64  
 5   has_country_in_name   49339 non-null  int64  
 6   fuzz_ratio            49339 non-null  float64
 7   length_diff           49339 non-null  int64  
 8   common_token_count    49339 non-null  int64  
dtypes: float64(1), int64(6), object(2)
memory usage: 3.4+ MB
None


#### **Clustering for Alias Networks (DBSCAN):**

Normalize Features

In [5]:
features = ["length_diff", "fuzz_ratio", "common_token_count", "word_count"]
X = StandardScaler().fit_transform(matched_df[features])

ValueError: Found array with 0 sample(s) (shape=(0, 4)) while a minimum of 1 is required by StandardScaler.

Run DBSCAN