In [1]:
#pip install faker
#pip install rapidfuzz

Project: Optimizing Name Matching Algorithms for Sanctions Screening
Project Goal:
Develop and optimize a name-matching algorithm for sanctions and politically exposed person (PEP) screening. The solution will focus on improving precision and recall, benchmarking its performance, and automating testing processes.

Key Objectives:
Algorithm Development:

Implement a fuzzy matching algorithm using libraries such as FuzzyWuzzy, rapidfuzz, or Levenshtein.
Optimize matching rules for efficiency and effectiveness, considering diverse naming conventions and common misspellings.
Benchmark Testing:

Create an internal benchmark dataset with synthetic customer names and a list of sanctions/PEPs.
Evaluate algorithm performance using precision, recall, and F1-score metrics.
Compare results with external benchmarking tools (e.g., World-Check).
Automation:

Develop an automated pipeline in Python to test the algorithm on both synthetic and real-world-like datasets.
Deploy the solution in a simulated environment to monitor real-time performance.
Rule Optimization:

Explore configurations for balancing precision and recall.
Provide insights on trade-offs by calculating missed true positives and false positives.
Operational Tools:

Build a dashboard (e.g., Streamlit) for compliance teams to visualize matching results and tune rules dynamically.
Introduce LLM-assisted (Large Language Model) explanations to provide clarity on why matches occur, improving decision-making speed for compliance agents.
Scalability:

Extend the algorithm to handle additional screening scenarios, such as vessel names or aliases.
Implement modular code for easy integration with other screening tools.
Data Requirements:
Synthetic Customer Data:

Names with variations (e.g., spelling mistakes, alternative spellings).
Demographics to simulate naming conventions across regions.
Sanctioned Entities Data:

Publicly available sanctions lists (e.g., OFAC, UN sanctions).
Benchmarking Data:

Historical screening cases (synthetic data with true positives and false positives).
Technologies and Tools:
Programming Languages: Python (core) and basic Java for code integration.
Libraries:
FuzzyWuzzy, Levenshtein, or rapidfuzz for fuzzy matching.
scikit-learn for precision/recall evaluation.
Data Visualization: Streamlit or Plotly for dashboards.
Automation: Pytest or unittest for automated testing pipelines.
Deployment: Docker or Flask for staging.
Deliverables:
Name-Matching Algorithm:

Optimized for precision and recall.
Configurable rules for custom scenarios.
Benchmark Dataset:

Comprehensive test set for algorithm evaluation.
Automation Pipeline:

Automated testing and performance reporting.
Interactive Dashboard:

For visualizing screening results and optimizing rules.
Documentation:

Detailed explanation of methods, trade-offs, and deployment instructions.
Impact Alignment with Job Role:
Ownership: Demonstrates ability to take ownership of testing and tuning strategies.
Automation: Highlights capability to automate processes for efficiency.
Cross-Functional Skills: Showcases communication of insights to technical and non-technical stakeholders through dashboards and reports.
Scalability: Reflects aptitude for designing solutions that grow with the business's needs.
This project would directly address many aspects of the job description while giving you hands-on experience with key concepts in name screening, algorithm optimization, and compliance operations.

In [5]:
from faker import Faker
import pandas as pd
import random
from rapidfuzz import process, fuzz


# Initialize Faker
fake = Faker()

# Set seed for reproducibility
Faker.seed(42)

# Generate customer names
def generate_customer_data(num_records):
    data = []
    for _ in range(num_records):
        name = fake.name()
        nationality = fake.country()
        birthdate = fake.date_of_birth(minimum_age=18, maximum_age=70)
        data.append({"Name": name, "Nationality": nationality, "Birthdate": birthdate})
    return pd.DataFrame(data)

# Generate sanctioned names
def generate_sanctioned_data(num_records):
    sanctioned_names = []
    for _ in range(num_records):
        name = fake.name()
        # Introduce some variations (e.g., typos, alias, transliterations)
        variations = [
            name,
            name.lower(),
            ''.join(random.sample(name, len(name))),  # Scrambled name
            name.replace("a", "@"),  # Typo
        ]
        sanctioned_names.append({"Sanctioned_Name": random.choice(variations)})
    return pd.DataFrame(sanctioned_names)

# Generate datasets
num_customers = 1000
num_sanctioned = 200

customers_df = generate_customer_data(num_customers)
sanctioned_df = generate_sanctioned_data(num_sanctioned)

# Save datasets to CSV
customers_df.to_csv("Name Matching Algorithm/data/customers_data.csv", index=False)
sanctioned_df.to_csv("Name Matching Algorithm/data/sanctioned_data.csv", index=False)

# Paths to generated datasets
customer_file_path = "Name Matching Algorithm/data/customers_data.csv"
sanctioned_file_path = "Name Matching Algorithm/data/sanctioned_data.csv"

customer_file_path, sanctioned_file_path


('Name Matching Algorithm/data/customers_data.csv',
 'Name Matching Algorithm/data/sanctioned_data.csv')

### Develop and Test a Name-Matching Algorithm
We will use fuzzy matching algorithms to compare customer names against the sanctioned list.

Steps:
Implement the name matching logic using rapidfuzz (faster and more flexible than FuzzyWuzzy).
Tune the matching threshold to balance precision and recall.

In [14]:
# Load datasets
customers_df = pd.read_csv("Name Matching Algorithm/data/customers_data.csv")
sanctioned_df = pd.read_csv("Name Matching Algorithm/data/sanctioned_data.csv")

# Define a function for fuzzy matching
def match_name(customer_name, sanctioned_list, threshold=90):
    # Find the best match for a given name in the sanctioned list
    match, score, index = process.extractOne(customer_name, sanctioned_list, scorer=fuzz.ratio)
    if score >= threshold:
        return match, score
    return None, 0

# Perform name matching
sanctioned_list = sanctioned_df["Sanctioned_Name"].tolist()
customers_df["Matched_Name"], customers_df["Match_Score"] = zip(
    *customers_df["Name"].apply(lambda x: match_name(x, sanctioned_list))
)

# Filter matches above a threshold
matched_customers = customers_df[customers_df["Match_Score"] > 0]

# Save results for review
matched_customers.to_csv("Name Matching Algorithm/data/matched_customers.csv", index=False)
"Name Matching Algorithm/data/matched_customers.csv"


'Name Matching Algorithm/data/matched_customers.csv'

### Evaluate Performance (Precision, Recall, F1-Score)
We need to measure how well the algorithm performs using synthetic true positives and false positives.

Steps:
1. Generate labels for true positives and false positives in the dataset.
2. Compute metrics.

In [15]:
from sklearn.metrics import precision_score, recall_score, f1_score

# Generate synthetic ground truth (for demo purposes)
customers_df["True_Positive"] = customers_df["Name"].apply(
    lambda x: 1 if x in sanctioned_list else 0
)
customers_df["Predicted_Positive"] = customers_df["Match_Score"].apply(
    lambda x: 1 if x > 0 else 0
)

# Calculate metrics
precision = precision_score(customers_df["True_Positive"], customers_df["Predicted_Positive"])
recall = recall_score(customers_df["True_Positive"], customers_df["Predicted_Positive"])
f1 = f1_score(customers_df["True_Positive"], customers_df["Predicted_Positive"])

precision, recall, f1


(0.5, 1.0, 0.6666666666666666)

### Automate the Testing Pipeline
Create a function to rerun the above steps with different datasets and thresholds. This can later be expanded into a fully automated pipeline.

In [12]:
def evaluate_matching(customers_df, sanctioned_df, threshold=80):
    sanctioned_list = sanctioned_df["Sanctioned_Name"].tolist()
    customers_df["Matched_Name"], customers_df["Match_Score"] = zip(
        *customers_df["Name"].apply(lambda x: match_name(x, sanctioned_list, threshold))
    )
    customers_df["Predicted_Positive"] = customers_df["Match_Score"].apply(
        lambda x: 1 if x > 0 else 0
    )
    precision = precision_score(customers_df["True_Positive"], customers_df["Predicted_Positive"])
    recall = recall_score(customers_df["True_Positive"], customers_df["Predicted_Positive"])
    f1 = f1_score(customers_df["True_Positive"], customers_df["Predicted_Positive"])
    return precision, recall, f1


In [13]:
import streamlit as st

# Load data
matched_customers = pd.read_csv("Name Matching Algorithm/data/matched_customers.csv")

# Streamlit app
st.title("Sanctions Screening Dashboard")

threshold = st.slider("Match Threshold", 0, 100, 80)
filtered_matches = matched_customers[matched_customers["Match_Score"] >= threshold]

st.dataframe(filtered_matches)
st.write(f"Total Matches Above Threshold: {len(filtered_matches)}")


2025-01-02 14:32:39.203 
  command:

    streamlit run /Users/toluwalaseokuwoga/opt/anaconda3/lib/python3.9/site-packages/ipykernel_launcher.py [ARGUMENTS]


# different dataset

In [2]:
from faker import Faker
import pandas as pd
import random
import re

# Initialize Faker
fake = Faker()

# Set seed for reproducibility
Faker.seed(42)
random.seed(42)

# Generate complex customer data
def generate_customer_data(num_records):
    data = []
    for _ in range(num_records):
        name = fake.name()
        nationality = fake.country()
        birthdate = fake.date_of_birth(minimum_age=18, maximum_age=70)
        data.append({"Name": name, "Nationality": nationality, "Birthdate": birthdate})
    return pd.DataFrame(data)

# Generate complex sanctioned names with variations
def generate_sanctioned_data(num_records):
    data = []
    for _ in range(num_records):
        name = fake.name()
        variations = [
            name,
            name.lower(),
            ''.join(random.sample(name, len(name))),  # Scrambled name
            name.replace("a", "@"),  # Typo
            name + " Jr.",  # Alias
        ]
        data.append({"Sanctioned_Name": random.choice(variations), "Sanctioned_Nationality": fake.country()})
    return pd.DataFrame(data)

# Generate datasets
num_customers = 2000
num_sanctioned = 500

customers_df = generate_customer_data(num_customers)
sanctioned_df = generate_sanctioned_data(num_sanctioned)

# Save datasets for reference
customers_df.to_csv("customers_data.csv", index=False)
sanctioned_df.to_csv("sanctioned_data.csv", index=False)


In [3]:
customers_df.head()

Unnamed: 0,Name,Nationality,Birthdate
0,Allison Hill,Czech Republic,1961-05-28
1,Patrick Gardner,Nauru,1976-05-15
2,Daniel Wagner,Anguilla,1983-10-03
3,Meredith Barnes,Kiribati,1985-03-29
4,Abigail Shaffer,Qatar,1976-05-28


In [4]:
sanctioned_df.head()

Unnamed: 0,Sanctioned_Name,Sanctioned_Nationality
0,Mckenzie Green,Switzerland
1,evoatSrs teentnP,Sao Tome and Principe
2,Cr@ig Mcm@hon,Liberia
3,Michael Martin,Qatar
4,Thomas Hudson Jr.,Swaziland


In [5]:
#Standardize names and remove noise for better matching
def preprocess_name(name):
    # Remove special characters and standardize case
    return re.sub(r"[^a-zA-Z\s]", "", name).lower().strip()

# Preprocess names in both datasets
customers_df["Preprocessed_Name"] = customers_df["Name"].apply(preprocess_name)
sanctioned_df["Preprocessed_Sanctioned_Name"] = sanctioned_df["Sanctioned_Name"].apply(preprocess_name)


In [6]:
#Use fuzzy matching to find the best matches between customer names and sanctioned names
from rapidfuzz import process, fuzz

# Define the fuzzy matching function
def match_name(customer_name, sanctioned_list, threshold=85):
    match, score, _ = process.extractOne(customer_name, sanctioned_list, scorer=fuzz.ratio)
    if score >= threshold:
        return match, score
    return None, 0

# Perform name matching
sanctioned_list = sanctioned_df["Preprocessed_Sanctioned_Name"].tolist()
customers_df["Matched_Name"], customers_df["Match_Score"] = zip(
    *customers_df["Preprocessed_Name"].apply(lambda x: match_name(x, sanctioned_list))
)


In [7]:
#Introduce a synthetic ground truth and evaluate the algorithm’s performance
# Simulate ground truth for demonstration purposes
customers_df["True_Positive"] = customers_df["Name"].apply(
    lambda x: 1 if preprocess_name(x) in sanctioned_list else 0
)
customers_df["Predicted_Positive"] = customers_df["Match_Score"].apply(lambda x: 1 if x > 0 else 0)

from sklearn.metrics import precision_score, recall_score, f1_score

# Calculate evaluation metrics
precision = precision_score(customers_df["True_Positive"], customers_df["Predicted_Positive"])
recall = recall_score(customers_df["True_Positive"], customers_df["Predicted_Positive"])
f1 = f1_score(customers_df["True_Positive"], customers_df["Predicted_Positive"])

print(f"Precision: {precision}, Recall: {recall}, F1-Score: {f1}")


Precision: 0.18604651162790697, Recall: 1.0, F1-Score: 0.3137254901960785


In [8]:
#Incorporate attributes like nationality to improve filtering
def contextual_filter(row):
    # Match based on name and additional attributes like nationality
    if row["Matched_Name"] and row["Nationality"] == sanctioned_df.loc[
        sanctioned_df["Preprocessed_Sanctioned_Name"] == row["Matched_Name"], "Sanctioned_Nationality"
    ].values[0]:
        return 1
    return 0

customers_df["Contextual_Match"] = customers_df.apply(contextual_filter, axis=1)
customers_df["Predicted_Positive"] = customers_df["Contextual_Match"]


In [9]:
#Evaluate performance at different thresholds to find the optimal balance
thresholds = [80, 85, 90, 95]
results = []

for threshold in thresholds:
    customers_df["Matched_Name"], customers_df["Match_Score"] = zip(
        *customers_df["Preprocessed_Name"].apply(lambda x: match_name(x, sanctioned_list, threshold))
    )
    customers_df["Predicted_Positive"] = customers_df["Match_Score"].apply(lambda x: 1 if x > 0 else 0)
    precision = precision_score(customers_df["True_Positive"], customers_df["Predicted_Positive"])
    recall = recall_score(customers_df["True_Positive"], customers_df["Predicted_Positive"])
    f1 = f1_score(customers_df["True_Positive"], customers_df["Predicted_Positive"])
    results.append({"Threshold": threshold, "Precision": precision, "Recall": recall, "F1": f1})

results_df = pd.DataFrame(results)
print(results_df)


   Threshold  Precision  Recall        F1
0         80   0.052980     1.0  0.100629
1         85   0.186047     1.0  0.313725
2         90   0.421053     1.0  0.592593
3         95   0.666667     1.0  0.800000


In [10]:
#Create a Streamlit dashboard to visualize results interactively
import streamlit as st

st.title("Sanctions Screening Dashboard")

threshold = st.slider("Match Threshold", 0, 100, 85)
filtered_matches = customers_df[customers_df["Match_Score"] >= threshold]

st.dataframe(filtered_matches[["Name", "Matched_Name", "Match_Score"]])
st.write(f"Total Matches Above Threshold: {len(filtered_matches)}")


2025-01-04 16:40:55.559 
  command:

    streamlit run /Users/toluwalaseokuwoga/opt/anaconda3/lib/python3.9/site-packages/ipykernel_launcher.py [ARGUMENTS]


In [11]:
#Save results for future tuning and comparison
results_df.to_csv("performance_log.csv", mode='a', index=False, header=False)
