Cleaning Process
The raw dataset contained 158663 rows of political speaker name variants paired with their country and legislature. The goal is to match names against English-language CNN/DailyMail news articles and the Country names from the global north dataset.
The most impactful step was filtering to Latin-script names only — dropping Cyrillic, Arabic, Hebrew, CJK, and all Indic scripts (Odia, Telugu, Tamil, Gujarati, etc.) using a Unicode allowlist approach. This alone removed ~62,400 rows. The remaining steps cleaned up Wikipedia scraping artifacts: HTML tags, wiki markup, OCR-garbled entries from scanned DRC parliament records (digits substituted for visually similar letters like 1 for I), and province names that had leaked into the name field for Afghan entries. Names that were fully or partially all-caps were title-cased to match how names appear in journalism. Country names had hyphens replaced with spaces (United-States-of-America → United States of America). Finally, a name_ascii column was added stripping diacritics (e.g. Erdoğan → Erdogan) to support flexible matching since news articles are inconsistent about using them.
Finally, country names were changed to remove countries/territories that were not recognized in the global north dataset, as well as changing the country names to match (United States of America → United States)

In [2]:
import pandas as pd
import re
import unicodedata

In [3]:
df = pd.read_csv("names.csv", delimiter=',')
print(f"Original rows: {len(df):,}")
print(f"Unique people: {df['id'].nunique():,}")
df.head(20)

Original rows: 158,663
Unique people: 78,382


Unnamed: 0,id,name,country,legislature
0,9fd33b27-fd4c-4eba-9a8f-d4d23f603c63,Sergei Schamba,Abkhazia,Assembly
1,9fd33b27-fd4c-4eba-9a8f-d4d23f603c63,Sergei Shamba,Abkhazia,Assembly
2,9fd33b27-fd4c-4eba-9a8f-d4d23f603c63,Sergei Šamba,Abkhazia,Assembly
3,9fd33b27-fd4c-4eba-9a8f-d4d23f603c63,Sergej Sjamba,Abkhazia,Assembly
4,9fd33b27-fd4c-4eba-9a8f-d4d23f603c63,Sergej Šamba,Abkhazia,Assembly
5,9fd33b27-fd4c-4eba-9a8f-d4d23f603c63,Siergiej Szamba,Abkhazia,Assembly
6,da988bab-32d4-46c0-bb7b-5c6a6eb129e7,Valeri Bganba,Abkhazia,Assembly
7,da988bab-32d4-46c0-bb7b-5c6a6eb129e7,Valerij Bganba,Abkhazia,Assembly
8,da988bab-32d4-46c0-bb7b-5c6a6eb129e7,Waleri Bganba,Abkhazia,Assembly
9,da988bab-32d4-46c0-bb7b-5c6a6eb129e7,Walerij Bganba,Abkhazia,Assembly


In [4]:
# STEP 1: Drop ALL non-Latin scripts
# KEEP only characters that are Latin (base + extended), common punctuation,
# and whitespace. Anything outside that is non-Latin and gets dropped.
# Latin Unicode blocks:
#   Basic Latin:          U+0020–U+007E
#   Latin-1 Supplement:   U+00C0–U+00FF
#   Latin Extended-A/B:   U+0100–U+024F
#   Latin Extended Add.:  U+1E00–U+1EFF  (e.g. Vietnamese)
# We also allow: spaces, hyphens, apostrophes, periods, commas, parentheses
def is_latin_name(name):
    """Return True only if every character is Latin-script or allowed punctuation."""
    allowed = re.compile(
        r'^[\u0020-\u007E'       # Basic ASCII (letters, digits, punctuation)
        r'\u00C0-\u00FF'         # Latin-1 Supplement (à, é, ñ, ü…)
        r'\u0100-\u024F'         # Latin Extended A+B (ā, ș, ž, ğ…)
        r'\u1E00-\u1EFF'         # Latin Extended Additional (Vietnamese etc.)
        r']+$'
    )
    return bool(allowed.match(str(name)))

latin_mask = df['name'].apply(is_latin_name)
print(f"\nStep 1 — Dropping {(~latin_mask).sum():,} non-Latin-script names "
      f"({latin_mask.sum():,} Latin names kept)")
df = df[latin_mask].copy()


Step 1 — Dropping 62,427 non-Latin-script names (96,236 Latin names kept)


In [5]:
# STEP 2: Strip HTML / wiki markup
# Artifacts from Wikipedia scraping that would break any name lookup.
df['name'] = df['name'].str.replace(r'<[^>]+>', '', regex=True)       # <br />, <b>, etc.
df['name'] = df['name'].str.replace(r'\[.*?\]', '', regex=True)       # [edit] wiki links
df['name'] = df['name'].str.replace(r'thumb\|.*', '', regex=True)     # wiki thumbnail captions
df['name'] = df['name'].str.replace(r'\\+', '', regex=True)           # stray backslashes
df['name'] = df['name'].str.replace(r'^\|+', '', regex=True)          # leading pipe chars
df['name'] = df['name'].str.replace(r'\s*\|\s*.*$', '', regex=True)   # "Name | نام" → "Name"
df['name'] = df['name'].str.strip()

# STEP 3: Drop OCR-garbled names
# Mainly DRC entries scanned with bad OCR: "MUB1" instead of "MUBI",
# "N1ADIMBA" instead of "NIADIMBA". Pattern: digit flanked by capital letters.
def is_ocr_garbage(name):
    return bool(re.search(r'[A-Z][0-9][A-Z]|[0-9][A-Z]{2}|[A-Z]{2}[0-9]', name))
ocr_mask = df['name'].apply(is_ocr_garbage)
print(f"Step 3 — Dropping {ocr_mask.sum():,} OCR-garbled names")
df = df[~ocr_mask].copy()

# STEP 4: Drop names with residual junk characters
junk_mask = df['name'].str.contains(r'[<>{}\^~`]', na=False, regex=True)
print(f"Step 4 — Dropping {junk_mask.sum():,} names with residual junk characters")
df = df[~junk_mask].copy()

Step 3 — Dropping 22 OCR-garbled names
Step 4 — Dropping 25 names with residual junk characters


In [6]:
# STEP 5:Strip region/province leakage from name field
# Afghan entries like "Abdul Qadir Qalatwal, zabul" accidentally include a
# province name after the comma. Heuristic: trailing ", lowercase-word(s)".
df['name'] = df['name'].str.replace(r',\s+[a-z][a-zA-Z\s]*$', '', regex=True).str.strip()

# STEP 6: Normalize whitespace
df['name'] = df['name'].str.replace(r'\s+', ' ', regex=True).str.strip()

# STEP 7 — Drop exact duplicate (id, name) pairs
before = len(df)
df = df.drop_duplicates(subset=['id', 'name'])
print(f"Step 7 — Dropped {before - len(df):,} exact duplicate (id, name) rows")

Step 7 — Dropped 1 exact duplicate (id, name) rows


In [7]:
# STEP 8: Strip whitespace from country / legislature
df['country'] = df['country'].str.strip()
df['legislature'] = df['legislature'].str.strip()

# STEP 9: Add ASCII-normalized name column for fuzzy/exact matching
# News articles may write "Erdogan" (no diacritics) even when the correct
# spelling is "Erdoğan". Keeping both versions lets you match either way.
def to_ascii(name):
    """Strip diacritics → pure ASCII for loose matching."""
    normalized = unicodedata.normalize('NFKD', name)
    return ''.join(c for c in normalized if not unicodedata.combining(c))
df['name_ascii'] = df['name'].apply(to_ascii)
# Flag rows where the ASCII version differs (i.e. had diacritics worth noting)
df['has_diacritics'] = df['name'] != df['name_ascii']

In [8]:
# Step 11: Normalize name casing 
# ~6,800 fully all-caps names and ~8,400 partial-caps names exist (mostly
# Albanian, Algerian, DRC entries). Title-case them to match how names appear
# in news articles. Only touches words that are already all-uppercase.
PARTICLES = {'de','van','von','der','den','al','el','la','le',
             'du','di','da','dos','das','ep','ben','bin'}

def normalize_case(name):
    words = name.split()
    if not any(w.isupper() and len(w) > 1 for w in words):
        return name  # already mixed-case, leave it alone
    result = []
    for i, word in enumerate(words):
        if word.lower() in PARTICLES and i > 0:
            result.append(word.lower())
        else:
            result.append(word.capitalize())
    return ' '.join(result)

df['name'] = df['name'].apply(normalize_case)
# Refresh ASCII column after casing changes
df['name_ascii'] = df['name'].apply(to_ascii)
df['has_diacritics'] = df['name'] != df['name_ascii']

# Step 12: Normalize country names (hyphens → spaces) 
# Countries are stored as "United-States-of-America" — replace hyphens with
# spaces so they match natural English as it appears in news text.
df['country'] = df['country'].str.replace('-', ' ', regex=False)

In [9]:
# Step 13: Align with global-north-countries-2026 dataset
# Rename countries with mismatched names first, then drop anything not in the
# reference list (unrecognised states, territories, dependencies, etc.)

COUNTRY_RENAMES = {
    'UK':                      'United Kingdom',
    'United States of America': 'United States',
    'Czech Republic':           'Czechia',
    'Macedonia':                'North Macedonia',
    'Congo Brazzaville':        'Republic of the Congo',
    'Congo Kinshasa':           'DR Congo',
    'Cabo Verde':               'Cape Verde',
    'Timor Leste':              'Timor-Leste',
    'Swaziland':                'Eswatini',
    'Guinea Bissau':            'Guinea-Bissau',
}
df['country'] = df['country'].replace(COUNTRY_RENAMES)

gn = pd.read_csv('global-north-countries-2026.csv', delimiter= ',')
valid_countries = set(gn['country'].str.strip())

before = len(df)
df = df[df['country'].isin(valid_countries)].copy()
print(f"Step 13 — Renamed 10 mismatched country names")
print(f"          Dropped {before - len(df):,} rows from territories/unrecognised countries")
print(f"          Countries remaining: {df['country'].nunique()}")


Step 13 — Renamed 10 mismatched country names
          Dropped 3,704 rows from territories/unrecognised countries
          Countries remaining: 182


In [31]:
df = df.reset_index(drop=True)

print(f"\n{'─'*60}")
print(f"Final clean rows:           {len(df):,}  (was 158,663)")
print(f"Unique people (by id):      {df['id'].nunique():,}")
print(f"Unique name variants:       {df['name'].nunique():,}")
print(f"  w/ diacritics:            {df['has_diacritics'].sum():,}")
print(f"Countries covered:          {df['country'].nunique():,}")
print(f"\nRows per country (top 15):")
print(df.groupby('country')['name'].count().sort_values(ascending=False).head(15).to_string())



────────────────────────────────────────────────────────────
Final clean rows:           92,484  (was 158,663)
Unique people (by id):      69,290
Unique name variants:       91,466
  w/ diacritics:            21,048
Countries covered:          182

Rows per country (top 15):
country
Turkey            7370
Germany           4772
United States     3684
Greece            3018
Israel            2692
Italy             2588
Portugal          2415
Poland            2345
United Kingdom    2045
Sweden            1872
France            1748
Tanzania          1662
Bulgaria          1574
Switzerland       1429
Norway            1378


In [11]:
#testing
df.sample(30)

Unnamed: 0,id,name,country,legislature,name_ascii,has_diacritics
90731,aee4adc6-a000-4017-9a48-6ce4b7799bcc,Joe Natuman,Vanuatu,Parliament,Joe Natuman,False
22693,683ce1c7-0f75-4eff-8c48-b682489b5b4a,Irmingard Schewe-Gerigk,Germany,Bundestag,Irmingard Schewe-Gerigk,False
78667,dab44c05-6dd1-41e2-b187-029289a5b644,İbrahim Göker,Turkey,Assembly,Ibrahim Goker,True
35814,d49f8809-5c51-433e-9b47-fab0830bb242,Yehoshua Rabinovitz,Israel,Knesset,Yehoshua Rabinovitz,False
38190,e9295a98-7d30-4e05-854f-a8e43dfb6664,Luigi Perrone,Italy,Senate,Luigi Perrone,False
34166,200acc74-aa74-466a-8a8c-c8ae5662a5a2,Eli'ezer Kaplan,Israel,Knesset,Eli'ezer Kaplan,False
22330,2040b078-9b08-4914-aaa0-72561ada4165,Helge Braun,Germany,Bundestag,Helge Braun,False
82739,324fccdb-b808-41a1-b934-0f0d9d1a333f,Ömer Özen,Turkey,Assembly,Omer Ozen,True
75571,701545f2-115d-443e-be7d-96e92eeba9c7,Slim Besbes,Tunisia,Majlis,Slim Besbes,False
88362,d8160e24-2725-4e90-b77f-78f3407e2cdc,John Ensign,United States,House,John Ensign,False


In [33]:
df.to_csv("names_clean.csv", index=False)
import os
print(os.getcwd())

C:\Users\ASUS
