This notebook has for goal to preprocess the Wikidata dataset. 

# Description of the dataset before processing

The wikidata dataset is constituted of 10 smaller subset, each containing names (person_label), date of births (dobs), countries (countries) and continent (continents) of 500 person from a given continent and gender. It also contain the Wikidata identifier for those persons (person). 

The continent included are : 
- Africa
- Asia
- Europe
- North America
- South America

The gender included are : 
- Male
- Female

Those two categories include anyone tagged either male, female, cisgender male or cisgender female. 

When fetching data from Wikidata, the date of birth was capped between 1900 and 2025 to prevent cases where someone's date of birth was not filled, and to avoid particularly old names that may mess up the more modern services being used. 

# Preprocessing tasks

The current 10 subset will need to be check for duplicates. Some of the people included have multiple countries and continents of citizenship, meaning they might have shown up multiple time on different queries. The check for duplication will be made on the Wikidata identifier, then on duplicate tuples sharing the same name and country. 

Once checked for duplicates, the dataset will need to be manually treated to identify the first, middle and last name(s) of the people included. This step is fully manual, but can be softened a little bit by previously flagging names that have particularities in them. 

Particularities in this context include : 
- Any middles names
- Double barreled names (ex : Smith-Jones)
- ASCII characters (any character that is not the standard a-zA-Z. All accents are included here too.)
- Numbers in the name
- trailing spaces and other irregularities with blank spaces. 
- Potential punctuation in the name (ex: Adam J.)

Any full name not flagged by this identification process should thus be a simple "FirstName LastName" name. It'll be separated into first and last names columns on the logic that the surname is the first, and the last name is the second. This is NOT guaranteed, but is supported by a partial visual control of the dataset. 

Finally, the services that can make use of the country of origin of the name, for the most part, only accept the ISO code of the country. Since the country is only registered as a string here, we use pycountry to make a fuzzy search of the string and return the ISO code. 

In [67]:
import numpy as np
import pandas as pd
from pathlib import Path
import pycountry

## Joining subsets 

In [68]:
data_dir = Path("../../data/rawData/wikidata5k")

csv_files = list(data_dir.glob("*.csv"))

dfs = [pd.read_csv(f) for f in csv_files]

big_df = pd.concat(dfs, ignore_index=True)

print(f"Combined {len(csv_files)} files.")
print(f"Big dataset shape: {big_df.shape}")

Combined 10 files.
Big dataset shape: (5000, 6)


## Removing duplicates identifier

In [69]:
clean_df = big_df.drop_duplicates(subset=['person'], keep='first')

In [70]:
print("Before cleaning:", big_df.shape)
print("After cleaning :", clean_df.shape)

duplicate_count = big_df.shape[0] - clean_df.shape[0]
duplicate_ratio = duplicate_count / big_df.shape[0] * 100
print(f"Removed {duplicate_count} duplicates ({duplicate_ratio:.2f}% of rows).")

Before cleaning: (5000, 6)
After cleaning : (4881, 6)
Removed 119 duplicates (2.38% of rows).


In [71]:
print("Unique persons:", clean_df['person'].nunique())
print("Rows in clean_df:", clean_df.shape[0])

Unique persons: 4881
Rows in clean_df: 4881


In [72]:
missing_summary = clean_df.isnull().mean().sort_values(ascending=False)
print("Fraction of missing values per column:")
print(missing_summary)

Fraction of missing values per column:
person          0.0
person_label    0.0
genders         0.0
dobs            0.0
countries       0.0
continents      0.0
dtype: float64


In [73]:
print(clean_df[['continents', 'countries', 'person_label']].value_counts(normalize=False).head(20))

continents     countries  person_label            
Europe         Poland     Andrzej Kowalski            5
               Hungary    Tibor Flórián               2
               Italy      Lorenzo Casini              2
South America  Brazil     Fabio Ribeiro               2
               Venezuela  José Urriola                1
                          Jhonnatan Medina-Álvarez    1
Africa         Algeria    Isma Kaddouri               1
South America  Venezuela  Victor Luces                1
Africa         Algeria    Abdel Medioub               1
                          Asma Guesmi                 1
                          Fatima Hellilou             1
                          Khaled Benaissa             1
                          Leila Beratto               1
                          Mohamed Esseghir            1
                          Mohamed Zerguini            1
                          Myriam Belkiri              1
                          Ratiba Derfoul             

## Removing duplicate names

After removing duplicate based on the identifier, some names were still appearing in duplicates. This is unncessary, as the services used are deterministic, they'd produce the same results for any of those names assuming the same output on other variables, such as country of origins. Based on this, duplicate names that also have the same country of origin will be removed. 

In [74]:
clean_df = clean_df.drop_duplicates(subset=['continents', 'countries', 'person_label'], keep='first')
print(clean_df['person_label'].value_counts().sum())
print(clean_df[['continents', 'countries', 'person_label']].value_counts(normalize=False).head(20))

4874
continents  countries  person_label         
Africa      Algeria    Abdel Medioub            1
                       Asma Guesmi              1
                       Boumédiene Bensmaine     1
                       Brahim Haggiag           1
                       Fadila Hachemaoui        1
                       Faouzia Mebarki          1
                       Fatima Hellilou          1
                       Ghaniyya Bouhouia        1
                       Hassane Mezine           1
                       Hicham Benayad-Cherif    1
                       Isma Kaddouri            1
                       Khaled Benaissa          1
                       Leila Beratto            1
                       Mohamed Esseghir         1
                       Mohamed Zerguini         1
                       Myriam Belkiri           1
                       Ratiba Derfoul           1
                       Selma Hellal             1
                       Sofiane Benfatah         1


## Flagging names

In [75]:
middle_name_mask = clean_df['person_label'].str.strip().str.count(" ") != 1

middle_name_count = middle_name_mask.sum()

print(f"Total rows: {clean_df.shape[0]}")
print(f"Potential middle names: {middle_name_count}")
print(f"Percentage: {middle_name_count / clean_df.shape[0] * 100:.2f}%")

clean_df['hasMiddleName'] = middle_name_mask

Total rows: 4874
Potential middle names: 937
Percentage: 19.22%


In [76]:
no_last_name_mask = clean_df['person_label'].str.count(" ") == 0
no_last_name_count = no_last_name_mask.sum()

print(f"Total rows: {clean_df.shape[0]}")
print(f"Potential rows without last name: {no_last_name_count}")
print(f"Percentage: {no_last_name_count / clean_df.shape[0] * 100:.2f}%")

clean_df['hasNoLastName'] = no_last_name_mask

Total rows: 4874
Potential rows without last name: 55
Percentage: 1.13%


In [77]:
import re
import unicodedata
# This code was generated by ChatGPT and tweaked by me. 

# 1) Keep the original text; add a normalized helper column for consistent tests
def normalize_name(s):
    if pd.isna(s):
        return s
    s = unicodedata.normalize("NFKC", str(s))  # unify apostrophes/spaces, etc.
    s = s.strip()
    s = re.sub(r"\s+", " ", s)                # collapse internal whitespace
    return s

clean_df["person_label_norm"] = clean_df["person_label"].map(normalize_name)

# 2) Build masks (vectorized)
s_raw  = clean_df["person_label"].astype(str)
s_norm = clean_df["person_label_norm"].astype(str)

hyphen_mask       = s_norm.str.contains(r"-", na=False)
apostrophe_mask   = s_norm.str.contains(r"['\u2019\u02BC]", na=False)  # ', ’, ʼ
punct_mask        = s_norm.str.contains(r"[.,/&(){}\[\]<>@#?!$%^*_=+\\]", na=False)
digit_mask        = s_norm.str.contains(r"\d", na=False)
initial_mask      = s_norm.str.contains(r"(^|\s)[A-Za-z]\.", na=False)  # initials like "J."
# Non-ASCII: after NFKC, compare ASCII-stripped version to itself
non_ascii_mask    = ~s_norm.map(lambda x: x.isascii())

# Whitespace issues measured on the raw (pre-normalization) text
whitespace_mask   = s_raw.str.contains(
    r"^\s|[\u00A0\u2007\u202F]|\s{2,}|\s$", na=False
)

# 3) Combine into a single "hasDifficultName" flag
difficult_mask = (
    middle_name_mask
    | no_last_name_mask
    | hyphen_mask
    | apostrophe_mask
    | punct_mask
    | digit_mask
    | initial_mask
    | non_ascii_mask
    | whitespace_mask
)

clean_df["hasDifficultName"] = difficult_mask

# 4) (Optional) Keep a reason code for explainability/auditing
def reason_row(i):
    reasons = []
    if middle_name_mask.iat[i]: reasons.append("middleName")
    if no_last_name_mask.iat[i]: reasons.append("noLastName")
    if hyphen_mask.iat[i]:      reasons.append("hyphen")
    if apostrophe_mask.iat[i]:  reasons.append("apostrophe")
    if punct_mask.iat[i]:       reasons.append("punctuation")
    if digit_mask.iat[i]:       reasons.append("digit")
    if initial_mask.iat[i]:     reasons.append("initials")
    if non_ascii_mask.iat[i]:   reasons.append("non_ascii")
    if whitespace_mask.iat[i]:  reasons.append("whitespace")
    return "|".join(reasons)

clean_df["difficult_reason"] = [reason_row(i) for i in range(len(clean_df))]

# 5) (Optional) Quick summary for your report
summary = {
    "total": len(clean_df),
    "difficult_count": difficult_mask.sum(),
    "difficult_pct": 100 * difficult_mask.mean(),
    "by_reason": {
        "hyphen": int(hyphen_mask.sum()),
        "apostrophe": int(apostrophe_mask.sum()),
        "punctuation": int(punct_mask.sum()),
        "digit": int(digit_mask.sum()),
        "initials": int(initial_mask.sum()),
        "non_ascii": int(non_ascii_mask.sum()),
        "whitespace": int(whitespace_mask.sum()),
    },
}
print(summary)

{'total': 4874, 'difficult_count': np.int64(1688), 'difficult_pct': np.float64(34.632745178498155), 'by_reason': {'hyphen': 219, 'apostrophe': 32, 'punctuation': 145, 'digit': 1, 'initials': 131, 'non_ascii': 762, 'whitespace': 4}}


  initial_mask      = s_norm.str.contains(r"(^|\s)[A-Za-z]\.", na=False)  # initials like "J."


## Separating non-flagged names

In [78]:
def split_name(full_name: str):
    tokens = full_name.strip().split()
    print(tokens)
    return tokens[0], tokens[1]

# Apply to dataset
clean_df[["surName", "lastName"]] = clean_df[
    (clean_df['hasMiddleName']==False)]['person_label'].apply(
    lambda x: pd.Series(split_name(x))
)

['Clara', 'Benson']
['Dalia', 'Ziada']
['Fatou', 'Haidara']
['Claude', 'Haffner']
['Diallo', 'Sène']
['Fatoumata', 'Sylla']
['Manuella', 'Ollo']
['Awa', 'Traoré']
['Lorna', 'Doorman']
['Enayat', 'al-Zayyat']
['Fatima', 'Hellilou']
['Fadila', 'Hachemaoui']
['Mervat', 'Farrag']
['Ghaniyya', 'Bouhouia']
['Bella', 'Keshk']
['Magda', 'Saleh']
['Emna', 'Fakher']
['Marianne', 'Khoury']
['Vicentia', 'Boco']
['Therese', 'Soukup']
['Aisha', 'Yesufu']
['Priscilla', 'Jana']
['Aïcha', 'Macky']
['Philippa', 'Ndisi-Herrmann']
['Bridget', 'Otoo']
['Branwen', 'Okpako']
['Voahary', 'Rakotovelomanantsoa']
['Joelle', 'Kayembe']
['Sophia', 'Byass']
['Mandisa', 'Makesini']
['Cynthia', 'Butare']
['Sakina', 'Safadi']
['Laura', 'Musanase']
['Munezero', 'Aline']
['Girley', 'Jazama']
['Nadège', 'Uwamwezi']
['Martine', 'Oulabou']
['Ayyam', 'Sureau']
['Catherine', 'Kamau']
['Aïcha', 'Boro']
['Fay', 'Goldie']
['Aderonke', 'Adeola']
['Pascale', 'Lamche']
['Kagendo', 'Murungi']
['Omelga', 'Mthiyane']
['Alfa', 'Demmel

## Adding ISO code for countries

In [79]:
def findISOCode(country):
    if isinstance(country, str): # in case of missing values
        try:
            return pycountry.countries.search_fuzzy(country)[0].alpha_2
        except LookupError:
            return None  # if pycountry can't find a match
    else:
        return None
clean_df["iso_country"] = clean_df['countries'].apply(findISOCode)
clean_df.head()

Unnamed: 0,person,person_label,genders,dobs,countries,continents,hasMiddleName,hasNoLastName,person_label_norm,hasDifficultName,difficult_reason,surName,lastName,iso_country
0,http://www.wikidata.org/entity/Q100065772,Clara Benson,female,2000-08-19T00:00:00Z,Ghana,Africa,False,False,Clara Benson,False,,Clara,Benson,GH
1,http://www.wikidata.org/entity/Q100136959,Esther Ruth Mbabazi,female,1995-01-01T00:00:00Z,Uganda,Africa,True,False,Esther Ruth Mbabazi,True,middleName,,,UG
2,http://www.wikidata.org/entity/Q100146654,Dalia Ziada,female,1982-01-01T00:00:00Z,Egypt,Africa,False,False,Dalia Ziada,False,,Dalia,Ziada,EG
3,http://www.wikidata.org/entity/Q100152601,Fatou Haidara,female,1962-01-01T00:00:00Z,Mali,Africa,False,False,Fatou Haidara,False,,Fatou,Haidara,ML
4,http://www.wikidata.org/entity/Q100165910,Claude Haffner,female,1976-01-01T00:00:00Z,Democratic Republic of the Congo,Africa,False,False,Claude Haffner,False,,Claude,Haffner,


In [80]:
df_na_isoCountry = clean_df[clean_df['iso_country'].isna()]
print(f"Total number of missing ISO codes : {df_na_isoCountry['countries'].value_counts().sum()}")
df_na_isoCountry['countries'].value_counts()

Total number of missing ISO codes : 333


countries
Soviet Union                        119
Turkey                               50
Democratic Republic of the Congo     25
Ivory Coast                          21
Cape Verde                           16
Empire of Japan                      13
Kingdom of Italy                     13
United Arab Republic                 10
German Democratic Republic           10
Kingdom of Nejd and Hejaz             8
Ottoman Empire                        8
British Raj                           6
German Reich                          6
French protectorate of Tunisia        5
Kingdom of Yugoslavia                 3
Yugoslavia                            3
Czechoslovakia                        3
Kingdom of Egypt                      2
Cisleithania                          2
French mandate of Lebanon             2
Somaliland                            1
British Hong Kong                     1
Taiwan under Japanese rule            1
Weimar Republic                       1
Colony of Jamaica             

It appears a lot of the missing ISO codes are from countries that no longer exist (Soviet Union), or are written in a way that isn't recognized by the ISO norm (Turkey, now spelled Türkiye). 

Given that the services do NOT require the usage of a country code to provide a return, it is acceptable to not treat those missing ISO code and simply leave those fields empty. 

## Curating continent
To align both dataset and assure they have matching data, we use the same tactic to curate the continents. This means using the same file to define a continent based on the ISO code proposed here. 

In [81]:
code_to_continent_table = pd.read_csv('../../data/rawData/countryCodesToContinent/country-and-continent-codes-list.csv')

In [82]:
def findContinent(row:pd.Series):
    kaggleCode = row['iso_country']
    if kaggleCode is not None: # if we were unable to provide an ISO code, we do not adjust the continent
        try : 
            return code_to_continent_table[code_to_continent_table['Two_Letter_Country_Code']==kaggleCode].iloc[0]['Continent_Name']
        except:
            print(type(kaggleCode), kaggleCode)
            return None

clean_df['continents'] = clean_df.apply(findContinent, axis=1)
clean_df.loc[clean_df['iso_country'] == 'NA', 'continents'] = 'Africa'

<class 'str'> NA
<class 'str'> NA
<class 'str'> NA
<class 'str'> NA
<class 'str'> NA
<class 'str'> NA
<class 'str'> NA
<class 'str'> NA
<class 'str'> NA
<class 'str'> NA


In [83]:
clean_df.head()

Unnamed: 0,person,person_label,genders,dobs,countries,continents,hasMiddleName,hasNoLastName,person_label_norm,hasDifficultName,difficult_reason,surName,lastName,iso_country
0,http://www.wikidata.org/entity/Q100065772,Clara Benson,female,2000-08-19T00:00:00Z,Ghana,Africa,False,False,Clara Benson,False,,Clara,Benson,GH
1,http://www.wikidata.org/entity/Q100136959,Esther Ruth Mbabazi,female,1995-01-01T00:00:00Z,Uganda,Africa,True,False,Esther Ruth Mbabazi,True,middleName,,,UG
2,http://www.wikidata.org/entity/Q100146654,Dalia Ziada,female,1982-01-01T00:00:00Z,Egypt,Africa,False,False,Dalia Ziada,False,,Dalia,Ziada,EG
3,http://www.wikidata.org/entity/Q100152601,Fatou Haidara,female,1962-01-01T00:00:00Z,Mali,Africa,False,False,Fatou Haidara,False,,Fatou,Haidara,ML
4,http://www.wikidata.org/entity/Q100165910,Claude Haffner,female,1976-01-01T00:00:00Z,Democratic Republic of the Congo,,False,False,Claude Haffner,False,,Claude,Haffner,


## Resetting index

In [84]:
clean_df.reset_index(drop=True)

Unnamed: 0,person,person_label,genders,dobs,countries,continents,hasMiddleName,hasNoLastName,person_label_norm,hasDifficultName,difficult_reason,surName,lastName,iso_country
0,http://www.wikidata.org/entity/Q100065772,Clara Benson,female,2000-08-19T00:00:00Z,Ghana,Africa,False,False,Clara Benson,False,,Clara,Benson,GH
1,http://www.wikidata.org/entity/Q100136959,Esther Ruth Mbabazi,female,1995-01-01T00:00:00Z,Uganda,Africa,True,False,Esther Ruth Mbabazi,True,middleName,,,UG
2,http://www.wikidata.org/entity/Q100146654,Dalia Ziada,female,1982-01-01T00:00:00Z,Egypt,Africa,False,False,Dalia Ziada,False,,Dalia,Ziada,EG
3,http://www.wikidata.org/entity/Q100152601,Fatou Haidara,female,1962-01-01T00:00:00Z,Mali,Africa,False,False,Fatou Haidara,False,,Fatou,Haidara,ML
4,http://www.wikidata.org/entity/Q100165910,Claude Haffner,female,1976-01-01T00:00:00Z,Democratic Republic of the Congo,,False,False,Claude Haffner,False,,Claude,Haffner,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4869,http://www.wikidata.org/entity/Q102012734,Iranildo Conceição Espíndola,male,1969-01-24T00:00:00Z,Brazil,South America,True,False,Iranildo Conceição Espíndola,True,middleName|non_ascii,,,BR
4870,http://www.wikidata.org/entity/Q102035839,Wrays Pérez,male,1961-04-18T00:00:00Z,Peru,South America,False,False,Wrays Pérez,True,non_ascii,Wrays,Pérez,PE
4871,http://www.wikidata.org/entity/Q102035885,João Luís Barbosa,male,1962-05-20T00:00:00Z,Brazil,South America,True,False,João Luís Barbosa,True,middleName|non_ascii,,,BR
4872,http://www.wikidata.org/entity/Q102035903,Jorginho,male,1962-05-19T00:00:00Z,Brazil,South America,True,True,Jorginho,True,middleName|noLastName,,,BR


# Recording preprocessed dataset

In [85]:
clean_df.to_csv("../../data/Wikidata5k.csv", index=False)

In [86]:
wikidata5k = pd.read_csv('../../data/Wikidata5k.csv')

In [87]:
wikidata5k.shape

(4874, 14)

In [89]:
wikidata5k.head(10)

Unnamed: 0,person,person_label,genders,dobs,countries,continents,hasMiddleName,hasNoLastName,person_label_norm,hasDifficultName,difficult_reason,surName,lastName,iso_country
0,http://www.wikidata.org/entity/Q100065772,Clara Benson,female,2000-08-19T00:00:00Z,Ghana,Africa,False,False,Clara Benson,False,,Clara,Benson,GH
1,http://www.wikidata.org/entity/Q100136959,Esther Ruth Mbabazi,female,1995-01-01T00:00:00Z,Uganda,Africa,True,False,Esther Ruth Mbabazi,True,middleName,,,UG
2,http://www.wikidata.org/entity/Q100146654,Dalia Ziada,female,1982-01-01T00:00:00Z,Egypt,Africa,False,False,Dalia Ziada,False,,Dalia,Ziada,EG
3,http://www.wikidata.org/entity/Q100152601,Fatou Haidara,female,1962-01-01T00:00:00Z,Mali,Africa,False,False,Fatou Haidara,False,,Fatou,Haidara,ML
4,http://www.wikidata.org/entity/Q100165910,Claude Haffner,female,1976-01-01T00:00:00Z,Democratic Republic of the Congo,,False,False,Claude Haffner,False,,Claude,Haffner,
5,http://www.wikidata.org/entity/Q100214040,Bâ Odette Yattara,female,1944-01-01T00:00:00Z,Mali,Africa,True,False,Bâ Odette Yattara,True,middleName|non_ascii,,,ML
6,http://www.wikidata.org/entity/Q100230407,Traoré Oumou Touré,female,1950-01-01T00:00:00Z,Mali,Africa,True,False,Traoré Oumou Touré,True,middleName|non_ascii,,,ML
7,http://www.wikidata.org/entity/Q100233625,Diallo Sène,female,1952-07-01T00:00:00Z,Mali,Africa,False,False,Diallo Sène,True,non_ascii,Diallo,Sène,ML
8,http://www.wikidata.org/entity/Q100233649,Fatoumata Sylla,female,1954-12-11T00:00:00Z,Mali,Africa,False,False,Fatoumata Sylla,False,,Fatoumata,Sylla,ML
9,http://www.wikidata.org/entity/Q100233714,Maïga Zeïnab Mint Youba,female,1955-11-30T00:00:00Z,Mali,Africa,True,False,Maïga Zeïnab Mint Youba,True,middleName|non_ascii,,,ML
