This notebook has for goal to preprocess the Wikidata dataset. 

# Description of the dataset

The wikidata dataset is constituted of 10 smaller subset, each containing names (person_label), date of births (dobs), countries (countries) and continent (continents) of 500 person from a given continent and gender. It also contain the Wikidata identifier for those persons (person). 

The continent included are : 
- Africa
- Asia
- Europe
- North America
- South America

The gender included are : 
- Male
- Female
Those two categories include anyone tagged either male, female, cisgender male or cisgender female. 

When fetching data from Wikidata, the date of birth was capped between 1900 and 2025 to prevent cases where someone's date of birth was not filled. 

# Preprocessing tasks

The current 10 subset will need to be check for duplicates. Some of the people included have multiple countries and continents of citizenship, meaning they might have shown up multiple time on different queries. The check for duplication will be made on the Wikidata identifier. 

Once checked for duplicates, the dataset will need to be manually treated to identify the first, middle and last name(s) of the people included. 

In [30]:
import numpy as np
import pandas as pd
from pathlib import Path

In [31]:
data_dir = Path("../../data/rawData/")

csv_files = list(data_dir.glob("*.csv"))

dfs = [pd.read_csv(f) for f in csv_files]

big_df = pd.concat(dfs, ignore_index=True)

print(f"Combined {len(csv_files)} files.")
print(f"Big dataset shape: {big_df.shape}")

Combined 11 files.
Big dataset shape: (9881, 7)


In [32]:
clean_df = big_df.drop_duplicates(subset=['person'], keep='first')

In [33]:
print("Before cleaning:", big_df.shape)
print("After cleaning :", clean_df.shape)

duplicate_count = big_df.shape[0] - clean_df.shape[0]
duplicate_ratio = duplicate_count / big_df.shape[0] * 100
print(f"Removed {duplicate_count} duplicates ({duplicate_ratio:.2f}% of rows).")

Before cleaning: (9881, 7)
After cleaning : (4881, 7)
Removed 5000 duplicates (50.60% of rows).


In [34]:
print("Unique persons:", clean_df['person'].nunique())
print("Rows in clean_df:", clean_df.shape[0])

Unique persons: 4881
Rows in clean_df: 4881


In [35]:
missing_summary = clean_df.isnull().mean().sort_values(ascending=False)
print("Fraction of missing values per column:")
print(missing_summary)

Fraction of missing values per column:
Unnamed: 0      0.0
person          0.0
person_label    0.0
genders         0.0
dobs            0.0
countries       0.0
continents      0.0
dtype: float64


In [36]:
print(clean_df[['continents', 'countries', 'person_label']].value_counts(normalize=False).head(20))

continents     countries  person_label            
Europe         Poland     Andrzej Kowalski            5
               Hungary    Tibor Flórián               2
               Italy      Lorenzo Casini              2
South America  Brazil     Fabio Ribeiro               2
               Venezuela  José Urriola                1
                          Jhonnatan Medina-Álvarez    1
Africa         Algeria    Isma Kaddouri               1
South America  Venezuela  Victor Luces                1
Africa         Algeria    Abdel Medioub               1
                          Asma Guesmi                 1
                          Fatima Hellilou             1
                          Khaled Benaissa             1
                          Leila Beratto               1
                          Mohamed Esseghir            1
                          Mohamed Zerguini            1
                          Myriam Belkiri              1
                          Ratiba Derfoul             

In this part, we'll explore how much does the names needs curations. 

Curation, in this context, is defined as needing, in any order : 
- To separate first, middle and last names
- To deal with any potentially problematic character such as joined name (-), names with apostrophes ('), and so on

In [41]:
middle_name_mask = clean_df['person_label'].str.count(" ") >= 2
middle_name_count = middle_name_mask.sum()

print(f"Total rows: {clean_df.shape[0]}")
print(f"Potential middle names: {middle_name_count}")
print(f"Percentage: {middle_name_count / clean_df.shape[0] * 100:.2f}%")

clean_df['hasMiddleName'] = middle_name_mask

Total rows: 4881
Potential middle names: 882
Percentage: 18.07%


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clean_df['hasMiddleName'] = middle_name_mask


In [45]:
no_last_name_mask = clean_df['person_label'].str.count(" ") == 0
no_last_name_count = no_last_name_mask.sum()

print(f"Total rows: {clean_df.shape[0]}")
print(f"Potential middle names: {no_last_name_count}")
print(f"Percentage: {no_last_name_count / clean_df.shape[0] * 100:.2f}%")

clean_df['hasNoLastName'] = no_last_name_mask

Total rows: 4881
Potential middle names: 55
Percentage: 1.13%


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clean_df['hasNoLastName'] = no_last_name_mask


In [None]:
import pandas as pd
import re
import unicodedata

# 1) Keep the original text; add a normalized helper column for consistent tests
def normalize_name(s):
    if pd.isna(s):
        return s
    s = unicodedata.normalize("NFKC", str(s))  # unify apostrophes/spaces, etc.
    s = s.strip()
    s = re.sub(r"\s+", " ", s)                # collapse internal whitespace
    return s

clean_df["person_label_norm"] = clean_df["person_label"].map(normalize_name)

# 2) Build masks (vectorized)
s_raw  = clean_df["person_label"].astype(str)
s_norm = clean_df["person_label_norm"].astype(str)

hyphen_mask       = s_norm.str.contains(r"-", na=False)
apostrophe_mask   = s_norm.str.contains(r"['\u2019\u02BC]", na=False)  # ', ’, ʼ
punct_mask        = s_norm.str.contains(r"[.,/&(){}\[\]<>@#?!$%^*_=+\\]", na=False)
digit_mask        = s_norm.str.contains(r"\d", na=False)
initial_mask      = s_norm.str.contains(r"(^|\s)[A-Za-z]\.", na=False)  # initials like "J."
# Non-ASCII: after NFKC, compare ASCII-stripped version to itself
non_ascii_mask    = ~s_norm.map(lambda x: x.isascii())

# Whitespace issues measured on the raw (pre-normalization) text
whitespace_mask   = s_raw.str.contains(
    r"^\s|[\u00A0\u2007\u202F]|\s{2,}|\s$", na=False
)

# 3) Combine into a single "hasDifficultName" flag
difficult_mask = (
    hyphen_mask
    | apostrophe_mask
    | punct_mask
    | digit_mask
    | initial_mask
    | non_ascii_mask
    | whitespace_mask
)

clean_df["hasDifficultName"] = difficult_mask

# 4) (Optional) Keep a reason code for explainability/auditing
def reason_row(i):
    reasons = []
    if hyphen_mask.iat[i]:      reasons.append("hyphen")
    if apostrophe_mask.iat[i]:  reasons.append("apostrophe")
    if punct_mask.iat[i]:       reasons.append("punctuation")
    if digit_mask.iat[i]:       reasons.append("digit")
    if initial_mask.iat[i]:     reasons.append("initials")
    if non_ascii_mask.iat[i]:   reasons.append("non_ascii")
    if whitespace_mask.iat[i]:  reasons.append("whitespace")
    return "|".join(reasons)

clean_df["difficult_reason"] = [reason_row(i) for i in range(len(clean_df))]

# 5) (Optional) Quick summary for your report
summary = {
    "total": len(clean_df),
    "difficult_count": difficult_mask.sum(),
    "difficult_pct": 100 * difficult_mask.mean(),
    "by_reason": {
        "hyphen": int(hyphen_mask.sum()),
        "apostrophe": int(apostrophe_mask.sum()),
        "punctuation": int(punct_mask.sum()),
        "digit": int(digit_mask.sum()),
        "initials": int(initial_mask.sum()),
        "non_ascii": int(non_ascii_mask.sum()),
        "whitespace": int(whitespace_mask.sum()),
    },
}
print(summary)

{'total': 4881, 'difficult_count': np.int64(1104), 'difficult_pct': np.float64(22.618315918869083), 'by_reason': {'hyphen': 219, 'apostrophe': 32, 'punctuation': 145, 'digit': 1, 'initials': 131, 'non_ascii': 763, 'whitespace': 4}}


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clean_df["person_label_norm"] = clean_df["person_label"].map(normalize_name)
  initial_mask      = s_norm.str.contains(r"(^|\s)[A-Za-z]\.", na=False)  # initials like "J."
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clean_df["hasDifficultName"] = difficult_mask
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-ver

In [49]:
clean_df.to_csv("../../data/rawData/all.csv")