The goal of this notebook is to produce one single dataset that joins all smaller datasets into one. This way, work on the quality of the service can be made without having to do multiple service handlers for each datasets, simplifying the process. 

In [61]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Fetching the datasets

In [2]:
wikidata_df = pd.read_csv("../../data/Wikidata5k.csv")
kaggleOlympic_df = pd.read_csv("../../data/KaggleOlympic2024.csv")

# Noting the source

In [3]:
wikidata_df['source'] = 'wikidata'
kaggleOlympic_df['source'] = 'kaggleOlympic'


In [4]:
print(wikidata_df.columns)

Index(['person', 'person_label', 'genders', 'dobs', 'countries', 'continents',
       'hasMiddleName', 'hasNoLastName', 'person_label_norm',
       'hasDifficultName', 'difficult_reason', 'surName', 'lastName',
       'iso_country', 'source'],
      dtype='object')


In [5]:
print(kaggleOlympic_df.columns)

Index(['name', 'name_short', 'name_tv', 'gender', 'function', 'country_code',
       'country', 'country_long', 'nationality', 'nationality_long',
       'nationality_code', 'birth_date', 'birth_place', 'birth_country',
       'residence_place', 'residence_country', 'lang', 'iso_country',
       'continents', 'hasMiddleName', 'hasNoLastName', 'person_label_norm',
       'hasDifficultName', 'difficult_reason', 'lastName', 'surName',
       'source'],
      dtype='object')


From this, we can see that the following columns links accordingly : 

| Wikidata      | KaggleOlympic |
|---------------|---------------|
| person_label  | name          |
| surName       | surName       |
| lastName      | lastName      |
| genders       | gender        |
| iso_country   | iso_country   |
| continents    | continents    |
| hasMiddleName | hasMiddleName |
| hasNoLastName | hasNoLastName |
| dobs          | birth_date    |

# Curating the genders

In [6]:
print(kaggleOlympic_df['gender'].unique())
print(wikidata_df['genders'].unique())

['Male' 'Female']
['female' 'male']


In [7]:
gender_map = {
    'Male':'male',
    'Female':'female'
}
kaggleOlympic_df["gender"] = kaggleOlympic_df["gender"].map(gender_map)
print(kaggleOlympic_df['gender'].unique())

['male' 'female']


# Cleaning up the years and date of birth

In [8]:
wikidata_df["dobs"] = pd.to_datetime(wikidata_df["dobs"], errors="coerce")
wikidata_df["birth_year"] = wikidata_df["dobs"].dt.year

kaggleOlympic_df["birth_date"] = pd.to_datetime(kaggleOlympic_df["birth_date"], errors="coerce")
kaggleOlympic_df["birth_year"] = kaggleOlympic_df["birth_date"].dt.year

  kaggleOlympic_df["birth_date"] = pd.to_datetime(kaggleOlympic_df["birth_date"], errors="coerce")


In [9]:
kaggleOlympic_df.head()

Unnamed: 0,name,name_short,name_tv,gender,function,country_code,country,country_long,nationality,nationality_long,...,continents,hasMiddleName,hasNoLastName,person_label_norm,hasDifficultName,difficult_reason,lastName,surName,source,birth_year
0,ALEKSANYAN Artur,ALEKSANYAN A,Artur ALEKSANYAN,male,Athlete,ARM,Armenia,Armenia,Armenia,Armenia,...,Europe,False,False,ALEKSANYAN Artur,False,,ALEKSANYAN,Artur,kaggleOlympic,1991
1,AMOYAN Malkhas,AMOYAN M,Malkhas AMOYAN,male,Athlete,ARM,Armenia,Armenia,Armenia,Armenia,...,Europe,False,False,AMOYAN Malkhas,False,,AMOYAN,Malkhas,kaggleOlympic,1999
2,GALSTYAN Slavik,GALSTYAN S,Slavik GALSTYAN,male,Athlete,ARM,Armenia,Armenia,Armenia,Armenia,...,Europe,False,False,GALSTYAN Slavik,False,,GALSTYAN,Slavik,kaggleOlympic,1996
3,HARUTYUNYAN Arsen,HARUTYUNYAN A,Arsen HARUTYUNYAN,male,Athlete,ARM,Armenia,Armenia,Armenia,Armenia,...,Europe,False,False,HARUTYUNYAN Arsen,False,,HARUTYUNYAN,Arsen,kaggleOlympic,1999
4,TEVANYAN Vazgen,TEVANYAN V,Vazgen TEVANYAN,male,Athlete,ARM,Armenia,Armenia,Armenia,Armenia,...,Europe,False,False,TEVANYAN Vazgen,False,,TEVANYAN,Vazgen,kaggleOlympic,1999


# Determining the final structure
By now, the two datasets are in a compatible states and can be linked. What remains to be decided is the final structure and naming convention to give the master dataset. I propose the following : 

- ID
- fullName
- firstName
- lastName
- gender
- isoCountry
- continent
- birthYear
- hasMiddleName
- hasNoLastName
- source

This is the basis to rename the original columns of the dataset

# Renaming wikidata

In [18]:
# Renaming Wikidata
wikidata_df_renamed = wikidata_df.rename(columns={
    'person_label':'fullName',
    'surName': 'firstName',
    'lastName': 'lastName',
    'genders' : 'gender',
    'iso_country' : 'isoCountry',
    'continents' : 'continent',
    'birth_year' : 'birthYear',
    'hasMiddleName' : 'hasMiddleName',
    'hasNoLastName' : 'hasNoLastName',
    'source' : 'source'
})

In [19]:
wikidata_df_renamed.columns

Index(['person', 'fullName', 'gender', 'dobs', 'countries', 'continent',
       'hasMiddleName', 'hasNoLastName', 'person_label_norm',
       'hasDifficultName', 'difficult_reason', 'firstName', 'lastName',
       'isoCountry', 'source', 'birthYear'],
      dtype='object')

# Renaming Kaggle Olympic

In [21]:
kaggleOlympic_df_renamed = kaggleOlympic_df.rename(columns={
    'name':'fullName',
    'surName': 'firstName',
    'lastName': 'lastName',
    'gender' : 'gender',
    'iso_country' : 'isoCountry',
    'continents' : 'continent',
    'birth_year' : 'birthYear',
    'hasMiddleName' : 'hasMiddleName',
    'hasNoLastName' : 'hasNoLastName',
    'source' : 'source'
})

In [22]:
kaggleOlympic_df_renamed.columns

Index(['fullName', 'name_short', 'name_tv', 'gender', 'function',
       'country_code', 'country', 'country_long', 'nationality',
       'nationality_long', 'nationality_code', 'birth_date', 'birth_place',
       'birth_country', 'residence_place', 'residence_country', 'lang',
       'isoCountry', 'continent', 'hasMiddleName', 'hasNoLastName',
       'person_label_norm', 'hasDifficultName', 'difficult_reason', 'lastName',
       'firstName', 'source', 'birthYear'],
      dtype='object')

# Joining the sets

In [23]:
core_columns = [
    "fullName",
    "firstName",
    "lastName",
    "gender",
    "isoCountry",
    "continent",
    "birthYear",
    "hasMiddleName",
    "hasNoLastName",
    "source",
]
wikidata_core = wikidata_df_renamed[core_columns]
kaggle_core = kaggleOlympic_df_renamed[core_columns]

In [24]:
all = pd.concat([wikidata_core, kaggle_core], ignore_index=True)

In [28]:
print(all.columns)
print(all.describe(exclude=np.number))

Index(['fullName', 'firstName', 'lastName', 'gender', 'isoCountry',
       'continent', 'birthYear', 'hasMiddleName', 'hasNoLastName', 'source'],
      dtype='object')
             fullName firstName lastName gender isoCountry continent  \
count           15987     12268    12268  15987      15640     15987   
unique          15973      6309    10247      2        204         6   
top     WATANABE Yuta    Daniel     WANG   male         US    Europe   
freq                2        51       40   8069       1457      6230   

       hasMiddleName hasNoLastName         source  
count          15987         15987          15987  
unique             2             2              2  
top            False         False  kaggleOlympic  
freq           13423         15878          11113  


# Checking for duplication in the new master set

When checking for duplication, it's important to focus not just on the full name itself, but on the additional info such as the nationality and the birth year. We can see below that, looking only at the fullname themselves, we have 28 duplicate (14 pairs at most), but if we include the country to consider it a duplicate, it drop to 14 duplicates (7 pairs at most). Including the birth year drop the number of duplicate to 0. 

Given that the combo of fullName + country + birth year has 0 duplicates, removing the duplicate from lower combinations is not needed. 

In [59]:
checks = ['fullName', 'isoCountry', 'birthYear']

print(len(all[all[['fullName']].duplicated(keep=False)]))
print(len(all[all[['fullName', 'isoCountry']].duplicated(keep=False)]))
print(len(all[all[['fullName', 'birthYear']].duplicated(keep=False)]))
print(len(all[all[['fullName', 'isoCountry', 'birthYear']].duplicated(keep=False)]))

28
14
0
0


# Recording the master Dataset

In [64]:
all.to_csv("../../data/all.csv")