# Olympic Athletes Data Cleaning Project
## Comprehensive Analysis of Athlete Biographical Data

## Introduction
This project focuses on cleaning and analyzing Olympic athlete biographical data to extract meaningful insights about competitors' physical characteristics, origins, and lifespans. The cleaned dataset will enable further analysis of trends in athlete demographics across different Olympic games and nations.

## Background
The dataset contains biographical information about Olympic athletes including:
- Personal details (names, birth/death dates)
- Physical measurements (height, weight)
- Geographic origins (birth cities, countries, regions)
- National Olympic Committee affiliations

Key challenges addressed:
- Inconsistent measurement formats (height/weight)
- Complex date string parsing
- Geographic location extraction from unstructured text
- Handling missing and malformed data

## Tools I Used
**Data Processing & Cleaning:**
- `Pandas` - Core data manipulation and transformation
- `NumPy` - Numerical operations and conditional logic

**String Processing:**
- Regular expressions - For complex pattern matching in text fields
- String operations - For cleaning and extracting substrings

**Date Handling:**
- `pd.to_datetime` - For converting diverse date formats

**Workflow:**
- Jupyter Notebook - Interactive development and documentation

In [1]:
import pandas as pd, numpy as np, babel as bl, scipy as sp, seaborn as sns, matplotlib as plt
from babel import numbers
from scipy import stats

## Data Loading & Initial Inspection

In [2]:
# Load raw athlete bios data
df_1 = pd.DataFrame(pd.read_csv('/Users/brtelfer/Documents/Python_Data_Projects/*18_Olympics_Data_Cleaning/bios.csv'))
df_1.head(3)

Unnamed: 0,Roles,Sex,Full name,Used name,Born,Died,NOC,athlete_id,Measurements,Affiliations,Nick/petnames,Title(s),Other names,Nationality,Original name,Name order
0,Competed in Olympic Games,Male,"François Joseph Marie Antoine ""Jean-François""•...",Jean-François•Blanchy,"12 December 1886 in Bordeaux, Gironde (FRA)","2 October 1960 in Saint-Jean-de-Luz, Pyrénées-...",France,1,,,,,,,,
1,Competed in Olympic Games,Male,Arnaud Benjamin•Boetsch,Arnaud•Boetsch,"1 April 1969 in Meulan, Yvelines (FRA)",,France,2,183 cm / 76 kg,"Racing Club de France, Paris (FRA)",,,,,,
2,Competed in Olympic Games • Administrator,Male,Jean Laurent Robert•Borotra,Jean•Borotra,"13 August 1898 in Biarritz, Pyrénées-Atlantiqu...","17 July 1994 in Arbonne, Pyrénées-Atlantiques ...",France,3,183 cm / 76 kg,"TCP, Paris (FRA)",Le Basque Bondissant (The Bounding Basque),,,,,


## Data Cleaning Process

### 1. Name Standardization

In [3]:
# Create working copy
df = df_1.copy()

# Clean special characters from names
df['Used name'] = df['Used name'].str.replace("•"," ")

### 2. Height & Weight Extraction

In [4]:
# Split measurements into height and weight
df[['Height_cm','Weight_kg']] = df['Measurements'].str.split('/', expand=True)

# Clean units from measurements
df['Height_cm'] = df['Height_cm'].str.replace('cm',' ')
df['Weight_kg'] = df['Weight_kg'].str.replace('kg',' ')

# Handle cases where measurements were swapped
df['Weight_kg'] = np.where(df['Weight_kg'].str.contains("cm"), np.nan, df['Weight_kg'])
df['Height_cm'] = np.where(df['Height_cm'].str.contains("kg"), np.nan, df['Height_cm'])

### 3. Date Processing

In [5]:
# Complex regex pattern to extract birth dates
pattern ='(^\d{1,2}\s\w+\s\d{4}|^\d{4}|^\w+|^\(\w+\s\d{4}\)|^\(\d{4}\s\w+\s\d{4}\)|\(c\.\s\d{4}\))'
df['Birthday'] = df['Born'].str.extract(rf'{pattern}')

# Convert to datetime
df['Birthday'] = pd.to_datetime(df['Birthday'], format='mixed', errors='coerce')

# Same process for death dates
df['Deathday'] = df['Died'].str.extract(rf'{pattern}')
df['Deathday'] = pd.to_datetime(df['Deathday'], format='mixed', errors='coerce')

### 4. Geographic Data Extraction

In [6]:
# Extract birth location components
df['Born_City'] = df['Born'].str.extract(r'in\s(.*),')
df['Born_Country'] = df['Born'].str.extract(r',\s(.*)\s')
df['Born_Region'] = df['Born'].str.extract(r'\((\w+)\)$')

## Final Data Structure

In [7]:
# Select relevant columns for cleaned dataset
columns_to_keep = ['athlete_id', 'Full name','Birthday', 'Born_City', 'Born_Region', 'Born_Country', 'NOC', 'Height_cm', 'Weight_kg', 'Deathday']
df_c = df[columns_to_keep]

# Display cleaned data structure
df_c.head()

Unnamed: 0,athlete_id,Full name,Birthday,Born_City,Born_Region,Born_Country,NOC,Height_cm,Weight_kg,Deathday
0,1,"François Joseph Marie Antoine ""Jean-François""•...",1886-12-12,Bordeaux,FRA,Gironde,France,,,1960-10-02
1,2,Arnaud Benjamin•Boetsch,1969-04-01,Meulan,FRA,Yvelines,France,183.0,76.0,NaT
2,3,Jean Laurent Robert•Borotra,1898-08-13,Biarritz,FRA,Pyrénées-Atlantiques,France,183.0,76.0,1994-07-17
3,4,Jacques Marie Stanislas Jean•Brugnon,1895-05-11,Paris VIIIe,FRA,Paris,France,168.0,64.0,1978-03-20
4,5,Henry Albert•Canet,1878-04-17,Wandsworth,GBR,England,France,,,1930-07-25


## Key Insights & Next Steps

**Data Quality Assessment:**
- Successfully extracted structured data from complex text fields
- Handled edge cases in measurement and date formats
- Maintained data integrity through transformation pipeline

**Potential Analyses:**
1. Athlete physical characteristics by sport/country
2. Longevity trends across Olympic generations
3. Geographic distribution of athletes
4. Height/weight correlations with performance

**Future Improvements:**
- Additional validation for extracted dates
- Standardization of country/region names
- Integration with competition results data