**Table of contents**<a id='toc0_'></a>    
- [Prepare the notebook](#toc1_)    
  - [Import necessary libraries](#toc1_1_)    
  - [Import the datasets](#toc1_2_)    
  - [Describe the datasets](#toc1_3_)    
- [Cyclists](#toc2_)    
  - [Syntactic accuracy](#toc2_1_)    
    - [Check that birth_year, weight and height are numeric](#toc2_1_1_)    
    - [Check that the nationality is a valid country](#toc2_1_2_)    
    - [Check that the urls follow the same format (lowercase, hyphen-separated, alphanumerical)](#toc2_1_3_)    
  - [Semantic accuracy](#toc2_2_)    
    - [Check that the weights and heights are possible values](#toc2_2_1_)    
    - [Check that the BMI is reasonable (a professional cyclist most likely isn't obese, ie BMI > 30)](#toc2_2_2_)    
    - [Check that the birth years make sense (not in the future or before 1868, when cyclism was established as a professional sport)](#toc2_2_3_)    
- [Races](#toc3_)    
  - [Syntactic accuracy](#toc3_1_)    
    - [Check that the urls follow the same format (lowercase, hyphen-separated, alphanumerical)](#toc3_1_1_)    
    - [Confirm the date column is a valid timestamp](#toc3_1_2_)    
    - [Confirm position is a positive integer](#toc3_1_3_)    
    - [Confirm fields are numeric](#toc3_1_4_)    
    - [Confirm the cyclist team format is consistent (lowercase, alphanumeric + dots, words separated with hyphens)](#toc3_1_5_)    
    - [Ensure there are no duplicate entries for the same race and cyclist combination.](#toc3_1_6_)    
  - [Semantic accuracy](#toc3_2_)    
    - [Make sure the delta is consistent with the position (delta increases with position)](#toc3_2_1_)    
    - [Make sure the startlist_quality is realistic](#toc3_2_2_)    
    - [Ensure the climb_total increases with the profile](#toc3_2_3_)    
    - [Ensure that the average temperature makes sense](#toc3_2_4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Prepare the notebook](#toc0_)

## <a id='toc1_1_'></a>[Import necessary libraries](#toc0_)

In [None]:
!pip install pandas
!pip install pycountry
!pip install pycountry-convert




[notice] A new release of pip is available: 23.2.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
import pandas as pd

## <a id='toc1_2_'></a>[Import the datasets](#toc0_)

In [3]:
# Load the dataset
df_races = pd.read_csv('dataset/races.csv')
df_cyclists = pd.read_csv('dataset/cyclists.csv')

## <a id='toc1_3_'></a>[Describe the datasets](#toc0_)

In [4]:
df_races.describe()

Unnamed: 0,points,uci_points,length,climb_total,profile,startlist_quality,average_temperature,position,cyclist_age,delta
count,589388.0,251086.0,589865.0,442820.0,441671.0,589865.0,29933.0,589865.0,589752.0,589865.0
mean,89.221635,74.601547,166776.180584,2330.469215,2.611611,1101.161178,21.731768,74.219491,28.486208,418.292794
std,54.43533,100.947962,64545.605664,1375.710722,1.491741,380.586928,5.884761,48.404023,3.855631,842.961596
min,18.0,6.0,1000.0,2.0,1.0,115.0,10.0,0.0,13.0,-6906.0
25%,50.0,16.0,152500.0,1309.0,1.0,844.0,17.0,32.0,26.0,10.0
50%,80.0,60.0,178200.0,2255.0,2.0,988.0,22.0,70.0,28.0,156.0
75%,100.0,100.0,203500.0,3273.0,4.0,1309.0,26.0,112.0,31.0,624.0
max,350.0,800.0,338000.0,6974.0,5.0,2047.0,36.0,209.0,56.0,61547.0


# <a id='toc2_'></a>[Cyclists](#toc0_)

## <a id='toc2_1_'></a>[Syntactic accuracy](#toc0_)

To check syntactic accuracy, we need to make sure an entry is in the domain.

### <a id='toc2_1_1_'></a>[Check that birth_year, weight and height are numeric](#toc0_)

In [5]:
# Check to make sure all the values in the numeric fields are indeed numeric
numeric_issues = df_cyclists[['birth_year', 'weight', 'height']].apply(lambda x: pd.to_numeric(x, errors='coerce')).isna()
print(numeric_issues)

      birth_year  weight  height
0          False    True    True
1          False   False   False
2          False   False   False
3          False   False   False
4          False   False   False
...          ...     ...     ...
6129       False    True    True
6130       False   False   False
6131       False    True    True
6132       False   False   False
6133       False   False   False

[6134 rows x 3 columns]


### <a id='toc2_1_2_'></a>[Check that the nationality is a valid country](#toc0_)

In [6]:
import pycountry_convert as pc
import pycountry


# Helper function to standardize country names
def standardize_country_name(country_name):
    if pd.isna(country_name):  # Check for NaN (missing) values
        return None
    try:
        # Try to get the alpha-2 code from the name
        country_code = pc.country_name_to_country_alpha2(country_name, cn_name_format="default")
        # Get the country name from the alpha-2 code
        return pycountry.countries.get(alpha_2=country_code).name
    except KeyError:
        # Return None if no match is found
        return None

# Apply the standardization function and validate
df_cyclists['standardized_nationality'] = df_cyclists['nationality'].apply(standardize_country_name)
df_cyclists['valid_nationality'] = df_cyclists['standardized_nationality'].notna()

# Filter and output records where valid_nationality is False
invalid_nationality_records = df_cyclists[df_cyclists['valid_nationality'] == False]
print(invalid_nationality_records[['name', 'nationality', 'standardized_nationality', 'valid_nationality']])


               name nationality standardized_nationality  valid_nationality
9     Scott  Davies         NaN                     None              False
102   Primož  Čerin  Yugoslavia                     None              False
6100   Kam-Po  Wong    Hongkong                     None              False


### <a id='toc2_1_3_'></a>[Check that the urls follow the same format (lowercase, hyphen-separated, alphanumerical)](#toc0_)

In [7]:
df_cyclists['_url_format_issue'] = df_cyclists['_url'].str.match(r'^[a-z0-9-]+$') == False

invalid_url_records = df_cyclists[df_cyclists['_url_format_issue'] == True]
print(invalid_url_records)

Empty DataFrame
Columns: [_url, name, birth_year, weight, height, nationality, standardized_nationality, valid_nationality, _url_format_issue]
Index: []


## <a id='toc2_2_'></a>[Semantic accuracy](#toc0_)

### <a id='toc2_2_1_'></a>[Check that the weights and heights are possible values](#toc0_)

The bounds used are:
The tallest person ever (Robert Wadlow), at 272cm
The shortest adult ever (Chandra Bahadur Dangi), at 54.6cm
The heaviest person ever (Jon Brower Minnoch) at 635kg
The lightest person ever (Lucia Zarate) at 2.1 kg

In [8]:
df_cyclists['weight_issue'] = df_cyclists['weight'].notna() & ~df_cyclists['weight'].between(2.1, 635, inclusive='both')
df_cyclists['height_issue'] = df_cyclists['height'].notna() & ~df_cyclists['height'].between(54.6, 272, inclusive='both')

invalid_weight_records = df_cyclists[df_cyclists['weight_issue'] == True]
invalid_height_records = df_cyclists[df_cyclists['height_issue'] == True]
print(invalid_weight_records)   
print(invalid_height_records)


Empty DataFrame
Columns: [_url, name, birth_year, weight, height, nationality, standardized_nationality, valid_nationality, _url_format_issue, weight_issue, height_issue]
Index: []
Empty DataFrame
Columns: [_url, name, birth_year, weight, height, nationality, standardized_nationality, valid_nationality, _url_format_issue, weight_issue, height_issue]
Index: []


### <a id='toc2_2_2_'></a>[Check that the BMI is reasonable (a professional cyclist most likely isn't obese, ie BMI > 30)](#toc0_)

In [9]:
# bmi check. A BMI over 30 is considered obese and therefore unlikely that a professional athlete would have such a high BMI
df_cyclists['bmi'] = df_cyclists['weight'] / ((df_cyclists['height'] / 100) ** 2)
df_cyclists['bmi_issue'] = df_cyclists['bmi'].notna() & (~df_cyclists['bmi'].between(15, 30))

invalid_bmi = df_cyclists[df_cyclists['bmi_issue'] == True]
print(invalid_bmi)


Empty DataFrame
Columns: [_url, name, birth_year, weight, height, nationality, standardized_nationality, valid_nationality, _url_format_issue, weight_issue, height_issue, bmi, bmi_issue]
Index: []


### <a id='toc2_2_3_'></a>[Check that the birth years make sense (not in the future or before 1868, when cyclism was established as a professional sport)](#toc0_)

In [10]:
current_year = 2024
df_cyclists['birth_year_issue'] = df_cyclists['birth_year'].notna() & ~df_cyclists['birth_year'].between(1900, current_year)

invalid_name_records = df_cyclists[df_cyclists['birth_year_issue'] == True]
print(invalid_name_records)

Empty DataFrame
Columns: [_url, name, birth_year, weight, height, nationality, standardized_nationality, valid_nationality, _url_format_issue, weight_issue, height_issue, bmi, bmi_issue, birth_year_issue]
Index: []


# <a id='toc3_'></a>[Races](#toc0_)

## <a id='toc3_1_'></a>[Syntactic accuracy](#toc0_)

### <a id='toc3_1_1_'></a>[Check that the urls follow the same format (lowercase, hyphen-separated, alphanumerical)](#toc0_)

In [11]:
df_races['_url_format_issue'] = df_races['_url'].str.match(r'^[a-z0-9-/]+$') == False

invalid_url_records = df_races[df_races['_url_format_issue'] == True]
print(invalid_url_records)

Empty DataFrame
Columns: [_url, name, points, uci_points, length, climb_total, profile, startlist_quality, average_temperature, date, position, cyclist, cyclist_age, is_tarmac, is_cobbled, is_gravel, cyclist_team, delta, _url_format_issue]
Index: []


### <a id='toc3_1_2_'></a>[Confirm the date column is a valid timestamp](#toc0_)

In [12]:
df_races['date_format_issue'] = pd.to_datetime(df_races['date'], errors='coerce').isna()

invalid_date_records = df_races[df_races['date_format_issue'] == True]
print(invalid_date_records)

Empty DataFrame
Columns: [_url, name, points, uci_points, length, climb_total, profile, startlist_quality, average_temperature, date, position, cyclist, cyclist_age, is_tarmac, is_cobbled, is_gravel, cyclist_team, delta, _url_format_issue, date_format_issue]
Index: []


### <a id='toc3_1_3_'></a>[Confirm position is a positive integer](#toc0_)

In [13]:
df_races['position_issue'] = df_races['position'].apply(lambda x: isinstance(x, int) and x >= 0) == False

invalid_position_records = df_races[df_races['position_issue'] == True]
print(invalid_position_records)

Empty DataFrame
Columns: [_url, name, points, uci_points, length, climb_total, profile, startlist_quality, average_temperature, date, position, cyclist, cyclist_age, is_tarmac, is_cobbled, is_gravel, cyclist_team, delta, _url_format_issue, date_format_issue, position_issue]
Index: []

[0 rows x 21 columns]


### <a id='toc3_1_4_'></a>[Confirm fields are numeric](#toc0_)

In [14]:
# List of fields to check
numeric_fields = ['points', 'uci_points', 'length', 'climb_total', 'startlist_quality', 'average_temperature', 'delta']

# Identify NaNs
original_na = df_races[numeric_fields].isna()

# Convert non-numeric values to NaN
df_races[numeric_fields] = df_races[numeric_fields].apply(pd.to_numeric, errors='coerce')

# Identify new NaNs created by non-numeric values
numeric_issues = df_races[numeric_fields].isna() & ~original_na  # True where NaN was caused by non-numeric values

# Adding a column to flag rows with any numeric issues
df_races['numeric_issue'] = numeric_issues.any(axis=1)

invalid_number_records = df_races[df_races['numeric_issue'] == True]
print(invalid_number_records)

Empty DataFrame
Columns: [_url, name, points, uci_points, length, climb_total, profile, startlist_quality, average_temperature, date, position, cyclist, cyclist_age, is_tarmac, is_cobbled, is_gravel, cyclist_team, delta, _url_format_issue, date_format_issue, position_issue, numeric_issue]
Index: []

[0 rows x 22 columns]


### <a id='toc3_1_5_'></a>[Confirm the cyclist team format is consistent (lowercase, alphanumeric + dots, words separated with hyphens)](#toc0_)

In [15]:
df_races['cyclist_team_format_issue'] = (
    df_races['cyclist_team'].notna() & ~df_races['cyclist_team'].str.match(r'^[a-z0-9.-]+$').fillna(False)
)

invalid_cyclist_team_records = df_races[df_races['cyclist_team_format_issue'] == True]
print(invalid_cyclist_team_records)


Empty DataFrame
Columns: [_url, name, points, uci_points, length, climb_total, profile, startlist_quality, average_temperature, date, position, cyclist, cyclist_age, is_tarmac, is_cobbled, is_gravel, cyclist_team, delta, _url_format_issue, date_format_issue, position_issue, numeric_issue, cyclist_team_format_issue]
Index: []

[0 rows x 23 columns]


### <a id='toc3_1_6_'></a>[Ensure there are no duplicate entries for the same race and cyclist combination.](#toc0_)

In [16]:
df_races['duplicate_issue'] = df_races.duplicated(subset=['_url', 'position', 'cyclist'], keep=False)

duplicates = df_races[df_races['duplicate_issue'] == True]
print(duplicates)

Empty DataFrame
Columns: [_url, name, points, uci_points, length, climb_total, profile, startlist_quality, average_temperature, date, position, cyclist, cyclist_age, is_tarmac, is_cobbled, is_gravel, cyclist_team, delta, _url_format_issue, date_format_issue, position_issue, numeric_issue, cyclist_team_format_issue, duplicate_issue]
Index: []

[0 rows x 24 columns]


## <a id='toc3_2_'></a>[Semantic accuracy](#toc0_)

### <a id='toc3_2_1_'></a>[Make sure the delta is consistent with the position (delta increases with position)](#toc0_)

In [17]:
# Create the 'delta_issue' column to flag issues. The way this works is by grouping the records by url,
# then adding a "delta_issue" column which is true if the delta is not strictly increasing.
df_races['delta_issue'] = (
    df_races.groupby(['_url'])['delta']
    .transform(lambda x: (x.shift() > x).fillna(False))
)
problematic_rows = df_races[df_races['delta_issue'] == True]

# Next e'll retrieve the row before each problematic row within the same group
# so that we can get some extra context
previous_rows = (
    df_races.groupby('_url')
    .shift()
    .loc[problematic_rows.index]
)

# Concatenate the problematic rows with their preceding rows for inspection
combined_issues = pd.concat([previous_rows, problematic_rows]).sort_index()

# Display the combined result
print(combined_issues)


                     name  points  uci_points    length  climb_total  profile  \
423        Tour de France   100.0       120.0  128000.0        781.0      1.0   
423        Tour de France   100.0       120.0  128000.0        781.0      1.0   
1908    Tirreno-Adriatico    50.0         NaN  181000.0          NaN      3.0   
1908    Tirreno-Adriatico    50.0         NaN  181000.0          NaN      3.0   
6970    Tirreno-Adriatico    50.0        60.0  167000.0        776.0      1.0   
...                   ...     ...         ...       ...          ...      ...   
589006     Tour de France   100.0       120.0  113000.0        713.0      1.0   
589013     Tour de France   100.0       120.0  113000.0        713.0      1.0   
589013     Tour de France   100.0       120.0  113000.0        713.0      1.0   
589016     Tour de France   100.0       120.0  113000.0        713.0      1.0   
589016     Tour de France   100.0       120.0  113000.0        713.0      1.0   

        startlist_quality  

### <a id='toc3_2_2_'></a>[Make sure the startlist_quality is realistic](#toc0_)

According to procycliststats, the theoretical maximum limit of startlist_quality for a 150-person race is 2275 points (https://www.procyclingstats.com/calendar/uci/startlist-quality)

In [18]:
df_races['startlist_quality_issue'] = ~df_races['startlist_quality'].between(0, 2275, inclusive="both").fillna(False)

invalid_startlist_quality_records = df_races[df_races['startlist_quality_issue'] == True]
print(invalid_startlist_quality_records)

Empty DataFrame
Columns: [_url, name, points, uci_points, length, climb_total, profile, startlist_quality, average_temperature, date, position, cyclist, cyclist_age, is_tarmac, is_cobbled, is_gravel, cyclist_team, delta, _url_format_issue, date_format_issue, position_issue, numeric_issue, cyclist_team_format_issue, duplicate_issue, delta_issue, startlist_quality_issue]
Index: []

[0 rows x 26 columns]


### <a id='toc3_2_3_'></a>[Ensure the climb_total increases with the profile](#toc0_)

Given that the profile says how mountainous the stage's terrain is (the five profile values probably correspond to the 5 icons explained here: https://www.procyclingstats.com/info/profile-score-explained), we'd expect higher values of profile to have higher average climb_totals. This is indeed what we observe.

In [19]:
# Calculate the average climb_total for each profile
average_climb_total_per_profile = df_races.groupby('profile')['climb_total'].mean().reset_index()

average_climb_total_per_profile

Unnamed: 0,profile,climb_total
0,1.0,1115.032447
1,2.0,2216.014574
2,3.0,2417.451732
3,4.0,3493.641104
4,5.0,3737.367327


### <a id='toc3_2_4_'></a>[Ensure that the average temperature makes sense](#toc0_)

Given that the coldest bike race takes places in temperatures of -43c (https://road.cc/content/news/174123-whats-it-ride-43%C2%B0c-worlds-coldest-bike-race)
And while I wasn't able to find the hottest race ever, there are records of stages having average temperatures of 40c: https://www.cyclingnews.com/features/the-heat-is-on-how-the-vuelta-a-espana-peloton-is-battling-the-first-weeks-intense-temperatures/, so we'll set 43c as the upper limit.

In [20]:
df_races['average_temperature_issue'] = ~df_races['average_temperature'].between(-43, 43, inclusive="both").fillna(False)

invalid_temperature_records = df_races[df_races['average_temperature'] == True]
print(invalid_temperature_records)

Empty DataFrame
Columns: [_url, name, points, uci_points, length, climb_total, profile, startlist_quality, average_temperature, date, position, cyclist, cyclist_age, is_tarmac, is_cobbled, is_gravel, cyclist_team, delta, _url_format_issue, date_format_issue, position_issue, numeric_issue, cyclist_team_format_issue, duplicate_issue, delta_issue, startlist_quality_issue, average_temperature_issue]
Index: []

[0 rows x 27 columns]
