**Table of contents**<a id='toc0_'></a>    
- [Prepare the notebook](#toc1_)    
  - [Import necessary libraries](#toc1_1_)    
  - [Import the datasets](#toc1_2_)    
- [Task 2: Data Transformation](#toc2_)    
  - [Feature engineering and/or novel feature definition](#toc2_1_)    
  - [Outlier detection](#toc2_2_)    
- [PCA](#toc3_)    
- [Distributional approach](#toc4_)    
- [Connectivity approach](#toc5_)    
- [One-class SVM](#toc6_)    
- [Isolation forest](#toc7_)    
  - [Get the final list of 'outlier' columns get getting the columns that were identified by a majority of tests](#toc7_1_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Prepare the notebook](#toc0_)

## <a id='toc1_1_'></a>[Import necessary libraries](#toc0_)

In [5]:
!pip install pandas
!pip install numpy
!pip install matplotlib
!pip install seaborn
!pip install outlier_utils
!pip install plotly




[notice] A new release of pip is available: 23.2.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.2.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.2.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.2.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.2.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 23.2.1 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [6]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
import numpy as np

from sklearn.impute import KNNImputer
from sklearn.impute import SimpleImputer
from sklearn.mixture import GaussianMixture

## <a id='toc1_2_'></a>[Import the datasets](#toc0_)

In [7]:
# Load the dataset
df_races = pd.read_csv('dataset/races.csv')
df_cyclists = pd.read_csv('dataset/cyclists.csv')

# Describe the datasets

In [8]:
df_races.describe()

Unnamed: 0,points,uci_points,length,climb_total,profile,startlist_quality,average_temperature,position,cyclist_age,delta
count,589388.0,251086.0,589865.0,442820.0,441671.0,589865.0,29933.0,589865.0,589752.0,589865.0
mean,89.221635,74.601547,166776.180584,2330.469215,2.611611,1101.161178,21.731768,74.219491,28.486208,418.292794
std,54.43533,100.947962,64545.605664,1375.710722,1.491741,380.586928,5.884761,48.404023,3.855631,842.961596
min,18.0,6.0,1000.0,2.0,1.0,115.0,10.0,0.0,13.0,-6906.0
25%,50.0,16.0,152500.0,1309.0,1.0,844.0,17.0,32.0,26.0,10.0
50%,80.0,60.0,178200.0,2255.0,2.0,988.0,22.0,70.0,28.0,156.0
75%,100.0,100.0,203500.0,3273.0,4.0,1309.0,26.0,112.0,31.0,624.0
max,350.0,800.0,338000.0,6974.0,5.0,2047.0,36.0,209.0,56.0,61547.0


## Syntactic accuracy

To check syntactic accuracy, we need to make sure an entry is in the domain.

### Check that birth_year, weight and height are numeric

In [9]:
# Check to make sure all the values in the numeric fields are indeed numeric
numeric_issues = df_cyclists[['birth_year', 'weight', 'height']].apply(lambda x: pd.to_numeric(x, errors='coerce')).isna()
print(numeric_issues)

      birth_year  weight  height
0          False    True    True
1          False   False   False
2          False   False   False
3          False   False   False
4          False   False   False
...          ...     ...     ...
6129       False    True    True
6130       False   False   False
6131       False    True    True
6132       False   False   False
6133       False   False   False

[6134 rows x 3 columns]


### Check that the nationality is a valid country

In [18]:
import pycountry_convert as pc
import pycountry


# Helper function to standardize country names
def standardize_country_name(country_name):
    if pd.isna(country_name):  # Check for NaN (missing) values
        return None
    try:
        # Try to get the alpha-2 code from the name
        country_code = pc.country_name_to_country_alpha2(country_name, cn_name_format="default")
        # Get the country name from the alpha-2 code
        return pycountry.countries.get(alpha_2=country_code).name
    except KeyError:
        # Return None if no match is found
        return None

# Apply the standardization function and validate
df_cyclists['standardized_nationality'] = df_cyclists['nationality'].apply(standardize_country_name)
df_cyclists['valid_nationality'] = df_cyclists['standardized_nationality'].notna()

# Filter and output records where valid_nationality is False
invalid_nationality_records = df_cyclists[df_cyclists['valid_nationality'] == False]
print(invalid_nationality_records[['name', 'nationality', 'standardized_nationality', 'valid_nationality']])


               name nationality standardized_nationality  valid_nationality
9     Scott  Davies         NaN                     None              False
102   Primož  Čerin  Yugoslavia                     None              False
6100   Kam-Po  Wong    Hongkong                     None              False


### Check that the urls follow the same format (lowercase, hyphen-separated, alphanumerical)

In [None]:
df_cyclists['_url_format_issue'] = df_cyclists['_url'].str.match(r'^[a-z0-9-]+$') == False

invalid_url_records = df_cyclists[df_cyclists['_url_format_issue'] == True]
print(invalid_url_records)

Empty DataFrame
Columns: [_url, name, birth_year, weight, height, nationality, valid_nationality, standardized_nationality, _url_format_issue]
Index: []


## Semantic accuracy

### Check that the weights and heights are possible values

The bounds used are:
The tallest person ever (Robert Wadlow), at 272cm
The shortest adult ever (Chandra Bahadur Dangi), at 54.6cm
The heaviest person ever (Jon Brower Minnoch) at 635kg
The lightest person ever (Lucia Zarate) at 2.1 kg

In [None]:
df_cyclists['weight_issue'] = df_cyclists['weight'].notna() & ~df_cyclists['weight'].between(2.1, 635, inclusive='both')
df_cyclists['height_issue'] = df_cyclists['height'].notna() & ~df_cyclists['height'].between(54.6, 272, inclusive='both')

invalid_weight_records = df_cyclists[df_cyclists['weight_issue'] == True]
invalid_height_records = df_cyclists[df_cyclists['height_issue'] == True]
print(invalid_weight_records)   
print(invalid_height_records)


Empty DataFrame
Columns: [_url, name, birth_year, weight, height, nationality, valid_nationality, standardized_nationality, _url_format_issue, name_format_issue, birth_year_issue, weight_issue, height_issue]
Index: []
Empty DataFrame
Columns: [_url, name, birth_year, weight, height, nationality, valid_nationality, standardized_nationality, _url_format_issue, name_format_issue, birth_year_issue, weight_issue, height_issue]
Index: []


### Check that the BMI is reasonable (a professional cyclist most likely isn't obese, ie BMI > 30)

In [None]:
# bmi check. A BMI over 30 is considered obese and therefore unlikely that a professional athlete would have such a high BMI
df_cyclists['bmi'] = df_cyclists['weight'] / ((df_cyclists['height'] / 100) ** 2)
df_cyclists['bmi_issue'] = df_cyclists['bmi'].notna() & (~df_cyclists['bmi'].between(15, 30))

invalid_bmi = df_cyclists[df_cyclists['bmi_issue'] == True]
print(invalid_bmi)


Empty DataFrame
Columns: [_url, name, birth_year, weight, height, nationality, valid_nationality, standardized_nationality, _url_format_issue, name_format_issue, birth_year_issue, weight_issue, height_issue, bmi, bmi_issue]
Index: []


### Check that the birth years make sense (not in the future or before 1868, when cyclism was established as a professional sport)

In [None]:
current_year = 2024
df_cyclists['birth_year_issue'] = df_cyclists['birth_year'].notna() & ~df_cyclists['birth_year'].between(1900, current_year)

invalid_name_records = df_cyclists[df_cyclists['birth_year_issue'] == True]
print(invalid_name_records)

Empty DataFrame
Columns: [_url, name, birth_year, weight, height, nationality, valid_nationality, standardized_nationality, _url_format_issue, name_format_issue, birth_year_issue]
Index: []


# Cyclists

## Syntactic accuracy

### Check that the urls follow the same format (lowercase, hyphen-separated, alphanumerical)

In [35]:
df_races['_url_format_issue'] = df_races['_url'].str.match(r'^[a-z0-9-/]+$') == False

invalid_url_records = df_races[df_races['_url_format_issue'] == True]
print(invalid_url_records)

Empty DataFrame
Columns: [_url, name, points, uci_points, length, climb_total, profile, startlist_quality, average_temperature, date, position, cyclist, cyclist_age, is_tarmac, is_cobbled, is_gravel, cyclist_team, delta, _url_format_issue]
Index: []


### Confirm the date column is a valid timestamp

In [36]:
df_races['date_format_issue'] = pd.to_datetime(df_races['date'], errors='coerce').isna()

invalid_date_records = df_races[df_races['date_format_issue'] == True]
print(invalid_date_records)

Empty DataFrame
Columns: [_url, name, points, uci_points, length, climb_total, profile, startlist_quality, average_temperature, date, position, cyclist, cyclist_age, is_tarmac, is_cobbled, is_gravel, cyclist_team, delta, _url_format_issue, date_format_issue]
Index: []


### Confirm position is a positive integer

In [38]:
df_races['position_issue'] = df_races['position'].apply(lambda x: isinstance(x, int) and x >= 0) == False

invalid_position_records = df_races[df_races['position_issue'] == True]
print(invalid_position_records)

Empty DataFrame
Columns: [_url, name, points, uci_points, length, climb_total, profile, startlist_quality, average_temperature, date, position, cyclist, cyclist_age, is_tarmac, is_cobbled, is_gravel, cyclist_team, delta, _url_format_issue, date_format_issue, position_issue]
Index: []

[0 rows x 21 columns]


Confirm fields are numeric

In [42]:
# List of fields to check
numeric_fields = ['points', 'uci_points', 'length', 'climb_total', 'startlist_quality', 'average_temperature', 'delta']

# Identify NaNs
original_na = df_races[numeric_fields].isna()

# Convert non-numeric values to NaN
df_races[numeric_fields] = df_races[numeric_fields].apply(pd.to_numeric, errors='coerce')

# Identify new NaNs created by non-numeric values
numeric_issues = df_races[numeric_fields].isna() & ~original_na  # True where NaN was caused by non-numeric values

# Adding a column to flag rows with any numeric issues
df_races['numeric_issue'] = numeric_issues.any(axis=1)

invalid_number_records = df_races[df_races['numeric_issue'] == True]
print(invalid_number_records)

Empty DataFrame
Columns: [_url, name, points, uci_points, length, climb_total, profile, startlist_quality, average_temperature, date, position, cyclist, cyclist_age, is_tarmac, is_cobbled, is_gravel, cyclist_team, delta, _url_format_issue, date_format_issue, position_issue, numeric_issue]
Index: []

[0 rows x 22 columns]
