**Table of contents**<a id='toc0_'></a>    
- [Prepare the notebook](#toc1_)    
  - [Import necessary libraries](#toc1_1_)    
  - [Import the datasets](#toc1_2_)    
- [Task 1: Data understanding](#toc2_)    
  - [Assessing Data Quality](#toc2_1_)    
    - [Verify the datatypes make intuitive sense](#toc2_1_1_)    
    - [Check for missing values](#toc2_1_2_)    
    - [Check for possible placeholder values](#toc2_1_3_)    
    - [Check for duplicates](#toc2_1_4_)    
    - [Check for races with more than one road type (cobble, tarmac, gravel)](#toc2_1_5_)    
    - [TODO: Check for Inconsistent Values in Categorical Columns?](#toc2_1_6_)    
    - [Check Numeric Ranges](#toc2_1_7_)    
  - [Data Distribution](#toc2_2_)    
    - [Identify outliers using IQR](#toc2_2_1_)    
    - [Histograms](#toc2_2_2_)    
  - [Relationships between features](#toc2_3_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Prepare the notebook](#toc0_)

## <a id='toc1_1_'></a>[Import necessary libraries](#toc0_)

In [None]:
!pip install pandas
!pip install numpy
!pip install matplotlib
!pip install seaborn

In [82]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import KNNImputer
from sklearn.impute import SimpleImputer

## <a id='toc1_2_'></a>[Import the datasets](#toc0_)

In [83]:
# Load the dataset
df_races = pd.read_csv('dataset/races.csv')
df_cyclists = pd.read_csv('dataset/cyclists.csv')

# <a id='toc2_'></a>[Task 1: Data understanding](#toc0_)

## <a id='toc2_1_'></a>[Assessing Data Quality](#toc0_)

### <a id='toc2_1_1_'></a>[Verify the datatypes make intuitive sense](#toc0_)

We'll start by making sure the datatypes make intuitive sense with the data we're seeing. This allows us to catch obvious logical mistakes, such as a numeric value (e.g height or weight) stored as a string.

In [None]:
df_races.dtypes

The datatypes seem to make sense for these columns. This is my understanding of what each column represents:

| Column               | Description                                                                                     | Data Type    |
|----------------------|-------------------------------------------------------------------------------------------------|--------------|
| _url                 | The URL of the stage.                                                                           | object       |
| name                 | The name of the event or race.                                                                  | object       |
| points               | Points awarded to cyclists for the... stage?                                                    | float64      |
| uci_points           | UCI points awarded for the... stage?                                                            | float64      |
| length               | The total length the race stage in meters.                                                      | float64      |
| climb_total          | The total elevation climbed in the stage in meters.                                             | float64      |
| profile              | The terrain profile of the stage? Seems encoded in some magic numeric value                     | float64      |
| startlist_quality    | A numeric value representing the quality of the riders at the start of the stage                | float64      |
| average_temperature  | The average temperature during the stage.                                                       | float64      |
| date                 | The date and time when the race stage occurred.                                                 | object       |
| position             | The final position of the cyclist in the race stage.                                            | int64        |
| cyclist              | The name of the cyclist competing in the race stage.                                            | object       |
| cyclist_age          | The age of the cyclist at the time of the race stage.                                           | float64      |
| is_tarmac            | True if race was on tarmac.                                                                     | bool         |
| is_cobbled           | True if race was on cobblestone.                                                                | bool         |
| is_gravel            | True if race was on gravel.                                                                     | bool         |
| cyclist_team         | The name of the cyclist's team.                                                                 | object       |
| delta                | Time difference between the cyclist and the winner of the stage, in seconds.                    | float64      |


In [None]:
df_cyclists.dtypes

| Column       | Description                                                      | Data Type |
|--------------|------------------------------------------------------------------|-----------|
| _url         | The URL or of the cyclist                                        | object    |
| name         | The full name of the cyclist.                                    | object    |
| birth_year   | The birth year of the cyclist.                                   | float64   |
| weight       | The weight of the cyclist in kilograms.                          | float64   |
| height       | The height of the cyclist in meters.                             | float64   |
| nationality  | The nationality of the cyclist.                                  | object    |

In [None]:
df_races.describe()

In [None]:
df_cyclists.describe()

### <a id='toc2_1_2_'></a>[Check for missing values](#toc0_)

In [None]:
# Check for missing values in each column
missing_values = df_races.isna().sum()

# Display the result
print(missing_values)

In [None]:
# Check for missing values in each column
missing_values = df_cyclists.isna().sum()

# Display the result
print(missing_values)

### <a id='toc2_1_3_'></a>[Check for possible placeholder values](#toc0_)

In [None]:
# List of potential "placeholders"
unknown_values = ["unknown", "N/A", "none", "missing", "na", "null", "", "other"]

# Check each column for occurrences of these values (case-insensitive)
unknown_counts = df_races.apply(lambda col: col.astype(str).str.lower().isin(unknown_values).sum())

# Display the counts of "placeholder" values for each column
print(unknown_counts)

### <a id='toc2_1_4_'></a>[Check for duplicates](#toc0_)

In [None]:
# Find duplicate rows
duplicate_rows = df_races[df_races.duplicated()]

# Count the number of duplicate rows
num_duplicate_rows = duplicate_rows.shape[0]

# Display the number of duplicate rows (and the rows themselves if count > 0)
print(f"Number of duplicate rows: {num_duplicate_rows}")
if num_duplicate_rows:
    print("Duplicate rows:")
    print(duplicate_rows)

### <a id='toc2_1_5_'></a>[Check for races with more than one road type (cobble, tarmac, gravel)](#toc0_)

In [None]:
# Checking if more than one road type is True for any record (we don't expect this to happen)
multiple_road_types = df_races[(df_races['is_tarmac'] & df_races['is_cobbled']) | 
                         (df_races['is_tarmac'] & df_races['is_gravel']) | 
                         (df_races['is_cobbled'] & df_races['is_gravel']) | 
                         (df_races['is_tarmac'] & df_races['is_gravel'] & df_races['is_cobbled'])]

print(multiple_road_types)

### <a id='toc2_1_6_'></a>[TODO: Check for Inconsistent Values in Categorical Columns?](#toc0_)

### <a id='toc2_1_7_'></a>[Check Numeric Ranges](#toc0_)

Make sure numeric values fall within realistic ranges (e.g. length cannot be negative)

In [None]:
negative_lengths = df_races[df_races['length'] <= 0]
print(f"Number of races with negative or zero length: {len(negative_lengths)}")

negative_climbs = df_races[df_races['climb_total'] <= 0]
print(f"Number of races with negative or zero climb total: {len(negative_climbs)}")

negative_points = df_races[(df_races['points'] < 0) | (df_races['uci_points'] < 0)]
print(f"Number of races with negative points: {len(negative_points)}")

negative_positions = df_races[df_races['position'] < 0]
print(f"Number of races with negative positions: {len(negative_positions)}")

negative_delta = df_races[df_races['delta'] < 0]
print(f"Number of races with negative delta times: {len(negative_delta)}")


In [None]:
negative_weights = df_cyclists[df_cyclists['weight'] <= 0]
print(f"Number of cyclists with negative or zero weight: {len(negative_weights)}")

negative_heights = df_cyclists[df_cyclists['height'] <= 0]
print(f"Number of races with negative or zero height: {len(negative_heights)}")

## <a id='toc2_2_'></a>[Data Distribution](#toc0_)

### <a id='toc2_2_1_'></a>[Identify outliers using IQR](#toc0_)

In [95]:
def iqr(df: pd.DataFrame, column_name: str) -> pd.DataFrame:
    # Use the IQR to find outliers for the column
    Q1 = df[column_name].quantile(0.25)
    Q3 = df[column_name].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df[(df[column_name] < lower_bound) | (df[column_name] > upper_bound)]

    print(f"Number of outliers: {len(outliers)}")

    # Plot the data using a boxplot
    plt.figure(figsize=(10, 6))
    sns.boxplot(x=df[column_name])
    plt.title(f'Boxplot of Average {column_name} with IQR Outlier Detection')
    plt.xlabel(f'Value for column {column_name}')
    plt.show()

    return outliers

### <a id='toc2_2_2_'></a>[Histograms](#toc0_)

In [None]:
# Cyclist heights

plt.figure(figsize=(10, 6))

sns.histplot(df_cyclists['height'].dropna(), kde=False, bins=50, color='blue')

plt.title('Distribution of heights of cyclists', fontsize=16)
plt.xlabel('Height', fontsize=14)
plt.ylabel('Frequency', fontsize=14)

plt.show()


In [None]:
# Cyclist weights

plt.figure(figsize=(10, 6))

sns.histplot(df_cyclists['weight'].dropna(), kde=False, bins=50, color='blue')

plt.title('Distribution of weight of cyclists', fontsize=16)
plt.xlabel('Height', fontsize=14)
plt.ylabel('Frequency', fontsize=14)

plt.show()

## <a id='toc2_3_'></a>[Relationships between features](#toc0_)

We'll determine how closely two columns are related by calculating their Pearson correlation coefficient.

In [None]:
# Select only the numeric and boolean columns (as Pearson correlation only works on numerical data)
df_numeric = df_races.select_dtypes(include=['float64', 'int64'])

# List to hold the results
correlations = []

# Loop through all possible pairs of columns
for col1 in df_numeric.columns:
    for col2 in df_numeric.columns:
        # Don't correlation a column with itself
        if col1 != col2:
            # To make sure we only calculate correlation for one ordering of the pair, we'll only
            # calculate the pairs where col1 is 'smaller' than col2, ie. when col1 is before col2 in alphabetical order
            if col1 < col2:
                corr_value = df_numeric[col1].corr(df_numeric[col2], method='pearson')
                correlations.append((col1, col2, corr_value))

corr_df = pd.DataFrame(correlations, columns=['Feature_1', 'Feature_2', 'Correlation'])

# Sort by the ABSOLUTE value of the correlation coefficient (highest first)
corr_df['Abs_Correlation'] = corr_df['Correlation'].abs()
sorted_corr_df = corr_df.sort_values(by='Abs_Correlation', ascending=False)

# Show top 10 pairs with the highest correlations
print(sorted_corr_df.head(10))

In [None]:
# Select only the numeric and boolean columns (as Pearson correlation only works on numerical data)
df_numeric = df_cyclists.select_dtypes(include=['float64', 'int64'])

# List to hold the results
correlations = []

# Loop through all possible pairs of columns
for col1 in df_numeric.columns:
    for col2 in df_numeric.columns:
        # Don't correlation a column with itself
        if col1 != col2:
            # To make sure we only calculate correlation for one ordering of the pair, we'll only
            # calculate the pairs where col1 is 'smaller' than col2, ie. when col1 is before col2 in alphabetical order
            if col1 < col2:
                corr_value = df_numeric[col1].corr(df_numeric[col2], method='pearson')
                correlations.append((col1, col2, corr_value))

corr_df = pd.DataFrame(correlations, columns=['Feature_1', 'Feature_2', 'Correlation'])

# Sort by the ABSOLUTE value of the correlation coefficient (highest first)
corr_df['Abs_Correlation'] = corr_df['Correlation'].abs()
sorted_corr_df = corr_df.sort_values(by='Abs_Correlation', ascending=False)

# Show top 10 pairs with the highest correlations
print(sorted_corr_df.head(10))

In [None]:
from scipy.stats import f_oneway

# List to hold the ANOVA results
anova_results = []

# Loop through the numerical columns (height and weight) and test them against nationality
for col in ['height', 'weight']:
    # Drop rows with missing values in the numerical column
    df_non_missing = df_cyclists.dropna(subset=[col])
    
    # Group data by nationality and perform ANOVA
    grouped_data = [group[col].dropna() for name, group in df_non_missing.groupby('nationality')]
    
    # Perform ANOVA only if there are at least two groups with data
    if len(grouped_data) > 1:
        f_stat, p_value = f_oneway(*grouped_data)
        anova_results.append(('nationality', col, f_stat, p_value))
    else:
        anova_results.append(('nationality', col, float('NaN'), float('NaN')))

# Convert results to DataFrame
anova_df = pd.DataFrame(anova_results, columns=['Categorical_Feature', 'Numeric_Feature', 'F-Statistic', 'p-value'])

# Show results sorted by F-statistic
print(anova_df.sort_values(by='F-Statistic', ascending=False))


In [None]:
# Group by 'nationality' and calculate the average weight
avg_weight_by_country = df_cyclists.groupby('nationality')['weight'].mean().reset_index()

# Rename the columns for clarity
avg_weight_by_country.columns = ['Nationality', 'Average_Weight']
avg_weight_by_country_sorted = avg_weight_by_country.sort_values(by='Average_Weight', ascending=True)


# Display the result
print(avg_weight_by_country_sorted)

In [None]:
# Group by 'nationality' and calculate the average weight
avg_weight_by_country = df_cyclists.groupby('nationality')['height'].mean().reset_index()

# Rename the columns for clarity
avg_weight_by_country.columns = ['Nationality', 'Average_Height']
avg_weight_by_country_sorted = avg_weight_by_country.sort_values(by='Average_Height', ascending=True)


# Display the result
print(avg_weight_by_country_sorted)