## Autoreload

Autoreload allows the notebook to dynamically load code: if we update some helper functions *outside* of the notebook, we do not need to reload the notebook.

In [2]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Imports

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import re

import os
import sys
sys.path.append('../dataset/')

In [4]:
cyclist_df = pd.read_csv(os.path.join('dataset','cyclists.csv'))
races_df = pd.read_csv(os.path.join('dataset','races.csv'))

## Preliminary exploration

We begin with a preliminary exploration of the two datasets.

### Cyclists

In [5]:
cyclist_df.shape

(6134, 6)

In [6]:
cyclist_df.head()

Unnamed: 0,_url,name,birth_year,weight,height,nationality
0,bruno-surra,Bruno Surra,1964.0,,,Italy
1,gerard-rue,Gérard Rué,1965.0,74.0,182.0,France
2,jan-maas,Jan Maas,1996.0,69.0,189.0,Netherlands
3,nathan-van-hooydonck,Nathan Van Hooydonck,1995.0,78.0,192.0,Belgium
4,jose-felix-parra,José Félix Parra,1997.0,55.0,171.0,Spain


In [7]:
cyclist_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6134 entries, 0 to 6133
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   _url         6134 non-null   object 
 1   name         6134 non-null   object 
 2   birth_year   6121 non-null   float64
 3   weight       3078 non-null   float64
 4   height       3143 non-null   float64
 5   nationality  6133 non-null   object 
dtypes: float64(3), object(3)
memory usage: 287.7+ KB


We can see that there are missing values. Specifically, a lot of cyclists don't have their height and/or weight indicated. 
But first, let's check duplicates.

#### Duplicates

First, a general check.

In [None]:
duplicates = cyclist_df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

Number of duplicate rows: 0


To really find duplicates the values that it makes sense to check are the cyclists' names and identifiers

In [None]:
cyclist_df[cyclist_df["_url"].duplicated(keep="first")]

No duplicate URLs (i.e. identifiers). There are homonyms though, so one shoud be aware of this.

In [None]:
cyclist_df[cyclist_df["name"].duplicated(keep=False)]

Unnamed: 0,_url,name,birth_year,weight,height,nationality
347,andrea-peron-1,Andrea Peron,1971,70.0,183.0,Italy
1745,roman-kreuziger-sr,Roman Kreuziger,1965,,,Czech Republic
2235,alessandro-pozzi2,Alessandro Pozzi,1969,,,Italy
2601,roman-kreuziger,Roman Kreuziger,1986,65.0,183.0,Czech Republic
2682,andrea-peron,Andrea Peron,1988,70.0,178.0,Italy
2862,antonio-cabello-baena,Antonio Cabello,1990,67.0,179.0,Spain
2939,jesus-lopez23,Jesús López,1955,,,Spain
2953,alberto-fernandez-sainz,Alberto Fernández,1981,,,Spain
3238,antonio-cabello,Antonio Cabello,1956,,,Spain
4917,sergio-dominguez-rodriguez,Sergio Domínguez,1979,,,Spain


Upon manual checking, all these cyclists exist, therefore there are no duplicated rows. Let's get back to checking the missing values

#### Missing Data

In [8]:
n_rows = cyclist_df.shape[0]
for col in cyclist_df.columns:
    print(f"There are {n_rows - cyclist_df[col].count()} null values in the {col} column, i.e. {100*(n_rows - cyclist_df[col].count())/n_rows:.2f}% are missing")

There are 0 null values in the _url column, i.e. 0.00% are missing
There are 0 null values in the name column, i.e. 0.00% are missing
There are 13 null values in the birth_year column, i.e. 0.21% are missing
There are 3056 null values in the weight column, i.e. 49.82% are missing
There are 2991 null values in the height column, i.e. 48.76% are missing
There are 1 null values in the nationality column, i.e. 0.02% are missing


Do cyclists that don't have a weight assigned at least have an height, or vice versa?

In [9]:
# Count the number of rows missing both weight and height
missing_weight_height_count = cyclist_df[cyclist_df['weight'].isna() & cyclist_df['height'].isna()].shape[0]
print(f"Number of rows missing both weight and height: {missing_weight_height_count}")
missing_weight_or_height_count = cyclist_df[cyclist_df['weight'].isna() | cyclist_df['height'].isna()].shape[0]
print(f"Number of rows missing either weight or height: {missing_weight_or_height_count}")

Number of rows missing both weight and height: 2984
Number of rows missing either weight or height: 3063


Almost none. This means using a regressor would be critical
Of all these cyclists, how many don't have a weight nor an height nor a birth year?

In [1]:
# .all(axis=1) is basically doing an AND along the columns. So we `.sum()` a Pandas series of length `cyclist_df.shape[0]`
with_no_info = cyclist_df[['birth_year', 'weight', 'height']].isna().all(axis=1)
print(f"There are {with_no_info.sum()} cyclists without birth_year, weight, height")
print("And they are:")
for _, row in cyclist_df[with_no_info].iterrows():
    print(f"_url: {row['_url']:<20} name: {row['name']:<20} nationality: {row['nationality']}")

NameError: name 'cyclist_df' is not defined

### Races

In [None]:
races_df.shape

In [None]:
races_df.head()

Unnamed: 0,_url,name,points,uci_points,length,climb_total,profile,startlist_quality,average_temperature,date,position,cyclist,cyclist_age,is_tarmac,is_cobbled,is_gravel,cyclist_team,delta
0,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,0,sean-kelly,22.0,True,False,False,vini-ricordi-pinarello-sidermec-1986,0.0
1,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,1,gerrie-knetemann,27.0,True,False,False,norway-1987,0.0
2,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,2,rene-bittinger,24.0,True,False,False,,0.0
3,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,3,joseph-bruyere,30.0,True,False,False,navigare-blue-storm-1993,0.0
4,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,4,sven-ake-nilsson,27.0,True,False,False,spain-1991,0.0


In [None]:
races_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 589865 entries, 0 to 589864
Data columns (total 18 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   _url                 589865 non-null  object 
 1   name                 589865 non-null  object 
 2   points               589388 non-null  float64
 3   uci_points           251086 non-null  float64
 4   length               589865 non-null  float64
 5   climb_total          442820 non-null  float64
 6   profile              441671 non-null  float64
 7   startlist_quality    589865 non-null  int64  
 8   average_temperature  29933 non-null   float64
 9   date                 589865 non-null  object 
 10  position             589865 non-null  int64  
 11  cyclist              589865 non-null  object 
 12  cyclist_age          589752 non-null  float64
 13  is_tarmac            589865 non-null  bool   
 14  is_cobbled           589865 non-null  bool   
 15  is_gravel        

Uci points are assigned to less than half of the performances.
Average temperature is almost nonexistent in the dataset.
Before delving into more accurate analysis about missing data, let's check duplicates

#### Duplicates

Are all the races distinct?

In [None]:
race_names = np.sort(races_df['name'].unique())
for race in race_names:
    print(race)

: 

Upon a first look, there a lot of suspected duplicates. With the `check_if_same` function we can check if two slightly different names actually correspond to the same race, by comparing the first part of the associated `_url` 

In [None]:
# Initialize a list to store pairs of races that are actually the same
same_races = []

# Iterate through all pairs of race names
for i in range(len(race_names)):
    for j in range(i + 1, len(race_names)):
        race1 = race_names[i]
        race2 = race_names[j]
        # Use the check_if_same function to compare the races
        try:
            same = check_if_same(race1, race2)[0]
            if same:
                same_races.append((race1, race2))
        except TypeError:
            print(f"Caught error at races {race_names[i]} and {race_names[j]}")
        

# Print the pairs of races that are actually the same
for race1, race2 in same_races:
    print(f"The races '{race1}' and '{race2}' are actually the same.")

In [None]:
print(f"Allegedly, there are {len(same_races)/2} races that are the same, out of {len(race_names)} possible")

It looks like many races are the same, but they changed name between years. To have the confirmation, one should check the data more carefully considering multiple sources, of course.

In [None]:
races_df.groupby("name")['_url'].unique()#.iloc[1]

name
Amstel Gold Race                      [amstel-gold-race/2018/result, amstel-gold-rac...
Clasica Ciclista San Sebastian        [san-sebastian/2016/result, san-sebastian/2006...
Clásica Ciclista San Sebastian                              [san-sebastian/2017/result]
Clásica Ciclista San Sebastián        [san-sebastian/2019/result, san-sebastian/1990...
Clásica San Sebastián                 [san-sebastian/1981/result, san-sebastian/1982...
                                                            ...                        
Vuelta Ciclista al País Vasco         [itzulia-basque-country/2012/stage-1, itzulia-...
Vuelta a España                       [vuelta-a-espana/2016/stage-14, vuelta-a-espan...
Vuelta al País Vasco                  [itzulia-basque-country/2007/stage-3, itzulia-...
World Championships - Road Race       [world-championship/1996/result, world-champio...
World Championships ME - Road Race    [world-championship/2002/result, world-champio...
Name: _url, Length: 61, dty

Indeed, for different urls, they are signed with different (although very similar) names albeit being the same race.

#### Missing Data

In [None]:
tot_zero_delta = races_df[(races_df['position'] > 0) & (races_df['delta'] == 0)].shape[0]
print('Number of records with position > 0 and delta = 0', tot_zero_delta)
print(f'Percentage of times delta was not recorded: {tot_zero_delta / races_df.shape[0] * 100:.2f}%')

A lot of $0.0$. We cannot be sure if these are just NaNs under disguise or if the calculations/conversions the delta faced to come into this dataset were just not precise enough. During a photo-finish, we usually encounter deltas that go well under the first decimal digit. 