# Checks of cyclists data

In [2]:
# Autoreload allows the notebook to dynamically load code: if we update some helper functions *outside* of the notebook, we do not need to reload the notebook.
%load_ext autoreload
%autoreload 2

In [3]:
import pandas as pd
import numpy as np
import re
import unicodedata

We load the dataset from a CSV file and display the first few rows to get an initial understanding of the data. This helps us verify that the data has been loaded correctly and gives us a glimpse of its structure and contents.

In [4]:
csv_file = "../data/cyclists.csv"
dataset = pd.read_csv(csv_file)
dataset.head()

Unnamed: 0,_url,name,birth_year,weight,height,nationality
0,bruno-surra,Bruno Surra,1964.0,,,Italy
1,gerard-rue,Gérard Rué,1965.0,74.0,182.0,France
2,jan-maas,Jan Maas,1996.0,69.0,189.0,Netherlands
3,nathan-van-hooydonck,Nathan Van Hooydonck,1995.0,78.0,192.0,Belgium
4,jose-felix-parra,José Félix Parra,1997.0,55.0,171.0,Spain


## Initial Info

Now we provide a concise summary of the DataFrame, including the number of non-null entries, data types of each column, and memory usage. It helps us quickly identify missing values and understand the overall structure of the dataset.

In [5]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6134 entries, 0 to 6133
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   _url         6134 non-null   object 
 1   name         6134 non-null   object 
 2   birth_year   6121 non-null   float64
 3   weight       3078 non-null   float64
 4   height       3143 non-null   float64
 5   nationality  6133 non-null   object 
dtypes: float64(3), object(3)
memory usage: 287.7+ KB


Also, we generates a descriptive statistics for numerical columns in the DataFrame. It includes metrics such as count, mean, standard deviation, minimum, and maximum values, as well as the 25th, 50th, and 75th percentiles. This summary helps us understand the distribution and central tendency of the data.

In [6]:
dataset.describe()

Unnamed: 0,birth_year,weight,height
count,6121.0,3078.0,3143.0
mean,1974.071884,68.658739,179.815145
std,15.535834,6.348183,6.443447
min,1933.0,48.0,154.0
25%,1962.0,64.0,175.0
50%,1974.0,69.0,180.0
75%,1987.0,73.0,184.0
max,2004.0,94.0,204.0


We use the `value_counts()` method to count the occurrences of each unique value in specified columns of a DataFrame

In [7]:
# Count numer of same values in every column
#dataset['_url'].value_counts()
#dataset['name'].value_counts()
#dataset['birth_year'].value_counts()
#dataset['weight'].value_counts()
#dataset['height'].value_counts()
#dataset['nationality'].value_counts()

## Check on '_url' data

We start considering the `_url` column, and check the number of null values and the count the occurrences of each unique value

In [13]:
print('Total number of null values in _url column: ' + str(dataset['_url'].isnull().sum())
      + ' (' + str(round(dataset['_url'].isnull().sum() / len(dataset) * 100, 2)) + '%)')

print('\nCount occurrences of each value in _url column:')
url_counts = dataset['_url'].value_counts()
print(url_counts)

Total number of null values in _url column: 0 (0.0%)

Count occurrences of each value in _url column:
_url
ward-vanhoof            1
bruno-surra             1
gerard-rue              1
jan-maas                1
nathan-van-hooydonck    1
                       ..
stian-remme             1
scott-davies            1
joost-van-leijen        1
chad-haga               1
willy-moonen            1
Name: count, Length: 6134, dtype: int64


We have lots different values, but no null values.

Since we have a lot of different values, we check if every value is sintatically correct

In this block we compare the `_url` and `name` columns and compare the components of the `_url` and `name` to see where there are discrepancies.

In [20]:
#For each data, split '_url' data and 'name' data to compare them
i=0
for index, row in dataset.iterrows():
    if not pd.isnull(row['_url']):
        url = row['_url'].lower().split('-')

        # Normqlize name in only ascii characters
        norm_name = re.sub(r'[ł]', 'l', row['name'].lower())
        norm_name = unicodedata.normalize('NFKD', norm_name).encode('ASCII', 'ignore').decode('utf-8')
        name = re.split(r'\s+', norm_name)
        
        if url != name:
            i+=1
            print(row['_url'], row['name'])
print('Tot: ' + str(i))

graeme-brown Graeme Allen  Brown
jean-claude-theilliere Jean-Claude  Theillière
jesus-rodriguez-rodriguez Jesús  Rodríguez
iban-herrero-atienzar Ivan  Herrero
juan-ayuso-pesquera Juan  Ayuso
anselmo-fuerte-abelenda Anselmo  Fuerte
luis-ricardo-mesa-saavedra Luis Ricardo  Mesa
joseba-lopez-cuesta Joseba  López
yecid-sierra Yecid Arturo  Sierra
edgar-humberto-ruiz Edgar-humberto  Ruiz
manuel-mesa José Manuel  Mesa
willy-in-t-ven Willy In 't Ven
marlon-alirio-perez-arango Marlon Alirio  Pérez
jean-pierre-berckmans Jean-Pierre  Berckmans
iban-sastre-estevez Iban  Sastre
gonzalo-rabunal-rios Gonzalo  Rabuñal
jean-jacques-philipp Jean-Jacques  Philipp
jean-pierre-monsere Jean-Pierre  Monseré
jose-alberto-benitez-roman José Alberto  Benítez
patrick-busolini2 Patrick  Busolini
manuel-cardoso Manuel Antonio Leal  Cardoso
syver-waersted Syver  Wærsted
fernando-mendes Fernando dos Reis Dias  Mendes
tom-jelte-slagter Tom-Jelte  Slagter
inaki-isasi-flores Iñaki  Isasi
juan-martinez-oliver Juan  Mar

Now we check each `_url` field to see if it contains any characters other than letters (after removing hyphens)

In [10]:
# For each data, check if '_url' object contains any character that are not letters, 
i=0
for index, row in dataset.iterrows():
    if not pd.isnull(row['_url']):
            # Delete '-' character from '_url' object
            url = row['_url'].replace('-', '')

            if not url.isalpha():
                i+=1
                print(row['_url'])

print("Tot: ", i)

patrick-busolini2
andrea-peron-1
hans-kanel2
peter-williams-1
juan-garcia2
anders-lund-1
pedro-pinto2
raimondas-rumsas-1
jesper-hansen-1
alessandro-fantini-1
romain-gregoire1
marco-zanotti-1
daniel-lloyd1
jose-antonio-747
kenny-van-braeckel2
alessandro-pozzi2
pino-cerami-2
johan-wellens2
francisco-martin-826
ricardo-martinez2
leon-paul-menard2
benjamin-king-1
hans-dekkers-1
kevin-neirynck-1
antonio-cruz-1
jesus-lopez23
filippo-colombo1
jose-manuel-garcia-114
andrea-zatti-1
marco-cattaneo-2
francisco-lopez-393
michel-nottebart-2
jan-svorada-1
jesus-hernandez-3
israel-nunez-1
philippe-durel2
daniel-jimenez1
georges-claes2
filippo-magli2
frederic-brun-2
manuel-rodriguez-305
stefan-van-dijk-1
luke-roberts-1
benjamin-thomas-2
wim-vervoort-1
connor-brown2
alvaro-sierra-1
william-frischkorn-1
romain-maes2
marc-van-den-brande2
miguel-martinez-1
manuel-martinez-1
Tot:  52


## Check on 'name' data

Now we consider the `name` column, and check the number of null values and the count the occurrences of each unique value

In [12]:
print('Total number of null values in name column: ' + str(dataset['name'].isnull().sum())
        + ' (' + str(round(dataset['name'].isnull().sum() / len(dataset) * 100, 2)) + '%)')

print('\nCount occurrences of each value in name column:')
name_counts = dataset['name'].value_counts()
print(name_counts)   

Total number of null values in name column: 0 (0.0%)

Count occurrences of each value in name column:
name
Sergio  Domínguez       2
Alberto  Fernández      2
Jesús  López            2
Antonio  Cabello        2
Alessandro  Pozzi       2
                       ..
Juan José  Martínez     1
Iñigo  Elosegui         1
Paolo  Alberati         1
Jackson  Rodríguez      1
Jean-Philippe  Dojwa    1
Name: count, Length: 6127, dtype: int64


We have lots of different values, but no null values.

Since we have a lot of different values, we check if every value is sintatically correct

In this block we check if the `name` contains any numeric characters.

In [11]:
# For each data, check if 'name' object contains any number
for index, row in dataset.iterrows():
    if not pd.isnull(row['name']):
        if any(char.isdigit() for char in row['name']):
            print(row['_url'], row['name'])

In this block we check, for the same `name` value, what is the different `_url`.

In [101]:
# Save names with at least two occurences
temp = dataset['name'].value_counts()
temp = temp[temp > 1]

# Print urls of names with at least two occurences
for name in temp.index:
    urls = dataset[dataset['name'] == name]['_url']
    print(name, urls, '\n')


Jesús  López 2939         jesus-lopez23
5040    jesus-lopez-carril
Name: _url, dtype: object 

Roman  Kreuziger 1745    roman-kreuziger-sr
2601       roman-kreuziger
Name: _url, dtype: object 

Alberto  Fernández 2953     alberto-fernandez-sainz
5720    alberto-fernandez-blanco
Name: _url, dtype: object 

Antonio  Cabello 2862    antonio-cabello-baena
3238          antonio-cabello
Name: _url, dtype: object 

Andrea  Peron 347     andrea-peron-1
2682      andrea-peron
Name: _url, dtype: object 

Sergio  Domínguez 4917    sergio-dominguez-rodriguez
4919        sergio-dominguez-munoz
Name: _url, dtype: object 

Alessandro  Pozzi 2235    alessandro-pozzi2
5722     alessandro-pozzi
Name: _url, dtype: object 



## Check on 'birth_year' data

Now we consider the `birth_year` column, and check the number of null values and the count the occurrences of each unique value

In [14]:
print('Total number of null values in birth_year column: ' + str(dataset['birth_year'].isnull().sum())
        + ' (' + str(round(dataset['birth_year'].isnull().sum() / len(dataset) * 100, 2)) + '%)')

print('\nCount occurrences of each value in birth_year column:')
birth_year_counts = dataset['birth_year'].value_counts()
print(birth_year_counts)

Total number of null values in birth_year column: 13 (0.21%)

Count occurrences of each value in birth_year column:
birth_year
1964.0    145
1962.0    141
1970.0    140
1974.0    138
1980.0    133
         ... 
1937.0      4
1934.0      2
1938.0      2
1933.0      1
1936.0      1
Name: count, Length: 71, dtype: int64


We have different values, and a few null values

Since we have different values, we check if every value is sintatically correct

In this block we check if there are `birth_year` values that do not end with '.0'

In [10]:
# For each data, check if 'birth_year' has .0 at the end
for index, row in dataset.iterrows():
    if not pd.isnull(row['birth_year']):
        if not str(row['birth_year']).endswith('.0'):
            print(row['_url'], row['birth_year'])

In this block we check if the `birth year` value is not in the form 'nnnn' and if it is not in the form '19nn' or '20nn'

In [11]:
# For each data, check if 'birth_year' data is in the form 'nnnn'
for index, row in dataset.iterrows():
    if not pd.isnull(row['birth_year']):
        # Delete '.0'  from 'birth_year' object
        year = str(row['birth_year']).replace('.0', '')

        # Check if 'birth_year' float64 is a number and if it's in form 19nn or 20nn
        if not year.isdigit() or not (year.startswith('19') or year.startswith('20')):
            print(row['_url'], row['birth_year'])

Check the races where the `birth_year` value is small or large, for possible outliers

In [30]:
# Dataset info, for 'birth_year' column
dataset['birth_year'].describe()

count    6121.000000
mean     1974.071884
std        15.535834
min      1933.000000
25%      1962.000000
50%      1974.000000
75%      1987.000000
max      2004.000000
Name: birth_year, dtype: float64

In [31]:
n = 1935
# Get rows where 'birth_year' is smaller than n
filtered_data = dataset[dataset['birth_year'] < n]

print('Rows where birth year is smaller than ' + str(n) + ':')
for index, row in filtered_data.iterrows():
    # Stampa '_url' e i corrispondenti 'birth_year'
    print(index, row['_url'], row['birth_year'])

Rows where birth year is smaller than 1935:
398 rik-van-looy 1933.0
1578 hans-junkermann 1934.0
1972 aldo-moser 1934.0


In [33]:
n = 2000
# Get data where 'birth_year' is greater than n
filtered_data = dataset[dataset['birth_year'] > n]

print('Rows where birth year is greater than ' + str(n) + ':')
for index, row in filtered_data.iterrows():
    # Stampa '_url' e i corrispondenti 'birth_year'
    print(index, row['_url'], row['birth_year'])

Rows where birth year is greater than 2000:
38 thomas-gloag 2001.0
80 thibau-nys 2002.0
88 juan-ayuso-pesquera 2002.0
156 hugo-page 2001.0
164 marco-brenner 2002.0
175 pierre-gautherat 2003.0
187 antonio-tiberi 2001.0
295 lennert-van-eetvelt 2001.0
296 lars-boven 2001.0
336 laurence-pithie 2002.0
355 loe-van-belle 2002.0
459 ben-tulett 2001.0
503 max-walker 2001.0
664 matthew-riccitello 2002.0
729 santiago-umba 2002.0
806 joshua-tarling 2004.0
855 lorenzo-germani 2002.0
920 martin-svrcek 2003.0
928 embret-svestad-bardseng 2002.0
986 milan-fretin 2001.0
1035 alejandro-franco-gonzalez 2001.0
1090 mateu-estelrich 2001.0
1181 xabier-isasa-larranaga 2001.0
1219 frederik-wandahl 2001.0
1222 alessio-nieri 2001.0
1367 lewis-askey 2001.0
1388 lorenzo-milesi 2002.0
1633 romain-gregoire1 2003.0
1677 samuel-watson 2001.0
1685 kevin-vauquelin 2001.0
1712 olav-kooij 2001.0
1795 jan-christen 2004.0
1885 dries-de-pooter 2002.0
2096 hugo-toumire 2001.0
2168 igor-arrieta-lizarraga 2002.0
2232 luke-lampe

## Check on 'weight' data

Now we consider the `weight` column, and check the number of null values and the count the occurrences of each unique value

In [15]:
print('Total number of null values in weight column: ' + str(dataset['weight'].isnull().sum())
        + ' (' + str(round(dataset['weight'].isnull().sum() / len(dataset) * 100, 2)) + '%)')

print('\nCount occurrences of each value in weight column:')
weight_counts = dataset['weight'].value_counts()
print(weight_counts)

Total number of null values in weight column: 3056 (49.82%)

Count occurrences of each value in weight column:
weight
70.0    272
68.0    219
65.0    193
67.0    177
72.0    169
69.0    162
73.0    146
63.0    140
66.0    139
64.0    137
74.0    135
62.0    131
75.0    128
71.0    125
60.0     98
61.0     90
78.0     86
77.0     67
58.0     64
76.0     63
80.0     53
59.0     49
79.0     30
82.0     26
55.0     25
81.0     22
83.0     20
57.0     20
56.0     19
85.0     10
53.0      7
52.0      6
84.0      6
54.0      4
51.0      4
90.0      4
87.0      3
88.0      3
63.5      2
89.0      2
50.0      2
58.5      2
86.0      2
71.5      1
48.0      1
91.0      1
67.5      1
66.5      1
78.1      1
77.5      1
74.5      1
81.4      1
62.5      1
93.0      1
73.5      1
79.5      1
65.1      1
92.0      1
94.0      1
Name: count, dtype: int64


We have different values, but a lot of null values. Also, we see that every value is sintatically correct

Check the races where the `weight` value is small or large, for possible outliers

In [25]:
# Dataset info, for 'weight' column
dataset['weight'].describe()

count    3078.000000
mean       68.658739
std         6.348183
min        48.000000
25%        64.000000
50%        69.000000
75%        73.000000
max        94.000000
Name: weight, dtype: float64

In [28]:
n = 50
# Get rows where 'weight' is smaller than n
filtered_data = dataset[dataset['weight'] < n]

print('Rows where weight is smaller than ' + str(n) + ':')
for index, row in filtered_data.iterrows():
    # Stampa '_url' e i corrispondenti 'weight'
    print(index, row['_url'], row['weight'])

Rows where weight is smaller than 50:
1375 jose-humberto-rujano 48.0


In [29]:
n = 90
# Get data where 'weight' is greater than n
filtered_data = dataset[dataset['weight'] > n]

print('Rows where weight is greater than ' + str(n) + ':')
for index, row in filtered_data.iterrows():
    # Stampa '_url' e i corrispondenti 'weight'
    print(index, row['_url'], row['weight'])

Rows where weight is greater than 90:
292 jens-mouris 91.0
4554 gerrit-solleveld 93.0
5490 soren-waerenskjold 92.0
5913 magnus-backstedt 94.0


## Check on 'height' data

Now we consider the `height` column, and check the number of null values and the count the occurrences of each unique value

In [16]:
print('Total number of null values in height column: ' + str(dataset['height'].isnull().sum())
        + ' (' + str(round(dataset['height'].isnull().sum() / len(dataset) * 100, 2)) + '%)')

print('\nCount occurrences of each value in height column:')
height_counts = dataset['height'].value_counts()
print(height_counts)

Total number of null values in height column: 2991 (48.76%)

Count occurrences of each value in height column:
height
180.0    277
178.0    226
183.0    193
181.0    181
175.0    169
182.0    165
185.0    161
176.0    154
184.0    152
179.0    137
177.0    133
174.0    129
173.0    120
186.0    107
190.0     97
170.0     90
187.0     85
172.0     80
188.0     77
171.0     67
189.0     48
169.0     46
191.0     37
192.0     34
168.0     24
167.0     23
193.0     22
164.0     20
194.0     17
195.0     13
165.0     13
196.0      7
197.0      6
166.0      6
198.0      4
160.0      4
162.0      3
159.0      3
161.0      2
199.0      2
163.0      2
154.0      1
204.0      1
155.0      1
158.0      1
202.0      1
157.0      1
200.0      1
Name: count, dtype: int64


We have different values, but a lot of null values. Also, we see that every value is sintatically correct

Check the races where the `height` value is small or large, for possible outliers

In [22]:
# Dataset info, for 'height' column
dataset['height'].describe()

count    3143.000000
mean      179.815145
std         6.443447
min       154.000000
25%       175.000000
50%       180.000000
75%       184.000000
max       204.000000
Name: height, dtype: float64

In [23]:
n = 160
# Get rows where 'height' is smaller than n
filtered_data = dataset[dataset['height'] < n]

print('Rows where height is smaller than ' + str(n) + ':')
for index, row in filtered_data.iterrows():
    # Stampa '_url' e i corrispondenti 'height'
    print(index, row['_url'], row['height'])

Rows where height is smaller than 160:
720 vicente-belda 154.0
840 samuel-dumoulin 159.0
969 jorge-ferrio 159.0
1818 alfons-de-bal 155.0
2720 mulu-kinfe-hailemichael 158.0
3994 wladimiro-panizza 157.0
4973 masatoshi-ichikawa 159.0


In [24]:
n = 200
# Get data where 'height' is greater than n
filtered_data = dataset[dataset['height'] > n]

print('Rows where height is greater than ' + str(n) + ':')
for index, row in filtered_data.iterrows():
    # Stampa '_url' e i corrispondenti 'height'
    print(index, row['_url'], row['height'])

Rows where height is greater than 200:
679 conor-dunne 204.0
3956 mathias-norsgaard 202.0


## Check on 'weight' and 'height' data

Now we consider the `weight` and `height` columns together, and we check the combinations of null values

In [17]:
# For each data, check when we have a 'weight' null value, if we have a 'height' value
i=0
for index, row in dataset.iterrows():
    if pd.isnull(row['weight']) and not pd.isnull(row['height']):
        i+=1
        print(row['_url'], row['height'])
print("Tot: ", i)

idar-andersen 182.0
thomas-bonnet 175.0
syver-waersted 193.0
loe-van-belle 184.0
negasi-abreha 186.0
davide-baldaccini 176.0
pier-andre-cote 181.0
valere-thiebaud 184.0
louis-bendixen 190.0
robin-carpenter 178.0
vicente-belda 154.0
emanuel-duarte 182.0
embret-svestad-bardseng 185.0
joaquim-adrego-pereira-andrade 175.0
jacob-eriksson 182.0
xabier-isasa-larranaga 185.0
ivan-quaranta 174.0
luis-ocana 178.0
alfons-de-bal 155.0
adam-de-vos 187.0
yanto-barker 182.0
umberto-poli 167.0
dario-lillo 185.0
miguel-heidemann 186.0
torstein-traeen 181.0
bogdan-bondariew 196.0
anders-halland-johannessen 176.0
christian-raymond 178.0
max-poole 185.0
andrzej-mierzejewski 173.0
venceslau-fernandes 181.0
sandy-dujardin 178.0
oliver-rees 184.0
charly-mottet 164.0
johannes-staune-mittet 182.0
rob-britton 188.0
jonas-iversby-hvideberg 185.0
sergey-kolesnikov 179.0
sebastian-kolze-changizi 181.0
vadim-kravchenko 178.0
libardo-nino-corredor 182.0
joan-bou 182.0
wladimiro-panizza 157.0
enekoitz-azparren-irurzu

In [18]:
# For each data, check when we have a 'height' null value, if we have a 'weight' value
i=0
for index, row in dataset.iterrows():
    if pd.isnull(row['height']) and not pd.isnull(row['weight']):
        i+=1
        print(row['_url'], row['weight'])
print("Tot: ", i)

mario-de-sarraga 69.0
frank-hoste 76.0
davide-orrico 70.0
nicolas-dalla-valle 73.0
nils-brun 64.0
yannis-voisard 56.0
jean-nuttli 70.0
Tot:  7


In [19]:
# For each data, check when we have both 'weight' and 'height' null values
i=0
for index, row in dataset.iterrows():
    if pd.isnull(row['height']) and pd.isnull(row['weight']):
        i+=1
        print(row['_url'])
print("Tot: ", i)

bruno-surra
willy-moonen
scott-davies
stian-remme
evgueny-anachkine
maurizio-biondo
patrice-thevenard
luc-suykerbuyk
alain-gallopin
urs-graf
jean-claude-theilliere
paul-kimmage
alejandro-paleo
samuel-blanco
sean-sullivan
gaetano-baronchelli
julien-mazet
roberto-giucolsi
leonardo-guidi
aitor-alonso
noel-vanclooster
jesus-rodriguez-rodriguez
noan-lelarge
eric-van-lancker
luca-maggioni
christian-muselet
marc-siemons
theo-smit
luigi-sestili
iban-herrero-atienzar
jules-bruessing
theo-eltink
willy-vigouroux
venancio-teran
valentin-dorronsoro
eddy-verstraeten
paulo-jose-dos-santos-ferreira
wim-omloop
sascha-henrix
manuele-tarozzi
roy-knickman
luigi-della-bianca
jannes-slendebroek
primoz-cerin
francisco-leon
anselmo-fuerte-abelenda
nic-hamilton
alain-van-den-bossche
hubert-arbes
peter-van-de-knoop
louis-de-koning
harm-ottenbros
stephan-joho
miroslav-uryga
luis-ricardo-mesa-saavedra
marco-villa
jacques-jolidon
dirk-wayenberg
jose-salvador-sanchis
giulio-tomi
claudio-cerri
salvatore-scamardella


## Check on 'nationality' data

Now we consider the `nationality` column, and check the number of null values and the count the occurrences of each unique value

In [21]:
print('Total number of null values in nationality column: ' + str(dataset['nationality'].isnull().sum())
        + ' (' + str(round(dataset['nationality'].isnull().sum() / len(dataset) * 100, 2)) + '%)')

print('\nCount occurrences of each value in nationality column:')
nationality_counts = dataset['nationality'].value_counts()
print(nationality_counts)

Total number of null values in nationality column: 1 (0.02%)

Count occurrences of each value in nationality column:
nationality
Italy                 1029
Spain                  872
Belgium                869
France                 741
Netherlands            380
                      ... 
Dominican Republic       1
Liechtenstein            1
Zimbabwe                 1
Puerto Rico              1
Hongkong                 1
Name: count, Length: 72, dtype: int64


We have different values, and just one null value.

Since we have a lot of different values, we check if every value is sintatically correct

In this block we check if there are `nationality` values that contains any character that is a letter

In [128]:
# For each data, check if 'nationality' object contains any number
for index, row in dataset.iterrows():
    if not pd.isnull(row['nationality']):
        if any(char.isdigit() for char in row['nationality']):
            print(row['_url'], row['nationality'])