# Checks of cyclists data

In [2]:
# Autoreload allows the notebook to dynamically load code: if we update some helper functions *outside* of the notebook, we do not need to reload the notebook.
%load_ext autoreload
%autoreload 2

In [3]:
import pandas as pd
import numpy as np
import re
import unicodedata

We load the dataset from a CSV file and display the first few rows to get an initial understanding of the data. This helps us verify that the data has been loaded correctly and gives us a glimpse of its structure and contents.

In [4]:
csv_file = "../data/cyclists.csv"
dataset = pd.read_csv(csv_file)
dataset.head()

Unnamed: 0,_url,name,birth_year,weight,height,nationality
0,bruno-surra,Bruno Surra,1964.0,,,Italy
1,gerard-rue,Gérard Rué,1965.0,74.0,182.0,France
2,jan-maas,Jan Maas,1996.0,69.0,189.0,Netherlands
3,nathan-van-hooydonck,Nathan Van Hooydonck,1995.0,78.0,192.0,Belgium
4,jose-felix-parra,José Félix Parra,1997.0,55.0,171.0,Spain


## Initial Info

Now we provide a concise summary of the DataFrame, including the number of non-null entries, data types of each column, and memory usage. It helps us quickly identify missing values and understand the overall structure of the dataset.

In [6]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6134 entries, 0 to 6133
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   _url         6134 non-null   object 
 1   name         6134 non-null   object 
 2   birth_year   6121 non-null   float64
 3   weight       3078 non-null   float64
 4   height       3143 non-null   float64
 5   nationality  6133 non-null   object 
dtypes: float64(3), object(3)
memory usage: 287.7+ KB


Also, we generates a descriptive statistics for numerical columns in the DataFrame. It includes metrics such as count, mean, standard deviation, minimum, and maximum values, as well as the 25th, 50th, and 75th percentiles. This summary helps us understand the distribution and central tendency of the data.

In [8]:
dataset.describe()

Unnamed: 0,birth_year,weight,height
count,6121.0,3078.0,3143.0
mean,1974.071884,68.658739,179.815145
std,15.535834,6.348183,6.443447
min,1933.0,48.0,154.0
25%,1962.0,64.0,175.0
50%,1974.0,69.0,180.0
75%,1987.0,73.0,184.0
max,2004.0,94.0,204.0


We use the `value_counts()` method to count the occurrences of each unique value in specified columns of a DataFrame

In [8]:
# Count numer of same values in every column
#dataset['_url'].value_counts()
#dataset['name'].value_counts()
#dataset['birth_year'].value_counts()
#dataset['weight'].value_counts()
#dataset['height'].value_counts()
#dataset['nationality'].value_counts()

name
Sergio  Domínguez       2
Alberto  Fernández      2
Jesús  López            2
Antonio  Cabello        2
Alessandro  Pozzi       2
                       ..
Juan José  Martínez     1
Iñigo  Elosegui         1
Paolo  Alberati         1
Jackson  Rodríguez      1
Jean-Philippe  Dojwa    1
Name: count, Length: 6127, dtype: int64

## Check on '_url' data

In this block we compare the `_url` and `name` columns and compare the components of the `_url` and `name` to see where there are discrepancies.

In [11]:
#For each data, split '_url' data and 'name' data to compare them
i=0
for index, row in dataset.iterrows():
    if not pd.isnull(row['_url']):
        url = row['_url'].lower().split('-')

        # Normqlize name in only ascii characters
        norm_name = re.sub(r'[ł]', 'l', row['name'].lower())
        norm_name = unicodedata.normalize('NFKD', norm_name).encode('ASCII', 'ignore').decode('utf-8')
        name = re.split(r'\s+', norm_name)
        
        if url != name:
            i+=1
            print(row['_url'], row['name'], i)

graeme-brown Graeme Allen  Brown 1
jean-claude-theilliere Jean-Claude  Theillière 2
jesus-rodriguez-rodriguez Jesús  Rodríguez 3
iban-herrero-atienzar Ivan  Herrero 4
juan-ayuso-pesquera Juan  Ayuso 5
anselmo-fuerte-abelenda Anselmo  Fuerte 6
luis-ricardo-mesa-saavedra Luis Ricardo  Mesa 7
joseba-lopez-cuesta Joseba  López 8
yecid-sierra Yecid Arturo  Sierra 9
edgar-humberto-ruiz Edgar-humberto  Ruiz 10
manuel-mesa José Manuel  Mesa 11
willy-in-t-ven Willy In 't Ven 12
marlon-alirio-perez-arango Marlon Alirio  Pérez 13
jean-pierre-berckmans Jean-Pierre  Berckmans 14
iban-sastre-estevez Iban  Sastre 15
gonzalo-rabunal-rios Gonzalo  Rabuñal 16
jean-jacques-philipp Jean-Jacques  Philipp 17
jean-pierre-monsere Jean-Pierre  Monseré 18
jose-alberto-benitez-roman José Alberto  Benítez 19
patrick-busolini2 Patrick  Busolini 20
manuel-cardoso Manuel Antonio Leal  Cardoso 21
syver-waersted Syver  Wærsted 22
fernando-mendes Fernando dos Reis Dias  Mendes 23
tom-jelte-slagter Tom-Jelte  Slagter 24

Now we check each `_url` field to see if it contains any characters other than letters (after removing hyphens)

In [40]:
# For each data, check if '_url' object contains any character that are not letters, 
i=0
for index, row in dataset.iterrows():
    if not pd.isnull(row['_url']):
            # Delete '-' character from '_url' object
            url = row['_url'].replace('-', '')

            if not url.isalpha():
                i+=1
                print(row['_url'])

print("Tot: ", i)

patrick-busolini2
andrea-peron-1
hans-kanel2
peter-williams-1
juan-garcia2
anders-lund-1
pedro-pinto2
raimondas-rumsas-1
jesper-hansen-1
alessandro-fantini-1
romain-gregoire1
marco-zanotti-1
daniel-lloyd1
jose-antonio-747
kenny-van-braeckel2
alessandro-pozzi2
pino-cerami-2
johan-wellens2
francisco-martin-826
ricardo-martinez2
leon-paul-menard2
benjamin-king-1
hans-dekkers-1
kevin-neirynck-1
antonio-cruz-1
jesus-lopez23
filippo-colombo1
jose-manuel-garcia-114
andrea-zatti-1
marco-cattaneo-2
francisco-lopez-393
michel-nottebart-2
jan-svorada-1
jesus-hernandez-3
israel-nunez-1
philippe-durel2
daniel-jimenez1
georges-claes2
filippo-magli2
frederic-brun-2
manuel-rodriguez-305
stefan-van-dijk-1
luke-roberts-1
benjamin-thomas-2
wim-vervoort-1
connor-brown2
alvaro-sierra-1
william-frischkorn-1
romain-maes2
marc-van-den-brande2
miguel-martinez-1
manuel-martinez-1
Tot:  52


## Check on 'name' data

In this block we check if the `name` contains any numeric characters.

In [9]:
# For each data, check if 'name' object contains any number
for index, row in dataset.iterrows():
    if not pd.isnull(row['name']):
        if any(char.isdigit() for char in row['name']):
            print(row['_url'], row['name'])

In this block we check, for the same `name` value, what is the different `_url`.

In [101]:
# Save names with at least two occurences
temp = dataset['name'].value_counts()
temp = temp[temp > 1]

# Print urls of names with at least two occurences
for name in temp.index:
    urls = dataset[dataset['name'] == name]['_url']
    print(name, urls, '\n')


Jesús  López 2939         jesus-lopez23
5040    jesus-lopez-carril
Name: _url, dtype: object 

Roman  Kreuziger 1745    roman-kreuziger-sr
2601       roman-kreuziger
Name: _url, dtype: object 

Alberto  Fernández 2953     alberto-fernandez-sainz
5720    alberto-fernandez-blanco
Name: _url, dtype: object 

Antonio  Cabello 2862    antonio-cabello-baena
3238          antonio-cabello
Name: _url, dtype: object 

Andrea  Peron 347     andrea-peron-1
2682      andrea-peron
Name: _url, dtype: object 

Sergio  Domínguez 4917    sergio-dominguez-rodriguez
4919        sergio-dominguez-munoz
Name: _url, dtype: object 

Alessandro  Pozzi 2235    alessandro-pozzi2
5722     alessandro-pozzi
Name: _url, dtype: object 



## Check on 'birth_year' data

In this block we check if there are `birth_year` values that do not end with '.0'

In [10]:
# For each data, check if 'birth_year' has .0 at the end
for index, row in dataset.iterrows():
    if not pd.isnull(row['birth_year']):
        if not str(row['birth_year']).endswith('.0'):
            print(row['_url'], row['birth_year'])

In this block we check if the `birth year` value is not in the form 'nnnn' and if it is not in the form '19nn' or '20nn'

In [11]:
# For each data, check if 'birth_year' data is in the form 'nnnn'
for index, row in dataset.iterrows():
    if not pd.isnull(row['birth_year']):
        # Delete '.0'  from 'birth_year' object
        year = str(row['birth_year']).replace('.0', '')

        # Check if 'birth_year' float64 is a number and if it's in form 19nn or 20nn
        if not year.isdigit() or not (year.startswith('19') or year.startswith('20')):
            print(row['_url'], row['birth_year'])

## Check on 'weight' data

In this block we check if there are `weight` values that do not end with '.0'

In [13]:
# For each data, check if 'weight' has .0 at the end
for index, row in dataset.iterrows():
    if not pd.isnull(row['weight']):
        if not str(row['weight']).endswith('.0'):
            print(row['_url'], row['weight'])

chad-haga 71.5
txomin-juaristi 67.5
awet-gebremedhin-andemeskel 58.5
pablo-torres 63.5
alessandro-bazzana 63.5
martijn-verschoor 74.5
jure-golcer 66.5
elmar-reinders 78.1
nikias-arndt 77.5
daan-hoole 81.4
thomas-danielson 58.5
ilia-koshevoy 62.5
maxime-farazijn 73.5
harry-tanfield 79.5
florian-stork 65.1


In this block we check if there are `weight` values that contains any character that is not a number

In [121]:
# For each data, check if 'weight' float64 data is a digit
for index, row in dataset.iterrows():
    if not pd.isnull(row['weight']):
        # Delete last two char from 'weight'
        weight = str(row['weight'])[:-2]

        if not weight.isdigit():
            print(row['_url'], row['weight'])

## Check on 'height' data

In this block we check if there are `height` values that do not end with '.0'

In [14]:
# For each data, check if 'height' has .0 at the end
for index, row in dataset.iterrows():
    if not pd.isnull(row['height']):
        if not str(row['height']).endswith('.0'):
            print(row['_url'], row['height'])

In this block we check if there are `height` values that contains any character that is not a number

In [122]:
# For each data, check if 'height' float64 data is a digit
for index, row in dataset.iterrows():
    if not pd.isnull(row['height']):
        # Delete last two char from 'height'
        height = str(row['height'])[:-2]

        if not height.isdigit():
            print(row['_url'], row['height'])

## Check on 'weight' and 'height' data

In [8]:
# For each data, check when we have a 'weight' null value, if we have a 'height' value
i=0
for index, row in dataset.iterrows():
    if pd.isnull(row['weight']) and not pd.isnull(row['height']):
        i+=1
        print(row['_url'], row['height'], i)

idar-andersen 182.0 1
thomas-bonnet 175.0 2
syver-waersted 193.0 3
loe-van-belle 184.0 4
negasi-abreha 186.0 5
davide-baldaccini 176.0 6
pier-andre-cote 181.0 7
valere-thiebaud 184.0 8
louis-bendixen 190.0 9
robin-carpenter 178.0 10
vicente-belda 154.0 11
emanuel-duarte 182.0 12
embret-svestad-bardseng 185.0 13
joaquim-adrego-pereira-andrade 175.0 14
jacob-eriksson 182.0 15
xabier-isasa-larranaga 185.0 16
ivan-quaranta 174.0 17
luis-ocana 178.0 18
alfons-de-bal 155.0 19
adam-de-vos 187.0 20
yanto-barker 182.0 21
umberto-poli 167.0 22
dario-lillo 185.0 23
miguel-heidemann 186.0 24
torstein-traeen 181.0 25
bogdan-bondariew 196.0 26
anders-halland-johannessen 176.0 27
christian-raymond 178.0 28
max-poole 185.0 29
andrzej-mierzejewski 173.0 30
venceslau-fernandes 181.0 31
sandy-dujardin 178.0 32
oliver-rees 184.0 33
charly-mottet 164.0 34
johannes-staune-mittet 182.0 35
rob-britton 188.0 36
jonas-iversby-hvideberg 185.0 37
sergey-kolesnikov 179.0 38
sebastian-kolze-changizi 181.0 39
vadim-

In [9]:
# For each data, check when we have a 'height' null value, if we have a 'weight' value
i=0
for index, row in dataset.iterrows():
    if pd.isnull(row['height']) and not pd.isnull(row['weight']):
        i+=1
        print(row['_url'], row['weight'], i)

mario-de-sarraga 69.0 1
frank-hoste 76.0 2
davide-orrico 70.0 3
nicolas-dalla-valle 73.0 4
nils-brun 64.0 5
yannis-voisard 56.0 6
jean-nuttli 70.0 7


In [10]:
# For each data, check when we have both 'weight' and 'height' null values
i=0
for index, row in dataset.iterrows():
    if pd.isnull(row['height']) and pd.isnull(row['weight']):
        i+=1
        print(row['_url'], i)

bruno-surra 1
willy-moonen 2
scott-davies 3
stian-remme 4
evgueny-anachkine 5
maurizio-biondo 6
patrice-thevenard 7
luc-suykerbuyk 8
alain-gallopin 9
urs-graf 10
jean-claude-theilliere 11
paul-kimmage 12
alejandro-paleo 13
samuel-blanco 14
sean-sullivan 15
gaetano-baronchelli 16
julien-mazet 17
roberto-giucolsi 18
leonardo-guidi 19
aitor-alonso 20
noel-vanclooster 21
jesus-rodriguez-rodriguez 22
noan-lelarge 23
eric-van-lancker 24
luca-maggioni 25
christian-muselet 26
marc-siemons 27
theo-smit 28
luigi-sestili 29
iban-herrero-atienzar 30
jules-bruessing 31
theo-eltink 32
willy-vigouroux 33
venancio-teran 34
valentin-dorronsoro 35
eddy-verstraeten 36
paulo-jose-dos-santos-ferreira 37
wim-omloop 38
sascha-henrix 39
manuele-tarozzi 40
roy-knickman 41
luigi-della-bianca 42
jannes-slendebroek 43
primoz-cerin 44
francisco-leon 45
anselmo-fuerte-abelenda 46
nic-hamilton 47
alain-van-den-bossche 48
hubert-arbes 49
peter-van-de-knoop 50
louis-de-koning 51
harm-ottenbros 52
stephan-joho 53
miros

## Check on 'nationality' data

In this block we check if there are `nationality` values that contains any character that is a letter

In [128]:
# For each data, check if 'nationality' object contains any number
for index, row in dataset.iterrows():
    if not pd.isnull(row['nationality']):
        if any(char.isdigit() for char in row['nationality']):
            print(row['_url'], row['nationality'])