# Data Cleaning

In [21]:
# Autoreload allows the notebook to dynamically load code: if we update some helper functions *outside* of the notebook, we do not need to reload the notebook.
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [22]:
import pandas as pd
import numpy as np

We load the two dataset from a CSV file and display the first few rows to get an initial understanding of the data. This helps us verify that the data has been loaded correctly and gives us a glimpse of its structure and contents.

In [23]:
csv_file = "../data/cyclists.csv"
cyclists_dataset = pd.read_csv(csv_file)
cyclists_dataset.head()

Unnamed: 0,_url,name,birth_year,weight,height,nationality
0,bruno-surra,Bruno Surra,1964.0,,,Italy
1,gerard-rue,Gérard Rué,1965.0,74.0,182.0,France
2,jan-maas,Jan Maas,1996.0,69.0,189.0,Netherlands
3,nathan-van-hooydonck,Nathan Van Hooydonck,1995.0,78.0,192.0,Belgium
4,jose-felix-parra,José Félix Parra,1997.0,55.0,171.0,Spain


In [24]:
csv_file = "../data/races.csv"
races_dataset = pd.read_csv(csv_file)
races_dataset.head()

Unnamed: 0,_url,name,points,uci_points,length,climb_total,profile,startlist_quality,average_temperature,date,position,cyclist,cyclist_age,is_tarmac,is_cobbled,is_gravel,cyclist_team,delta
0,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,0,sean-kelly,22.0,True,False,False,vini-ricordi-pinarello-sidermec-1986,0.0
1,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,1,gerrie-knetemann,27.0,True,False,False,norway-1987,0.0
2,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,2,rene-bittinger,24.0,True,False,False,,0.0
3,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,3,joseph-bruyere,30.0,True,False,False,navigare-blue-storm-1993,0.0
4,tour-de-france/1978/stage-6,Tour de France,100.0,,162000.0,1101.0,1.0,1241,,1978-07-05 04:02:24,4,sven-ake-nilsson,27.0,True,False,False,spain-1991,0.0


Create dataset from the union of the cyclists and the races data 

In [25]:
# Create union of two datasets, merging them considering the url of the cyclist
merged_dataset = pd.merge(cyclists_dataset, races_dataset, left_on='_url', right_on='cyclist', how='inner')

# Modify name column of the cyclist url in '_url_cyclist', and name column of the race url in '_url_race'
merged_dataset = merged_dataset.rename(columns={'_url_x': '_url_cyclist', '_url_y': '_url_race'})
# Modify name column of the cyclist name in 'name_cyclist', and name column of the race name in 'name_race'
merged_dataset = merged_dataset.rename(columns={'name_x': 'name_cyclist', 'name_y': 'name_race'})
# Take only the year-month-day part of 'date' (delete the time)
merged_dataset['date'] = merged_dataset['date'].str.split(' ').str[0]

merged_dataset.head()


Unnamed: 0,_url_cyclist,name_cyclist,birth_year,weight,height,nationality,_url_race,name_race,points,uci_points,...,average_temperature,date,position,cyclist,cyclist_age,is_tarmac,is_cobbled,is_gravel,cyclist_team,delta
0,bruno-surra,Bruno Surra,1964.0,,,Italy,vuelta-a-espana/1989/stage-1,Vuelta a España,80.0,,...,,1989-04-24,110,bruno-surra,25.0,True,False,False,,15.0
1,gerard-rue,Gérard Rué,1965.0,74.0,182.0,France,tour-de-france/1997/stage-2,Tour de France,100.0,,...,,1997-07-07,132,gerard-rue,32.0,True,False,False,denmark-1991,0.0
2,gerard-rue,Gérard Rué,1965.0,74.0,182.0,France,tour-de-france/1990/stage-1,Tour de France,100.0,,...,,1990-07-01,66,gerard-rue,25.0,True,False,False,france-1978,635.0
3,gerard-rue,Gérard Rué,1965.0,74.0,182.0,France,tour-de-france/1992/stage-7,Tour de France,100.0,,...,,1992-07-11,35,gerard-rue,27.0,True,False,False,france-1978,65.0
4,gerard-rue,Gérard Rué,1965.0,74.0,182.0,France,tour-de-france/1990/stage-9,Tour de France,100.0,,...,,1990-07-09,41,gerard-rue,25.0,True,False,False,france-1978,37.0


## Cyclists

In [None]:
cyclists_dataset.isnull().sum() # check number of missing values

_url              0
name              0
birth_year       13
weight         3056
height         2991
nationality       1
dtype: int64

## Imputation

### Imputation Birth Year and Nationality null values
Per l'anno di nascita e la nazionalità mancanti, abbiamo cercato di recuperare più dati possibile manualmente attraverso ricerce online.

In [27]:
# Show urls of the cyclists with 'birth_year' missing values
cyclists_dataset[cyclists_dataset['birth_year'].isnull()]['_url']

9             scott-davies
601       vladimir-malakov
894         antonio-zanini
2408     filippo-simonetti
2515         carlos-garcia
2536       alexandr-osipov
3046      nicolai-kosyakov
3551            nevens-guy
4142           oscar-pumar
4384         javier-luquin
4756        thierry-lauder
6072    sergei-jermachenko
6080       batik-odriozola
Name: _url, dtype: object

Per gli anni di nascita trovati, abbiamo fatto una imputazione manuale del valore ottenuto nella feature 'birth_year' corrispondente

In [28]:
cyclists_dataset.loc[cyclists_dataset['_url'] == 'scott-davies', 'birth_year'] = 1995
cyclists_dataset.loc[cyclists_dataset['_url'] == 'vladimir-malakov', 'birth_year'] = 1958
cyclists_dataset.loc[cyclists_dataset['_url'] == 'antonio-zanini', 'birth_year'] = 1965
cyclists_dataset.loc[cyclists_dataset['_url'] == 'nevens-guy', 'birth_year'] = 1962
cyclists_dataset.loc[cyclists_dataset['_url'] == 'sergei-jermachenko', 'birth_year'] = 1956 

Ragionamento simile fatto anche per la nazionalità. L'unica mancante apparteneva al ciclista Scott Davies.

In [29]:
cyclists_dataset.loc[cyclists_dataset['_url'] == 'scott-davies', 'nationality'] = 'Great Britain'

Per i restanti anni di nascita, è stata sfruttata la moda

In [30]:
cyclists_dataset['birth_year'] = cyclists_dataset['birth_year'].fillna(cyclists_dataset['birth_year'].mode()[0]) # substitue 8 remaining elements with mode

In [31]:
cyclists_dataset.isnull().sum() # check if missing values are filled correctly

_url              0
name              0
birth_year        0
weight         3056
height         2991
nationality       0
dtype: int64

### Imputation Weight and Height null values

Sapendo che altezza e peso sono altamente correlati, si può sfruttare il peso per trovare l'altezza, e l'altezza per trovare il peso.



## Deletions of rows

### Deletion of rows

## Races

## Cyclists & Races