# Data Cleaning

This notebook is supposed to be runned once to obtain a cleaned dataset to use in the next steps of the projects

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import procyclingstats as pcs
import re
import seaborn as sns
import sys
from typing import Tuple

sys.path.append('../dataset/')
sys.path.append('../utility/')

races_df = pd.read_csv(os.path.join('dataset','races_new.csv'))
cyclists_df = pd.read_csv(os.path.join('dataset','cyclists_new.csv'))

## Cyclists

For the cyclists does't look like they have some strange values, but we can drop the name since we have no use for it if not count how many people named `x` there are, but this is probably a not so useful statistic

In [3]:
cyclists_df = cyclists_df.drop(columns=['name'])

## Races

For the races we have a bit more work to do

### Delta

The delta has already been discussed in an appropriate notebook so here we'll just report the cleaning process

In [4]:
initial_len = races_df.shape[0]
races_df = races_df[(races_df['delta'] >= 0) & (races_df['delta'] <= 20000)]
races_df = races_df.drop_duplicates(subset=['_url', 'cyclist'], keep='first')
print(f"Removed {initial_len - races_df.shape[0]} rows")

Removed 15330 rows


### Bad columns

just a reminder of why these are not useful: `name` can change so it needs to be retrieved from _url; `is_cobbled` is always false; `is_gravel` is always false; `cyclist_team` can change from year to year and there were some considerations about the fact that the name is just the name of the sponsor (so the team can be the same but with different name); `average_temperature` is almost always null; `Unnamed: 0` is a column created for error somewhere

In [5]:
columns_to_drop = [
    'name', 
    'is_cobbled',
    'is_gravel',
    'cyclist_team',
    'average_temperature',
    'Unnamed: 0',
]
races_df = races_df.drop(columns=columns_to_drop)
races_df.head()

Unnamed: 0,_url,stage_type,points,uci_points,length,climb_total,profile,startlist_quality,date,position,cyclist,cyclist_age,is_tarmac,delta
0,tour-de-france/1978/stage-6,RR,100.0,,162000.0,1101.0,1.0,1241,1978-07-05 04:02:24,0,sean-kelly,22.0,True,0.0
1,tour-de-france/1978/stage-6,RR,70.0,,162000.0,1101.0,1.0,1241,1978-07-05 04:02:24,1,gerrie-knetemann,27.0,True,0.0
2,tour-de-france/1978/stage-6,RR,50.0,,162000.0,1101.0,1.0,1241,1978-07-05 04:02:24,2,rene-bittinger,24.0,True,0.0
3,tour-de-france/1978/stage-6,RR,40.0,,162000.0,1101.0,1.0,1241,1978-07-05 04:02:24,3,joseph-bruyere,30.0,True,0.0
4,tour-de-france/1978/stage-6,RR,32.0,,162000.0,1101.0,1.0,1241,1978-07-05 04:02:24,4,sven-ake-nilsson,27.0,True,0.0


### Recreating Name

Here we assign the name in a correct way, i.e. a race has the same name each edition (stages will have the stage number appended this way different stages will be distinguishable)

In [6]:
def get_name_stage(row) -> Tuple[str, str]:
    array_of_info = row['_url'].split('/')
    return array_of_info[0], array_of_info[-1]

#define column name
races_df['name'] = races_df.apply(lambda row: '-'.join(get_name_stage(row)), axis=1)
# move name column to the second position
cols = list(races_df.columns)
cols.insert(1, cols.pop(cols.index('name')))
races_df = races_df[cols]

races_df.head()

Unnamed: 0,_url,name,stage_type,points,uci_points,length,climb_total,profile,startlist_quality,date,position,cyclist,cyclist_age,is_tarmac,delta
0,tour-de-france/1978/stage-6,tour-de-france-stage-6,RR,100.0,,162000.0,1101.0,1.0,1241,1978-07-05 04:02:24,0,sean-kelly,22.0,True,0.0
1,tour-de-france/1978/stage-6,tour-de-france-stage-6,RR,70.0,,162000.0,1101.0,1.0,1241,1978-07-05 04:02:24,1,gerrie-knetemann,27.0,True,0.0
2,tour-de-france/1978/stage-6,tour-de-france-stage-6,RR,50.0,,162000.0,1101.0,1.0,1241,1978-07-05 04:02:24,2,rene-bittinger,24.0,True,0.0
3,tour-de-france/1978/stage-6,tour-de-france-stage-6,RR,40.0,,162000.0,1101.0,1.0,1241,1978-07-05 04:02:24,3,joseph-bruyere,30.0,True,0.0
4,tour-de-france/1978/stage-6,tour-de-france-stage-6,RR,32.0,,162000.0,1101.0,1.0,1241,1978-07-05 04:02:24,4,sven-ake-nilsson,27.0,True,0.0


### Saving the Datasets

In [7]:
races_df.to_csv(os.path.join('dataset', 'races_cleaned.csv'), index=False)
cyclists_df.to_csv(os.path.join('dataset', 'cyclists_cleaned.csv'), index=False)