# DataTour de France

## Collecting global data 

As a first step, we will collect general data about the Tour de France. We'll store this data in a pandas DataFrame so we can use well formated data later on for our analysis.

In that purpose, we will scrape data from several websites and aggregate it into a single DataFrame. We will use **BeautifulSoup** which is a Python module used to parse HTML and XML files.

In [116]:
from bs4 import BeautifulSoup
import re
import math
import requests
import numpy as np
import pandas as pd

In [117]:
r = requests.get("http://bikeraceinfo.com/tdf/tdfindex.html")
soup = BeautifulSoup(r.text, "lxml")
table = soup.find("table")

Let's build a dictionary we will use later to feed our DataFrame.

In [118]:
metadata = {"year": [], "winner": [], "second": [], "third": [], "winner_origin": [], "winning_team": []}

### Retrieve each year the Tour de France took place

[This web page](http://bikeraceinfo.com/tdf/tdfindex.html) stores every general information about each year the Tour de France took place. We'll scrape it step by step to retrive all the data it stores.

In [119]:
regex = re.compile(r"[0-9]{4}$")

for sib in table.tr.next_siblings:
    try:
        y = sib.td.text
        if re.match(regex, sib.td.text) != None:
            metadata["year"].append(y)
        else:
            # We want to get rid off the years that the Tour de France did not take place
            print(sib.td.text)
    except AttributeError:
        pass

1915-1918 World War I, no Tours held
1940-1946 World War II, no Tours held 
2020V


### Retrieve every winner of the Tour de France

Next, we want to move to the next column and retrieve each winner. The cell stores additional data that we need to clean a little bit further if we want to use it for some further analysis.

In [120]:
def remove_backspace(text):
    regex = re.compile(r"[\n\t]")
    text = regex.sub("", text)
    return text

In [121]:
for sib in table.tr.next_siblings:
    try:
        t = sib.td.next_sibling.next_sibling.get_text()
        metadata["winner"].append(remove_backspace(t))
    except AttributeError:
        pass

In [122]:
# Let's remove a non-necessary line
del metadata["winner"][-2]

### Retrieve each participant that achieved the second place

In [123]:
for sib in table.tr.next_siblings:
    try:
        t = sib.find_all("td")
        # Because the table does not have the same number of cells when the Tour did not take place
        try:
            sec = t[3].get_text(strip=True)
            metadata["second"].append(sec)
        except IndexError:
            pass
    except AttributeError:
        pass

In [124]:
# Let's remove a non-necessary line
del metadata["second"][-2]

### Retrieve each participant that achieved the third place

In [125]:
for sib in table.tr.next_siblings:
    try:
        t = sib.find_all("td")
        # Because the table does not have the same number of cells when the Tour did not take place
        try:
            thd = t[4].get_text(strip=True)
            metadata["third"].append(thd)
        except IndexError:
            pass
    except AttributeError:
        pass

In [126]:
# Let's remove a non-necessary line
del metadata["third"][-2]

### Retrieve each winning team (It is the team to which the yellow jersey belongs to)

In [127]:
for i, sib in enumerate(table.tr.next_siblings):
    try:
        t = sib.find_all("td")
        # Because the table does not have the same number of cells when the Tour did not take place
        try:
            team = t[2].get_text(strip=True)
            if "," in team:
                te = team.split(",")
                metadata["winner_origin"].append(te[0])
                metadata["winning_team"].append(te[1])
            else:
                print(math.ceil(i/2-3), team)
        except IndexError:
            pass
    except AttributeError:
        pass

90 USAUS Postal
91 USADiscovery
92 SpainCaissed'Epargne
93 SpainDiscovery
94 SpainCSC-Saxo Bank
95 SpainAstana
96 LuxembourgSaxo Bank
97 AustraliaBMC
98 Great BritainSky
99 Great BritainSky
100 ItalyAstana
101 Great BritainSky
102 Great BritainSky
103 Great BritainSky
104 Great BritainSky
105 ColombiaINEOS
106 
107 SloveniaUAE-Team Emirates


Because the rest of the dataset is not really well formated, let's add the final data manually.

In [128]:
metadata["winner_origin"].append("USA")
metadata["winning_team"].append("US Postal")

In [129]:
metadata["winner_origin"].append("USA")
metadata["winning_team"].append("Discovery")

In [130]:
metadata["winner_origin"].append("Spain")
metadata["winning_team"].append("Caisse d'Epargne")

In [131]:
metadata["winner_origin"].append("Spain")
metadata["winning_team"].append("Discovery")

In [132]:
metadata["winner_origin"].append("Spain")
metadata["winning_team"].append("CSC-Saxo Bank")

In [133]:
metadata["winner_origin"].append("Spain")
metadata["winning_team"].append("Astana")

In [134]:
metadata["winner_origin"].append("Luxembourg")
metadata["winning_team"].append("Saxo Bank")

In [135]:
metadata["winner_origin"].append("Australia")
metadata["winning_team"].append("BMC")

In [136]:
for i in range(98, 105):
    metadata["winner_origin"].append("Great Britain")
    metadata["winning_team"].append("Sky")

In [137]:
metadata["winner_origin"].append("Columbia")
metadata["winning_team"].append("INEOS")

In [138]:
metadata["winner_origin"].append("Slovenia")
metadata["winning_team"].append("UAE-Team Emirates")

In [139]:
for key, val in metadata.items():
    print(len(metadata[key]))

107
107
107
107
107
107


In [140]:
df = pd.DataFrame(metadata)

## Let's add some new fields in our metadata dictionary

Before we move on, we want to add some new fields in our `metadata` dictionary. We will use them later to feed our DataFrame with new data sources

In [449]:
metadata["winner_chrono"] = []
metadata["winner_timedelta"] = []

### Let's clean up a little bit some columns

#### Retrieve age of winners

Our "winner" column contains more information but it is kind of messy. Let's try to clean it a little bit more. First, we'll start by retrieving the age of the winner. We'll use a simple regular expression for that : by scanning through the dataset, there is only one suspicious data, being 14 years old. This seems kind of young to win the Tour de France. By inspecting it closer, it looks like there is some missing data for that year. We'll replace it manually.

In [450]:
def get_age(text):
    regex = re.compile(r"\d{2}")
    find = re.search(regex, text)
    return find.group()

In [451]:
list_age = []

for i, win in enumerate(df["winner"]):
    age = get_age(win)
    list_age.append(age)
    print(i, age, df["winner"][i])

0 32 Garin, Maurice, 3293hr 33min 14sec
1 20 Cornet, Henri, 2096hr 5min 55sec
2 24 Trousselier, Louis, 2435 points
3 27 Pottier, René, 2731 points
4 24 Petit-Breton, Lucien, 2447 points
5 25 Petit-Breton, Lucien, 2536 points
6 22 Faber, François, 2237 points
7 22 Lapize, Octave, 2263 points
8 29 Garrigou, Gustave, 2943 points
9 24 Defraye, Odile, 2449 points
10 23 Thys, Philippe, 23197hr 54min 0sec
11 24 Thys, Philippe, 24200hr 28min 49sec
12 33 Lambot, Firmin, 33231hr 7min 15sec
13 30 Thys, Philippe, 30228hr 36min 13sec
14 33 Scieur, Léon 33221hr 36min 0sec
15 36 Lambot, Firmin, 36222hr 8min 6sec
16 34 Pélissier, Henri, 34222hr 15min 30sec
17 30 Bottecchia, Ottavio, 30226hr 18min 21sec
18 31 Bottecchia, Ottavio, 31219hr 10min 18sec
19 33 Buysse, Lucien, 33238hr 44min 25sec
20 28 Frantz, Nicolas, 28198hr 16min 42sec
21 29 Frantz, Nicolas, 29192hr 48min 58sec 
22 33 De Waele, Maurice, 33186hr 39min 16sec
23 26 Leducq, André, 26172hr 12min 16sec
24 27 Magne, Antonin, 27177hr 10min 3sec
2

In [452]:
# Let's modify manually the missing value
list_age[31] = 24

In [453]:
# Finally, let's add the data to our DataFrame
df["age"] = list_age

#### Retrieve winners' chrono

The winner's column contains information about their total chrono : how much time did it take them to go from the first stage to the last. We want to store that information in a dedicated column, that might be helpful later on if we want to calculate the performance of each winner.

In [454]:
def get_chrono(text):
    regex = re.compile(r"(?<=\d{2}).+")
    find = re.search(regex, text)
    return find.group()

In [455]:
df.iloc[21]["winner"].replace("\xa0", " ")

'Frantz, Nicolas, 29192hr 48min 58sec '

In [456]:
for i, win in enumerate(df["winner"]):
    chrono = get_chrono(win)
    # Replace non-breaking spaces
    chrono = chrono.replace("\xa0", " ")
    # Handling one badly formated cell
    chrono = chrono.replace("; ", "")
    try:
        delta = pd.to_timedelta(chrono)
        metadata["winner_chrono"].append(chrono)
        metadata["winner_timedelta"].append(delta)
    except ValueError:
        metadata["winner_chrono"].append(np.nan)
        metadata["winner_timedelta"].append(np.nan)

Because some of the data that represents the winner's time are not properly formated or even sometimes wrong, we will correct those piece of data manually.

In [457]:
df["winner_chrono"] = metadata["winner_chrono"]
df["winner_timedelta"] = metadata["winner_timedelta"]

In [458]:
df.at[0, "winner_chrono"] = "94hr 33min 14sec"
df.at[0, "winner_timedelta"] = pd.to_timedelta("94hr 33min 14sec")

In [459]:
df.at[92, "winner_chrono"] = "89hr 39min 30sec"
df.at[92, "winner_timedelta"] = pd.to_timedelta("89hr 39min 30sec")

In [460]:
df.at[96, "winner_chrono"] = "91hr 58min 48sec"
df.at[96, "winner_timedelta"] = pd.to_timedelta("91hr 58min 48sec")

In [461]:
df[df["winner_chrono"].isnull()]

Unnamed: 0,year,winner,second,third,winner_origin,winning_team,age,winner_chrono,winner_timedelta
2,1905,"Trousselier, Louis, 2435 points",HippolyteAucouturier61 pts.,Jean-BaptisteDortignacq64 pts.,France,Peugeot,24,,NaT
3,1906,"Pottier, René, 2731 points",Georges Passerieu39 pts,LouisTrousselier59pts,France,Peugeot,27,,NaT
4,1907,"Petit-Breton, Lucien, 2447 points",Gustave Garrigou66 pts,Émile Georget74 pts,France,Peugeot,24,,NaT
5,1908,"Petit-Breton, Lucien, 2536 points",François Faber68 pts,Georges Passerieu75 pts,France,Peugeot,25,,NaT
6,1909,"Faber, François, 2237 points",Gustave Garrigou57 pts,Jean Alavoine66 pts,Luxembourg,Alcyon,22,,NaT
7,1910,"Lapize, Octave, 2263 points",François Faber67 pts,GustaveGarrigou86 pts,France,Alcyon,22,,NaT
8,1911,"Garrigou, Gustave, 2943 points",Paul Duboc61 pts,Émile Georget84 pts,France,Alcyon,29,,NaT
9,1912,"Defraye, Odile, 2449 points",Eugène Christophe108 pts,GustaveGarrigou140 pts,Belgium,Alcyon,24,,NaT


In [462]:
df.iloc[76]

year                                            1990
winner              LeMond, Greg, 2990hr 43min 20sec
second                  Claudio Chiappucci2min 16sec
third                        Erik Breukink2min 29sec
winner_origin                                 U.S.A.
winning_team                                       Z
age                                               29
winner_chrono                       90hr 43min 20sec
winner_timedelta                     3 days 18:43:20
Name: 76, dtype: object

In [463]:
df.at[106, "winner_chrono"]

'      87hr 20min 5sec'

In [464]:
df.at[14, "winner_chrono"] = "221hr 50min 26sec"
df.at[14, "winner_timedelta"] = pd.to_timedelta("221hr 50min 26sec")

In [465]:
df.at[31, "winner_chrono"] = "148hr 29min 12sec"
df.at[31, "winner_timedelta"] = pd.to_timedelta("148hr 29min 12sec")

In [466]:
df.at[54, "winner_chrono"] = "133hr 49min 42sec"
df.at[54, "winner_timedelta"] = pd.to_timedelta("133hr 49min 42sec")

In [467]:
df.at[64, "winner_chrono"] = "112hr 3min 2sec"
df.at[64, "winner_timedelta"] = pd.to_timedelta("112hr 3min 2sec")

In [468]:
df.at[74, "winner_chrono"] = "84hr 27min 58sec"
df.at[74, "winner_timedelta"] = pd.to_timedelta("84hr 27min 58sec")

In [469]:
df.at[80, "winner_chrono"] = "103hr 38min 38sec"
df.at[80, "winner_timedelta"] = pd.to_timedelta("103hr 38min 38sec")

In [484]:
for i in range(97, 107):
    df.at[i, "winner_chrono"] = df.at[i, "winner_chrono"].replace("\t", "")

In [485]:
df.at[99, "winner_chrono"]

'      83hr 56min 40sec'

#### Clean up winners name

The winner's column is still kind of messy. Let's try to clean it up further more now that we have extracted all the values that we cared about.

0                Garin, Maurice, 3293hr 33min 14sec
1                  Cornet, Henri, 2096hr 5min 55sec
2                   Trousselier, Louis, 2435 points
3                        Pottier, René, 2731 points
4                 Petit-Breton, Lucien, 2447 points
                           ...                     
102     Christopher Froome, 31      89hr 4min 48sec
103    Christopher Froome, 32      86hr 20min 55sec
104        Geraint Thomas, 32      83hr 17min 13sec
105            Egan Bernal, 22      82hr 57min 0sec
106          Tadej Pogacar, 20      87hr 20min 5sec
Name: winner, Length: 107, dtype: object