In [98]:
import pandas as pd
import re
from geopy.geocoders import Nominatim

# Cleaning the 'races' file

In [110]:
races = pd.read_csv('../data/races.csv', index_col = 0)

In [111]:
races.head(10)

Unnamed: 0,0,1,0.1,1.1,2
0,19,NOV,Valencia MotoGP™ Official Test,Circuit Ricardo Tormo,SPAIN
1,25,NOV,Jerez MotoGP™ Official Test,Circuito de Jerez - Angel Nieto,SPAIN
2,7,FEB,Sepang MotoGP™ Official Test,Sepang International Circuit,MALAYSIA
3,19,FEB,Jerez Moto2™-Moto3™ Test,Circuito de Jerez - Angel Nieto,SPAIN
4,22,FEB,Qatar MotoGP™ Test,Losail International Circuit,QATAR
5,28,FEB,Qatar Moto2™-Moto3™ Test,Losail International Circuit,QATAR
6,8,MAR,1 - Grand Prix of Qatar,Losail International Circuit,QATAR
7,22,MAR,2 - OR Thailand Grand Prix,TT Circuit Assen,NETHERLANDS
8,5,APR,3 - Red Bull Grand Prix of The Americas,Circuit Of The Americas,UNITED STATES
9,19,APR,4 - Gran Premio Motul de la República Argentina,Termas de Río Hondo,ARGENTINA


## Renaming columns

In [53]:
races.rename(columns = {'0':'Day', '1':'Month', '0.1':'Race', '1.1':'Circuit', '2':'Country'}, inplace = True)

In [54]:
races.head()

Unnamed: 0,Day,Month,Race,Circuit,Country
0,19,NOV,Valencia MotoGP™ Official Test,Circuit Ricardo Tormo,SPAIN
1,25,NOV,Jerez MotoGP™ Official Test,Circuito de Jerez - Angel Nieto,SPAIN
2,7,FEB,Sepang MotoGP™ Official Test,Sepang International Circuit,MALAYSIA
3,19,FEB,Jerez Moto2™-Moto3™ Test,Circuito de Jerez - Angel Nieto,SPAIN
4,22,FEB,Qatar MotoGP™ Test,Losail International Circuit,QATAR


## Removing 'Tests'

For this project, I will focus only in scoring races which are called 'Grand Prix'.

In [55]:
# The new dataframe should not include races with 'Test' word in the 'Race' column.
races = races.loc[races['Race'].str.contains('Test') == False]

In [56]:
races.head()

Unnamed: 0,Day,Month,Race,Circuit,Country
6,8,MAR,1 - Grand Prix of Qatar,Losail International Circuit,QATAR
7,22,MAR,2 - OR Thailand Grand Prix,TT Circuit Assen,NETHERLANDS
8,5,APR,3 - Red Bull Grand Prix of The Americas,Circuit Of The Americas,UNITED STATES
9,19,APR,4 - Gran Premio Motul de la República Argentina,Termas de Río Hondo,ARGENTINA
10,3,MAY,5 - Gran Premio Red Bull de España,Circuito de Jerez - Angel Nieto,SPAIN


## Updating dates

Now we removed the tests, all the dates are for same year 2020. I will create a new column with 'Date' as a DateTime

In [57]:
races['Date'] = pd.to_datetime((races['Day'].astype(str)+races['Month'] + '2020'), errors = 'coerce', infer_datetime_format=True)

In [62]:
races = races[['Date', 'Race', 'Circuit', 'Country']].reset_index(drop=True)

In [63]:
races.head()

Unnamed: 0,Date,Race,Circuit,Country
0,2020-03-08,1 - Grand Prix of Qatar,Losail International Circuit,QATAR
1,2020-03-22,2 - OR Thailand Grand Prix,TT Circuit Assen,NETHERLANDS
2,2020-04-05,3 - Red Bull Grand Prix of The Americas,Circuit Of The Americas,UNITED STATES
3,2020-04-19,4 - Gran Premio Motul de la República Argentina,Termas de Río Hondo,ARGENTINA
4,2020-05-03,5 - Gran Premio Red Bull de España,Circuito de Jerez - Angel Nieto,SPAIN


In [73]:
races.dtypes

Date       datetime64[ns]
Race               object
Circuit            object
Country            object
dtype: object

## Removing numbers on Race names

As they are already ordered, we can remove the numbers at the beginning of the race's name.

In [68]:
races['Race'] = races['Race'].apply(lambda x: re.sub(r'\d+ - ','',x))

In [69]:
races.head()

Unnamed: 0,Date,Race,Circuit,Country
0,2020-03-08,Grand Prix of Qatar,Losail International Circuit,QATAR
1,2020-03-22,OR Thailand Grand Prix,TT Circuit Assen,NETHERLANDS
2,2020-04-05,Red Bull Grand Prix of The Americas,Circuit Of The Americas,UNITED STATES
3,2020-04-19,Gran Premio Motul de la República Argentina,Termas de Río Hondo,ARGENTINA
4,2020-05-03,Gran Premio Red Bull de España,Circuito de Jerez - Angel Nieto,SPAIN


## Checking duplicates

Except for the country, the other columns should not have any duplicates.

In [83]:
races['Date'].duplicated(keep = False).any()

False

In [84]:
races['Race'].duplicated(keep = False).any()

False

In [85]:
races['Circuit'].duplicated(keep = False).any()

True

In [86]:
races.loc[races['Circuit'].duplicated(keep = False)]

Unnamed: 0,Date,Race,Circuit,Country
1,2020-03-22,OR Thailand Grand Prix,TT Circuit Assen,NETHERLANDS
9,2020-06-28,Motul TT Assen,TT Circuit Assen,NETHERLANDS


The calendar from motogp.com has an error, and so the data scraped. As is just one mistake (circuit and country for Thailand Grand Prix) I will correct it manually

In [93]:
races.at[1,'Circuit'] = 'Chang International Circuit'

In [94]:
races.at[1,'Country'] = 'THAILAND'

In [95]:
races.loc[races['Circuit'].duplicated(keep = False)]

Unnamed: 0,Date,Race,Circuit,Country


## Looking for coordinates with GeoPy

In [100]:
# creating a geolocator to use it to get coordinates
geolocator = Nominatim()

#creating a new column with coordinates
races['Lat_Lon'] = races['Circuit'].apply(geolocator.geocode)

  


In [101]:
races['Lat_Lon']

0     (Lusail International Circuit, أم صلال, ‏قطر, ...
1     (Sepang International Circuit, Jalan Kuarters ...
2     (Circuit of The Americas, Larkdale Lane, Lake ...
3     (Termas de Río Hondo, Departamento Río Hondo, ...
4                                                  None
5     (Le Mans, Sarthe, Pays de la Loire, France mét...
6     (Autodromo del Mugello, Mugellino, Omo morto, ...
7     (Circuit de Barcelona-Catalunya, BV-5003, Mont...
8     (Sachsenring, Marienthal Ost, Zwickau-West, Zw...
9     (TT Circuit Assen, TT-tunnelweg, Assen, Drenth...
10    (KymiRing, 748, Kymentie, Iitti, Kouvolan seut...
11    (Automotodrom (rozc.), 3842, Žebětín, Brno, ok...
12    (Red Bull Ring, Spielberg, Murtal, Steiermark,...
13    (Silverstone Circuit, Village Corner, Silverst...
14    (Misano World Circuit Marco Simoncelli, Via Ca...
15    (Motorland Aragón, Puigmoreno, Alcañiz, Bajo A...
16    (ツインリンクもてぎ, Twin Ring Motegi, 茂木町, 芳賀郡, 栃木県, 関...
17    (Phillip Island, Norfolk Island, Australia

As we can see above, the geolocator couldn't find the direction for row 4. Let's check the name.

In [104]:
races.loc[4, 'Circuit']

'Circuito de Jerez - Angel Nieto'

The problem may be due to the append 'Angel Nieto' to the name of the circuit. Let's remove it:

In [105]:
races.at[4, 'Circuit'] = 'Circuito de Jerez'

In [106]:
races.loc[4, 'Circuit']

'Circuito de Jerez'

Now I can try again to get the directions. But this time I will get directly the coordinates

In [107]:
races['Lat_Lon'] = races['Circuit'].apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))

In [108]:
races.head()

Unnamed: 0,Date,Race,Circuit,Country,Lat_Lon
0,2020-03-08,Grand Prix of Qatar,Losail International Circuit,QATAR,"(25.4909996, 51.4520682675116)"
1,2020-03-22,OR Thailand Grand Prix,Chang International Circuit,THAILAND,"(2.7601913, 101.736858607147)"
2,2020-04-05,Red Bull Grand Prix of The Americas,Circuit Of The Americas,UNITED STATES,"(30.1387146, -97.6364097869511)"
3,2020-04-19,Gran Premio Motul de la República Argentina,Termas de Río Hondo,ARGENTINA,"(-27.4959255, -64.8640783)"
4,2020-05-03,Gran Premio Red Bull de España,Circuito de Jerez,SPAIN,"(36.69444715, -6.15631689958845)"


## Races dataframe cleaned

In [109]:
races.to_csv('../data/races_cleaned.csv')

# Cleaning the 'riders' file

In [115]:
riders = pd.read_csv('../data/riders.csv', index_col = 0)

In [116]:
riders.head()

Unnamed: 0,0,1,2,3
0,Andrea Dovizioso,Ducati Team,Bike: Ducati,Forlimpopoli
1,Johann Zarco,Reale Avintia Racing,Bike: Ducati,Cannes
2,Danilo Petrucci,Ducati Team,Bike: Ducati,Terni
3,Maverick Viñales,Monster Energy Yamaha MotoGP,Bike: Yamaha,Figueres
4,Fabio Quartararo,Petronas Yamaha SRT,Bike: Yamaha,Nice
