# Limpeza de Dados do Dataset das Corridas

In [30]:
# Import das bibliotecas

import pandas as pd

In [31]:
# Lendo o arquivo csv
df = pd.read_csv('../data/races.csv', na_values=["\\N"])
df.head()

Unnamed: 0,raceId,year,round,circuitId,name,date,time,url,fp1_date,fp1_time,fp2_date,fp2_time,fp3_date,fp3_time,quali_date,quali_time,sprint_date,sprint_time
0,1,2009,1,1,Australian Grand Prix,2009-03-29,06:00:00,http://en.wikipedia.org/wiki/2009_Australian_G...,,,,,,,,,,
1,2,2009,2,2,Malaysian Grand Prix,2009-04-05,09:00:00,http://en.wikipedia.org/wiki/2009_Malaysian_Gr...,,,,,,,,,,
2,3,2009,3,17,Chinese Grand Prix,2009-04-19,07:00:00,http://en.wikipedia.org/wiki/2009_Chinese_Gran...,,,,,,,,,,
3,4,2009,4,3,Bahrain Grand Prix,2009-04-26,12:00:00,http://en.wikipedia.org/wiki/2009_Bahrain_Gran...,,,,,,,,,,
4,5,2009,5,4,Spanish Grand Prix,2009-05-10,12:00:00,http://en.wikipedia.org/wiki/2009_Spanish_Gran...,,,,,,,,,,


In [32]:
df.dtypes

raceId          int64
year            int64
round           int64
circuitId       int64
name           object
date           object
time           object
url            object
fp1_date       object
fp1_time       object
fp2_date       object
fp2_time       object
fp3_date       object
fp3_time       object
quali_date     object
quali_time     object
sprint_date    object
sprint_time    object
dtype: object

Há uma inconsistência de tipo de dados nas colunas de datas (`date`, `fp1_date`, `fp2_date`, `fp3_date`, `quali_date` e `sprint_date`). Essas colunas estão definidas como `object`, mas representam datas (`datetime64[ns]`).

As colunas que representam valores de tempo também estão definidas como `object`, porém não há nenhum tipo de dados em Python que representem o tempo no formato `%H:%M:%S`. No entanto, ao realizar o upload dos arquivos CSV na plataforma do BigQuery os dados serão reconhecidos corretamente como tipo `time`.

In [33]:
# Corrigindo o tipo de dados das colunas representar datas
df['date'] = pd.to_datetime(df['date'])
df['fp1_date'] = pd.to_datetime(df['fp1_date'])
df['fp2_date'] = pd.to_datetime(df['fp2_date'])
df['fp3_date'] = pd.to_datetime(df['fp3_date'])
df['quali_date'] = pd.to_datetime(df['quali_date'])
df['sprint_date'] = pd.to_datetime(df['sprint_date'])
df.head()

Unnamed: 0,raceId,year,round,circuitId,name,date,time,url,fp1_date,fp1_time,fp2_date,fp2_time,fp3_date,fp3_time,quali_date,quali_time,sprint_date,sprint_time
0,1,2009,1,1,Australian Grand Prix,2009-03-29,06:00:00,http://en.wikipedia.org/wiki/2009_Australian_G...,NaT,,NaT,,NaT,,NaT,,NaT,
1,2,2009,2,2,Malaysian Grand Prix,2009-04-05,09:00:00,http://en.wikipedia.org/wiki/2009_Malaysian_Gr...,NaT,,NaT,,NaT,,NaT,,NaT,
2,3,2009,3,17,Chinese Grand Prix,2009-04-19,07:00:00,http://en.wikipedia.org/wiki/2009_Chinese_Gran...,NaT,,NaT,,NaT,,NaT,,NaT,
3,4,2009,4,3,Bahrain Grand Prix,2009-04-26,12:00:00,http://en.wikipedia.org/wiki/2009_Bahrain_Gran...,NaT,,NaT,,NaT,,NaT,,NaT,
4,5,2009,5,4,Spanish Grand Prix,2009-05-10,12:00:00,http://en.wikipedia.org/wiki/2009_Spanish_Gran...,NaT,,NaT,,NaT,,NaT,,NaT,


In [34]:
df.dtypes

raceId                  int64
year                    int64
round                   int64
circuitId               int64
name                   object
date           datetime64[ns]
time                   object
url                    object
fp1_date       datetime64[ns]
fp1_time               object
fp2_date       datetime64[ns]
fp2_time               object
fp3_date       datetime64[ns]
fp3_time               object
quali_date     datetime64[ns]
quali_time             object
sprint_date    datetime64[ns]
sprint_time            object
dtype: object

A inconsistência nos tipos das colunas foi corrigida.

## Identificando Dados Ausentes

In [35]:
# Verificando valores ausentes
df.isnull().sum()

raceId            0
year              0
round             0
circuitId         0
name              0
date              0
time            731
url               0
fp1_date       1035
fp1_time       1057
fp2_date       1035
fp2_time       1057
fp3_date       1053
fp3_time       1072
quali_date     1035
quali_time     1057
sprint_date    1107
sprint_time    1110
dtype: int64

Diversas informações relacionadas a horários e datas apresentam valores ausentes. Observa-se que essas ausências ocorrem exclusivamente em corridas mais antigas, o que indica que tais dados possivelmente não eram registrados ou disponibilizados na época. Além disso, algumas dessas informações simplesmente não existiam naquele período, como é o caso da corrida Sprint, introduzida na Fórmula 1 apenas em 2021 (exemplo abaixo).

In [36]:
# Datas de corridas Sprint que não estão ausentes
df[df.sprint_date.notnull()]

Unnamed: 0,raceId,year,round,circuitId,name,date,time,url,fp1_date,fp1_time,fp2_date,fp2_time,fp3_date,fp3_time,quali_date,quali_time,sprint_date,sprint_time
1046,1061,2021,10,9,British Grand Prix,2021-07-18,14:00:00,http://en.wikipedia.org/wiki/2021_British_Gran...,2021-07-16,,2021-07-17,,NaT,,2021-07-16,,2021-07-17,
1050,1065,2021,14,14,Italian Grand Prix,2021-09-12,13:00:00,http://en.wikipedia.org/wiki/2021_Italian_Gran...,2021-09-10,,2021-09-11,,NaT,,2021-09-10,,2021-09-11,
1055,1071,2021,19,18,São Paulo Grand Prix,2021-11-14,17:00:00,http://en.wikipedia.org/wiki/2021_S%C3%A3o_Pau...,2021-11-12,,2021-11-13,,NaT,,2021-11-12,,2021-11-13,
1060,1077,2022,4,21,Emilia Romagna Grand Prix,2022-04-24,13:00:00,http://en.wikipedia.org/wiki/2022_Emilia_Romag...,2022-04-22,11:30:00,2022-04-23,10:30:00,NaT,,2022-04-22,15:00:00,2022-04-23,14:30:00
1067,1084,2022,11,70,Austrian Grand Prix,2022-07-10,13:00:00,http://en.wikipedia.org/wiki/2022_Austrian_Gra...,2022-07-08,11:30:00,2022-07-09,10:30:00,NaT,,2022-07-08,15:00:00,2022-07-09,14:30:00
1077,1095,2022,21,18,São Paulo Grand Prix,2022-11-13,18:00:00,http://en.wikipedia.org/wiki/2022_Brazilian_Gr...,2022-11-11,15:30:00,2022-11-12,15:30:00,NaT,,2022-11-11,19:00:00,2022-11-12,19:30:00
1082,1101,2023,4,73,Azerbaijan Grand Prix,2023-04-30,11:00:00,https://en.wikipedia.org/wiki/2023_Azerbaijan_...,2023-04-28,09:30:00,2023-04-29,09:30:00,NaT,,2023-04-28,13:00:00,2023-04-29,13:30:00
1087,1107,2023,9,70,Austrian Grand Prix,2023-07-02,13:00:00,https://en.wikipedia.org/wiki/2023_Austrian_Gr...,2023-06-30,11:30:00,2023-07-01,10:30:00,NaT,,2023-06-30,15:00:00,2023-07-01,14:30:00
1090,1110,2023,12,13,Belgian Grand Prix,2023-07-30,13:00:00,https://en.wikipedia.org/wiki/2023_Belgian_Gra...,2023-07-28,11:30:00,2023-07-29,10:30:00,NaT,,2023-07-28,15:00:00,2023-07-29,14:30:00
1095,1115,2023,17,78,Qatar Grand Prix,2023-10-08,17:00:00,https://en.wikipedia.org/wiki/2023_Qatar_Grand...,2023-10-06,13:30:00,2023-10-07,13:00:00,NaT,,2023-10-06,17:00:00,2023-10-07,17:30:00


## Identificando Dados Duplicados

In [37]:
# Verificação de registros duplicados
df[df.duplicated(keep=False)]

Unnamed: 0,raceId,year,round,circuitId,name,date,time,url,fp1_date,fp1_time,fp2_date,fp2_time,fp3_date,fp3_time,quali_date,quali_time,sprint_date,sprint_time


In [38]:
# Verificação de registros duplicados apenas na coluna 'raceId'
df[df.duplicated(subset='raceId' ,keep=False)]

Unnamed: 0,raceId,year,round,circuitId,name,date,time,url,fp1_date,fp1_time,fp2_date,fp2_time,fp3_date,fp3_time,quali_date,quali_time,sprint_date,sprint_time


Não foi encontrado nenhum registro duplicado.

## Análise Descritiva das Variáveis Numéricas

In [39]:
# Resumo estatístico das variáveis numéricas
df.describe()

Unnamed: 0,raceId,year,round,circuitId
count,1125.0,1125.0,1125.0,1125.0
mean,565.710222,1992.703111,8.579556,23.889778
std,328.813817,20.603848,5.15991,19.633527
min,1.0,1950.0,1.0,1.0
25%,282.0,1977.0,4.0,9.0
50%,563.0,1994.0,8.0,18.0
75%,845.0,2011.0,13.0,34.0
max,1144.0,2024.0,24.0,80.0


Observando a análise estatística acima não foram identificadas inconsistências em nenhuma variável.

## Salvando o Dataset Tratado

In [40]:
df.to_csv('../data_cleaned/races.csv', index=False)