# Limpeza de Dados do Dataset dos Resultados

In [14]:
# Import das bibliotecas

import pandas as pd

In [15]:
# Lendo o arquivo csv
df = pd.read_csv('../data/results.csv', na_values=["\\N"])
df.head()

Unnamed: 0,resultId,raceId,driverId,constructorId,number,grid,position,positionText,positionOrder,points,laps,time,milliseconds,fastestLap,rank,fastestLapTime,fastestLapSpeed,statusId
0,1,18,1,1,22.0,1,1.0,1,1,10.0,58,1:34:50.616,5690616.0,39.0,2.0,1:27.452,218.3,1
1,2,18,2,2,3.0,5,2.0,2,2,8.0,58,+5.478,5696094.0,41.0,3.0,1:27.739,217.586,1
2,3,18,3,3,7.0,7,3.0,3,3,6.0,58,+8.163,5698779.0,41.0,5.0,1:28.090,216.719,1
3,4,18,4,4,5.0,11,4.0,4,4,5.0,58,+17.181,5707797.0,58.0,7.0,1:28.603,215.464,1
4,5,18,5,1,23.0,3,5.0,5,5,4.0,58,+18.014,5708630.0,43.0,1.0,1:27.418,218.385,1


In [16]:
df.dtypes

resultId             int64
raceId               int64
driverId             int64
constructorId        int64
number             float64
grid                 int64
position           float64
positionText        object
positionOrder        int64
points             float64
laps                 int64
time                object
milliseconds       float64
fastestLap         float64
rank               float64
fastestLapTime      object
fastestLapSpeed    float64
statusId             int64
dtype: object

Várias colunas numéricas (`number`, `position`, `milliseconds`, `fastestLap` e `rank`) apresentam a mesma inconsistência, estão definidas como `float64`, mas representam valores numéricos inteiros (`int64`). Isso ocorre porque o tipo `int64` não suporta valores ausentes, com isso o seu tipo foi definido como `float64`.

A coluna `fastestLapTime`, que representa valores de tempo, está definida como `object`, porém os seus campos serão utilizados apenas para melhor visualização da informação. Portanto, há uma coluna faltante para representar os tempos em um formato que seja possível realizar operações matemáticas.

In [17]:
# Corrigindo o tipo de dados das colunas para representar números inteiros
# O tipo 'Int64', com o 'I' maiúsculo, aceita valores ausentes e inteiros juntos
df['number'] = pd.to_numeric(df['number']).astype('Int64')
df['position'] = pd.to_numeric(df['position']).astype('Int64')
df['milliseconds'] = pd.to_numeric(df['milliseconds']).astype('Int64')
df['fastestLap'] = pd.to_numeric(df['fastestLap']).astype('Int64')
df['rank'] = pd.to_numeric(df['rank']).astype('Int64')
df.head()

Unnamed: 0,resultId,raceId,driverId,constructorId,number,grid,position,positionText,positionOrder,points,laps,time,milliseconds,fastestLap,rank,fastestLapTime,fastestLapSpeed,statusId
0,1,18,1,1,22,1,1,1,1,10.0,58,1:34:50.616,5690616,39,2,1:27.452,218.3,1
1,2,18,2,2,3,5,2,2,2,8.0,58,+5.478,5696094,41,3,1:27.739,217.586,1
2,3,18,3,3,7,7,3,3,3,6.0,58,+8.163,5698779,41,5,1:28.090,216.719,1
3,4,18,4,4,5,11,4,4,4,5.0,58,+17.181,5707797,58,7,1:28.603,215.464,1
4,5,18,5,1,23,3,5,5,5,4.0,58,+18.014,5708630,43,1,1:27.418,218.385,1


In [18]:
# Função para converter o tempo, que está em string, em milisegundos
def time_to_milliseconds(t):
    if pd.isnull(t): # Caso seja um valor nulo, retorna nulo
        return None
    try: # Converte o tempo para milisegundos
        minutes, rest = t.split(':') # Separa os minutos do resto da string
        seconds, milliseconds = rest.split('.') # Separa os segundos e os milisegundos
        total_ms = (int(minutes) * 60 + int(seconds)) * 1000 + int(milliseconds) # Soma os minutos (convertido em milisegundos), segundos (convertido em milisegundos) e milisegundos
        return total_ms
    except ValueError: # Caso ocorra algum erro retorna nulo
        return None

# Cria a coluna 'q3_ms' com o tempo em milisegundos
df['fastestLapTime_ms'] = df['fastestLapTime'].apply(time_to_milliseconds).astype('Int64')
df.dtypes

resultId               int64
raceId                 int64
driverId               int64
constructorId          int64
number                 Int64
grid                   int64
position               Int64
positionText          object
positionOrder          int64
points               float64
laps                   int64
time                  object
milliseconds           Int64
fastestLap             Int64
rank                   Int64
fastestLapTime        object
fastestLapSpeed      float64
statusId               int64
fastestLapTime_ms      Int64
dtype: object

In [19]:
df.dtypes

resultId               int64
raceId                 int64
driverId               int64
constructorId          int64
number                 Int64
grid                   int64
position               Int64
positionText          object
positionOrder          int64
points               float64
laps                   int64
time                  object
milliseconds           Int64
fastestLap             Int64
rank                   Int64
fastestLapTime        object
fastestLapSpeed      float64
statusId               int64
fastestLapTime_ms      Int64
dtype: object

A inconsistência nos tipos das colunas foi corrigida.

## Identificando Dados Ausentes

In [20]:
# Verificando valores ausentes
df.isnull().sum()

resultId                 0
raceId                   0
driverId                 0
constructorId            0
number                   6
grid                     0
position             10953
positionText             0
positionOrder            0
points                   0
laps                     0
time                 19079
milliseconds         19079
fastestLap           18507
rank                 18249
fastestLapTime       18507
fastestLapSpeed      18507
statusId                 0
fastestLapTime_ms    18507
dtype: int64

Os campos ausentes nas colunas são causados pela não participação do piloto na corrida, indicadas pela coluna `positionText`, que registra códigos como "R" (retired), "F" (failed to qualify) ou "W" (withdrew). Essas ausências são, portanto, esperadas e coerentes com o contexto esportivo.

In [21]:
df[df.position.isnull()]

Unnamed: 0,resultId,raceId,driverId,constructorId,number,grid,position,positionText,positionOrder,points,laps,time,milliseconds,fastestLap,rank,fastestLapTime,fastestLapSpeed,statusId,fastestLapTime_ms
8,9,18,9,2,4,2,,R,9,0.0,47,,,15,9,1:28.753,215.100,4,88753
9,10,18,10,7,12,18,,R,10,0.0,43,,,23,13,1:29.558,213.166,3,89558
10,11,18,11,8,18,19,,R,11,0.0,32,,,24,15,1:30.892,210.038,7,90892
11,12,18,12,4,6,20,,R,12,0.0,30,,,20,16,1:31.384,208.907,8,91384
12,13,18,13,6,2,4,,R,13,0.0,29,,,23,6,1:28.175,216.510,5,88175
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26737,26743,1143,861,3,43,19,,R,19,0.0,0,,,,0,,,4,
26738,26744,1143,839,214,31,20,,R,20,0.0,0,,,,0,,,4,
26756,26762,1144,822,15,77,9,,R,18,0.0,30,,,14,19,1:29.482,212.462,130,89482
26757,26763,1144,861,3,43,20,,R,19,0.0,26,,,5,17,1:29.411,212.631,5,89411


## Identificando Dados Duplicados

In [22]:
# Verificação de registros duplicados
df[df.duplicated(keep=False)]

Unnamed: 0,resultId,raceId,driverId,constructorId,number,grid,position,positionText,positionOrder,points,laps,time,milliseconds,fastestLap,rank,fastestLapTime,fastestLapSpeed,statusId,fastestLapTime_ms


In [23]:
# Verificação de registros duplicados apenas na coluna 'resultId'
df[df.duplicated(subset='resultId' ,keep=False)]

Unnamed: 0,resultId,raceId,driverId,constructorId,number,grid,position,positionText,positionOrder,points,laps,time,milliseconds,fastestLap,rank,fastestLapTime,fastestLapSpeed,statusId,fastestLapTime_ms


Não foi encontrado nenhum registro duplicado.

## Análise Descritiva das Variáveis Numéricas

In [24]:
# Resumo estatístico das variáveis numéricas
df.describe()

Unnamed: 0,resultId,raceId,driverId,constructorId,number,grid,position,positionOrder,points,laps,milliseconds,fastestLap,rank,fastestLapSpeed,statusId,fastestLapTime_ms
count,26759.0,26759.0,26759.0,26759.0,26753.0,26759.0,15806.0,26759.0,26759.0,26759.0,7680.0,8252.0,8510.0,8252.0,26759.0,8252.0
mean,13380.977391,551.687283,278.67353,50.180537,18.153927,11.134796,8.020499,12.794051,1.987632,46.301768,6185832.952865,42.732913,10.334313,204.11633,17.224971,90827.59525
std,7726.134642,313.265036,282.703039,61.551498,15.581135,7.20286,4.840796,7.665951,4.351209,29.496557,1669306.488585,16.60346,6.140957,21.377265,26.026104,12338.955962
min,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,526.0,1.0,0.0,89.54,1.0,55404.0
25%,6690.5,300.0,57.0,6.0,7.0,5.0,4.0,6.0,0.0,23.0,5378454.5,33.0,5.0,193.3225,1.0,80848.25
50%,13380.0,531.0,172.0,25.0,16.0,11.0,8.0,12.0,0.0,53.0,5788193.5,46.0,10.0,204.894,10.0,90267.5
75%,20069.5,811.0,399.5,63.0,24.0,17.0,11.0,18.0,2.0,66.0,6402676.0,54.0,15.0,217.324,14.0,99476.75
max,26764.0,1144.0,862.0,215.0,208.0,34.0,33.0,39.0,50.0,200.0,15090540.0,85.0,24.0,257.32,141.0,202300.0


O valor máximo de 200 da coluna `laps`, que são as voltas, representa um valor muito acima do padrão das corridas de Fórmula 1 atuais, o que pode representar uma incosistência.

In [25]:
# Busca dos registros que possuem 200 voltas
df[df.laps == 200]

Unnamed: 0,resultId,raceId,driverId,constructorId,number,grid,position,positionText,positionOrder,points,laps,time,milliseconds,fastestLap,rank,fastestLapTime,fastestLapSpeed,statusId,fastestLapTime_ms
18112,18113,748,509,107,4,2,1,1,1,8.0,200,3:36:11.36,12971360,,,,,1,
18113,18114,748,449,107,1,3,2,2,2,6.0,200,+0:12.75,12984110,,,,,1,
18114,18115,748,510,108,99,26,3,3,3,4.0,200,+3:07.30,13158660,,,,,1,
18115,18116,748,511,109,7,8,4,4,4,3.0,200,+3:07.98,13159340,,,,,1,
18116,18117,748,512,110,3,17,5,5,5,2.0,200,+3:11.35,13162710,,,,,1,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20218,20219,800,611,113,45,27,11,11,11,0.0,200,+8:22.19,14259460,,,,,1,
20219,20220,800,612,113,45,27,11,11,11,0.0,200,+8:22.19,14259460,,,,,1,
20220,20221,800,653,113,45,27,11,11,11,0.0,200,+8:22.19,14259460,,,,,1,
20260,20261,794,555,113,10,2,2,2,2,3.0,200,+2:43.56,14203090,,,,,1,


As ocorrências das corridas que possuem 200 voltas correspondem à corrida das 500 Milhas de Indianópolis, realizadas na década de 50, que consistiam em 200 voltas. Essa informação confirma a veracidade desses dados.

## Salvando o Dataset Tratado

In [26]:
df.to_csv('../data_cleaned/results.csv', index=False)