# Projecte Kaggle: Formula 1 World Championship (1950 - 2022)

La base de dades asignada al projecte emmagatzema tota la informació tota la informació sobre la Formula 1: carreres, conductors, constructors, classificacions, circuits, nombre de voltes i parades a boxers.


## Llibreries

In [1]:
import math
import numpy as np
import scipy as sc
import pandas as pd
import sklearn as sk
import seaborn as sns
import matplotlib as mp

from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

## Bases de Dades

En total, hi ha 14 conjunts d'aprenentatge diferents (https://www.kaggle.com/datasets/rohanrao/formula-1-world-championship-1950-2020). Per carregar cadascún, es farà servir la següent funció:

In [2]:
# Visualitzarem només 3 decimals per mostra
pd.set_option('display.float_format', lambda x: '%.3f' % x)

# Funcio per a llegir dades en format csv
def load_dataset(path):
    dataset = pd.read_csv(path, header=0, delimiter=',', encoding = "ISO-8859-1")
    return dataset

<br >

### <ins>Dataset de circuits</ins>

In [3]:
# Carreguem dataset
circuits = load_dataset('databases/circuits.csv')

print("Dimensionalitat de la BBDD:", circuits.shape)
print("\nTabla de la BBDD:")
display(circuits.head())

Dimensionalitat de la BBDD: (76, 9)

Tabla de la BBDD:


Unnamed: 0,circuitId,circuitRef,name,location,country,lat,lng,alt,url
0,1,albert_park,Albert Park Grand Prix Circuit,Melbourne,Australia,-37.85,144.968,10,http://en.wikipedia.org/wiki/Melbourne_Grand_P...
1,2,sepang,Sepang International Circuit,Kuala Lumpur,Malaysia,2.761,101.738,18,http://en.wikipedia.org/wiki/Sepang_Internatio...
2,3,bahrain,Bahrain International Circuit,Sakhir,Bahrain,26.032,50.511,7,http://en.wikipedia.org/wiki/Bahrain_Internati...
3,4,catalunya,Circuit de Barcelona-Catalunya,MontmelÃ³,Spain,41.57,2.261,109,http://en.wikipedia.org/wiki/Circuit_de_Barcel...
4,5,istanbul,Istanbul Park,Istanbul,Turkey,40.952,29.405,130,http://en.wikipedia.org/wiki/Istanbul_Park


<br >

### <ins>Dataset de constructor results</ins>

In [4]:
# Carreguem dataset
constructor_results = load_dataset('databases/constructor_results.csv')

print("Dimensionalitat de la BBDD:", constructor_results.shape)
print("\nTabla de la BBDD:")
display(constructor_results.head())

Dimensionalitat de la BBDD: (12080, 5)

Tabla de la BBDD:


Unnamed: 0,constructorResultsId,raceId,constructorId,points,status
0,1,18,1,14.0,\N
1,2,18,2,8.0,\N
2,3,18,3,9.0,\N
3,4,18,4,5.0,\N
4,5,18,5,2.0,\N


<br >

### <ins>Dataset de constructor standings</ins>

In [5]:
# Carreguem dataset
constructor_standings = load_dataset('databases/constructor_standings.csv')

print("Dimensionalitat de la BBDD:", constructor_standings.shape)
print("\nTabla de la BBDD:")
display(constructor_standings.head())

Dimensionalitat de la BBDD: (12841, 7)

Tabla de la BBDD:


Unnamed: 0,constructorStandingsId,raceId,constructorId,points,position,positionText,wins
0,1,18,1,14.0,1,1,1
1,2,18,2,8.0,3,3,0
2,3,18,3,9.0,2,2,0
3,4,18,4,5.0,4,4,0
4,5,18,5,2.0,5,5,0


<br >

### <ins>Dataset de constructors</ins>

In [6]:
# Carreguem dataset
constructors = load_dataset('databases/constructors.csv')

print("Dimensionalitat de la BBDD:", constructors.shape)
print("\nTabla de la BBDD:")
display(constructors.head())

Dimensionalitat de la BBDD: (211, 5)

Tabla de la BBDD:


Unnamed: 0,constructorId,constructorRef,name,nationality,url
0,1,mclaren,McLaren,British,http://en.wikipedia.org/wiki/McLaren
1,2,bmw_sauber,BMW Sauber,German,http://en.wikipedia.org/wiki/BMW_Sauber
2,3,williams,Williams,British,http://en.wikipedia.org/wiki/Williams_Grand_Pr...
3,4,renault,Renault,French,http://en.wikipedia.org/wiki/Renault_in_Formul...
4,5,toro_rosso,Toro Rosso,Italian,http://en.wikipedia.org/wiki/Scuderia_Toro_Rosso


<br >

### <ins>Dataset de driver standings</ins>

In [7]:
# Carreguem dataset
driver_standings = load_dataset('databases/driver_standings.csv')

print("Dimensionalitat de la BBDD:", driver_standings.shape)
print("\nTabla de la BBDD:")
display(driver_standings.head())

Dimensionalitat de la BBDD: (33686, 7)

Tabla de la BBDD:


Unnamed: 0,driverStandingsId,raceId,driverId,points,position,positionText,wins
0,1,18,1,10.0,1,1,1
1,2,18,2,8.0,2,2,0
2,3,18,3,6.0,3,3,0
3,4,18,4,5.0,4,4,0
4,5,18,5,4.0,5,5,0


<br >

### <ins>Dataset de drivers</ins>

In [8]:
# Carreguem dataset
drivers = load_dataset('databases/drivers.csv')

print("Dimensionalitat de la BBDD:", drivers.shape)
print("\nTabla de la BBDD:")
display(drivers.head())

Dimensionalitat de la BBDD: (854, 9)

Tabla de la BBDD:


Unnamed: 0,driverId,driverRef,number,code,forename,surname,dob,nationality,url
0,1,hamilton,44,HAM,Lewis,Hamilton,1985-01-07,British,http://en.wikipedia.org/wiki/Lewis_Hamilton
1,2,heidfeld,\N,HEI,Nick,Heidfeld,1977-05-10,German,http://en.wikipedia.org/wiki/Nick_Heidfeld
2,3,rosberg,6,ROS,Nico,Rosberg,1985-06-27,German,http://en.wikipedia.org/wiki/Nico_Rosberg
3,4,alonso,14,ALO,Fernando,Alonso,1981-07-29,Spanish,http://en.wikipedia.org/wiki/Fernando_Alonso
4,5,kovalainen,\N,KOV,Heikki,Kovalainen,1981-10-19,Finnish,http://en.wikipedia.org/wiki/Heikki_Kovalainen


<br >

### <ins>Dataset de lap times</ins>

In [9]:
# Carreguem dataset
lap_times = load_dataset('databases/lap_times.csv')

print("Dimensionalitat de la BBDD:", lap_times.shape)
print("\nTabla de la BBDD:")
display(lap_times.head())

Dimensionalitat de la BBDD: (528785, 6)

Tabla de la BBDD:


Unnamed: 0,raceId,driverId,lap,position,time,milliseconds
0,841,20,1,1,1:38.109,98109
1,841,20,2,1,1:33.006,93006
2,841,20,3,1,1:32.713,92713
3,841,20,4,1,1:32.803,92803
4,841,20,5,1,1:32.342,92342


<br >

### <ins>Dataset de pit_stops</ins>

In [10]:
# Carreguem dataset
pit_stops = load_dataset('databases/pit_stops.csv')

print("Dimensionalitat de la BBDD:", pit_stops.shape)
print("\nTabla de la BBDD:")
display(pit_stops.head())

Dimensionalitat de la BBDD: (9299, 7)

Tabla de la BBDD:


Unnamed: 0,raceId,driverId,stop,lap,time,duration,milliseconds
0,841,153,1,1,17:05:23,26.898,26898
1,841,30,1,1,17:05:52,25.021,25021
2,841,17,1,11,17:20:48,23.426,23426
3,841,4,1,12,17:22:34,23.251,23251
4,841,13,1,13,17:24:10,23.842,23842


<br >

### <ins>Dataset de qualifying</ins>

In [11]:
# Carreguem dataset
qualifying = load_dataset('databases/qualifying.csv')

print("Dimensionalitat de la BBDD:", qualifying.shape)
print("\nTabla de la BBDD:")
display(qualifying.head())

Dimensionalitat de la BBDD: (9395, 9)

Tabla de la BBDD:


Unnamed: 0,qualifyId,raceId,driverId,constructorId,number,position,q1,q2,q3
0,1,18,1,1,22,1,1:26.572,1:25.187,1:26.714
1,2,18,9,2,4,2,1:26.103,1:25.315,1:26.869
2,3,18,5,1,23,3,1:25.664,1:25.452,1:27.079
3,4,18,13,6,2,4,1:25.994,1:25.691,1:27.178
4,5,18,2,2,3,5,1:25.960,1:25.518,1:27.236


<br >

### <ins>Dataset de races</ins>

In [12]:
# Carreguem dataset
races = load_dataset('databases/races.csv')

print("Dimensionalitat de la BBDD:", races.shape)
print("\nTabla de la BBDD:")
display(races.head())

Dimensionalitat de la BBDD: (1079, 18)

Tabla de la BBDD:


Unnamed: 0,raceId,year,round,circuitId,name,date,time,url,fp1_date,fp1_time,fp2_date,fp2_time,fp3_date,fp3_time,quali_date,quali_time,sprint_date,sprint_time
0,1,2009,1,1,Australian Grand Prix,2009-03-29,06:00:00,http://en.wikipedia.org/wiki/2009_Australian_G...,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N
1,2,2009,2,2,Malaysian Grand Prix,2009-04-05,09:00:00,http://en.wikipedia.org/wiki/2009_Malaysian_Gr...,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N
2,3,2009,3,17,Chinese Grand Prix,2009-04-19,07:00:00,http://en.wikipedia.org/wiki/2009_Chinese_Gran...,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N
3,4,2009,4,3,Bahrain Grand Prix,2009-04-26,12:00:00,http://en.wikipedia.org/wiki/2009_Bahrain_Gran...,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N
4,5,2009,5,4,Spanish Grand Prix,2009-05-10,12:00:00,http://en.wikipedia.org/wiki/2009_Spanish_Gran...,\N,\N,\N,\N,\N,\N,\N,\N,\N,\N


<br >

### <ins>Dataset de results</ins>

In [13]:
# Carreguem dataset
results = load_dataset('databases/results.csv')

print("Dimensionalitat de la BBDD:", results.shape)
print("\nTabla de la BBDD:")
display(results.head())

Dimensionalitat de la BBDD: (25660, 18)

Tabla de la BBDD:


Unnamed: 0,resultId,raceId,driverId,constructorId,number,grid,position,positionText,positionOrder,points,laps,time,milliseconds,fastestLap,rank,fastestLapTime,fastestLapSpeed,statusId
0,1,18,1,1,22,1,1,1,1,10.0,58,1:34:50.616,5690616,39,2,1:27.452,218.3,1
1,2,18,2,2,3,5,2,2,2,8.0,58,+5.478,5696094,41,3,1:27.739,217.586,1
2,3,18,3,3,7,7,3,3,3,6.0,58,+8.163,5698779,41,5,1:28.090,216.719,1
3,4,18,4,4,5,11,4,4,4,5.0,58,+17.181,5707797,58,7,1:28.603,215.464,1
4,5,18,5,1,23,3,5,5,5,4.0,58,+18.014,5708630,43,1,1:27.418,218.385,1


<br >

### <ins>Dataset de seasons</ins>

In [14]:
# Carreguem dataset
seasons = load_dataset('databases/seasons.csv')

print("Dimensionalitat de la BBDD:", seasons.shape)
print("\nTabla de la BBDD:")
display(seasons.head())

Dimensionalitat de la BBDD: (73, 2)

Tabla de la BBDD:


Unnamed: 0,year,url
0,2009,http://en.wikipedia.org/wiki/2009_Formula_One_...
1,2008,http://en.wikipedia.org/wiki/2008_Formula_One_...
2,2007,http://en.wikipedia.org/wiki/2007_Formula_One_...
3,2006,http://en.wikipedia.org/wiki/2006_Formula_One_...
4,2005,http://en.wikipedia.org/wiki/2005_Formula_One_...


<br >

### <ins>Dataset de sprint results</ins>

In [15]:
# Carreguem dataset
sprint_results = load_dataset('databases/sprint_results.csv')

print("Dimensionalitat de la BBDD:", sprint_results.shape)
print("\nTabla de la BBDD:")
display(sprint_results.head())

Dimensionalitat de la BBDD: (100, 16)

Tabla de la BBDD:


Unnamed: 0,resultId,raceId,driverId,constructorId,number,grid,position,positionText,positionOrder,points,laps,time,milliseconds,fastestLap,fastestLapTime,statusId
0,1,1061,830,9,33,2,1,1,1,3,17,25:38.426,1538426,14,1:30.013,1
1,2,1061,1,131,44,1,2,2,2,2,17,+1.430,1539856,17,1:29.937,1
2,3,1061,822,131,77,3,3,3,3,1,17,+7.502,1545928,17,1:29.958,1
3,4,1061,844,6,16,4,4,4,4,0,17,+11.278,1549704,16,1:30.163,1
4,5,1061,846,1,4,6,5,5,5,0,17,+24.111,1562537,16,1:30.566,1


<br >

### <ins>Dataset de status</ins>

In [16]:
# Carreguem dataset
status = load_dataset('databases/status.csv')

print("Dimensionalitat de la BBDD:", status.shape)
print("\nTabla de la BBDD:")
display(status.head())

Dimensionalitat de la BBDD: (139, 2)

Tabla de la BBDD:


Unnamed: 0,statusId,status
0,1,Finished
1,2,Disqualified
2,3,Accident
3,4,Collision
4,5,Engine


<br >

## Objectiu

De tots els datasets, l'únic que recopila els resultats de la resta de datasets és la de results. A les instàncies de results, hi ha diverses referències a altres datasets com a race, driver, constructor o status. Abans d'escollir un objectiu, s'investigarà una mica el dataset.

Per començar, tornem a visualitzar la Base de Dades:

In [17]:
dataset = results.copy()

print("Dimensionalitat de la BBDD:", dataset.shape)
print("Tabla de la BBDD:")
display(dataset.head())

Dimensionalitat de la BBDD: (25660, 18)
Tabla de la BBDD:


Unnamed: 0,resultId,raceId,driverId,constructorId,number,grid,position,positionText,positionOrder,points,laps,time,milliseconds,fastestLap,rank,fastestLapTime,fastestLapSpeed,statusId
0,1,18,1,1,22,1,1,1,1,10.0,58,1:34:50.616,5690616,39,2,1:27.452,218.3,1
1,2,18,2,2,3,5,2,2,2,8.0,58,+5.478,5696094,41,3,1:27.739,217.586,1
2,3,18,3,3,7,7,3,3,3,6.0,58,+8.163,5698779,41,5,1:28.090,216.719,1
3,4,18,4,4,5,11,4,4,4,5.0,58,+17.181,5707797,58,7,1:28.603,215.464,1
4,5,18,5,1,23,3,5,5,5,4.0,58,+18.014,5708630,43,1,1:27.418,218.385,1


<br >

El primer atribut es pot quitar, ja que només és la l'identificador de la instància.

In [18]:
dataset = dataset.drop(['resultId'], axis=1)

print("Dimensionalitat de la BBDD:", dataset.shape)
print("Tabla de la BBDD:")
display(dataset.head())

Dimensionalitat de la BBDD: (25660, 17)
Tabla de la BBDD:


Unnamed: 0,raceId,driverId,constructorId,number,grid,position,positionText,positionOrder,points,laps,time,milliseconds,fastestLap,rank,fastestLapTime,fastestLapSpeed,statusId
0,18,1,1,22,1,1,1,1,10.0,58,1:34:50.616,5690616,39,2,1:27.452,218.3,1
1,18,2,2,3,5,2,2,2,8.0,58,+5.478,5696094,41,3,1:27.739,217.586,1
2,18,3,3,7,7,3,3,3,6.0,58,+8.163,5698779,41,5,1:28.090,216.719,1
3,18,4,4,5,11,4,4,4,5.0,58,+17.181,5707797,58,7,1:28.603,215.464,1
4,18,5,1,23,3,5,5,5,4.0,58,+18.014,5708630,43,1,1:27.418,218.385,1


<br >

Comprovem el tipus dels atributs:

In [19]:
print("Informació tipus de dades de la BBDD:")
dataset.info()

Informació tipus de dades de la BBDD:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25660 entries, 0 to 25659
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   raceId           25660 non-null  int64  
 1   driverId         25660 non-null  int64  
 2   constructorId    25660 non-null  int64  
 3   number           25660 non-null  object 
 4   grid             25660 non-null  int64  
 5   position         25660 non-null  object 
 6   positionText     25660 non-null  object 
 7   positionOrder    25660 non-null  int64  
 8   points           25660 non-null  float64
 9   laps             25660 non-null  int64  
 10  time             25660 non-null  object 
 11  milliseconds     25660 non-null  object 
 12  fastestLap       25660 non-null  object 
 13  rank             25660 non-null  object 
 14  fastestLapTime   25660 non-null  object 
 15  fastestLapSpeed  25660 non-null  object 
 16  statusId         256

<br >

Hi ha 8 atributs emmagatzemats com objectes: number, position, positionText, time, milliseconds, fastestLap, rank, fastestLapTime i fastestLapSpeed. Abans de tot, comprovem els valors que tenen els atributs:

In [20]:
labelCategoric = ["number", "position", "positionText", "time", "milliseconds", "fastestLap", "rank", "fastestLapTime", "fastestLapSpeed"]

for atr in labelCategoric:
    print("Rang " + atr + ": ", dataset[atr].nunique())
    print(dataset[atr].unique())

Rang number:  130
['22' '3' '7' '5' '23' '8' '14' '1' '4' '12' '18' '6' '2' '9' '11' '20'
 '10' '16' '19' '15' '21' '17' '24' '25' '28' '27' '29' '30' '26' '0' '31'
 '34' '32' '33' '35' '36' '39' '38' '41' '37' '40' '42' '50' '43' '51'
 '55' '66' '44' '45' '52' '13' '77' '54' '208' '46' '48' '56' '69' '71'
 '58' '64' '62' '60' '\\N' '74' '72' '68' '70' '99' '98' '65' '73' '97'
 '76' '88' '89' '57' '53' '49' '47' '87' '61' '59' '83' '92' '95' '82'
 '81' '67' '93' '101' '102' '117' '103' '108' '119' '121' '125' '124'
 '113' '128' '135' '107' '123' '120' '126' '110' '130' '116' '136' '118'
 '114' '109' '127' '105' '112' '122' '129' '104' '115' '75' '91' '84' '90'
 '94' '78' '86' '85' '79' '63']
Rang position:  34
['1' '2' '3' '4' '5' '6' '7' '8' '\\N' '9' '10' '11' '12' '13' '14' '15'
 '16' '17' '18' '19' '20' '21' '22' '23' '24' '25' '26' '27' '28' '29'
 '30' '31' '32' '33']
Rang positionText:  39
['1' '2' '3' '4' '5' '6' '7' '8' 'R' 'D' '9' '10' '11' '12' '13' '14' '15'
 '16' '17' '18' 

<br >

Notem que alguns valors estan representats amb \\\N, és a dir, valors nuls. També s'ha notat que a altres datasets es representa amb \\N. Abans de poder continuar, es tractarà els valors nuls.

In [21]:
for atr in labelCategoric:
    dataset.loc[(dataset[atr] == '\\N'), atr] = np.nan
    
print(dataset.isnull().sum()/len(dataset)*100)

raceId             0.000
driverId           0.000
constructorId      0.000
number             0.023
grid               0.000
position          42.194
positionText       0.000
positionOrder      0.000
points             0.000
laps               0.000
time              72.860
milliseconds      72.864
fastestLap        71.917
rank              71.118
fastestLapTime    71.917
fastestLapSpeed   71.917
statusId           0.000
dtype: float64


<br >

Per l'atribut number, es pot quitar les instancies amb nuls ja que hi ha menys d'1%. De pas, podem passar aquest atribut a numeric.

In [22]:
# Tractament nulls
dataset = dataset[~dataset['number'].isnull()]
dataset['number'] = dataset['number'].astype('int64')

print(dataset.isnull().sum()/len(dataset)*100)

raceId             0.000
driverId           0.000
constructorId      0.000
number             0.000
grid               0.000
position          42.181
positionText       0.000
positionOrder      0.000
points             0.000
laps               0.000
time              72.854
milliseconds      72.858
fastestLap        71.911
rank              71.112
fastestLapTime    71.911
fastestLapSpeed   71.911
statusId           0.000
dtype: float64


<br>

Per position, time, milliseconds, fastestLap, rank i fastestLapTime hi ha massa valors nuls per poder fer el mateix. Moltes de les funcions per tractar valors nuls no estan dissenyades per fer els canvis amb objects, per tant es tractarà de passar tots els atributs numèrics guardats com a object a nombres enters o decimals.

Per millisecondds, rank, fastestLap i fastestLapSpeed es pot passar directament. Com encara queden nulls a tractar, els passem tots a floats:

In [23]:
# Tractament milliseconds     
dataset["milliseconds"] = dataset["milliseconds"].astype('float64')

# Tractament rank
dataset["rank"] = dataset["rank"].astype('float64')

# Tractament fastestLap
dataset["fastestLap"] = dataset["fastestLap"].astype('float64')

# Tractament fastestLapSpeed
dataset["fastestLapSpeed"] = dataset["fastestLapSpeed"].astype('float64')

display(dataset.head())
print("\n")
dataset.info()

Unnamed: 0,raceId,driverId,constructorId,number,grid,position,positionText,positionOrder,points,laps,time,milliseconds,fastestLap,rank,fastestLapTime,fastestLapSpeed,statusId
0,18,1,1,22,1,1,1,1,10.0,58,1:34:50.616,5690616.0,39.0,2.0,1:27.452,218.3,1
1,18,2,2,3,5,2,2,2,8.0,58,+5.478,5696094.0,41.0,3.0,1:27.739,217.586,1
2,18,3,3,7,7,3,3,3,6.0,58,+8.163,5698779.0,41.0,5.0,1:28.090,216.719,1
3,18,4,4,5,11,4,4,4,5.0,58,+17.181,5707797.0,58.0,7.0,1:28.603,215.464,1
4,18,5,1,23,3,5,5,5,4.0,58,+18.014,5708630.0,43.0,1.0,1:27.418,218.385,1




<class 'pandas.core.frame.DataFrame'>
Int64Index: 25654 entries, 0 to 25659
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   raceId           25654 non-null  int64  
 1   driverId         25654 non-null  int64  
 2   constructorId    25654 non-null  int64  
 3   number           25654 non-null  int64  
 4   grid             25654 non-null  int64  
 5   position         14833 non-null  object 
 6   positionText     25654 non-null  object 
 7   positionOrder    25654 non-null  int64  
 8   points           25654 non-null  float64
 9   laps             25654 non-null  int64  
 10  time             6964 non-null   object 
 11  milliseconds     6963 non-null   float64
 12  fastestLap       7206 non-null   float64
 13  rank             7411 non-null   float64
 14  fastestLapTime   7206 non-null   object 
 15  fastestLapSpeed  7206 non-null   float64
 16  statusId         25654 non-null  int64  
dtypes: float64

<br >

Per fastestLapTime, cal separar els minuts dels segons i juntar-hos en una mateixa magnitud. S'ha decidit guardar les dades com segons.

In [24]:
# Tractament fastestLapTime
print(type(dataset['fastestLapTime']))

dataset['fastestLapTime'] = dataset['fastestLapTime'].str.split(":", n=1, expand= False)

minutes = pd.to_numeric(dataset['fastestLapTime'].str[0], errors='coerce')
seconds = pd.to_numeric(dataset['fastestLapTime'].str[1], errors='coerce')

dataset['fastestLapTime'] = (minutes * 60) + seconds

display(dataset.head())
print("\n")
dataset.info()

<class 'pandas.core.series.Series'>


Unnamed: 0,raceId,driverId,constructorId,number,grid,position,positionText,positionOrder,points,laps,time,milliseconds,fastestLap,rank,fastestLapTime,fastestLapSpeed,statusId
0,18,1,1,22,1,1,1,1,10.0,58,1:34:50.616,5690616.0,39.0,2.0,87.452,218.3,1
1,18,2,2,3,5,2,2,2,8.0,58,+5.478,5696094.0,41.0,3.0,87.739,217.586,1
2,18,3,3,7,7,3,3,3,6.0,58,+8.163,5698779.0,41.0,5.0,88.09,216.719,1
3,18,4,4,5,11,4,4,4,5.0,58,+17.181,5707797.0,58.0,7.0,88.603,215.464,1
4,18,5,1,23,3,5,5,5,4.0,58,+18.014,5708630.0,43.0,1.0,87.418,218.385,1




<class 'pandas.core.frame.DataFrame'>
Int64Index: 25654 entries, 0 to 25659
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   raceId           25654 non-null  int64  
 1   driverId         25654 non-null  int64  
 2   constructorId    25654 non-null  int64  
 3   number           25654 non-null  int64  
 4   grid             25654 non-null  int64  
 5   position         14833 non-null  object 
 6   positionText     25654 non-null  object 
 7   positionOrder    25654 non-null  int64  
 8   points           25654 non-null  float64
 9   laps             25654 non-null  int64  
 10  time             6964 non-null   object 
 11  milliseconds     6963 non-null   float64
 12  fastestLap       7206 non-null   float64
 13  rank             7411 non-null   float64
 14  fastestLapTime   7206 non-null   float64
 15  fastestLapSpeed  7206 non-null   float64
 16  statusId         25654 non-null  int64  
dtypes: float64

<br >

//

<br>

Tots els canvis s'ha fet menys a position, positionText i positionOrder. Si mirem //

In [54]:
print("Rang position: ", dataset['position'].nunique())
print(dataset['position'].unique())

print("\nRang positionText: ", dataset['positionText'].nunique())
print(dataset['positionText'].unique())

print("\nRang positionOrder: ", dataset['positionOrder'].nunique())
print(dataset['positionOrder'].unique())

Rang position:  33
['1' '2' '3' '4' '5' '6' '7' '8' nan '9' '10' '11' '12' '13' '14' '15'
 '16' '17' '18' '19' '20' '21' '22' '23' '24' '25' '26' '27' '28' '29'
 '30' '31' '32' '33']

Rang positionText:  39
['1' '2' '3' '4' '5' '6' '7' '8' 'R' 'D' '9' '10' '11' '12' '13' '14' '15'
 '16' '17' '18' '19' '20' '21' 'N' 'W' 'F' 'E' '22' '23' '24' '25' '26'
 '27' '28' '29' '30' '31' '32' '33']

Rang positionOrder:  39
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39]


In [None]:
#Outliers

In [None]:
dataset.describe()

In [None]:
# Mirem la correlació entre els atributs d'entrada per entendre millor les dades
correlacio = dataset.corr()
plt.figure()
ax = sns.heatmap(correlacio, annot=True, linewidths=.5)

In [None]:
# Mirem la relació entre atributs utilitzant la funció pairplot
relacio = sns.pairplot(dataset)