# üèéÔ∏è Projet F1 Data Science ‚Äì Acquisition 2025
Ce notebook automatise **l‚Äôacquisition, la pr√©paration et l‚Äôaudit qualit√©** des donn√©es F1 pour la saison 2025 √† partir de l‚ÄôAPI FastF1 et de sources ouvertes (OpenFlights pour la logistique).
> ‚ö†Ô∏è **Remarque** : Les donn√©es d‚Äôun Grand Prix ne sont pas toujours disponibles imm√©diatement apr√®s la course (d√©pendance FIA). Le pipeline g√®re ce cas et n‚Äôint√®gre que les GP dont les donn√©es sont effectivement accessibles au moment de l‚Äôex√©cution. Un log de disponibilit√© est affich√© pour chaque GP trait√©.


## 1Ô∏è‚É£ Imports & Setup
Pr√©paration de l‚Äôenvironnement, import des librairies et configuration des dossiers de cache et de donn√©es.


In [19]:
# Import des biblioth√®ques n√©cessaires 
from pathlib import Path
import pandas as pd
import fastf1
from datetime import date
from math import radians, sin, cos, sqrt, atan2

In [20]:
# Activation du cache (cr√©ation du dossier si besoin)
cache_dir = Path('../.cache_f1')
cache_dir.mkdir(exist_ok=True)
fastf1.Cache.enable_cache(cache_dir)

data_dir = Path('../data')
data_dir.mkdir(exist_ok=True)

# R√©cup√©ration du calendrier de la saison 2025
YEAR = 2025
schedule = fastf1.get_event_schedule(YEAR, include_testing=False)
today = pd.Timestamp(date.today())
completed = schedule[schedule['EventDate'] <= today]

# initialisation des listes pour collecter les donn√©es

results_list = []
quali_list = []
weather_list = []
pits_list = []
standings_driver = []
standings_team = []
flight_legs = []

last_airport = None
last_loc = None

## 2Ô∏è‚É£ Logistique & Mapping a√©roports ‚úàÔ∏è
T√©l√©chargement de la base OpenFlights et mapping circuits F1 ‚Üí principaux a√©roports (IATA).

In [21]:
# T√©l√©chargement de la table a√©roports OpenFlights + mapping F1 circuits ‚Üí IATA

# T√©l√©chargement et pr√©pa du CSV OpenFlights
airports_df = pd.read_csv(
    'https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat',
    header=None,
    names=[
        'AirportID', 'Name', 'City', 'Country', 'IATA', 'ICAO',
        'Latitude', 'Longitude', 'Altitude', 'Timezone', 'DST', 'Tz',
        'Type', 'Source'
    ],
    dtype={'IATA': str}
)
airports_df = airports_df[airports_df['IATA'].notna() & (airports_df['IATA'].str.len() == 3)]

In [22]:
# Table F1 circuits ‚Üí principaux a√©roports (compl√®te pour le calendrier 2025) : codes a√©roport IATA
# Note : les a√©roports sont choisis en fonction de leur utilisation par les √©quipes F1 pour la logistique
# Sources :
# - https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat
# - Connaissances g√©n√©rales F1/logistique (sites officiels F1, Wikipedia, forums sp√©cialis√©s, presse F1)
CIRCUIT_IATA = {
    'Melbourne': 'MEL',
    'Shanghai': 'PVG',
    'Suzuka': 'NGO',
    'Sakhir': 'BAH',
    'Jeddah': 'JED',
    'Miami': 'MIA',
    'Imola': 'BLQ',
    'Monaco': 'NCE',
    'Barcelona': 'BCN',
    'Montr√©al': 'YUL',
    'Spielberg': 'VIE',
    'Silverstone': 'LHR',
    'Spa-Francorchamps': 'LGG',
    'Budapest': 'BUD',
    'Zandvoort': 'AMS',
    'Monza': 'MXP',
    'Baku': 'GYD',
    'Marina Bay': 'SIN',
    'Austin': 'AUS',
    'Mexico City': 'MEX',
    'S√£o Paulo': 'GRU',
    'Las Vegas': 'LAS',
    'Lusail': 'DOH',
    'Yas Island': 'AUH',
}

## 3Ô∏è‚É£ Fonctions utilitaires
Pour la logistique (distance, coordonn√©es) et extraction des vrais pitstops.

In [23]:
# Fonction pour obtenir les coordonn√©es d'un IATA (a√©roport)
def get_coords_for_iata(iata):
    row = airports_df[airports_df['IATA'] == iata]
    if not row.empty:
        return float(row['Latitude'].iloc[0]), float(row['Longitude'].iloc[0])
    else:
        return None, None

# Fonction Haversine (distance km entre deux lat/lon)
def haversine(lat1, lon1, lat2, lon2):
    R = 6371
    dlat = radians(lat2 - lat1)
    dlon = radians(lon2 - lon1)
    a = sin(dlat/2)**2 + cos(radians(lat1)) * cos(radians(lat2)) * sin(dlon/2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))
    return R * c

#
def extract_real_pitstops(laps_df):
    """
    Extrait les vrais arr√™ts aux stands d√©tect√©s par changement de Stint ET/OU changement de Compound.
    Ajoute des colonnes d'int√©r√™t pour enrichir l'analyse strat√©gique.
    """
    pitstops = []
    for driver, d in laps_df.groupby('Driver'):
        d = d.sort_values('LapNumber')
        prev_row = None
        for idx, row in d.iterrows():
            # Sauter si info stint ou compound absente
            if prev_row is not None and pd.notna(row['Stint']) and pd.notna(prev_row['Stint']):
                stint_change = row['Stint'] != prev_row['Stint']
                compound_change = row['Compound'] != prev_row['Compound']
                if stint_change and compound_change:
                    pitstops.append({
                        'Driver': driver,
                        'DriverNumber': row['DriverNumber'],
                        'TeamName': row['Team'],
                        'LapIn': prev_row['LapNumber'],
                        'LapOut': row['LapNumber'],
                        'PitInTime': prev_row['PitInTime'],
                        'PitOutTime': row['PitOutTime'],
                        'CompoundIn': prev_row['Compound'],
                        'CompoundOut': row['Compound'],
                        'TyreLifeIn': prev_row['TyreLife'],
                        'TyreLifeOut': row['TyreLife'],
                        'PositionIn': prev_row['Position'],
                        'PositionOut': row['Position'],
                        'event': row['event'],
                        'round': row['round']
                    })
            prev_row = row
    return pd.DataFrame(pitstops)


## 4Ô∏è‚É£ R√©cup√©ration des donn√©es F1 & logistique
Pour chaque GP disponible, on t√©l√©charge les r√©sultats, qualifs, m√©t√©o, pitstops, standings, et on calcule la logistique CO‚ÇÇ.

In [24]:
# Aper√ßu des gp 2025
for _, ev in schedule.iterrows():
    rd = int(ev['RoundNumber'])
    loc = ev['Location']
    event_date = ev['EventDate']
    available = "‚úÖ" if event_date <= today else "‚è≥"
    print(f"{available} Round {rd:02d} ‚Äì {loc} (date: {event_date.date()})")

‚úÖ Round 01 ‚Äì Melbourne (date: 2025-03-16)
‚úÖ Round 02 ‚Äì Shanghai (date: 2025-03-23)
‚úÖ Round 03 ‚Äì Suzuka (date: 2025-04-06)
‚úÖ Round 04 ‚Äì Sakhir (date: 2025-04-13)
‚úÖ Round 05 ‚Äì Jeddah (date: 2025-04-20)
‚úÖ Round 06 ‚Äì Miami (date: 2025-05-04)
‚úÖ Round 07 ‚Äì Imola (date: 2025-05-18)
‚úÖ Round 08 ‚Äì Monaco (date: 2025-05-25)
‚úÖ Round 09 ‚Äì Barcelona (date: 2025-06-01)
‚úÖ Round 10 ‚Äì Montr√©al (date: 2025-06-15)
‚è≥ Round 11 ‚Äì Spielberg (date: 2025-06-29)
‚è≥ Round 12 ‚Äì Silverstone (date: 2025-07-06)
‚è≥ Round 13 ‚Äì Spa-Francorchamps (date: 2025-07-27)
‚è≥ Round 14 ‚Äì Budapest (date: 2025-08-03)
‚è≥ Round 15 ‚Äì Zandvoort (date: 2025-08-31)
‚è≥ Round 16 ‚Äì Monza (date: 2025-09-07)
‚è≥ Round 17 ‚Äì Baku (date: 2025-09-21)
‚è≥ Round 18 ‚Äì Marina Bay (date: 2025-10-05)
‚è≥ Round 19 ‚Äì Austin (date: 2025-10-19)
‚è≥ Round 20 ‚Äì Mexico City (date: 2025-10-26)
‚è≥ Round 21 ‚Äì S√£o Paulo (date: 2025-11-09)
‚è≥ Round 22 ‚Äì Las Vegas (date: 2025-11-22)
‚è≥ Roun

In [None]:
for _, ev in completed.iterrows():
    rd = int(ev['RoundNumber'])
    loc = ev['Location']
    print(f"\n‚ñ∂ Round {rd:02d} ‚Äì {loc}")

    try:
        # Sessions F1
        race = fastf1.get_session(YEAR, rd, 'R')
        race.load()
        results_list.append(race.results.assign(round=rd, event=loc))
        laps = race.laps.assign(round=rd, event=loc)
        pits_real = extract_real_pitstops(laps)
        pits_list.append(pits_real)
        weather_list.append(race.weather_data.assign(round=rd, event=loc))
        print(f"Colonnes r√©sultats: {race.results.columns.tolist()}")  #  Debug: afficher les colonnes des r√©sultats
        pilot_vars = ['DriverNumber', 'FullName', 'HeadshotUrl', 'TeamName', 'Position', 'GridPosition', 'Points', 'Status']
        standings_driver.append(race.results[pilot_vars].assign(round=rd, event=loc))
        team_vars = ['TeamName', 'TeamColor']
        teams = race.results[team_vars].drop_duplicates()
        team_points = race.results.groupby('TeamName')['Points'].sum().reset_index()
        teams = teams.merge(team_points, on='TeamName')
        standings_team.append(teams.assign(round=rd, event=loc))

    except Exception as e:
        print(f"  ‚ùå Erreur lors du chargement des donn√©es course pour {loc} : {e}")
        continue
    # Qualifications
    try:
        qual = fastf1.get_session(YEAR, rd, 'Q')
        qual.load()
        quali_list.append(qual.results.assign(round=rd, event=loc))
    except Exception as e:
        print(f"  (pas de session Q trouv√©e) ‚Üí {e}")

    # Logistique CO‚ÇÇ
    airport = CIRCUIT_IATA.get(loc, None)
    if last_airport and airport:
        lat1, lon1 = get_coords_for_iata(last_airport)
        lat2, lon2 = get_coords_for_iata(airport)
        if None not in (lat1, lon1, lat2, lon2):
            dist_km = haversine(lat1, lon1, lat2, lon2)
            flight_legs.append({
                'from': last_airport, 'to': airport, 'distance_km': dist_km,
                'event_from': last_loc, 'event_to': loc
            })
    last_airport = airport
    last_loc = loc

## 5Ô∏è‚É£ Construction des DataFrames finaux
On concat√®ne toutes les listes pour obtenir les datasets finaux.

In [26]:

df_results    = pd.concat(results_list, ignore_index=True)
df_quali      = pd.concat(quali_list,   ignore_index=True)
df_pits       = pd.concat(pits_list,    ignore_index=True)
df_weather    = pd.concat(weather_list, ignore_index=True)
df_drv_stand  = pd.concat(standings_driver, ignore_index=True)
df_team_stand = pd.concat(standings_team,   ignore_index=True)
df_flights    = pd.DataFrame(flight_legs)


## 6Ô∏è‚É£ Audit qualit√© rapide üïµÔ∏è‚Äç‚ôÇÔ∏è
On v√©rifie les valeurs manquantes principales pour chaque dataset.

In [27]:
def missing_report(df):
    miss = df.isna().sum()
    pct = 100 * miss / len(df)
    return pd.DataFrame({'missing_count': miss, 'missing_pct': pct.round(2)}).sort_values('missing_pct', ascending=False).head(10)

print("---- R√©sultats course ----\n")
display(missing_report(df_results))
print("---- Pitstops ----\n")
display(missing_report(df_pits))
print("---- Weather ----\n")
display(missing_report(df_weather))
print("---- Flight legs ----\n")
display(missing_report(df_flights))
print("---- Driver standings ----\n")
display(missing_report(df_drv_stand))
print("---- Team standings ----\n")
display(missing_report(df_team_stand))
print("---- Qualifying ----\n")
display(missing_report(df_quali))

---- R√©sultats course ----



Unnamed: 0,missing_count,missing_pct
Q1,199,100.0
Q3,199,100.0
Q2,199,100.0
Time,27,13.57
DriverNumber,0,0.0
BroadcastName,0,0.0
Abbreviation,0,0.0
TeamId,0,0.0
TeamColor,0,0.0
TeamName,0,0.0


---- Pitstops ----



Unnamed: 0,missing_count,missing_pct
PitOutTime,12,4.72
PitInTime,12,4.72
Driver,0,0.0
TeamName,0,0.0
DriverNumber,0,0.0
LapOut,0,0.0
LapIn,0,0.0
CompoundIn,0,0.0
CompoundOut,0,0.0
TyreLifeIn,0,0.0


---- Weather ----



Unnamed: 0,missing_count,missing_pct
Time,0,0.0
AirTemp,0,0.0
Humidity,0,0.0
Pressure,0,0.0
Rainfall,0,0.0
TrackTemp,0,0.0
WindDirection,0,0.0
WindSpeed,0,0.0
round,0,0.0
event,0,0.0


---- Flight legs ----



Unnamed: 0,missing_count,missing_pct
from,0,0.0
to,0,0.0
distance_km,0,0.0
event_from,0,0.0
event_to,0,0.0


---- Driver standings ----



Unnamed: 0,missing_count,missing_pct
DriverNumber,0,0.0
FullName,0,0.0
HeadshotUrl,0,0.0
TeamName,0,0.0
Position,0,0.0
GridPosition,0,0.0
Points,0,0.0
Status,0,0.0
round,0,0.0
event,0,0.0


---- Team standings ----



Unnamed: 0,missing_count,missing_pct
TeamName,0,0.0
TeamColor,0,0.0
Points,0,0.0
round,0,0.0
event,0,0.0


---- Qualifying ----



Unnamed: 0,missing_count,missing_pct
Points,200,100.0
Time,200,100.0
GridPosition,200,100.0
Q3,101,50.5
Q2,55,27.5
Q1,2,1.0
DriverNumber,0,0.0
TeamId,0,0.0
TeamColor,0,0.0
TeamName,0,0.0


## 7Ô∏è‚É£ Audit valeurs manquantes, Nettoyage & Pr√©paration des donn√©es
On ajuste les colonnes inutiles, on documente les NaN, et on pr√©pare les datasets pour la suite du pipeline.

L'objectif est de garantir une des **donn√©es propre et fiable** avant de lancer l'analyse exploratoire (EDA).

### ---- R√©sultats course ----

In [28]:
df_results[df_results['Time'].isna()]

Unnamed: 0,DriverNumber,BroadcastName,Abbreviation,DriverId,TeamName,TeamColor,TeamId,FirstName,LastName,FullName,...,ClassifiedPosition,GridPosition,Q1,Q2,Q3,Time,Status,Points,round,event
14,30,L LAWSON,LAW,lawson,Red Bull Racing,3671C6,red_bull,Liam,Lawson,Liam Lawson,...,R,18.0,NaT,NaT,NaT,NaT,Retired,0.0,1,Melbourne
15,5,G BORTOLETO,BOR,bortoleto,Kick Sauber,52E252,sauber,Gabriel,Bortoleto,Gabriel Bortoleto,...,R,15.0,NaT,NaT,NaT,NaT,Retired,0.0,1,Melbourne
16,14,F ALONSO,ALO,alonso,Aston Martin,229971,aston_martin,Fernando,Alonso,Fernando Alonso,...,R,12.0,NaT,NaT,NaT,NaT,Retired,0.0,1,Melbourne
17,55,C SAINZ,SAI,sainz,Williams,64C4FF,williams,Carlos,Sainz,Carlos Sainz,...,R,10.0,NaT,NaT,NaT,NaT,Retired,0.0,1,Melbourne
18,7,J DOOHAN,DOO,doohan,Alpine,0093CC,alpine,Jack,Doohan,Jack Doohan,...,R,14.0,NaT,NaT,NaT,NaT,Retired,0.0,1,Melbourne
19,6,I HADJAR,HAD,hadjar,Racing Bulls,6692FF,rb,Isack,Hadjar,Isack Hadjar,...,R,11.0,NaT,NaT,NaT,NaT,Retired,0.0,1,Melbourne
36,14,F ALONSO,ALO,alonso,Aston Martin,229971,aston_martin,Fernando,Alonso,Fernando Alonso,...,R,13.0,NaT,NaT,NaT,NaT,Retired,0.0,2,Shanghai
37,16,C LECLERC,LEC,leclerc,Ferrari,E80020,ferrari,Charles,Leclerc,Charles Leclerc,...,D,6.0,NaT,NaT,NaT,NaT,Disqualified,0.0,2,Shanghai
38,44,L HAMILTON,HAM,hamilton,Ferrari,E80020,ferrari,Lewis,Hamilton,Lewis Hamilton,...,D,5.0,NaT,NaT,NaT,NaT,Disqualified,0.0,2,Shanghai
39,10,P GASLY,GAS,gasly,Alpine,0093CC,alpine,Pierre,Gasly,Pierre Gasly,...,D,16.0,NaT,NaT,NaT,NaT,Disqualified,0.0,2,Shanghai


Les colonnes Q1/Q2/Q3 sont enti√®rement manquantes dans les r√©sultats de course car elles ne sont renseign√©es **que lors des qualifications**. Elles seront ignor√©es ici.  
Les valeurs manquantes dans la colonne ‚ÄúTime‚Äù correspondent uniquement √† des pilotes ayant abandonn√© ou ayant √©t√© disqualifi√©s (cf. colonne ‚ÄúStatus‚Äù).  
**Aucune action de nettoyage n‚Äôest n√©cessaire sur les abandons : ces lignes sont l√©gitimes et doivent √™tre conserv√©es pour l‚Äôanalyse des abandons/disqualifications.**

In [29]:
 # Suppression des colonnes Q1/Q2/Q3 inutiles ici
df_results = df_results.drop(columns=['Q1', 'Q2', 'Q3'], errors='ignore')

### ---- Pitstops  ----

In [30]:
df_pits[df_pits[['PitInTime', 'PitOutTime']].isna().any(axis=1)]

Unnamed: 0,Driver,DriverNumber,TeamName,LapIn,LapOut,PitInTime,PitOutTime,CompoundIn,CompoundOut,TyreLifeIn,TyreLifeOut,PositionIn,PositionOut,event,round
126,ALB,23,Williams,32.0,33.0,NaT,NaT,MEDIUM,HARD,8.0,1.0,5.0,5.0,Miami,6
127,ALO,14,Aston Martin,32.0,33.0,NaT,NaT,HARD,MEDIUM,10.0,2.0,15.0,15.0,Miami,6
128,ANT,12,Mercedes,32.0,33.0,NaT,NaT,MEDIUM,HARD,8.0,1.0,6.0,6.0,Miami,6
130,HAM,44,Ferrari,32.0,33.0,NaT,NaT,HARD,MEDIUM,9.0,1.0,9.0,9.0,Miami,6
132,LAW,30,Racing Bulls,31.0,32.0,NaT,NaT,HARD,MEDIUM,8.0,1.0,17.0,17.0,Miami,6
133,LEC,16,Ferrari,32.0,33.0,NaT,NaT,MEDIUM,HARD,8.0,1.0,8.0,8.0,Miami,6
134,NOR,4,McLaren,32.0,33.0,NaT,NaT,MEDIUM,HARD,8.0,1.0,2.0,2.0,Miami,6
135,PIA,81,McLaren,32.0,33.0,NaT,NaT,MEDIUM,HARD,8.0,1.0,1.0,1.0,Miami,6
136,RUS,63,Mercedes,32.0,33.0,NaT,NaT,HARD,MEDIUM,8.0,1.0,3.0,3.0,Miami,6
137,SAI,55,Williams,32.0,33.0,NaT,NaT,MEDIUM,HARD,14.0,1.0,7.0,7.0,Miami,6


Certains arr√™ts sous Virtual Safety Car (VSC) ou Safety Car peuvent manquer de timestamps pr√©cis (`PitInTime`, `PitOutTime`), alors que le changement de pneus a bien eu lieu.  
Cela refl√®te une limitation du syst√®me de chronom√©trage FIA/FastF1 en condition neutralis√©e.  
Aucune correction n‚Äôest appliqu√©e‚ÄØ: les analyses ‚Äúfr√©quence/type d‚Äôarr√™t‚Äù restent valides, seule l‚Äô√©tude du delta pit est √† filtrer sur valeurs valides.


### ---- Qualifs  ----

In [31]:
df_quali

Unnamed: 0,DriverNumber,BroadcastName,Abbreviation,DriverId,TeamName,TeamColor,TeamId,FirstName,LastName,FullName,...,ClassifiedPosition,GridPosition,Q1,Q2,Q3,Time,Status,Points,round,event
0,4,L NORRIS,NOR,norris,McLaren,FF8000,mclaren,Lando,Norris,Lando Norris,...,,,0 days 00:01:15.912000,0 days 00:01:15.415000,0 days 00:01:15.096000,NaT,,,1,Melbourne
1,81,O PIASTRI,PIA,piastri,McLaren,FF8000,mclaren,Oscar,Piastri,Oscar Piastri,...,,,0 days 00:01:16.062000,0 days 00:01:15.468000,0 days 00:01:15.180000,NaT,,,1,Melbourne
2,1,M VERSTAPPEN,VER,max_verstappen,Red Bull Racing,3671C6,red_bull,Max,Verstappen,Max Verstappen,...,,,0 days 00:01:16.018000,0 days 00:01:15.565000,0 days 00:01:15.481000,NaT,,,1,Melbourne
3,63,G RUSSELL,RUS,russell,Mercedes,27F4D2,mercedes,George,Russell,George Russell,...,,,0 days 00:01:15.971000,0 days 00:01:15.798000,0 days 00:01:15.546000,NaT,,,1,Melbourne
4,22,Y TSUNODA,TSU,tsunoda,Racing Bulls,6692FF,rb,Yuki,Tsunoda,Yuki Tsunoda,...,,,0 days 00:01:16.225000,0 days 00:01:16.009000,0 days 00:01:15.670000,NaT,,,1,Melbourne
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,5,G BORTOLETO,BOR,bortoleto,Kick Sauber,01C00E,sauber,Gabriel,Bortoleto,Gabriel Bortoleto,...,,,0 days 00:01:12.385000,NaT,NaT,NaT,,,10,Montr√©al
196,55,C SAINZ,SAI,sainz,Williams,1868DB,williams,Carlos,Sainz,Carlos Sainz,...,,,0 days 00:01:12.398000,NaT,NaT,NaT,,,10,Montr√©al
197,18,L STROLL,STR,stroll,Aston Martin,229971,aston_martin,Lance,Stroll,Lance Stroll,...,,,0 days 00:01:12.517000,NaT,NaT,NaT,,,10,Montr√©al
198,30,L LAWSON,LAW,lawson,Racing Bulls,6C98FF,rb,Liam,Lawson,Liam Lawson,...,,,0 days 00:01:12.525000,NaT,NaT,NaT,,,10,Montr√©al


- Les colonnes Points, Time et GridPosition sont toujours manquantes en qualifs (renseign√©es uniquement apr√®s la course).  
  elles seront donc supprim√©es pour plus de clart√©.
- Les NaN dans Q2/Q3 refl√®tent le format √† √©limination des qualifications F1‚ÄØ: seules les meilleurs chronos de chaque phase sont renseign√©s.
- On garde toutes les lignes et NaN pour respecter la logique sportive.


In [32]:
# Suppression des colonnes non pertinentes en qualifs
df_quali = df_quali.drop(columns=['Points', 'Time', 'GridPosition'], errors='ignore')

### ---- Vols et √©missions carbone  ----

#### Hypoth√®ses pour le calcul de l‚Äôempreinte carbone logistique

- **Masse totale transport√©e** : 1 400 tonnes par Grand Prix (estimation officielle issue des communiqu√©s DHL, principal partenaire logistique de la F1).
- **Facteur d‚Äô√©mission** : 0,587 kg CO‚ÇÇ par tonne-kilom√®tre (t¬∑km), d‚Äôapr√®s le r√©f√©rentiel officiel UK BEIS 2024 pour le fret a√©rien longue distance ([source](https://www.gov.uk/government/publications/greenhouse-gas-reporting-conversion-factors-2024)).
- **Calcul appliqu√©** :  
  > Emissions CO‚ÇÇ (kg) = distance (km) √ó masse transport√©e (t) √ó facteur d‚Äô√©mission (kg CO‚ÇÇ / t¬∑km)

**Remarque**‚ÄØ: Ces hypoth√®ses refl√®tent les standards de calcul utilis√©s par la FIA et l‚Äôindustrie logistique, garantissant ainsi la pertinence de l‚Äôordre de grandeur de l‚Äôanalyse.


### üßÆ Calcul CO‚ÇÇ logistique
On applique les hypoth√®ses officielles FIA/DHL pour estimer l‚Äôempreinte carbone des trajets logistiques.

In [33]:
MASS_TONNES = 1400
KG_CO2_PER_TKM = 0.587

df_flights['CO2_kg'] = df_flights['distance_km'] * MASS_TONNES * KG_CO2_PER_TKM
df_flights['CO2_tonnes'] = (df_flights['CO2_kg'] / 1000).round(2)
df_flights['CO2_kg'] = df_flights['CO2_kg'].round(2)
df_flights

Unnamed: 0,from,to,distance_km,event_from,event_to,CO2_kg,CO2_tonnes
0,MEL,PVG,8017.369472,Melbourne,Shanghai,6588674.23,6588.67
1,PVG,NGO,1456.86876,Shanghai,Suzuka,1197254.75,1197.25
2,NGO,BAH,8052.263175,Suzuka,Sakhir,6617349.88,6617.35
3,BAH,JED,1272.192236,Sakhir,Jeddah,1045487.58,1045.49
4,JED,MIA,11621.236283,Jeddah,Miami,9550331.98,9550.33
5,MIA,BLQ,8149.753826,Miami,Imola,6697467.69,6697.47
6,BLQ,NCE,339.502228,Imola,Monaco,279002.93,279.0
7,NCE,BCN,496.29652,Monaco,Barcelona,407856.48,407.86
8,BCN,YUL,5911.345375,Barcelona,Montr√©al,4857943.63,4857.94


Une inspection approfondie des r√©sultats a r√©v√©l√© que le pilote Kimi Antonelli apparait sous plusieurs variantes de nom (ex : "Andrea Kimi Antonelli" et "Kimi Antonelli").
Nous allons appliquer une normalisation de ce dernier.


In [34]:
# Pour df_results
mask_results = df_results['DriverId'] == 'antonelli'
df_results.loc[mask_results, 'FirstName'] = 'Kimi'
df_results.loc[mask_results, 'FullName'] = 'Kimi Antonelli'

# Pour df_quali
mask_quali = df_quali['DriverId'] == 'antonelli'
df_quali.loc[mask_quali, 'FirstName'] = 'Kimi'
df_quali.loc[mask_quali, 'FullName'] = 'Kimi Antonelli'

# Pour df_drv_stand
mask_drv = df_drv_stand['FullName'] == 'Andrea Kimi Antonelli'
df_drv_stand.loc[mask_drv, 'FullName'] = 'Kimi Antonelli'

## 8Ô∏è‚É£ Export final üöÄ
On sauvegarde tous les datasets au format Parquet pour la suite du pipeline (EDA, ML, dashboard).

In [35]:
# Sauvegarde des DataFrames en fichiers Parquet avec horodatage
#now = datetime.now()
#date_str = now.strftime("%d_%m_%Y_%H_%M")

df_results.to_parquet(data_dir / f'results_2025.parquet')
df_quali.to_parquet(data_dir / f'qualifying_2025.parquet')
df_pits.to_parquet(data_dir / f'pitstops_2025.parquet')
df_weather.to_parquet(data_dir / f'weather_2025.parquet')
df_drv_stand.to_parquet(data_dir / f'driver_standings_2025.parquet')
df_team_stand.to_parquet(data_dir / f'team_standings_2025.parquet')
df_flights.to_parquet(data_dir / f'flightlegs_2025.parquet')