# Project Group 11

Members:  Alessandro Casati (6544649),  Bas van den Muijsenberg (5797578) , Hylke Bleeker (5589355), Jan-Pieter Vermeer (6340261),  Mike Geerts (6276210)


# Research Objective

## Analysis and delay prediction of flights across top 20 airports in Europe in 2023-24

**Research Question**  
What factors drive airport delays across the top 10 European airports in 2023-24, and can we build predictive models to identify possible delays for a flight using the information available?

**Sub Questions**
- What type of delays are occurring most at the airports?
- Which airports experience the highest, the lowest delay, and most consistent delay rates?
- How do delays relate to traffic volume and efficiency?
- How do delays distribute geographically across Europe, and which airports stand out as persistent “delay hotspots”?
- Which features (traffic levels, weekday/weekend, month, etc.) are the strongest predictors of high-delay days?
- What model can predict the chance of delay and the amount delay for a flight?

# Contribution Statement

*Be specific. Some of the tasks can be coding (expect everyone to do this), background research, conceptualisation, visualisation, data analysis, data modelling*

**Author 1**:

**Author 2**:

**Author 3**:

# Data Used

## Datasets
| Dataset | Dataset description (function) | Columns needed |
|---|---|---|
| Airport Traffic | Airport capacity of inbound and outbound flights for network manager and airport operator | All |
| Airport ATFM_Delay | The Airport Arrival ATFM Delay provides an indication of ATFM delays on the ground due to constraints at airports | FLT_DATE, APT_ICAO (to merge DB) + all the remaining types of delays |
| Airport Punctuality | Daily performance tracker of airport punctuality, delays, and operational efficiency | All + add APT_ICAO for merging |
| All_Pre_Departure_Delays | Daily overview of total pre-departure delay in minutes per airport that includes all delay causes | All |
| ATC_Pre_Departure_Delay | Proxy for ATC induced delays at the departure stand (IATA delay code 89) | Pending (depends on data analysis findings) |
| Additional taxi-out time | Taxi time from gate to runway and associated delay | All |
| Additional taxi-in time | Taxi time from runway to gate and associated delay | All |
| Airport IATA delays | monthly flight delay information per airport by IATA delay cause | All |
| Airport Reactionary delay delays | monthly flight reactionary delays as share of total network delays per airport | All |
| Airport punctuality distribution | monthly flight punctuality statitics showcasing delayed, on-time and early departures | All |
## Information extracted from datasets
| Dataset | Key Information |
|---|---|
| Airport Traffic | IFR arrivals, departures for Airport Operator and Network Manager |
| Airport ATFM_Delay | Disruption type and count, number of delayed arrival flights |
| Airport Punctuality | Arrival & departure punctuality %, avg delay (dep & arr), operated schedules % |
| Pre_Departure_Delay | AO departures, AO pre-departure delay (minutes) |
| Additional taxi-out time | Number of flights with available data + total taxi in time |
| Additional taxi-in time | Number of flights with available data + total taxi out time |

# Data Pipeline

## Data Exploration & Cleaning
- Import & clean the datasets: remove missing/invalid values, select only useful columns and merge.
- Create basic delay metrics:
  - Delay categories (On time, Delayed, Delayed15)
  - Day category (Weekend, Weekdays)

## Descriptive Data Analysis
- Time patterns: heatmaps/line charts/stacked charts (delays by day, month, cause).
- Geospatial view: interactive maps color-coded by average airport delay.
- Efficiency scatterplots: traffic volume vs. % delay

## Predictive Modeling
- Predict delay occurrence for a flight using departure airport and its performances, average airport delay, weekday/weekend & month.
- Models: Logistic Regression and/or Random Forest.
- Testing with 2025 data (unseen during training).

## Clustering & Patterns
- Cluster airports by daily delay patterns (e.g., “always congested,” “seasonal peaks,” “mostly smooth”).

# Data Cleaning

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


## TO-DO list, by Wed 8/10
| Dataset | Key Information | Assigned to |
|---|---|---|
| Airport Traffic | Imported + top20 defined: missing total number of departures and arrivals | Ale |
| Airport ATFM_Delay | Extract top20 aiports by comparing it with df_top10 and overall and sum for all the 2 years | Bas |
| Airport Punctuality | Filter for 23-24, find top 20, compare with df_top20 (some airports might be missing), get all the columns | Mike |
| Pre_Departure_Delay | TBD. is the data actually useful for our RQ, if yes extract top20 + last 5 columns | JP |
| Additional taxi-out time | Extract top20 airports + TF, TOT_REF_NB_FL, TOT_REF&ADD_TIME_MIN, PIVOT_LABEL | Hylke |
| Additional taxi-in time | Extract top20 airports + TF, TOT_REF_NB_FL, TOT_REF&ADD_TIME_MIN, PIVOT_LABEL | Hylke |

Some comments: for now we focus on the totals for 2023-24 (combined). Watch out that some datasets don't have matching columns and you can't merge them together, you might need to compare them manually or use other references (especially for Punctuality dataset)

## Airport Traffic
Importing dataset

In [2]:
df_AT = pd.read_excel("datasets/AirportTraffic.xlsx")
df_TIn = pd.read_excel("datasets/Taxi-In_Additional_Time.xlsx")
df_TOut = pd.read_excel("datasets/Taxi-Out_Additional_Time.xlsx")
df_AAD = pd.read_excel("datasets/AA_ATFM_Delay.xlsx")
df_AP = pd.read_excel("datasets/Airports_Punctuality.xlsx")
df_AID= pd.read_excel("datasets/Airport_IATA_delays_airline_reported.xlsx")
df_PDD = pd.read_excel("datasets/Primary_departure_Delay_Causes_AP.xlsx")
df_FPD = pd.read_excel("datasets/Flight_Punctuality_Distribution.xlsx")


Find top 20 airports by total flights and create a new dataframe with key data

In [4]:
df_top20 = (
    df_AT.groupby("APT_ICAO")[["FLT_TOT_1", "FLT_DEP_1", "FLT_ARR_1"]] #Group by airport code
    .sum().sort_values(by="FLT_TOT_1",ascending=False) #Sum the values for each code of the 3 columns indicated
    .head(20).reset_index()) #Change "20" to change the number of airports analysed
# Adding airport's city name and state from original dataset
df_top20 = (df_top20.merge(df_AT[["APT_ICAO", "APT_NAME", "STATE_NAME"]]
                           .drop_duplicates(), on="APT_ICAO", how="left"))
df_top20.head(20)

Unnamed: 0,APT_ICAO,FLT_TOT_1,FLT_DEP_1,FLT_ARR_1,APT_NAME,STATE_NAME
0,LTFM,1014869,507380,507489,Istanbul,Türkiye
1,EHAM,947254,473657,473597,Amsterdam - Schiphol,Netherlands
2,EGLL,933098,466473,466625,London - Heathrow,United Kingdom
3,LFPG,921916,460997,460919,Paris-Charles-de-Gaulle,France
4,EDDF,871219,435565,435654,Frankfurt,Germany
5,LEMD,809535,404744,404791,Madrid - Barajas,Spain
6,LEBL,666969,333465,333504,Barcelona,Spain
7,EDDM,624316,312141,312175,Munich,Germany
8,LIRF,582158,291068,291090,Rome - Fiumicino,Italy
9,EGKK,522331,261146,261185,London - Gatwick,United Kingdom


# Airport ATFM_Delay

Delay for every day and all top 20 airports.
Removing Na, and only top 20 airports, sorted by airport traffic

In [5]:
df_AAD = df_AAD.dropna()
df_AAD_top20 = df_AAD[df_AAD["APT_ICAO"].isin(df_top20["APT_ICAO"])]
df_AAD_top20

#Lists of code and names of top20 airports
airports_name_list = df_top20["APT_NAME"].tolist()
airports_code_list = df_top20["APT_ICAO"].tolist()

#Dataframe of codes, useful for merging 
df_airport_codes = pd.DataFrame({'APT_ICAO': airports_code_list})

df_AAD_top20 = df_AAD_top20.set_index("APT_ICAO").loc[airports_code_list].reset_index()


df_AAD_top20.head()

Unnamed: 0,APT_ICAO,YEAR,MONTH_NUM,MONTH_MON,FLT_DATE,APT_NAME,STATE_NAME,FLT_ARR_1,DLY_APT_ARR_1,DLY_APT_ARR_A_1,...,DLY_APT_ARR_R_1,DLY_APT_ARR_S_1,DLY_APT_ARR_T_1,DLY_APT_ARR_V_1,DLY_APT_ARR_W_1,DLY_APT_ARR_NA_1,FLT_ARR_1_DLY,FLT_ARR_1_DLY_15,ATFM_VERSION,Pivot Label
0,LTFM,2023,1,JAN,44928,Istanbul,Türkiye,676,1720.0,0.0,...,0.0,0.0,0.0,0.0,1720.0,0.0,83.0,38.0,v2,Istanbul (LTFM)
1,LTFM,2023,1,JAN,44938,Istanbul,Türkiye,607,665.0,0.0,...,0.0,0.0,0.0,0.0,665.0,0.0,34.0,19.0,v2,Istanbul (LTFM)
2,LTFM,2023,1,JAN,44939,Istanbul,Türkiye,635,78.0,0.0,...,0.0,0.0,0.0,0.0,78.0,0.0,5.0,2.0,v2,Istanbul (LTFM)
3,LTFM,2023,1,JAN,44948,Istanbul,Türkiye,646,21.0,0.0,...,0.0,0.0,0.0,0.0,21.0,0.0,3.0,0.0,v2,Istanbul (LTFM)
4,LTFM,2023,1,JAN,44954,Istanbul,Türkiye,641,351.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,30.0,6.0,v2,Istanbul (LTFM)


Total delay for every airport 

In [6]:
ATFM_cols = df_AAD.columns[7:27].tolist() #Columns neded

df_AAD_total_delays_per_airport = (
    df_AAD_top20[df_AAD_top20["APT_ICAO"].isin(airports_code_list)].groupby(["APT_ICAO", "APT_NAME", "STATE_NAME"])[ATFM_cols]
    .sum()
    .reset_index()
)

order = df_top20["APT_ICAO"].tolist()
df_AAD_total_delays_per_airport = df_AAD_total_delays_per_airport.set_index("APT_ICAO").loc[order].reset_index() #Do we care for the order?

df_AAD_total_delays_per_airport

Unnamed: 0,APT_ICAO,APT_NAME,STATE_NAME,FLT_ARR_1,DLY_APT_ARR_1,DLY_APT_ARR_A_1,DLY_APT_ARR_C_1,DLY_APT_ARR_D_1,DLY_APT_ARR_E_1,DLY_APT_ARR_G_1,...,DLY_APT_ARR_O_1,DLY_APT_ARR_P_1,DLY_APT_ARR_R_1,DLY_APT_ARR_S_1,DLY_APT_ARR_T_1,DLY_APT_ARR_V_1,DLY_APT_ARR_W_1,DLY_APT_ARR_NA_1,FLT_ARR_1_DLY,FLT_ARR_1_DLY_15
0,LTFM,Istanbul,Türkiye,75507,230592.0,0.0,311.0,0.0,0.0,19684.0,...,0.0,2706.0,0.0,0.0,1415.0,0.0,206476.0,0.0,6800.0,4442.0
1,EHAM,Amsterdam - Schiphol,Netherlands,433799,1540691.0,0.0,12665.0,0.0,6204.0,650129.0,...,7292.0,18581.0,0.0,591.0,4280.0,0.0,840949.0,0.0,93661.0,29292.0
2,EGLL,London - Heathrow,United Kingdom,336787,1230954.0,5624.0,96191.0,0.0,4202.0,35423.0,...,6433.0,4818.0,0.0,14679.0,741.0,0.0,1062843.0,0.0,65314.0,28145.0
3,LFPG,Paris-Charles-de-Gaulle,France,135569,119040.0,24.0,763.0,0.0,241.0,1152.0,...,265.0,10259.0,0.0,9306.0,3704.0,0.0,85256.0,0.0,6960.0,2882.0
4,EDDF,Frankfurt,Germany,186063,563486.0,493.0,10253.0,0.0,9417.0,177461.0,...,14543.0,1445.0,0.0,7450.0,18717.0,0.0,323690.0,0.0,27388.0,13731.0
5,LEMD,Madrid - Barajas,Spain,294867,278691.0,1530.0,95450.0,0.0,0.0,47110.0,...,3723.0,20656.0,0.0,0.0,1858.0,1032.0,107332.0,0.0,18858.0,5932.0
6,LEBL,Barcelona,Spain,136472,384152.0,1079.0,5123.0,0.0,960.0,16489.0,...,0.0,523.0,0.0,0.0,3889.0,31802.0,324287.0,0.0,15815.0,7609.0
7,EDDM,Munich,Germany,70404,108797.0,25.0,2193.0,0.0,950.0,5265.0,...,0.0,1794.0,0.0,464.0,263.0,0.0,97843.0,0.0,5032.0,2348.0
8,LIRF,Rome - Fiumicino,Italy,24871,39145.0,964.0,0.0,0.0,0.0,8431.0,...,4829.0,773.0,0.0,0.0,0.0,0.0,22789.0,0.0,2222.0,959.0
9,EGKK,London - Gatwick,United Kingdom,191655,1032548.0,0.0,32521.0,0.0,239.0,363605.0,...,7075.0,0.0,0.0,205442.0,1594.0,0.0,422072.0,0.0,38960.0,22324.0


In [None]:
import pandas as pd
from pathlib import Path

#SETTINGS
FILE = Path("datasets/AA_ATFM_Delay.xlsx")
SHEET = "Blad1"
AIRPORTS_20 = [
    "LTFM","EHAM","EGLL","LFPG","EDDF","LEMD","LEBL","EDDM","LIRF","EGKK",
    "LSZH","LGAV","EIDW","LOWW","LEPA","EKCH","LPPT","LTFJ","LTAI","ENGM"
]
YEARS = [2023, 2024]   # use [2023] for only one year

#setting the letter into what it is
delay_mapping = {
    "A": "Accident/Incident",
    "C": "ATC Capacity",
    "D": "De-icing",
    "E": "Equipment (non-ATC)",
    "G": "Aerodrome Capacity",
    "I": "Industrial Action (ATC)",
    "M": "Airspace Management",
    "N": "Industrial Action (non-ATC)",
    "O": "Other",
    "P": "Special Event",
    "R": "ATC Routeing",
    "S": "ATC Staffing",
    "T": "Equipment (ATC)",
    "V": "Environmental Issues",
    "W": "Weather",
    "NA": "Not specified"
}

#LOAD & FILTER
df = pd.read_excel(FILE, sheet_name=SHEET)
df = df[df["YEAR"].isin(YEARS) & df["APT_ICAO"].isin(AIRPORTS_20)].copy()

#what columns
delay_cols = sorted([
    c for c in df.columns
    if c.startswith("DLY_APT_ARR_") and c.endswith("_1") and c != "DLY_APT_ARR_1"
])

#FIND MOST-OCCURRING TYPE PER (airport, year, month)(idk chat did this)
occ = (df[delay_cols].fillna(0) > 0)
group_keys = ["APT_ICAO", "YEAR", "MONTH_NUM"]
counts = occ.groupby([df[k] for k in group_keys]).sum()

best_col = counts.idxmax(axis=1)
best_code = (best_col.str.replace("DLY_APT_ARR_", "", regex=False)
                      .str.replace("_1", "", regex=False))

#table making
long_df = (
    best_code.reset_index(name="delay_code")
    .assign(
        Month=lambda d: d["YEAR"].astype(str) + "-" + d["MONTH_NUM"].astype(str).str.zfill(2),
        delay_type=lambda d: d["delay_code"].map(delay_mapping)
    )
    .loc[:, ["Month", "APT_ICAO", "delay_type"]]
    .sort_values(["APT_ICAO", "Month"])
    .set_index("Month")
)

#printing, but maybe remove it before committing?
display(long_df)
print("Rows:", len(long_df))





Unnamed: 0_level_0,APT_ICAO,delay_type
Month,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-01,EDDF,Equipment (non-ATC)
2023-02,EDDF,Weather
2023-03,EDDF,Weather
2023-04,EDDF,Weather
2023-05,EDDF,Aerodrome Capacity
...,...,...
2024-08,LTFM,Weather
2024-09,LTFM,Weather
2024-10,LTFM,Weather
2024-11,LTFM,Weather


Rows: 480


In [None]:
# --- SETTINGS ---
FILE = Path("datasets/AA_ATFM_Delay.xlsx")  # change if needed
SHEET = "Blad1"
AIRPORTS_20 = [
    "LTFM","EHAM","EGLL","LFPG","EDDF","LEMD","LEBL","EDDM","LIRF","EGKK",
    "LSZH","LGAV","EIDW","LOWW","LEPA","EKCH","LPPT","LTFJ","LTAI","ENGM"
]
YEARS = [2023, 2024]   # use [2023] if you only want one year

#setting the letter into what it is
delay_mapping = {
    "A": "Accident/Incident",
    "C": "ATC Capacity",
    "D": "De-icing",
    "E": "Equipment (non-ATC)",
    "G": "Aerodrome Capacity",
    "I": "Industrial Action (ATC)",
    "M": "Airspace Management",
    "N": "Industrial Action (non-ATC)",
    "O": "Other",
    "P": "Special Event",
    "R": "ATC Routeing",
    "S": "ATC Staffing",
    "T": "Equipment (ATC)",
    "V": "Environmental Issues",
    "W": "Weather",
    "NA": "Not specified",
}

#same as above
df = pd.read_excel(FILE, sheet_name=SHEET)
df = df[df["YEAR"].isin(YEARS) & df["APT_ICAO"].isin(AIRPORTS_20)].copy()

#same as above
delay_cols = sorted([
    c for c in df.columns
    if c.startswith("DLY_APT_ARR_") and c.endswith("_1") and c != "DLY_APT_ARR_1"
])

# Any day with value > 0 counts as an occurrence
occ = (df[delay_cols].fillna(0) > 0)

# Sum occurrences across the whole period, per airport
counts = occ.groupby(df["APT_ICAO"]).sum()

def top_k_codes(row, k=3):
    cols = row.nlargest(k).index.tolist()
    return [c.replace("DLY_APT_ARR_", "").replace("_1", "") for c in cols]

top = counts.apply(top_k_codes, axis=1, result_type="expand")
top.columns = ["Most occurring", "Second most", "Third most"]

# Map letters -> descriptions
top = top.applymap(lambda x: delay_mapping.get(x, x))

# order airports as requested
top = top.reindex(AIRPORTS_20)

# show in VS Code / Jupyter
display(top)


  top = top.applymap(lambda x: delay_mapping.get(x, x))


Unnamed: 0_level_0,Most occurring,Second most,Third most
APT_ICAO,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
LTFM,Weather,Aerodrome Capacity,Equipment (ATC)
EHAM,Aerodrome Capacity,Weather,ATC Capacity
EGLL,Weather,ATC Capacity,Aerodrome Capacity
LFPG,Weather,Industrial Action (ATC),ATC Staffing
EDDF,Weather,ATC Capacity,Aerodrome Capacity
LEMD,ATC Capacity,Weather,Aerodrome Capacity
LEBL,Environmental Issues,Weather,Aerodrome Capacity
EDDM,Weather,ATC Capacity,Aerodrome Capacity
LIRF,Weather,Aerodrome Capacity,Other
EGKK,Aerodrome Capacity,Weather,ATC Staffing


Mean delay
# Airport Punctuality

In [None]:
ATFM_cols = df_AAD.columns[7:27].tolist() #Columns needed

df_AAD_mean_delays_per_airport = (
    df_AAD_top20[df_AAD_top20['APT_ICAO'].isin(airports_code_list)]
    .groupby(['APT_ICAO', 'APT_NAME', 'STATE_NAME'])[ATFM_cols]
    .mean()
    .reset_index()
)

order = df_top20['APT_ICAO'].tolist()
df_AAD_mean_delays_per_airport = (
    df_AAD_mean_delays_per_airport
    .set_index('APT_ICAO')
    .loc[order]
    .reset_index()
)

df_AAD_mean_delays_per_airport

# Taxi In & Out

In [7]:
df_taxi_time_in = (
    df_TIn[(df_TIn["APT_ICAO"].isin(airports_code_list)) & 
           ((df_TIn["YEAR"] == 2023) | (df_TIn["YEAR"] == 2024))]
    .groupby("APT_ICAO")[["VALID_FL", "TOTAL_REF_NB_FL", "TOTAL_REF_TIME_MIN", "TOTAL_ADD_TIME_MIN"]]
    .sum()
    .reset_index()
)
#In order: filter only airports in top 20, filter only values from 2023-24, group by code and sum the values in the 
#listed columns, reset the index to have the codes as a column itselft (useful for merging)

# Rename the columns to for in values
in_cols_to_rename = {
    "VALID_FL": "VALID_FL_IN",
    "TOTAL_REF_NB_FL": "TOTAL_REF_NB_FL_IN",
    "TOTAL_REF_TIME_MIN": "TOT_REF_TIME_MIN_IN",
    "TOTAL_ADD_TIME_MIN": "TOT_ADD_TIME_MIN_IN"
}
df_taxi_time_in = df_taxi_time_in.rename(columns=in_cols_to_rename)

# Merging taxi with airport codes
df_taxi_time_final = pd.merge(df_airport_codes, df_taxi_time_in, on='APT_ICAO', how='left')


In [9]:
df_taxi_time_out = (
    df_TOut[(df_TOut["APT_ICAO"].isin(airports_code_list)) & 
           ((df_TOut["YEAR"] == 2023) | (df_TOut["YEAR"] == 2024))]
    .groupby("APT_ICAO")[["VALID_FL", "TOTAL_REF_NB_FL", "TOTAL_REF_TIME_MIN", "TOTAL_ADD_TIME_MIN"]]
    .sum()
    .reset_index()
)
#Same operations as df_taxi_time_in

# Rename the columns to for out values
out_cols_to_rename = {
    "VALID_FL": "VALID_FL_OUT",
    "TOTAL_REF_NB_FL": "TOTAL_REF_NB_FL_OUT",
    "TOTAL_REF_TIME_MIN": "TOT_REF_TIME_MIN_OUT",
    "TOTAL_ADD_TIME_MIN": "TOT_ADD_TIME_MIN_OUT"
}
df_taxi_time_out = df_taxi_time_out.rename(columns=out_cols_to_rename)

# Merging taxi in with taxi out
df_taxi_time = pd.merge(df_taxi_time_final, df_taxi_time_out, on='APT_ICAO', how='left')

df_taxi_time.head()

  APT_ICAO  VALID_FL_IN  TOTAL_REF_NB_FL_IN  TOT_REF_TIME_MIN_IN  \
0     LTFM     510112.0            466200.0         5.139009e+06   
1     EHAM     473916.0            425232.0         2.938714e+06   
2     EGLL     464878.0            456854.0         2.472468e+06   
3     LFPG     457627.0            421245.0         3.141329e+06   
4     EDDF     435576.0            403923.0         2.897641e+06   

   TOT_ADD_TIME_MIN_IN  VALID_FL_OUT  TOTAL_REF_NB_FL_OUT  \
0         1.569870e+06      506963.0             472337.0   
1         9.292076e+05      472543.0             429268.0   
2         1.703258e+06      465112.0             458402.0   
3         9.167525e+05      454437.0             395185.0   
4         1.077685e+06      432316.0             411475.0   

   TOT_REF_TIME_MIN_OUT  TOT_ADD_TIME_MIN_OUT  
0          5.598360e+06          2.405860e+06  
1          3.937064e+06          1.418437e+06  
2          7.091690e+06          3.199987e+06  
3          4.481997e+06         

# Airport Punctionality

In [10]:
airport_map = {
    "LTFM": "Istanbul",
    "EHAM": "Amsterdam",
    "EGLL": "London Heathrow",
    "LFPG": "Paris Charles de Gaulle",
    "EDDF": "Frankfurt",
    "LEMD": "Madrid Barajas",
    "LEBL": "Barcelona",
    "EDDM": "Munich",
    "LIRF": "Rome Fiumicino",
    "EGKK": "London Gatwick",
    "LSZH": "Zurich",
    "LGAV": "Athens",
    "EIDW": "Dublin",
    "LOWW": "Vienna",
    "LEPA": "Palma de Mallorca",
    "EKCH": "Copenhagen",
    "LPPT": "Lisbon",
    "LTFJ": "Istanbul Sabiha Gokcen",
    "LTAI": "Antalya",
    "ENGM": "Oslo"
}

top20_codes = [airport_map[a] for a in df_top20['APT_ICAO'] if a in airport_map]
df_AP_top20 = df_AP[df_AP['Airport'].isin(top20_codes)]

name_to_icao = {name: icao for icao, name in airport_map.items()}
icao_codes = df_AP_top20["Airport"].map(name_to_icao)
df_AP_top20.insert(loc=2, column="APT_ICAO", value=df_AP_top20["Airport"].map(name_to_icao))

df_AP_top20

Unnamed: 0,Date,Airport,APT_ICAO,Departure Punctuality %,Arrival Punctuality %,Avg Departure Schedule Delay,Avg Arrival Schedule Delay,Avg Departure - Arrival Schedule Delay,Operated Schedules %
1367,2022-01-01,Frankfurt,EDDF,0.818898,0.804217,13.043920,13.426908,-0.382988,0.922170
1368,2022-01-02,Frankfurt,EDDF,0.707216,0.713389,17.652062,18.506590,-0.854528,0.946463
1369,2022-01-03,Frankfurt,EDDF,0.738938,0.755991,16.235140,13.956209,2.278931,0.952183
1370,2022-01-04,Frankfurt,EDDF,0.630542,0.725962,16.605624,14.621955,1.983669,0.953811
1371,2022-01-05,Frankfurt,EDDF,0.782816,0.821951,10.382379,9.699187,0.683192,0.946785
...,...,...,...,...,...,...,...,...,...
43735,2025-09-24,Istanbul,LTFM,0.840440,0.831309,8.972031,11.150967,-2.178936,0.922894
43736,2025-09-25,Istanbul,LTFM,0.794157,0.843750,11.858544,8.783967,3.074576,0.930314
43737,2025-09-26,Istanbul,LTFM,0.713147,0.770039,15.778220,13.545795,2.232425,0.938824
43738,2025-09-27,Istanbul,LTFM,0.748021,0.782377,17.316623,13.748999,3.567624,0.948931


## Primary Departure Causes
Dataset contains sum of total delay per main category (airline, airport, En Route, Miscellaneous, Government, Weather) per airport


In [None]:
df_PDD_TOP20 = (
    df_PDD[df_PDD["APT_ICAO"].isin(airports_code_list)]
)

df_PDD_summary = (
    df_PDD_TOP20
    .groupby(["Year","APT_ICAO"], as_index=False)
    .agg({
        "Airline": "sum",
        "En_Route": "sum",
        "Miscellaneous": "sum",
        "Government": "sum",
        "Weather": "sum",
        "Total": "sum" # SUM IS NOT EQUAL TO ACTUAL SUM AFTER DATA TRUNCTION --> needs investigation ; maybe calculate sum in python
    })
    .sort_values(by="Total", ascending=False)   
    .reset_index(drop=True)
)

df_PDD_summary.head()

## Airport IATA delays
Year lobt = Year of the Last Off-Block Time (LOBT) \
TD = Total Delay (minutes) \
TF = Total Flights affected \
Total Flights Period = total number of flights in that airport/period \
ADM = Average Delay per Movement = TD/Total Flight Period  (minutes per total flight)
Total Delay Period = Total delay minutes (all causes) for the period \
PD = Proportion of delay = (TD / Total Delay Period)*100
Delay rate = TF / Total flight period


In [None]:
df_AID.head()

In [None]:
# Filter the delay dataset for only the top 20 airports
df_AID_TOP20 = df_AID[df_AID["APT_ICAO"].isin(airports_code_list)].copy()

# Calculate delay ratio as percentage
df_AID_TOP20["Delay_Ratio"] = (df_AID_TOP20["TF"] / df_AID_TOP20["Total_Flights_Period"]) * 100

# Provide key insights per year per month per airport
df_AID_TOP20_YMA = (
    df_AID_TOP20
    .groupby(["Year_Lobt","Month_Lobt","APT_ICAO"], as_index=False)
    .agg({
        "TD": "sum",
        "TF": "sum",
        "Total_Flights_Period": "sum",
        "adm": "mean",
        "pd": "mean",
        "Delay_Ratio": "mean" 
    })
    .sort_values(by="TF", ascending=False)
    .rename(columns={
        "TD": "Total Delay (TD)",
        "TF": "Total Flights (TF)",
        "Total_Flights_Period": "Total_Flights_Period",
        "adm": "Avg Delay per Movement (in min)",
        "pd": "Avg Proportion of Delay (%)",
        "Delay_Rate": "Avg Delay Ratio (%)"
    })
    .reset_index(drop=True)
)

display(df_AID_TOP20_YMA)

In [12]:
import plotly.express as px
import ipywidgets as widgets
from IPython.display import display

In [None]:
df_AID_TOP20_YMA = df_AID_TOP20_YMA.rename(columns={
    "Delay_Ratio": "Avg Delay Ratio (%)"
})

# month order
month_order = [
    "January", "February", "March", "April", "May", "June",
    "July", "August", "September", "October", "November", "December"
]

# Aggregate dataframe for line chart
df_avg = (
    df_AID_TOP20_YMA
    .groupby(["Year_Lobt", "Month_Lobt", "APT_ICAO"], as_index=False)[["Avg Delay Ratio (%)"]]
    .mean()
)

# Month lobt follows chronological order
df_avg["Month_Lobt"] = pd.Categorical(df_avg["Month_Lobt"], categories=month_order, ordered=True)

# all_airport order
df_all_avg = (
    df_avg.groupby(["Year_Lobt", "Month_Lobt"], as_index=False)[["Avg Delay Ratio (%)"]]
    .mean()
)
df_all_avg["APT_ICAO"] = "ALL"

# Combine & sort
df_plot = pd.concat([df_avg, df_all_avg], ignore_index=True)
df_plot = df_plot.sort_values(["APT_ICAO", "Year_Lobt", "Month_Lobt"])

# Plot function with airport filter
def plot_delay_ratio(selected_airport="ALL"):
    if selected_airport == "ALL":
        dff = df_plot[df_plot["APT_ICAO"] == "ALL"]
        title = "Average Delay Ratio by Month — All Airports"
    else:
        dff = df_plot[df_plot["APT_ICAO"] == selected_airport]
        title = f"Average Delay Ratio by Month — {selected_airport}"
    
    # Plot one line per year — break lines for missing months
    fig = px.line(
        dff,
        x="Month_Lobt",
        y="Avg Delay Ratio (%)",
        color="Year_Lobt",
        markers=True,
        line_shape="linear", 
        title=title,
        labels={
            "Month_Lobt": "Month",
            "Avg Delay Ratio (%)": "Average Delay Ratio (%)",
            "Year_Lobt": "Year"
        }
    )


    # Clean layout
    fig.update_layout(
        template="plotly_white",
        xaxis=dict(categoryorder="array", categoryarray=month_order),
        yaxis_title="Average Delay Ratio (%)",
        hovermode="x unified",
        xaxis_tickangle=-45,
        legend_title="Year"
    )

    fig.show()

# Dropdown for airport selection
airport_list = ["ALL"] + sorted(df_AID_TOP20_YMA["APT_ICAO"].unique())

dropdown = widgets.Dropdown(
    options=airport_list,
    value="ALL",
    description="Airport:",
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='40%')
)

widgets.interact(plot_delay_ratio, selected_airport=dropdown)

  df_avg.groupby(["Year_Lobt", "Month_Lobt"], as_index=False)[["Avg Delay Ratio (%)"]]


interactive(children=(Dropdown(description='Airport:', layout=Layout(width='40%'), options=('ALL', 'EDDF', 'ED…

<function __main__.plot_delay_ratio(selected_airport='ALL')>

## Flight Punctuality Distribution

In [14]:
# filter the delay dataset for only the top 20 airports
df_FPD = (
    df_FPD[df_FPD["APT_ICAO"].isin(airports_code_list)]
)
df_FPD.head()

Unnamed: 0,Year,Month,Flight_direction,APT_ICAO,Delayed_Flights(>= 5 Min),Ontime_Flights(0 to 4 Min),Early_Flights_(Before STD),Nb_Flight_Tot
0,2023,August,Departure,LTFM,12097,3488,1877,17462
1,2024,August,Departure,LTFM,10805,3225,3286,17316
2,2023,July,Departure,LTFM,11466,3553,2023,17042
3,2024,August,Departure,EHAM,11458,3339,2167,16964
4,2024,June,Departure,LTFM,8685,3793,4307,16785
