# Project Group 11

Members:  Alessandro Casati (6544649),  Bas van den Muijsenberg (5797578) , Hylke Bleeker (5589355), Jan-Pieter Vermeer (6340261),  Mike Geerts (6276210)


# Research Objective

## Analysis and delay prediction of flights across top 20 airports in Europe in 2023-24

**Research Question**  
What factors drive airport delays across the top 10 European airports in 2023-24, and can we build predictive models to identify possible delays for a flight using the information available?

**Sub Questions**
- What type of delays are occurring most at the airports?
- Which airports experience the highest, the lowest delay, and most consistent delay rates?
- How do delays relate to traffic volume and efficiency?
- How do delays distribute geographically across Europe, and which airports stand out as persistent “delay hotspots”?
- Which features (traffic levels, weekday/weekend, month, etc.) are the strongest predictors of high-delay days?
- What model can predict the chance of delay and the amount delay for a flight?

# Contribution Statement

*Be specific. Some of the tasks can be coding (expect everyone to do this), background research, conceptualisation, visualisation, data analysis, data modelling*

**Author 1**:

**Author 2**:

**Author 3**:

# Data Used

## Datasets
| Dataset | Dataset description (function) | Columns needed |
|---|---|---|
| Airport Traffic | Airport capacity of inbound and outbound flights for network manager and airport operator | All |
| Airport ATFM_Delay | The Airport Arrival ATFM Delay provides an indication of ATFM delays on the ground due to constraints at airports | FLT_DATE, APT_ICAO (to merge DB) + all the remaining types of delays |
| Airport Punctuality | Daily performance tracker of airport punctuality, delays, and operational efficiency | All + add APT_ICAO for merging |
| All_Pre_Departure_Delays | Daily overview of total pre-departure delay in minutes per airport that includes all delay causes | All |
| ATC_Pre_Departure_Delay | Proxy for ATC induced delays at the departure stand (IATA delay code 89) | Pending (depends on data analysis findings) |
| Additional taxi-out time | Taxi time from gate to runway and associated delay | All |
| Additional taxi-in time | Taxi time from runway to gate and associated delay | All |

## Information extracted from datasets
| Dataset | Key Information |
|---|---|
| Airport Traffic | IFR arrivals, departures for Airport Operator and Network Manager |
| Airport ATFM_Delay | Disruption type and count, number of delayed arrival flights |
| Airport Punctuality | Arrival & departure punctuality %, avg delay (dep & arr), operated schedules % |
| Pre_Departure_Delay | AO departures, AO pre-departure delay (minutes) |
| Additional taxi-out time | Number of flights with available data + total taxi in time |
| Additional taxi-in time | Number of flights with available data + total taxi out time |

# Data Pipeline

## Data Exploration & Cleaning
- Import & clean the datasets: remove missing/invalid values, select only useful columns and merge.
- Create basic delay metrics:
  - Delay categories (On time, Delayed, Delayed15)
  - Day category (Weekend, Weekdays)

## Descriptive Data Analysis
- Time patterns: heatmaps/line charts/stacked charts (delays by day, month, cause).
- Geospatial view: interactive maps color-coded by average airport delay.
- Efficiency scatterplots: traffic volume vs. % delay

## Predictive Modeling
- Predict delay occurrence for a flight using departure airport and its performances, average airport delay, weekday/weekend & month.
- Models: Logistic Regression and/or Random Forest.
- Testing with 2025 data (unseen during training).

## Clustering & Patterns
- Cluster airports by daily delay patterns (e.g., “always congested,” “seasonal peaks,” “mostly smooth”).

# Data Cleaning

In [2]:
import pandas as pd
import numpy as np


## TO-DO list, by Wed 8/10
| Dataset | Key Information | Assigned to |
|---|---|---|
| Airport Traffic | Imported + top20 defined: missing total number of departures and arrivals | Ale |
| Airport ATFM_Delay | Extract top20 aiports by comparing it with df_top10 and overall and sum for all the 2 years | Bas |
| Airport Punctuality | Filter for 23-24, find top 20, compare with df_top20 (some airports might be missing), get all the columns | Mike |
| Pre_Departure_Delay | TBD. is the data actually useful for our RQ, if yes extract top20 + last 5 columns | JP |
| Additional taxi-out time | Extract top20 airports + TF, TOT_REF_NB_FL, TOT_REF&ADD_TIME_MIN, PIVOT_LABEL | Hylke |
| Additional taxi-in time | Extract top20 airports + TF, TOT_REF_NB_FL, TOT_REF&ADD_TIME_MIN, PIVOT_LABEL | Hylke |

Some comments: for now we focus on the totals for 2023-24 (combined). Watch out that some datasets don't have matching columns and you can't merge them together, you might need to compare them manually or use other references (especially for Punctuality dataset)

## Airport Traffic
Importing dataset

In [92]:
df_AT = pd.read_excel("datasets/AirportTraffic.xlsx")
df_TIn = pd.read_excel("datasets/Taxi-In_Additional_Time.xlsx")
df_TOut = pd.read_excel("datasets/Taxi-Out_Additional_Time.xlsx")

Find top 20 airports by total flights and create a new dataframe with key data

In [86]:
df_top20 = (
    df_AT.groupby("APT_ICAO")[["FLT_TOT_1", "FLT_DEP_1", "FLT_ARR_1"]] #Group by airport code
    .sum().sort_values(by="FLT_TOT_1",ascending=False) #Sum the values for each code of the 3 columns indicated
    .head(20).reset_index()) #Change "20" to change the number of airports analysed
# Adding airport's city name and state from original dataset
df_top20 = (df_top20.merge(df_AT[["APT_ICAO", "APT_NAME", "STATE_NAME"]]
                           .drop_duplicates(), on="APT_ICAO", how="left"))
df_top20.head(20)

# Airport ATFM_Delay
Importing dataset

In [71]:
df_AAD = pd.read_excel("datasets/AA_ATFM_Delay.xlsx")

Delay for every day and all top 20 airports.
Removing Na, and only top 20 airports, sorted by airport traffic

In [79]:
df_AAD = df_AAD.dropna()
df_AAD_top20 = df_AAD[df_AAD["APT_ICAO"].isin(df_top20["APT_ICAO"])]

#Lists of code and names of top20 airports
airports_name_list = df_top20["APT_NAME"].tolist()
airports_code_list = df_top20["APT_ICAO"].tolist()

#Dataframe of codes, useful for merging 
df_airport_codes = pd.DataFrame({'APT_ICAO': airports_code_list})

df_AAD_top20 = df_AAD_top20.set_index("APT_ICAO").loc[airports_code_list].reset_index()

#df_AAD_top20.head()

Total delay for every airport 

In [None]:
ATFM_cols = df_AAD.columns[7:27].tolist() #Columns neded

df_AAD_total_delays_per_airport = (
    df_AAD_top20[df_AAD_top20["APT_ICAO"].isin(airports_code_list)].groupby(["APT_ICAO", "APT_NAME", "STATE_NAME"])[ATFM_cols]
    .sum()
    .reset_index()
)

order = df_top20["APT_ICAO"].tolist()
df_AAD_total_delays_per_airport = df_AAD_total_delays_per_airport.set_index("APT_ICAO").loc[order].reset_index() #Do we care for the order?

df_AAD_total_delays_per_airport

# Airport Punctionality

In [3]:
import pandas as pd

import os
print(os.listdir("datasets"))

df_AP = pd.read_excel("datasets/AirportPunctuality.xlsx")
df_AT = pd.read_excel("datasets/AirportTraffic.xlsx")

#We normalise the columns here:
df_AP.columns = df_AP.columns.str.strip().str.upper()
df_AT.columns = df_AT.columns.str.strip().str.upper()

#Here we filter on the years 2023 and 2024:
if 'YEAR' in df_AP.columns:
    df_AP = df_AP[(df_AP['YEAR'] == 2023) | (df_AP['YEAR'] == 2024)]
elif 'FLT_DATE' in df_AP.columns:
    df_AP['FLT_DATE'] = pd.to_datetime(df_AP['FLT_DATE'], errors='coerce')
    df_AP = df_AP[df_AP['FLT_DATE'].dt.year.isin([2023, 2024])]

#Below we find the top 20 airports based on the number of flights
flight_col_candidates = [c for c in df_AP.columns if 'FLT' in c and 'TOT' in c]
if flight_col_candidates:
    main_flight_col = flight_col_candidates[0]  
else:
    main_flight_col = None

if main_flight_col:
    df_AP_top20 = (
        df_AP.groupby("APT_ICAO")[main_flight_col]
        .sum()
        .sort_values(ascending=False)
        .head(20)
        .reset_index()
    )
else:
    df_AP_top20 = (
        df_AP.groupby("APT_ICAO")
        .size()
        .sort_values(ascending=False)
        .head(20)
        .reset_index(name='ENTRY_COUNT')
    )

#We try to add names of airports and countries
extra_cols = [col for col in ['APT_NAME', 'STATE_NAME'] if col in df_AP.columns]
if extra_cols:
    df_AP_top20 = df_AP_top20.merge(
        df_AP[['APT_ICAO'] + extra_cols].drop_duplicates(),
        on='APT_ICAO',
        how='left'
    )

#We compare with the df_top20 from AirportTraffic since some airports might be missing
df_top20_AT = (
    df_AT.groupby("APT_ICAO")[["FLT_TOT_1"]]
    .sum()
    .sort_values(by="FLT_TOT_1", ascending=False)
    .head(20)
    .reset_index()
)

#Lists of ICAO-codes
codes_punctuality = set(df_AP_top20["APT_ICAO"].tolist())
codes_traffic = set(df_top20_AT["APT_ICAO"].tolist())

missing_in_punctuality = codes_traffic - codes_punctuality
missing_in_traffic = codes_punctuality - codes_traffic

print("Airports in Traffic top20 but missing in Punctuality data:")
print(missing_in_punctuality)
print("\nAirports in Punctuality top20 but not in Traffic top20:")
print(missing_in_traffic)

#Finally, we find all colums
common_codes = list(codes_punctuality & codes_traffic)
df_AP_top20_final = df_AP[df_AP["APT_ICAO"].isin(common_codes)].copy()

print("\n✅ Airport Punctuality top20 dataset (2023–24) klaar:")
print(df_AP_top20_final.head())

# Optioneel: opslaan
df_AP_top20_final.to_csv("data/processed/AirportPunctuality_Top20_2023_2024.csv", index=False)


['.gitkeep', 'AA_ATFM_Delay.xlsx', 'Airports_Punctuality.xlsx', 'AirportTraffic.xlsx', 'ATC_Pre_Departure_Delay.xlsx', 'Taxi-In_Additional_Time.xlsx', 'Taxi-Out_Additional_Time.xlsx']


FileNotFoundError: [Errno 2] No such file or directory: 'datasets/AirportPunctuality.xlsx'

# Taxi In & Out

In [93]:
df_taxi_time_in = (
    df_TIn[(df_TIn["APT_ICAO"].isin(airports_code_list)) & 
           ((df_TIn["YEAR"] == 2023) | (df_TIn["YEAR"] == 2024))]
    .groupby("APT_ICAO")[["VALID_FL", "TOTAL_REF_NB_FL", "TOTAL_REF_TIME_MIN", "TOTAL_ADD_TIME_MIN"]]
    .sum()
    .reset_index()
)
#In order: filter only airports in top 20, filter only values from 2023-24, group by code and sum the values in the 
#listed columns, reset the index to have the codes as a column itselft (useful for merging)

# Rename the columns to for in values
in_cols_to_rename = {
    "VALID_FL": "VALID_FL_IN",
    "TOTAL_REF_NB_FL": "TOTAL_REF_NB_FL_IN",
    "TOTAL_REF_TIME_MIN": "TOT_REF_TIME_MIN_IN",
    "TOTAL_ADD_TIME_MIN": "TOT_ADD_TIME_MIN_IN"
}
df_taxi_time_in = df_taxi_time_in.rename(columns=in_cols_to_rename)

# Merging taxi with airport codes
df_taxi_time_final = pd.merge(df_airport_codes, df_taxi_time_in, on='APT_ICAO', how='left')


In [None]:
df_taxi_time_out = (
    df_TOut[(df_TOut["APT_ICAO"].isin(airports_code_list)) & 
           ((df_TOut["YEAR"] == 2023) | (df_TOut["YEAR"] == 2024))]
    .groupby("APT_ICAO")[["VALID_FL", "TOTAL_REF_NB_FL", "TOTAL_REF_TIME_MIN", "TOTAL_ADD_TIME_MIN"]]
    .sum()
    .reset_index()
)
#Same operations as df_taxi_time_in

# Rename the columns to for out values
out_cols_to_rename = {
    "VALID_FL": "VALID_FL_OUT",
    "TOTAL_REF_NB_FL": "TOTAL_REF_NB_FL_OUT",
    "TOTAL_REF_TIME_MIN": "TOT_REF_TIME_MIN_OUT",
    "TOTAL_ADD_TIME_MIN": "TOT_ADD_TIME_MIN_OUT"
}
df_taxi_time_out = df_taxi_time_out.rename(columns=out_cols_to_rename)

# Merging taxi in with taxi out
df_taxi_time = pd.merge(df_taxi_time_final, df_taxi_time_out, on='APT_ICAO', how='left')

print(df_taxi_time.head())