# Project Group 11

Members:  Alessandro Casati (6544649),  Bas van den Muijsenberg (5797578) , Hylke Bleeker (5589355), Jan-Pieter Vermeer (6340261),  Mike Geerts (6276210)


# Research Objective

## Analysis and delay prediction of flights across top 20 airports in Europe in 2023-24

**Research Question**  
What factors drive airport delays across the top 10 European airports in 2023-24, and can we build predictive models to identify possible delays for a flight using the information available?

**Sub Questions**
- What type of delays are occurring most at the airports?
- Which airports experience the highest, the lowest delay, and most consistent delay rates?
- How do delays relate to traffic volume and efficiency?
- How do delays distribute geographically across Europe, and which airports stand out as persistent “delay hotspots”?
- Which features (traffic levels, weekday/weekend, month, etc.) are the strongest predictors of high-delay days?
- What model can predict the chance of delay and the amount delay for a flight?

# Contribution Statement

*Be specific. Some of the tasks can be coding (expect everyone to do this), background research, conceptualisation, visualisation, data analysis, data modelling*

**Author 1**:

**Author 2**:

**Author 3**:

# Data Used

## Datasets
| Dataset | Dataset description (function) | Columns needed |
|---|---|---|
| Airport Traffic | Airport capacity of inbound and outbound flights for network manager and airport operator | All |
| Airport ATFM_Delay | The Airport Arrival ATFM Delay provides an indication of ATFM delays on the ground due to constraints at airports | FLT_DATE, APT_ICAO (to merge DB) + all the remaining types of delays |
| Airport Punctuality | Daily performance tracker of airport punctuality, delays, and operational efficiency | All + add APT_ICAO for merging |
| All_Pre_Departure_Delays | Daily overview of total pre-departure delay in minutes per airport that includes all delay causes | All |
| ATC_Pre_Departure_Delay | Proxy for ATC induced delays at the departure stand (IATA delay code 89) | Pending (depends on data analysis findings) |
| Additional taxi-out time | Taxi time from gate to runway and associated delay | All |
| Additional taxi-in time | Taxi time from runway to gate and associated delay | All |

## Information extracted from datasets
| Dataset | Key Information |
|---|---|
| Airport Traffic | IFR arrivals, departures for Airport Operator and Network Manager |
| Airport ATFM_Delay | Disruption type and count, number of delayed arrival flights |
| Airport Punctuality | Arrival & departure punctuality %, avg delay (dep & arr), operated schedules % |
| Pre_Departure_Delay | AO departures, AO pre-departure delay (minutes) |
| Additional taxi-out time | Number of flights with available data + total taxi in time |
| Additional taxi-in time | Number of flights with available data + total taxi out time |

# Data Pipeline

## Data Exploration & Cleaning
- Import & clean the datasets: remove missing/invalid values, select only useful columns and merge.
- Create basic delay metrics:
  - Delay categories (On time, Delayed, Delayed15)
  - Day category (Weekend, Weekdays)

## Descriptive Data Analysis
- Time patterns: heatmaps/line charts/stacked charts (delays by day, month, cause).
- Geospatial view: interactive maps color-coded by average airport delay.
- Efficiency scatterplots: traffic volume vs. % delay

## Predictive Modeling
- Predict delay occurrence for a flight using departure airport and its performances, average airport delay, weekday/weekend & month.
- Models: Logistic Regression and/or Random Forest.
- Testing with 2025 data (unseen during training).

## Clustering & Patterns
- Cluster airports by daily delay patterns (e.g., “always congested,” “seasonal peaks,” “mostly smooth”).

# Data Cleaning

In [2]:
import pandas as pd
import numpy as np


## TO-DO list, by Wed 8/10
| Dataset | Key Information | Assigned to |
|---|---|---|
| Airport Traffic | Imported + top20 defined: missing total number of departures and arrivals | Ale |
| Airport ATFM_Delay | Extract top20 aiports by comparing it with df_top10 and overall and sum for all the 2 years | Bas |
| Airport Punctuality | Filter for 23-24, find top 20, compare with df_top20 (some airports might be missing), get all the columns | Mike |
| Pre_Departure_Delay | TBD. is the data actually useful for our RQ, if yes extract top20 + last 5 columns | JP |
| Additional taxi-out time | Extract top20 airports + TF, TOT_REF_NB_FL, TOT_REF&ADD_TIME_MIN, PIVOT_LABEL | Hylke |
| Additional taxi-in time | Extract top20 airports + TF, TOT_REF_NB_FL, TOT_REF&ADD_TIME_MIN, PIVOT_LABEL | Hylke |

Some comments: for now we focus on the totals for 2023-24 (combined). Watch out that some datasets don't have matching columns and you can't merge them together, you might need to compare them manually or use other references (especially for Punctuality dataset)

## Airport Traffic
Importing dataset

In [3]:
df_AT = pd.read_excel("datasets/AirportTraffic.xlsx")

Find top 20 airports by total flights and create a new dataframe with key data

In [7]:
df_top20 = (
    df_AT.groupby("APT_ICAO")[["FLT_TOT_1", "FLT_DEP_1", "FLT_ARR_1"]] #Group by airport code
    .sum().sort_values(by="FLT_TOT_1",ascending=False) #Sum the values for each code of the 3 columns indicated
    .head(20).reset_index()) #Change "20" to change the number of airports analysed
# Adding airport's city name and state from original dataset
df_top20 = (df_top20.merge(df_AT[["APT_ICAO", "APT_NAME", "STATE_NAME"]]
                           .drop_duplicates(), on="APT_ICAO", how="left"))
df_top20.head(20)

Unnamed: 0,APT_ICAO,FLT_TOT_1,FLT_DEP_1,FLT_ARR_1,APT_NAME,STATE_NAME
0,LTFM,1014869,507380,507489,Istanbul,Türkiye
1,EHAM,947254,473657,473597,Amsterdam - Schiphol,Netherlands
2,EGLL,933098,466473,466625,London - Heathrow,United Kingdom
3,LFPG,921916,460997,460919,Paris-Charles-de-Gaulle,France
4,EDDF,871219,435565,435654,Frankfurt,Germany
5,LEMD,809535,404744,404791,Madrid - Barajas,Spain
6,LEBL,666969,333465,333504,Barcelona,Spain
7,EDDM,624316,312141,312175,Munich,Germany
8,LIRF,582158,291068,291090,Rome - Fiumicino,Italy
9,EGKK,522331,261146,261185,London - Gatwick,United Kingdom


## Airport ATFM_Delay
Importing dataset

In [36]:
df_AAD = pd.read_excel("datasets/AA_ATFM_Delay.xlsx")

Delay for every day and all top 20 airports.
Removing Na, and only top 20 airports, sorted by airport traffic

In [37]:
df_AAD = df_AAD.dropna()
df_AAD_top20 = df_AAD[df_AAD["APT_ICAO"].isin(df_top20["APT_ICAO"])]

order = df_top20["APT_ICAO"].tolist()
df_AAD_top20 = df_AAD_top20.set_index("APT_ICAO").loc[order].reset_index()

df_AAD_top20.head()

Unnamed: 0,APT_ICAO,YEAR,MONTH_NUM,MONTH_MON,FLT_DATE,APT_NAME,STATE_NAME,FLT_ARR_1,DLY_APT_ARR_1,DLY_APT_ARR_A_1,...,DLY_APT_ARR_R_1,DLY_APT_ARR_S_1,DLY_APT_ARR_T_1,DLY_APT_ARR_V_1,DLY_APT_ARR_W_1,DLY_APT_ARR_NA_1,FLT_ARR_1_DLY,FLT_ARR_1_DLY_15,ATFM_VERSION,Pivot Label
0,LTFM,2023,1,JAN,44928,Istanbul,Türkiye,676,1720.0,0.0,...,0.0,0.0,0.0,0.0,1720.0,0.0,83.0,38.0,v2,Istanbul (LTFM)
1,LTFM,2023,1,JAN,44938,Istanbul,Türkiye,607,665.0,0.0,...,0.0,0.0,0.0,0.0,665.0,0.0,34.0,19.0,v2,Istanbul (LTFM)
2,LTFM,2023,1,JAN,44939,Istanbul,Türkiye,635,78.0,0.0,...,0.0,0.0,0.0,0.0,78.0,0.0,5.0,2.0,v2,Istanbul (LTFM)
3,LTFM,2023,1,JAN,44948,Istanbul,Türkiye,646,21.0,0.0,...,0.0,0.0,0.0,0.0,21.0,0.0,3.0,0.0,v2,Istanbul (LTFM)
4,LTFM,2023,1,JAN,44954,Istanbul,Türkiye,641,351.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,30.0,6.0,v2,Istanbul (LTFM)


Total delay for every airport 

In [33]:
AAD_delay_cols = [
    "DLY_APT_ARR_1",
    "DLY_APT_ARR_A_1",
    "DLY_APT_ARR_C_1",
    "DLY_APT_ARR_D_1",
    "DLY_APT_ARR_E_1",
    "DLY_APT_ARR_G_1",
    "DLY_APT_ARR_I_1",
    "DLY_APT_ARR_M_1",
    "DLY_APT_ARR_N_1",
    "DLY_APT_ARR_O_1",
    "DLY_APT_ARR_P_1",
    "DLY_APT_ARR_R_1",
    "DLY_APT_ARR_S_1",
    "DLY_APT_ARR_T_1",
    "DLY_APT_ARR_V_1",
    "DLY_APT_ARR_W_1",
    "DLY_APT_ARR_NA_1",
    "FLT_ARR_1_DLY",
    "FLT_ARR_1_DLY_15"
]

df_AAD_total_delays_per_airport = (
    df_AAD_top20.groupby(["APT_ICAO", "APT_NAME", "STATE_NAME"])[AAD_delay_cols]
    .sum()
    .reset_index()
)

order = df_top20["APT_ICAO"].tolist()
df_AAD_total_delays_per_airport = df_AAD_total_delays_per_airport.set_index("APT_ICAO").loc[order].reset_index()

df_AAD_total_delays_per_airport

Unnamed: 0,APT_ICAO,APT_NAME,STATE_NAME,DLY_APT_ARR_1,DLY_APT_ARR_A_1,DLY_APT_ARR_C_1,DLY_APT_ARR_D_1,DLY_APT_ARR_E_1,DLY_APT_ARR_G_1,DLY_APT_ARR_I_1,...,DLY_APT_ARR_O_1,DLY_APT_ARR_P_1,DLY_APT_ARR_R_1,DLY_APT_ARR_S_1,DLY_APT_ARR_T_1,DLY_APT_ARR_V_1,DLY_APT_ARR_W_1,DLY_APT_ARR_NA_1,FLT_ARR_1_DLY,FLT_ARR_1_DLY_15
0,LTFM,Istanbul,Türkiye,230592.0,0.0,311.0,0.0,0.0,19684.0,0.0,...,0.0,2706.0,0.0,0.0,1415.0,0.0,206476.0,0.0,6800.0,4442.0
1,EHAM,Amsterdam - Schiphol,Netherlands,1540691.0,0.0,12665.0,0.0,6204.0,650129.0,0.0,...,7292.0,18581.0,0.0,591.0,4280.0,0.0,840949.0,0.0,93661.0,29292.0
2,EGLL,London - Heathrow,United Kingdom,1230954.0,5624.0,96191.0,0.0,4202.0,35423.0,0.0,...,6433.0,4818.0,0.0,14679.0,741.0,0.0,1062843.0,0.0,65314.0,28145.0
3,LFPG,Paris-Charles-de-Gaulle,France,119040.0,24.0,763.0,0.0,241.0,1152.0,7904.0,...,265.0,10259.0,0.0,9306.0,3704.0,0.0,85256.0,0.0,6960.0,2882.0
4,EDDF,Frankfurt,Germany,563486.0,493.0,10253.0,0.0,9417.0,177461.0,0.0,...,14543.0,1445.0,0.0,7450.0,18717.0,0.0,323690.0,0.0,27388.0,13731.0
5,LEMD,Madrid - Barajas,Spain,278691.0,1530.0,95450.0,0.0,0.0,47110.0,0.0,...,3723.0,20656.0,0.0,0.0,1858.0,1032.0,107332.0,0.0,18858.0,5932.0
6,LEBL,Barcelona,Spain,384152.0,1079.0,5123.0,0.0,960.0,16489.0,0.0,...,0.0,523.0,0.0,0.0,3889.0,31802.0,324287.0,0.0,15815.0,7609.0
7,EDDM,Munich,Germany,108797.0,25.0,2193.0,0.0,950.0,5265.0,0.0,...,0.0,1794.0,0.0,464.0,263.0,0.0,97843.0,0.0,5032.0,2348.0
8,LIRF,Rome - Fiumicino,Italy,39145.0,964.0,0.0,0.0,0.0,8431.0,15.0,...,4829.0,773.0,0.0,0.0,0.0,0.0,22789.0,0.0,2222.0,959.0
9,EGKK,London - Gatwick,United Kingdom,1032548.0,0.0,32521.0,0.0,239.0,363605.0,0.0,...,7075.0,0.0,0.0,205442.0,1594.0,0.0,422072.0,0.0,38960.0,22324.0
