# Anac flight delay prediction

## Data Loading and Integration

This section loads the Brazilian ANAC VRA (Active Regular Flights) open dataset directly from the official data source.  
Monthly CSV files are dynamically retrieved for the selected year, validated for availability, and merged into a single DataFrame for further analysis.


In [1]:
import pandas as pd
import requests
from urllib.parse import quote

# Base URL for ANAC open VRA (Active Regular Flights) dataset
BASE_URL = (
    "https://sistemas.anac.gov.br/dadosabertos/"
    "Voos%20e%20opera%C3%A7%C3%B5es%20a%C3%A9reas/"
    "Voo%20Regular%20Ativo%20%28VRA%29/"
)

# Month mapping required to match ANAC folder structure
MONTHS = {
    1: "Janeiro",
    2: "Fevereiro",
    3: "Março",
    4: "Abril",
    5: "Maio",
    6: "Junho",
    7: "Julho",
    8: "Agosto",
    9: "Setembro",
    10: "Outubro",
    11: "Novembro",
    12: "Dezembro",
}

# Years selected for analysis
YEARS = [2018]

dfs = []

def file_exists(url):
    """
    Check if a remote file exists before attempting download.
    This avoids runtime errors caused by missing monthly datasets.
    """
    try:
        r = requests.head(url, timeout=10)
        return r.status_code == 200
    except requests.RequestException:
        return False

# Download and load available monthly datasets
for year in YEARS:
    for month_num, month_name in MONTHS.items():

        folder = f"{month_num:02d} - {month_name}"
        filename = f"VRA_{year}{month_num}.csv"

        url = f"{BASE_URL}{year}/{quote(folder)}/{filename}"

        if not file_exists(url):
            print(f"❌ Missing: {year}-{month_num:02d}")
            continue

        print(f"✔️ Downloading {year}-{month_num:02d}")

        df = pd.read_csv(
            url,
            sep=";",
            encoding="utf-8",
            skiprows=1,
            low_memory=False
        )

        # Add explicit temporal identifiers for downstream analysis
        df["year"] = year
        df["month"] = month_num

        dfs.append(df)

# =========================
# FINAL MERGE
# =========================

# Combine all monthly datasets into a single DataFrame
df_final = pd.concat(dfs, ignore_index=True)

print("✅ All available VRA data merged")
print(df_final.shape)


✔️ Downloading 2018-01
✔️ Downloading 2018-02
✔️ Downloading 2018-03
✔️ Downloading 2018-04
✔️ Downloading 2018-05
✔️ Downloading 2018-06
✔️ Downloading 2018-07
✔️ Downloading 2018-08
✔️ Downloading 2018-09
✔️ Downloading 2018-10
✔️ Downloading 2018-11
✔️ Downloading 2018-12
✅ All available VRA data merged
(1035169, 14)


After loading, all available monthly datasets are consolidated into a single DataFrame, preserving temporal identifiers (`year` and `month`) to support downstream feature engineering and exploratory analysis.


## Initial Dataset Inspection

After consolidating all available monthly files, an initial inspection is performed to understand the dataset structure, column availability, and basic integrity before proceeding with data preparation.


In [2]:
df_final

Unnamed: 0,ICAO Empresa Aérea,Número Voo,Código Autorização (DI),Código Tipo Linha,ICAO Aeródromo Origem,ICAO Aeródromo Destino,Partida Prevista,Partida Real,Chegada Prevista,Chegada Real,Situação Voo,Código Justificativa,year,month
0,AFR,454,0,I,LFPG,SBGR,01/01/2018 20:20,,02/01/2018 08:20,,REALIZADO,,2018,1
1,ARG,1248,0,I,SABE,SBGR,01/01/2018 19:30,,01/01/2018 23:15,,REALIZADO,,2018,1
2,ARG,1251,1,I,SBGL,SABE,,01/01/2018 19:53,,01/01/2018 22:10,REALIZADO,,2018,1
3,ARG,1254,1,I,SABE,SBGL,,01/01/2018 11:02,,01/01/2018 14:29,REALIZADO,,2018,1
4,ARG,1278,0,I,SABE,SBFL,01/01/2018 17:55,,01/01/2018 21:05,,REALIZADO,,2018,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1035164,AZU,4040,0,N,SBCF,SBBR,31/12/2018 07:45,31/12/2018 08:07,31/12/2018 09:10,31/12/2018 09:26,REALIZADO,,2018,12
1035165,AZU,4034,0,N,SBVT,SBKP,31/12/2018 20:20,31/12/2018 20:20,31/12/2018 22:00,31/12/2018 22:00,REALIZADO,,2018,12
1035166,AZU,4033,0,N,SBRJ,SBKP,31/12/2018 16:35,31/12/2018 16:29,31/12/2018 17:45,31/12/2018 17:28,REALIZADO,,2018,12
1035167,AZU,4029,0,N,SBRJ,SBKP,31/12/2018 07:55,31/12/2018 07:46,31/12/2018 09:05,31/12/2018 08:49,REALIZADO,,2018,12


In [None]:
anac_df = df_final.copy()

In [5]:
anac_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1035169 entries, 0 to 1035168
Data columns (total 14 columns):
 #   Column                   Non-Null Count    Dtype 
---  ------                   --------------    ----- 
 0   ICAO Empresa Aérea       1035169 non-null  object
 1   Número Voo               1035169 non-null  object
 2   Código Autorização (DI)  1035045 non-null  object
 3   Código Tipo Linha        1034579 non-null  object
 4   ICAO Aeródromo Origem    1035169 non-null  object
 5   ICAO Aeródromo Destino   1035169 non-null  object
 6   Partida Prevista         1023143 non-null  object
 7   Partida Real             881971 non-null   object
 8   Chegada Prevista         1023143 non-null  object
 9   Chegada Real             881976 non-null   object
 10  Situação Voo             1035169 non-null  object
 11  Código Justificativa     267992 non-null   object
 12  year                     1035169 non-null  int64 
 13  month                    1035169 non-null  int64 
dtypes:

In [7]:
anac_df.isna().sum()

Unnamed: 0,0
ICAO Empresa Aérea,0
Número Voo,0
Código Autorização (DI),124
Código Tipo Linha,590
ICAO Aeródromo Origem,0
ICAO Aeródromo Destino,0
Partida Prevista,12026
Partida Real,153198
Chegada Prevista,12026
Chegada Real,153193


## Missing Values Assessment

Before applying any transformations or filtering, missing values are evaluated to understand data completeness and identify columns that may require cleaning, imputation, or exclusion.


In [16]:
anac_df.isna().sum()

Unnamed: 0,0
ICAO Empresa Aérea,0
Número Voo,0
ICAO Aeródromo Origem,0
ICAO Aeródromo Destino,0
Partida Prevista,11986
Partida Real,114397
Situação Voo,0
year,0
month,0


The inspection shows that missing values are concentrated in a limited set of operational columns.  
These results guide subsequent filtering and cleaning decisions applied in the data preparation stage.


## Removal of Irrelevant and Non-Informative Columns

Some columns in the original dataset do not contribute to the delay prediction objective or are outside the scope of the current analysis. These fields are removed to reduce noise and simplify downstream processing.


In [9]:
anac_df = anac_df.drop(columns=[
    'Código Autorização (DI)',
    'Código Tipo Linha',
    'Chegada Prevista',
    'Chegada Real',
    'Código Justificativa'
])


## Filtering Completed Flights

Only flights with status `REALIZADO` (completed flights) are retained for analysis.  
Canceled or non-completed flights are excluded, as departure delay cannot be reliably measured for these cases.


In [10]:
anac_df = anac_df[anac_df['Situação Voo']=='REALIZADO']

In [14]:
anac_df

Unnamed: 0,ICAO Empresa Aérea,Número Voo,ICAO Aeródromo Origem,ICAO Aeródromo Destino,Partida Prevista,Partida Real,Situação Voo,year,month
0,AFR,454,LFPG,SBGR,01/01/2018 20:20,,REALIZADO,2018,1
1,ARG,1248,SABE,SBGR,01/01/2018 19:30,,REALIZADO,2018,1
2,ARG,1251,SBGL,SABE,,01/01/2018 19:53,REALIZADO,2018,1
3,ARG,1254,SABE,SBGL,,01/01/2018 11:02,REALIZADO,2018,1
4,ARG,1278,SABE,SBFL,01/01/2018 17:55,,REALIZADO,2018,1
...,...,...,...,...,...,...,...,...,...
1035164,AZU,4040,SBCF,SBBR,31/12/2018 07:45,31/12/2018 08:07,REALIZADO,2018,12
1035165,AZU,4034,SBVT,SBKP,31/12/2018 20:20,31/12/2018 20:20,REALIZADO,2018,12
1035166,AZU,4033,SBRJ,SBKP,31/12/2018 16:35,31/12/2018 16:29,REALIZADO,2018,12
1035167,AZU,4029,SBRJ,SBKP,31/12/2018 07:55,31/12/2018 07:46,REALIZADO,2018,12


This filtering step ensures consistency between the target variable definition and the operational scope of the dataset.


## Removal of Incomplete Records

Records containing missing values are removed to ensure data consistency and avoid introducing bias through artificial imputation.  
Given the operational nature of the dataset, incomplete records cannot be reliably corrected and are excluded from further analysis.


In [18]:
anac_df = anac_df.dropna()

In [19]:
anac_df

Unnamed: 0,ICAO Empresa Aérea,Número Voo,ICAO Aeródromo Origem,ICAO Aeródromo Destino,Partida Prevista,Partida Real,Situação Voo,year,month
7,AVA,85,SKBO,SBGR,01/01/2018 01:04,01/01/2018 01:06,REALIZADO,2018,1
11,AZU,2404,SBRF,SBMO,01/01/2018 13:50,01/01/2018 13:44,REALIZADO,2018,1
13,AZU,2428,SBCF,SBBR,01/01/2018 17:30,01/01/2018 17:25,REALIZADO,2018,1
17,AZU,2537,SBCF,SBPA,01/01/2018 21:35,01/01/2018 21:52,REALIZADO,2018,1
19,AZU,2596,SBCY,SBPV,01/01/2018 12:15,01/01/2018 11:59,REALIZADO,2018,1
...,...,...,...,...,...,...,...,...,...
1035164,AZU,4040,SBCF,SBBR,31/12/2018 07:45,31/12/2018 08:07,REALIZADO,2018,12
1035165,AZU,4034,SBVT,SBKP,31/12/2018 20:20,31/12/2018 20:20,REALIZADO,2018,12
1035166,AZU,4033,SBRJ,SBKP,31/12/2018 16:35,31/12/2018 16:29,REALIZADO,2018,12
1035167,AZU,4029,SBRJ,SBKP,31/12/2018 07:55,31/12/2018 07:46,REALIZADO,2018,12


In [20]:
anac_df.isna().sum()

Unnamed: 0,0
ICAO Empresa Aérea,0
Número Voo,0
ICAO Aeródromo Origem,0
ICAO Aeródromo Destino,0
Partida Prevista,0
Partida Real,0
Situação Voo,0
year,0
month,0


This step results in a clean dataset where all remaining records contain the minimum required information for delay calculation and feature engineering.


## Data Type Standardization

Data types are explicitly standardized to ensure semantic correctness, memory efficiency, and compatibility with downstream feature engineering and modeling steps.


In [29]:
# Create an explicit copy to avoid chained assignment issues
anac_df = anac_df.copy()

# Columns representing identifiers and categorical attributes
category_cols = [
    'ICAO Empresa Aérea',
    'Número Voo',
    'ICAO Aeródromo Origem',
    'ICAO Aeródromo Destino',
    'Situação Voo'
]

# Cast categorical features to pandas 'category' dtype
anac_df[category_cols] = anac_df[category_cols].astype('category')

# Datetime columns used for delay calculation and temporal feature extraction
datetime_cols = [
    'Partida Prevista',
    'Partida Real'
]


Explicit type casting ensures consistent behavior across the data preparation pipeline and prevents ambiguity during feature extraction and modeling.


In [31]:
anac_df[category_cols] = anac_df[category_cols].astype('category')

In [32]:
anac_df[datetime_cols] = anac_df[datetime_cols].apply(
    pd.to_datetime,
    dayfirst=True,
    errors='coerce'
)

In [34]:
anac_df['year'] = anac_df['year'].astype('int16')
anac_df['month'] = anac_df['month'].astype('int8')

In [48]:
anac_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 862919 entries, 7 to 1035168
Data columns (total 16 columns):
 #   Column                  Non-Null Count   Dtype         
---  ------                  --------------   -----         
 0   ICAO Empresa Aérea      862919 non-null  category      
 1   Número Voo              862919 non-null  category      
 2   ICAO Aeródromo Origem   862919 non-null  category      
 3   ICAO Aeródromo Destino  862919 non-null  category      
 4   Partida Prevista        862919 non-null  datetime64[ns]
 5   Partida Real            862919 non-null  datetime64[ns]
 6   Situação Voo            862919 non-null  category      
 7   year                    862919 non-null  int16         
 8   month                   862919 non-null  int8          
 9   delay_minutes           862919 non-null  float64       
 10  is_delayed              862919 non-null  int8          
 11  day_of_week             862919 non-null  int8          
 12  week_of_year            862919 non

## Departure Delay Calculation

The departure delay is computed as the difference between the actual and scheduled departure times.  
This continuous variable represents the foundation for defining the target variable and supports both exploratory analysis and feature engineering.


In [36]:
anac_df['delay_minutes'] = (
    anac_df['Partida Real'] - anac_df['Partida Prevista']
).dt.total_seconds() / 60


Positive values indicate delayed departures, while negative values represent early departures.  
This feature is used exclusively for target definition and exploratory analysis and is excluded from the modeling feature set to prevent data leakage.


## Target Variable Definition

A binary target variable is defined based on a delay threshold of 15 minutes.  
Flights departing more than 15 minutes after the scheduled time are labeled as delayed.


In [37]:
anac_df['is_delayed'] = (anac_df['delay_minutes'] > 15).astype('int8')

This binary formulation enables the use of classification models such as logistic regression, while maintaining interpretability and alignment with common operational delay definitions.


## Day of Week Extraction

The day of the week is extracted from the scheduled departure time to capture weekly operational patterns that may influence flight delays.


In [42]:
anac_df['day_of_week'] = (
    anac_df['Partida Prevista']
    .dt.dayofweek
    .astype('int8')
)


The feature ranges from 0 (Monday) to 6 (Sunday) and allows the model to learn differences in delay behavior across weekdays and weekends.


## Temporal Feature Engineering

Multiple calendar-based features are derived from the scheduled departure time to capture seasonal, weekly, and intraday patterns associated with flight delays.


In [43]:
anac_df['week_of_year'] = (
    anac_df['Partida Prevista']
    .dt.isocalendar()
    .week
    .astype('int8')
)


In [44]:
anac_df['week_of_month'] = (
    ((anac_df['Partida Prevista'].dt.day - 1) // 7 + 1)
    .astype('int8')
)


In [45]:
anac_df['hour'] = (
    anac_df['Partida Prevista']
    .dt.hour
    .astype('int8')
)


In [46]:
anac_df['is_weekend'] = (
    anac_df['day_of_week'] >= 5
).astype('int8')


In [49]:
anac_df['day_of_year'] = (
    anac_df['Partida Prevista']
    .dt.dayofyear
    .astype('int16')
)


All temporal features are extracted exclusively from scheduled departure times to avoid information leakage and ensure model validity.


## Removal of Leakage-Prone and Non-Modeling Features

Columns that either introduce data leakage or are no longer required after feature extraction are removed to ensure a clean, model-ready dataset.


In [57]:
cols_to_drop = [
    'Partida Prevista',
    'Partida Real',
    'delay_minutes',
    'Situação Voo'
]

anac_df = anac_df.drop(columns=cols_to_drop)


After this step, the dataset contains only features available at prediction time, fully aligned with the modeling objective.


## Final Dataset Validation

A final inspection is performed to confirm data integrity, data types, and overall readiness for modeling.


In [58]:
anac_df

Unnamed: 0,ICAO Empresa Aérea,Número Voo,ICAO Aeródromo Origem,ICAO Aeródromo Destino,year,month,is_delayed,day_of_week,week_of_year,week_of_month,hour,is_weekend,day_of_year
7,AVA,85,SKBO,SBGR,2018,1,0,0,1,1,1,0,1
11,AZU,2404,SBRF,SBMO,2018,1,0,0,1,1,13,0,1
13,AZU,2428,SBCF,SBBR,2018,1,0,0,1,1,17,0,1
17,AZU,2537,SBCF,SBPA,2018,1,1,0,1,1,21,0,1
19,AZU,2596,SBCY,SBPV,2018,1,0,0,1,1,12,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1035164,AZU,4040,SBCF,SBBR,2018,12,1,0,1,5,7,0,365
1035165,AZU,4034,SBVT,SBKP,2018,12,0,0,1,5,20,0,365
1035166,AZU,4033,SBRJ,SBKP,2018,12,0,0,1,5,16,0,365
1035167,AZU,4029,SBRJ,SBKP,2018,12,0,0,1,5,7,0,365


In [59]:
anac_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 862919 entries, 7 to 1035168
Data columns (total 13 columns):
 #   Column                  Non-Null Count   Dtype   
---  ------                  --------------   -----   
 0   ICAO Empresa Aérea      862919 non-null  category
 1   Número Voo              862919 non-null  category
 2   ICAO Aeródromo Origem   862919 non-null  category
 3   ICAO Aeródromo Destino  862919 non-null  category
 4   year                    862919 non-null  int16   
 5   month                   862919 non-null  int8    
 6   is_delayed              862919 non-null  int8    
 7   day_of_week             862919 non-null  int8    
 8   week_of_year            862919 non-null  int8    
 9   week_of_month           862919 non-null  int8    
 10  hour                    862919 non-null  int8    
 11  is_weekend              862919 non-null  int8    
 12  day_of_year             862919 non-null  int16   
dtypes: category(4), int16(2), int8(7)
memory usage: 21.8 MB


The inspection confirms that the dataset contains no missing values or leakage-prone features and is fully structured for machine learning workflows.


## Exploratory Analysis: Target Distribution

This analysis examines the distribution of the target variable to assess class balance and establish a baseline understanding of flight delays in the dataset.


In [41]:
anac_df['is_delayed'].value_counts(normalize=True)


Unnamed: 0_level_0,proportion
is_delayed,Unnamed: 1_level_1
0,0.830933
1,0.169067


The results indicate that delayed flights represent a minority of the observations, suggesting a moderately imbalanced classification problem.  
This imbalance highlights the importance of using evaluation metrics beyond accuracy, such as recall and precision, during model assessment.


## Exploratory Analysis: Delay by Hour of Day

This analysis evaluates how the probability of flight delays varies throughout the day, aiming to identify intraday operational patterns.


In [69]:
anac_df.groupby('hour', observed=True)['is_delayed'].mean().sort_values(ascending=False)


Unnamed: 0_level_0,is_delayed
hour,Unnamed: 1_level_1
0,0.234129
21,0.210326
19,0.206647
20,0.205254
22,0.204082
17,0.19298
18,0.188132
14,0.185
16,0.181259
12,0.18007


Flights scheduled during later hours of the day show a higher probability of delay.  
This pattern suggests cumulative operational effects, where delays propagate throughout the day as aircraft rotations and airport congestion increase.


## Exploratory Analysis: Delay by Day of Week

This analysis examines how delay probabilities vary across the days of the week, capturing potential weekly operational patterns.


In [70]:
anac_df.groupby('day_of_week', observed=True)['is_delayed'].mean()


Unnamed: 0_level_0,is_delayed
day_of_week,Unnamed: 1_level_1
0,0.171706
1,0.142911
2,0.158822
3,0.193311
4,0.201181
5,0.161553
6,0.149177


The results suggest that delay behavior varies across weekdays, with some days exhibiting higher delay probabilities.  
This reinforces the relevance of weekly temporal patterns in flight delay prediction.


## Exploratory Analysis: Delay by Airline

This analysis evaluates how delay probabilities vary across airlines, highlighting potential differences in operational performance.


In [64]:
anac_df.groupby(
    'ICAO Empresa Aérea',
    observed=True
)['is_delayed'].mean().sort_values(ascending=False)


Unnamed: 0_level_0,is_delayed
ICAO Empresa Aérea,Unnamed: 1_level_1
ISS,0.735294
GTI,0.577703
CCA,0.533493
CLX,0.469261
LCO,0.460306
...,...
CQB,0.000000
RIM,0.000000
MAA,0.000000
SID,0.000000


Significant variation in delay rates is observed across airlines.  
However, extreme values may be associated with carriers that operate a small number of flights, requiring cautious interpretation.


## Exploratory Analysis: Delay by Origin Airport

This analysis investigates how delay probabilities vary across origin aerodromes, aiming to identify potential operational bottlenecks at departure airports.


In [66]:
anac_df.groupby(
    'ICAO Aeródromo Origem',
    observed=True
)['is_delayed'].mean().sort_values(ascending=False)


  anac_df.groupby('ICAO Aeródromo Origem')['is_delayed'].mean().sort_values(ascending=False)


Unnamed: 0_level_0,is_delayed
ICAO Aeródromo Origem,Unnamed: 1_level_1
LLBG,1.000000
KHSV,0.666667
SEQU,0.650000
EBBR,0.569767
ELLX,0.538462
...,...
SWFX,0.000000
SWHP,0.000000
VABB,0.000000
VIDP,0.000000


Large differences in delay rates are observed across origin airports.  
However, extreme delay probabilities are often associated with aerodromes with very low flight volumes and should be interpreted with caution.


## Exploratory Analysis: Delay Rate and Flight Volume by Origin Airport

To avoid misleading interpretations driven by small sample sizes, delay rates are analyzed together with the number of flights per origin airport.


In [71]:
airport_stats = (
    anac_df
    .groupby('ICAO Aeródromo Origem', observed=True)
    .agg(
        delay_rate=('is_delayed', 'mean'),
        flights=('is_delayed', 'count')
    )
    .sort_values('delay_rate', ascending=False)
)

airport_stats


Unnamed: 0_level_0,delay_rate,flights
ICAO Aeródromo Origem,Unnamed: 1_level_1,Unnamed: 2_level_1
LLBG,1.000000,8
KHSV,0.666667,3
SEQU,0.650000,40
EBBR,0.569767,86
ELLX,0.538462,520
...,...,...
SWFX,0.000000,6
SWHP,0.000000,12
VABB,0.000000,1
VIDP,0.000000,1


Airports with extreme delay rates often correspond to very low flight volumes, indicating that sample size plays a critical role in interpreting operational performance.  
Airports with both high delay rates and substantial flight volumes are more indicative of systemic constraints.
