# Flight Punctuality Analysis at Dublin Airport

This project examines how weather conditions influence flight punctuality at Dublin Airport.  
The analysis combines flight activity data (arrivals, departures, delays, cancellations) with historical and forecast weather data from Met √âireann to identify trends, quantify the impact of adverse conditions, and project future delay probabilities. 
 
By aligning operational flight records with local weather observations, the study provides insights into how rain, wind, and visibility affect airport performance and passenger reliability.


### Notebook Control Flag Explanation

This notebook contains code to download flight history data from the Aviation Edge API.  
Because downloading six months of data can take a long time and may stress the API, we use a **control flag** called `RUN_DOWNLOAD` to decide whether the download should run.

- **RUN_DOWNLOAD = False** ‚Üí The download section is skipped.  
  Use this setting when you want to run analysis, visualizations, or other notebook functionality without refreshing the data.

- **RUN_DOWNLOAD = True** ‚Üí The download section executes.  
  Use this setting only when you deliberately want to refresh the flight history data and update the cumulative JSON files.

This design ensures:
- The notebook can be safely re-run without triggering unwanted downloads.
- Existing JSON files are preserved and can be loaded for analysis.
- You have full control over when heavy API calls are made.

üëâ In practice: keep `RUN_DOWNLOAD = False` most of the time, and flip it to `True` only when you need new data.


In [1]:
# --- Control flag to enable/disable data refresh ---
RUN_DOWNLOAD = False   # Change to True only when you want to refresh data

### üì¶ Step 2 ‚Äì Install and Import Required Libraries

This step prepares the environment for the Dublin Airport Flight Rerouting Project.  
It ensures that all required Python packages are available and sets up the project‚Äôs directory structure inside the `project` root.

The notebook imports essential libraries for:

- üìä **Data manipulation** (`pandas`, `numpy`)
- üìÖ **Date and time handling** (`datetime`, `matplotlib.dates`)
- üìà **Plotting and visualisation** (`matplotlib`, `seaborn`, `plotly`)
- ü§ñ **Machine learning and model persistence** (`scikit-learn`, `joblib`)
- üìÇ **File handling and paths** (`os`, `pathlib`, `json`)
- üåê **Web access** (`requests`)
- üß© **Interactivity and display** (`ipywidgets`, `IPython.display`)

It also defines key directories (`data`, `outputs`, `models`, `docs`) inside the `project` folder and ensures they exist.  
This structure keeps raw data, processed outputs, trained models, and documentation organised and reproducible.

üìå *Note: `%pip install` commands can be used inside Jupyter notebooks if a package is missing.  
For scripts or terminal use, run `pip install` directly.*


In [2]:
%pip install plotly --quiet

# Setup: imports, paths and basic config

# --- Core Python modules ---
import json              # config files / JSON handling
import os                # operating system interactions
import time              # time management
import warnings          # manage warnings
from datetime import date, timedelta  # date calculations
from pathlib import Path              # path management

# --- Data science / numerical libraries ---
import numpy as np       # numerical operations
import pandas as pd      # data manipulation

# --- Plotting libraries ---
import matplotlib.pyplot as plt   # static plotting
import plotly.express as px       # interactive plotting
import seaborn as sns             # enhanced plotting

# --- Machine learning libraries ---
import joblib                     # model persistence
from sklearn.ensemble import GradientBoostingClassifier   # example model
from sklearn.linear_model import LogisticRegression       # example model
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score
)  # model evaluation
from sklearn.model_selection import (
    cross_val_score,
    train_test_split
)  # model validation

# --- API / external requests ---
import requests                   # API calls

# Plotting style
sns.set(style='whitegrid')

# Explicit project root: programming-for-data-analytics/project
ROOT = Path.cwd().resolve()
if ROOT.name != "project":
    # climb up until we find project folder
    for parent in ROOT.parents:
        if parent.name == "project":
            ROOT = parent
            break

# Define key directories inside project
DATA_DIR = ROOT / "data"
OUTPUT_DIR = ROOT / "outputs"
MODEL_DIR = ROOT / "models"
DOCS_DIR = ROOT / "docs"

# Ensure directories exist
for path in [DATA_DIR, OUTPUT_DIR, MODEL_DIR, DOCS_DIR]:
    path.mkdir(parents=True, exist_ok=True)

print(f"Project root: {ROOT}")
print(f"Data directory: {DATA_DIR}")
print(f"Output directory: {OUTPUT_DIR}")
print(f"Model directory: {MODEL_DIR}")
print(f"Docs directory: {DOCS_DIR}")


Note: you may need to restart the kernel to use updated packages.
Project root: C:\Users\eCron\OneDrive\Documents\ATU_CourseWork\Programming For Data Analytics\programming-for-data-analytics\project
Data directory: C:\Users\eCron\OneDrive\Documents\ATU_CourseWork\Programming For Data Analytics\programming-for-data-analytics\project\data
Output directory: C:\Users\eCron\OneDrive\Documents\ATU_CourseWork\Programming For Data Analytics\programming-for-data-analytics\project\outputs
Model directory: C:\Users\eCron\OneDrive\Documents\ATU_CourseWork\Programming For Data Analytics\programming-for-data-analytics\project\models
Docs directory: C:\Users\eCron\OneDrive\Documents\ATU_CourseWork\Programming For Data Analytics\programming-for-data-analytics\project\docs


### Step 3 ‚Äì Utilise Helper Functions for Dublin Airport Data Processing

This section defines a set of reusable helper functions that simplify common tasks in the project.  
They are designed specifically to support the analysis of **Dublin Airport flight activity and weather data** by handling messy inputs and preparing clean datasets for exploration and modelling.

The functions help with:

- ‚úÖ Detecting and parsing inconsistent datetime formats in flight and weather logs  
- ‚úÖ Standardising and cleaning temperature and precipitation columns from Met √âireann datasets  
- ‚úÖ Loading and preparing Dublin Airport daily weather data from local CSV files  
- ‚úÖ Defining Irish seasonal boundaries (Winter, Spring, Summer, Autumn) for comparative analysis  
- ‚úÖ Filtering weather data for a custom date range to align with flight events  
- ‚úÖ Validating user-provided date inputs for reproducible analysis  
- ‚úÖ Detecting header rows in raw CSV files downloaded from dashboards  

Each helper is **modular** ‚Äî it performs one clear task and can be reused across notebooks and scripts.  
This improves readability, reduces duplication, and supports good programming practices for the final project.

üìå *Tip: These helpers are written to be beginner-friendly, with comments explaining their purpose and logic. They make it easier to align flight activity with weather conditions when investigating delays and cancellations.*

üìñ References:  
- [Real Python ‚Äì Python Helper Functions](https://realpython.com/defining-your-own-python-function/)  
- [GeeksforGeeks ‚Äì Python Helper Functions](https://www.geeksforgeeks.org/python-helper-functions/)  
- [Wikipedia ‚Äì DRY Principle (Don't Repeat Yourself)](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself)


In [3]:
# üìÇ Helper Functions for Dublin Airport Project
# These functions handle parsing dates, cleaning weather data, preparing ranges,
# defining Irish seasons, and detecting CSV headers.
# Keep them in one cell so they are easy to reuse across the notebook.

import pandas as pd
import warnings
from pathlib import Path
from calendar import monthrange

# üìÖ Detect the most likely datetime format from sample strings
def detect_datetime_format(samples, formats, dayfirst=True, min_match_ratio=0.7, min_absolute=5):
    """
    Try each format and return the one that matches at least 70% of samples
    or at least 'min_absolute' matches. Helps ensure consistent parsing of date strings.
    """
    for fmt in formats:
        parsed = pd.to_datetime(samples, format=fmt, dayfirst=dayfirst, errors='coerce')
        matches = parsed.notna().sum()
        if matches >= max(min_absolute, int(len(samples) * min_match_ratio)):
            return fmt
    return None

# üìÖ Parse a datetime column using format detection or fallback
def parse_datetime_column(df, date_col, candidate_formats=None, dayfirst=True):
    """
    Parse a datetime column using known formats.
    Falls back to flexible parsing if none match.
    """
    if candidate_formats is None:
        candidate_formats = [
            '%Y-%m-%d %H:%M:%S', '%Y-%m-%d %H:%M', '%d-%b-%Y %H:%M',
            '%d/%m/%Y %H:%M:%S', '%d/%m/%Y %H:%M', '%d-%m-%Y %H:%M',
            '%d %b %Y %H:%M', '%d %B %Y %H:%M',
        ]

    sample_vals = df[date_col].dropna().astype(str).head(100).tolist()
    chosen_fmt = detect_datetime_format(sample_vals, candidate_formats, dayfirst=dayfirst)

    if chosen_fmt:
        print(f"‚úÖ Detected datetime format: {chosen_fmt}")
        return pd.to_datetime(df[date_col], format=chosen_fmt, dayfirst=dayfirst, errors='coerce')
    else:
        print("‚ö†Ô∏è No single format matched. Falling back to flexible parsing.")
        with warnings.catch_warnings():
            warnings.filterwarnings('ignore', message='Could not infer format')
            return pd.to_datetime(df[date_col], dayfirst=dayfirst, errors='coerce')

# üå°Ô∏è Ensure temperature column is numeric and named 'temp'
def parse_temperature_column(df, col_name='temp'):
    """
    Convert the temperature column to numeric and rename it to 'temp'.
    If no exact match, look for any column containing 'temp'.
    """
    if col_name not in df.columns:
        col_name = next((c for c in df.columns if 'temp' in c.lower()), None)
        if col_name is None:
            raise KeyError("No temperature column found.")
    if 'temp' in df.columns and col_name != 'temp':
        df.rename(columns={col_name: 'temp'}, inplace=True)
    else:
        df['temp'] = pd.to_numeric(df[col_name], errors='coerce')
    return df

# üìÇ Load cleaned weather data from local CSV
def load_cleaned_weather_data(filepath="data/dublin_airport_daily.csv"):
    """
    Load weather dataset from CSV and strip spaces from column names.
    """
    df = pd.read_csv(filepath, low_memory=False)
    df.columns = df.columns.str.strip()
    return df

# üçÇ Define Irish seasonal boundaries for a given year (leap year safe)
def define_irish_seasons(year=2025):
    """
    Return start and end dates for Irish meteorological seasons.
    Handles leap years correctly for February.
    """
    feb_days = monthrange(year, 2)[1]  # 28 or 29
    data = [
        ("Winter", pd.Timestamp(f"{year-1}-12-01"), pd.Timestamp(f"{year}-02-{feb_days} 23:59")),
        ("Spring", pd.Timestamp(f"{year}-03-01"), pd.Timestamp(f"{year}-05-31 23:59")),
        ("Summer", pd.Timestamp(f"{year}-06-01"), pd.Timestamp(f"{year}-08-31 23:59")),
        ("Autumn", pd.Timestamp(f"{year}-09-01"), pd.Timestamp(f"{year}-11-30 23:59")),
    ]
    return pd.DataFrame(data, columns=["season", "start", "end"])

# üìä Filter and prepare weather data for a custom date range
def prepare_weather_data(df, start_date, end_date):
    """
    Filter weather data to a date range and add useful time features.
    Handles separate 'date' and 'time' columns if present.
    """
    df = df.copy()
    df.columns = df.columns.str.strip()

    if 'date' not in df.columns:
        raise KeyError("Expected 'date' column not found.")

    if 'time' in df.columns:
        dt_strings = df['date'].astype(str) + " " + df['time'].astype(str)
        df['datetime'] = pd.to_datetime(dt_strings, errors='coerce', dayfirst=True)
    else:
        df['datetime'] = pd.to_datetime(df['date'], errors='coerce', dayfirst=True)

    df = df.dropna(subset=['datetime'])
    mask = (df['datetime'] >= pd.to_datetime(start_date)) & (df['datetime'] <= pd.to_datetime(end_date))
    range_df = df.loc[mask].copy()

    # Add date and hour columns for plotting
    range_df['date'] = range_df['datetime'].dt.date
    range_df['hour'] = range_df['datetime'].dt.strftime('%H:%M')

    return range_df.sort_values('datetime').reset_index(drop=True)

# üìÜ Convert user input strings into a validated date range
def get_custom_range(start_str, end_str):
    """
    Convert string inputs into datetime objects and validate order.
    """
    try:
        start = pd.to_datetime(start_str, dayfirst=True)
        end = pd.to_datetime(end_str, dayfirst=True)
        if start > end:
            raise ValueError("Start date must be before end date.")
        return start, end
    except Exception as e:
        print(f"‚ùå Invalid date range: {e}")
        return None, None

# üîç Detect the header row in a CSV file
def detect_header(lines, keywords=("station","date","rain","temp","wind")):
    """
    Detect the most likely header row in a CSV file.
    Looks for lines containing known weather keywords and multiple columns.
    """
    for i, line in enumerate(lines):
        line_lower = line.strip().lower()
        if any(line_lower.startswith(k) for k in keywords) and "," in line:
            columns = line.split(",")
            if len(columns) > 3:  # header rows usually have multiple columns
                return i
    print("‚ö†Ô∏è Warning: header row not found. Defaulting to first line.")
    return 0


### üìÇ Step 4 ‚Äì Download Dublin Airport Daily Data and Detect Header Row

In this step, the notebook retrieves the **Dublin Airport Daily Data CSV** directly from Met √âireann‚Äôs open data service.  
This dataset contains daily weather observations (e.g., precipitation, temperature, wind speed, radiation) recorded at Dublin Airport, which will later be aligned with flight activity logs to analyse rerouting events.

The process includes:

- üåê **Downloading the raw CSV** from Met √âireann using the `requests` library.  
- üìÇ **Defining a local output path** (`data/dublin_airport_daily.csv`) to store the file inside the project‚Äôs `data` folder.  
- ‚úÖ **Checking the HTTP response** to ensure the download was successful.  
- üìë **Splitting the file into lines** so the structure can be inspected before loading into pandas.  
- üîç **Detecting the header row** using the `detect_header` helper function defined earlier.  
  This ensures that column names (such as `date`, `maxtp`, `mintp`, `rain`, `wdsp`) are correctly identified even if the file contains metadata lines at the top.  
- üñ®Ô∏è **Printing the detected header row** to confirm the correct starting point for parsing.

üìå *Tip: Detecting the header row is important because Met √âireann CSVs often include metadata lines before the actual data table.  
By confirming the header row, you avoid misaligned columns and ensure clean parsing in later steps.*


In [4]:
# üìÇ Step 4 ‚Äì Download Dublin Airport Daily Data CSV and Detect Header Row

from pathlib import Path
import requests

# --- Define output path for cleaned CSV ---
DATA_PATH = Path("data/dublin_airport_daily.csv")

# --- Download raw CSV from Met √âireann (Dublin Airport Daily Data) ---
url = "https://cli.fusio.net/cli/climate_data/webdata/dly532.csv"
response = requests.get(url)

# ‚úÖ Check for successful response
if response.status_code != 200:
    raise RuntimeError(f"‚ùå Failed to download data: HTTP {response.status_code}")

# --- Split response into lines ---
lines = response.text.splitlines()

# --- Detect header row using helper function ---
header_index = detect_header(lines)

# ‚úÖ Confirm detected header row
print(f"‚úÖ Header row detected at line {header_index}:")
print(lines[header_index])


‚úÖ Header row detected at line 25:
date,ind,maxtp,ind,mintp,igmin,gmin,ind,rain,cbl,wdsp,ind,hm,ind,ddhm,ind,hg,sun,dos,g_rad,soil,pe,evap,smd_wd,smd_md,smd_pd


### üìë Step 4b ‚Äì Load and Inspect Dublin Airport Daily Data

After detecting the correct header row in the raw CSV file, the next step is to **load the dataset into pandas**.  
This allows us to immediately inspect the structure of the Dublin Airport Daily Data and confirm that the columns (e.g., `date`, `maxtp`, `mintp`, `rain`, `wdsp`) are correctly aligned.

The process includes:

- üìÇ Reading the CSV into a pandas DataFrame, starting from the detected header row  
- üîç Displaying the first few rows with `head()` to verify column names and sample values  
- üßæ Using `info()` to check datatypes and identify potential missing values  
- üìä Summarising numeric columns with `describe()` to get a quick statistical overview  

üìå *Why this matters:* Inspecting the dataset before saving ensures that the header detection worked correctly and that the file is ready for consistent downstream analysis.  
This step acts as a validation checkpoint before committing the cleaned file to the `data/` folder.


In [5]:
# üìë Step 4b ‚Äì Load and Inspect Dublin Airport Daily Data

# --- Load CSV into pandas using detected header row ---
df = pd.read_csv(
    url,  # still using the online source
    skiprows=header_index  # skip metadata lines before the header
)

# ‚úÖ Inspect the first few rows
print("First 5 rows of Dublin Airport Daily Data:")
display(df.head())

# ‚úÖ Check column types and missing values
print("\nDataFrame info:")
print(df.info())

# ‚úÖ Quick statistical summary
print("\nSummary statistics:")
print(df.describe(include='all'))


First 5 rows of Dublin Airport Daily Data:


Unnamed: 0,date,ind,maxtp,ind.1,mintp,igmin,gmin,ind.2,rain,cbl,...,hg,sun,dos,g_rad,soil,pe,evap,smd_wd,smd_md,smd_pd
0,01-jan-1942,0,9.7,0,6.8,0,4.7,2,0.0,1020.3,...,,0.0,0,,,1.1,1.4,,,
1,02-jan-1942,0,9.9,0,7.9,0,6.7,0,0.1,1016.2,...,,0.0,0,,,0.7,0.9,,,
2,03-jan-1942,0,11.2,0,8.9,0,7.2,0,1.5,1006.8,...,,0.1,0,,,0.5,0.6,,,
3,04-jan-1942,0,9.2,0,2.7,0,3.4,0,3.5,1001.5,...,,0.6,0,,,0.6,0.7,,,
4,05-jan-1942,0,3.5,1,-0.8,0,0.0,0,0.6,1013.4,...,,3.4,0,,,0.6,0.7,,,



DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30620 entries, 0 to 30619
Data columns (total 26 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   date    30620 non-null  object 
 1   ind     30620 non-null  int64  
 2   maxtp   30620 non-null  float64
 3   ind.1   30620 non-null  int64  
 4   mintp   30620 non-null  float64
 5   igmin   30620 non-null  int64  
 6   gmin    30620 non-null  object 
 7   ind.2   30620 non-null  int64  
 8   rain    30620 non-null  float64
 9   cbl     30620 non-null  float64
 10  wdsp    30620 non-null  float64
 11  ind.3   30620 non-null  int64  
 12  hm      30620 non-null  object 
 13  ind.4   30620 non-null  int64  
 14  ddhm    30620 non-null  object 
 15  ind.5   30620 non-null  int64  
 16  hg      30620 non-null  object 
 17  sun     30620 non-null  float64
 18  dos     30620 non-null  object 
 19  g_rad   30620 non-null  object 
 20  soil    30620 non-null  object 
 21  pe      30620 non-

### üìÅ Step 5 ‚Äì Save the Cleaned Dublin Airport Daily Data CSV

After detecting the correct header row in the raw Met √âireann dataset, we now save a **cleaned version** of the Dublin Airport Daily Data file into the project‚Äôs `data/` folder.  

This step ensures:

- üìÇ The dataset is stored locally for reuse without needing to re-download from Met √âireann each time  
- üìë All future analysis references a consistent, structured version of the data (starting at the correct header row)  
- üîÑ The workflow remains reproducible and version-controlled, supporting transparent project documentation  
- üõ†Ô∏è Analysts and reviewers can always work from the same baseline dataset, avoiding inconsistencies caused by raw file metadata  

üìå *Why this matters:*  
Saving cleaned data locally is a best practice in data science. It guarantees consistency across runs, makes collaboration easier, and allows you to track changes over time.  
This approach supports reproducibility and version control in your Dublin Airport rerouting analysis.  

üìñ Reference:  
- [GeeksforGeeks ‚Äì Explain Data Versioning](https://www.geeksforgeeks.org/machine-learning/explain-data-versioning/)


In [6]:
# üìÅ Step 5 ‚Äì Save the Cleaned CSV File

# --- Ensure 'data' folder exists ---
DATA_PATH.parent.mkdir(parents=True, exist_ok=True)

# --- Save cleaned data starting from the detected header row ---
with open(DATA_PATH, "w", encoding="utf-8") as f:
    for line in lines[header_index:]:
        f.write(line + "\n")

# ‚úÖ Confirm save location
print(f"üìÅ Saved cleaned climate data for Dublin Airport to: {DATA_PATH.resolve()}")


üìÅ Saved cleaned climate data for Dublin Airport to: C:\Users\eCron\OneDrive\Documents\ATU_CourseWork\Programming For Data Analytics\programming-for-data-analytics\project\data\dublin_airport_daily.csv


### üìÇ Step 6 ‚Äì Validate Saved CSV Against Step‚ÄØ4 Output

Instead of re‚Äëprinting the same inspection results, this step **confirms that the locally saved CSV file is identical to the dataset inspected in Step‚ÄØ4**.  

The process includes:

- üìÇ Reloading the locally saved CSV (`data/dublin_airport_daily.csv`)  
- üåê Reloading the online CSV directly from Met √âireann (skipping metadata lines)  
- ‚úÖ Comparing the two DataFrames with `equals()` to check for exact match  
- üìä Printing a simple confirmation message and shape comparison  

üìå *Why this matters:* This validation ensures reproducibility. It proves that the cleaned file saved in Step‚ÄØ5 is a faithful copy of the dataset originally inspected in Step‚ÄØ4.  
Reviewers can trust that all downstream analysis is based on the same consistent dataset.

[https://www.geeksforgeeks.org/create-effective-and-reproducible-code-using-pandas/](https://www.geeksforgeeks.org/create-effective-and-reproducible-code-using-pandas/)

In [7]:
# üìÇ Step 6 ‚Äì Validate Saved CSV Against Step 4 Output

# --- Reload the locally saved CSV ---
df_local = pd.read_csv(DATA_PATH)

# --- Reload the online CSV (using header_index from Step 4) ---
df_online = pd.read_csv(url, skiprows=header_index)

# ‚úÖ Compare the two DataFrames
if df_local.equals(df_online):
    print("‚úÖ Validation successful: Local CSV matches the online dataset from Step 4.")
else:
    print("‚ùå Validation failed: Local CSV differs from the online dataset.")

# Optional: show shape comparison
print(f"Local shape: {df_local.shape}, Online shape: {df_online.shape}")


‚úÖ Validation successful: Local CSV matches the online dataset from Step 4.
Local shape: (30620, 26), Online shape: (30620, 26)


### üìë Step 7 ‚Äì Download and Save Flight Activity Data

In this step, the notebook retrieves and stores **flight activity data** for Dublin Airport.  
This dataset will later be aligned with Met √âireann weather observations to analyse how conditions such as rain, wind, and visibility impact flight punctuality.

The process includes:

- üåê Collecting flight schedules and activity logs (arrivals, departures, delays, cancellations) from public APIs or dashboards  
- üìÇ Defining a local output path (`data/dublin_airport_flights.csv`) to store the file inside the project‚Äôs `data` folder  
- ‚úÖ Checking the response to ensure the download or export was successful  
- üìë Parsing the raw data into a structured format, including scheduled vs actual times and delay minutes  
- üìÅ Saving a cleaned version of the dataset locally for reproducibility and future analysis  

üìå *Why this matters:* Having flight activity data stored locally ensures that the project can consistently align flight events with weather conditions.  
It also supports reproducibility, version control, and enables predictive modelling of delays and cancellations without repeatedly querying external APIs.

In [8]:
# --- Compute date range for the past six months ---
today = date.today()
six_months_ago = today - timedelta(days=182)  # approx 6 months

DATE_FROM = six_months_ago.isoformat()
DATE_TO = today.isoformat()

# --- Output directories ---
DATA_DIR = Path("data")
RAW_DIR = DATA_DIR / "raw_flights"
RAW_DIR.mkdir(parents=True, exist_ok=True)

print(f"Date range: {DATE_FROM} to {DATE_TO}")


Date range: 2025-05-20 to 2025-11-18


### üìë Step 8 ‚Äì Dublin Airport flight information analysis 

This cell prepares the environment for **Dublin Airport flight information analysis** by defining key date ranges and output directories:

- üóìÔ∏è **Date range:**  
  - Calculates today‚Äôs date and subtracts ~six months (182 days) to define the analysis window.  
  - Converts both dates into ISO format (`YYYY-MM-DD`) for use in API queries.  
  - These values (`DATE_FROM`, `DATE_TO`) specify the six‚Äëmonth period of **flight activity data** (arrivals, departures, delays, cancellations) to be downloaded.

- üìÇ **Output directories:**  
  - Creates a root `data/` folder for project storage.  
  - Inside it, a `raw_flights/` subfolder is created to hold raw JSON files retrieved from the Aviation Edge API.  
  - This ensures reproducibility and a clear separation between raw flight inputs and processed datasets.

- ‚úÖ **Checkpoint:**  
  - Prints the computed date range so you can confirm the correct six‚Äëmonth window before downloading flight information.


In [9]:

# --- Compute date range for the past six months ---
today = date.today()
six_months_ago = today - timedelta(days=182)  # approx 6 months

DATE_FROM = six_months_ago.isoformat()
DATE_TO = today.isoformat()

# --- Output directories ---
DATA_DIR = Path("data")
RAW_DIR = DATA_DIR / "raw_flights"
RAW_DIR.mkdir(parents=True, exist_ok=True)

print(f"Date range: {DATE_FROM} to {DATE_TO}")


Date range: 2025-05-20 to 2025-11-18


### ‚úàÔ∏è Step 9 ‚Äî Download Six Months of Flight History for Dublin (Arrivals and Departures)

In this step we use **Aviation Edge‚Äôs Flights History API** to collect six months of flight schedules for Dublin Airport (IATA: DUB).  
The endpoint provides detailed records for each flight, including:

- **Scheduled, estimated, and actual times** (departure and arrival)
- **Delay minutes** (either reported or inferred)
- **Flight status** (e.g., scheduled, landed, cancelled, diverted)
- **Airline and flight identifiers**

We request **both arrivals and departures** for the date range **2025‚Äë05‚Äë20 to 2025‚Äë11‚Äë18**, ensuring coverage of the most recent six months.  
The raw JSON files are saved for reproducibility in the folder:

- `data/raw_flights/dub_arrival_history.json`  
- `data/raw_flights/dub_departure_history.json`

Additionally, a `fetch_log.txt` file is generated to record progress, errors, and confirmation of successful downloads.  
This log provides transparency and makes troubleshooting easier if API requests fail or return incomplete data.

**Important notes for reproducibility:**
- The code cell was executed on **18 November 2025** using a private API key from Aviation Edge.
- To run the download yourself, you must:
  1. Sign up for an account at [aviation-edge.com](https://aviation-edge.com/) and obtain an API key.
  2. Store the key securely (e.g., as an environment variable).
  3. Set the notebook control flag `RUN_DOWNLOAD = True` to enable downloading.
- By default, the notebook will skip downloading if `RUN_DOWNLOAD = False`, and instead use the existing JSON files.  
  This prevents unnecessary API calls and ensures consistent results for reviewers.

‚ö†Ô∏è **Best practice:** Only re‚Äërun the download when you want to refresh the dataset.  
Frequent downloads are unnecessary and may exceed API rate limits.

**References:**
- [Aviation Edge official site](https://aviation-edge.com/)  
- [Aviation Edge API documentation on GitHub](https://github.com/AviationEdgeAPI/Aviation-Edge-Complete-API)




In [10]:
# --- API setup ---
API_KEY = os.getenv("AVIATION_EDGE_API_KEY")   # Read API key from environment variable
if not API_KEY:
    raise RuntimeError("API key not found. Please set AVIATION_EDGE_API_KEY.")

BASE_URL = "https://aviation-edge.com/v2/public/flightsHistory"  # Endpoint for flight history
IATA_CODE = "DUB"  # Airport code for Dublin

# --- Directory setup ---
DATA_DIR = Path("data")              # Root data folder
RAW_DIR = DATA_DIR / "raw_flights"   # Subfolder for raw flight data
RAW_DIR.mkdir(parents=True, exist_ok=True)  # Create folders if missing

# --- Log file path ---
LOG_FILE = RAW_DIR / "fetch_log.txt"  # Text log for progress and errors

def log_message(message: str):
    """Print message and append to log file for tracking progress."""
    print(message)
    with open(LOG_FILE, "a", encoding="utf-8") as log:
        log.write(message + "\n")

def fetch_day(iata_code: str, flight_type: str, day: date, retries: int = 3):
    """
    Fetch flight history for a single day (arrival/departure).
    Retries up to 'retries' times if errors occur.
    """
    params = {
        "key": API_KEY,
        "code": iata_code,
        "type": flight_type,
        "date_from": day.isoformat(),
        "date_to": day.isoformat()
    }

    for attempt in range(retries):
        resp = requests.get(BASE_URL, params=params, timeout=60)  # API request
        if resp.status_code == 200:
            try:
                data = resp.json()  # Parse JSON response
                log_message(f"‚úÖ {flight_type.capitalize()} {day}: {len(data)} records fetched")
                return data
            except Exception:
                log_message(f"‚ö†Ô∏è Non-JSON response on {day}: {resp.text[:200]}")
                return []
        else:
            wait = 2 ** attempt  # Exponential backoff
            log_message(f"‚ö†Ô∏è Error {resp.status_code} on {day} (attempt {attempt+1}/{retries}). Retrying in {wait}s...")
            time.sleep(wait)

    log_message(f"‚ùå Failed after {retries} retries on {day}")
    return []

def fetch_history(iata_code: str, flight_type: str, start_date: date, end_date: date):
    """
    Loop through each day in the date range and fetch daily history.
    Append results into one cumulative JSON file (avoids overwriting with empty data).
    """
    results = []
    total_days = (end_date - start_date).days + 1
    filename = RAW_DIR / f"{iata_code.lower()}_{flight_type}_history.json"

    # Load existing cumulative file if present
    if filename.exists():
        with open(filename, "r", encoding="utf-8") as f:
            try:
                results = json.load(f)
            except Exception:
                results = []
                log_message(f"‚ö†Ô∏è Existing {filename.name} could not be read, starting fresh.")

    # Loop through each day in range
    for i in range(total_days):
        day = start_date + timedelta(days=i)
        log_message(f"Day {i+1}/{total_days}: {day}")
        day_data = fetch_day(iata_code, flight_type, day)

        # Save only if data was fetched
        if day_data:
            results.extend(day_data)
            with open(filename, "w", encoding="utf-8") as f:
                json.dump(results, f, ensure_ascii=False, indent=2)
            log_message(f"üíæ Saved {len(day_data)} records for {day} into {filename.name}")
        else:
            log_message(f"‚è© Skipped saving {day}, no data returned")

        time.sleep(1)  # Pause politely between requests

    return results

# --- Conditional download control ---
if RUN_DOWNLOAD:
    # Define date range (last ~6 months)
    today = date.today()
    start_date = today - timedelta(days=182)
    end_date = today

    log_message(f"Fetching flights from {start_date} to {end_date} for {IATA_CODE}...")

    # Fetch arrivals and departures
    arrivals = fetch_history(IATA_CODE, "arrival", start_date, end_date)
    departures = fetch_history(IATA_CODE, "departure", start_date, end_date)

    log_message(f"‚úÖ Completed: {len(arrivals)} arrivals and {len(departures)} departures fetched.")
else:
    log_message("‚è© Skipping download step (RUN_DOWNLOAD=False). Using existing JSON files.")


‚è© Skipping download step (RUN_DOWNLOAD=False). Using existing JSON files.


### üìÇ Step 10 ‚Äì Inspect Headings of Downloaded JSON Files

Before tidying the flight history data, it‚Äôs important to **inspect the structure of the raw JSON files**.  
The Aviation Edge API responses can vary depending on whether the file contains arrivals or departures, and not all fields are always present.

**What this step does:**
- Loads the raw JSON files for Dublin Airport arrivals (`dub_arrival_history.json`) and departures (`dub_departure_history.json`).
- Uses `pandas.json_normalize` to flatten the nested JSON into a tabular structure.
- Prints out all available column headings so we can see which fields exist.
- Shows a sample record (truncated for readability) to preview the nested structure.

**Why this matters:**
- Helps identify which fields are consistently available and relevant for analysis.
- Prevents errors later by ensuring we only select columns that actually exist.
- Guides the design of the tidy DataFrame schema in Step‚ÄØ11 (e.g. keeping scheduled/actual times, delays, status, airline, etc., while dropping baggage or codeshare metadata).

üëâ This inspection step is a diagnostic tool: it gives us visibility into the raw data so we can confidently build the parsing logic in the next step.


In [11]:
# Step 10 - inspect headings of downloaded JSON files
RAW_DIR = Path("data") / "raw_flights"
ARR_FILE = RAW_DIR / "dub_arrival_history.json"
DEP_FILE = RAW_DIR / "dub_departure_history.json"

def inspect_keys(json_file, sample_size=50):
    """Inspect nested keys in a JSON file by sampling records."""
    with open(json_file, "r", encoding="utf-8") as f:
        records = json.load(f)

    # Use pandas.json_normalize to flatten structure
    import pandas as pd
    df = pd.json_normalize(records)

    # Show all column headings
    print(f"\n--- Keys in {json_file.name} ---")
    print(sorted(df.columns.tolist()))

    # Optionally preview first record
    print("\nSample record:")
    print(json.dumps(records[0], indent=2)[:500])  # truncate for readability

# Inspect both files
inspect_keys(ARR_FILE)
inspect_keys(DEP_FILE)



--- Keys in dub_arrival_history.json ---
['airline.iataCode', 'airline.icaoCode', 'airline.name', 'arrival.actualRunway', 'arrival.actualTime', 'arrival.baggage', 'arrival.delay', 'arrival.estimatedRunway', 'arrival.estimatedTime', 'arrival.gate', 'arrival.iataCode', 'arrival.icaoCode', 'arrival.scheduledTime', 'arrival.terminal', 'codeshared.airline.iataCode', 'codeshared.airline.icaoCode', 'codeshared.airline.name', 'codeshared.flight.iataNumber', 'codeshared.flight.icaoNumber', 'codeshared.flight.number', 'departure.actualRunway', 'departure.actualTime', 'departure.delay', 'departure.estimatedRunway', 'departure.estimatedTime', 'departure.gate', 'departure.iataCode', 'departure.icaoCode', 'departure.scheduledTime', 'departure.terminal', 'flight.iataNumber', 'flight.icaoNumber', 'flight.number', 'status', 'type']

Sample record:
{
  "type": "arrival",
  "status": "landed",
  "departure": {
    "iataCode": "vlc",
    "icaoCode": "levc",
    "terminal": "1",
    "gate": "3",
    "dela

### üìÇ Step 11 ‚Äì Parse Flight History JSON into Tidy DataFrames

This step takes the raw JSON files downloaded from the Aviation Edge API  
(`dub_arrival_history.json` and `dub_departure_history.json`) and converts them into clean, analysis‚Äëready DataFrames.

**What the code does:**
- Loads the raw JSON files for arrivals and departures.
- Flattens the nested JSON structure into tabular form using `pandas.json_normalize`.
- Keeps only the **relevant fields** for weather analysis:
  - Flight number (`flight_iata`)
  - Airline name
  - Flight status (landed, cancelled, active, etc.)
  - Scheduled, estimated, and actual times
  - Delay (either reported by API or calculated if missing)
  - Optional operational details (terminal, runway)
- Converts timestamps into proper datetime objects.
- Derives `date` and `time` columns to align flights with hourly weather data.
- Adds a `type` column to distinguish arrivals vs departures.
- Returns three DataFrames:
  - `df_arrivals` ‚Üí tidy arrivals
  - `df_departures` ‚Üí tidy departures
  - `df_all` ‚Üí combined dataset of both arrivals and departures

**Why this matters:**
- The raw JSON files are large and verbose; this step filters them down to only the fields needed for analysis.
- The tidy DataFrames provide a consistent schema that can be merged directly with weather observations.
- Having both separate and combined views makes it easy to analyse arrivals and departures independently or together.



In [12]:
# üìÇ Step 11 ‚Äì Parse Flight History JSON into Tidy DataFrames

RAW_DIR = Path("data") / "raw_flights"
ARR_FILE = RAW_DIR / "dub_arrival_history.json"
DEP_FILE = RAW_DIR / "dub_departure_history.json"

def parse_flights(json_file, flight_type="arrival"):
    """
    Load a JSON file (arrivals or departures) and return a tidy DataFrame.
    Drops irrelevant fields and keeps only those useful for weather analysis.
    """
    with open(json_file, "r", encoding="utf-8") as f:
        records = json.load(f)

    if not records:  # safeguard against empty files
        return pd.DataFrame()

    df = pd.json_normalize(records)

    # Column mapping based on flight type
    if flight_type == "arrival":
        cols_map = {
            "flight.iataNumber": "flight_iata",
            "airline.name": "airline",
            "status": "status",
            "arrival.scheduledTime": "sched",
            "arrival.estimatedTime": "est",
            "arrival.actualTime": "act",
            "arrival.delay": "delay",
            "arrival.terminal": "terminal",
            "arrival.actualRunway": "runway"
        }
    else:  # departure
        cols_map = {
            "flight.iataNumber": "flight_iata",
            "airline.name": "airline",
            "status": "status",
            "departure.scheduledTime": "sched",
            "departure.estimatedTime": "est",
            "departure.actualTime": "act",
            "departure.delay": "delay",
            "departure.terminal": "terminal",
            "departure.actualRunway": "runway"
        }

    # Keep only available columns
    available = {k: v for k, v in cols_map.items() if k in df.columns}
    df = df[list(available.keys())].rename(columns=available)

    # Parse timestamps
    for c in ["sched", "est", "act"]:
        if c in df.columns:
            df[c] = pd.to_datetime(df[c], errors="coerce")

    # Compute delay if missing
    if "delay" not in df.columns or df["delay"].isna().all():
        df["delay_calc"] = (df["act"] - df["sched"]).dt.total_seconds() / 60.0
    else:
        df["delay_calc"] = df["delay"]

    # Merge keys for weather alignment
    df["date"] = df["sched"].dt.date
    df["time"] = df["sched"].dt.strftime("%H:%M")
    df["is_cancelled"] = df["status"].str.lower().eq("cancelled")

    df["type"] = flight_type  # add explicit type column

    return df

# --- Load both arrivals and departures ---
df_arrivals = parse_flights(ARR_FILE, "arrival")
df_departures = parse_flights(DEP_FILE, "departure")

# --- Optionally combine into one DataFrame ---
df_all = pd.concat([df_arrivals, df_departures], ignore_index=True)

print("Arrivals shape:", df_arrivals.shape)
print("Departures shape:", df_departures.shape)
print("Combined shape:", df_all.shape)

print(df_all.head())


Arrivals shape: (131556, 14)
Departures shape: (137720, 14)
Combined shape: (269276, 14)
  flight_iata            airline  status               sched  \
0      fr1739            ryanair  landed 2025-05-20 01:00:00   
1      fr9612            ryanair  landed 2025-05-20 01:10:00   
2       fr651            ryanair  landed 2025-05-20 01:15:00   
3      aa8330  american airlines  landed 2025-05-20 04:25:00   
4      ba6124    british airways  landed 2025-05-20 04:25:00   

                  est                 act  delay terminal  \
0 2025-05-20 01:15:00 2025-05-20 01:15:00   15.0       t1   
1 2025-05-20 01:11:00 2025-05-20 01:03:00    NaN       t1   
2 2025-05-20 01:12:00 2025-05-20 01:05:00    NaN       t1   
3 2025-05-20 03:39:00 2025-05-20 03:39:00    NaN       t2   
4 2025-05-20 03:39:00 2025-05-20 03:39:00    NaN       t2   

                    runway  delay_calc        date   time  is_cancelled  \
0  2025-05-20t01:15:00.000        15.0  2025-05-20  01:00         False   
1  2025-0

### üìÇ Step 12 ‚Äì Save Tidy Flight DataFrames

In this step, the cleaned flight DataFrames created in Step‚ÄØ11 are written out to disk.  
Saving them ensures that the tidy, analysis‚Äëready datasets are preserved and can be reused without re‚Äëparsing the raw JSON files.

**What is saved:**
- `dub_arrivals_tidy.csv` ‚Üí arrivals into Dublin Airport, with relevant fields (flight number, airline, status, times, delay, etc.).
- `dub_departures_tidy.csv` ‚Üí departures from Dublin Airport, with the same tidy structure.
- `dub_flights_tidy.csv` ‚Üí combined dataset containing both arrivals and departures, distinguished by a `type` column.

**Why this matters:**
- The tidy CSVs are much smaller and easier to work with than the raw JSON files.
- They provide a consistent schema for merging with weather data (using `date` and `time`).
- Reviewers and collaborators can immediately use these files without needing to rerun the full API download.

üëâ This step finalises the preprocessing pipeline: raw JSON ‚Üí tidy DataFrames ‚Üí reusable CSVs in the `data/` folder.


In [13]:
# üìÇ Step 12 ‚Äì Save Tidy Flight DataFrames

DATA_DIR = Path("data")

# Save arrivals
df_arrivals.to_csv(DATA_DIR / "dub_arrivals_tidy.csv", index=False)

# Save departures
df_departures.to_csv(DATA_DIR / "dub_departures_tidy.csv", index=False)

# Save combined dataset
df_all.to_csv(DATA_DIR / "dub_flights_tidy.csv", index=False)

print("‚úÖ Saved tidy datasets into data/ folder")


‚úÖ Saved tidy datasets into data/ folder
