# Flight Punctuality Analysis at Dublin Airport

This project examines how weather conditions influence flight punctuality at Dublin Airport.  
The analysis combines flight activity data (arrivals, departures, delays, cancellations) with historical and forecast weather data from Met √âireann to identify trends, quantify the impact of adverse conditions, and project future delay probabilities. 
 
By aligning operational flight records with local weather observations, the study provides insights into how rain, wind, and visibility affect airport performance and passenger reliability.


### Notebook Control Flag Explanation

This notebook contains code to download flight history data from the Aviation Edge API.  
Because downloading six months of data can take a long time and may stress the API, we use a **control flag** called `RUN_DOWNLOAD` to decide whether the download should run.

- **RUN_DOWNLOAD = False** ‚Üí The download section is skipped.  
  Use this setting when you want to run analysis, visualizations, or other notebook functionality without refreshing the data.

- **RUN_DOWNLOAD = True** ‚Üí The download section executes.  
  Use this setting only when you deliberately want to refresh the flight history data and update the cumulative JSON files.

This design ensures:
- The notebook can be safely re-run without triggering unwanted downloads.
- Existing JSON files are preserved and can be loaded for analysis.
- You have full control over when heavy API calls are made.

üëâ In practice: keep `RUN_DOWNLOAD = False` most of the time, and flip it to `True` only when you need new data.


In [1]:
# --- Control flag to enable/disable data refresh ---
RUN_DOWNLOAD = False   # Change to True only when you want to refresh data

### üì¶ Step 2 ‚Äì Install and Import Required Libraries

This step prepares the environment for the Dublin Airport Flight Rerouting Project.  
It ensures that all required Python packages are available and sets up the project‚Äôs directory structure inside the `project` root.

The notebook imports essential libraries for:

- üìä **Data manipulation** (`pandas`, `numpy`)
- üìÖ **Date and time handling** (`datetime`, `matplotlib.dates`)
- üìà **Plotting and visualisation** (`matplotlib`, `seaborn`, `plotly`)
- ü§ñ **Machine learning and model persistence** (`scikit-learn`, `joblib`)
- üìÇ **File handling and paths** (`os`, `pathlib`, `json`)
- üåê **Web access** (`requests`)
- üß© **Interactivity and display** (`ipywidgets`, `IPython.display`)

It also defines key directories (`data`, `outputs`, `models`, `docs`) inside the `project` folder and ensures they exist.  
This structure keeps raw data, processed outputs, trained models, and documentation organised and reproducible.

üìå *Note: `%pip install` commands can be used inside Jupyter notebooks if a package is missing.  
For scripts or terminal use, run `pip install` directly.*


In [2]:
%pip install plotly --quiet

# --- Core Python modules ---
import json              # config files / JSON handling
import os                # operating system interactions
import time              # time management
import warnings          # manage warnings
from datetime import date, timedelta  # date calculations
from pathlib import Path              # path management
from calendar import monthrange       # leap-year safe month calculations

# --- Data science / numerical libraries ---
import numpy as np       # numerical operations
import pandas as pd      # data manipulation

# --- Plotting libraries ---
import matplotlib.pyplot as plt   # static plotting
import plotly.express as px       # interactive plotting
import seaborn as sns             # enhanced plotting

# --- Machine learning libraries ---
import joblib                     # model persistence
from sklearn.ensemble import GradientBoostingClassifier   # example model
from sklearn.linear_model import LogisticRegression       # example model
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score
)  # model evaluation
from sklearn.model_selection import (
    cross_val_score,
    train_test_split
)  # model validation

# --- API / external requests ---
import requests                   # API calls

# --- Plotting style ---
sns.set_theme(style='whitegrid')

# --- Explicit project root: programming-for-data-analytics/project ---
ROOT = Path.cwd().resolve()
if ROOT.name != "project":
    # climb up until we find project folder
    for parent in ROOT.parents:
        if parent.name == "project":
            ROOT = parent
            break

# --- Define key directories inside project ---
DATA_DIR = ROOT / "data"
OUTPUT_DIR = ROOT / "outputs"
MODEL_DIR = ROOT / "models"
DOCS_DIR = ROOT / "docs"

# --- Ensure directories exist ---
for path in [DATA_DIR, OUTPUT_DIR, MODEL_DIR, DOCS_DIR]:
    path.mkdir(parents=True, exist_ok=True)

print(f"Project root: {ROOT}")
print(f"Data directory: {DATA_DIR}")
print(f"Output directory: {OUTPUT_DIR}")
print(f"Model directory: {MODEL_DIR}")
print(f"Docs directory: {DOCS_DIR}")


Note: you may need to restart the kernel to use updated packages.
Project root: C:\Users\eCron\OneDrive\Documents\ATU_CourseWork\Programming For Data Analytics\programming-for-data-analytics\project
Data directory: C:\Users\eCron\OneDrive\Documents\ATU_CourseWork\Programming For Data Analytics\programming-for-data-analytics\project\data
Output directory: C:\Users\eCron\OneDrive\Documents\ATU_CourseWork\Programming For Data Analytics\programming-for-data-analytics\project\outputs
Model directory: C:\Users\eCron\OneDrive\Documents\ATU_CourseWork\Programming For Data Analytics\programming-for-data-analytics\project\models
Docs directory: C:\Users\eCron\OneDrive\Documents\ATU_CourseWork\Programming For Data Analytics\programming-for-data-analytics\project\docs


### Step 3 ‚Äì Utilise Helper Functions for Dublin Airport Data Processing

This section defines a set of reusable helper functions that simplify common tasks in the project.  
They are designed specifically to support the analysis of **Dublin Airport flight activity and weather data** by handling messy inputs and preparing clean datasets for exploration and modelling.

The functions help with:

- ‚úÖ Detecting and parsing inconsistent datetime formats in flight and weather logs  
- ‚úÖ Standardising and cleaning temperature and precipitation columns from Met √âireann datasets  
- ‚úÖ Loading and preparing Dublin Airport daily weather data from local CSV files  
- ‚úÖ Defining Irish seasonal boundaries (Winter, Spring, Summer, Autumn) for comparative analysis  
- ‚úÖ Filtering weather data for a custom date range to align with flight events  
- ‚úÖ Validating user-provided date inputs for reproducible analysis  
- ‚úÖ Detecting header rows in raw CSV files downloaded from dashboards  

Each helper is **modular** ‚Äî it performs one clear task and can be reused across notebooks and scripts.  
This improves readability, reduces duplication, and supports good programming practices for the final project.

üìå *Tip: These helpers are written to be beginner-friendly, with comments explaining their purpose and logic. They make it easier to align flight activity with weather conditions when investigating delays and cancellations.*

üìñ References:  
- [Real Python ‚Äì Python Helper Functions](https://realpython.com/defining-your-own-python-function/)  
- [GeeksforGeeks ‚Äì Python Helper Functions](https://www.geeksforgeeks.org/python-helper-functions/)  
- [Wikipedia ‚Äì DRY Principle (Don't Repeat Yourself)](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself)


In [3]:
# üìÇ Helper Functions for Dublin Airport Project
# These functions handle parsing dates, cleaning weather data, preparing ranges,
# defining Irish seasons, and detecting CSV headers.
# Keep them in one cell so they are easy to reuse across the notebook.

# üîç Detect the header row in a CSV file
def detect_header(lines, keywords=("station","date","rain","temp","wind")):
    """
    Detect the most likely header row in a CSV file.
    Looks for lines containing known weather keywords and multiple columns.
    """
    for i, line in enumerate(lines):
        line_lower = line.strip().lower()
        if any(line_lower.startswith(k) for k in keywords) and "," in line:
            columns = line.split(",")
            if len(columns) > 3:  # header rows usually have multiple columns
                return i
    print("‚ö†Ô∏è Warning: header row not found. Defaulting to first line.")
    return 0

# üìÖ Detect the most likely datetime format from sample strings
def detect_datetime_format(samples, formats, dayfirst=True, min_match_ratio=0.7, min_absolute=5):
    """
    Try each format and return the one that matches at least 70% of samples
    or at least 'min_absolute' matches. Helps ensure consistent parsing of date strings.
    """
    for fmt in formats:
        parsed = pd.to_datetime(samples, format=fmt, dayfirst=dayfirst, errors='coerce')
        matches = parsed.notna().sum()
        if matches >= max(min_absolute, int(len(samples) * min_match_ratio)):
            return fmt
    return None

# üìÖ Parse a datetime column using format detection or fallback
def parse_datetime_column(df, date_col, candidate_formats=None, dayfirst=True):
    """
    Parse a datetime column using known formats.
    Falls back to flexible parsing if none match.
    """
    if candidate_formats is None:
        candidate_formats = [
            '%Y-%m-%d %H:%M:%S', '%Y-%m-%d %H:%M', '%d-%b-%Y %H:%M',
            '%d/%m/%Y %H:%M:%S', '%d/%m/%Y %H:%M', '%d-%m-%Y %H:%M',
            '%d %b %Y %H:%M', '%d %B %Y %H:%M',
        ]

    sample_vals = df[date_col].dropna().astype(str).head(100).tolist()
    chosen_fmt = detect_datetime_format(sample_vals, candidate_formats, dayfirst=dayfirst)

    if chosen_fmt:
        print(f"‚úÖ Detected datetime format: {chosen_fmt}")
        return pd.to_datetime(df[date_col], format=chosen_fmt, dayfirst=dayfirst, errors='coerce')
    else:
        print("‚ö†Ô∏è No single format matched. Falling back to flexible parsing.")
        with warnings.catch_warnings():
            warnings.filterwarnings('ignore', message='Could not infer format')
            return pd.to_datetime(df[date_col], dayfirst=dayfirst, errors='coerce')

# üïí Ensure full datetime column for arrivals/departures and weather
def prepare_datetime(df, date_col='date', time_col=None):
    """
    Ensure DataFrame has a full datetime column.
    Works for datasets with combined 'date' + time or already combined datetime.
    """
    df = df.copy()
    df.columns = df.columns.str.strip().str.lower()

    if 'datetime' in df.columns:
        df['datetime'] = pd.to_datetime(df['datetime'], errors='coerce')
    elif date_col in df.columns and time_col:
        dt_strings = df[date_col].astype(str) + " " + df[time_col].astype(str)
        df['datetime'] = pd.to_datetime(dt_strings, format="%d-%b-%Y %H:%M", errors='coerce')
    elif date_col in df.columns:
        # Explicit format for Met √âireann hourly data: '01-jan-1945 00:00'
        df['datetime'] = pd.to_datetime(df[date_col], format="%d-%b-%Y %H:%M", errors='coerce')
    else:
        raise KeyError("No suitable date/time columns found")

    df['date'] = df['datetime'].dt.date
    df['hour'] = df['datetime'].dt.hour
    return df.dropna(subset=['datetime']).reset_index(drop=True)

    # Add convenience fields
    df['date'] = df['datetime'].dt.date
    df['hour'] = df['datetime'].dt.hour
    
    return df.dropna(subset=['datetime']).reset_index(drop=True)

# üå°Ô∏è Ensure temperature column is numeric and named 'temp'
def parse_temperature_column(df, col_name='temp'):
    """
    Convert the temperature column to numeric and rename it to 'temp'.
    If no exact match, look for any column containing 'temp'.
    """
    if col_name not in df.columns:
        col_name = next((c for c in df.columns if 'temp' in c.lower()), None)
        if col_name is None:
            raise KeyError("No temperature column found.")
    if 'temp' in df.columns and col_name != 'temp':
        df.rename(columns={col_name: 'temp'}, inplace=True)
    else:
        df['temp'] = pd.to_numeric(df[col_name], errors='coerce')
    return df

# üìÇ Load cleaned weather data from local CSV
def load_cleaned_weather_data(filepath="data/dublin_airport_hourly.csv"):
    """
    Load weather dataset from CSV and strip spaces from column names.
    Default path now points to hourly Dublin Airport data.
    """
    df = pd.read_csv(filepath, low_memory=False)
    df.columns = df.columns.str.strip()
    return df

# üìä Prepare weather data with proper datetime column
def prepare_weather_data(df, date_col='date'):
    """
    Ensure weather DataFrame has a proper datetime column.
    Works for Met √âireann hourly datasets where 'date' already includes time.
    """
    df = df.copy()
    df.columns = df.columns.str.strip().str.lower()

    if 'datetime' in df.columns:
        df['datetime'] = pd.to_datetime(df['datetime'], errors='coerce')
    elif date_col in df.columns:
        # Explicit format: '01-jan-1945 00:00'
        df['datetime'] = pd.to_datetime(df[date_col], format="%d-%b-%Y %H:%M", errors='coerce')
    else:
        raise ValueError("No suitable date column found in weather dataset")

    df['date'] = df['datetime'].dt.date
    df['hour'] = df['datetime'].dt.hour
    return df

# üõ†Ô∏è Clean and standardise key weather columns
def clean_weather_columns(df):
    """
    Standardise Dublin Airport hourly weather data:
    - Convert rainfall, temperature, wind speed, and pressure to numeric
    - Handle 'Tr' (trace) rainfall as 0.0
    - Ensure consistent column naming
    """
    df = df.copy()
    df.columns = df.columns.str.strip().str.lower()

    # Handle rainfall column
    if 'rain' in df.columns:
        df['rain'] = df['rain'].replace('Tr', 0.0)   # trace rainfall ‚Üí 0
        df['rain'] = pd.to_numeric(df['rain'], errors='coerce')

    # Handle temperature column
    temp_col = next((c for c in df.columns if 'temp' in c), None)
    if temp_col:
        df['temp'] = pd.to_numeric(df[temp_col], errors='coerce')

    # Handle wind speed column
    if 'wdsp' in df.columns:
        df['wdsp'] = pd.to_numeric(df['wdsp'], errors='coerce')

    # Handle pressure column
    if 'msl' in df.columns:
        df['msl'] = pd.to_numeric(df['msl'], errors='coerce')

    return df

# üìÜ Convert user input strings into a validated date range
def get_custom_range(start_str, end_str):
    """
    Convert string inputs into datetime objects and validate order.
    Handles ISO (YYYY-MM-DD) and European (DD/MM/YYYY) formats gracefully.
    """
    try:
        # Try ISO format first
        try:
            start = pd.to_datetime(start_str, format="%Y-%m-%d", errors="raise")
        except Exception:
            start = pd.to_datetime(start_str, dayfirst=True)

        try:
            end = pd.to_datetime(end_str, format="%Y-%m-%d %H:%M", errors="raise")
        except Exception:
            end = pd.to_datetime(end_str, dayfirst=True)

        if start > end:
            raise ValueError("Start date must be before end date.")
        return start, end
    except Exception as e:
        print(f"‚ùå Invalid date range: {e}")
        return None, None

# üçÇ Define Irish seasonal boundaries for a given year (leap year safe)
def define_irish_seasons(year=2025):
    """
    Return start and end dates for Irish meteorological seasons.
    Handles leap years correctly for February.
    Seasons are defined as:
    - Winter: 1 December (previous year) ‚Üí end of February
    - Spring: 1 March ‚Üí 31 May
    - Summer: 1 June ‚Üí 31 August
    - Autumn: 1 September ‚Üí 30 November
    """
    # Determine the number of days in February for leap year safety
    feb_days = monthrange(year, 2)[1]

    data = [
        ("Winter", pd.Timestamp(f"{year-1}-12-01"), pd.Timestamp(f"{year}-02-{feb_days} 23:59")),
        ("Spring", pd.Timestamp(f"{year}-03-01"), pd.Timestamp(f"{year}-05-31 23:59")),
        ("Summer", pd.Timestamp(f"{year}-06-01"), pd.Timestamp(f"{year}-08-31 23:59")),
        ("Autumn", pd.Timestamp(f"{year}-09-01"), pd.Timestamp(f"{year}-11-30 23:59")),
    ]

    return pd.DataFrame(data, columns=["season", "start", "end"])

# üçÇ Assign Irish seasons to a DataFrame in bulk (vectorised)
def assign_season_vectorized(df):
    """
    Vectorised season assignment for Irish meteorological seasons.
    Much faster than row-wise apply.
    """
    df = df.copy()
    dt = df['datetime']

    # Initialise season column
    df['season'] = None

    # Loop through each year present in the dataset
    for year in dt.dt.year.dropna().unique():
        feb_days = monthrange(int(year), 2)[1]
        seasons = [
            ("Winter", pd.Timestamp(f"{year-1}-12-01"), pd.Timestamp(f"{year}-02-{feb_days} 23:59")),
            ("Spring", pd.Timestamp(f"{year}-03-01"), pd.Timestamp(f"{year}-05-31 23:59")),
            ("Summer", pd.Timestamp(f"{year}-06-01"), pd.Timestamp(f"{year}-08-31 23:59")),
            ("Autumn", pd.Timestamp(f"{year}-09-01"), pd.Timestamp(f"{year}-11-30 23:59")),
        ]

        # Assign seasons in bulk using masks
        for season, start, end in seasons:
            mask = (dt >= start) & (dt <= end)
            df.loc[mask, 'season'] = season

    return df


### üìÇ Step 4 ‚Äì Download Dublin Airport Daily Data and Detect Header Row

In this step, the notebook retrieves the **Dublin Airport Daily Data CSV** directly from Met √âireann‚Äôs open data service.  
This dataset contains daily weather observations (e.g., precipitation, temperature, wind speed, radiation) recorded at Dublin Airport, which will later be aligned with flight activity logs to analyse rerouting events.

The process includes:

- üåê **Downloading the raw CSV** from Met √âireann using the `requests` library.  
- üìÇ **Defining a local output path** (`data/dublin_airport_daily.csv`) to store the file inside the project‚Äôs `data` folder.  
- ‚úÖ **Checking the HTTP response** to ensure the download was successful.  
- üìë **Splitting the file into lines** so the structure can be inspected before loading into pandas.  
- üîç **Detecting the header row** using the `detect_header` helper function defined earlier.  
  This ensures that column names (such as `date`, `maxtp`, `mintp`, `rain`, `wdsp`) are correctly identified even if the file contains metadata lines at the top.  
- üñ®Ô∏è **Printing the detected header row** to confirm the correct starting point for parsing.

üìå *Tip: Detecting the header row is important because Met √âireann CSVs often include metadata lines before the actual data table.  
By confirming the header row, you avoid misaligned columns and ensure clean parsing in later steps.*


In [4]:
# üìÇ Step 4 ‚Äì Download Dublin Airport Hourly Data CSV and Detect Header Row

# Note: pathlib.Path and requests are imported earlier in the notebook, so we avoid re-importing them here.

# --- Define output path for cleaned CSV ---
DATA_PATH = Path("data/dublin_airport_hourly.csv")

# --- URL for the hourly CSV (may return 404 if file moved) ---
url = "https://cli.fusio.net/cli/climate_data/webdata/hly532.csv"

# --- Attempt to download the remote CSV, with a safe fallback to a local copy ---
try:
    response = requests.get(url, timeout=30)
except Exception as e:
    response = None
    print(f"‚ö†Ô∏è Network error when fetching URL: {e}")

if response is None or getattr(response, "status_code", None) != 200:
    # If remote download failed, try to use a previously saved local file if available
    if DATA_PATH.exists():
        print(f"‚ö†Ô∏è Remote download failed (status: {getattr(response,'status_code',None)}). Falling back to local file: {DATA_PATH}")
        text = DATA_PATH.read_text(encoding="utf-8")
        lines = text.splitlines()
    else:
        raise RuntimeError(f"‚ùå Failed to download data: HTTP {getattr(response,'status_code',None)} and no local fallback at {DATA_PATH}")
else:
    # Successful download ‚Äî use remote content
    lines = response.text.splitlines()

# --- Detect header row using helper function ---
header_index = detect_header(lines)

# ‚úÖ Confirm detected header row
print(f"‚úÖ Header row detected at line {header_index}:")
print(lines[header_index])


‚úÖ Header row detected at line 23:
date,ind,rain,ind,temp,ind,wetb,dewpt,vappr,rhum,msl,ind,wdsp,ind,wddir,ww,w,sun,vis,clht,clamt


### üìë Step 4b ‚Äì Load and Inspect Dublin Airport Hourly Data

After detecting the correct header row in the raw CSV file, the next step is to **load the hourly dataset into pandas**.  
This allows us to immediately inspect the structure of the Dublin Airport Hourly Data and confirm that the columns (e.g., `date`, `time`, `temp`, `rain`, `wdsp`, `msl`) are correctly aligned.

The process includes:

- üìÇ Reading the CSV into a pandas DataFrame, starting from the detected header row  
- üîç Displaying the first few rows with `head()` to verify column names and sample values  
- üßæ Using `info()` to check datatypes and identify potential missing values  
- üìä Summarising numeric columns with `describe()` to get a quick statistical overview  

üìå *Why this matters:* Inspecting the hourly dataset ensures that the header detection worked correctly and that the file is ready for consistent downstream analysis.  
Since arrivals and departures depend on **exact times**, hourly weather data provides the necessary granularity to align flight events with Dublin Airport conditions. This step acts as a validation checkpoint before committing the cleaned file to the `data/` folder.


In [5]:
# üìë Step 4b ‚Äì Load and Inspect Dublin Airport Hourly Data

# --- Load CSV into pandas using detected header row ---
df = pd.read_csv(
    "https://cli.fusio.net/cli/climate_data/webdata/hly532.csv",  # Dublin Airport hourly dataset
    skiprows=header_index,  # skip metadata lines before the header
    low_memory=False        # process file in one pass, avoids mixed-type warnings
)

# ‚úÖ Inspect the first few rows
print("First 5 rows of Dublin Airport Hourly Data:")
display(df.head())

# ‚úÖ Check column types and missing values
print("\nDataFrame info:")
print(df.info())

# ‚úÖ Quick statistical summary
print("\nSummary statistics:")
print(df.describe(include='all'))


First 5 rows of Dublin Airport Hourly Data:


Unnamed: 0,date,ind,rain,ind.1,temp,ind.2,wetb,dewpt,vappr,rhum,...,ind.3,wdsp,ind.4,wddir,ww,w,sun,vis,clht,clamt
0,01-jan-1945 00:00,2,0.0,0,4.9,0,4.6,4.4,8.2,95,...,1,0,1,0,50,4,0.0,200,2,8
1,01-jan-1945 01:00,3,0.0,0,5.1,0,4.9,4.4,8.5,97,...,1,0,1,0,45,4,0.0,200,2,8
2,01-jan-1945 02:00,2,0.0,0,5.1,0,4.8,4.4,8.5,97,...,1,0,1,0,50,4,0.0,4800,4,8
3,01-jan-1945 03:00,0,0.2,0,5.2,0,5.0,4.4,8.5,97,...,1,0,1,0,50,4,0.0,6000,4,8
4,01-jan-1945 04:00,2,0.0,0,5.6,0,5.4,5.0,8.8,97,...,1,7,1,250,50,5,0.0,6000,4,8



DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 708577 entries, 0 to 708576
Data columns (total 21 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   date    708577 non-null  object 
 1   ind     708577 non-null  int64  
 2   rain    708577 non-null  float64
 3   ind.1   708577 non-null  int64  
 4   temp    708577 non-null  float64
 5   ind.2   708577 non-null  int64  
 6   wetb    708577 non-null  float64
 7   dewpt   708577 non-null  float64
 8   vappr   708577 non-null  object 
 9   rhum    708577 non-null  object 
 10  msl     708577 non-null  float64
 11  ind.3   708577 non-null  int64  
 12  wdsp    708577 non-null  int64  
 13  ind.4   708577 non-null  int64  
 14  wddir   708577 non-null  object 
 15  ww      708577 non-null  int64  
 16  w       708577 non-null  int64  
 17  sun     708577 non-null  float64
 18  vis     708577 non-null  object 
 19  clht    708577 non-null  object 
 20  clamt   708577 non-null  object

### üìÅ Step 5 ‚Äì Save the Cleaned Dublin Airport Hourly Data CSV

After detecting the correct header row in the raw Met √âireann dataset, we now save a **cleaned version** of the Dublin Airport Hourly Data file into the project‚Äôs `data/` folder.  

This step ensures:

- üìÇ The dataset is stored locally for reuse without needing to re-download from Met √âireann each time  
- üìë All future analysis references a consistent, structured version of the data (starting at the correct header row)  
- üîÑ The workflow remains reproducible and version-controlled, supporting transparent project documentation  
- üõ†Ô∏è Analysts and reviewers can always work from the same baseline dataset, avoiding inconsistencies caused by raw file metadata  
- ‚è±Ô∏è Hourly granularity is preserved, which is essential for aligning weather conditions with flight arrivals and departures  

üìå *Why this matters:*  
Saving cleaned hourly data locally is a best practice in data science. It guarantees consistency across runs, makes collaboration easier, and allows you to track changes over time.  
For Dublin Airport analysis, hourly weather data provides the necessary detail to study how conditions at specific times affect flight operations, ensuring reproducibility and transparency in your rerouting and delay modelling work.  

üìñ Reference:  
- [GeeksforGeeks ‚Äì Explain Data Versioning](https://www.geeksforgeeks.org/machine-learning/explain-data-versioning/)


In [6]:
# üìÅ Step 5 ‚Äì Save the Cleaned CSV File

# --- Ensure 'data' folder exists ---
DATA_PATH.parent.mkdir(parents=True, exist_ok=True)

# --- Save cleaned data starting from the detected header row ---
with open(DATA_PATH, "w", encoding="utf-8") as f:
    for line in lines[header_index:]:
        f.write(line + "\n")

# ‚úÖ Confirm save location
print(f"üìÅ Saved cleaned climate data for Dublin Airport to: {DATA_PATH.resolve()}")


üìÅ Saved cleaned climate data for Dublin Airport to: C:\Users\eCron\OneDrive\Documents\ATU_CourseWork\Programming For Data Analytics\programming-for-data-analytics\project\data\dublin_airport_hourly.csv


### üìÇ Step 6 ‚Äì Validate Saved CSV Against Step‚ÄØ4 Output

Instead of re‚Äëprinting the same inspection results, this step **confirms that the locally saved CSV file is identical to the hourly dataset inspected in Step‚ÄØ4**.  

The process includes:

- üìÇ Reloading the locally saved CSV (`data/dublin_airport_hourly.csv`)  
- üåê Reloading the online CSV directly from Met √âireann (skipping metadata lines)  
- ‚úÖ Comparing the two DataFrames with `equals()` to check for exact match  
- üìä Printing a simple confirmation message and shape comparison  

üìå *Why this matters:* This validation ensures reproducibility. It proves that the cleaned hourly file saved in Step‚ÄØ5 is a faithful copy of the dataset originally inspected in Step‚ÄØ4.  
Reviewers can trust that all downstream analysis is based on the same consistent dataset.  
For Dublin Airport analysis, this step is especially important because **hourly granularity** is required to align weather conditions with arrivals and departures at specific times.  

üìñ Reference:  
- [GeeksforGeeks ‚Äì Create Effective and Reproducible Code Using Pandas](https://www.geeksforgeeks.org/create-effective-and-reproducible-code-using-pandas/)


In [7]:
# üìÇ Step 6 ‚Äì Validate Saved CSV Against Step 4 Output

# --- Reload the locally saved CSV (hourly Dublin Airport data) ---
df_local = pd.read_csv("data/dublin_airport_hourly.csv", low_memory=False)

# --- Reload the online CSV (using header_index from Step 4) ---
df_online = pd.read_csv(
    "https://cli.fusio.net/cli/climate_data/webdata/hly532.csv",
    skiprows=header_index,
    low_memory=False
)

# ‚úÖ Compare the two DataFrames
if df_local.equals(df_online):
    print("‚úÖ Validation successful: Local CSV matches the online dataset from Step 4.")
else:
    print("‚ùå Validation failed: Local CSV differs from the online dataset.")

# Optional: show shape comparison
print(f"Local shape: {df_local.shape}, Online shape: {df_online.shape}")


‚úÖ Validation successful: Local CSV matches the online dataset from Step 4.
Local shape: (708577, 21), Online shape: (708577, 21)


### üìä Step 7 ‚Äì Enhance Dublin Airport Hourly Weather Data with Seasons

After validating the saved hourly dataset, this step enriches the data by preparing timestamps, cleaning numeric weather columns, and tagging each record with its Irish meteorological season.  

The process includes:

- üìÇ Loading the locally saved hourly CSV (`data/dublin_airport_hourly.csv`)  
- üïí Preparing a full `datetime` column by combining `date` and `time`  
- üõ†Ô∏è Cleaning mixed‚Äëtype weather columns (`rain`, `temp`, `wdsp`, `msl`) into consistent numeric values  
- üçÇ Adding a `season` column using the `get_season_for_date` helper  

üìå *Why this matters:*  
By making the dataset **season‚Äëaware**, you can easily filter and analyse weather conditions and flight delays by season. This ensures that downstream analysis captures both the **hourly granularity** and the **seasonal context**, which are critical for understanding operational impacts at Dublin Airport.


In [8]:
# üìä Step 7 ‚Äì Enhance Dublin Airport Hourly Weather Data with Seasons

# --- Load the locally saved hourly CSV ---
df_weather = load_cleaned_weather_data()

# --- Prepare datetime column (parse combined date+time) ---
df_weather = prepare_weather_data(df_weather)

# --- Clean numeric weather columns (rain, temp, wind, pressure) ---
df_weather = clean_weather_columns(df_weather)

# --- Vectorized season assignment ---
df_weather = assign_season_vectorized(df_weather)

# ‚úÖ Inspect result
print("First 10 rows with season tagging:")
display(df_weather[['datetime', 'temp', 'rain', 'wdsp', 'msl', 'season']].head(10))


First 10 rows with season tagging:


Unnamed: 0,datetime,temp,rain,wdsp,msl,season
0,1945-01-01 00:00:00,4.9,0.0,0,1035.8,Winter
1,1945-01-01 01:00:00,5.1,0.0,0,1035.8,Winter
2,1945-01-01 02:00:00,5.1,0.0,0,1035.8,Winter
3,1945-01-01 03:00:00,5.2,0.2,0,1036.1,Winter
4,1945-01-01 04:00:00,5.6,0.0,7,1036.2,Winter
5,1945-01-01 05:00:00,5.6,0.0,9,1035.9,Winter
6,1945-01-01 06:00:00,6.0,0.0,9,1035.9,Winter
7,1945-01-01 07:00:00,6.1,0.1,9,1036.1,Winter
8,1945-01-01 08:00:00,6.1,0.0,9,1036.5,Winter
9,1945-01-01 09:00:00,6.1,0.0,5,1037.3,Winter


### üìë Step 8 ‚Äì Download and Save Flight Activity Data

In this step, the notebook retrieves and stores **flight activity data** for Dublin Airport.  
This dataset will later be aligned with Met √âireann weather observations to analyse how conditions such as rain, wind, and visibility impact flight punctuality.

The process includes:

- üåê Collecting flight schedules and activity logs (arrivals, departures, delays, cancellations) from public APIs or dashboards  
- üìÇ Defining a local output path (`data/dublin_airport_flights.csv`) to store the file inside the project‚Äôs `data` folder  
- ‚úÖ Checking the response to ensure the download or export was successful  
- üìë Parsing the raw data into a structured format, including scheduled vs actual times and delay minutes  
- üìÅ Saving a cleaned version of the dataset locally for reproducibility and future analysis  

üìå *Why this matters:* Having flight activity data stored locally ensures that the project can consistently align flight events with weather conditions.  
It also supports reproducibility, version control, and enables predictive modelling of delays and cancellations without repeatedly querying external APIs.

In [9]:
# üìë Step 8 ‚Äì Download and Save Flight Activity Data
# --- Compute date range for the past six months ---
today = date.today()
six_months_ago = today - timedelta(days=182)  # approx 6 months

DATE_FROM = six_months_ago.isoformat()
DATE_TO = today.isoformat()

# --- Output directories ---
DATA_DIR = Path("data")
RAW_DIR = DATA_DIR / "raw_flights"
RAW_DIR.mkdir(parents=True, exist_ok=True)

print(f"Date range: {DATE_FROM} to {DATE_TO}")


Date range: 2025-05-21 to 2025-11-19


### üìë Step 9 ‚Äì Dublin Airport flight information analysis 

This cell prepares the environment for **Dublin Airport flight information analysis** by defining key date ranges and output directories:

- üóìÔ∏è **Date range:**  
  - Calculates today‚Äôs date and subtracts ~six months (182 days) to define the analysis window.  
  - Converts both dates into ISO format (`YYYY-MM-DD`) for use in API queries.  
  - These values (`DATE_FROM`, `DATE_TO`) specify the six‚Äëmonth period of **flight activity data** (arrivals, departures, delays, cancellations) to be downloaded.

- üìÇ **Output directories:**  
  - Creates a root `data/` folder for project storage.  
  - Inside it, a `raw_flights/` subfolder is created to hold raw JSON files retrieved from the Aviation Edge API.  
  - This ensures reproducibility and a clear separation between raw flight inputs and processed datasets.

- ‚úÖ **Checkpoint:**  
  - Prints the computed date range so you can confirm the correct six‚Äëmonth window before downloading flight information.


In [10]:
# üìë Step 9 ‚Äì Dublin Airport flight information analysis 
# --- Compute date range for the past six months ---
today = date.today()
six_months_ago = today - timedelta(days=182)  # approx 6 months

DATE_FROM = six_months_ago.isoformat()
DATE_TO = today.isoformat()

# --- Output directories ---
DATA_DIR = Path("data")
RAW_DIR = DATA_DIR / "raw_flights"
RAW_DIR.mkdir(parents=True, exist_ok=True)

print(f"Date range: {DATE_FROM} to {DATE_TO}")


Date range: 2025-05-21 to 2025-11-19


### ‚úàÔ∏è Step 10 ‚Äî Download Six Months of Flight History for Dublin (Arrivals and Departures)

In this step we use **Aviation Edge‚Äôs Flights History API** to collect six months of flight schedules for Dublin Airport (IATA: DUB).  
The endpoint provides detailed records for each flight, including:

- **Scheduled, estimated, and actual times** (departure and arrival)
- **Delay minutes** (either reported or inferred)
- **Flight status** (e.g., scheduled, landed, cancelled, diverted)
- **Airline and flight identifiers**

We request **both arrivals and departures** for the date range **2025‚Äë05‚Äë20 to 2025‚Äë11‚Äë18**, ensuring coverage of the most recent six months.  
The raw JSON files are saved for reproducibility in the folder:

- `data/raw_flights/dub_arrival_history.json`  
- `data/raw_flights/dub_departure_history.json`

Additionally, a `fetch_log.txt` file is generated to record progress, errors, and confirmation of successful downloads.  
This log provides transparency and makes troubleshooting easier if API requests fail or return incomplete data.

**Important notes for reproducibility:**
- The code cell was executed on **18 November 2025** using a private API key from Aviation Edge.
- To run the download yourself, you must:
  1. Sign up for an account at [aviation-edge.com](https://aviation-edge.com/) and obtain an API key.
  2. Store the key securely (e.g., as an environment variable).
  3. Set the notebook control flag `RUN_DOWNLOAD = True` to enable downloading.
- By default, the notebook will skip downloading if `RUN_DOWNLOAD = False`, and instead use the existing JSON files.  
  This prevents unnecessary API calls and ensures consistent results for reviewers.

‚ö†Ô∏è **Best practice:** Only re‚Äërun the download when you want to refresh the dataset.  
Frequent downloads are unnecessary and may exceed API rate limits.

**References:**
- [Aviation Edge official site](https://aviation-edge.com/)  
- [Aviation Edge API documentation on GitHub](https://github.com/AviationEdgeAPI/Aviation-Edge-Complete-API)

In [11]:
# ‚úàÔ∏è Step 10 ‚Äî Download Six Months of Flight History for Dublin (Arrivals and Departures)
# --- API setup ---
API_KEY = os.getenv("AVIATION_EDGE_API_KEY")   # Read API key from environment variable
if not API_KEY:
    raise RuntimeError("API key not found. Please set AVIATION_EDGE_API_KEY.")

BASE_URL = "https://aviation-edge.com/v2/public/flightsHistory"  # Endpoint for flight history
IATA_CODE = "DUB"  # Airport code for Dublin

# --- Directory setup ---
DATA_DIR = Path("data")              # Root data folder
RAW_DIR = DATA_DIR / "raw_flights"   # Subfolder for raw flight data
RAW_DIR.mkdir(parents=True, exist_ok=True)  # Create folders if missing

# --- Log file path ---
LOG_FILE = RAW_DIR / "fetch_log.txt"  # Text log for progress and errors

def log_message(message: str):
    """Print message and append to log file for tracking progress."""
    print(message)
    with open(LOG_FILE, "a", encoding="utf-8") as log:
        log.write(message + "\n")

def fetch_day(iata_code: str, flight_type: str, day: date, retries: int = 3):
    """
    Fetch flight history for a single day (arrival/departure).
    Retries up to 'retries' times if errors occur.
    """
    params = {
        "key": API_KEY,
        "code": iata_code,
        "type": flight_type,
        "date_from": day.isoformat(),
        "date_to": day.isoformat()
    }

    for attempt in range(retries):
        resp = requests.get(BASE_URL, params=params, timeout=60)  # API request
        if resp.status_code == 200:
            try:
                data = resp.json()  # Parse JSON response
                log_message(f"‚úÖ {flight_type.capitalize()} {day}: {len(data)} records fetched")
                return data
            except Exception:
                log_message(f"‚ö†Ô∏è Non-JSON response on {day}: {resp.text[:200]}")
                return []
        else:
            wait = 2 ** attempt  # Exponential backoff
            log_message(f"‚ö†Ô∏è Error {resp.status_code} on {day} (attempt {attempt+1}/{retries}). Retrying in {wait}s...")
            time.sleep(wait)

    log_message(f"‚ùå Failed after {retries} retries on {day}")
    return []

def fetch_history(iata_code: str, flight_type: str, start_date: date, end_date: date):
    """
    Loop through each day in the date range and fetch daily history.
    Append results into one cumulative JSON file (avoids overwriting with empty data).
    """
    results = []
    total_days = (end_date - start_date).days + 1
    filename = RAW_DIR / f"{iata_code.lower()}_{flight_type}_history.json"

    # Load existing cumulative file if present
    if filename.exists():
        with open(filename, "r", encoding="utf-8") as f:
            try:
                results = json.load(f)
            except Exception:
                results = []
                log_message(f"‚ö†Ô∏è Existing {filename.name} could not be read, starting fresh.")

    # Loop through each day in range
    for i in range(total_days):
        day = start_date + timedelta(days=i)
        log_message(f"Day {i+1}/{total_days}: {day}")
        day_data = fetch_day(iata_code, flight_type, day)

        # Save only if data was fetched
        if day_data:
            results.extend(day_data)
            with open(filename, "w", encoding="utf-8") as f:
                json.dump(results, f, ensure_ascii=False, indent=2)
            log_message(f"üíæ Saved {len(day_data)} records for {day} into {filename.name}")
        else:
            log_message(f"‚è© Skipped saving {day}, no data returned")

        time.sleep(1)  # Pause politely between requests

    return results

# --- Conditional download control ---
if RUN_DOWNLOAD:
    # Define date range (last ~6 months)
    today = date.today()
    start_date = today - timedelta(days=182)
    end_date = today

    log_message(f"Fetching flights from {start_date} to {end_date} for {IATA_CODE}...")

    # Fetch arrivals and departures
    arrivals = fetch_history(IATA_CODE, "arrival", start_date, end_date)
    departures = fetch_history(IATA_CODE, "departure", start_date, end_date)

    log_message(f"‚úÖ Completed: {len(arrivals)} arrivals and {len(departures)} departures fetched.")
else:
    log_message("‚è© Skipping download step (RUN_DOWNLOAD=False). Using existing JSON files.")


‚è© Skipping download step (RUN_DOWNLOAD=False). Using existing JSON files.


### üìÇ Step 11 ‚Äì Inspect Headings of Downloaded JSON Files

Before tidying the flight history data, it‚Äôs important to **inspect the structure of the raw JSON files**.  
The Aviation Edge API responses can vary depending on whether the file contains arrivals or departures, and not all fields are always present.

**What this step does:**
- Loads the raw JSON files for Dublin Airport arrivals (`dub_arrival_history.json`) and departures (`dub_departure_history.json`).
- Uses `pandas.json_normalize` to flatten the nested JSON into a tabular structure.
- Prints out all available column headings so we can see which fields exist.
- Shows a sample record (truncated for readability) to preview the nested structure.

**Why this matters:**
- Helps identify which fields are consistently available and relevant for analysis.
- Prevents errors later by ensuring we only select columns that actually exist.
- Guides the design of the tidy DataFrame schema in Step‚ÄØ11 (e.g. keeping scheduled/actual times, delays, status, airline, etc., while dropping baggage or codeshare metadata).

üëâ This inspection step is a diagnostic tool: it gives us visibility into the raw data so we can confidently build the parsing logic in the next step.


In [12]:
# üìÇ Step 11 - inspect headings of downloaded JSON files
RAW_DIR = Path("data") / "raw_flights"
ARR_FILE = RAW_DIR / "dub_arrival_history.json"
DEP_FILE = RAW_DIR / "dub_departure_history.json"

def inspect_keys(json_file, sample_size=50):
    """Inspect nested keys in a JSON file by sampling records."""
    with open(json_file, "r", encoding="utf-8") as f:
        records = json.load(f)

    # Use pandas.json_normalize to flatten structure
    import pandas as pd
    df = pd.json_normalize(records)

    # Show all column headings
    print(f"\n--- Keys in {json_file.name} ---")
    print(sorted(df.columns.tolist()))

    # Optionally preview first record
    print("\nSample record:")
    print(json.dumps(records[0], indent=2)[:500])  # truncate for readability

# Inspect both files
inspect_keys(ARR_FILE)
inspect_keys(DEP_FILE)



--- Keys in dub_arrival_history.json ---
['airline.iataCode', 'airline.icaoCode', 'airline.name', 'arrival.actualRunway', 'arrival.actualTime', 'arrival.baggage', 'arrival.delay', 'arrival.estimatedRunway', 'arrival.estimatedTime', 'arrival.gate', 'arrival.iataCode', 'arrival.icaoCode', 'arrival.scheduledTime', 'arrival.terminal', 'codeshared.airline.iataCode', 'codeshared.airline.icaoCode', 'codeshared.airline.name', 'codeshared.flight.iataNumber', 'codeshared.flight.icaoNumber', 'codeshared.flight.number', 'departure.actualRunway', 'departure.actualTime', 'departure.delay', 'departure.estimatedRunway', 'departure.estimatedTime', 'departure.gate', 'departure.iataCode', 'departure.icaoCode', 'departure.scheduledTime', 'departure.terminal', 'flight.iataNumber', 'flight.icaoNumber', 'flight.number', 'status', 'type']

Sample record:
{
  "type": "arrival",
  "status": "landed",
  "departure": {
    "iataCode": "vlc",
    "icaoCode": "levc",
    "terminal": "1",
    "gate": "3",
    "dela

### üìÇ Step 12 ‚Äì Parse Flight History JSON into Tidy DataFrames

This step takes the raw JSON files downloaded from the Aviation Edge API  
(`dub_arrival_history.json` and `dub_departure_history.json`) and converts them into clean, analysis‚Äëready DataFrames.

**What the code does:**
- Loads the raw JSON files for arrivals and departures.
- Flattens the nested JSON structure into tabular form using `pandas.json_normalize`.
- Keeps only the **relevant fields** for weather analysis:
  - Flight number (`flight_iata`)
  - Airline name
  - Flight status (landed, cancelled, active, etc.)
  - Scheduled, estimated, and actual times
  - Delay (either reported by API or calculated if missing)
  - Optional operational details (terminal, runway)
- Converts timestamps into proper datetime objects.
- Derives `date` and `time` columns to align flights with hourly weather data.
- Adds a `type` column to distinguish arrivals vs departures.
- Returns three DataFrames:
  - `df_arrivals` ‚Üí tidy arrivals
  - `df_departures` ‚Üí tidy departures
  - `df_all` ‚Üí combined dataset of both arrivals and departures

**Why this matters:**
- The raw JSON files are large and verbose; this step filters them down to only the fields needed for analysis.
- The tidy DataFrames provide a consistent schema that can be merged directly with weather observations.
- Having both separate and combined views makes it easy to analyse arrivals and departures independently or together.



In [13]:
# üìÇ Step 12 ‚Äì Parse Flight History JSON into Tidy DataFrames
# NOTE: Rewritten to avoid pandas.json_normalize issues (e.g. nested/circular structures).
#       We safely extract nested fields and build a flat DataFrame.

RAW_DIR = Path("data") / "raw_flights"
ARR_FILE = RAW_DIR / "dub_arrival_history.json"
DEP_FILE = RAW_DIR / "dub_departure_history.json"

def _safe_get(d, path):
    """Get nested value from dict using dotted path, return None if missing."""
    if d is None:
        return None
    parts = path.split(".")
    val = d
    for p in parts:
        if isinstance(val, dict) and p in val:
            val = val[p]
        else:
            return None
    return val

def parse_flights(json_file, flight_type="arrival"):
    """
    Load a JSON file (arrivals or departures) and return a tidy DataFrame.
    Safely extracts nested fields to avoid json_normalize edge-cases.
    """
    with open(json_file, "r", encoding="utf-8") as f:
        records = json.load(f)

    if not records:  # safeguard against empty files
        return pd.DataFrame()

    # Define mapping: target_column -> dotted path in record
    if flight_type == "arrival":
        mapping = {
            "flight_iata": "flight.iataNumber",
            "airline": "airline.name",
            "status": "status",
            "sched": "arrival.scheduledTime",
            "est": "arrival.estimatedTime",
            "act": "arrival.actualTime",
            "delay": "arrival.delay",
            "terminal": "arrival.terminal",
            "runway": "arrival.actualRunway",
        }
    else:
        mapping = {
            "flight_iata": "flight.iataNumber",
            "airline": "airline.name",
            "status": "status",
            "sched": "departure.scheduledTime",
            "est": "departure.estimatedTime",
            "act": "departure.actualTime",
            "delay": "departure.delay",
            "terminal": "departure.terminal",
            "runway": "departure.actualRunway",
        }

    rows = []
    for rec in records:
        row = {}
        for col, path in mapping.items():
            row[col] = _safe_get(rec, path)
        rows.append(row)

    df = pd.DataFrame(rows)

    # Parse timestamps into datetimes
    for c in ["sched", "est", "act"]:
        if c in df.columns:
            df[c] = pd.to_datetime(df[c], errors="coerce")

    # Compute delay_calc: prefer explicit 'delay' if present, else compute from times
    if "delay" not in df.columns or df["delay"].isna().all():
        # compute from sched/act where possible
        df["delay_calc"] = (df["act"] - df["sched"]).dt.total_seconds() / 60.0
    else:
        df["delay_calc"] = pd.to_numeric(df["delay"], errors="coerce")

    # Merge keys for weather alignment
    if "sched" in df.columns:
        df["date"] = df["sched"].dt.date
        df["time"] = df["sched"].dt.strftime("%H:%M")
    else:
        df["date"] = pd.NaT
        df["time"] = None

    df["is_cancelled"] = df.get("status", pd.Series(dtype="object")).astype(str).str.lower().eq("cancelled")
    df["type"] = flight_type  # add explicit type column

    return df

# --- Load both arrivals and departures ---
df_arrivals = parse_flights(ARR_FILE, "arrival")
df_departures = parse_flights(DEP_FILE, "departure")

# --- Optionally combine into one DataFrame ---
df_all = pd.concat([df_arrivals, df_departures], ignore_index=True, sort=False)

print("Arrivals shape:", df_arrivals.shape)
print("Departures shape:", df_departures.shape)
print("Combined shape:", df_all.shape)

print(df_all.head())


Arrivals shape: (131556, 14)
Departures shape: (137720, 14)
Combined shape: (269276, 14)
  flight_iata            airline  status               sched  \
0      fr1739            ryanair  landed 2025-05-20 01:00:00   
1      fr9612            ryanair  landed 2025-05-20 01:10:00   
2       fr651            ryanair  landed 2025-05-20 01:15:00   
3      aa8330  american airlines  landed 2025-05-20 04:25:00   
4      ba6124    british airways  landed 2025-05-20 04:25:00   

                  est                 act  delay terminal  \
0 2025-05-20 01:15:00 2025-05-20 01:15:00   15.0       t1   
1 2025-05-20 01:11:00 2025-05-20 01:03:00    NaN       t1   
2 2025-05-20 01:12:00 2025-05-20 01:05:00    NaN       t1   
3 2025-05-20 03:39:00 2025-05-20 03:39:00    NaN       t2   
4 2025-05-20 03:39:00 2025-05-20 03:39:00    NaN       t2   

                    runway  delay_calc        date   time  is_cancelled  \
0  2025-05-20t01:15:00.000        15.0  2025-05-20  01:00         False   
1  2025-0

### üìÇ Step 13 ‚Äì Save Tidy Flight DataFrames

In this step, the cleaned flight DataFrames created in Step‚ÄØ11 are written out to disk.  
Saving them ensures that the tidy, analysis‚Äëready datasets are preserved and can be reused without re‚Äëparsing the raw JSON files.

**What is saved:**
- `dub_arrivals_tidy.csv` ‚Üí arrivals into Dublin Airport, with relevant fields (flight number, airline, status, times, delay, etc.).
- `dub_departures_tidy.csv` ‚Üí departures from Dublin Airport, with the same tidy structure.
- `dub_flights_tidy.csv` ‚Üí combined dataset containing both arrivals and departures, distinguished by a `type` column.

**Why this matters:**
- The tidy CSVs are much smaller and easier to work with than the raw JSON files.
- They provide a consistent schema for merging with weather data (using `date` and `time`).
- Reviewers and collaborators can immediately use these files without needing to rerun the full API download.

üëâ This step finalises the preprocessing pipeline: raw JSON ‚Üí tidy DataFrames ‚Üí reusable CSVs in the `data/` folder.


In [14]:
# üìÇ Step 13 ‚Äì Save Tidy Flight DataFrames

DATA_DIR = Path("data")

# Save arrivals
df_arrivals.to_csv(DATA_DIR / "dub_arrivals_tidy.csv", index=False)

# Save departures
df_departures.to_csv(DATA_DIR / "dub_departures_tidy.csv", index=False)

# Save combined dataset
df_all.to_csv(DATA_DIR / "dub_flights_tidy.csv", index=False)

print("‚úÖ Saved tidy datasets into data/ folder")


‚úÖ Saved tidy datasets into data/ folder


### üå¶Ô∏è Step 14 ‚Äì Define Meteorological Seasons in Ireland for 2025 (Relevant to Dublin Airport Temperature Analysis)

Ireland‚Äôs meteorological seasons follow a fixed calendar pattern, as outlined by Met √âireann:

- **Winter**: 1 December (previous year) to 28 February  
- **Spring**: 1 March to 31 May  
- **Summer**: 1 June to 31 August  
- **Autumn**: 1 September to 30 November  

In this step, we use a helper function to define seasonal boundaries for 2025. These boundaries will later help us filter and analyse temperature data by season.

üìå *Reference:* [Met √âireann ‚Äì Irish Seasons](https://www.met.ie/education/outreach-irish-seasons)


In [15]:
# üå¶Ô∏è Step 14 ‚Äì Define Meteorological Seasons in Ireland for 2025

# --- Generate seasonal boundaries using helper function ---
seasons_2025 = define_irish_seasons()

# --- Display formatted season ranges ---
print("üìÖ Irish Seasons for 2025:")
for _, row in seasons_2025.iterrows():
    print(f"  {row['season']}: {row['start'].strftime('%d-%b-%Y %H:%M')} ‚Üí {row['end'].strftime('%d-%b-%Y %H:%M')}")

üìÖ Irish Seasons for 2025:
  Winter: 01-Dec-2024 00:00 ‚Üí 28-Feb-2025 23:59
  Spring: 01-Mar-2025 00:00 ‚Üí 31-May-2025 23:59
  Summer: 01-Jun-2025 00:00 ‚Üí 31-Aug-2025 23:59
  Autumn: 01-Sep-2025 00:00 ‚Üí 30-Nov-2025 23:59


### üìÜ Step 15 ‚Äì Define and Validate a Custom Date Range (Dublin Weather)

In this step, we define a **custom date range** for Dublin Airport hourly weather data and validate it against the seasonal boundaries.  
This ensures that the selected range is valid, reproducible, and contextually meaningful for downstream analysis.

The process includes:

- üìÖ Using the `get_custom_range` helper to parse start and end dates from user input  
- ‚ö†Ô∏è Falling back to default values if the input is invalid or cannot be parsed  
- üçÇ Checking whether the custom range falls entirely within a single Irish meteorological season  
- üïí Preparing the `df_weather` dataset with a full `datetime` column for accurate filtering  
- üîç Filtering `df_weather` to include only rows within the validated custom range  

üìå *Why this matters:*  
Validating custom ranges ensures that analysis is seasonally consistent and avoids mixing weather effects across boundaries.  
By confirming whether the range lies within a single season, we maintain clarity in interpretation.  
Filtering the dataset to the exact range provides a focused subset of hourly Dublin Airport weather data, ready for alignment with arrivals and departures.


In [16]:
# üìÜ Step 15 ‚Äì Define and Validate a Custom Date Range (Dublin Weather)

# --- Define custom date range using helper ---
custom_start, custom_end = get_custom_range("2025-10-27", "2025-10-31 23:59")

# --- Fallback to default if input is invalid ---
if custom_start is None or custom_end is None:
    print("‚ö†Ô∏è Invalid custom range returned by get_custom_range(); falling back to defaults.")
    custom_start = pd.Timestamp("2025-11-12")
    custom_end = pd.Timestamp("2025-11-16 23:59")

# --- Check which season the range falls into ---
matched_season = None
for _, row in seasons_2025.iterrows():
    if row["start"] <= custom_start <= row["end"] and row["start"] <= custom_end <= row["end"]:
        matched_season = row["season"]
        break

# --- Display season match result ---
if matched_season:
    print(f"üìÜ The custom range falls entirely within: {matched_season}")
else:
    print("‚ö†Ô∏è The custom range spans multiple seasons or falls outside defined bounds.")

# --- Prepare filtered weather data for Dublin Airport ---
# Ensure full datetime column first
df_weather = prepare_datetime(df_weather, date_col='date', time_col='time')

# Then filter using the validated custom range
range_df = prepare_weather_data(df_weather)
range_df = range_df[(range_df['datetime'] >= custom_start) & (range_df['datetime'] <= custom_end)]

print(f"‚úÖ Filtered Dublin weather data contains {len(range_df)} rows from {custom_start.date()} ‚Üí {custom_end.date()}")
display(range_df.head())


üìÜ The custom range falls entirely within: Autumn
‚úÖ Filtered Dublin weather data contains 120 rows from 2025-10-27 ‚Üí 2025-10-31


Unnamed: 0,date,ind,rain,ind.1,temp,ind.2,wetb,dewpt,vappr,rhum,...,wddir,ww,w,sun,vis,clht,clamt,datetime,hour,season
708456,2025-10-27,0,0.0,0,9.4,0,8.0,6.4,9.6,81,...,260,2,11,0.0,20000,27,7,2025-10-27 00:00:00,0,Autumn
708457,2025-10-27,0,0.0,0,9.2,0,7.6,5.8,9.2,79,...,270,2,11,0.0,20000,26,7,2025-10-27 01:00:00,1,Autumn
708458,2025-10-27,0,0.0,0,9.1,0,7.5,5.6,9.1,79,...,270,2,11,0.0,20000,999,3,2025-10-27 02:00:00,2,Autumn
708459,2025-10-27,0,0.0,0,9.4,0,8.0,6.4,9.6,81,...,270,2,11,0.0,20000,37,7,2025-10-27 03:00:00,3,Autumn
708460,2025-10-27,0,0.0,0,8.9,0,7.5,5.9,9.2,81,...,270,2,11,0.0,20000,999,3,2025-10-27 04:00:00,4,Autumn


### üìä Step 16 ‚Äì Load and Clean Dublin Airport Flight Data

In this step, we load the tidy flight datasets created earlier (arrivals, departures, or the combined file) and prepare them for analysis.  
The goal is to ensure that the flight records have a consistent **datetime column** and clean numeric delay values, so they can be merged with weather data later.

The process includes:

- üìÇ Checking which tidy flight files are available in the `data/` folder  
- üì• Loading the combined dataset if present, or concatenating arrivals and departures as fallback  
- üßπ Normalising column names to lowercase and stripping whitespace  
- üïí Creating a canonical `datetime` column based on the scheduled time (`sched`)  
- üî¢ Converting delay fields into numeric values for analysis  
- üö´ Dropping rows without valid datetime values  

üìå *Why this matters:*  
A clean and consistent flight dataset ensures reproducibility and makes it possible to align flights with weather observations.  
By anchoring on the scheduled time (`sched`), we can reliably merge flights with hourly weather data in the next step.


In [17]:
# üìä Step 16 ‚Äì Load and clean Dublin Airport flight data

DATA_DIR = Path("data")

# --- Inspect available tidy flight files ---
print("üìÇ Available files in data/:", os.listdir(DATA_DIR))

# --- Load the combined tidy flights file if present; fall back to arrivals+departures ---
flights_path = DATA_DIR / "dub_flights_tidy.csv"
arr_path = DATA_DIR / "dub_arrivals_tidy.csv"
dep_path = DATA_DIR / "dub_departures_tidy.csv"

if flights_path.exists():
    # Preferred: combined tidy dataset
    df_flights = pd.read_csv(flights_path, low_memory=False)
    source_used = "combined"
elif arr_path.exists() and dep_path.exists():
    # Fallback: arrivals + departures concatenated
    df_arrivals = pd.read_csv(arr_path, low_memory=False)
    df_departures = pd.read_csv(dep_path, low_memory=False)
    # Harmonise columns before concatenation
    common_cols = sorted(set(df_arrivals.columns).intersection(set(df_departures.columns)))
    df_flights = pd.concat(
        [df_arrivals[common_cols], df_departures[common_cols]],
        ignore_index=True
    )
    source_used = "arrivals+departures"
else:
    raise FileNotFoundError("‚ùå No tidy flight files found in data/. Expected dub_flights_tidy.csv or both arrivals/departures.")

print(f"‚úÖ Loaded flight dataset from: {source_used} ({len(df_flights)} rows)")

# --- Normalise column names and whitespace ---
df_flights.columns = df_flights.columns.str.strip().str.lower()

# --- Ensure a proper datetime column ---
# Use 'sched' (scheduled time) as the canonical datetime for weather alignment
if "sched" in df_flights.columns:
    df_flights["datetime"] = pd.to_datetime(df_flights["sched"], errors="coerce")
elif "datetime" in df_flights.columns:
    df_flights["datetime"] = pd.to_datetime(df_flights["datetime"], errors="coerce")
elif "date" in df_flights.columns and "time" in df_flights.columns:
    df_flights = prepare_datetime(df_flights, date_col="date", time_col="time")
else:
    raise KeyError("‚ùå Flight data needs either 'sched', 'datetime', or both 'date' and 'time' columns.")

# --- Convert delay columns to numeric if present ---
for col in ["arr_delay", "dep_delay", "delay", "delay_minutes", "delay_calc"]:
    if col in df_flights.columns:
        df_flights[col] = pd.to_numeric(df_flights[col], errors="coerce")

# --- Drop rows without valid datetime and reset index ---
df_flights = df_flights.dropna(subset=["datetime"]).reset_index(drop=True)

print(f"üßπ Cleaned flight dataset has {len(df_flights)} rows with valid datetime")
display(df_flights.head())


üìÇ Available files in data/: ['dublin_airport_hourly.csv', 'dub_arrivals_tidy.csv', 'dub_departures_tidy.csv', 'dub_flights_tidy.csv', 'raw_flights']
‚úÖ Loaded flight dataset from: combined (269276 rows)
üßπ Cleaned flight dataset has 269270 rows with valid datetime


Unnamed: 0,flight_iata,airline,status,sched,est,act,delay,terminal,runway,delay_calc,date,time,is_cancelled,type,datetime
0,fr1739,ryanair,landed,2025-05-20 01:00:00,2025-05-20 01:15:00,2025-05-20 01:15:00,15.0,t1,2025-05-20t01:15:00.000,15.0,2025-05-20,01:00,False,arrival,2025-05-20 01:00:00
1,fr9612,ryanair,landed,2025-05-20 01:10:00,2025-05-20 01:11:00,2025-05-20 01:03:00,,t1,2025-05-20t01:03:00.000,,2025-05-20,01:10,False,arrival,2025-05-20 01:10:00
2,fr651,ryanair,landed,2025-05-20 01:15:00,2025-05-20 01:12:00,2025-05-20 01:05:00,,t1,2025-05-20t01:05:00.000,,2025-05-20,01:15,False,arrival,2025-05-20 01:15:00
3,aa8330,american airlines,landed,2025-05-20 04:25:00,2025-05-20 03:39:00,2025-05-20 03:39:00,,t2,2025-05-20t03:39:00.000,,2025-05-20,04:25,False,arrival,2025-05-20 04:25:00
4,ba6124,british airways,landed,2025-05-20 04:25:00,2025-05-20 03:39:00,2025-05-20 03:39:00,,t2,2025-05-20t03:39:00.000,,2025-05-20,04:25,False,arrival,2025-05-20 04:25:00
