# Advanced Temporal Feature Engineering for Flight Delay Prediction

This notebook adds advanced temporal features to the cleaned flight dataset:
- Day of the week
- Week of the year  
- Major holiday flags

These features can help capture seasonal and holiday-related patterns in flight delays.

## 1. Import Required Libraries

In [2]:
import warnings
from datetime import datetime, timedelta

import holidays
import numpy as np
import pandas as pd

warnings.filterwarnings("ignore")

print("âœ… All libraries imported successfully, including holidays library")

âœ… All libraries imported successfully, including holidays library


## 2. Load and Parse Flight Data

In [3]:
# Load the cleaned flight data
print("Loading cleaned flight data...")
df = pd.read_csv("dataset/flights_cleaned.csv")
print(f"Loaded {len(df):,} flights")
print(f"Date range: {df['FL_DATE'].min()} to {df['FL_DATE'].max()}")

# Convert FL_DATE to datetime if not already
df["FL_DATE"] = pd.to_datetime(df["FL_DATE"])
print(f"FL_DATE converted to datetime: {df['FL_DATE'].dtype}")

# Display sample data
print("\nSample data:")
df.head()

Loading cleaned flight data...
Loaded 500,000 flights
Date range: 2019-01-01 to 2023-08-31
FL_DATE converted to datetime: datetime64[ns]

Sample data:


Unnamed: 0,FL_DATE,AIRLINE,FL_NUMBER,ORIGIN,DEST,CRS_DEP_TIME,CRS_ARR_TIME,ARR_DELAY,CRS_ELAPSED_TIME,DISTANCE
0,2019-01-09,United Air Lines Inc.,1562,FLL,EWR,1155,1501,-14.0,186.0,1065.0
1,2022-11-19,Delta Air Lines Inc.,1149,MSP,SEA,2120,2315,-5.0,235.0,1399.0
2,2022-07-22,United Air Lines Inc.,459,DEN,MSP,954,1252,0.0,118.0,680.0
3,2023-03-06,Delta Air Lines Inc.,2295,MSP,SFO,1609,1829,24.0,260.0,1589.0
4,2020-02-23,Spirit Air Lines,407,MCO,DFW,1840,2041,-1.0,181.0,985.0


## 3. Extract Day of the Week

In [4]:
# Extract day of week (0=Monday, 6=Sunday)
df["day_of_week"] = df["FL_DATE"].dt.dayofweek
df["day_of_week_name"] = df["FL_DATE"].dt.day_name()

# Create weekend flag
df["is_weekend"] = df["day_of_week"].isin([5, 6]).astype(int)

print("Day of week distribution:")
print(df["day_of_week_name"].value_counts().sort_index())

print(f"\nWeekend flights: {df['is_weekend'].mean():.1%}")
print(f"Weekday flights: {(1 - df['is_weekend'].mean()):.1%}")

# Show sample
df[["FL_DATE", "day_of_week", "day_of_week_name", "is_weekend"]].head()

Day of week distribution:
day_of_week_name
Friday       74305
Monday       74428
Saturday     64276
Sunday       72757
Thursday     74387
Tuesday      69255
Wednesday    70592
Name: count, dtype: int64

Weekend flights: 27.4%
Weekday flights: 72.6%


Unnamed: 0,FL_DATE,day_of_week,day_of_week_name,is_weekend
0,2019-01-09,2,Wednesday,0
1,2022-11-19,5,Saturday,1
2,2022-07-22,4,Friday,0
3,2023-03-06,0,Monday,0
4,2020-02-23,6,Sunday,1


## 4. Extract Week of the Year

In [5]:
# Extract week of year (1-52/53)
df["week_of_year"] = df["FL_DATE"].dt.isocalendar().week

print("Week of year distribution:")
print(f"Min week: {df['week_of_year'].min()}")
print(f"Max week: {df['week_of_year'].max()}")
print(f"Unique weeks: {df['week_of_year'].nunique()}")

# Group by week to see flight volume patterns
weekly_flights = df.groupby("week_of_year").size()
print(f"\nFlights per week - Min: {weekly_flights.min():,}, Max: {weekly_flights.max():,}")

# Show sample
df[["FL_DATE", "week_of_year"]].head()

Week of year distribution:
Min week: 1
Max week: 53
Unique weeks: 53

Flights per week - Min: 1,527, Max: 11,255


Unnamed: 0,FL_DATE,week_of_year
0,2019-01-09,2
1,2022-11-19,46
2,2022-07-22,29
3,2023-03-06,10
4,2020-02-23,8


## 5. Flag Major Holidays

In [7]:
import holidays

# Get unique years in the dataset
years = df["FL_DATE"].dt.year.unique()
print(f"Years in dataset: {sorted(years)}")

# Create holiday dictionary for all years
all_holidays = {}
for year in years:
    us_holidays = holidays.US(years=year)
    # Filter for major holidays only
    for date, name in us_holidays.items():
        if any(
            keyword in name.upper()
            for keyword in [
                "NEW YEAR",
                "MARTIN LUTHER KING",
                "WASHINGTON",
                "MEMORIAL",
                "INDEPENDENCE",
                "LABOR",
                "COLUMBUS",
                "VETERANS",
                "THANKSGIVING",
                "CHRISTMAS",
            ]
        ):
            all_holidays[date] = name

print(f"\nTotal holidays identified: {len(all_holidays)}")
for date, name in sorted(all_holidays.items()):
    print(f"  {date.strftime('%Y-%m-%d')}: {name}")

# LIGHTNING FAST HOLIDAY DETECTION
print("\nApplying lightning-fast holiday detection...")

# Create sets of dates for different holiday categories
near_holiday_dates = set()
holiday_period_dates = set()

holiday_name_map = {}
period_name_map = {}
days_to_holiday_map = {}

for holiday_date, holiday_name in all_holidays.items():
    # Near holidays (Â±3 days)
    for offset in range(-3, 4):
        date = holiday_date + pd.Timedelta(days=offset)
        near_holiday_dates.add(date)
        holiday_name_map[date] = holiday_name

    # Holiday periods (Â±7 days pre, +3 days post)
    for offset in range(-7, 4):
        date = holiday_date + pd.Timedelta(days=offset)
        holiday_period_dates.add(date)
        period_name_map[date] = holiday_name
        days_to_holiday_map[date] = offset

print(f"Total near-holiday dates to check: {len(near_holiday_dates)}")
print(f"Total holiday period dates to check: {len(holiday_period_dates)}")

# Vectorized operations using pandas isin() - extremely fast
df["is_near_holiday"] = df["FL_DATE"].isin(near_holiday_dates).astype(int)
df["is_holiday_period"] = df["FL_DATE"].isin(holiday_period_dates).astype(int)

# Map names and days using pandas map() - also very fast
df["nearest_holiday"] = df["FL_DATE"].map(holiday_name_map)
df["holiday_period_name"] = df["FL_DATE"].map(period_name_map)
df["days_to_holiday"] = df["FL_DATE"].map(days_to_holiday_map)

print("Holiday analysis complete!")
print(f"Flights near holidays: {df['is_near_holiday'].sum():,}")
print(f"Flights in holiday periods: {df['is_holiday_period'].sum():,}")

# Show holiday distribution
print("\nHoliday distribution:")
holiday_counts = df[df["is_near_holiday"] == 1]["nearest_holiday"].value_counts()
for holiday, count in holiday_counts.items():
    print(f"  {holiday}: {count:,} flights")

# Show sample
df[["FL_DATE", "is_near_holiday", "nearest_holiday", "is_holiday_period", "days_to_holiday"]].head(10)

Years in dataset: [np.int32(2019), np.int32(2020), np.int32(2021), np.int32(2022), np.int32(2023)]

Total holidays identified: 62
  2019-01-01: New Year's Day
  2019-01-21: Martin Luther King Jr. Day
  2019-02-18: Washington's Birthday
  2019-05-27: Memorial Day
  2019-07-04: Independence Day
  2019-09-02: Labor Day
  2019-10-14: Columbus Day
  2019-11-11: Veterans Day
  2019-11-28: Thanksgiving Day
  2019-12-25: Christmas Day
  2020-01-01: New Year's Day
  2020-01-20: Martin Luther King Jr. Day
  2020-02-17: Washington's Birthday
  2020-05-25: Memorial Day
  2020-07-03: Independence Day (observed)
  2020-07-04: Independence Day
  2020-09-07: Labor Day
  2020-10-12: Columbus Day
  2020-11-11: Veterans Day
  2020-11-26: Thanksgiving Day
  2020-12-25: Christmas Day
  2021-01-01: New Year's Day
  2021-01-18: Martin Luther King Jr. Day
  2021-02-15: Washington's Birthday
  2021-05-31: Memorial Day
  2021-06-18: Juneteenth National Independence Day (observed)
  2021-06-19: Juneteenth Nation

Unnamed: 0,FL_DATE,is_near_holiday,nearest_holiday,is_holiday_period,days_to_holiday
0,2019-01-09,0,,0,
1,2022-11-19,0,,1,-5.0
2,2022-07-22,0,,0,
3,2023-03-06,0,,0,
4,2020-02-23,0,,0,
5,2019-07-31,0,,0,
6,2023-06-11,0,,0,
7,2019-07-08,0,,0,
8,2023-02-12,0,,0,
9,2020-08-22,0,,0,


## 6. Summary and Save Enhanced Dataset

In [8]:
# Show final dataset info
print("=== ENHANCED DATASET SUMMARY ===")
print(f"Total flights: {len(df):,}")
print(f"Original features: 10")
print(f"New temporal features added: 7")
print(f"Total features: {len(df.columns)}")

print("\nNew features added:")
new_features = [
    "day_of_week",
    "day_of_week_name",
    "is_weekend",
    "week_of_year",
    "is_near_holiday",
    "nearest_holiday",
    "is_holiday_period",
    "holiday_period_name",
    "days_to_holiday",
]
for feature in new_features:
    if feature in df.columns:
        print(f"  âœ… {feature}")

print("\nFeature statistics:")
print(f"Weekend flights: {df['is_weekend'].mean():.1%}")
print(f"Flights near holidays: {df['is_near_holiday'].mean():.1%}")
print(f"Flights in holiday periods: {df['is_holiday_period'].mean():.1%}")

# Show correlation with delays (if ARR_DELAY exists)
if "ARR_DELAY" in df.columns:
    print("\nCorrelation with arrival delays:")
    correlations = (
        df[["day_of_week", "is_weekend", "week_of_year", "is_near_holiday", "is_holiday_period", "ARR_DELAY"]]
        .corr()["ARR_DELAY"]
        .drop("ARR_DELAY")
    )
    for feature, corr in correlations.items():
        print(f"  {feature}: {corr:.3f}")

# Save enhanced dataset
output_file = "dataset/flights_cleaned_temporal.csv"
df.to_csv(output_file, index=False)
print(f"\nðŸ’¾ Enhanced dataset saved to: {output_file}")

# Show final sample
print("\nSample of enhanced dataset:")
df[["FL_DATE", "day_of_week_name", "week_of_year", "is_near_holiday", "nearest_holiday"]].head()

=== ENHANCED DATASET SUMMARY ===
Total flights: 500,000
Original features: 10
New temporal features added: 7
Total features: 19

New features added:
  âœ… day_of_week
  âœ… day_of_week_name
  âœ… is_weekend
  âœ… week_of_year
  âœ… is_near_holiday
  âœ… nearest_holiday
  âœ… is_holiday_period
  âœ… holiday_period_name
  âœ… days_to_holiday

Feature statistics:
Weekend flights: 27.4%
Flights near holidays: 19.7%
Flights in holiday periods: 30.1%

Correlation with arrival delays:
  day_of_week: 0.011
  is_weekend: 0.003
  week_of_year: 0.005
  is_near_holiday: 0.011
  is_holiday_period: 0.011

ðŸ’¾ Enhanced dataset saved to: dataset/flights_cleaned_temporal.csv

Sample of enhanced dataset:


Unnamed: 0,FL_DATE,day_of_week_name,week_of_year,is_near_holiday,nearest_holiday
0,2019-01-09,Wednesday,2,0,
1,2022-11-19,Saturday,46,0,
2,2022-07-22,Friday,29,0,
3,2023-03-06,Monday,10,0,
4,2020-02-23,Sunday,8,0,
