### AviationStack API Overview

AviationStack is a REST API that provides real-time and historical flight data from global aviation sources. It offers comprehensive flight information including schedules, status, delays, and airline/airport details.

### How You'll Use It in Your Project

#### Primary Data Source:

Fetch real-time flight status (scheduled/estimated/actual times)

Get historical delay patterns for specific flights/routes

Access airline and aircraft information

Retrieve airport-specific data

#### Key Use Cases:

Flight Identification: Look up flights by number/route to get unique flight IDs

Delay History: Analyze past performance of the same flight route

Real-time Status: Check current flight status for model features

Alternative Flights: Find other flights between same origin-destination

### Benefits for Your Project

Comprehensive Data: Single API for most flight information needs

Real-time & Historical: Supports both current status and pattern analysis

Reliable Source: Commercial-grade API with good uptime

Easy Integration: Simple REST endpoints with JSON responses

Free Tier Available: Sufficient for capstone project development

##### Libraries used for exploration: 

In [0]:
import requests
import pandas as pd
import json
from pprint import pprint
from datetime import datetime, timedelta
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, count, avg, sum as spark_sum, when
from pyspark.sql.types import *

#### CONFIGURATION

In [0]:
api_key = dbutils.secrets.get(scope="my-secrets", key="aviation_stack_api")
base_url = "https://api.aviationstack.com/v1"

#### 1. BASIC API EXPLORATION

In [0]:
# Get real-time flight data
url = f"{base_url}/flights?access_key={api_key}"
response = requests.get(url)
data = response.json()

print(f"\n‚úÖ API Response Status: {response.status_code}")
print(f"üìä Total Flights Retrieved: {len(data.get('data', []))}")
print(f"üî¢ API Quota Used: {data.get('pagination', {}).get('offset', 0)} / {data.get('pagination', {}).get('total', 'N/A')}")

# Display response structure
print("\nüìã Top-Level Response Keys:")
for key, value in data.items():
    if key == 'data':
        print(f"  ‚Ä¢ {key}: List[{len(value)} flight objects]")
    else:
        print(f"  ‚Ä¢ {key}: {type(value).__name__}")

# Detailed look at first flight
if data.get('data'):
    first_flight = data['data'][0]
    print("\nüîç Single Flight Object Structure:")
    print(f"  Main Keys: {list(first_flight.keys())}")
    
    print("\n  Nested Object Details:")
    for key in ['flight', 'airline', 'departure', 'arrival', 'aircraft', 'live']:
        if key in first_flight and first_flight[key]:
            if isinstance(first_flight[key], dict):
                print(f"    ‚Ä¢ {key}: {list(first_flight[key].keys())}")
            else:
                print(f"    ‚Ä¢ {key}: {type(first_flight[key]).__name__}")


‚úÖ API Response Status: 200
üìä Total Flights Retrieved: 100
üî¢ API Quota Used: 0 / 91419

üìã Top-Level Response Keys:
  ‚Ä¢ pagination: dict
  ‚Ä¢ data: List[100 flight objects]

üîç Single Flight Object Structure:
  Main Keys: ['flight_date', 'flight_status', 'departure', 'arrival', 'airline', 'flight', 'aircraft', 'live']

  Nested Object Details:
    ‚Ä¢ flight: ['number', 'iata', 'icao', 'codeshared']
    ‚Ä¢ airline: ['name', 'iata', 'icao']
    ‚Ä¢ departure: ['airport', 'timezone', 'iata', 'icao', 'terminal', 'gate', 'delay', 'scheduled', 'estimated', 'actual', 'estimated_runway', 'actual_runway']
    ‚Ä¢ arrival: ['airport', 'timezone', 'iata', 'icao', 'terminal', 'gate', 'baggage', 'scheduled', 'delay', 'estimated', 'actual', 'estimated_runway', 'actual_runway']


#### 2. DATA EXTRACTION & TRANSFORMATION

In [0]:
def extract_flight_details(flights_data):
    """
    Extract and flatten nested JSON into analysis-ready DataFrame
    Handles missing/null values gracefully
    """
    extracted_flights = []
    
    for flight in flights_data:
        # Safely extract nested fields
        flight_info = flight.get('flight', {}) or {}
        airline_info = flight.get('airline', {}) or {}
        departure_info = flight.get('departure', {}) or {}
        arrival_info = flight.get('arrival', {}) or {}
        aircraft_info = flight.get('aircraft', {}) or {}
        live_info = flight.get('live', {}) or {}
        
        flight_record = {
            # Flight Identification
            'flight_date': flight.get('flight_date'),
            'flight_status': flight.get('flight_status'),
            'flight_iata': flight_info.get('iata'),
            'flight_icao': flight_info.get('icao'),
            'flight_number': flight_info.get('number'),
            'flight_codeshared': flight_info.get('codeshared'),
            
            # Airline Information
            'airline_name': airline_info.get('name'),
            'airline_iata': airline_info.get('iata'),
            'airline_icao': airline_info.get('icao'),
            
            # Departure Details
            'departure_airport': departure_info.get('airport'),
            'departure_iata': departure_info.get('iata'),
            'departure_icao': departure_info.get('icao'),
            'departure_terminal': departure_info.get('terminal'),
            'departure_gate': departure_info.get('gate'),
            'departure_delay': departure_info.get('delay'),
            'departure_scheduled': departure_info.get('scheduled'),
            'departure_estimated': departure_info.get('estimated'),
            'departure_actual': departure_info.get('actual'),
            'departure_timezone': departure_info.get('timezone'),
            
            # Arrival Details
            'arrival_airport': arrival_info.get('airport'),
            'arrival_iata': arrival_info.get('iata'),
            'arrival_icao': arrival_info.get('icao'),
            'arrival_terminal': arrival_info.get('terminal'),
            'arrival_gate': arrival_info.get('gate'),
            'arrival_baggage': arrival_info.get('baggage'),
            'arrival_delay': arrival_info.get('delay'),
            'arrival_scheduled': arrival_info.get('scheduled'),
            'arrival_estimated': arrival_info.get('estimated'),
            'arrival_actual': arrival_info.get('actual'),
            'arrival_timezone': arrival_info.get('timezone'),
            
            # Aircraft Information
            'aircraft_registration': aircraft_info.get('registration'),
            'aircraft_iata': aircraft_info.get('iata'),
            'aircraft_icao': aircraft_info.get('icao'),
            'aircraft_icao24': aircraft_info.get('icao24'),
            
            # Live Tracking (if available)
            'live_updated': live_info.get('updated'),
            'live_latitude': live_info.get('latitude'),
            'live_longitude': live_info.get('longitude'),
            'live_altitude': live_info.get('altitude'),
            'live_direction': live_info.get('direction'),
            'live_speed_horizontal': live_info.get('speed_horizontal'),
            'live_speed_vertical': live_info.get('speed_vertical'),
            'live_is_ground': live_info.get('is_ground')
        }
        
        extracted_flights.append(flight_record)
    
    return pd.DataFrame(extracted_flights)

# Extract data
flights_df = extract_flight_details(data.get('data', []))

print(f"\nExtracted {len(flights_df)} flights")
print(f"Total Columns: {len(flights_df.columns)}")
print(f"DataFrame Shape: {flights_df.shape}")

print("\nAvailable Data Fields:")
print(f"  ‚Ä¢ Flight Info: {sum(1 for col in flights_df.columns if col.startswith('flight_'))} fields")
print(f"  ‚Ä¢ Airline Info: {sum(1 for col in flights_df.columns if col.startswith('airline_'))} fields")
print(f"  ‚Ä¢ Departure Info: {sum(1 for col in flights_df.columns if col.startswith('departure_'))} fields")
print(f"  ‚Ä¢ Arrival Info: {sum(1 for col in flights_df.columns if col.startswith('arrival_'))} fields")
print(f"  ‚Ä¢ Aircraft Info: {sum(1 for col in flights_df.columns if col.startswith('aircraft_'))} fields")
print(f"  ‚Ä¢ Live Tracking: {sum(1 for col in flights_df.columns if col.startswith('live_'))} fields")

# Display sample
print("\nSample Flight Records:")
display(flights_df.head(10))


Extracted 100 flights
Total Columns: 42
DataFrame Shape: (100, 42)

Available Data Fields:
  ‚Ä¢ Flight Info: 6 fields
  ‚Ä¢ Airline Info: 3 fields
  ‚Ä¢ Departure Info: 10 fields
  ‚Ä¢ Arrival Info: 11 fields
  ‚Ä¢ Aircraft Info: 4 fields
  ‚Ä¢ Live Tracking: 8 fields

Sample Flight Records:


flight_date,flight_status,flight_iata,flight_icao,flight_number,flight_codeshared,airline_name,airline_iata,airline_icao,departure_airport,departure_iata,departure_icao,departure_terminal,departure_gate,departure_delay,departure_scheduled,departure_estimated,departure_actual,departure_timezone,arrival_airport,arrival_iata,arrival_icao,arrival_terminal,arrival_gate,arrival_baggage,arrival_delay,arrival_scheduled,arrival_estimated,arrival_actual,arrival_timezone,aircraft_registration,aircraft_iata,aircraft_icao,aircraft_icao24,live_updated,live_latitude,live_longitude,live_altitude,live_direction,live_speed_horizontal,live_speed_vertical,live_is_ground
2025-11-30,scheduled,CZ7171,CSN7171,7171,"List(gj, cdc, loong air, gj8887, cdc8887, 8887)",China Southern Airlines,CZ,CSN,Hangzhou,HGH,ZSHC,3,B22,,2025-11-30T07:05:00+00:00,2025-11-30T07:05:00+00:00,,Asia/Shanghai,Beijing Capital International,PEK,ZBAA,3,,,,2025-11-30T09:20:00+00:00,,,Asia/Shanghai,,,,,,,,,,,,
2025-11-30,scheduled,3U4849,CSC4849,4849,"List(gj, cdc, loong air, gj8887, cdc8887, 8887)",Sichuan Airlines,3U,CSC,Hangzhou,HGH,ZSHC,3,B22,,2025-11-30T07:05:00+00:00,2025-11-30T07:05:00+00:00,,Asia/Shanghai,Beijing Capital International,PEK,ZBAA,3,,,,2025-11-30T09:20:00+00:00,,,Asia/Shanghai,,,,,,,,,,,,
2025-11-30,scheduled,MF5908,CXA5908,5908,"List(gj, cdc, loong air, gj8887, cdc8887, 8887)",Xiamen Airlines,MF,CXA,Hangzhou,HGH,ZSHC,3,B22,,2025-11-30T07:05:00+00:00,2025-11-30T07:05:00+00:00,,Asia/Shanghai,Beijing Capital International,PEK,ZBAA,3,,,,2025-11-30T09:20:00+00:00,,,Asia/Shanghai,,,,,,,,,,,,
2025-11-30,scheduled,EU7709,UEA7709,7709,"List(gj, cdc, loong air, gj8887, cdc8887, 8887)",Chengdu Airlines,EU,UEA,Hangzhou,HGH,ZSHC,3,B22,,2025-11-30T07:05:00+00:00,2025-11-30T07:05:00+00:00,,Asia/Shanghai,Beijing Capital International,PEK,ZBAA,3,,,,2025-11-30T09:20:00+00:00,,,Asia/Shanghai,,,,,,,,,,,,
2025-11-30,scheduled,G59537,HXA9537,9537,"List(gj, cdc, loong air, gj8887, cdc8887, 8887)",China Express Air,G5,HXA,Hangzhou,HGH,ZSHC,3,B22,,2025-11-30T07:05:00+00:00,2025-11-30T07:05:00+00:00,,Asia/Shanghai,Beijing Capital International,PEK,ZBAA,3,,,,2025-11-30T09:20:00+00:00,,,Asia/Shanghai,,,,,,,,,,,,
2025-11-30,scheduled,TV2509,TBA2509,2509,"List(gj, cdc, loong air, gj8887, cdc8887, 8887)",Tibet Airlines,TV,TBA,Hangzhou,HGH,ZSHC,3,B22,,2025-11-30T07:05:00+00:00,2025-11-30T07:05:00+00:00,,Asia/Shanghai,Beijing Capital International,PEK,ZBAA,3,,,,2025-11-30T09:20:00+00:00,,,Asia/Shanghai,,,,,,,,,,,,
2025-11-30,scheduled,MU4369,CES4369,4369,"List(gj, cdc, loong air, gj8887, cdc8887, 8887)",China Eastern Airlines,MU,CES,Hangzhou,HGH,ZSHC,3,B22,,2025-11-30T07:05:00+00:00,2025-11-30T07:05:00+00:00,,Asia/Shanghai,Beijing Capital International,PEK,ZBAA,3,,,,2025-11-30T09:20:00+00:00,,,Asia/Shanghai,,,,,,,,,,,,
2025-11-30,scheduled,ZH1701,CSZ1701,1701,"List(ca, cca, air china ltd, ca1701, cca1701, 1701)",Shenzhen Airlines,ZH,CSZ,Hangzhou,HGH,ZSHC,4,407,,2025-11-30T07:00:00+00:00,2025-11-30T07:00:00+00:00,,Asia/Shanghai,Beijing Capital International,PEK,ZBAA,3,,,,2025-11-30T09:20:00+00:00,,,Asia/Shanghai,,,,,,,,,,,,
2025-11-30,scheduled,UA7544,UAL7544,7544,"List(ca, cca, air china ltd, ca1701, cca1701, 1701)",United Airlines,UA,UAL,Hangzhou,HGH,ZSHC,4,407,,2025-11-30T07:00:00+00:00,2025-11-30T07:00:00+00:00,,Asia/Shanghai,Beijing Capital International,PEK,ZBAA,3,,,,2025-11-30T09:20:00+00:00,,,Asia/Shanghai,,,,,,,,,,,,
2025-11-30,scheduled,SC5579,CDG5579,5579,"List(ca, cca, air china ltd, ca1701, cca1701, 1701)",Shandong Airlines,SC,CDG,Hangzhou,HGH,ZSHC,4,407,,2025-11-30T07:00:00+00:00,2025-11-30T07:00:00+00:00,,Asia/Shanghai,Beijing Capital International,PEK,ZBAA,3,,,,2025-11-30T09:20:00+00:00,,,Asia/Shanghai,,,,,,,,,,,,


#### 3. DATA QUALITY ASSESSMENT

In [0]:
# Calculate completeness for key fields
key_fields = [
    'flight_iata', 'airline_name', 'departure_airport', 'arrival_airport',
    'departure_scheduled', 'arrival_scheduled', 'flight_status',
    'departure_delay', 'arrival_delay'
]

print("\nField Completeness (% non-null):")
completeness = {}
for field in key_fields:
    if field in flights_df.columns:
        pct_complete = (flights_df[field].notna().sum() / len(flights_df) * 100)
        completeness[field] = pct_complete
        status = "‚úÖ" if pct_complete > 80 else "‚ö†Ô∏è" if pct_complete > 50 else "‚ùå"
        print(f"  {status} {field:25s}: {pct_complete:5.1f}%")

# Live tracking availability
live_tracking_available = flights_df['live_latitude'].notna().sum()
print(f"\nLive Tracking Data Available: {live_tracking_available}/{len(flights_df)} flights ({live_tracking_available/len(flights_df)*100:.1f}%)")


Field Completeness (% non-null):
  ‚úÖ flight_iata              : 100.0%
  ‚úÖ airline_name             :  99.0%
  ‚úÖ departure_airport        :  98.0%
  ‚úÖ arrival_airport          : 100.0%
  ‚úÖ departure_scheduled      : 100.0%
  ‚úÖ arrival_scheduled        : 100.0%
  ‚úÖ flight_status            : 100.0%
  ‚ùå departure_delay          :   7.0%
  ‚ùå arrival_delay            :   0.0%

Live Tracking Data Available: 0/100 flights (0.0%)


#### 4. FLIGHT STATUS ANALYSIS

In [0]:
status_distribution = flights_df['flight_status'].value_counts()
print("\nFlight Status Breakdown:")
for status, count in status_distribution.items():
    pct = count / len(flights_df) * 100
    print(f"  ‚Ä¢ {status:20s}: {count:3d} flights ({pct:5.1f}%)")

# Visualize status distribution
print("\nStatus Distribution Chart:")
display(flights_df['flight_status'].value_counts().to_frame('count'))


Flight Status Breakdown:
  ‚Ä¢ scheduled           : 100 flights (100.0%)

Status Distribution Chart:


count
100


#### 5. DELAY ANALYSIS (FOR PROJECT)

In [0]:
# Filter for flights with delay data
delayed_departures = flights_df[flights_df['departure_delay'].notna() & (flights_df['departure_delay'] > 0)]
delayed_arrivals = flights_df[flights_df['arrival_delay'].notna() & (flights_df['arrival_delay'] > 0)]

print(f"\nDelay Statistics:")
print(f"  Departure Delays:")
print(f"    ‚Ä¢ Delayed Flights: {len(delayed_departures)}/{len(flights_df)} ({len(delayed_departures)/len(flights_df)*100:.1f}%)")
if len(delayed_departures) > 0:
    print(f"    ‚Ä¢ Average Delay: {delayed_departures['departure_delay'].mean():.1f} minutes")
    print(f"    ‚Ä¢ Median Delay: {delayed_departures['departure_delay'].median():.1f} minutes")
    print(f"    ‚Ä¢ Max Delay: {delayed_departures['departure_delay'].max():.0f} minutes")
    print(f"    ‚Ä¢ Min Delay: {delayed_departures['departure_delay'].min():.0f} minutes")

print(f"\n  Arrival Delays:")
print(f"    ‚Ä¢ Delayed Flights: {len(delayed_arrivals)}/{len(flights_df)} ({len(delayed_arrivals)/len(flights_df)*100:.1f}%)")
if len(delayed_arrivals) > 0:
    print(f"    ‚Ä¢ Average Delay: {delayed_arrivals['arrival_delay'].mean():.1f} minutes")
    print(f"    ‚Ä¢ Median Delay: {delayed_arrivals['arrival_delay'].median():.1f} minutes")
    print(f"    ‚Ä¢ Max Delay: {delayed_arrivals['arrival_delay'].max():.0f} minutes")

# Delay severity classification
if len(delayed_departures) > 0:
    flights_df['delay_severity'] = flights_df['departure_delay'].apply(
        lambda x: 'No Delay' if pd.isna(x) or x == 0 
        else 'Minor (1-15 min)' if x <= 15
        else 'Moderate (16-30 min)' if x <= 30
        else 'Significant (31-60 min)' if x <= 60
        else 'Severe (>60 min)'
    )
    
    print("\nDelay Severity Distribution:")
    severity_dist = flights_df['delay_severity'].value_counts()
    for severity, count in severity_dist.items():
        print(f"  ‚Ä¢ {severity:25s}: {count:3d} flights")


Delay Statistics:
  Departure Delays:
    ‚Ä¢ Delayed Flights: 7/100 (7.0%)
    ‚Ä¢ Average Delay: 93.0 minutes
    ‚Ä¢ Median Delay: 125.0 minutes
    ‚Ä¢ Max Delay: 125 minutes
    ‚Ä¢ Min Delay: 12 minutes

  Arrival Delays:
    ‚Ä¢ Delayed Flights: 0/100 (0.0%)

Delay Severity Distribution:
  ‚Ä¢ No Delay                 :  93 flights
  ‚Ä¢ Severe (>60 min)         :   5 flights
  ‚Ä¢ Minor (1-15 min)         :   2 flights


#### 6. AIRLINE PERFORMANCE ANALYSIS

In [0]:
airline_stats = flights_df.groupby('airline_name').agg({
    'flight_iata': 'count',
    'departure_delay': ['mean', 'max'],
    'arrival_delay': ['mean', 'max']
}).round(1)

airline_stats.columns = ['total_flights', 'avg_dep_delay', 'max_dep_delay', 'avg_arr_delay', 'max_arr_delay']
airline_stats = airline_stats.sort_values('total_flights', ascending=False)

print(f"\n Airlines in Dataset: {flights_df['airline_name'].nunique()}")
print(f"\nTop Airlines by Flight Volume:")
display(airline_stats.head(10))


 Airlines in Dataset: 41

Top Airlines by Flight Volume:


total_flights,avg_dep_delay,max_dep_delay,avg_arr_delay,max_arr_delay
25,125.0,125.0,,
7,,,,
5,125.0,125.0,,
4,,,,
4,125.0,125.0,,
4,125.0,125.0,,
4,125.0,125.0,,
3,,,,
3,,,,
2,,,,


#### 7. ROUTE ANALYSIS

In [0]:
# Create route column
flights_df['route'] = flights_df['departure_iata'] + ' ‚Üí ' + flights_df['arrival_iata']

print(f"\nDeparture Airports: {flights_df['departure_iata'].nunique()}")
print(f"Arrival Airports: {flights_df['arrival_iata'].nunique()}")
print(f"Unique Routes: {flights_df['route'].nunique()}")

print("\nTop 10 Departure Airports:")
top_departures = flights_df['departure_airport'].value_counts().head(10)
for airport, count in top_departures.items():
    iata = flights_df[flights_df['departure_airport'] == airport]['departure_iata'].iloc[0]
    print(f"  ‚Ä¢ {airport:40s} ({iata}): {count:3d} flights")

print("\nTop 10 Routes:")
top_routes = flights_df['route'].value_counts().head(10)
for route, count in top_routes.items():
    print(f"  ‚Ä¢ {route:15s}: {count:3d} flights")


Departure Airports: 17
Arrival Airports: 46
Unique Routes: 54

Top 10 Departure Airports:
  ‚Ä¢ Beijing Capital International            (PEK):  34 flights
  ‚Ä¢ Taiyuan                                  (TYN):  17 flights
  ‚Ä¢ Hangzhou                                 (HGH):  11 flights
  ‚Ä¢ Malay                                    (MPH):   8 flights
  ‚Ä¢ Cukurova International Airport           (COV):   5 flights
  ‚Ä¢ Chongqing Jiangbei International         (CKG):   4 flights
  ‚Ä¢ Haneda Airport                           (HND):   4 flights
  ‚Ä¢ Fukuoka                                  (FUK):   4 flights
  ‚Ä¢ Trabzon                                  (TZX):   2 flights
  ‚Ä¢ Bolshoye Savino                          (PEE):   2 flights

Top 10 Routes:
  ‚Ä¢ HGH ‚Üí PEK      :  10 flights
  ‚Ä¢ MPH ‚Üí MNL      :   7 flights
  ‚Ä¢ TYN ‚Üí URC      :   5 flights
  ‚Ä¢ TYN ‚Üí KWE      :   5 flights
  ‚Ä¢ HND ‚Üí FUK      :   4 flights
  ‚Ä¢ CKG ‚Üí TYN      :   4 flights
  ‚Ä¢ COV ‚

#### 8. AIRCRAFT TYPE ANALYSIS

In [0]:

aircraft_with_type = flights_df[flights_df['aircraft_iata'].notna()]
print(f"\nFlights with Aircraft Type Data: {len(aircraft_with_type)}/{len(flights_df)} ({len(aircraft_with_type)/len(flights_df)*100:.1f}%)")

if len(aircraft_with_type) > 0:
    print("\nTop 10 Aircraft Types:")
    top_aircraft = flights_df['aircraft_iata'].value_counts().head(10)
    for aircraft, count in top_aircraft.items():
        print(f"  ‚Ä¢ {aircraft:10s}: {count:3d} flights")


Flights with Aircraft Type Data: 0/100 (0.0%)


#### 9. ML FEATURE ENGINEERING PREVIEW

In [0]:
# Create ML-ready features
ml_features = flights_df[[
    'flight_iata', 'airline_name', 'departure_iata', 'arrival_iata',
    'route', 'flight_status', 'departure_delay', 'arrival_delay',
    'departure_scheduled', 'aircraft_iata'
]].copy()

# Add binary delay indicator (target variable)
ml_features['is_delayed'] = ((ml_features['departure_delay'].notna()) & 
                              (ml_features['departure_delay'] > 15)).astype(int)

# Add time-based features
ml_features['scheduled_hour'] = pd.to_datetime(ml_features['departure_scheduled']).dt.hour
ml_features['scheduled_day_of_week'] = pd.to_datetime(ml_features['departure_scheduled']).dt.dayofweek

# Add route popularity (could indicate congestion)
route_counts = flights_df['route'].value_counts()
ml_features['route_popularity'] = ml_features['route'].map(route_counts)

print("\nü§ñ ML-Ready Feature Set:")
print(f"  ‚Ä¢ Total Features: {len(ml_features.columns)}")
print(f"  ‚Ä¢ Sample Size: {len(ml_features)} flights")
print(f"  ‚Ä¢ Delayed Flights (target=1): {ml_features['is_delayed'].sum()} ({ml_features['is_delayed'].mean()*100:.1f}%)")

print("\nSample ML Features:")
display(ml_features.head(10))

print("\nFeature Importance Indicators:")
if ml_features['is_delayed'].sum() > 0:
    print("\nDelay Rate by Hour of Day:")
    hourly_delays = ml_features.groupby('scheduled_hour')['is_delayed'].agg(['sum', 'count', 'mean'])
    hourly_delays['delay_rate_%'] = (hourly_delays['mean'] * 100).round(1)
    display(hourly_delays.sort_values('delay_rate_%', ascending=False).head(10))


ü§ñ ML-Ready Feature Set:
  ‚Ä¢ Total Features: 14
  ‚Ä¢ Sample Size: 100 flights
  ‚Ä¢ Delayed Flights (target=1): 5 (5.0%)

Sample ML Features:


flight_iata,airline_name,departure_iata,arrival_iata,route,flight_status,departure_delay,arrival_delay,departure_scheduled,aircraft_iata,is_delayed,scheduled_hour,scheduled_day_of_week,route_popularity
CZ7171,China Southern Airlines,HGH,PEK,HGH ‚Üí PEK,scheduled,,,2025-11-30T07:05:00+00:00,,0,7,6,10
3U4849,Sichuan Airlines,HGH,PEK,HGH ‚Üí PEK,scheduled,,,2025-11-30T07:05:00+00:00,,0,7,6,10
MF5908,Xiamen Airlines,HGH,PEK,HGH ‚Üí PEK,scheduled,,,2025-11-30T07:05:00+00:00,,0,7,6,10
EU7709,Chengdu Airlines,HGH,PEK,HGH ‚Üí PEK,scheduled,,,2025-11-30T07:05:00+00:00,,0,7,6,10
G59537,China Express Air,HGH,PEK,HGH ‚Üí PEK,scheduled,,,2025-11-30T07:05:00+00:00,,0,7,6,10
TV2509,Tibet Airlines,HGH,PEK,HGH ‚Üí PEK,scheduled,,,2025-11-30T07:05:00+00:00,,0,7,6,10
MU4369,China Eastern Airlines,HGH,PEK,HGH ‚Üí PEK,scheduled,,,2025-11-30T07:05:00+00:00,,0,7,6,10
ZH1701,Shenzhen Airlines,HGH,PEK,HGH ‚Üí PEK,scheduled,,,2025-11-30T07:00:00+00:00,,0,7,6,10
UA7544,United Airlines,HGH,PEK,HGH ‚Üí PEK,scheduled,,,2025-11-30T07:00:00+00:00,,0,7,6,10
SC5579,Shandong Airlines,HGH,PEK,HGH ‚Üí PEK,scheduled,,,2025-11-30T07:00:00+00:00,,0,7,6,10



Feature Importance Indicators:

Delay Rate by Hour of Day:


sum,count,mean,delay_rate_%
5,15,0.3333333333333333,33.3
0,7,0.0,0.0
0,2,0.0,0.0
0,2,0.0,0.0
0,14,0.0,0.0
0,5,0.0,0.0
0,51,0.0,0.0
0,4,0.0,0.0



#### 10. API CAPABILITIES SUMMARY

####KEY DATA AVAILABLE FOR DELAY PREDICTION:

REAL-TIME FLIGHT DATA

  Flight status (scheduled, active, landed, cancelled, diverted), 
  Current delays (departure & arrival in minutes), 
  Live tracking coordinates (lat/lon, altitude, speed), 
  Gate and terminal information

SCHEDULE INFORMATION

  Scheduled departure/arrival times, 
  Estimated departure/arrival times, 
  Actual departure/arrival times (when available)
  
ROUTE & AIRPORT DATA

  Origin/destination airport codes (IATA/ICAO), 
  Airport names and locations, 
  Route patterns and popularity

AIRLINE & AIRCRAFT INFO

  Airline names and codes, 
  Aircraft type and registration, 
  Codeshare information

HISTORICAL PATTERNS (via repeated API calls)

  Airline on-time performance trends, 
  Route-specific delay patterns, 
  Time-of-day delay correlations, 
  Seasonal variations

#### FUTURE implementations:

OPENSKY NETWORK

   Match flights by aircraft registration (icao24), 
   Correlate live positions with flight status, 
   Validate arrival estimates with actual tracking

OPEN-METEO WEATHER API

   Match by airport coordinates and time, 
   Correlate weather conditions with delays, 
   Predict weather-related delays

ML MODEL TRAINING

   Target: is_delayed (binary classification), 
   Features: airline, route, time, weather, congestion, 
   Historical data for pattern recognition

REAL-TIME PREDICTIONS

   Query API for scheduled flights, 
   Enrich with weather + congestion data, 
   Generate delay probability scores, 
   Suggest alternative flights

API LIMITATIONS TO CONSIDER:
  
 Free tier: 100-500 requests/month, 
 Live tracking not available for all flights, 
 Historical data requires repeated polling, 
 Rate limits require caching strategy


#### 11. NEXT STEPS & RECOMMENDATIONS

IMPLEMENTATION STRATEGY:

PHASE 1: DATA COLLECTION (ONGOING)
  
  Set up scheduled job to poll API every 15-30 minutes
  
  Store raw responses in Delta Lake Bronze layer
  
  Build 2-4 weeks of historical data
  
  Focus on top 10-20 airports for manageable scope

PHASE 2: DATA ENRICHMENT
  
  Match AviationStack flights with OpenSky tracking
  
  Add weather data from Open-Meteo for each airport
  
  Calculate airport congestion metrics
  
  Create time-based features (hour, day, season)

PHASE 3: FEATURE ENGINEERING
  
  Airline historical performance scores
  
  Route congestion indices
  
  Weather severity scores
  
  Time-of-day risk factors
  
  Cascading delay indicators

PHASE 4: MODEL TRAINING
  
  Binary classification: Will flight be delayed? (>15 min)
  
  Regression: How long will the delay be?
  
  Multi-class: Delay severity category
  
  Use MLflow for experiment tracking

PHASE 5: PRODUCTION DEPLOYMENT
  
  Real-time prediction API endpoint
  
  Power BI dashboard with live updates
  
  Alert system for high-probability delays
  
  Alternative flight recommendation engine

üí° KEY SUCCESS METRICS:
  
  ‚Ä¢ Prediction Accuracy: >80% for binary classification

  ‚Ä¢ Mean Absolute Error: <15 minutes for delay duration

  ‚Ä¢ Lead Time: Predict 2-4 hours in advance

  ‚Ä¢ User Satisfaction: Actionable recommendations
