# **US Flights 2023:** Flight Delay Analysis

## **Section 1:** Imports

In [1]:
import sys

In [3]:
sys.path.append("../scripts")

In [4]:
import utils 
import load_data
import query

## **Section 2:** Create conncetion with MongoDB

In [5]:
MONGO_URI = "mongodb://localhost:27017/"
MONGO_DATABASE_NAME = "flights_delay_db"

In [6]:
client, database = utils.connect_to_mongodb(MONGO_URI, MONGO_DATABASE_NAME, timeout=5000)

Databases: ['admin', 'config', 'flights_db', 'local']
MongoDB connection established to database: flights_delay_db


## **Section 3:** Load data

In [7]:
is_loaded = load_data.load_data_to_mongodb(database)

if is_loaded:
    print("Raw data loaded successfully into MongoDB.")

Loading data from us-flights-2023.csv
Inserted 10000 records into collection 'us_flights_2023'
Loading data from cancelled-diverted-2023.csv
Inserted 10000 records into collection 'cancelled_deverted_2023'
Loading data from weather-meteo-by-airport.csv
Inserted 10000 records into collection 'weather_meteo_by_airport'
Loading data from airports-geolocation.csv
Inserted 364 records into collection 'airports_geolocation'
Loading data from airports.csv
Inserted 10000 records into collection 'airports'
Loading data from airport-frequencies.csv
Inserted 10000 records into collection 'airport_frequencies'
Loading data from runways.csv
Inserted 10000 records into collection 'runways'
Data loading completed.


In [8]:
collections = database.list_collection_names()
print("Available collections in database:", collections)

Available collections in database: ['airport_frequencies', 'us_flights_2023', 'airports', 'cancelled_deverted_2023', 'runways', 'airports_geolocation', 'weather_meteo_by_airport']


## **Section 4:** Query analysis

### **Data Structure**

#### **File 1:** `us-flights-2023.csv`

| Field | Type | Description |
|-------|-----|------|
| FlightDate | Date | Flight date (YYYY-MM-DD) |
| Day_Of_Week | Integer | Day of week (1-7, 1=Monday) |
| Airline | String | Airline name |
| Tail_number | String | Unique aircraft identifier |
| Dep_Airport | String | Departure airport IATA code |
| Dep_CityName | String | Departure city name |
| DepTime_label | String | Departure time interval |
| Dep_Delay | Integer | Departure delay in minutes |
| Dep_Delay_Tag | Integer | Delay indicator (0/1) |
| Dep_Delay_Type | String | Departure delay type |
| Arr_Airport | String | Arrival airport IATA code |
| Arr_CityName | String | Arrival city name |
| Arr_Delay | Integer | Arrival delay in minutes |
| Arr_Delay_Type | String | Arrival delay type |
| Flight_Duration | Integer | Flight duration in minutes |
| Distance_type | String | Distance category (Short, Medium, Long) |
| Delay_Carrier | Integer | Carrier delay (min) |
| Delay_Weather | Integer | Weather delay (min) |
| Delay_NAS | Integer | NAS system delay (min) |
| Delay_Security | Integer | Security delay (min) |
| Delay_LastAircraft | Integer | Previous flight delay (min) |
| Manufacturer | String | Aircraft manufacturer |
| Model | String | Aircraft model |
| Aircraft_age | Float | Aircraft age in years |

**Number of rows:** 6,743,404  
**Number of columns:** 24

#### **File 2:** `cancelled-diverted-2023.csv`

| Field | Type | Description |
|-------|-----|------|
| FlightDate | Date | Flight date (YYYY-MM-DD) |
| Day_Of_Week | Integer | Day of week (1-7, 1=Monday) |
| Airline | String | Airline name |
| Tail_Number | String | Unique aircraft identifier |
| Cancelled | Integer | Cancellation indicator (0/1) |
| Diverted | Integer | Diversion indicator (0/1) |
| Dep_Airport | String | Departure airport IATA code |
| Arr_Airport | String | Arrival airport IATA code |
| Dep_CityName | String | Departure city |
| Arr_CityName | String | Arrival city |
| DepTime_label | String | Departure time interval |
| Dep_Delay | Integer | Departure delay (min) |
| Arr_Delay | Integer | Arrival delay (min) |
| Dep_Delay_Tag | Integer | Departure delay indicator (0/1) |
| Arr_Delay_Tag | Integer | Arrival delay indicator (0/1) |
| Dep_Delay_Type | String | Departure delay type |
| Arr_Delay_Type | String | Arrival delay type |
| Flight_Duration | Integer | Flight duration (min) |
| Distance_type | String | Distance category |
| Delay_Carrier | Integer | Carrier delay (min) |
| Delay_Weather | Integer | Weather delay (min) |
| Delay_NAS | Integer | NAS system delay (min) |
| Delay_Security | Integer | Security delay (min) |
| Delay_LastAircraft | Integer | Previous flight delay (min) |

**Number of rows:** 104,488  
**Number of columns:** 23

#### **File 3:** `weather-meteo-by-airport.csv`

| Field | Type | Description |
|-------|-----|------|
| time | DateTime | Measurement date and time |
| tavg | Float | Average temperature (°C) |
| tmin | Float | Minimum temperature (°C) |
| tmax | Float | Maximum temperature (°C) |
| prcp | Float | Precipitation (mm) |
| snow | Float | Snow (mm) |
| wdir | Float | Wind direction (degrees) |
| wspd | Float | Wind speed (km/h) |
| pres | Float | Atmospheric pressure (hPa) |
| airport_id | String | Airport IATA code |

**Number of rows:** 132,860  
**Number of columns:** 10

#### **File 4:** `airports-geolocation.csv`

| Field | Type | Description |
|-------|-----|------|
| IATA_CODE | String | Airport IATA code |
| AIRPORT | String | Airport name |
| CITY | String | City |
| STATE | String | State |
| COUNTRY | String | Country |
| LATITUDE | Float | Geographic latitude |
| LONGITUDE | Float | Geographic longitude |

**Number of rows:** 364  
**Number of columns:** 7

#### **File 5:** `airports.csv`

| Field | Type | Description |
|-------|-----|------|
| id | Integer | Unique identifier |
| ident | String | Airport ICAO code |
| type | String | Airport type |
| name | String | Full airport name |
| latitude_deg | Float | Geographic latitude |
| longitude_deg | Float | Geographic longitude |
| elevation_ft | Integer | Elevation (ft) |
| continent | String | Continent |
| iso_country | String | ISO country code |
| iso_region | String | ISO region code |
| municipality | String | Municipality/city |
| scheduled_service | String | Scheduled flights (yes/no) |
| gps_code | String | GPS code |
| iata_code | String | IATA code |
| local_code | String | Local code |
| home_link | String | Homepage URL |
| wikipedia_link | String | Wikipedia link |
| keywords | String | Keywords |

**Number of rows:** 61,221  
**Number of columns:** 18

#### **File 6:** `airport-frequencies.csv`

| Field | Type | Description |
|-------|-----|------|
| id | Integer | Unique identifier |
| airport_ref | Integer | Reference to airports.id |
| airport_ident | String | Airport ICAO code |
| type | String | Frequency type |
| description | String | Frequency description |
| frequency_mhz | Float | Frequency (MHz) |

**Number of rows:** 28,927  
**Number of columns:** 6

#### **File 7:** `runways.csv`

| Field | Type | Description |
|-------|-----|------|
| id | Integer | Unique identifier |
| airport_ref | Integer | Reference to airports.id |
| airport_ident | String | Airport ICAO code |
| length_ft | Integer | Runway length (ft) |
| width_ft | Integer | Runway width (ft) |
| surface | String | Runway surface material |
| lighted | Integer | Lighting (1=yes, 0=no) |
| closed | Integer | Closed (1=yes, 0=no) |
| le_ident | String | Lower end identifier |
| le_latitude_deg | Float | Lower end latitude |
| le_longitude_deg | Float | Lower end longitude |
| le_elevation_ft | Float | Lower end elevation (ft) |
| le_heading_degT | Float | Lower end azimuth |
| le_displaced_threshold_ft | Integer | Lower end threshold displacement |
| he_ident | String | Higher end identifier |
| he_latitude_deg | Float | Higher end latitude |
| he_longitude_deg | Float | Higher end longitude |
| he_elevation_ft | Float | Higher end elevation (ft) |
| he_heading_degT | Float | Higher end azimuth |
| he_displaced_threshold_ft | Integer | Higher end threshold displacement |

**Number of rows:** 41,761  
**Number of columns:** 20

#### **Keys for Joining**

- **US_flights_2023.Dep_Airport** ↔ **Airports_Geolocation.IATA_CODE**
- **US_flights_2023.Dep_Airport** ↔ **Weather_Meteo_by_Airport.airport_id**
- **Airports.iata_code** ↔ **US_flights_2023.Dep_Airport**
- **Airports.ident** ↔ **Airport_Frequencies.airport_ident**
- **Airports.ident** ↔ **Runways.airport_ident**

##### Test Query

In [9]:
test_query = [
    {
        "$match": {
            "Dep_Delay": {"$gt": 15}
        }
    },
    {"$limit": 5} 
]

In [10]:
test_results = list(database.us_flights_2023.aggregate(test_query))

print(f"Found {len(test_results)} flights (showing only 5 for testing)\n")

for i, flight in enumerate(test_results, 1):
    print(f"Flight {i}:")
    print(f"  Date: {flight.get('FlightDate')}")
    print(f"  Airline: {flight.get('Airline')}")
    print(f"  Dep_Airport: {flight.get('Dep_Airport')}")
    print(f"  Dep_Delay: {flight.get('Dep_Delay')} minutes")
    print()

Found 5 flights (showing only 5 for testing)

Flight 1:
  Date: 2023-01-11
  Airline: Endeavor Air
  Dep_Airport: LGA
  Dep_Delay: 35 minutes

Flight 2:
  Date: 2023-01-12
  Airline: Endeavor Air
  Dep_Airport: LGA
  Dep_Delay: 132 minutes

Flight 3:
  Date: 2023-01-19
  Airline: Endeavor Air
  Dep_Airport: LGA
  Dep_Delay: 676 minutes

Flight 4:
  Date: 2023-01-22
  Airline: Endeavor Air
  Dep_Airport: LGA
  Dep_Delay: 29 minutes

Flight 5:
  Date: 2023-01-23
  Airline: Endeavor Air
  Dep_Airport: LGA
  Dep_Delay: 27 minutes



##### **Create Indexes for query optimization**

In [11]:
# Check existing indexes first
print("Current indexes on us_flights_2023:")
for index in database.us_flights_2023.list_indexes():
    print(f"  - {index['name']}: {index.get('key', {})}")
print()

# Create indexes
indexes_to_create = [
    ("weather_meteo_by_airport", "airport_id", "idx_airport_id"),
    ("us_flights_2023", "Dep_Airport", "idx_dep_airport"),
    ("us_flights_2023", "Dep_Delay", "idx_dep_delay"),
    ("airports", "iata_code", "idx_iata_code"),
    ("airports", "ident", "idx_ident"),
    ("runways", "airport_ident", "idx_runway_airport_ident"),
]

for collection_name, field, index_name in indexes_to_create:
    try:
        result = database[collection_name].create_index(
            [(field, 1)],
            name=index_name,
            background=True  # Creates index in background
        )
        print(f"Created index '{index_name}' on {collection_name}.{field}")
    except Exception as e:
        # Index might already exist
        if "already exists" in str(e):
            print(f"Index '{index_name}' already exists on {collection_name}.{field}")
        else:
            print(f"Error creating '{index_name}': {e}")

Current indexes on us_flights_2023:
  - _id_: SON([('_id', 1)])

Created index 'idx_airport_id' on weather_meteo_by_airport.airport_id
Created index 'idx_dep_airport' on us_flights_2023.Dep_Airport
Created index 'idx_dep_delay' on us_flights_2023.Dep_Delay
Created index 'idx_iata_code' on airports.iata_code
Created index 'idx_ident' on airports.ident
Created index 'idx_runway_airport_ident' on runways.airport_ident


### **Query 1:** Large airports with high delays in bad weather

What are the 'large_airport' airports with an average departure delay of more than 15 minutes, when the precipitation was greater than 10 mm, and have runways longer than 10,000 feet?

> **Upit 1:** Koji su aerodromi tipa 'large_airport' sa prosečnim kašnjenjem polaska većim od 15 minuta, kada su padavine bile veće od 10 mm, i imaju piste duže od 10,000 stopa?

In [12]:
pipeline_query_1 = [
    {
        "$match": {
            "Dep_Delay": {"$gt": 5, "$ne": None}
        }
    },
    {
        "$lookup": {
            "from": "weather_meteo_by_airport",
            "localField": "Dep_Airport",
            "foreignField": "airport_id",
            "as": "weather_info"
        }
    },
    {
        "$unwind": {
            "path": "$weather_info",
            "preserveNullAndEmptyArrays": False
        }
    },
    {
        "$match": {
            "weather_info.prcp": {"$gt": 2, "$ne": None}
        }
    },
    {
        "$group": {
            "_id": "$Dep_Airport",
            "avg_delay": {"$avg": "$Dep_Delay"},
            "total_flights": {"$sum": 1},
            "avg_precipitation": {"$avg": "$weather_info.prcp"}
        }
    },
    {
        "$sort": {"avg_delay": -1}
    },
    {
        "$limit": 5
    }
]

In [13]:
query_results, execution_time = query.execute_query(database.us_flights_2023, pipeline_query_1, "Large airports with high delays in bad weather")

Executing query: Large airports with high delays in bad weather
Query 'Large airports with high delays in bad weather' execution time: 0.9692 seconds


### **Query 2:** Airlines with weather-related delays at high-elevation airports

Which airlines have the most weather-related delays (weather delay > 10 minutes), for flights from airports with an altitude of more than 500 feet and more than 5 communication frequencies?

> **Upit 2:** Koje avio-kompanije imaju najviše kašnjenja uzrokovanih vremenskim uslovima (vremensko kašnjenje > 10 minuta), za letove sa aerodroma sa nadmorskom visinom većom od 500 stopa i više od 5 komunikacionih frekvencija?

In [14]:
pipeline_query_2 = [
    # Stage 1: Filter flights with significant weather delays
    {
        "$match" : {
            "Delay_Weather": {"$gt": 10, "$ne": None}
        }
    },
    # STAGE 2: Join with airports to get elevation data
    {
        "$lookup" : {
            "from": "airports",
            "localField": "Dep_Airport",
            "foreignField": "iata_code",
            "as": "airport_info"
        }
    },
    # STAGE 3: Unwind airport info (convert array to individual documents)
    {
        "$unwind": {
            "path": "$airport_info",
            "preserveNullAndEmptyArrays": False
        }
    },
    # STAGE 4: Filter airports with elevation > 500 ft
    {
        "$match": {
            "airport_info.elevation_ft": {"$gt": 500, "$ne": None}
        }
    },
    # STAGE 5: Join with airport_frequencies
    {
        "$lookup": {
            "from": "airport_frequencies",
            "localField": "airport_info.ident",   # ICAO kod (KATL)
            "foreignField": "airport_ident",
            "as": "frequencies"
        }
    },
    # STAGE 6: Add field with frequency count
    {
        "$addFields": {
            "frequency_count": {"$size": "$frequencies"}
        }
    },
    # STAGE 7: Filter airports with more than 5 frequencies
    {
        "$match": {
            "frequency_count": {"$gt": 5}
        }
    },
    # STAGE 8: Group by airline and calculate statistics
    {
        "$group": {
            "_id": "$Airline",                           # Group by airline
            "total_weather_delays": {"$sum": 1},         # Count delays
            "avg_delay_minutes": {"$avg": "$Delay_Weather"},
            "total_delay_minutes": {"$sum": "$Delay_Weather"},
            "affected_airports": {"$addToSet": "$Dep_Airport"}  # Unique airports
        }
    },
    # STAGE 9: Sort by total delays (descending)
    {
        "$sort": {"total_weather_delays": -1}
    },
    # STAGE 10: Limit to top 10 airlines
    {
        "$limit": 10
    }
]

In [15]:
query_results, execution_time = query.execute_query(database.us_flights_2023, pipeline_query_2, "Airlines with weather-related delays at high-elevation airports")

Executing query: Airlines with weather-related delays at high-elevation airports
Query 'Airlines with weather-related delays at high-elevation airports' execution time: 0.0585 seconds


### **Query 3:** Average age of old aircraft with NAS delays on lit asphalt runways

What is the average age of aircraft older than 10 years for flights with a delay caused by the national aviation system (NAS delay > 5 minutes), at airports with asphalt runways and lighting?

> **Upit 3:** Koja je prosečna starost aviona starijih od 10 godina za letove sa kašnjenjem uzrokovanim nacionalnim avio-sistemom (NAS kašnjenje > 5 minuta), na aerodromima sa asfaltnom pistom i osvetljenjem?

In [16]:
pipeline_query_3 = [
    # STAGE 1: Filter flights with NAS delays and old aircraft
    {
        "$match": {
            "Delay_NAS": {"$gt": 5, "$ne": None},
            "Aicraft_age": {"$gt": 10, "$ne": None}
        }
    },
    # STAGE 2: Join with airports to get airport identifier (ident)
    {
        "$lookup": {
            "from": "airports",
            "localField": "Dep_Airport",
            "foreignField": "iata_code",
            "as": "airport_info"
        }
    },
    # STAGE 3: Unwind airport info
    {
        "$unwind": {
            "path": "$airport_info",
            "preserveNullAndEmptyArrays": False
        }
    },
    # STAGE 4: Join with runways to get runway details
    {
        "$lookup": {
            "from": "runways",
            "localField": "airport_info.ident",  # ICAO kod
            "foreignField": "airport_ident",
            "as": "runway_info"
        }
    },
    # STAGE 5: Unwind runway info
    {
        "$unwind": {
            "path": "$runway_info",
            "preserveNullAndEmptyArrays": False
        }
    },
    # STAGE 6: Filter for asphalt runways with lighting
    {
        "$match": {
            "runway_info.surface": "Asphalt",
            "runway_info.lighted": 1
        }
    }
]

In [17]:
query_results, execution_time = query.execute_query(database.us_flights_2023, pipeline_query_3, "Average age of old aircraft with NAS delays on lit asphalt runways")

Executing query: Average age of old aircraft with NAS delays on lit asphalt runways
Query 'Average age of old aircraft with NAS delays on lit asphalt runways' execution time: 0.0860 seconds


### **Query 4:** Cities with most diverted flights in high wind conditions

Which cities have the most diverted flights when the wind direction is greater than 180 degrees and the wind speed is over 20 km/h?

> **Upit 4:** Koji gradovi imaju najviše preusmerenih letova kada je smer vetra veći od 180 stepeni i brzina vetra preko 20 km/h?

In [18]:
pipeline_query_4 = [
    # STAGE 1: Filter diverted flights only
    {
        "$match": {
            "Diverted": 1  # Note: Number 1, not string "1"
        }
    },
    # STAGE 2: Join with weather data
    {
        "$lookup": {
            "from": "weather_meteo_by_airport",
            "localField": "Dep_Airport",
            "foreignField": "airport_id",
            "as": "weather_info"
        }
    },
    # STAGE 3: Unwind weather info
    {
        "$unwind": {
            "path": "$weather_info",
            "preserveNullAndEmptyArrays": False
        }
    },
    # STAGE 4: Filter for high wind conditions
    {
        "$match": {
            "weather_info.wdir": {"$gt": 180, "$ne": None},
            "weather_info.wspd": {"$gt": 20, "$ne": None}
        }
    },
    # STAGE 5: Join with geolocation to get city names
    {
        "$lookup": {
            "from": "airports_geolocation",
            "localField": "Dep_Airport",
            "foreignField": "IATA_CODE",
            "as": "geo_info"
        }
    },
    # STAGE 6: Unwind geolocation info
    {
        "$unwind": {
            "path": "$geo_info",
            "preserveNullAndEmptyArrays": False
        }
    },
    # STAGE 7: Group by city and count diversions
    {
        "$group": {
            "_id": {
                "city": "$geo_info.CITY",
                "state": "$geo_info.STATE"
            },
            "total_diversions": {"$sum": 1},
            "avg_wind_speed": {"$avg": "$weather_info.wspd"},
            "avg_wind_direction": {"$avg": "$weather_info.wdir"},
            "affected_airports": {"$addToSet": "$Dep_Airport"}
        }
    },
    # STAGE 8: Sort by number of diversions (descending)
    {
        "$sort": {"total_diversions": -1}
    },
    
    # STAGE 9: Limit to top 10 cities
    {
        "$limit": 10
    }
]

In [19]:
query_results, execution_time = query.execute_query(database.us_flights_2023, pipeline_query_4, "Cities with most diverted flights in high wind conditions")

Executing query: Cities with most diverted flights in high wind conditions
Query 'Cities with most diverted flights in high wind conditions' execution time: 0.0220 seconds


### **Query 5:** Aircraft models with late aircraft delays at medium airports

Which aircraft models have the most delays caused by a late aircraft (delay > 15 minutes), at 'medium_airport' airports with more than 3 runways?

> **Upit 5:** Koji modeli aviona imaju najviše kašnjenja uzrokovanih kasnim avionom (kašnjenje > 15 minuta), na aerodromima tipa 'medium_airport' sa više od 3 piste?

In [20]:
pipeline_query_5 = [
    # STAGE 1: Filter flights with late aircraft delays > 15 minutes
    {
        "$match": {
            "Delay_LastAircraft": {"$gt": 15, "$ne": None}
        }
    },
    # STAGE 2: Join with airports to get airport type
    {
        "$lookup": {
            "from": "airports",
            "localField": "Dep_Airport",
            "foreignField": "iata_code",
            "as": "airport_info"
        }
    },
    # STAGE 3: Unwind airport info
    {
        "$unwind": {
            "path": "$airport_info",
            "preserveNullAndEmptyArrays": False
        }
    },
    # STAGE 4: Filter for medium airports
    {
        "$match": {
            "airport_info.type": "medium_airport"
        }
    },
    # STAGE 5: Join with runways to get runway information
    {
        "$lookup": {
            "from": "runways",
            "localField": "airport_info.ident",  # ICAO kod
            "foreignField": "airport_ident",
            "as": "runways_info"
        }
    },
    # STAGE 6: Add field with runway count
    {
        "$addFields": {
            "runway_count": {"$size": "$runways_info"}
        }
    },
    # STAGE 7: Filter airports with more than 3 runways
    {
        "$match": {
            "runway_count": {"$gt": 3}
        }
    },
    # STAGE 8: Sort by total delays (descending)
    {
        "$sort": {"total_delays": -1}
    },
    # STAGE 10: Limit to top 10 aircraft models
    {
        "$limit": 10
    }
]

In [21]:
query_results, execution_time = query.execute_query(database.us_flights_2023, pipeline_query_5, "Aircraft models with late aircraft delays at medium airports")

Executing query: Aircraft models with late aircraft delays at medium airports
Query 'Aircraft models with late aircraft delays at medium airports' execution time: 0.0955 seconds


### **Query 6:** Average departure delay for morning flights in snowy conditions at TWR airports

What is the average departure delay for morning flights (DepTime_label = 'Morning') at airports where there was snow (snow > 0) and use frequency type 'TWR'?

> **Upit 6:** Koji je prosečni delay polaska za jutarnje letove (DepTime_label = 'Morning') na aerodromima gde je bilo snega (snow > 0) i koriste frekvenciju tipa 'TWR'?

In [22]:
pipeline_query_6 = [
    # STAGE 1: Filter morning flights
    {
        "$match": {
            "DepTime_label": "Morning",
            "Dep_Delay": {"$ne": None}
        }
    },
    # STAGE 2: Join with weather data for snow conditions
    {
        "$lookup": {
            "from": "weather_meteo_by_airport",
            "localField": "Dep_Airport",
            "foreignField": "airport_id",
            "as": "weather_info"
        }
    },
    # STAGE 3: Unwind weather info
    {
        "$unwind": {
            "path": "$weather_info",
            "preserveNullAndEmptyArrays": False
        }
    },
    # STAGE 4: Filter for snowy conditions
    {
        "$match": {
            "weather_info.snow": {"$gt": 0, "$ne": None}
        }
    },
    # STAGE 5: Join with airports to get ident for frequency lookup
    {
        "$lookup": {
            "from": "airports",
            "localField": "Dep_Airport",
            "foreignField": "iata_code",
            "as": "airport_info"
        }
    },
    # STAGE 6: Unwind airport info
    {
        "$unwind": {
            "path": "$airport_info",
            "preserveNullAndEmptyArrays": False
        }
    },
    # STAGE 7: Join with frequencies to find TWR frequencies
    {
        "$lookup": {
            "from": "airport_frequencies",
            "localField": "airport_info.ident",
            "foreignField": "airport_ident",
            "as": "frequencies"
        }
    },
    # STAGE 8: Filter for airports with TWR frequency type
    {
        "$match": {
            "frequencies.type": "TWR"
        }
    },
    # STAGE 9: Group and calculate average delay
    {
        "$group": {
            "_id": "$Dep_Airport",
            "avg_departure_delay": {"$avg": "$Dep_Delay"},
            "total_morning_flights": {"$sum": 1},
            "avg_snow": {"$avg": "$weather_info.snow"},
            "airport_name": {"$first": "$airport_info.name"}
        }
    },
    # STAGE 10: Sort by average delay (descending)
    {
        "$sort": {"avg_departure_delay": -1}
    },
    # STAGE 11: Limit to top 10 results
    {
        "$limit": 10
    }
]

In [23]:
query_results, execution_time = query.execute_query(database.us_flights_2023, pipeline_query_6, "Average departure delay for morning flights in snowy conditions at TWR airports")

Executing query: Average departure delay for morning flights in snowy conditions at TWR airports
Query 'Average departure delay for morning flights in snowy conditions at TWR airports' execution time: 0.9246 seconds


### **Query 7:** Aircraft manufacturers with most cancelled flights at airports with local codes

Which aircraft manufacturers have the most canceled flights at airports that have a locale code?

> **Upit 7:** Koji proizvođači aviona imaju najviše otkazanih letova na aerodromima koji imaju lokalni kod?

In [24]:
pipeline_query_7 = [
    # STAGE 1: Filter cancelled flights
    {
        "$match": {
            "Cancelled": 1
        }
    },
    # STAGE 2: Join with airports to check for local codes
    {
        "$lookup": {
            "from": "airports",
            "localField": "Dep_Airport",
            "foreignField": "iata_code",
            "as": "airport_info"
        }
    },
    # STAGE 3: Unwind airport info
    {
        "$unwind": {
            "path": "$airport_info",
            "preserveNullAndEmptyArrays": False
        }
    },
    # STAGE 4: Filter airports that have local codes
    {
        "$match": {
            "airport_info.local_code": {"$ne": None, "$ne": ""}
        }
    },
    # STAGE 5: Group by manufacturer and count cancellations
    {
        "$group": {
            "_id": "$Manufacturer",
            "total_cancellations": {"$sum": 1},
            "affected_airports": {"$addToSet": "$Dep_Airport"},
            "cancellation_rate": {
                "$avg": {
                    "$cond": [{"$eq": ["$Cancelled", 1]}, 1, 0]
                }
            }
        }
    },
    # STAGE 6: Sort by total cancellations (descending)
    {
        "$sort": {"total_cancellations": -1}
    },
    # STAGE 7: Limit to top 10 manufacturers
    {
        "$limit": 10
    }
]

In [25]:
query_results, execution_time = query.execute_query(database.us_flights_2023, pipeline_query_7, "Aircraft manufacturers with most cancelled flights at airports with local codes")

Executing query: Aircraft manufacturers with most cancelled flights at airports with local codes
Query 'Aircraft manufacturers with most cancelled flights at airports with local codes' execution time: 0.0200 seconds


### **Query 8:** Airports with highest NAS delay proportion by day of week

Which airports have the highest proportion of delays caused by NAS (Delay_NAS) on certain days of the week?

> **Upit 8:** Koji aerodromi imaju najveći udeo kašnjenja uzrokovanih NAS-om (Delay_NAS) tokom određenih dana u nedelji?

In [26]:
pipeline_query_8 = [
    # STAGE 1: Filter flights with NAS delays
    {
        "$match": {
            "Delay_NAS": {"$gt": 0, "$ne": None}
        }
    },
    # STAGE 2: Calculate total delay and NAS delay proportion
    {
        "$addFields": {
            "total_delay": {
                "$add": [
                    {"$ifNull": ["$Delay_Carrier", 0]},
                    {"$ifNull": ["$Delay_Weather", 0]},
                    {"$ifNull": ["$Delay_NAS", 0]},
                    {"$ifNull": ["$Delay_Security", 0]},
                    {"$ifNull": ["$Delay_LastAircraft", 0]}
                ]
            }
        }
    },
    # STAGE 3: Calculate NAS delay proportion
    {
        "$addFields": {
            "nas_proportion": {
                "$cond": [
                    {"$gt": ["$total_delay", 0]},
                    {"$divide": ["$Delay_NAS", "$total_delay"]},
                    0
                ]
            }
        }
    },
    # STAGE 4: Group by airport and day of week
    {
        "$group": {
            "_id": {
                "airport": "$Dep_Airport",
                "day_of_week": "$Day_Of_Week"
            },
            "avg_nas_proportion": {"$avg": "$nas_proportion"},
            "total_nas_delays": {"$sum": 1},
            "avg_nas_delay_minutes": {"$avg": "$Delay_NAS"},
            "total_flights": {"$sum": 1}
        }
    },
    # STAGE 5: Calculate NAS delay percentage
    {
        "$addFields": {
            "nas_delay_percentage": {"$multiply": ["$avg_nas_proportion", 100]}
        }
    },
    # STAGE 6: Sort by NAS delay percentage (descending)
    {
        "$sort": {"nas_delay_percentage": -1}
    },
    # STAGE 7: Group by day to get top airports per day
    {
        "$group": {
            "_id": "$_id.day_of_week",
            "top_airports": {
                "$push": {
                    "airport": "$_id.airport",
                    "nas_delay_percentage": "$nas_delay_percentage",
                    "total_nas_delays": "$total_nas_delays",
                    "avg_nas_delay_minutes": "$avg_nas_delay_minutes"
                }
            }
        }
    },
    # STAGE 8: Get top 3 airports per day
    {
        "$project": {
            "day_of_week": "$_id",
            "top_airports": {"$slice": ["$top_airports", 3]},
            "_id": 0
        }
    },
    # STAGE 9: Sort by day of week
    {
        "$sort": {"day_of_week": 1}
    }
]

In [27]:
query_results, execution_time = query.execute_query(database.us_flights_2023, pipeline_query_8, "Airports with highest NAS delay proportion by day of week")

Executing query: Airports with highest NAS delay proportion by day of week
Query 'Airports with highest NAS delay proportion by day of week' execution time: 0.0390 seconds


### **Query 9:** Most common aircraft models for long-distance flights from NYC with displaced thresholds

What are the most common long-haul (Distance_type = 'Long') aircraft models departing from New York airport with a runway threshold displacement (le_displaced_threshold_ft > 0)?

> **Upit 9:** Koji su najčešći modeli aviona za letove na duge distance (Distance_type = 'Long') koji poleću sa aerodroma u Njujorku sa pomerajem praga piste (le_displaced_threshold_ft > 0)?

In [28]:
pipeline_query_9 = [
    # STAGE 1: Filter long-distance flights from New York airports
    {
        "$match": {
            "Distance_type": "Long",
            "Dep_CityName": {"$regex": "New York", "$options": "i"},
            "Model": {"$ne": None, "$ne": ""}
        }
    },
    # STAGE 2: Join with airports to get ICAO ident
    {
        "$lookup": {
            "from": "airports",
            "localField": "Dep_Airport",
            "foreignField": "iata_code",
            "as": "airport_info"
        }
    },
    # STAGE 3: Unwind airport info
    {
        "$unwind": {
            "path": "$airport_info",
            "preserveNullAndEmptyArrays": False
        }
    },
    # STAGE 4: Join with runways to check for displaced thresholds
    {
        "$lookup": {
            "from": "runways",
            "localField": "airport_info.ident",
            "foreignField": "airport_ident",
            "as": "runway_info"
        }
    },
    # STAGE 5: Unwind runway info
    {
        "$unwind": {
            "path": "$runway_info",
            "preserveNullAndEmptyArrays": False
        }
    },
    # STAGE 6: Filter runways with displaced thresholds
    {
        "$match": {
            "$or": [
                {"runway_info.le_displaced_threshold_ft": {"$gt": 0}},
                {"runway_info.he_displaced_threshold_ft": {"$gt": 0}}
            ]
        }
    },
    # STAGE 7: Group by aircraft model and count occurrences
    {
        "$group": {
            "_id": {
                "manufacturer": "$Manufacturer",
                "model": "$Model"
            },
            "total_flights": {"$sum": 1},
            "airports": {"$addToSet": "$Dep_Airport"},
            "avg_flight_duration": {"$avg": "$Flight_Duration"},
            "avg_departure_delay": {"$avg": "$Dep_Delay"}
        }
    },
    # STAGE 8: Sort by total flights (descending)
    {
        "$sort": {"total_flights": -1}
    },
    # STAGE 9: Limit to top 10 aircraft models
    {
        "$limit": 10
    },
    # STAGE 10: Project final format
    {
        "$project": {
            "aircraft_model": {
                "manufacturer": "$_id.manufacturer",
                "model": "$_id.model"
            },
            "total_flights": 1,
            "airports_used": {"$size": "$airports"},
            "avg_flight_duration_minutes": {"$round": ["$avg_flight_duration", 2]},
            "avg_departure_delay_minutes": {"$round": ["$avg_departure_delay", 2]},
            "_id": 0
        }
    }
]

In [29]:
query_results, execution_time = query.execute_query(database.us_flights_2023, pipeline_query_9, "Most common aircraft models for long-distance flights from NYC with displaced thresholds")

Executing query: Most common aircraft models for long-distance flights from NYC with displaced thresholds
Query 'Most common aircraft models for long-distance flights from NYC with displaced thresholds' execution time: 0.0200 seconds


### **Query 10:** Airports with longest average flight duration for late aircraft delays in low pressure

Which airports have the longest average flight duration for flights with a delay caused by a late aircraft (Delay_LastAircraft > 10 minutes), in low pressure conditions (pres < 1000 hPa), and home_link?

> **Upit 10:** Koji aerodromi imaju najduže prosečno trajanje leta za letove sa kašnjenjem uzrokovanim kasnim avionom (Delay_LastAircraft > 10 minuta), u uslovima niskog pritiska (pres < 1000 hPa), i home_link?

In [30]:
pipeline_query_10 = [
    # STAGE 1: Filter flights with late aircraft delays
    {
        "$match": {
            "Delay_LastAircraft": {"$gt": 10, "$ne": None},
            "Flight_Duration": {"$ne": None}
        }
    },
    # STAGE 2: Join with weather data for pressure conditions
    {
        "$lookup": {
            "from": "weather_meteo_by_airport",
            "localField": "Dep_Airport",
            "foreignField": "airport_id",
            "as": "weather_info"
        }
    },
    # STAGE 3: Unwind weather info
    {
        "$unwind": {
            "path": "$weather_info",
            "preserveNullAndEmptyArrays": False
        }
    },
    # STAGE 4: Filter for low pressure conditions
    {
        "$match": {
            "weather_info.pres": {"$lt": 1000, "$ne": None}
        }
    },
    # STAGE 5: Join with airports to check for home_link
    {
        "$lookup": {
            "from": "airports",
            "localField": "Dep_Airport",
            "foreignField": "iata_code",
            "as": "airport_info"
        }
    },
    # STAGE 6: Unwind airport info
    {
        "$unwind": {
            "path": "$airport_info",
            "preserveNullAndEmptyArrays": False
        }
    },
    # STAGE 7: Filter airports with home_link
    {
        "$match": {
            "airport_info.home_link": {"$ne": None, "$ne": ""}
        }
    },
    # STAGE 8: Group by airport and calculate statistics
    {
        "$group": {
            "_id": "$Dep_Airport",
            "avg_flight_duration": {"$avg": "$Flight_Duration"},
            "total_flights": {"$sum": 1},
            "avg_late_aircraft_delay": {"$avg": "$Delay_LastAircraft"},
            "avg_pressure": {"$avg": "$weather_info.pres"},
            "airport_name": {"$first": "$airport_info.name"},
            "home_link": {"$first": "$airport_info.home_link"},
            "city": {"$first": "$Dep_CityName"}
        }
    },
    # STAGE 9: Sort by average flight duration (descending)
    {
        "$sort": {"avg_flight_duration": -1}
    },
    # STAGE 10: Limit to top 15 results
    {
        "$limit": 15
    },
    # STAGE 11: Project final format
    {
        "$project": {
            "airport_code": "$_id",
            "airport_name": 1,
            "city": 1,
            "avg_flight_duration_hours": {
                "$round": [
                    {"$divide": ["$avg_flight_duration", 60]}, 
                    2
                ]
            },
            "avg_late_aircraft_delay_minutes": {"$round": ["$avg_late_aircraft_delay", 2]},
            "avg_pressure_hpa": {"$round": ["$avg_pressure", 2]},
            "total_flights_analyzed": 1,
            "home_link": 1,
            "_id": 0
        }
    }
]

In [31]:
query_results, execution_time = query.execute_query(database.us_flights_2023, pipeline_query_10, "Airports with longest average flight duration for late aircraft delays in low pressure")

Executing query: Airports with longest average flight duration for late aircraft delays in low pressure
Query 'Airports with longest average flight duration for late aircraft delays in low pressure' execution time: 0.2885 seconds


## **Section 5** Performance optimization

## **Section 6:** Visualization