# Phase II: Data Curation, Exploratory Analysis and Plotting
Central Motivation:

We plan on using [U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics' 2015 Flight Delays and Cancellations dataset from Kaggle](https://www.kaggle.com/datasets/usdot/flight-delays) which contains airline data, airport data, and detailed flight data from 2015. As college students coming to Boston from across the country, we have all had to become frequent flyers. On top of this, roughly [25 million people report anxiety around flying (2024)](https://www.stratosjets.com/blog/fear-of-flying-statistics-trends-facts/), with recent airline news likely increasing this number significantly. We wanted to see how we could use the airline data to find trends and make predictions that may calm some nerves, or just improve overall flying experience. 

Using these data, we want to determine:
1. What causes or correlations (time of travel, weather, etc) we can link to delays and cancellations — can we make predictions or improve these?
2. What causes or correlations can we link to accidents? How can we use these to improve air travel safety?

In [None]:
pip install kagglehub pandas altair seaborn matplotlib plotly geopandas shapely contextily --upgrade

## Data Collection

This dataset can be imported from Kaggle using the [Kaggle Hub Client Library](https://github.com/Kaggle/kagglehub) which makes it easy to import the data for use with Pandas DataFrames.

In [None]:
import kagglehub
path = kagglehub.dataset_download("usdot/flight-delays")

In [None]:
import pandas as pd

airlines_df = pd.read_csv(path + "/airlines.csv", low_memory=False, encoding='UTF-8')
airports_df = pd.read_csv(path + "/airports.csv", low_memory=False, encoding='UTF-8')
flights_df = pd.read_csv(path + "/flights.csv", low_memory=False, encoding='UTF-8')

## Data Cleanup

Merges the 3 DataFrames together, first by removing their file name prefixes from the column names. Then, by adding in airport information for both the origin and destination airport (merges the Airports DF twice technically), and then joins the airlines table with the flights table. Lastly, we create a datetime object from the existing date columns

In [None]:
# Remove file name prefix from column names
airlines_df.columns = airlines_df.columns.str.replace(r'^airlines\.csv/', '', regex=True)
airlines_df.rename(columns={'AIRLINE': 'AIRLINE NAME'}, inplace=True)

airports_df.columns = airports_df.columns.str.replace(r'^airports\.csv/', '', regex=True)

flights_df.columns = flights_df.columns.str.replace(r'^flights\.csv/', '', regex=True)

In [None]:
# Join airports data to flights table
origin_airports = airports_df.add_prefix('origin_airport/')
destination_airports = airports_df.add_prefix('destination_airport/')

flights_df = flights_df.join(origin_airports.set_index('origin_airport/IATA_CODE'), on='ORIGIN_AIRPORT')
flights_df = flights_df.join(destination_airports.set_index('destination_airport/IATA_CODE'), on='DESTINATION_AIRPORT')

In [None]:
# Join airlines data to flights table
flights_df = flights_df.join(airlines_df.set_index('IATA_CODE'), on='AIRLINE')

In [None]:
# Create DATE column from YEAR, MONTH, DAY, and SCHEDULED_DEPARTURE columns
flights_df['DATE'] = pd.to_datetime(
    flights_df['YEAR'].astype(str) + '-' +
    flights_df['MONTH'].astype(str) + '-' +
    flights_df['DAY'].astype(str) + ' ' +
    flights_df['SCHEDULED_DEPARTURE'].astype(str).str.zfill(4),
    format='%Y-%m-%d %H%M'
)

## Data Visualization

In [None]:
import altair as alt
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

alt.data_transformers.disable_max_rows()

In [None]:
"""
Visualizes the geographical distribution of flight cancellations across airports in the USA.
- Filters flight data to include only cancelled flights
- Groups cancellations by origin airport with their geographical coordinates
- Creates an interactive bubble map where bubble size represents the number of cancellations
- Includes hover information showing airport name, code, and cancellation count
- Reference: https://plotly.com/python/bubble-maps/
"""

# Filter for cancelled flights
cancelled_flights = flights_df[flights_df['CANCELLED'] == 1]

# Group by origin airport and count cancellations
cancellations_by_airport = cancelled_flights.groupby(['ORIGIN_AIRPORT', 
                                                     'origin_airport/AIRPORT',
                                                     'origin_airport/LATITUDE', 
                                                     'origin_airport/LONGITUDE']).size().reset_index(name='cancellation_count')

# Create the bubble map
fig = px.scatter_geo(cancellations_by_airport,
                     lat='origin_airport/LATITUDE',
                     lon='origin_airport/LONGITUDE',
                     size='cancellation_count',
                     hover_name='origin_airport/AIRPORT',
                     hover_data=['ORIGIN_AIRPORT', 'cancellation_count'],
                     scope='usa',
                     title='Cancelled Flights by Origin Airport')

# Customize the map appearance
fig.update_layout(
    title_font_size=20,
    title_x=0.5,
    width=1000,  # Set width in pixels
    height=750,  # Set height in pixels
    geo=dict(
        landcolor='rgb(217, 217, 217)',
        showlakes=True,
        lakecolor='rgb(255, 255, 255)',
    )
)

# Display the map
fig.show()

In [None]:
"""
Analyzes and visualizes the flight cancellation rates by airline.
- Calculates the cancellation ratio (cancelled flights divided by total flights) for each airline
- Sorts airlines from highest to lowest cancellation ratio
- Creates a bar chart showing these ratios with proper formatting and labels
"""

agg_flights = flights_df.groupby('AIRLINE NAME').agg(cancel_ratio=('CANCELLED', 'mean')).reset_index()
agg_flights = agg_flights.sort_values(by='cancel_ratio', ascending=False)

plt.figure(figsize=(14, 8))
sns.barplot(data=agg_flights, x='AIRLINE NAME', y='cancel_ratio')
plt.xlabel("Airline")
plt.ylabel("Cancellation Ratio")
plt.title("Cancellation Ratio by Airline (Sorted by Highest Ratio)")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.ylim(0, agg_flights['cancel_ratio'].max() * 1.1)
plt.show()