# Phase II: Data Curation, Exploratory Analysis and Plotting
We plan on using [U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics' 2015 Flight Delays and Cancellations dataset from Kaggle](https://www.kaggle.com/datasets/usdot/flight-delays) which contains airline data, airport data, and detailed flight data from 2015.

Using these data, we want to determine:
1. What causes or correlations (time of travel, weather, etc) we can link to delays and cancellations â€” can we make predictions or improve these?
2. What causes or correlations can we link to accidents? How can we use these to improve air travel safety?

## Data Collection

This dataset can be imported from Kaggle using the [Kaggle Hub Client Library](https://github.com/Kaggle/kagglehub) which makes it easy to import the data for use with Pandas DataFrames.

In [None]:
import kagglehub
path = kagglehub.dataset_download("usdot/flight-delays")

In [None]:
import pandas as pd

airlines_df = pd.read_csv(path + "/airlines.csv", low_memory=False, encoding='UTF-8')
airports_df = pd.read_csv(path + "/airports.csv", low_memory=False, encoding='UTF-8')
flights_df = pd.read_csv(path + "/flights.csv", low_memory=False, encoding='UTF-8')

## Data Cleanup

Merges the 3 DataFrames together, first by removing their file name prefixes from the column names. Then, by adding in airport information for both the origin and destination airport (merges the Airports DF twice technically), and then joins the airlines table with the flights table. Lastly, we create a datetime object from the existing date columns

In [None]:
# Remove file name prefix from column names
airlines_df.columns = airlines_df.columns.str.replace(r'^airlines\.csv/', '', regex=True)
airlines_df.rename(columns={'AIRLINE': 'AIRLINE NAME'}, inplace=True)

airports_df.columns = airports_df.columns.str.replace(r'^airports\.csv/', '', regex=True)

flights_df.columns = flights_df.columns.str.replace(r'^flights\.csv/', '', regex=True)

In [None]:
# Join airports data to flights table
origin_airports = airports_df.add_prefix('origin_airport/')
destination_airports = airports_df.add_prefix('destination_airport/')

flights_df = flights_df.join(origin_airports.set_index('origin_airport/IATA_CODE'), on='ORIGIN_AIRPORT')
flights_df = flights_df.join(destination_airports.set_index('destination_airport/IATA_CODE'), on='DESTINATION_AIRPORT')

In [None]:
# Join airlines data to flights table
flights_df = flights_df.join(airlines_df.set_index('IATA_CODE'), on='AIRLINE')

In [None]:
# Create DATE column from YEAR, MONTH, DAY, and SCHEDULED_DEPARTURE columns
flights_df['DATE'] = pd.to_datetime(
    flights_df['YEAR'].astype(str) + '-' +
    flights_df['MONTH'].astype(str) + '-' +
    flights_df['DAY'].astype(str) + ' ' +
    flights_df['SCHEDULED_DEPARTURE'].astype(str).str.zfill(4),
    format='%Y-%m-%d %H%M'
)

In [None]:
flights_df.head()

## Data Visualization

In [None]:
import altair as alt
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

alt.data_transformers.disable_max_rows()