# Flight delay time exploratory data analysis

**For this week's exercises, scroll down to "Part 2"**.

In [None]:
import numpy as np
import pandas as pd
import glob
import seaborn as sns
import networkx as nx

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

First we read in the input files. We can use the `glob` package with `*` as a wildcard to make a list of all the csv files, and then open and concatenate all the files in the list to get a single dataframe.

In [None]:
df = pd.concat([pd.read_csv(f) for f in glob.glob("/kaggle/input/historical-flight-and-weather-data/*.csv") ])

Next, lets explore some basic characteristics of our data.

In [None]:
df.head()

In [None]:
df.dtypes

In [None]:
df.hist(figsize=(20,20)); # Tip: put a semicolon at the end of the line to avoid printing a bunch of text output.

In [None]:
df.shape

In [None]:
df.dtypes

So from the initial analysis above, we can see that we've got a database of 5.5 billion flights, with each record including information about the airline ("carrier_code"), origin and destination airport, date and time, and weather information. This dataset is not well documented, but we'll assume that `*_x` corresponds to weather at the origin airport and `*_y` corresponds to weather at the destination airport. There is also information about flight delays and cancellations.

Our goal is always to do something useful. Some useful things we could do with this dataset could be to gain insight into what conditions are related to delayed and canceled flights, and potentially predict or avoid those delays in the future, so we will explore the dataset with that goal in mind.

First, we'll look into the frequency of delays and cancellations:

In [None]:
(df.arrival_delay > 0).sum() / df.shape[0]

In [None]:
(df.arrival_delay > 30).sum() / df.shape[0]

In [None]:
(df.arrival_delay > 60).sum() / df.shape[0]

In [None]:
(df.departure_delay > 0).sum() / df.shape[0]

In [None]:
((df.arrival_delay > 0) & (df.departure_delay > 0)).sum() / df.shape[0]

In [None]:
df.cancelled_code.value_counts()

In [None]:
(df.cancelled_code != "N").sum() / df.shape[0]

From the above, we can see that 34% of flight arrivals are delayed, 12% are delayed by more than 30 minutes, and 7% are delayed by more than one hour. (We're assuming the times are in minutes. Hopefully the benefit of having a well-documented dataset is apparent here.)

If we assume that a cancelled code of "N" means not cancelled, and everything else is cancelled, then about 1.5% of flights are cancelled.

We can start out by looking at how conditions were different for flights that were canceled compared to other flights. One way to do this is to create two sets of histograms:

In [None]:
df_cancel = df[df.cancelled_code != "N"]
df_cancel.hist(figsize=(20,20)); 

In [None]:
df_nocancel = df[df.cancelled_code == "N"]
df_nocancel.hist(figsize=(20,20)); 

One insight this gives us is that the max windspeed for non-canceled flights appears much higher than the max windspeed for flights that were canceled. TWe can investigate this further:

In [None]:
print(df_cancel.HourlyWindSpeed_x.mean(), df_cancel.HourlyWindSpeed_x.median(), df_cancel.HourlyWindSpeed_x.max())
print(df_nocancel.HourlyWindSpeed_x.mean(), df_nocancel.HourlyWindSpeed_x.median(), df_nocancel.HourlyWindSpeed_x.max())

## Part 2: Network analysis

Last week, we started an exploratory analysis of this dataset, treating it as tabular data. However there is also a graph or network aspect of this dataset—it's a "transportation network'. This week, we will explore that aspect.

First, let's calculate the number of flights on each "route", which is the number of flights that share an origin and destination airport:

In [None]:
num_flights = df.groupby(by=["origin_airport", "destination_airport"]).size()

num_flights.head()

Next, let's create a directed graph of the different routes.

In [None]:
pd.DataFrame(num_flights.reset_index()).dtypes

In [None]:
num_flights = num_flights.reset_index()

num_flights.columns = ['origin_airport','destination_airport','num_flights']

g = nx.DiGraph()

for _, edge in num_flights.iterrows():
    g.add_edge(edge['origin_airport'], edge['destination_airport'], weight=edge['num_flights'])


We can make a plot of the graph:

In [None]:
nx.draw(g)

Next, let's calculate the degree centrality and (weighted) betweenness centrality of each airport and create a data frame that includes the columns `airport`, `deg_cen`, and `bet_cen`:

In [None]:
deg_cen = nx.degree_centrality(g)

df_deg_cen = pd.DataFrame(deg_cen.items())
df_deg_cen.columns = ["airport", "deg_cen"]

df_deg_cen.head()

In [None]:
bet_cen = nx.betweenness_centrality(g, weight="weight")

df_bet_cen = pd.DataFrame(bet_cen.items())
df_bet_cen.columns = ["airport", "bet_cen"]

df_bet_cen.head()

In [None]:
df_bet_cen.set_index("airport", inplace=True)
df_deg_cen.set_index("airport", inplace=True)


net_stats = df_bet_cen
net_stats["deg_cen"] = df_deg_cen.deg_cen

net_stats.head()


Now, let's add our network statistics for each airport to data frame of flights:

In [None]:
net_stats.reset_index(inplace=True)

df_net_stats = df.merge(net_stats, left_on="origin_airport", right_on="airport")

df_net_stats["origin_bet_cen"] = df_net_stats["bet_cen"]
df_net_stats["origin_deg_cen"] = df_net_stats["deg_cen"]
df_net_stats.drop(["airport", "deg_cen", "bet_cen"], inplace=True, axis=1)

df_net_stats.head()

In [None]:
df_net_stats = df_net_stats.merge(net_stats, left_on="destination_airport", right_on="airport")

df_net_stats["destination_bet_cen"] = df_net_stats["bet_cen"]
df_net_stats["destination_deg_cen"] = df_net_stats["deg_cen"]
df_net_stats.drop(["airport", "deg_cen", "bet_cen"], inplace=True, axis=1)

df_net_stats.head()

Finally, let's calculate the correlations of our network statistics with our "departure delay" dependent variable:

In [None]:
df_net_stats[["arrival_delay", "destination_bet_cen","destination_deg_cen", "origin_bet_cen","origin_deg_cen"]].corr()

Can you conclude anything from these correlations?