## Exploratory analysis of the US Airport Dataset

This dataset contains data for 5 years[1995, 2000, 2005, 2010, 2015] of flights between various US airports and the carrier on those routes.

Let's see what can we make out of this!

In [None]:
import networkx as nx
import pandas as pd
air_data = pd.read_csv('dump_output.csv')

We create 5 graphs (one for each year), it may take a couple of minutes to load all the graphs.

In [None]:
air_graph_1995 = air_data[air_data.YEAR == 1995]
G_1995 = nx.DiGraph()
for row in air_graph_1995.iterrows():
    G_1995.add_edge(row[1].ORIGIN, row[1].DEST)

In [None]:
air_graph_2000 = air_data[air_data.YEAR == 2000]
G_2000 = nx.DiGraph()
for row in air_graph_2000.iterrows():
    G_2000.add_edge(row[1].ORIGIN, row[1].DEST)

In [None]:
air_graph_2005 = air_data[air_data.YEAR == 2005]
G_2005 = nx.DiGraph()
for row in air_graph_2005.iterrows():
    G_2005.add_edge(row[1].ORIGIN, row[1].DEST)

In [None]:
air_graph_2010 = air_data[air_data.YEAR == 2010]
G_2010 = nx.DiGraph()
for row in air_graph_2010.iterrows():
    G_2010.add_edge(row[1].ORIGIN, row[1].DEST)

In [None]:
air_graph_2015 = air_data[air_data.YEAR == 2015]
G_2015 = nx.DiGraph()
for row in air_graph_2015.iterrows():
    G_2015.add_edge(row[1].ORIGIN, row[1].DEST)

So let's have a look at the important nodes in this network, i.e. important airports in this network. We'll use pagerank, betweenness centrality and degree centrality.

In [None]:
sorted(nx.pagerank(G_2015).items(), key=lambda x:x[1], reverse=True)[0:10]

In [None]:
# Find the betweenness_centrality and sort the result to get the top 10 nodes


In [None]:
# Find the degree_centrality and sort the result to get the top 10 nodes


'ANC' is the airport code of Anchorage airport, a place in Alaska, and according to pagerank and betweenness centrality it is the most important airport in this network Isn't that weird? Thoughts?

related blog post: https://toreopsahl.com/2011/08/12/why-anchorage-is-not-that-important-binary-ties-and-sample-selection/

Let's find the airline which operates the most routes every year? The thing to take into account is routes not flights as this dataset gives us the routes.

In [None]:
airline = [(i, air_data[air_data.YEAR == i].UNIQUE_CARRIER_NAME.value_counts()[0:5].reset_index(
).rename(columns={0: 'Airline', 'UNIQUE_CARRIER_NAME': 'count'})) for i in [1995, 2000, 2005, 2010, 2015]]

In [None]:
# count of routes of top 5 airlines of each year
airline

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

# dictionary to store data to plot
d = {}
for i, year in airline:
    for row in year.iterrows():
        if row[1][0] in d:
            d[row[1][0]].append((i, row[1][1]))
        else:
            d[row[1][0]] = [(i, row[1][1])]
            
plt.figure(num=None, figsize=(10,8), dpi=80, facecolor='w', edgecolor='k')
labels = []
for i, j in d.items():
    year = [k for k, l in j]
    per = [l for k,l in j]
    plt.plot(year, per)
    labels.append(i)
plt.legend(labels, ncol=4, loc='upper center', 
           bbox_to_anchor=[0.5, 1.1], 
           columnspacing=1.0, labelspacing=0.0,
           handletextpad=0.0, handlelength=1.5,
           fancybox=True, shadow=True)

#### Exercise

Find out the hubs of Delta Air Lines Inc, United Air Lines Inc and Southwest Airlines Co. in this network by measuring the flights that fly out of the airport 

Average shortest path lenght of this network?

In [None]:
nx.average_shortest_path_length(G_2015)

A Graph is connected if there is only one component. A directed graph is strongly connected if every node is reachable from every other node, and is weakly connected if the undirected copy of the graph is connected.

In [None]:
nx.is_weakly_connected(G_2015)

What does it mean? The airport datasets has components that an unreachable from each other. Well that's a problem!

In [None]:
# list the connected components of the graph G_2015 (use to_undirected)


What are the components? A single airport and a component of two airports?

Let's remove these nodes from the network so that we can find the avg shortest path length of this graph.

In [None]:
G_2015.remove_nodes_from(['AIK', 'SPB', 'SSB'])

In [None]:
# Find if the network is weakly connected


In [None]:
# Find the average shortest path length

In [None]:
# Plot the histogram of degree centrality


^ Barabasi Albert model seems to work

#### Exercise! (Actually this is a game :D)

How can we decrease the avg shortest path length of this network?

Think of an effective way to add new edges to decrease the avg shortest path length.
Let's see if we can come up with a nice way to do this, and the one who gets the highest decrease wins!!!

The rules are simple:
- You can't add more than 2% of the current edges( ~500 edges)