# Using RAPIDS cuGraph and cuSpatial to analyze airport and flight data
## Intro
We have airports and flights datasets.  We have cuGraph and cuSpatial.  What craziness can we get up to here?

We're going to use cuGraph and cuSpatial to answer these questions of our data:
1. Which airport is the most trafficked airport in our dataset?
1. What are the max number of plane rides (hops) do you need to take to get from the most trafficked airport to get to any other airport in our dataset?
1. How many hops do you need to take to get from the most trafficked airport to one of the least trafficked airport?
1. How far is that distance really?
1. What is the topology of our airport network, based on our dataset and distance from one another?

Note: The Airports data in this toy dataset is using hashed identifiers. In the beginning, this may throw you for a loop, but by the end of the notebook everything will be clearer.

## Imports and Data Gathering/Prep

In [None]:
import pandas as pd
import numpy as np
import cuspatial, cugraph, cudf, cuml

In [None]:
!wget https://raw.githubusercontent.com/rapidsai/cuDataShader/master/cudatashader-notebooks/data/airports.csv
!wget https://raw.githubusercontent.com/rapidsai/cuDataShader/master/cudatashader-notebooks/data/flights.csv

In [None]:
data_dir = './'
fdf = cudf.read_csv(data_dir+'flights.csv')
adf = cudf.read_csv(data_dir+'airports.csv')

In [None]:
fdf.head()

In [None]:
fdf.dtypes

### Prep
Since we'll be using cuGraph, which uses int32, and the above dtypes are int64, we recast each Series:

In [None]:
fdf['ORIGIN_AIRPORT_ID'] = fdf['ORIGIN_AIRPORT_ID'].astype(np.int32)
fdf['DEST_AIRPORT_ID'] = fdf['DEST_AIRPORT_ID'].astype(np.int32)
fdf['PASSENGERS'] = fdf['PASSENGERS'].astype(np.int32)

In [None]:
fdf.dtypes

Okay, better!  Now let's make some some graphs.  Why?  Cause graphs are fun and informative!

## Graphs
Recall that we're going to ask these questions of our data:
1. Which airport is the most trafficked airport in our dataset?
1. What are the max number of plane rides (hops) do you need to take to get from the most trafficked airport to get to any other airport in our dataset?
1. How many hops do you need to take to get from the most trafficked airport to one of the least trafficked airport?
1. How far is that distance really?
1. What is the topology of our airport network, based on our dataset and distance from one another?

**Let's get started!**

### Build the foundations

In [None]:
G = cugraph.Graph()
G.add_edge_list(fdf["ORIGIN_AIRPORT_ID"], fdf["DEST_AIRPORT_ID"])

In [None]:
# cudf uses the same formatting controls as Pandas!
pd.options.display.max_rows = 10

fdf["ORIGIN_AIRPORT_ID"].value_counts()

In [None]:
fdf["DEST_AIRPORT_ID"].value_counts()

### Question 1: Which airport is the most trafficked airport in our dataset?

The easiest way to find out which airport is the most trafficked is the same way Google does it for websites: Pagerank!

In [None]:
df_page = cugraph.pagerank(G)

So now we have a graph, `df_page`.  Great!  What does it look like?

In [None]:
df_page.head()

Pagerank isn't ordered by rank, but by vertex number, but it is easy to find the max rank and sort the orders.  Let's get our max and the top 10 airports in our dataset

In [None]:
pr_max = df_page['pagerank'].max()
print(pr_max)

In [None]:
sort_pr = df_page.sort_values('pagerank', ascending=False)
sort_pr.head(10)

In [None]:
sort_pr = df_page.sort_values('pagerank', ascending=False) # Just for fun, we're looking to see which airports have the least traffic
sort_pr.head(10)

Those are the top 10 trafficked airports.  While it was easy to see from the origin and destination airports counts that 13930 would be the most trafficked, the order of the others in the list required a bit more work.  It is also interesting that no single airport acconts for 1% of the total flights.

### Question 2: max number of plane rides (hops)?

Let's do a breadth first search (BFS) on the airports to fly out of and see how many hops it takes to get from popular airport, 13930, to an isolated one.  We'll do the BFS from the most poular airport to a randomly chosen one.

In [None]:
df = cugraph.bfs(G,13930)

In [None]:
df.count()

In [None]:
df['predecessor'].value_counts()

hmmm...what's `-1`?  Why does it's value so high?  Well, maybe it doesn't matter...let's get the max

In [None]:
df["distance"].max()

**Whoa!**  That distance value is unexpected...but really not.  In the BFS demo, Brad told us that this occurs because the isolated vertex, 0, is unreachable.  Whenever a graph contains disjointed components, the distance to the unconnected vertices will always be max_int.  He also showed us how to fix it by dropping all insanely large distances.  We'll keep `df` untouched, in case we need it again, and make a second dataframe `df2`

In [None]:
# drop all large distances 
exp="distance < 100"
df2 = df.query(exp)

In [None]:
df2['predecessor'].value_counts()

That looks better!  A positive number has the most, and it's of course, airport 13930.  Now, let's see what the real graph distance is.

In [None]:
df2["distance"].max()

Okay great!  We know that no matter what, in the US, you're no more than 5 flights away from any other airport.  

### Question 3: How many hops do you need to take to get from the most trafficked airport to one of the least trafficed airport
Let's find out how many flights it takes to get us to a remote airport.  Let's pick one that has 1 flight from it.  I'm choosing `16838`, but you can change that value to another airport.  Also, there's a helper function to help make it a nicer print.

In [None]:
end_airport = 16838 # change to any other airport

In [None]:
def print_path(df, id):
    
    # Use the BFS predecessors and distance to trace the path 
    # from vertex id back to the starting vertex ( vertex 1 in this example)
    dist = df['distance'][id]
    lastVert = id
    for i in range(dist):
        nextVert = df['predecessor'][lastVert]
        d = df['distance'][lastVert]
        print("Airport " + str(lastVert) + " was reached from airport " + str(nextVert) + 
        " where the graph distance to Airport 13930 was " + str(d) )
        lastVert = nextVert

In [None]:
print_path(df, end_airport)

If you used my number, it would take 3 flights So now we know which airports you would connect to between those two airports.  But that is the graph distance.  What about the real distances?  

### Question 4:  How far is that distance really?
Well, for that, we need to bring in our other dataset, `adf`, which is a list of the airport's latitude and longitudes, as well as the GPU accelerated `cuSpatial` library to compute the Haversine distances (distances on the surface of the globe [sphere] instead of a straight line) 

In [None]:
adf.head()

Let's make a new function that calculates the haversine distance of all the airports in our flights at once.  This is a great time to use merge().  We'll do 2 merges, first on `ORIGIN_AIRPORT_ID` and then on `DEST_AIRPORT_ID`. To do the merge, we'll need to typecast the queries on our original 2 dataframes.

In [None]:
fdf['AIRPORT_ID'] = fdf['ORIGIN_AIRPORT_ID'].astype(np.int64) # create a common key with origin airport
hdf = fdf.merge(adf, on=['AIRPORT_ID'], how='left')
hdf.rename(columns = {'LATITUDE': 'LATITUDE_O', 'LONGITUDE': 'LONGITUDE_O'}, inplace=True) # Origin lat and long
hdf['AIRPORT_ID'] = hdf['DEST_AIRPORT_ID'].astype(np.int64) # recreate a common key with destination airport
hdf = hdf.merge(adf, on=['AIRPORT_ID'], how='left')
hdf.rename(columns = {'LATITUDE': 'LATITUDE_D', 'LONGITUDE': 'LONGITUDE_D'}, inplace=True) # Origin lat and long
hdf.head()

In [None]:
x1 = hdf["LONGITUDE_O"]
y1 = hdf["LATITUDE_O"]
x2 = hdf["LONGITUDE_D"]
y2 = hdf["LATITUDE_D"]

hdf['H-distance'] = cuspatial.haversine_distance(x1, y1, x2, y2)
hdf.head(10)

Let's get the actual distances that one must fly to get between those airports

In [None]:
H = cugraph.Graph()
#hdf["ORIGIN_AIRPORT_ID_0"] = hdf["ORIGIN_AIRPORT_ID"] - 10001
#hdf["DEST_AIRPORT_ID_0"] = hdf["DEST_AIRPORT_ID"] - 10001
#hdf["data"] = 1.0
H.add_edge_list(hdf["ORIGIN_AIRPORT_ID"], hdf["DEST_AIRPORT_ID"], hdf["H-distance"])
hgdf = cugraph.bfs(H,13930)

**Fun Fact** Deleting the -1s throws off your indexes and doesn't return you a valid answer.  Try it if you'd like!

In [None]:
def print_dist_path(df, id):
    # Use the BFS predecessors and distance to trace the path 
    # from vertex id back to the starting vertex ( vertex 1 in this example)
    dist = df['distance'][id]
    hdist = 0
    print("Your overall flight has " + str(dist) + " hops")
    lastVert = id
    for i in range(dist):
        nextVert = df['predecessor'][lastVert]
        d = df['distance'][lastVert]
        a = hdf.query("ORIGIN_AIRPORT_ID == @nextVert and DEST_AIRPORT_ID == @lastVert")
        a.head()
        hdist = hdist+ a["H-distance"][0]
        print("Airport: " + str(lastVert) + " was reached from Airport " + str(nextVert) + 
        " and flight distance was " + str(a["H-distance"][0]) )
        lastVert = nextVert
    print("Your total flying distance was " + str(hdist))

In [None]:
print_dist_path(hgdf, 16838)

Okay, pretty cool.  We now know the distance between these airports...but where are they in the world?  Normally, we'd use use [cuDataShader](https://github.com/rapidsai/cuDataShader) for this, but it is not a library in this container.  [They've got a great example here that you can adapt to your needs](https://github.com/rapidsai/cuDataShader/blob/master/cudatashader-notebooks/cuDatashader%20Edge%20Bundling%20(US%20air%20traffic).ipynb)

### Question 5: What is the topology of our airport network

Let's look at the topology of this network of airports.  One way to do that is to measure the modularity of our airport system!  To do that, we use Louvain.  However, we need to make some changes to our data, as Louvain requires us to start from 0.  It also requires weights.  Let's see how weights change our answer.  We will use our Haversine distances as our weights in one set, and be unweighted in the next!

In [None]:
L = cugraph.Graph()
L2 = cugraph.Graph()
hdf["ORIGIN_AIRPORT_ID_0"] = hdf["ORIGIN_AIRPORT_ID"] - 10001
hdf["DEST_AIRPORT_ID_0"] = hdf["DEST_AIRPORT_ID"] - 10001
hdf["data"]= 1.0
L2.add_edge_list(hdf["ORIGIN_AIRPORT_ID_0"], hdf["DEST_AIRPORT_ID_0"], hdf["data"]) # Unweighted Modularity
L.add_edge_list(hdf["ORIGIN_AIRPORT_ID_0"], hdf["DEST_AIRPORT_ID_0"], hdf["H-distance"]) # Distance Weighted Modularity

In [None]:
# Call Louvain on the graph
hgdf, mod = cugraph.louvain(L) 
hgdf2, mod2 =cugraph.louvain(L2) 
# Print the modularity score
print('Modularity using Distance as a weight was {}'.format(mod))
print()
print('Modularity unweighted was {}'.format(mod2))
print()

In [None]:
hgdf.head(10)

In [None]:
hgdf2.head(10)

That's a high partition number for both graphs.  This is of course, based on a small dataset of flights.  I'll be working on a larger one in notebooks_contrib that uses DOT 2015 Flight data and use cuDataShader for graph visualizations.  Let's see what the value counts look like.

In [None]:
print(len(hgdf['partition'].unique()))
print(len(hgdf2['partition'].unique()))

In [None]:
hgdf['partition'].value_counts()

In [None]:
hgdf2['partition'].value_counts()

It seems that the unweighted graph is less modular.  Let's remove paritions of 1 from the .

In [None]:
def get_mod(df):
    val_counts = df['partition'].value_counts()
    relevant_partitions = val_counts[val_counts>1].index
    print(len(relevant_partitions))
    query = 'partition == '+ str(relevant_partitions[0])
    for i in range (1, len(relevant_partitions)):
            query += ' or partition == '+ str(relevant_partitions[i])
    return df.query(query)

In [None]:
# How many partitions where found
def get_partitions(df):
    part_ids = df["partition"].unique()
    for p in range(len(part_ids)):
        part = []
        for i in range(len(df)):
            #print(df['partition'][i])
            if (df['partition'][i] == part_ids[p]):
                part.append(df['vertex'][i] +1+10001)
        print("Partition " + str(part_ids[p]) + " contains these airports:")
        print(part)

In [None]:
print("Number of partitions > 1 in Distance Weighted Modularity:")
hgdf_1 = get_mod(hgdf)
print("Number of partitions > 1 in Unweighted Modularity:")
hgdf_2 = get_mod(hgdf2)

hgdf_1.head()

In [None]:
print("------Distance Weighted Modularity------")
get_partitions(hgdf_1)
print("------Unweighted Modularity------")
get_partitions(hgdf_2)

Okay great!  Now we know what each partition is, you can once again use [cuDataShader](https://github.com/rapidsai/cuDataShader) or [cuXFilter](https://github.com/rapidsai/cuxfilter) to visualize the results.  Let's make a pretty picture (that sound you just heard was Allan Enemark grinding his teeth :).  He's a friend, so I shouldn't befall any physcal harm by his hands.  He also leads the team that does data visualizations, and their libraries, such as [cuXFilter](https://github.com/rapidsai/cuxfilter) and [cuDataShader](https://github.com/rapidsai/cuDataShader)).