## Graph Analysis with GraphFrames
In this notebook we'll go over basic graph analysis using the GraphX API. The goal of this notebook is to show you how to use the GraphFrames API to perform graph analysis. We're going to be doing this with publicly available bike data from [the Bay Area Bike Share portal](http://www.bayareabikeshare.com/open-data). We're going to be specifically analyzing the second year of data.

**Note: ** GraphX computation is only supported using the Scala and RDD APIs.

#### Graph Theory and Graph Processing
Graph processing is important aspect of analysis that applies to a lot of use cases. Fundamentally graph theory and processing are about defining relationships between different nodes and edges. Nodes or vertices are the units while edges are the relationships that are defined between those. This works great for social network analysis and running algorithms like [PageRank](https://en.wikipedia.org/wiki/PageRank) to better understand and weigh relationships.

Some business use cases could be to look at central people in social networks [who is most popular in a group of friends], importance of papers in bibliographic networks [which papers are most referenced], and of course ranking web pages!

#### Graphs and Bike Trip Data
As mentioned, in this example we'll be using bay area bike share data. This data is free for use by the public on the website linked above. The way we're going to orient our analysis is by making every vertex a station and each trip will become an edge connecting two stations. This creates a *directed* graph.

**Further Reference:**
* [Graph Theory on Wikipedia](https://en.wikipedia.org/wiki/Graph_theory)
* [PageRank on Wikipedia](https://en.wikipedia.org/wiki/PageRank)

#### **Table of Contents**
* **Setup & Data**
* **Imports**
* **Building the Graph**
* **PageRank**
* **Trips from Station to Station**
* **In Degrees and Out Degrees**

In [None]:
bikeStations = spark.read.option("header", True).csv("/data/training/bikeshare/*_station_data.csv")
tripData = spark.read.option("header", True).csv("/data/training/bikeshare/*_trip_data.csv")

In [None]:
bikeStations.show()

In [None]:
tripData.show()

In [None]:
bikeStations.printSchema()
tripData.printSchema()

In [None]:
from graphframes import *
from pyspark.sql import *
from pyspark.sql.functions import *

## Building the Graph
Now that we've imported our data, we're going to need to build our graph. To do so we're going to do two things. We are going to build the structure of the vertices (or nodes) and we're going to build the structure of the edges.

You may have noticed that we have station ids inside of our `bikeStations` data but not inside of our trip data. This complicates things because we have to ensure that we have numerical data for GraphX. That means that the vertices have to be identifiable with a numeric value not a string value like station name. Therefore we have to perform some joins to ensure that we have those ids associated with each trip.

In [None]:
stationVertices = (bikeStations
  .withColumnRenamed("name", "id")
  .select("id")
  .distinct())

In [None]:
stationEdges = (tripData
  .withColumnRenamed("Start Station", "src")
  .withColumnRenamed("End Station", "dst").select("src", "dst"))

In [None]:
stationGraph = GraphFrame(stationVertices, stationEdges)
stationGraph.cache()

In [None]:
stationVertices.take(1)

In [None]:
print("Total Number of Stations: ", stationGraph.vertices.count())
print("Total Number of Trips: ", stationGraph.edges.count())
# sanity check
print("Total Number of Trips in Original Data: ", tripData.count())

Now that we're all set up and have computed some basic statistics, let's run some algorithms!

## PageRank

GraphX includes a number of built-in algorithms to leverage. [PageRank](https://en.wikipedia.org/wiki/PageRank) is one of the more popular ones popularized by the Google Search Engine and created by Larry Page. To quote Wikipedia:

> PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

What's awesome about this concept is that it readily applies to any graph type structure be them web pages or bike stations. Let's go ahead and run PageRank on our data, we can either run it for a set number of iterations or until convergence. Passing an Integer into `pageRank` will run for a set number of iterations while a Double will run until convergence.

In [None]:
# TODO
ranks = stationGraph

In [None]:
ranks.sort(ranks.pagerank.desc()).show()

Answer should be
* San Jose Diridon Caltrain Station
* San Francisco Caltrain (Townsend at 4th)
* Mountain View Caltrain Station
* Redwood City Caltrain Station
* San Francisco Caltrain 2 (330 Townsend)
* Harry Bridges Plaza (Ferry Building)
* 2nd at Townsend
* Santa Clara at Almaden
* Townsend at 7th
* Embarcadero at Sansome


We can see above that the Caltrain stations seem to be significant! This makes sense as these are natural connectors and likely one of the most popular uses of these bike share programs to get you from A to B in a way that you don't need a car!

## Trips From Station to Station
One question you might ask is what are the most common destinations in the dataset from location to location. We can do this by performing a grouping operator and adding the edge counts together. This will yield a new graph except each edge will now be the sum of all of the semantically same edges. Think about it this way: we have a number of trips that are the exact same from station A to station B, we just want to count those up!

In the below query you'll see that we're going to grab the station to station trips that are most common and print out the top 10.

In [None]:
# TODO
trips = None

Answer should be 
* There were 9150 trips from Harry Bridges Plaza (Ferry Building) to Embarcadero at Sansome.
* There were 8508 trips from San Francisco Caltrain 2 (330 Townsend) to Townsend at 7th.
* There were 7620 trips from 2nd at Townsend to Harry Bridges Plaza (Ferry Building).
* There were 6888 trips from Harry Bridges Plaza (Ferry Building) to 2nd at Townsend.
* There were 6874 trips from Embarcadero at Sansome to Steuart at Market.
* There were 6836 trips from Townsend at 7th to San Francisco Caltrain 2 (330 Townsend).
* There were 6351 trips from Embarcadero at Folsom to San Francisco Caltrain (Townsend at 4th).
* There were 6215 trips from San Francisco Caltrain (Townsend at 4th) to Harry Bridges Plaza (Ferry Building).
* There were 6039 trips from Steuart at Market to 2nd at Townsend.
* There were 5959 trips from Steuart at Market to San Francisco Caltrain (Townsend at 4th).

## In Degrees and Out Degrees
Remember that in this instance we've got a directed graph. That means that our trips our directional - from one location to another. Therefore we get access to a wealth of analysis that we can use. We can find the number of trips that go into a specific station and leave from a specific station.

Naturally we can sort this information and find the stations with lots of inbound and outbound trips! Check out this definition of [Vertex Degrees](http://mathworld.wolfram.com/VertexDegree.html) for more information.

Now that we've defined that process, let's go ahead and find the stations that have lots of inbound and outbound traffic.

In [None]:
# TODO in degrees
inDeg = None

Answer should be : 
* San Francisco Caltrain (Townsend at 4th) has 90532 in degrees.
* San Francisco Caltrain 2 (330 Townsend) has 57570 in degrees.
* Harry Bridges Plaza (Ferry Building) has 49215 in degrees.
* Embarcadero at Sansome has 45224 in degrees.
* 2nd at Townsend has 43717 in degrees.
* Market at Sansome has 40518 in degrees.
* Steuart at Market has 38981 in degrees.
* Townsend at 7th has 38011 in degrees.
* Temporary Transbay Terminal (Howard at Beale) has 34878 in degrees.
* Market at 4th has 26248 in degrees.

In [None]:
# TODO out degrees
outDeg = None

One interesting follow up question we could ask is what is the station with the highest ratio of in degrees but fewest out degrees. As in, what station acts as almost a pure trip sink. A station where trips end at but rarely start from.

In [1]:
# TODO
degreeRatio = None

We can do something similar by getting the stations with the lowest in degrees to out degrees ratios, meaning that trips start from that station but don't end there as often. This is essentially the opposite of what we have above.


In [None]:
# TODO reverse previous computation

The conclusions of what we get from the above analysis should be relatively straightforward. If we have a higher value, that means many more trips come into that station than out, and a lower value means that many more trips leave from that station than come into it!
Hopefully you've gotten some value out of this notebook! Graph stuctures are everywhere once you start looking for them and hopefully GraphFrames will make analyzing them easy!