# <center>Exploring GraphX</center>
## <center>Visualizing GraphX and Exploring Graph Operators</center>
### <center>July 20, 2016</center>

<img src = "http://spark.apache.org/docs/latest/img/graphx_logo.png", width = 600, align = 'centre'>

## Welcome to the second lab in the course, Exploring GraphX.

### GraphX is Apache Spark's API for graph and graph-parallel computations.

In the last exercise, you looked at an introduction to GraphX, specifically how to create the components that make up a Graph. Then fully create a Graph given the information of the vertices and edges. In this lab, you will get more practice constructing a GraphX, extracting information using Graph Operators, and look at visualization of the Graph.

### Some Notebook Commands
#### In case you haven't dealt with a Jupyter Notebook before, here are some quick, useful commands that may be handy to get started.
<ul>
    <li>Run a cell: CTRL + ENTER</li>
    <li>Create a cell above a cell: a</li>
    <li>Create a cell below a cell: b</li>
    <li>Change a cell to Markdown: m</li>
    
    <li>Change a cell to code: y</li>
</ul>

<b> If you are interested in more keyboard shortcuts, go to Help -> Keyboard Shortcuts </b>

So in the last exercise, you looked at creating our simple recreation of "facebook". You were given most of the code, so let's go ahead and recreate the same graph with a little less help and a bit more intitution!

First we will import the following libraries:

- org.apache.spark._ 
- org.apache.spark.graphx._
- org.apache.spark.rdd.RDD 

In [21]:
import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD

Highlight over the box below for the answer
<table width="100%" cellspacing="0" cellpadding="0" border="0" align="center" bgcolor="#ff6600">
<td> <font color = "white">import org.apache.spark.&#95;<br>
import org.apache.spark.graphx.&#95;<br>
import org.apache.spark.rdd.RDD</font>
</td>
</table>

In our "facebook" graph we created we had the following People:

- Billy Bill -> VertexId = 1
- Jacob Johnson -> VertexId = 2
- Andrew Smith -> VertexId = 3

and 2 Pages:

- Iron Man Fan Page -> VertexId = 4
- Captain America Fan Page -> VertexId = 5

And we are going to create the vertices in one step! This will be tied to the variable called vertexRDD

Hint: The type is RDD[(Long, (String, String))]

In [22]:
val vertexRDD: RDD[(Long, (String, String))] = sc.parallelize(Array((1L, ("Billy Bill", "Person")), (2L, ("Jacob Johnson", "Person")), (3L, ("Andrew Smith", "Person")), (4L, ("Iron Man Fan Page", "Page")), (5L, ("Captain America Fan Page", "Page"))))

Highlight over the box below for the answer
<table width="100%" cellspacing="0" cellpadding="0" border="0" align="center" bgcolor="#ff6600">
<td> <font color = "white">val vertexRDD: RDD[(Long, (String, String))] = sc.parallelize(Array((1L, ("Billy Bill", "Person")), (2L, ("Jacob Johnson", "Person")), (3L, ("Andrew Smith", "Person")), (4L, ("Iron Man Fan Page", "Page")), (5L, ("Captain America Fan Page", "Page"))))</font>
</td>
</table>

Awesome! Now let's create the same relationships in one step too:

- Billy is Friends with Jacob
- Billy is Friends with Andrew
- Jacob is a Follower of the Iron Man Fan Page
- Jacob is a Follower of the the Captain America Fan Page
- Andrew is a Follower of the the Captain America Fan Page

This edge will be called edgeRDD.

Hint: The Type is RDD[Edge[String]]

In [23]:
val edgeRDD: RDD[Edge[String]] = sc.parallelize(Array(Edge(1L, 2L, "Friends"), Edge(1L, 3L, "Friends"), Edge(2L, 4L, "Follower"), Edge(2L, 5L, "Follower"), Edge(3L, 5L, "Follower")))

Highlight over the box below for the answer
<table width="100%" cellspacing="0" cellpadding="0" border="0" align="center" bgcolor="#ff6600">
<td> <font color = "white">val edgeRDD: RDD[Edge[String]] = sc.parallelize(Array(Edge(1L, 2L, "Friends"), Edge(1L, 3L, "Friends"), Edge(2L, 4L, "Follower"), Edge(2L, 5L, "Follower"), Edge(3L, 5L, "Follower")))</font>
</td>
</table>

Now let's create the a variable called defaultvertex which will be the "fallback" for any edges that cannot connect to a vertex. It is only a tuple which contains "Self" and "Missing"

In [24]:
var defaultvertex = ("Self", "Missing")

Highlight over the box below for the answer
<table width="100%" cellspacing="0" cellpadding="0" border="0" align="center" bgcolor="#ff6600">
<td> <font color = "white">var defaultvertex = ("Self", "Missing")</font>
</td>
</table>

Alright, now let's go ahead and construct the Graph! We will name it facebook again!

In [25]:
var facebook = Graph(vertexRDD, edgeRDD, defaultvertex)

Highlight over the box below for the answer
<table width="100%" cellspacing="0" cellpadding="0" border="0" align="center" bgcolor="#ff6600">
<td> <font color = "white">var facebook = Graph(vertexRDD, edgeRDD, defaultvertex)</font>
</td>
</table>

Perfect! Here's a reminder of the visualized Graph:

<img src = "http://i.imgur.com/rhkiopM.png">

Alright so now we will take a look at the few of the Graph Operators! These Graph Operators are called by using the Graph "facebook" variable we created. You use them by calling them on the Graph variable or "facebook" in our case. Let's try to extract how many vertices there are in this graph by using numVertices function.

In [26]:
facebook.numVertices

5

Highlight over the box below for the answer
<table width="100%" cellspacing="0" cellpadding="0" border="0" align="center" bgcolor="#ff6600">
<td> <font color = "white">facebook.numVertices</font>
</td>
</table>

Sweet! Now let's find out the number of edges using numEdges function.

In [27]:
facebook.numEdges

5

Highlight over the box below for the answer
<table width="100%" cellspacing="0" cellpadding="0" border="0" align="center" bgcolor="#ff6600">
<td> <font color = "white">facebook.numEdges</font>
</td>
</table>

Ironically, they are both the same. So make sure you didn't just use the same function both times! Haha.

Now the next Operator we will look at involve degrees. In this case we are talking about degrees as the number of edges a vertex touches! The Edges in a multi-directional graph have a direction. As you can see, sometimes it can be mutual such as:

-> Billy is a Friend of Andrew. 

-> Andrew is a Friend of Billy.

However there are cases where the edge or "relationship" is not mutual. This is such as:

-> Jacob is a Follower of the Captain America Fan Page.

-> Captain America Fan Page is a Follower of Jacob.

So, if we are looking at a specific vertex, we can determine the edges that point "out" with the function outDegrees. However, the question is... How do we find a specific vertex? We use the filter function like we did in the last exercise!

We can use the filter function on the outDegrees function of facebook and select the case where the id is the number or numbers we want.

Let's find Billy's outDegree information by filtering it with a id of 1 and using the collect function afterwards. Let's save it as Billy_outDegree.

Note: The case we will need is case(id, outdegree), as the id of the person is the first parameter and the outdegree number is the second parameter.

In [28]:
var Billy_outDegree = facebook.outDegrees.filter{ case(id, outdegree) => id == 1}.collect

Highlight over the box below for the answer
<table width="100%" cellspacing="0" cellpadding="0" border="0" align="center" bgcolor="#ff6600">
<td> <font color = "white">var Billy_outDegree = facebook.outDegrees.filter{ case(id, outdegree) => id == 1}.collect</font>
</td>
</table>

Awesome! Now let's go ahead print out Billy_outDegree. However, we will need to index it using () brackets instead of [] brackets. The index should be 0.

In [29]:
print(Billy_outDegree(0))

(1,2)

Highlight over the box below for the answer
<table width="100%" cellspacing="0" cellpadding="0" border="0" align="center" bgcolor="#ff6600">
<td> <font color = "white">print(Billy_outDegree(0))</font>
</td>
</table>

So the result should contain the id (first parameter) and the outDegree of that parameter in the second index. Therefore the outDegree of Billy is 2.

Now let's do the same for Billy but lets find it's inDegree. We use Billy_inDegree as the variable. Then print the first index of Billy_inDegree like before

In [37]:
var Billy_inDegree = facebook.inDegrees.filter{ case(id, outdegree) => id == 1}.collect

In [40]:
print(Billy_inDegree(0))

Name: java.lang.ArrayIndexOutOfBoundsException
Message: 0
StackTrace: $line110.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:45)
$line110.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:50)
$line110.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:52)
$line110.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:54)
$line110.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:56)
$line110.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:58)
$line110.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:60)
$line110.$read$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:62)
$line110.$read$$iwC$$iwC$$iwC$$iwC.<init>(<console>:64)
$line110.$read$$iwC$$iwC$$iwC.<init>(<console>:66)
$line110.$read$$iwC$$iwC.<init>(<console>:68)
$line110.$read$$iwC.<init>(<console>:70)
$line110.$read.<init>(<console>:72)
$line110.$read$.<init>(<console>:76)
$line110.$read$.<clinit>(<console>)

Highlight over the box below for the answer
<table width="100%" cellspacing="0" cellpadding="0" border="0" align="center" bgcolor="#ff6600">
<td> <font color = "white">var Billy_inDegree = facebook.inDegrees.filter{ case(id, outdegree) => id == 1}.collect<br>
print(Billy_inDegree(0))</font>
</td>
</table>

You got an error when you tried to print the Billy_inDegree didn't you? That's to be expected because since there wasn't an inDegree value for Billy's vertex, there wasn't an anything in Billy_inDegree variable.

Now let's take a look at the degrees operator. We will do something different than before, and go ahead and use a for loop to cycle through the total degree of each vertex (inDegree + OutDegree)

In [31]:
for (degree <- facebook.degrees.collect) {
    println(degree)
}

(1,2)
(2,3)
(3,2)
(4,1)
(5,2)


Highlight over the box below for the answer
<table width="100%" cellspacing="0" cellpadding="0" border="0" align="center" bgcolor="#ff6600">
<td> <font color = "white">for (degree <- facebook.degrees.collect) { <br>
    println(degree) <br>
}</font>
</td>
</table>

So the list that shows up is the same format as before, the first element is the vertex id and the second element is the number of degrees corresponding to that element.

Now the next Graph Operators we are lookin at is .vertices, .edges, and .triplets. As you have used, and seem them before in the last exercise. They are Graph Operators and it is important to know how to use each of their cases:

- .vertices -> Uses format of the defined Vertices of the graph. <br>
Ex. We defined our Vertices as (Long, (String, String)), therefore when you call a case on this, you must define variables for each such as (id, (name, user_type)).

- .edges -> Uses format of the defined Edges of the graph. <br>
Ex. We defined our Edges as Edge[String], therefore when you call a case on this, you can just define one variable such as (relation). However, this variable will have attributes such as .srcId (Source Id), .dstId (Destination Id), and .attr (Attribute).

- .triplets -> Uses the combined format of the defined Vertices and Edges. <br>
Ex. Follow the above example, when we call a case on this, you define one variable such as (triplet). And this variable will have attributes of both Vertices and Edges such as .srcAttr (Source Attribute), .dstAttr (Destination Attribute) from Vertices, and .srcId (Source Id), .dstId (Destination Id), and .attr (Attribute) from Edges.

So since you've dealt with .vertices and edges, we do a quick example with each then start looking at how to visualize the graph with .triplets since it a combination of .vertices and .edges.

Unfortunately, GraphX does not have any build-in visualization, so it's important to know how to create views. Let's go ahead and trying printing out all of the vertices.

Hint: Use a for loop and the collect function on .vertices

In [32]:
for (vertex <- facebook.vertices.collect) {
    println(vertex)
}

(1,(Billy Bill,Person))
(2,(Jacob Johnson,Person))
(3,(Andrew Smith,Person))
(4,(Iron Man Fan Page,Page))
(5,(Captain America Fan Page,Page))


Highlight over the box below for the answer
<table width="100%" cellspacing="0" cellpadding="0" border="0" align="center" bgcolor="#ff6600">
<td> <font color = "white">for (vertex <- facebook.vertices.collect) {<br>
    println(vertex)<br>
}</font>
</td>
</table>

Awesome! Now let's do the same with edges just so we have an idea of all the vertices and edges.

In [33]:
for (edge <- facebook.edges.collect) {
    println(edge)
}

Edge(1,2,Friends)
Edge(1,3,Friends)
Edge(2,4,Follower)
Edge(2,5,Follower)
Edge(3,5,Follower)


Highlight over the box below for the answer
<table width="100%" cellspacing="0" cellpadding="0" border="0" align="center" bgcolor="#ff6600">
<td> <font color = "white">for (edge <- facebook.edges.collect) {<br>
    println(edge)<br>
}</font>
</td>
</table>

Alright, now let's use triplets to create a view of the graph. Just like in last two examples, we will use the collect function on .triplets, however we will denote the Source Attribute (.srcAttr), the edge attribute (.attr), and the Destination Attribute (.dstAttr) all in the same println statement to denote each relationship.

Hint: Make sure to use the index on the Source and Destination Attribute!

In [34]:
for (triplet <- facebook.triplets.collect) {
  print(triplet.srcAttr._1)
  print(" is a ")
  print(triplet.attr)
  print(" of ")
  println(triplet.dstAttr._1)
}

Billy Bill is a Friends of Jacob Johnson
Billy Bill is a Friends of Andrew Smith
Jacob Johnson is a Follower of Iron Man Fan Page
Jacob Johnson is a Follower of Captain America Fan Page
Andrew Smith is a Follower of Captain America Fan Page


Highlight over the box below for the answer
<table width="100%" cellspacing="0" cellpadding="0" border="0" align="center" bgcolor="#ff6600">
<td> <font color = "white">for (triplet <- facebook.triplets.collect) {<br>
  print(triplet.srcAttr._1)<br>
  print(" is a ")<br>
  print(triplet.attr)<br>
  print(" of ")<br>
  println(triplet.dstAttr._1)<br>
}</font>
</td>
</table>

The view looks great! It is important to know how to create a view because GraphX does not have any visualization built-in, it is mainly a parallel graph processing library. There are alternatives such as Graphlab and Gephi, but we won't be looking into these in this course.

So the only issue we have is that the relation "Friends" and "Follower" is different. One is pural, and one is singular. So when we print the view, one set will be correct while the other set will be grammatically incorrect. 

You may have noticed that the visualization is corrrect, but not the actual graph. That is not a mistake - the visualization is what we want it to be, but it was left there as we want to change this error by learning how to! 

We will want to change this in the Graph to make "Friends" singular to "Friend". We will be exploring how to do this in the next lab, so starting thinking!

Now we will take a look at an important algorithm in GraphX: pageRank.

Pagerank is a algorithm that measures the importance of each vertex by directly correlating it's importance with edges (properties and quantity). There are two options for Pagerank, static and dynamic. Static runs for a fixed number of iterations while dynamtic runs until the rank converges.

We won't worry too much as we will just introduce the concept. Now, in this case I went ahead and used the pageRank function on our graph, and collected the vertices into a variable called rank. Now go ahead and try to print it out!

Note: rank is a collection, so you will need to use a for loop!

In [35]:
val rank = facebook.pageRank(0.1).vertices.collect

In [36]:
for (rankee <- rank) {
    println(rankee)
}

(1,0.15)
(2,0.21375)
(3,0.21375)
(4,0.21375)
(5,0.34124999999999994)


Highlight over the box below for the answer
<table width="100%" cellspacing="0" cellpadding="0" border="0" align="center" bgcolor="#ff6600">
<td> <font color = "white">for (rankee <- rank) {<br>
    println(rankee)<br>
}</font>
</td>
</table>

Alright! So what do these numbers mean? The first part is the ID of the vertex, and the second value is their rank determined by pageRank. The higher the number, the higher the rank. So it looks like ID = 5 ("Iron Man Fan Page") is the most important, which makes sense because it has two followers (the most). 

So this is just an introduction to pageRank, so if you would like to dive more into it, feel free to take a look at this documentation! http://spark.apache.org/docs/latest/graphx-programming-guide.html#pagerank

That's it for this lab, in the next exercise we will be taking a look at modifying the graph and how GraphX does it with RDDs which are immutable!