# <center>Exploring GraphX</center>
## <center>Introduction to Graph-Parallel</center>
### <center>July 18, 2016</center>

<img src = "http://spark.apache.org/docs/latest/img/graphx_logo.png", width = 600, align = 'centre'>

## Welcome to the first lab in the course, Exploring GraphX.

### GraphX is Apache Spark's API for graph and graph-parallel computations.

In this lab exercise, you will learn about the GraphX library and how to build a simple multi directed graph with Scala. We will also explore a few classes of GraphX and dicuss a little about their importance to GraphX.

### Some Notebook Commands
#### In case you haven't dealt with a Jupyter Notebook before, here are some quick, useful commands that may be handy to get started.
<ul>
    <li>Run a cell: CTRL + ENTER</li>
    <li>Create a cell above a cell: a</li>
    <li>Create a cell below a cell: b</li>
    <li>Change a cell to Markdown: m</li>
    
    <li>Change a cell to code: y</li>
</ul>

<b> If you are interested in more keyboard shortcuts, go to Help -> Keyboard Shortcuts </b>

Hello! First before we start creating our graph, we will need to the import the following libraries:

- org.apache.spark._ 
- org.apache.spark.graphx._
- org.apache.spark.rdd.RDD 

In [1]:
import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD

Highlight over the box below for the answer
<table width="100%" cellspacing="0" cellpadding="0" border="0" align="center" bgcolor="#ff6600">
<td> <font color = "white">import org.apache.spark.&#95;<br>
import org.apache.spark.graphx.&#95;<br>
import org.apache.spark.rdd.RDD</font>
</td>
</table>

Now to begin, as a reminder, we have a SparkContext called sc.

Now next we will create the "Vertices" of our graph. Let's try to make it a simple, easy-to-relate graph. Let's use "Facebook" as an example. We will create an Array called facebook_vertices that consists of 3 people and 2 pages.

In [3]:
val facebook_vertices = Array((1L, ("Billy Bill", "Person")), (2L, ("Jacob Johnson", "Person")), (3L, ("Andrew Smith", "Person")), (4L, ("Iron Man Fan Page", "Page")), (5L, ("Captain America Fan Page", "Page")))

Here, we are just making a simple array that has 3 People: 
- Billy Bill
- Jacob Johnson
- Andrew Smith

and 2 Pages:

- Iron Man Fan Page
- Captain America Fan Page

These will become our vertices later on. Vertices carry an identifier (1L, 2L, 3L, ...) and user-defined attributes such as "Person" or "Page".

Next, we will create the relationships of each one of them. The variable relationships will become the "Edges" of our graph.

In [2]:
val relationships = Array(Edge(1L, 2L, "Friends"), Edge(1L, 3L, "Friends"), Edge(2L, 4L, "Follower"), Edge(2L, 5L, "Follower"), Edge(3L, 5L, "Follower"))

Now we have created another Array called relationships that are Edges, with attributes of the srcId (Source ID), dstId (Destination ID). These are the following relationships that we created:

- Billy is Friends with Jacob
- Billy is Friends with Andrew
- Jacob is a Follower of the Iron Man Fan Page
- Jacob is a Follower of the the Captain America Fan Page
- Andrew is a Follower of the the Captain America Fan Page

Now we have our Vertices (facebook_vertices) and Edges (relationships). However, they are just Arrays. When we create our our Graph, these variables need to be RDDs. To create RDDs, we will use the parallelize function of SparkContext (sc). We will also have to make sure that the correct types are labeled in type format.

In [4]:
val vertexRDD: RDD[(Long, (String, String))] = sc.parallelize(facebook_vertices)
val edgeRDD: RDD[Edge[String]] = sc.parallelize(relationships)

Now we have our Vertices and Edges in proper format, but before we define our graph we just need to define one user - which will be "fallback" user. This user will be defaulty connected to any edges that lead to a non-existant Vertex. Let's called it "Self" - since you can be friends with "Yourself" and have a page that follows "Itself."

In [5]:
val defaultvertex = ("Self", "Missing")

This variable is just a tuple. Now we can move onto creating our Graph. We will create a variable called facebook which will be our instantiate of Graph with 3 variables - vertexRDD, edgeRDD, and defaultvertex.

In [7]:
val facebook = Graph(vertexRDD, edgeRDD)

Highlight over the box below for the answer
<table width="100%" cellspacing="0" cellpadding="0" border="0" align="center" bgcolor="#ff6600">
<td> <font color = "white">val facebook = Graph(vertexRDD, edgeRDD, defaultvertex)</font>
</td>
</table>

Here's a visual representation created by me to show what the graph should look like:

<img src = "http://i.imgur.com/rhkiopM.png">

We did it! We made facebook! (multi-directional graph representing facebook :) ). Now the Graph we created has some interesting components that it has made from our parameters. Let's try printing out the vertices component of facebook.

In [11]:
print(facebook.vertices)

VertexRDDImpl[12] at RDD at VertexRDD.scala:57

Highlight over the box below for the answer
<table width="100%" cellspacing="0" cellpadding="0" border="0" align="center" bgcolor="#ff6600">
<td> <font color = "white">print(facebook.vertices)</font>
</td>
</table>

This is the vertices of our of graph. You can do the same for Edges by using the edges components. Try printing it out!

In [12]:
print (facebook.edges)

EdgeRDDImpl[14] at RDD at EdgeRDD.scala:40

Highlight over the box below for the answer
<table width="100%" cellspacing="0" cellpadding="0" border="0" align="center" bgcolor="#ff6600">
<td> <font color = "white">print(facebook.edges)</font>
</td>
</table>

Now, what's so important about these two components? You can use them to create views of their respective components of the graph! However, they are slightly different from each other, so we will take a look at vertices first!

So right now vertices is called as a whole, so we will need to seperate the results we want using the filter function. Then we will make cases for each attribute and then define a condition to be met.

In [13]:
facebook.vertices.filter { case (id, (name, user_type)) => user_type == "Person" }.count

3

As you can see, we used the filter function, defined the attributes of the vertex then made a condition that only selects a "Person" in our graph. We counted this to produce a result of 3, which matches the 3 vertices (people) in our graph. However, we could have easily have replaced the count funciton with a collect and have dealt with it as a tuple and used for loops to print out a each person. 

Now let's try the same with Edges except it only has one defined case variable, which is the edge itself. However, the Edge class has attributes such as srcId (sourceID), dstId (destinationID), and attr (Attribute) which stores the edge property.

Let's see if you are able to use the filter function on facebook.edges to find how many people follow the "Captain America Fan Page"

Hint: The destination will be the Captain America Fan Page's ID and the relationship has to be Follower.

In [17]:
facebook.edges.filter {case edge => (edge.dstId == 5L) && edge.attr == "Follower"}.count()

2

Highlight over the box below for the answer
<table width="100%" cellspacing="0" cellpadding="0" border="0" align="center" bgcolor="#ff6600">
<td> <font color = "white">facebook.edges.filter { case (relation) => relation.dstId == 5L && relation.attr == "Follower"}.count</font>
</td>
</table>

The answer should be 2! So now that you have gotten some insight into Vertices and Edges of the graph, you may think be thinking how can I visualize GraphX? Unfortunately, GraphX does not have any visualizations built-in, it is mainly a parallel graph processing library. The closest options we have to visualize the data is through views as we did above with Vertices and Edges.

However, there is an easier way to create views, and that is with the EdgeTriplet class. This class contains information about the Edge and Vertex because of it logical join. We will discuss more later on, however here is a little taste of what EdgeTriplets can do.

In [18]:
val selected = facebook.triplets.filter { case (triplet) => triplet.srcAttr._1 == "Billy Bill"}.collect

In [19]:
for (person <- selected) {
    print(person.srcAttr._1)
    print(" is ")
    print(person.attr)
    print(" with ")
    println(person.dstAttr._1)
}

Billy Bill is Friends with Jacob Johnson
Billy Bill is Friends with Andrew Smith


First we created a variable called selected which contained the collection of the information for Billy Bill. Then we cycled through a for loop of that collection and outputted Billy Bill's relationships and with whom. You are able to do much more with the EdgeTriplet class, but that will be dicussed later. 

Note: You can access the "selected" variables by using the () and putting an index inbetween the brackets.

Can you think of the possibilites of the EdgeTriplet class?

Now with that lingering question in your mind, let's see if you can create another graph with the knowledge you have gained!

This time, we will make a little more different, and it will just model "real" relationships between people. Let's pick some popular Simpson characters:

- Homer Simpson -> VertexId = 1
- Bart Simpson -> VertexId = 2
- Marge Simpson -> VertexId = 3
- Milhouse Houten -> VertexId = 4

However, we are going to try to create an RDD vertex called characters all in one step! Let's see if you can combine the two steps we learned earlier!

In [20]:
val characters: RDD[(VertexId, (String, String))] = sc.parallelize(Array((1L, ("Homer Simpson", "Person")), (2L, ("Bart Simpson", "Person")), (3L, ("Marge Simpson", "Person")), (4L, ("Milhouse Houten", "Page"))))

Highlight over the box below for the answer
<table width="100%" cellspacing="0" cellpadding="0" border="0" align="center" bgcolor="#ff6600">
<td> <font color = "white">val characters: RDD[(VertexId, (String, String))] = sc.parallelize(Array((1L, ("Homer Simpson", "Person")), (2L, ("Bart Simpson", "Person")), (3L, ("Marge Simpson", "Person")), (4L, ("Milhouse Houten", "Page"))))</font>
</td>
</table>

Awesome! Now let's model some of their relationships (For simplicity sake we will only model a few):

- Homer Simpson is the Father of Bart Simpson
- Marge Simpson is the Wife of Homer Simpson
- Bart Simpson is the Friend of Milhouse Houten

We can also create an EdgeRDD variable called simpson_relationships in one step too! It is done similarly as the previous step, so if your stuck, take a look there!

In [22]:
val EdgeRDD: RDD[Edge[String]] = sc.parallelize(Array(Edge(1L, 2L, "Father"), Edge(3L, 1L, "Wife"), Edge(2L, 3L, "Friend")))

Highlight over the box below for the answer
<table width="100%" cellspacing="0" cellpadding="0" border="0" align="center" bgcolor="#ff6600">
<td> <font color = "white">val simpson_relationships : RDD[Edge[String]] = sc.parallelize(Array(Edge(1L, 2L, "Father"), Edge(3L, 1L, "Wife"), Edge(2L, 4L, "Friends")))</font>
</td>
</table>

Now we will just reuse the defaultvertex variable as our "fallback" user. If you don't have this variable instantiated, then go ahead and scroll up to do so.

Now let's create our graph with our Vertices (characters), Edges (simpson_relationships), and defaultvertex called the_simpsons.

In [26]:
val the_simpsons = ("Self", "Missing")
val ourGraph = Graph(characters, EdgeRDD, the_simpsons)

Highlight over the box below for the answer
<table width="100%" cellspacing="0" cellpadding="0" border="0" align="center" bgcolor="#ff6600">
<td> <font color = "white">val the_simpsons = Graph(characters, simpson_relationships, defaultvertex)</font>
</td>
</table>

Awesome! You have successfully created the_simpsons graph. Keep thinking about the EdgeTriplet class and let your curiosity guide you!

You are now done with this exercise.

In [27]:
val selectedSimpsons = ourGraph.triplets.filter { case (triplet) => triplet.srcAttr._1 == "Homer Simpson"}.collect

In [30]:
for (each <- selectedSimpsons) {
    print(each.srcAttr._1)
    print(" is ")
    print(each.attr)
    print(" with ")
    println(each.dstAttr._1)
}

Homer Simpson is Father with Bart Simpson
