# GraphFrames Demonstration

## What is GraphFrames?

[GraphFrames](http://graphframes.github.io/user-guide.html) is a DataFrame-based API that allows you to analyze graph data (social graphs, network graphs, property graphs, etc.), using the power of the Spark
cluster. It is the successor to the RDD-based [GraphX](http://spark.apache.org/docs/latest/graphx-programming-guide.html) API. For an introduction to GraphFrames, see [this blog post](https://databricks.com/blog/2016/03/03/introducing-graphframes.html).

This notebook contains a simple demonstration of just one small part of the GraphFrames API. Hopefully, it'll whet your appetite.

**NOTE**: Please use a Spark 2.0 cluster to run this notebook.

An RDD-based GraphX demonstration conceptually similar to this demo, with a detailed explanation, can be found here: <http://ampcamp.berkeley.edu/big-data-mini-course/graph-analytics-with-graphx.html>. This version is based on the new GraphFrames API.

The following example creates a _property graph_. A property graph is a _directed_ graph: Each node, or _vertex_, in the graph "points" at another vertex, and the edge represents the relationship between the first node and the second node.

The image, at the right, shows an example of a property graph. In that example, a portion of a fictitious dating site, there are four people (the vertices), connected by edges that have the following meanings:

* Valentino likes Angelica; he rated her 9 stars.
* Angelica also gave Valentino 9 stars.
* Mercedes gave Valentino 8 stars.
* Anthony doesn't really like Angelica at all; he gave her 0 zeros.

<img src="http://i.imgur.com/P8wyWCZ.png" alt="Property Graph Example" style="float: right"/>

Let's expand on this World's Most Unpopular Dating Site just a little, for our example.

## Constructing our graph

In [None]:
from graphframes import *
from pyspark.sql import *
from pyspark.sql.functions import *

# These are the people. They will become the vertices. Each person record
# has three fields.
#
# - a unique numeric ID
# - a name
# - an age
people = [
  (1, "Angelica", 23),
  (2, "Valentino", 31),
  (3, "Mercedes", 52),
  (4, "Anthony", 39),
  (5, "Roberto", 45),
  (6, "Julia", 34)
]

# This map makes the edge code easier to read.
name_to_id = dict([(name, id) for id, name, age in people])

# The vertices are the people.
#
# NOTE: Prior to Spark 2.0, you need to use:
# vertices = sqlContext.createDataFrame(people).toDF("id", "name", "age")
vertices = spark.createDataFrame(people).toDF("id", "name", "age")

# The edges connect the people, and each edge contains the rating that person 1 gave person 2 (0 to 10).
# The edges use the IDs to identify the vertices.
edges = sqlContext.createDataFrame(
  [
    # First Person by Index >> related somehow to Second Person by Index, with Attribute Value (which is contextual)
    (name_to_id["Valentino"], name_to_id["Angelica"],  9), # Valentino likes Angela, giving her a 9-star rating.
    (name_to_id["Valentino"], name_to_id["Julia"],     2),
    (name_to_id["Mercedes"],  name_to_id["Valentino"], 8),
    (name_to_id["Mercedes"],  name_to_id["Roberto"],   3),
    (name_to_id["Anthony"],   name_to_id["Angelica"],  0),
    (name_to_id["Roberto"],   name_to_id["Mercedes"],  5),
    (name_to_id["Roberto"],   name_to_id["Angelica"],  7),
    (name_to_id["Angelica"],  name_to_id["Valentino"], 9),
    (name_to_id["Anthony"],   name_to_id["Mercedes"],  1),
    (name_to_id["Julia"],     name_to_id["Anthony"],   8),
    (name_to_id["Anthony"],   name_to_id["Julia"],    10)

  ]
).toDF("src", "dst", "stars")

# The graph is the combination of the vertices and the edges.
g = GraphFrame(vertices, edges)

In [None]:
g.vertices

In [None]:
g.edges

Let's look at the incoming degree of each vertex, which is the number of edges coming _into_ each node (i.e., the number of people who have rated a particular person)

In [None]:
g.inDegrees.show()

We can make that more readable.

In [None]:
in_degrees = g.inDegrees
joined = in_degrees.join(vertices, in_degrees['id'] == vertices['id'])
joined.select(joined['name'], joined['inDegree'].alias('number of ratings')).show()

Let's do the same with out degrees, i.e., the number of ratings each person has made.

In [None]:
out_degrees = g.outDegrees
joined = out_degrees.join(vertices, out_degrees['id'] == vertices['id'])
joined.select(joined['name'], joined['outDegree'].alias('number of ratings')).show()

`degrees` is just the sum of the input and output degrees for each vertex. Or, to put it another way, it's the number of times each person's ID appears as a vertex.

In [None]:
degrees = g.degrees
joined = degrees.join(vertices, degrees['id'] == vertices['id'])
joined.select(joined['name'], joined['degree'].alias('number of ratings')).show()

## Who likes whom?

We can visually scan the output to find people who really like each other, but only because the graph is so small. In a larger graph, that would be tedious. So, let's see if we can find a way to process the graph programmatically to find people really like each other. We'll define "really like" as two people who have rated each with 8 or more stars.

We want to find sets of two people, each of whom rated the other at least 8. [GraphFrame motif finding](http://graphframes.github.io/user-guide.html#motif-finding) can help here. GraphFrame
motif finding uses a simple string-based Domain-Specific Language (DSL) to express structural queries. For instance, the expression `"(a)-[e]->(b); (b)-[e2]->(a)"` will search for pairs of vertices,
`a` and `b`, that are connected by edges in both directions. As it happens, this is _exactly_ what we want.

In [None]:
pairs = g.find("(a)-[ab]->(b); (b)-[ba]->(a)")
match_ups = pairs.filter((pairs['ab.stars'] > 7) & (pairs['ba.stars'] > 7))
match_ups.show()

In that output, we can see, for instance, that Valentino likes Angelica (one row) and Angelica like Valentino (another row). Can we collapse that output down to simple rows of matched pairs?

We can, using a little DataFrame manipulation.

* Use the `array()` function from `org.apache.spark.sql.functions` to pull the two name columns together into a single array column.
* Use the `sort_array()` function to sort that array column. Thus, the two rows with `[Valentino, Angelica]`, `[Angelica, Valentino]`, will both end up having the value `[Angelica, Valentino]`.
* Use `distinct` to sort remove the duplicate rows. That way, we don't end up with two `[Angelica <-> Valentino]` entries.

All three operations can be performed in one statement, but we'll do each step separately, so we can see the transformations.

In [None]:
df1 = match_ups.select(array(match_ups['a.name'], match_ups['b.name']).alias('names'))
df1.show(truncate = False)

In [None]:
df2 = df1.select(sort_array(df1['names']).alias('names'))
df2.show(truncate = False)

In [None]:
final_matchups = df2.distinct()
final_matchups.show(truncate = False)

Finally, let's pull the matches back to the driver and print them nicely.

In [None]:
for n1, n2 in [row.names for row in final_matchups.collect()]:
  print("{0} and {1} really like each other.".format(n1, n2))