# Lab 14: Spark GraphFrames

This notebook demonstrates examples from the [GraphFrames User Guide](https://graphframes.github.io/graphframes/docs/_site/user-guide.html).

This notebook is intended to run on Databricks (but with some modification it could run on other spark installations)

## Step 1 : Installing custom packages

### Install graphframes spark package


1. Go to **Workspace** > from the drop down menu, choose **Create** > **Library**.
2. Then on the page, Choose "Maven" as the library source, at the **Coordinates**, click **Search Packages**
3. Type in graphframes, select the graphframes (the latest is fine). 
4. Wait for the package to be installed on the cluster.


### Install networkx package for graph visualization

GraphFrames has no built in visualization capabilities. But we may use python's networkx to visualize graph data.

1. Go to **Workspace** > from the drop down menu, choose **Create** > **Library**.
2. Then on the page, Choose "PyPI" as the library source
3. Type "networkx" in the package text box
4. Click "Create" to install the package.

Run the following to enable visualization. 

The code below is used for visualize a given Graph of GraphFrame. It is  modified from the suggested solution here https://stackoverflow.com/questions/45720931/pyspark-how-to-visualize-a-graphframe

See [networkx drawing API documentation](https://networkx.github.io/documentation/networkx-1.10/reference/generated/networkx.drawing.nx_pylab.draw_networkx.html#networkx.drawing.nx_pylab.draw_networkx) for more details.

In [4]:
import networkx as nx
import matplotlib.pyplot as plt

def PlotGraph(g,node_size=300,node_color='r',labels=None):
    # to use: display(PlotGraph(g)),  where g is the given GraphFrames graph object
    # note that node_size can be an array or a single number
    # node_color can be a string or array of floats
    # labels can be a dictionary keyed by node of text labels.
    # see 
    Gplot=nx.DiGraph()
    # add nodes
    for row in g.vertices.select('id').collect():
        Gplot.add_node(row['id'])
    # add edges
    for row in g.edges.select('src','dst').take(1000):
        Gplot.add_edge(row['src'],row['dst'])

    plt.subplot(121)
    nx.draw(Gplot, with_labels=True, alpha=0.8,arrows=True,node_size=node_size,node_color=node_color,labels=labels)


In [5]:
from functools import reduce
from pyspark.sql.functions import col, lit, when
from graphframes import *

## Creating GraphFrames

Users can create GraphFrames from vertex and edge DataFrames.

* Vertex DataFrame: A vertex DataFrame should contain a special column named "id" which specifies unique IDs for each vertex in the graph.
* Edge DataFrame: An edge DataFrame should contain two special columns: "src" (source vertex ID of edge) and "dst" (destination vertex ID of edge).

Both DataFrames can have arbitrary other columns. Those columns can represent vertex and edge attributes.

This example graph also comes with the GraphFrames package.

```python
from graphframes.examples import Graphs
same_g = Graphs(spark).friends()
display(PlotGraph(same_g))
```

Create the vertices first:

In [8]:
vertices = spark.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
  ("d", "David", 29),
  ("e", "Esther", 32),
  ("f", "Fanny", 36),
  ("g", "Gabby", 60)], ["id", "name", "age"])

And then some edges:

In [10]:
edges = spark.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
  ("f", "c", "follow"),
  ("e", "f", "follow"),
  ("e", "d", "friend"),
  ("d", "a", "friend"),
  ("a", "e", "friend")
], ["src", "dst", "relationship"])

Let's create a graph from these vertices and these edges:

In [12]:
g = GraphFrame(vertices, edges)
print(g)

In [13]:
display(PlotGraph(g))

## Basic graph and DataFrame queries

GraphFrames provide several simple graph queries, such as node degree.

Also, since GraphFrames represent graphs as pairs of vertex and edge DataFrames, it is easy to make powerful queries directly on the vertex and edge DataFrames. Those DataFrames are made available as vertices and edges fields in the GraphFrame.

In [15]:
display(g.vertices)

id,name,age
a,Alice,34
b,Bob,36
c,Charlie,30
d,David,29
e,Esther,32
f,Fanny,36
g,Gabby,60


In [16]:
display(g.edges)

src,dst,relationship
a,b,friend
b,c,follow
c,b,follow
f,c,follow
e,f,follow
e,d,friend
d,a,friend
a,e,friend


The incoming degree of the vertices:

In [18]:
display(g.inDegrees)

id,inDegree
f,1
e,1
d,1
c,2
b,2
a,1


The outgoing degree of the vertices:

In [20]:
display(g.outDegrees)

id,outDegree
f,1
e,2
d,1
c,1
b,1
a,2


The degree of the vertices:

In [22]:
display(g.degrees)

id,degree
f,2
e,3
d,2
c,3
b,3
a,3


You can run queries directly on the vertices DataFrame. For example, we can find the age of the youngest person in the graph:

In [24]:
# find youngest age
youngest = g.vertices.groupBy().min("age")
display(youngest)

min(age)
29


Likewise, you can run queries on the edges DataFrame. For example, let's count the number of 'follow' relationships in the graph:

In [26]:
numFollows = g.edges.filter("relationship = 'follow'").count()
print("The number of follow edges is", numFollows)

## Motif finding

Using motifs you can build more complex relationships involving edges and vertices. The following cell finds the pairs of vertices with edges in both directions between them. The result is a DataFrame, in which the column names are given by the motif keys.

Check out the [GraphFrame User Guide](http://graphframes.github.io/user-guide.html#motif-finding) for more details on the API.

In [28]:
# Search for pairs of vertices with edges in both directions between them.
motifs = g.find("(a)-[e]->(b); (b)-[e2]->(a)")
display(motifs)

a,e,b,e2
"List(c, Charlie, 30)","List(c, b, follow)","List(b, Bob, 36)","List(b, c, follow)"
"List(b, Bob, 36)","List(b, c, follow)","List(c, Charlie, 30)","List(c, b, follow)"


Find triangles among friends.

In [30]:
# find triangles, e. a --> b -->c --a
triangles = g.find("(a)-[e]->(b); (b)-[e2]->(c); (c)-[e3]->(a)")
display(triangles)

a,e,b,e2,c,e3
"List(d, David, 29)","List(d, a, friend)","List(a, Alice, 34)","List(a, e, friend)","List(e, Esther, 32)","List(e, d, friend)"
"List(a, Alice, 34)","List(a, e, friend)","List(e, Esther, 32)","List(e, d, friend)","List(d, David, 29)","List(d, a, friend)"
"List(e, Esther, 32)","List(e, d, friend)","List(d, David, 29)","List(d, a, friend)","List(a, Alice, 34)","List(a, e, friend)"


## Subgraphs

GraphFrames provides APIs for building subgraphs by filtering on edges and vertices. These filters can be composed together, for example the following subgraph only includes people who have friends

In [32]:
g2 = g.filterEdges("relationship = 'friend'").dropIsolatedVertices()

In [33]:
display(PlotGraph(g2))

## Standard graph algorithms

GraphFrames comes with a number of standard graph algorithms built in:
* Breadth-first search (BFS)
* Connected components
* Strongly connected components
* Label Propagation Algorithm (LPA)
* PageRank (regular and personalized)
* Shortest paths
* Triangle count

## Connected components

Compute the connected component membership of each vertex and return a DataFrame with each vertex assigned a component ID. The GraphFrames connected components implementation can take advantage of checkpointing to improve performance.

In [36]:
sc.setCheckpointDir("/tmp/graphframes-example-connected-components")
result = g.connectedComponents()
display(result)

id,name,age,component
a,Alice,34,412316860416
b,Bob,36,412316860416
c,Charlie,30,412316860416
d,David,29,412316860416
e,Esther,32,412316860416
f,Fanny,36,412316860416
g,Gabby,60,146028888064


## PageRank

Identify important vertices in a graph based on connections.

In [38]:
results = g.pageRank(resetProbability=0.15, tol=0.01)
display(results.vertices)

id,name,age,pagerank
g,Gabby,60,0.1799821386239711
b,Bob,36,2.655507832863289
e,Esther,32,0.3708523318767607
a,Alice,34,0.4491063370653874
f,Fanny,36,0.3283606792049851
d,David,29,0.3283606792049851
c,Charlie,30,2.6878300011606218


In [39]:
# the following display the inverse pagerank as node size, reflecting their relative centrality
display(PlotGraph(g,node_size=[(1/d.pagerank)*100 for d in results.vertices.select('pagerank').collect()]))