# PySpark GraphFrame example

GraphFrame tutorials: https://graphframes.github.io/graphframes/docs/_site/user-guide.html

## Important notes

Make sure you run the following commands in your terminal to set up environment variables before start `jupyter-notebook --no-browser`. If you didn't do it, please kill your notebook in the termnal, run the commands and restart your notebook.

The first two are Spark paths. The last one is Python3 path.

```
export SPARK_HOME=/home/ubuntu/spark-3.0.3-bin-hadoop3.2
export PYTHONPATH=/home/ubuntu/spark-3.0.3-bin-hadoop3.2/PYTHON
export PYSPARK_PYTHON=/usr/bin/python3
```

**Your machine might have different Spark paths depending on your OS username and Spark location. Do NOT just copy/paste it blindly.**

**Your machine might have a different Python 3 path. You can find your Python3 path using `whereis python3` in the terminal. Usually the first one is the one you want.**

## Create Spark Session

In [6]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local[*]")\
    .appName("Python Spark GraphFrame example") \
    .config("spark.jars.packages", "graphframes:graphframes:0.8.1-spark3.0-s_2.12") \
    .getOrCreate()

## Cluster mode and pseudo mode

local[*] means pseudo mode with all available CPU cores.

You can use spark://IP-address , the URL you find from Spark web ui
to enable cluster mode, such as spark://JIAYU1AB6.localdomain:7077

The notebook under the cluster mode will distribute the computation tasks to your Spark cluster.

Make sure you shutdown and restart this notebook when switch mode

## Create a GraphFrame

A GraphFrame = A DataFrame of vertexes + a DataFrame of edges

### Download users.csv to your local path

Ubuntu
```wget https://raw.githubusercontent.com/DataOceanLab/CPTS-415-Project-Examples/main/users.csv```

MacOS
```curl -O https://raw.githubusercontent.com/DataOceanLab/CPTS-415-Project-Examples/main/users.csv```

### Download relationships.csv to your local path

Ubuntu
```wget https://raw.githubusercontent.com/DataOceanLab/CPTS-415-Project-Examples/main/relationships.csv```

MacOS
```curl -O https://raw.githubusercontent.com/DataOceanLab/CPTS-415-Project-Examples/main/relationships.csv```

In [2]:
# Create a Vertex DataFrame with unique ID column "id"
v = spark.read.option("inferSchema", "true").option("delimiter", ",").option("header", "true").csv("users.csv")
v.show()

# Create an Edge DataFrame with "src" and "dst" columns
e = spark.read.option("inferSchema", "true").option("delimiter", ",").option("header", "true").csv("relationships.csv")
e.show()

# Create a GraphFrame
from graphframes import *
g = GraphFrame(v, e)

g.vertices.show()
g.edges.show()

+---+-------+---+
| id|   name|age|
+---+-------+---+
|  a|  Alice| 34|
|  b|    Bob| 36|
|  c|Charlie| 30|
+---+-------+---+

+---+---+------------+
|src|dst|relationship|
+---+---+------------+
|  a|  b|      friend|
|  b|  c|      follow|
|  c|  b|      follow|
+---+---+------------+

+---+-------+---+
| id|   name|age|
+---+-------+---+
|  a|  Alice| 34|
|  b|    Bob| 36|
|  c|Charlie| 30|
+---+-------+---+

+---+---+------------+
|src|dst|relationship|
+---+---+------------+
|  a|  b|      friend|
|  b|  c|      follow|
|  c|  b|      follow|
+---+---+------------+



## Get degree statistics of each vertex

In [3]:
# Query: Get in-degree of each vertex.
g.inDegrees.show()

# Query: Get out-degree of each vertex.
g.outDegrees.show()

# Query: Get degree of each vertex
g.degrees.show()

# Query: Get the max degree among all vertexes
g.degrees.agg({'degree': 'max'}).show()

+---+--------+
| id|inDegree|
+---+--------+
|  c|       1|
|  b|       2|
+---+--------+

+---+---------+
| id|outDegree|
+---+---------+
|  c|        1|
|  b|        1|
|  a|        1|
+---+---------+

+---+------+
| id|degree|
+---+------+
|  c|     2|
|  b|     3|
|  a|     1|
+---+------+





+-----------+
|max(degree)|
+-----------+
|          3|
+-----------+



                                                                                

## Query: Count the number of "follow" connections in the graph.

In [4]:
g.edges.filter("relationship = 'follow'").count()

2

## Run PageRank algorithm, and show results.

In [5]:
results = g.pageRank(resetProbability=0.01, maxIter=3)
results.vertices.select("id", "pagerank").show()

                                                                                

+---+--------+
| id|pagerank|
+---+--------+
|  a|    0.01|
|  b|1.980199|
|  c|1.009801|
+---+--------+

