### GraphFrames

Agenda:
* Creating vertices and edges
* Viewing properties of a GraphFrame
* Graph filtering
* Motifs - finding patterns
* Graph Algorithms

In [0]:
# GraphFrames jar needs to be installed
# https://spark-packages.org/package/graphframes/graphframes
import graphframes as gf

spark.conf.set("spark.sql.shuffle.partitions", sc.defaultParallelism)

help(gf)
# GraphFrame = 2 dataframes: vertices and edges

In [0]:
# Let's load in some sample data
fraud_df = spark.read.csv("/mnt/training/fraud/paysim-fraud-detection.csv", header=True, inferSchema=True)
display(fraud_df)

#### Vertices

* Needs to contain **id** column

In [0]:
import pyspark.sql.functions as F

fraud_vertices = (fraud_df
                  .select(F.col("nameOrig").alias("id"))
                  .union(fraud_df
                        .select(F.col("nameDest").alias("id")))
                  .distinct()
)

display(fraud_vertices)

#### Edges

* Needs to contain **src** and **dst** columns

In [0]:
fraud_edges = (fraud_df
               .select(F.col("nameOrig").alias("src")
                      ,F.col("nameDest").alias("dst")
                      ,F.col("*"))
               .drop("nameOrig", "nameDest")
)

display(fraud_edges)

In [0]:
# Let's create our first GraphFrame

fraud_graph = gf.GraphFrame(fraud_vertices, fraud_edges)

fraud_vertices.cache()
fraud_edges.cache()

display(fraud_graph)

#### Viewing properties of a GraphFrame

In [0]:
# All of these return a Spark DataFrame

display(fraud_graph.vertices) # same as our created dataframe
#display(fraud_graph.edges) # same as our created edges
#display(fraud_graph.degrees) # total edges connected to a vertice
#display(fraud_graph.inDegrees) # incoming edges
#display(fraud_graph.outDegrees) # outgoing edges
#display(fraud_graph.triplets) # source / edge / destination combined

### Graph filtering

In [0]:
# filtering vertices
fraud_graph_filtered_v = fraud_graph.filterVertices("id == 'C1525806158'")
display(fraud_graph_filtered_v.vertices)
#display(fraud_graph_filtered_v.edges)

In [0]:
# filtering edges
fraud_graph_filtered_e = fraud_graph.filterEdges("isFraud == 1")
display(fraud_graph_filtered_e.vertices)
#display(fraud_graph_filtered_e.edges)

In [0]:
# note that we still have all the vertices, even if they are not on an edge with "isFraud"
print(f"""
Count of edges: {fraud_graph_filtered_e.edges.count()}
Count of vertices: {fraud_graph_filtered_e.vertices.count()}
""")
      
print(f"Original fraud edge count: {fraud_edges.filter('isFraud == 1').count()}")

# If you want to remove orphaned vertices, combine it with dropIsolatedVertices()

# fraud_graph_filtered_e_clean = (fraud_graph
#                                 .filterEdges("isFraud == 1")
#                                 .dropIsolatedVertices()
#                                )
# print(f"""
# Count of edges: {fraud_graph_filtered_e_clean.edges.count()}
# Count of vertices: {fraud_graph_filtered_e_clean.vertices.count()}
# """)
  

### Motifs

(vertice)-[edge]->(vertice)

In [0]:
# The naming of vertice/edge - used for mapping to a specific identity (note that b is definitely b in the example below, but c may or may not be same as a)
# Semicolon for bundling multiple patterns
# If we want to apply some filters, we should apply them on the resulting dataframe (eg isFraud below)

money_launderers_df = (fraud_graph
                       .find("(a)-[e1]->(b); (b)-[e2]->(c)")
                       .filter(("e1.isFraud == 1 & e2.isFraud == 0"))
                      )

display(money_launderers_df)

In [0]:
# We can have empty brackets - then this entity is left out of the resulting dataframe

outgoing_edges_df = (fraud_graph
                       .find("(a)-[edge]->()")
                      )

display(outgoing_edges_df)

## Graph Algorithms

### PageRank

In [0]:
# Let's load in a smaller example graph 

from graphframes.examples import Graphs
g = Graphs(sqlContext).friends()
display(g.vertices)
#display(g.edges)
#display(g.triplets)

In [0]:
g_pagerank = g.pageRank(resetProbability=0.15, maxIter=10)
display(g_pagerank.vertices)
#display(g_pagerank.edges)

### Triangle count (3-clique)

In [0]:
g_trianglecount = g.triangleCount()
display(g_trianglecount)

In [0]:
# Let's add a few more edges to see some triangles

new_edges = [
  {"src": "c",
  "dst": "a"
  },
  {"src": "c",
   "dst": "e" 
  }
]

new_edges_df = spark.createDataFrame(new_edges)

all_edges_df = (g.edges
               .unionByName(new_edges_df, allowMissingColumns=True))

new_g = gf.GraphFrame(g.vertices, all_edges_df)

display(new_g.triplets)

In [0]:
new_g_trianglecount = new_g.triangleCount()
display(new_g_trianglecount)

### Label propagation

In [0]:
g_labelprop = g.labelPropagation(maxIter=5)
display(g_labelprop)

In [0]:
# This algorithm is computationally efficient, but not always very useful. E.g.:
new_g_labelprop = new_g.labelPropagation(maxIter=5)
display(new_g_labelprop)

### Breadth-first search

In [0]:
g_bfs = g.bfs("name = 'Esther'", "age < 32")
display(g_bfs)

In [0]:
new_g_bfs = new_g.bfs("name = 'Esther'", "age > 29 and age < 36 and name != 'Esther'")
display(new_g_bfs)

### Further reading

https://graphframes.github.io/graphframes/docs/_site/user-guide.html  
https://docs.databricks.com/spark/latest/graph-analysis/graphframes/user-guide-python.html  
https://blog.devgenius.io/graph-modeling-in-pyspark-using-graphframes-part-1-e7cb42099182

### Task
Dataset: Star Wars Social Network  
https://www.kaggle.com/datasets/ruchi798/star-wars?resource=download&select=starwars-full-interactions-allCharacters-merged.json

Todo:
* Preprocess the data:
  * create a dataframe for vertices
  * create a dataframe for edges. Please make the edges undirected 
    * hint: the data is currently directed, but actually the direction has no meaning for this dataset
* Create a GraphFrame using the vertices and undirected edges
* Run pagerank on top of the GraphFrame. 
  * Order by pagerank, descending. Discuss, why do the values and pageranks not correlate across characters?
* Using motifs, find characters who never appear together with Luke but appear at least 5 times together with a character that appears at least once with Luke. 
  * E.g. Luke never appears together in a scene with Padme, but both characters appear on scenes with R2-D2.

In [0]:
# Your solution:
