### Task
Dataset: Star Wars Social Network  
https://www.kaggle.com/datasets/ruchi798/star-wars?resource=download&select=starwars-full-interactions-allCharacters-merged.json

Todo:
* Preprocess the data:
  * create a dataframe for vertices
  * create a dataframe for edges. Please make the edges undirected 
    * hint: the data is currently directed, but actually the direction has no meaning for this dataset
* Create a GraphFrame using the vertices and undirected edges
* Run pagerank on top of the GraphFrame. 
  * Order by pagerank, descending. Discuss, why do the values and pageranks not correlate across characters?
* Using motifs, find characters who never appear together with Luke but appear at least 5 times together with a character that appears at least once with Luke. 
  * E.g. Luke never appears together in a scene with Padme, but both characters appear on scenes with R2-D2.

In [None]:
# Your solution:
import pyspark.sql.functions as F
import graphframes as gf


df = spark.read.json("dbfs:/FileStore/shared_uploads/starwars_full_interactions_allCharacters_merged.json", multiLine=True)

# create the dataframe with edges
edges_df = (df
           .select(F.explode("links").alias("links"))
           .select(F.col("links.source").alias("src")
                  ,F.col("links.target").alias("dst")
                  ,F.col("links.value")))

# make the edges undirected: add an edge from tgt to src
edges_df = (edges_df
           .unionByName(edges_df
                       .select(F.col("dst").alias("src")
                              ,F.col("src").alias("dst")
                              ,F.col("value")))
           )

# create the dataframe with vertices
vertices_df = (df
              .select(F.posexplode("nodes"))
              .select(F.col("pos").alias("id")
                     ,F.col("col.name")
                     ,F.col("col.colour")
                     ,F.col("col.value"))
              )

# create the graphframe
starwars_graph = gf.GraphFrame(vertices_df, edges_df)

vertices_df.cache()
edges_df.cache()

pagerank_result = starwars_graph.pageRank(resetProbability=0.15, maxIter=10)
# display(pagerank_result.vertices.orderBy(F.col("pagerank").desc()))
# Page rank top 5: Darth Vader, Obi-Wan, C-3PO, Padme, Luke. 
# Reasoning:
# Some characters appear in fewer scenes but with many other characters - e.g. Padme, Qui-Gon. Low "value" but high "pagerank"
# Some characters appear in many scenes, but only with a handful of other characters (Luke, Han, R2-D2). High "value", but not as high "pagerank"


starwars_motif_df = (starwars_graph
                       .find("(a)-[e1]->(b); (b)-[e2]->(c)")
                       .filter("a.name=='LUKE'") # filter on Luke
                       .filter(("a.id != c.id and e2.value>=5")) # filter on edge 2 having at least 5 scenes, and c not equalling luke
                    )

b_ids = [row["id"] for row in starwars_motif_df.select("b.id").collect()] # create a list of b vertices to filter from c

starwars_motif_df = (starwars_motif_df
                    .filter(~F.col("c.id").isin(b_ids))) # filter "c" rows which already appeared in b
                      
  
two_steps_from_luke_df = starwars_motif_df.select("c.*").distinct()

#display(two_steps_from_luke_df)
# 15 distinct characters