# PySpark GraphX

GraphX is available in Scala and Java APIs, but not directly in PySpark (Python).
To work with graphs in PySpark, the usual approach is to use GraphFrames — a graph processing library built on top of Spark DataFrames and available in Python.

In [1]:
from pyspark.sql import SparkSession # This imports the SparkSession class from the PySpark library.

# Lets Initialize a new SparkSession, which is the entry point to programming Spark with the Dataset and DataFrame API.
# .builder: This starts the process of creating a SparkSession. 
# .appName("Customer Dataset Analysis"): Sets the name of the Spark application as “Customer Dataset Analysis.” This name appears in logs, the Spark UI, and other places. 
# .getOrCreate(): Either returns an existing SparkSession if one already exists in the current context, or creates a new one.

spark = SparkSession.builder \
    .appName("ReadFromCassandra") \
    .config("spark.jars.packages", "com.datastax.spark:spark-cassandra-connector_2.12:3.4.0,graphframes:graphframes:0.8.2-spark3.1-s_2.12") \
    .config("spark.cassandra.connection.host", "127.0.0.1") \
    .config("spark.cassandra.connection.port", "9042") \
    .getOrCreate()

## What is GraphFrames?
GraphFrames extends DataFrames with graph algorithms and graph abstractions (vertices, edges).
It’s built on top of DataFrames, so you can easily integrate with other PySpark workflows.

In [2]:
from graphframes import GraphFrame

In [3]:
# Spark DataFrame called df, which holds the data for processing

df = spark.read \
    .format("org.apache.spark.sql.cassandra") \
    .options(table="customers", keyspace="pyspark_keyspace") \
    .load()

## Vertices:
These represent the nodes (or entities) in the graph. Each vertex typically has:
a unique ID (like a primary key)
associated attributes (like name, age, etc.)

## Edges:
These represent the connections (or relationships) between vertices. 

Each edge has:
a source vertex ID (src)
a destination vertex ID (dst)
an optional attribute (like the type of relationship)

In [4]:
# Create vertices (must have 'id' column)
vertices = df.selectExpr("customer_id as id", "first_name", "last_name", "address", "assets_value", "salary")

The relationship goal is to find same family names from customers

In [9]:
# Create edges: join on last_name to create pairs
from pyspark.sql import functions as F

edges = df.alias("df1").join(
    df.alias("df2"),
    (F.col("df1.last_name") == F.col("df2.last_name")) & (F.col("df1.customer_id") != F.col("df2.customer_id"))
).select(
    F.col("df1.customer_id").alias("src"),
    F.col("df2.customer_id").alias("dst"),
    F.col("df1.first_name").alias("src_first_name"),
    F.col("df2.first_name").alias("dst_first_name"),
    F.col("df1.last_name").alias("last_name"),
    F.lit("same_family_name").alias("relationship")
).dropDuplicates()

In [10]:
# Show the generated edges
edges.show()

+--------------------+--------------------+--------------+--------------+---------+----------------+
|                 src|                 dst|src_first_name|dst_first_name|last_name|    relationship|
+--------------------+--------------------+--------------+--------------+---------+----------------+
|1f3fcf8a-d218-45c...|63c6073c-8af1-414...|          John|       Michael|    Brown|same_family_name|
|1f3fcf8a-d218-45c...|47150df6-af41-434...|          John|          John|    Brown|same_family_name|
|1f3fcf8a-d218-45c...|6a35e9e0-59cf-4f9...|          John|          Jane|    Brown|same_family_name|
|1f3fcf8a-d218-45c...|5c18aff6-5c5a-4a8...|          John|        Olivia|    Brown|same_family_name|
|1f3fcf8a-d218-45c...|0ae62535-bb04-443...|          John|          Jane|    Brown|same_family_name|
|1f3fcf8a-d218-45c...|2c038c58-4074-461...|          John|         David|    Brown|same_family_name|
|1f3fcf8a-d218-45c...|2779860d-d659-427...|          John|         David|    Brown|same_fam

In [11]:
from graphframes import GraphFrame

# Create the graph
g = GraphFrame(vertices, edges)



In [13]:
# Show the vertices
print("Vertices:")
g.vertices.show()

Vertices:
+--------------------+----------+---------+--------------------+------------+-------+
|                  id|first_name|last_name|             address|assets_value| salary|
+--------------------+----------+---------+--------------------+------------+-------+
|2b78763c-ceed-4a0...|      Jane|   Taylor|9330 Oak St, Chicago|   555194.86| 6552.5|
|d5e3cd2b-610b-48f...|       Bob|   Miller|9635 Elm St, Houston|   390784.68|5150.91|
|2176e702-a2e3-4ad...|    Sophia|      Doe|1293 Main St, Chi...|   675510.48|7455.35|
|549324a6-37fd-48b...|    Olivia|    Davis|2433 Oak St, Phoenix|   501835.93|5618.59|
|6a35e9e0-59cf-4f9...|      Jane|    Brown|5050 Main St, New...|   930475.79|5771.22|
|68d56238-3bc8-4bc...|      John|   Miller|2809 Oak St, Los ...|   908318.48|4867.72|
|de040c86-0422-4f7...|     Emily|    Smith|7840 Main St, New...|   111917.41|7307.63|
|14f7fbdd-fa82-4bb...|     Emily|   Taylor|5999 Main St, Hou...|   514902.95| 3216.6|
|7fc8ce80-b21f-4b3...|     David|   Taylor|2

In [14]:
# Show the edges
print("Edges:")
g.edges.show()

Edges:
+--------------------+--------------------+--------------+--------------+---------+----------------+
|                 src|                 dst|src_first_name|dst_first_name|last_name|    relationship|
+--------------------+--------------------+--------------+--------------+---------+----------------+
|63c6073c-8af1-414...|1f3fcf8a-d218-45c...|       Michael|          John|    Brown|same_family_name|
|63c6073c-8af1-414...|5c18aff6-5c5a-4a8...|       Michael|        Olivia|    Brown|same_family_name|
|63c6073c-8af1-414...|5f7dddd9-7486-435...|       Michael|          John|    Brown|same_family_name|
|63c6073c-8af1-414...|2779860d-d659-427...|       Michael|         David|    Brown|same_family_name|
|63c6073c-8af1-414...|2bb56d74-6e87-41a...|       Michael|         Emily|    Brown|same_family_name|
|63c6073c-8af1-414...|0fb80c1c-901c-480...|       Michael|       Michael|    Brown|same_family_name|
|63c6073c-8af1-414...|47150df6-af41-434...|       Michael|          John|    Brown|s

# PageRank

PageRank measures how important each node is in a graph, based on the idea that a node is important if other important nodes point to it.
Simple Example:
Imagine a group of friends: Alice, Bob, and Charlie. If everyone talks about Alice, Alice is probably important!
PageRank is like counting how many people mention or “recommend” each other. If important people mention you, you become even more important.

In [16]:
pagerank = g.pageRank(resetProbability=0.15, maxIter=5)
pagerank.vertices.show()

+--------------------+----------+---------+--------------------+------------+-------+------------------+
|                  id|first_name|last_name|             address|assets_value| salary|          pagerank|
+--------------------+----------+---------+--------------------+------------+-------+------------------+
|1125910a-60ba-481...|     Emily|    Smith|8495 Main St, Pho...|   696736.52|8470.16|0.9999999999999996|
|1011d0c2-2e47-43a...|    Olivia|      Doe|3108 Oak St, Los ...|   655369.65|9708.16|0.9999999999999998|
|d920e993-7616-41c...|    Sophia| Williams|4517 Pine Ave, Ch...|   159208.98|5307.29|1.0000000000000002|
|18f478a5-2b59-4b2...|     Emily| Williams|6367 Main St, Pho...|   607651.75|6815.21|1.0000000000000002|
|86d99ab8-3f72-4fe...|    Sophia|    Smith|725 Main St, Los ...|   505158.31|7464.95|0.9999999999999996|
|2bb52b82-4f81-421...|     Alice| Williams|935 Main St, Los ...|   445681.62|7320.36|1.0000000000000002|
|1f3fcf8a-d218-45c...|      John|    Brown|295 Maple Bl

# Triangle Count

Triangle Count tells you how many triangles there are in your network. A triangle is 3 people (or nodes) who are all connected to each other.
Simple Example:
Let’s say:
Alice is friends with Bob.
Bob is friends with Charlie.
Alice is also friends with Charlie.
That’s a triangle—everyone in the triangle knows each other.

In [18]:
triangles = g.triangleCount()
triangles.show()

+-----+--------------------+----------+---------+--------------------+------------+-------+
|count|                  id|first_name|last_name|             address|assets_value| salary|
+-----+--------------------+----------+---------+--------------------+------------+-------+
|   55|2da63cc1-8e2d-49f...|      Jane|    Davis|2539 Maple Blvd, ...|   885284.37|6039.09|
|   15|b36b1045-184b-49e...|      Jane|      Doe|3190 Main St, Hou...|   413022.51|3504.42|
|   28|fb6edfff-3187-478...|     Alice|   Garcia|2998 Pine Ave, Ph...|   907752.67|5457.94|
|   15|a045bef9-b281-442...|    Olivia|      Doe|3786 Oak St, Los ...|   744804.54|8491.88|
|   55|d652f7b4-2c5c-4ea...|      Jane|   Taylor|9849 Maple Blvd, ...|   343330.21|5051.17|
|   45|5c18aff6-5c5a-4a8...|    Olivia|    Brown|3902 Oak St, New ...|   314844.16|9021.98|
|   15|48a04704-1533-478...|   Charlie|  Johnson|9564 Pine Ave, Lo...|   722351.04|9520.67|
|  153|6ab68490-3d3f-4cc...|    Olivia| Williams|2248 Maple Blvd, ...|   839865.