<a href="https://colab.research.google.com/github/Theseyh/Big-Data-Framework/blob/main/BDF_11_Graph_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#00 - Configuration of Apache Spark on Collaboratory


###Installing Java, Spark, and Findspark


---


This code installs Apache Spark 3.0.1, Java 8, and [Findspark](https://github.com/minrk/findspark), a library that makes it easy for Python to find Spark.

In [21]:
import os

os.environ["SPARK_VERSION"] = "spark-3.2.3"
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget  https://archive.apache.org/dist/spark/$SPARK_VERSION/$SPARK_VERSION-bin-hadoop2.7.tgz
!tar xf $SPARK_VERSION-bin-hadoop2.7.tgz
!echo $SPARK_VERSION-bin-hadoop2.7.tgz
!pip install -q findspark

0% [Working]            Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:3 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:4 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:5 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:6 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Fetched 128 kB in 1s (92.9 kB/s)
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
-

In [22]:
!wget https://repos.spark-packages.org/graphframes/graphframes/0.8.2-spark3.2-s_2.12/graphframes-0.8.2-spark3.2-s_2.12.jar

--2024-11-27 10:37:12--  https://repos.spark-packages.org/graphframes/graphframes/0.8.2-spark3.2-s_2.12/graphframes-0.8.2-spark3.2-s_2.12.jar
Resolving repos.spark-packages.org (repos.spark-packages.org)... 3.168.132.80, 3.168.132.114, 3.168.132.68, ...
Connecting to repos.spark-packages.org (repos.spark-packages.org)|3.168.132.80|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 247880 (242K) [binary/octet-stream]
Saving to: ‘graphframes-0.8.2-spark3.2-s_2.12.jar’


2024-11-27 10:37:12 (7.12 MB/s) - ‘graphframes-0.8.2-spark3.2-s_2.12.jar’ saved [247880/247880]



### Set Environment Variables
Set the locations where Spark and Java are installed.

In [23]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark/"
os.environ["DRIVE_DATA"] = "/content/gdrive/My Drive/Big Data Framework/data/"

!rm /content/spark
!ln -s /content/$SPARK_VERSION-bin-hadoop2.7 /content/spark

!mv graphframes-0.8.2-spark3.2-s_2.12.jar /content/spark/jars/

!export SPARK_HOME=/content/spark
!export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

!echo $SPARK_HOME
!env |grep  "DRIVE_DATA"

!ls -l /content/

/content/spark/
DRIVE_DATA=/content/gdrive/My Drive/Big Data Framework/data/
total 799428
drwx------  7 root root      4096 Nov 27 10:16 gdrive
drwxr-xr-x  1 root root      4096 Nov 25 19:13 sample_data
lrwxrwxrwx  1 root root        34 Nov 27 10:37 spark -> /content/spark-3.2.3-bin-hadoop2.7
drwxr-xr-x 13  501 1000      4096 Nov 14  2022 spark-3.2.3-bin-hadoop2.7
-rw-r--r--  1 root root 272866820 Nov 14  2022 spark-3.2.3-bin-hadoop2.7.tgz
-rw-r--r--  1 root root 272866820 Nov 14  2022 spark-3.2.3-bin-hadoop2.7.tgz.1
-rw-r--r--  1 root root 272866820 Nov 14  2022 spark-3.2.3-bin-hadoop2.7.tgz.2


### Start a SparkSession
This will start a local Spark session.

In [24]:
!python -V

import findspark
findspark.init()

from pyspark import SparkContext
sc = SparkContext.getOrCreate()

sc.addPyFile('/content/spark/jars/graphframes-0.8.2-spark3.2-s_2.12.jar')

# Example: shows the PySpark version
print("PySpark version {0}".format(sc.version))

# Example: parallelise an array and show the 2 first elements
sc.parallelize([2, 3, 4, 5, 6]).cache().take(2)

Python 3.10.12
PySpark version 3.2.3


[2, 3]

In [25]:
from pyspark.sql import SparkSession
# We create a SparkSession object (or we retrieve it if it is already created)
spark = SparkSession \
.builder \
.appName("My application") \
.config("packages","graphframes:graphframes-0.8.2-spark3.2-s_2.12") \
.master("local[4]") \
.getOrCreate()
# We get the SparkContext
sc = spark.sparkContext

In [26]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).



---


# 11 - Graph processing

## GraphX: Graph processing with RDDs

Parallel graph programming using Spark

- Main abstraction: [*Graph*](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.Graph)
    -   Directed multigraph with properties assigned to vertices and edges
    -   It is an extension of the RDDs
- It includes graph constructors, basic operators ( *reverse*, *subgraph*…) and graph algorithms ( *PageRank*, *Triangle Counting*…)
- Only availabe on Scala.

Documentation: [spark.apache.org/docs/latest/graphx-programming-guide.html](http://spark.apache.org/docs/latest/graphx-programming-guide.html)

API: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.package

## Graphs in GraphX
<img src="http://persoal.citius.usc.es/tf.pena/TCDM/figs/grapxgraph.png" alt="Grafo en GraphX" style="width: 50px;"/>
(Source: M.S. Malak, R. East "Spark GraphX in action", Manning, 2016)

### Example of a simple graph
<img src="http://persoal.citius.usc.es/tf.pena/TCDM/figs/simpsonsgraph.png" alt="Grafo de los Simpson" style="width: 600px;"/>
(Source: P. Zecević, M. Bonaći "Spark in action", Manning, 2017)

## GraphFrames: : Graph processing with DataFrames

In Python we can use [*GraphFrames*](https://graphframes.github.io/graphframes/docs/_site/quick-start.html) which wraps GraphX algorithms under the DataFrames API, providing a Python interface.

- Support for multiple languages is on the works
    - For now,  available for Scala and Python
- Not yet integrated on Spark
    - Available as an external package (https://spark-packages.org/package/graphframes/graphframes)

More information:
- Project web: https://graphframes.github.io/graphframes/docs/_site/
- Python API : https://graphframes.github.io/graphframes/docs/_site/api/python/index.html


### Graphs using pyspark and GraphFrames

In [27]:
# The following example shows how to create a GraphFrame, query it, and run the PageRank algorithm.
# Source: https://graphframes.github.io/graphframes/docs/_site/quick-start.html

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

from graphframes import *

# Create a Vertex DataFrame with unique ID column "id"
v = sqlContext.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
], ["id", "name", "age"])
# Create an Edge DataFrame with "src" and "dst" columns
e = sqlContext.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
], ["src", "dst", "relationship"])
# Create a GraphFrame

g = GraphFrame(v, e)

# Query: Get in-degree of each vertex.
g.inDegrees.show()

# Query: Count the number of "follow" connections in the graph.
g.edges.filter("relationship = 'follow'").count()

# Run PageRank algorithm, and show results.
results = g.pageRank(resetProbability=0.01, maxIter=20)
results.vertices.select("id", "pagerank").show()



+---+--------+
| id|inDegree|
+---+--------+
|  b|       2|
|  c|       1|
+---+--------+

+---+------------------+
| id|          pagerank|
+---+------------------+
|  c|1.8994109890559092|
|  b|1.0905890109440908|
|  a|              0.01|
+---+------------------+



#Exercises

## Exercise 11.1:

A long time ago in a galaxy far, far away, the characters of the Star Wars franchise interacted with each other in an endless series of films. An ancient Jedi order, called the *Data Guardians of the Galaxy* (not affiliated to Marvel's homonym :) registered all those interactions and saved them on a digital file so that they could be studied by the forthcoming generations. This file was originally called (guess it) `sw.txt`, and you will find it in the `/data` directory.

Using pySpark, perform the following operations and answer the following questions:

1. Load the `$DRIVE_DATA/sw.txt` file. Take into account that it is a JSON file.
2. Using this information, create a graph of interactions between the Star Wars characters.
3. How many different characters are there?
4. How many interactions are there?
5. Who is the central character in Star Wars (the one who interacts in most scenes)?
6. Who is the character with the highest 'rank' in Star Wars (use the PageRank algorithm)?

In [31]:
sw_DF = spark.read.option("multiline", "true").json(os.environ["DRIVE_DATA"] + "sw.txt").cache()
sw = sw_DF.rdd

print(sw.collect())


[Row(links=[Row(source=0, target=1, value=32), Row(source=2, target=0, value=2), Row(source=0, target=20, value=5), Row(source=0, target=4, value=22), Row(source=0, target=18, value=41), Row(source=0, target=21, value=2), Row(source=0, target=15, value=12), Row(source=0, target=22, value=2), Row(source=0, target=23, value=8), Row(source=24, target=0, value=11), Row(source=0, target=26, value=3), Row(source=0, target=27, value=2), Row(source=0, target=8, value=47), Row(source=0, target=29, value=1), Row(source=0, target=30, value=1), Row(source=13, target=0, value=2), Row(source=0, target=19, value=4), Row(source=0, target=32, value=9), Row(source=0, target=33, value=2), Row(source=0, target=34, value=9), Row(source=0, target=35, value=1), Row(source=17, target=0, value=1), Row(source=38, target=0, value=2), Row(source=39, target=0, value=1), Row(source=40, target=0, value=3), Row(source=0, target=14, value=1), Row(source=0, target=43, value=1), Row(source=0, target=44, value=1), Row(so

In [32]:
from pyspark.sql import SparkSession
from graphframes import GraphFrame

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Star Wars Graph Analysis") \
    .getOrCreate()

# Load the JSON file
file_path = os.environ["DRIVE_DATA"] + "sw.txt"
sw_data = spark.read.option("multiline", "true").json(file_path)

# Show the structure of the data
sw_data.printSchema()


root
 |-- links: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- source: long (nullable = true)
 |    |    |-- target: long (nullable = true)
 |    |    |-- value: long (nullable = true)
 |-- nodes: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- colour: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- value: long (nullable = true)



In [33]:
# Extract nodes and links
nodes = sw_data.selectExpr("explode(nodes) as node").selectExpr("node.name as id", "node.value as value", "node.colour as colour")
links = sw_data.selectExpr("explode(links) as link").selectExpr("link.source as src", "link.target as dst", "link.value as value")

# Join nodes with indices to create meaningful links
nodes = nodes.withColumn("index", nodes.rdd.zipWithIndex().map(lambda x: x[1]).toDF())
links = links.join(nodes.withColumnRenamed("id", "src_id"), links.src == nodes.index) \
             .join(nodes.withColumnRenamed("id", "dst_id"), links.dst == nodes.index) \
             .select("src_id", "dst_id", "value")

# Create GraphFrame
graph = GraphFrame(nodes, links)


ValueError: The first row in RDD is empty, can not infer schema