<div style="font-size:18pt; padding-top:20px; text-align:center"><b>Introduction to Graph Analysis with </b> <span style="font-weight:bold; color:green">Spark GraphFrames</span></div><hr>
<div style="text-align:right;">Sergei Papulin <span style="font-style: italic;font-weight: bold;">(papulin_bmstu@mail.ru)</span></div>

<a name="0"></a>
<div><span style="font-size:14pt; font-weight:bold">Contents</span>
    <ol>
        <li><a href="#1">Initial dataset</a></li>
        <li><a href="#2">Graph analysis with GraphFrames</a></li>
        <li><a href="#3">References</a></li>
    </ol>
</div>

Install the `Folium` python library to plot maps:

`pip install folium --user`

In [None]:
import folium

[OPTIONAL] **Environment Setup**

In [None]:
import os
import sys

os.environ["SPARK_HOME"]="/usr/lib/spark"
os.environ["PYSPARK_PYTHON"]="/opt/anaconda3/bin/python"
os.environ["PYSPARK_DRIVER_PYTHON"]="/opt/anaconda3/bin/python"

spark_home = os.environ.get("SPARK_HOME")
sys.path.insert(0, os.path.join(spark_home, "python"))
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.10.7-src.zip"))

Run Spark Context

In [None]:
import pyspark
from pyspark.sql import SparkSession

To use Spark GraphFrames you have to download its package and deploy it on the driver and executors. You can define maven packages by specifying the `spark.jars.packages` configuration paramenter of `SparkConf` as follows:

In [None]:
conf = pyspark.SparkConf() \
        .set("spark.executor.memory", "1g") \
        .set("spark.executor.core", "2") \
        .set("spark.jars.packages", "graphframes:graphframes:0.6.0-spark2.3-s_2.11")\
        .setAppName("airGraphApp") \
        .setMaster("local[4]")

In [None]:
spark = SparkSession \
    .builder \
    .config(conf=conf) \
    .getOrCreate()

After starting the Spark context, you can import the `GraphFrames` module:

In [None]:
import graphframes as gf

<a name="1"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">1. Initial dataset</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To contents</a></div>
    </div>
</div>

In [None]:
import pyspark.sql.functions as F
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

#### Airports

Download the `airports` dataset from the [openflights](https://openflights.org/data.html) website. 

Each entry of the dataset contains the following information:

<p>
<div style="width: 600px;">
    
|Parameter|Description|
|:-|:-|
|Airport ID |Unique OpenFlights identifier for this airport.|
|Name |Name of airport. May or may not contain the City name.|
|City |	Main city served by airport. May be spelled differently from Name.|
|Country |Country or territory where airport is located. See countries.dat to cross-reference to ISO 3166-1 codes.|
|IATA|	3-letter IATA code. Null if not assigned/unknown.|
|ICAO|	4-letter ICAO code. Null if not assigned.|
|Latitude|	Decimal degrees, usually to six significant digits. Negative is South, positive is North.|
|Longitude|	Decimal degrees, usually to six significant digits. Negative is West, positive is East.|
|Altitude|	In feet.|
|Timezone|	Hours offset from UTC. Fractional hours are expressed as decimals, eg. India is 5.5.|
|DST|	Daylight savings time. One of E (Europe), A (US/Canada), S (South America), O (Australia), Z (New Zealand), N (None) or U (Unknown). See also: Help: Time|
|Tz| database time zone	Timezone in "tz" (Olson) format, eg. "America/Los_Angeles".|
|Type|	Type of the airport. Value "airport" for air terminals, "station" for train stations, "port" for ferry terminals and "unknown" if not known. In airports.csv, only type=airport is included.|
|Source|	Source of this data. "OurAirports" for data sourced from OurAirports, "Legacy" for old data not matched to OurAirports (mostly DAFIF), "User" for unverified user contributions. In airports.csv, only source=OurAirports is included|

</div>
</p>

Define Spark DataFrame scheme of data:

In [None]:
airportSchema = StructType([
    StructField(name="airport_id", dataType=IntegerType(), nullable=False),
    StructField("name", StringType(), True),
    StructField("city", StringType(), True),
    StructField("country", StringType(), True),
    StructField("iata", StringType(), True),
    StructField("icao", StringType(), True),
    StructField("lat", DoubleType(), True),
    StructField("lng", DoubleType(), True),
    StructField("alt", IntegerType(), True)])
#    StructField("timezone", StringType(), True),
#     StructField("dst", StringType(), True),
#     StructField("tz", StringType(), True),
#     StructField("type", StringType(), True),
#     StructField("source", StringType(), True)])

Path to the downloaded file:

In [None]:
airports_data_path = "file:///YOUR_PATH/data/airlines/airports.dat"

[OPTIONAL] Copy the local dataset to HDFS

Load the `airports` data:

In [None]:
df_airports = spark.read.load(airports_data_path, 
                              format="csv", 
                              header="false", 
                              schema=airportSchema,
                              inferSchema="false",
                              sep=",")

print("Total number of airports:", df_airports.count())
df_airports.show(5)

Display the dataframe scheme of the loaded data:

In [None]:
df_airports.printSchema()

Filter the airports to get all ones located in Moscow, Russia:

In [None]:
moscow_airport_filter = F.lower(F.col("country")).like("rus%") & F.lower(F.col("city")).like("mos%")

In [None]:
df_airports.where(moscow_airport_filter).show(5)

In [None]:
m = folium.Map()

html_template = "<p><b>Name:</b> {0}</br><b>IATA</b>: {1}</br><b>City</b>: {2}</br><b>Country:</b> {3}</p>"

for index, row in df_airports.where(moscow_airport_filter).toPandas().iterrows():
    folium.Marker([row["lat"], row["lng"]], 
                  popup=folium.Popup(html=html_template.format(row["name"], row["iata"], 
                                                               row["city"], row["country"]), 
                                     max_width=400), 
                  tooltip="{}".format(row["name"])).add_to(m)

m.fit_bounds(m.get_bounds())
m

In [None]:
df_airports.where(F.col("iata")=="SVO").show()

### Routes
(2014)

Download the `routes` dataset from the [openflights](https://openflights.org/data.html) website. 

Each entry of the dataset contains the following information:

<p>
<div style="width: 600px;">

|Parameter|Description|
|:-|:-|
|Airline|	2-letter (IATA) or 3-letter (ICAO) code of the airline.|
|Airline ID|	Unique OpenFlights identifier for airline (see Airline).|
|Source airport|	3-letter (IATA) or 4-letter (ICAO) code of the source airport.|
|Source airport ID|	Unique OpenFlights identifier for source airport (see Airport)|
|Destination airport|	3-letter (IATA) or 4-letter (ICAO) code of the destination airport.|
|Destination airport ID|	Unique OpenFlights identifier for destination airport (see Airport)|
|Codeshare|	"Y" if this flight is a codeshare (that is, not operated by Airline, but another carrier), empty otherwise.|
|Stops|	Number of stops on this flight ("0" for direct)|
|Equipment|	3-letter codes for plane type(s) generally used on this flight, separated by spaces|

</div>
</p>

Define Spark DataFrame scheme of data:

In [None]:
routeSchema = StructType([
    StructField("airline", StringType(), False),
    StructField("airline_id", IntegerType(), True),
    StructField("src_airport", StringType(), True),
    StructField("src_airport_id", IntegerType(), True),
    StructField("dst_airport", StringType(), True),
    StructField("dst_airport_id", IntegerType(), True),
    StructField("codeshare", StringType(), True),
    StructField("stops", IntegerType(), True),
    StructField("equipment", StringType(), True)])

Path to the downloaded file:

In [None]:
routes_data_path = "file:///YOUR_PATH/data/airlines/routes.dat"

[OPTIONAL] Copy the local dataset to HDFS

Load the `routes` data:

In [None]:
df_routes_raw = spark.read.load(routes_data_path, 
                              format="csv", 
                              header="false", 
                              schema=routeSchema,
                              inferSchema="false",
                              sep=",")

print("Total number of routes:", df_routes_raw.count())
df_routes_raw.show(5)

Filter `routes` without stops:

In [None]:
df_routes = df_routes_raw.where(F.col("stops") == 0)
df_routes.count()

Retrieve all routes from `df_routes` where a source is `SVO` (Sheremetyevo, Moscow, Russia) and attach to them coordinates of source and destination airports:

In [None]:
df_routes_coord = df_routes.select("src_airport", "dst_airport")\
        .where((F.col("src_airport")=="SVO")).distinct()\
        .join(df_airports.select(F.col("iata").alias("src"), 
                                 F.col("lat").alias("src_lat"), 
                                 F.col("lng").alias("src_lng")), 
              on=[F.col("src_airport")==F.col("src")])\
        .join(df_airports.select(F.col("iata").alias("dst"), 
                                 F.col("city").alias("dst_city"),
                                 F.col("country").alias("dst_country"),
                                 F.col("lat").alias("dst_lat"), 
                                 F.col("lng").alias("dst_lng")), 
              on=[F.col("dst_airport")==F.col("dst")])

print("Number of routes from SVO:", df_routes_coord.count())
print("Number of countries:", df_routes_coord.select("dst_country").distinct().count())
df_routes_coord.show(5)

Display the routes on map:

In [None]:
df_routs_coord_pn = df_routes_coord.toPandas()

In [None]:
df_routs_coord_pn.iloc[0]["src_lat"]

In [None]:
m = folium.Map()

df_routs_coord_pn = df_routes_coord.toPandas()
df_source_coord_pn = df_airports.where(F.col("iata")=="SVO").toPandas()

# Destinations
for index, row in df_routs_coord_pn.iterrows():
    folium.PolyLine([(row["src_lat"], row["src_lng"]), (row["dst_lat"], row["dst_lng"])], 
                    color="#888888", 
                    weight=0.5, 
                    opacity=0.5).add_to(m)    
    folium.CircleMarker(location=(row["dst_lat"], row["dst_lng"]),
                        radius= 5,
                        tooltip="{0}</br>{1}</br>{2}".format(row["dst"], row["dst_city"], row["dst_country"]),
                        color="seagreen",
                        fill_color="seagreen",
                        fill_opacity=0.5,
                        fill=True).add_to(m)

# Source
first_row = df_source_coord_pn.iloc[0]
folium.CircleMarker(location=(first_row["lat"], first_row["lng"]),
                        radius= 5,
                        tooltip="{0}</br>{1}</br>{2}".format(first_row["iata"], 
                                                             first_row["city"], 
                                                             first_row["country"]),
                        color="red",
                        fill_color="red",
                        fill_opacity=0.5,
                        fill=True).add_to(m)

m.fit_bounds(m.get_bounds())
m

Find all routes from SVO (Moscow) to JFK (New York):

In [None]:
df_routes.where((F.col("src_airport")=="SVO") & (F.col("dst_airport")=="JFK")).show()

<a name="2"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">2. Graph analysis with GraphFrames</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To contents</a></div>
    </div>
</div>

[PySpark GraphFrames Package](https://graphframes.github.io/graphframes/docs/_site/api/python/graphframes.html#graphframes.GraphFrame)

#### Creating graph

Rename columns of initial `dataframes` to meet requirements of GrapFrames

In [None]:
df_vertices = df_airports.withColumnRenamed("iata", "id")
df_vertices.show(5)

In [None]:
df_edges = df_routes.select(F.col("src_airport").alias("src"), 
                            F.col("dst_airport").alias("dst"), 
                            "airline")
df_edges.show(5)

Initialize a `GraphFrame` instance for routes:

In [None]:
gf_routes = gf.GraphFrame(df_vertices, df_edges)
gf_routes

Display a triplet stucture of GraphFrame for the given data:

In [None]:
gf_routes.triplets.show()

#### Degrees

The number of edges incoming to each vertex:

In [None]:
gf_routes.inDegrees.orderBy(-F.col("inDegree")).show()

The number of edges outcoming from each vertex:

In [None]:
gf_routes.outDegrees.orderBy(-F.col("outDegree")).show()

#### PageRank

Run the `PageRank` algorithm:

In [None]:
gf_routes_pr = gf_routes.pageRank(resetProbability=0.1, maxIter=5)

Display airports sorted by the `pagerank` values in decreasing order

In [None]:
gf_routes_pr.vertices\
            .select("id", "name", "city", "country", "pagerank")\
            .orderBy(-F.col("pagerank"))\
            .show()

Attach in-degrees and out-degrees of the vertices:

In [None]:
gf_routes_pr.vertices\
            .select("id", "name", "city", "country", "pagerank")\
            .join(gf_routes.inDegrees, on=["id"])\
            .join(gf_routes.outDegrees, on=["id"])\
            .orderBy(-F.col("pagerank"))\
            .show()

#### Motif

Find all routes with the following pattern: KUF(Samara) -> X -> BCN(Barcelona) where an airline is SU (Aeroflot) or S7 from KUF to X: 

In [None]:
motifs = gf_routes.find("(a)-[ab]->(b); (b)-[bc]->(c)")\
                .filter("a.id = 'KUF' and (ab.airline = 'SU' or ab.airline = 'S7') and c.id = 'BCN'")

motifs.show(truncate=True)

#### BFS

Is any direct route from KUF to BCN?

In [None]:
df_paths = gf_routes.bfs(fromExpr = "id = 'KUF'", toExpr = "id = 'BCN'",  maxPathLength = 1)
df_paths.show()

Find route from KUF to BCN with single transfer where an airline is SU or S7 (note that `edgeFilter` is used for whole route, not for each edge):

In [None]:
df_paths = gf_routes.bfs(fromExpr = "id = 'KUF'", 
                         toExpr = "id = 'BCN'", 
                         edgeFilter="(airline = 'SU') or (airline = 'S7')",
                         maxPathLength = 2)
df_paths.show()

<a name="3"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">3. References</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To contents</a></div>
    </div>
</div>

[Airport, airline and route data](https://openflights.org/data.html)

[GraphFrames Overview](https://graphframes.github.io/graphframes/docs/_site/index.html)

[PySpark GraphFrames Package](https://graphframes.github.io/graphframes/docs/_site/api/python/graphframes.html#graphframes.GraphFrame)

[Folium](https://github.com/python-visualization/folium)

[On-Time Flight Performance with GraphFrames for Apache Spark](https://databricks.com/blog/2016/03/16/on-time-flight-performance-with-graphframes-for-apache-spark.html)