## The following section is for Colab Users.
### Just run the following code cells

In [1]:
!apt-get install openjdk-11-jdk-headless -qq > /dev/null
!wget -q https://bitbucket.org/habedi/datasets/raw/b6769c4664e7ff68b001e2f43bc517888cbe3642/spark/spark-3.0.2-bin-hadoop2.7.tgz
!tar xf spark-3.0.2-bin-hadoop2.7.tgz
!rm -rf spark-3.0.2-bin-hadoop2.7.tgz*
!pip -q install findspark pyspark graphframes

[K     |████████████████████████████████| 281.3 MB 49 kB/s 
[K     |████████████████████████████████| 199 kB 68.3 MB/s 
[K     |████████████████████████████████| 154 kB 64.7 MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [2]:
!wget https://repos.spark-packages.org/graphframes/graphframes/0.8.2-spark3.0-s_2.12/graphframes-0.8.2-spark3.0-s_2.12.jar -P /content/spark-3.0.2-bin-hadoop2.7/jars/
!cp /content/spark-3.0.2-bin-hadoop2.7/jars/graphframes-0.8.2-spark3.0-s_2.12.jar /content/spark-3.0.2-bin-hadoop2.7/graphframes-0.8.2-spark3.0-s_2.12.zip

--2022-07-22 20:49:38--  https://repos.spark-packages.org/graphframes/graphframes/0.8.2-spark3.0-s_2.12/graphframes-0.8.2-spark3.0-s_2.12.jar
Resolving repos.spark-packages.org (repos.spark-packages.org)... 13.33.33.70, 13.33.33.11, 13.33.33.102, ...
Connecting to repos.spark-packages.org (repos.spark-packages.org)|13.33.33.70|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 247882 (242K) [binary/octet-stream]
Saving to: ‘/content/spark-3.0.2-bin-hadoop2.7/jars/graphframes-0.8.2-spark3.0-s_2.12.jar’


2022-07-22 20:49:39 (47.6 MB/s) - ‘/content/spark-3.0.2-bin-hadoop2.7/jars/graphframes-0.8.2-spark3.0-s_2.12.jar’ saved [247882/247882]



In [3]:
import os

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.2-bin-hadoop2.7"
os.environ["HADOOP_HOME"] = os.environ["SPARK_HOME"]

os.environ["PYSPARK_DRIVER_PYTHON"] = "jupyter"
os.environ["PYSPARK_DRIVER_PYTHON_OPTS"] = "notebook"
os.environ["PYSPARK_SUBMIT_ARGS"] = "--master local[*] pyspark-shell"

In [4]:
import findspark
findspark.init()

In [5]:
!export PYSPARK_SUBMIT_ARGS="--master local[*] pyspark-shell"
!export PYSPARK_DRIVER_PYTHON=jupyter
!export PYSPARK_DRIVER_PYTHON_OPTS=notebook

In [6]:
from pyspark.sql import SparkSession
from graphframes import *

spark = SparkSession.builder.master("local[*]").appName("GraphFrames").getOrCreate()

In [7]:
os.environ["PYSPARK_SUBMIT_ARGS"] = "--packages graphframes:graphframes:0.8.1-spark3.0-s_2.12 pyspark-shell"

**************************************************************************
**************************************************************************
**************************************************************************

In [8]:
from IPython.display import display, HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

### Read departuredelays.csv in Edge DataFrame
### Read airport-codes-na.txt in Vertix DataFrame (the separator is Tab i.e sep = '\t' )

#### The US flight delays data set has five columns:
- The <b>date</b> column contains an integer like 02190925 . When converted, this maps to 02-19 09:25 am.
- The <b>delay</b> column gives the delay in minutes between the scheduled and actual departure times. Early departures show negative numbers.
- The <b>distance</b> column gives the distance in miles from the origin airport to the destination airport.
- The <b>origin</b> column contains the origin IATA airport code.
- The <b>destination</b> column contains the destination IATA airport code.

#### The airport-codes data set has four columns:
- The <b>IATA</b> column contains IATA airport code.
- The <b>City, State, and Country</b> columns contains information about the airport location. 

In [53]:
e_df=spark.read.csv('/content/departuredelays(1).csv',header=True,inferSchema=True)
e_df.show()

+-------+-----+--------+------+-----------+
|   date|delay|distance|origin|destination|
+-------+-----+--------+------+-----------+
|1011245|    6|     602|   ABE|        ATL|
|1020600|   -8|     369|   ABE|        DTW|
|1021245|   -2|     602|   ABE|        ATL|
|1020605|   -4|     602|   ABE|        ATL|
|1031245|   -4|     602|   ABE|        ATL|
|1030605|    0|     602|   ABE|        ATL|
|1041243|   10|     602|   ABE|        ATL|
|1040605|   28|     602|   ABE|        ATL|
|1051245|   88|     602|   ABE|        ATL|
|1050605|    9|     602|   ABE|        ATL|
|1061215|   -6|     602|   ABE|        ATL|
|1061725|   69|     602|   ABE|        ATL|
|1061230|    0|     369|   ABE|        DTW|
|1060625|   -3|     602|   ABE|        ATL|
|1070600|    0|     369|   ABE|        DTW|
|1071725|    0|     602|   ABE|        ATL|
|1071230|    0|     369|   ABE|        DTW|
|1070625|    0|     602|   ABE|        ATL|
|1071219|    0|     569|   ABE|        ORD|
|1080600|    0|     369|   ABE| 

In [14]:
v_df=spark.read.csv('/content/sample_data/airport-codes-na.txt',sep='\t',header=True)
v_df.show(5)

+----------+-----+-------+----+
|      City|State|Country|IATA|
+----------+-----+-------+----+
|Abbotsford|   BC| Canada| YXX|
|  Aberdeen|   SD|    USA| ABR|
|   Abilene|   TX|    USA| ABI|
|     Akron|   OH|    USA| CAK|
|   Alamosa|   CO|    USA| ALS|
+----------+-----+-------+----+
only showing top 5 rows



### In the vertix DataFrame, drop any duplicated rows with the same  IATA code.

In [15]:
v_df=v_df.drop_duplicates(['IATA'])
v_df.show()

+-------------------+-----+-------+----+
|               City|State|Country|IATA|
+-------------------+-----+-------+----+
|         Binghamton|   NY|    USA| BGM|
|            Lebanon|   NH|    USA| LEB|
|           Montreal|   PQ| Canada| YUL|
|         Dillingham|   AK|    USA| DLG|
|International Falls|   MN|    USA| INL|
|         Wolf Point|   MT|    USA| OLF|
|        New Orleans|   LA|    USA| MSY|
|            Toronto|   ON| Canada| YTO|
|            Spokane|   WA|    USA| GEG|
|              Havre|   MT|    USA| HVR|
|            Burbank|   CA|    USA| BUR|
|      Orange County|   CA|    USA| SNA|
|             Dryden|   ON| Canada| YHD|
|         Fort Dodge|   IA|    USA| FOD|
|          Green Bay|   WI|    USA| GRB|
|        Great Falls|   MT|    USA| GTF|
|              Homer|   AK|    USA| HOM|
|        Idaho Falls|   ID|    USA| IDA|
|      Sioux Lookout|   ON| Canada| YXL|
|       Grand Rapids|   MI|    USA| GRR|
+-------------------+-----+-------+----+
only showing top

### In the edges DataFrame:
- Rename the <b>date</b> columns to become <b>tripid</b>.
- Rename the <b>origin</b> columns to become <b>src</b>.
- Rename the <b>destination</b> columns to become <b>dst</b>.

In [54]:
e_df=e_df.withColumnRenamed('date','tripid')
e_df=e_df.withColumnRenamed('origin','src')
e_df=e_df.withColumnRenamed('destination','dst')
e_df.show(10)


+-------+-----+--------+---+---+
| tripid|delay|distance|src|dst|
+-------+-----+--------+---+---+
|1011245|    6|     602|ABE|ATL|
|1020600|   -8|     369|ABE|DTW|
|1021245|   -2|     602|ABE|ATL|
|1020605|   -4|     602|ABE|ATL|
|1031245|   -4|     602|ABE|ATL|
|1030605|    0|     602|ABE|ATL|
|1041243|   10|     602|ABE|ATL|
|1040605|   28|     602|ABE|ATL|
|1051245|   88|     602|ABE|ATL|
|1050605|    9|     602|ABE|ATL|
+-------+-----+--------+---+---+
only showing top 10 rows



### In the Vertix DataFrame:
- Rename the <b>IATA</b> columns to become <b>id</b>.

In [17]:
v_df=v_df.withColumnRenamed('IATA','id')
v_df.show(5)

+-------------------+-----+-------+---+
|               City|State|Country| id|
+-------------------+-----+-------+---+
|         Binghamton|   NY|    USA|BGM|
|            Lebanon|   NH|    USA|LEB|
|           Montreal|   PQ| Canada|YUL|
|         Dillingham|   AK|    USA|DLG|
|International Falls|   MN|    USA|INL|
+-------------------+-----+-------+---+
only showing top 5 rows



### Create GraphFrame from Vertix and Edges DataFrames

In [57]:
gf=GraphFrame(v_df,e_df)
gf.vertices.show()
gf.edges.show()

+-------------------+-----+-------+---+
|               City|State|Country| id|
+-------------------+-----+-------+---+
|         Binghamton|   NY|    USA|BGM|
|            Lebanon|   NH|    USA|LEB|
|           Montreal|   PQ| Canada|YUL|
|         Dillingham|   AK|    USA|DLG|
|International Falls|   MN|    USA|INL|
|         Wolf Point|   MT|    USA|OLF|
|        New Orleans|   LA|    USA|MSY|
|            Toronto|   ON| Canada|YTO|
|            Spokane|   WA|    USA|GEG|
|              Havre|   MT|    USA|HVR|
|            Burbank|   CA|    USA|BUR|
|      Orange County|   CA|    USA|SNA|
|             Dryden|   ON| Canada|YHD|
|         Fort Dodge|   IA|    USA|FOD|
|          Green Bay|   WI|    USA|GRB|
|        Great Falls|   MT|    USA|GTF|
|              Homer|   AK|    USA|HOM|
|        Idaho Falls|   ID|    USA|IDA|
|      Sioux Lookout|   ON| Canada|YXL|
|       Grand Rapids|   MI|    USA|GRR|
+-------------------+-----+-------+---+
only showing top 20 rows

+-------+-----

### Determine the number of airports

In [27]:
gf.vertices.count()

524

### Determine the number of trips 

In [28]:
gf.edges.count()

1267461

### What is the longest delay?

In [31]:
gf.edges.agg({'delay':'max'}).show()

+----------+
|max(delay)|
+----------+
|      1638|
+----------+



### Find out the number of delayed flights vs. early flights (flights that departed before actual time)

In [24]:
gf.edges.filter('delay >0').show() #late
gf.edges.filter('delay<=0').show() #early


+-------+-----+--------+---+---+
| tripid|delay|distance|src|dst|
+-------+-----+--------+---+---+
|1011245|    6|     602|ABE|ATL|
|1041243|   10|     602|ABE|ATL|
|1040605|   28|     602|ABE|ATL|
|1051245|   88|     602|ABE|ATL|
|1050605|    9|     602|ABE|ATL|
|1061725|   69|     602|ABE|ATL|
|1081230|   33|     369|ABE|DTW|
|1080625|    1|     602|ABE|ATL|
|1080607|    5|     569|ABE|ORD|
|1081219|   54|     569|ABE|ORD|
|1091215|   43|     602|ABE|ATL|
|1090600|  151|     369|ABE|DTW|
|1090625|    8|     602|ABE|ATL|
|1091219|   83|     569|ABE|ORD|
|1101725|    7|     602|ABE|ATL|
|1100625|   52|     602|ABE|ATL|
|1111215|  127|     602|ABE|ATL|
|1131215|   14|     602|ABE|ATL|
|1130625|   29|     602|ABE|ATL|
|1161219|   68|     569|ABE|ORD|
+-------+-----+--------+---+---+
only showing top 20 rows

+-------+-----+--------+---+---+
| tripid|delay|distance|src|dst|
+-------+-----+--------+---+---+
|1020600|   -8|     369|ABE|DTW|
|1021245|   -2|     602|ABE|ATL|
|1020605|   -4|  

### What flight destinations departing SFO are most likely to have significant delays? Select the top 10
#### Hint: you should get the average delay for each destination for trips that depart from SFO only

In [59]:
from pyspark.sql.functions import desc
temp=gf.edges.filter("src=='SFO'").filter('delay>0').groupby(['dst']).mean("delay")
temp.sort(desc("avg(delay)")).show(10) #choose the top 10

+---+------------------+
|dst|        avg(delay)|
+---+------------------+
|OKC|59.073170731707314|
|JAC| 57.13333333333333|
|COS|53.976190476190474|
|OTH| 48.09090909090909|
|SAT|            47.625|
|MOD| 46.80952380952381|
|SUN|46.723404255319146|
|CIC| 46.72164948453608|
|ABQ|           44.8125|
|ASE|44.285714285714285|
+---+------------------+
only showing top 10 rows



### Find the Incoming connections to the airport sorted in Desc. order.

In [52]:
gf.edges.filter('dst=="SFO"').sort(desc('delay')).show()

+-------+-----+--------+---+---+
| tripid|delay|distance|src|dst|
+-------+-----+--------+---+---+
|2102155|  724|    2084|HNL|SFO|
|3160600|  637|    1302|MCI|SFO|
|2040725|  622|    2084|HNL|SFO|
|2141405|  586|    2084|HNL|SFO|
|1020840|  509|    1612|MDW|SFO|
|2281625|  481|    1995|CLT|SFO|
|2100740|  447|     286|MFR|SFO|
|2141140|  430|    2229|EWR|SFO|
|1300600|  427|     133|CIC|SFO|
|3241530|  424|     840|DEN|SFO|
|2271459|  423|    2102|IAD|SFO|
|1051735|  416|     360|LAS|SFO|
|2281805|  409|     293|LAX|SFO|
|3151425|  409|    1273|DFW|SFO|
|2150840|  392|    2102|IAD|SFO|
|2261850|  389|    2247|JFK|SFO|
|1051915|  385|    1604|ORD|SFO|
|1301730|  385|     366|PSP|SFO|
|1061214|  384|     590|SEA|SFO|
|1020755|  378|    2191|PHL|SFO|
+-------+-----+--------+---+---+
only showing top 20 rows



### Find the Outgoing connections from the airport sorted in Desc. order.

In [53]:
gf.edges.filter('src=="SFO"').sort(desc('delay')).show()

+-------+-----+--------+---+---+
| tripid|delay|distance|src|dst|
+-------+-----+--------+---+---+
|2190925| 1638|    1604|SFO|ORD|
|2092110|  740|    2246|SFO|MIA|
|2092230|  636|    2247|SFO|JFK|
|1211508|  593|    2247|SFO|JFK|
|1021507|  536|    2247|SFO|JFK|
|2030906|  516|    2191|SFO|PHL|
|2131420|  504|     388|SFO|SAN|
|1021210|  484|     566|SFO|PHX|
|2282320|  474|    2247|SFO|JFK|
|2251500|  446|    1381|SFO|MSP|
|1051250|  437|    2247|SFO|JFK|
|1161105|  427|    1273|SFO|DFW|
|1200840|  425|    2247|SFO|JFK|
|2281535|  412|     293|SFO|LAX|
|2282005|  407|     478|SFO|PDX|
|1031755|  396|    1604|SFO|ORD|
|2091550|  383|     566|SFO|PHX|
|2122356|  376|    1859|SFO|ATL|
|2281955|  353|     388|SFO|SAN|
|1060023|  349|    1421|SFO|IAH|
+-------+-----+--------+---+---+
only showing top 20 rows



### Use motif finding to answer this question: which delays could we blame on SFO?
#### Hint: this practically means that SFO is a transit station

In [55]:
gf.find("(dst)-[SFO]->(src)").show()


+--------------------+--------------------+--------------------+
|                 dst|                 SFO|                 src|
+--------------------+--------------------+--------------------+
|[Allentown, PA, U...|[1011245, 6, 602,...|[Atlanta, GA, USA...|
|[Allentown, PA, U...|[1020600, -8, 369...|[Detroit, MI, USA...|
|[Allentown, PA, U...|[1021245, -2, 602...|[Atlanta, GA, USA...|
|[Allentown, PA, U...|[1020605, -4, 602...|[Atlanta, GA, USA...|
|[Allentown, PA, U...|[1031245, -4, 602...|[Atlanta, GA, USA...|
|[Allentown, PA, U...|[1030605, 0, 602,...|[Atlanta, GA, USA...|
|[Allentown, PA, U...|[1041243, 10, 602...|[Atlanta, GA, USA...|
|[Allentown, PA, U...|[1040605, 28, 602...|[Atlanta, GA, USA...|
|[Allentown, PA, U...|[1051245, 88, 602...|[Atlanta, GA, USA...|
|[Allentown, PA, U...|[1050605, 9, 602,...|[Atlanta, GA, USA...|
|[Allentown, PA, U...|[1061215, -6, 602...|[Atlanta, GA, USA...|
|[Allentown, PA, U...|[1061725, 69, 602...|[Atlanta, GA, USA...|
|[Allentown, PA, U...|[10

### Determine Airport Ranking in Desc. order using PageRank algorithm

In [29]:
temp1=gf.pageRank(resetProbability=0.15, tol=0.01 )
temp1.vertices.orderBy("pagerank", ascending=False ).distinct().show()

+---------------+-----+-------+---+------------------+
|           City|State|Country| id|          pagerank|
+---------------+-----+-------+---+------------------+
|     Ponca City|   OK|    USA|PNC|0.8994250492584518|
|      New Haven|   CT|    USA|HVN|0.8994250492584518|
|    Scottsbluff|   NE|    USA|BFF|0.8994250492584518|
|         Marion|   IL|    USA|MWA|0.8994250492584518|
|        Phoenix|   AZ|    USA|PHX|1.7636271609181418|
|         Clovis|   NM|    USA|CVN|0.8994250492584518|
|     Twin Falls|   ID|    USA|TWF|0.8994250492584518|
|        Chicago|   IL|    USA|MDW|1.4055634946889624|
|   Myrtle Beach|   SC|    USA|MYR|0.9294344697201518|
|        Oshkosh|   WI|    USA|OSH|0.8994250492584518|
|       Savannah|   GA|    USA|SAV|0.9594438901818515|
|Fort Saint John|   BC| Canada|YXJ|0.8994250492584518|
|       Hartford|   CT|    USA|BDL|1.0366589075200061|
|    Anahim Lake|   BC| Canada|YAA|0.8994250492584518|
|       Bismarck|   ND|    USA|BIS|0.8994250492584518|
|   Jackso

### Find and Save a Subragph that obtained from the following pattern:
#### The flight starts from an airport and return back to the same airport through 2 other airports.

In [57]:
sg=gf.find('(v1)-[]->(v2);(v2)-[]->(v1)')
sg.show()

+--------------------+--------------------+
|                  v1|                  v2|
+--------------------+--------------------+
|[Greenville, SC, ...|[Atlanta, GA, USA...|
|[Greenville, SC, ...|[Atlanta, GA, USA...|
|[Greenville, SC, ...|[Atlanta, GA, USA...|
|[Greenville, SC, ...|[Atlanta, GA, USA...|
|[Greenville, SC, ...|[Atlanta, GA, USA...|
|[Greenville, SC, ...|[Atlanta, GA, USA...|
|[Greenville, SC, ...|[Atlanta, GA, USA...|
|[Greenville, SC, ...|[Atlanta, GA, USA...|
|[Greenville, SC, ...|[Atlanta, GA, USA...|
|[Greenville, SC, ...|[Atlanta, GA, USA...|
|[Greenville, SC, ...|[Atlanta, GA, USA...|
|[Greenville, SC, ...|[Atlanta, GA, USA...|
|[Greenville, SC, ...|[Atlanta, GA, USA...|
|[Greenville, SC, ...|[Atlanta, GA, USA...|
|[Greenville, SC, ...|[Atlanta, GA, USA...|
|[Greenville, SC, ...|[Atlanta, GA, USA...|
|[Greenville, SC, ...|[Atlanta, GA, USA...|
|[Greenville, SC, ...|[Atlanta, GA, USA...|
|[Greenville, SC, ...|[Atlanta, GA, USA...|
|[Greenville, SC, ...|[Atlanta, 