<img width="200" style="float:left" 
     src="https://upload.wikimedia.org/wikipedia/commons/f/f3/Apache_Spark_logo.svg" /> 

<img style="float:left;height:200" 
     src="https://storage.googleapis.com/kaggle-datasets-images/1392/2506/2d89d2ffd3946c8e06d9d57a8ffb01ec/dataset-cover.jpg" />   

# Sections
* [Description](#0)
* [1. Setup](#1)
  * [1.1 Start Hadoop](#1.1)  
  * [1.2 Search for Spark Installation](#1.2)
  * [1.3 Create SparkSession](#1.3)
* [2. Kata](#2)
  * [2.1 Upload the Dataset to HDFS](#2.1)  
  * [2.2 Create the DataFrames](#2.2)
  * [2.3 Create the GraphFrame](#2.3)
  * [2.4 Exercises](#2.4)
* [3. TearDown](#3)
  * [3.1 Stop Hadoop](#3.1)

<a id='0'></a>
## Description
<p>
<div>The goal for this kata are:</div>
<ul>    
    <li>Practice the Spark GraphFrames API</li>
    <li>Solve several exercise by yourself</li>
</ul>    
</p>

<p>We are going to work with a flights dataset. We're now interested in understanding better how the different cities are interconnected together, and which airports are the most important.

It happens that airports and flights between them can be modeled as a graph where:

- **vertices**: airports.
- **edges**: flights.

<a id='1'></a>
## 1. Setup

Since we are going to process data stored from HDFS let's start the service

<a id='1.1'></a>
### 1.1 Start Hadoop

Start Hadoop

Open a terminal and execute
```sh
hadoop-start.sh
```

<a id='1.2'></a>
### 1.2 Search for Spark Installation 
This step is required just because we are working in the course environment.

In [1]:
import findspark
findspark.init()

I'm changing pandas max column width property to improve data displaying

In [2]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

<a id='1.3'></a>
### 1.3 Create SparkSession

By setting this environment variable we can include extra libraries in our Spark cluster.<br/>
GraphFrames is not in spark core so we have to add it this way## 2. Create SparkSession

In [3]:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "graphframes:graphframes:0.8.2-spark3.2-s_2.12" --jars /opt/hive3/lib/hive-hcatalog-core-3.1.2.jar pyspark-shell'

The first thing always is to create the SparkSession

In [4]:
from pyspark.sql.session import SparkSession

spark = (SparkSession.builder
    .appName("Flights - Analytics - GraphFrames Kata")
    .enableHiveSupport()
    .getOrCreate())

<a id='2'></a>
## 2. Kata

<a id='2.1'></a>
### 2.1 Upload the Dataset to HDFS

Upload the dataset provided in the following HDFS path:<br/>
/datalake/raw/flights/

<a id='2.2'></a>
### 2.2 Create the DataFrames

In [5]:
# create flights dataframe
flights = spark.read.option("header", "true")\
                      .option("inferSchema", "true")\
                      .csv(r"C:\Users\SLO\Documents\GitHub\IE-University\07_MODERN_DATA_ARCHITECTURES\flights_jan08.csv")


In [6]:
from pyspark.sql.functions import col

#create vertices dataframe with just one column based on all distinct Origin airports.
vertices = flights.select(col("Origin").alias("id")).distinct()

#create edges dataframe with
edges = (flights.withColumnRenamed("Origin", "src")
                .withColumnRenamed("Dest", "dst")
                .select("src", "dst", "Distance")
                .distinct())

<a id='2.3'></a>
### 2.3 Create the GraphFrame

In [7]:
from graphframes import GraphFrame

#create the graphframe
graph = GraphFrame(vertices, edges)



<a id='2.4'></a>
### 2.4 Exercises

1. Find out the top 5 airports with the highest number of outbound flights.
2. Find out the top 5 airports with the highest number of inbound flights.
3. Find out the top 5 most important airports.
4. Find out the shortest paths between Albuquerque International Sunport airport and Nashville International Airport.
5. Identify routes between airports with no direct connection.

### 1. Find out the top 5 airports with the highest number of outbound flights.

In [8]:
graph.outDegrees.orderBy(col("outDegree").desc()).limit(5).toPandas()



Unnamed: 0,id,outDegree
0,LAS,54
1,MDW,47
2,PHX,42
3,BWI,38
4,MCO,33


### 2. Find out the top 5 airports with the highest number of inbound flights.

In [9]:
graph.inDegrees.orderBy(col("inDegree").desc()).limit(5).toPandas()



Unnamed: 0,id,inDegree
0,LAS,54
1,MDW,47
2,PHX,42
3,BWI,38
4,MCO,33


### 3. Find out the top 5 most important airports.

In [10]:
ranks = graph.pageRank(resetProbability=0.15, maxIter=10)
ranks.vertices.orderBy(col("pagerank").desc()).limit(5).toPandas()



Unnamed: 0,id,pagerank
0,LAS,3.968101
1,MDW,3.464189
2,PHX,2.99963
3,BWI,2.859044
4,MCO,2.506806


### 4. Find out the shortest paths between Albuquerque International Sunport airport and Nashville International Airport.

Albuquerque International Sunport airport's code is **ABQ** while the Nashville International Airport's one is **BNA**.

<div class="alert alert-danger">
    <b>NOTE</b>: Spark's BFS (Breadth-first search) computes the shortest paths in terms of the <b>number of hops</b> between to vertices. It does <b>not</b> take edge weights into account. There are alternatives to take edge weights into consideration but, unfortunately, it's out of the scope of this course.
</div>

In [11]:
paths = graph.bfs(fromExpr = "id = 'ABQ'", toExpr= "id = 'BNA'")
paths.toPandas()



Unnamed: 0,from,e0,v1,e1,to
0,"(ABQ,)","(ABQ, AUS, 619)","(AUS,)","(AUS, BNA, 756)","(BNA,)"
1,"(ABQ,)","(ABQ, BWI, 1670)","(BWI,)","(BWI, BNA, 588)","(BNA,)"
2,"(ABQ,)","(ABQ, DEN, 349)","(DEN,)","(DEN, BNA, 1013)","(BNA,)"
3,"(ABQ,)","(ABQ, HOU, 759)","(HOU,)","(HOU, BNA, 670)","(BNA,)"
4,"(ABQ,)","(ABQ, IAH, 744)","(IAH,)","(IAH, BNA, 657)","(BNA,)"
5,"(ABQ,)","(ABQ, LAS, 487)","(LAS,)","(LAS, BNA, 1588)","(BNA,)"
6,"(ABQ,)","(ABQ, LAX, 677)","(LAX,)","(LAX, BNA, 1797)","(BNA,)"
7,"(ABQ,)","(ABQ, MCI, 718)","(MCI,)","(MCI, BNA, 491)","(BNA,)"
8,"(ABQ,)","(ABQ, MCO, 1552)","(MCO,)","(MCO, BNA, 616)","(BNA,)"
9,"(ABQ,)","(ABQ, MDW, 1121)","(MDW,)","(MDW, BNA, 395)","(BNA,)"


### 5. Identify routes between airports with no direct connection.
**Hint:** Try to find vertices **a**, **b** and **c** where:

* There is an **edge from a to b**, an **edge from b to c**, but **no edge from a to c**.
* Additionally, you need to ensure that **a and c are not the same vertex**.

The easiest way to provide this logic is by **using motif findings**. If you are excited enough to go down this road, check the [Motif Finding section](https://graphframes.github.io/graphframes/docs/_site/user-guide.html#motif-finding) in the [GraphFrames User Guide](https://graphframes.github.io/graphframes/docs/_site/user-guide.html#graphframes-user-guide).

In [12]:
res = graph.find("(a)-[]->(b); (b)-[]->(c); !(a)-[]->(c)").filter("c.id !=a.id")
res.toPandas()



Unnamed: 0,a,b,c
0,"(RSW,)","(MDW,)","(OAK,)"
1,"(GEG,)","(LAS,)","(PHX,)"
2,"(MHT,)","(MDW,)","(IND,)"
3,"(ONT,)","(AUS,)","(JAX,)"
4,"(BNA,)","(TPA,)","(PBI,)"
...,...,...,...
12422,"(RDU,)","(MCI,)","(TUS,)"
12423,"(MCI,)","(RDU,)","(SAT,)"
12424,"(TUL,)","(PHX,)","(BHM,)"
12425,"(DTW,)","(BWI,)","(RDU,)"


<a id='3'></a>
## 3. Tear Down

Once we complete the the lab we can stop all the services

<a id='3.1'></a>
### 3.1 Stop Hadoop

Stops Hadoop
Open a terminal and execute
```sh
hadoop-stop.sh
```