<img width="200" style="float:left" 
     src="https://upload.wikimedia.org/wikipedia/commons/f/f3/Apache_Spark_logo.svg" />

<img style="float:left" src="https://storage.googleapis.com/kaggle-datasets-images/57/116/08a7f99f23e148898ab0eda150afc99f/dataset-cover.jpg" />

# Sections
* [Description](#0)
* [1. Setup](#1)
  * [1.1 Start Hadoop](#1.1)  
  * [1.2 Search for Spark Installation](#1.2)
  * [1.3 Create SparkSession](#1.3)
* [2. Lab](#2)
  * [2.1 Check Lab Files](#2.1)
  * [2.2 Create the DataFrames](#2.2)
  * [2.3 Create the GraphFrame](#2.3)
  * [2.4 Analytics](#2.3)
* [3. TearDown](#3)
  * [3.1 Stop Hadoop](#3.1)

<a id='0'></a>
## Description
<p>
<div>The goal for this lab are:</div>
<ul>    
    <li>Get familiar with Spark GraphFrames API</li>
</ul>    
</p>

<p>We are going to work with a bike sharing <a href="https://www.kaggle.com/benhamner/sf-bay-area-bike-share">dataset</a></p>
<p>Actually a smaller version with two files: 201508_trip_data.csv & 201508_station_data.csv</p>

<a id='1'></a>
## 1. Setup

Since we are going to process data stored from HDFS let's start the service

<a id='1.1'></a>
### 1.1 Start Hadoop

Start Hadoop

Open a terminal and execute
```sh
hadoop-start.sh
```

<a id='1.2'></a>
### 1.2 Search for Spark Installation 
This step is required just because we are working in the course environment.

In [None]:
import findspark
findspark.init()

I'm changing pandas max column width property to improve data displaying

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', None)

<a id='1.3'></a>
### 1.3 Create SparkSession

By setting this environment variable we can include extra libraries in our Spark cluster.<br/>
GraphFrames is not in spark core so we have to add it this way

In [None]:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "graphframes:graphframes:0.8.2-spark3.2-s_2.12" --jars /opt/hive3/lib/hive-hcatalog-core-3.1.2.jar pyspark-shell'

The first thing always is to create the SparkSession

In [None]:
from pyspark.sql.session import SparkSession

spark = (SparkSession.builder
    .appName("Bike Sharing - Analytics - GraphFrames")
    .enableHiveSupport()
    .getOrCreate())

<a id='2'></a>
## 2. Lab

<a id='2.1'></a>
### 2.1 Check Lab Files

In order to complete this lab you need to previosly upload the datasets into HDFS.<br/>

Check you have the data ready in HDFS

http://localhost:50070/explorer.html#/datalake/raw/san-francisco-bay-bike-sharing/stations/

http://localhost:50070/explorer.html#/datalake/raw/san-francisco-bay-bike-sharing/trips/


<a id='2.2'></a>
### 2.2 Create the DataFrames

The first step after creating the SparkSession is to create one or more DataFrames<br/>

In [None]:
stations = (spark.read
            .option("header","true")
            .option("inferSchema","true")
            .csv("hdfs://localhost:9000/datalake/raw/san-francisco-bay-bike-sharing/stations/")            
            .distinct())

trips = (spark.read
                  .option("header","true")
                  .option("inferSchema","true") 
                  .csv("hdfs://localhost:9000/datalake/raw/san-francisco-bay-bike-sharing/trips/"))

In [None]:
stations.limit(5).toPandas()

The data is related to different areas in San Francisco

In [None]:
stations.select("landmark").distinct().toPandas()

In [None]:
trips.limit(5).toPandas()

<a id='2.3'></a>
### 2.3 Create the GraphFrame

We are going to model our graph in the following way:<br/>
**vertices** : stations <br/>
**edges** : trips aggregation

In [None]:
from pyspark.sql.functions import count,avg,desc,asc,col
from graphframes import GraphFrame

# GraphFrames requires the vertices DataFrame to have a column named id.
vertices = stations.withColumnRenamed("station_id","id")
    
# GraphFrames requires the edges DataFrame to have columns named src and dst
trips = (trips.withColumnRenamed("Start Terminal", "src")
              .withColumnRenamed("End Terminal", "dst")
              .withColumnRenamed("Start Station", "src_name")
              .withColumnRenamed("End Station", "dst_name"))
              
edges = (trips.groupBy("src","src_name", "dst", "dst_name")
              .agg(
                  count("*").alias("trip_count"),
                  avg("duration").alias("duration_avg")
              ))
                            
     
# Creates the graph
graph = GraphFrame(vertices, edges)

# graph processing requires recursive/iterative calculations so is a good practice to cache
graph.cache()

In [None]:
graph.vertices.limit(5).toPandas()

In [None]:
graph.edges.limit(5).toPandas()

Let's create a subgraph for the sake of practicing with the stations related to "San Francisco"

In [None]:
subgraph = GraphFrame(graph.vertices.where("landmark='San Francisco'"),graph.edges)

<a id='2.4'></a>
### 2.4 Analytics

#### which are the top 5 most common routes?

In [None]:
routes = subgraph.edges.orderBy(desc("trip_count"))

routes.limit(5).toPandas()

#### Which are the stations where most of the trips depart from?

In [None]:
inDeg = subgraph.inDegrees
inDeg.orderBy(desc("inDegree"),asc("id")).limit(5).toPandas()

Let's get the names

In [None]:
inDeg.join(subgraph.vertices,"id").orderBy(desc("inDegree"),asc("id")).limit(5).toPandas()

#### Which are the stations where most of the trips get to?

In [None]:
outDeg = subgraph.outDegrees
outDeg.orderBy(desc("outDegree"),asc("id")).limit(5).toPandas()

In [None]:
outDeg.join(subgraph.vertices,"id").orderBy(desc("outDegree"),asc("id")).limit(5).toPandas()

#### which are the most relevant stations?
We are going to apply the Page Ranks algorithm

In [None]:
ranks = subgraph.pageRank(resetProbability=0.15, maxIter=10)

The algorithm returns a GraphFrame. <br/>
Notice we now have a new column in the vertices DataFrame called **pagerank**

In [None]:
ranks.vertices.limit(5).toPandas()

Notice we now have a new column in the edges DataFrame called **weight**

In [None]:
ranks.edges.limit(5).toPandas()

<a id='3'></a>
## 3. Tear Down

Once we complete the the lab we can stop all the services

<a id='3.1'></a>
### 3.1 Stop Hadoop

Stops Hadoop
Open a terminal and execute
```sh
hadoop-stop.sh
```