<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Predicting Flight Delays with Spark ML 

---

In this notebook, we will be using features that have already been prepared with PySpark to predict flight delays via regression and classification.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#First-unzip-the-data" data-toc-modified-id="First-unzip-the-data-1">First unzip the data</a></span></li><li><span><a href="#Import-the-required-modules" data-toc-modified-id="Import-the-required-modules-2">Import the required modules</a></span></li><li><span><a href="#Load-and-Inspect-our-JSON-Training-Data-(should-take-around-a-min-with-Spark):" data-toc-modified-id="Load-and-Inspect-our-JSON-Training-Data-(should-take-around-a-min-with-Spark):-3">Load and Inspect our JSON Training Data (should take around a min with Spark):</a></span></li><li><span><a href="#Extract-the-head-of-the-dataframe" data-toc-modified-id="Extract-the-head-of-the-dataframe-4">Extract the head of the dataframe</a></span></li><li><span><a href="#Determine-the-number-of-rows-in-the-dataframe" data-toc-modified-id="Determine-the-number-of-rows-in-the-dataframe-5">Determine the number of rows in the dataframe</a></span></li><li><span><a href="#Check-for-null-values-in-the-features-before-using-Spark-ML" data-toc-modified-id="Check-for-null-values-in-the-features-before-using-Spark-ML-6">Check for null values in the features before using Spark ML</a></span></li><li><span><a href="#Add-a-Route-variable-to-replace-FlightNum" data-toc-modified-id="Add-a-Route-variable-to-replace-FlightNum-7">Add a Route variable to replace FlightNum</a></span></li><li><span><a href="#Categorize-or-&quot;bucketize&quot;-the-arrival-delay-field-using-a-DataFrame-UDF" data-toc-modified-id="Categorize-or-&quot;bucketize&quot;-the-arrival-delay-field-using-a-DataFrame-UDF-8">Categorize or "bucketize" the arrival delay field using a DataFrame UDF</a></span></li><li><span><a href="#Use-pyspark.ml.feature.Bucketizer-to-bucketize-ArrDelay" data-toc-modified-id="Use-pyspark.ml.feature.Bucketizer-to-bucketize-ArrDelay-9">Use <code>pyspark.ml.feature.Bucketizer</code> to bucketize <code>ArrDelay</code></a></span></li><li><span><a href="#Turn-categorical-fields-into-categorical-feature-vectors" data-toc-modified-id="Turn-categorical-fields-into-categorical-feature-vectors-10">Turn categorical fields into categorical feature vectors</a></span></li><li><span><a href="#Cross-validate,-train-and-evaluate-a-classifier-of-your-choice-(from-MLlib)" data-toc-modified-id="Cross-validate,-train-and-evaluate-a-classifier-of-your-choice-(from-MLlib)-11">Cross validate, train and evaluate a classifier of your choice (from MLlib)</a></span></li></ul></div>

### First unzip the data

In [1]:
!unzip flight_delay_sample.jsonl.zip

Archive:  flight_delay_sample.jsonl.zip
  inflating: flight_delay_sample.jsonl  


### Import the required modules

In [2]:
import findspark
findspark.init('/usr/local/spark')

import pyspark as ps    # for the pyspark suite
import warnings         # for displaying warnings
from pyspark.sql import SQLContext
from datetime import datetime
import numpy as np

In [3]:
try:
    # we try to create a SparkContext to work locally on all cpus available
    sc = ps.SparkContext('local[4]')
    sqlContext = SQLContext(sc)
    print("Just created a SparkContext")
except ValueError:
    # give a warning if SparkContext already exists (for use inside pyspark)
    warnings.warn("SparkContext already exists in this scope")

Just created a SparkContext


### Load and Inspect our JSON Training Data (should take around a min with Spark):

In [4]:
from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampType
from pyspark.sql.types import StructType, StructField

schema = StructType([
  StructField("ArrDelay", DoubleType(), True),     # "ArrDelay":5.0
  StructField("CRSArrTime", TimestampType(), True),    # "CRSArrTime":"2015-12-31T03:20:00.000-08:00"
  StructField("CRSDepTime", TimestampType(), True),    # "CRSDepTime":"2015-12-31T03:05:00.000-08:00"
  StructField("Carrier", StringType(), True),     # "Carrier":"WN"
  StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31
  StructField("DayOfWeek", IntegerType(), True),  # "DayOfWeek":4
  StructField("DayOfYear", IntegerType(), True),  # "DayOfYear":365
  StructField("DepDelay", DoubleType(), True),     # "DepDelay":14.0
  StructField("Dest", StringType(), True),        # "Dest":"SAN"
  StructField("Distance", DoubleType(), True),     # "Distance":368.0
  StructField("FlightDate", DateType(), True),    # "FlightDate":"2015-12-30T16:00:00.000-08:00"
  StructField("FlightNum", StringType(), True),   # "FlightNum":"6109"
  StructField("Origin", StringType(), True),      # "Origin":"TUS"
])

In [5]:
features = sqlContext.read.json(
  "flight_delay_sample.jsonl",
  schema=schema
)

### Extract the head of the dataframe

### Determine the number of rows in the dataframe

### Check for null values in the features before using Spark ML

In [6]:
from pyspark.sql.functions import isnan

### Add a Route variable to replace FlightNum

Take the origin and the destination and create a column `route` as a concatenation of both.

In [7]:
from pyspark.sql.functions import lit, concat

### Categorize or "bucketize" the arrival delay field using a DataFrame UDF

Create some label to characterize the delay from 0 to 3

In [8]:
# Wrap the function in pyspark.sql.functions.udf (Spark User Defined Functions) with...
# pyspark.sql.types.StructField information

from pyspark.sql.functions import udf

In [9]:
def bucketize_arr_delay(arr_delay):
    bucket = None
    if arr_delay <= -15.0:
        bucket = 0.0
    elif arr_delay > -15.0 and arr_delay <= 0.0:
        bucket = 1.0
    elif arr_delay > 0.0 and arr_delay <= 30.0:
        bucket = 2.0
    elif arr_delay > 30.0:
        bucket = 3.0
    return bucket

In [10]:
dummy_function_udf = udf(bucketize_arr_delay, StringType())

# Add a categorical column via pyspark.sql.DataFrame.withColumn
# Replace features with the dataframe containing the route column
manual_bucketized_features = features.withColumn(
  "ArrDelayBucket",
  dummy_function_udf(features['ArrDelay'])
)
manual_bucketized_features.select("ArrDelay", "ArrDelayBucket").show(5)

+--------+--------------+
|ArrDelay|ArrDelayBucket|
+--------+--------------+
|    13.0|           2.0|
|    36.0|           3.0|
|   -21.0|           0.0|
|   -14.0|           1.0|
|    16.0|           2.0|
+--------+--------------+
only showing top 5 rows



### Use `pyspark.ml.feature.Bucketizer` to bucketize `ArrDelay`

Same as before but this time with `pyspark.ml.feature.Bucketizer`

In [11]:
from pyspark.ml.feature import Bucketizer

### Turn categorical fields into categorical feature vectors

In [12]:
from pyspark.ml.feature import StringIndexer, VectorAssembler

In [13]:
# Turn categorical fields into categorical feature vectors, then drop intermediate fields


### Cross validate, train and evaluate a classifier of your choice (from MLlib)

In [14]:
# Train/test split
# training_data, test_data = 

# Instantiate and fit your model

# Evaluate model using test data

# Check a sample

In [15]:
# Remove the unzipped file once you do not need it any more
!rm flight_delay_sample.jsonl