# Building a  Gradient-Boosted Trees with PySpark

You've previously built a classifier for flights likely to be delayed using a Decision Tree. In this exercise you'll compare a Decision Tree model to a Gradient-Boosted Trees model.

## 1. Import the classes required 

In [1]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql import SparkSession            # Import the PySpark module
from pyspark.sql.functions import round         # Import the required function
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler  # Import the necessary class
from pyspark.ml.classification import DecisionTreeClassifier # Decision Tree Classifier 
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [2]:
# Create SparkSession object
spark = SparkSession.builder \
                    .master('local[*]') \
                    .appName('test') \
                    .getOrCreate()

## 2. Loading flights data

In [16]:
path = "/home/danae/Documents/pySparkTraining/files/"
# Read data from CSV file
flights = spark.read.csv(path + 'flights.csv',
                         sep = ',',
                         header = True,
                         inferSchema = True,
                         nullValue = 'NA')

## 3. Data Preparation

You previously loaded airline flight data from a CSV file. You're going to develop a model which will predict whether or not a given flight will be delayed.

In this exercise you need to trim those data down by:

removing an uninformative column and
removing rows which do not have information about whether or not a flight was delayed.

### 3.1 Column manipulation

In [18]:

flights = flights.filter('delay IS NOT NULL').dropna()

# Convert 'mile' to 'km' and drop 'mile' column
flights = flights.withColumn('km', round(flights.mile * 1.60934 , 0))

# Create 'label' column indicating whether flight delayed (1) or not (0)
flights = flights.withColumn('label', (flights.delay >= 15).cast('integer'))

# number of rows
print(flights.count())

47022
+---+---+---+-------+------+---+----+------+--------+-----+------+-----+
|mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|    km|label|
+---+---+---+-------+------+---+----+------+--------+-----+------+-----+
|  0| 22|  2|     UA|  1107|ORD| 316| 16.33|      82|   30| 509.0|    1|
|  2| 20|  4|     UA|   226|SFO| 337|  6.17|      82|   -8| 542.0|    0|
|  9| 13|  1|     AA|   419|ORD|1236| 10.33|     195|   -5|1989.0|    0|
|  5|  2|  1|     UA|   704|SFO| 550|  7.98|     102|    2| 885.0|    0|
|  7|  2|  6|     AA|   380|ORD| 733| 10.83|     135|   54|1180.0|    1|
+---+---+---+-------+------+---+----+------+--------+-----+------+-----+
only showing top 5 rows



### 3.2 Categorical columns
In the flights data there are two columns, carrier and org, which hold categorical data. You need to transform those columns into indexed numerical values.

In [23]:
# Create an indexer
flights_indexed = StringIndexer(inputCol = 'carrier', outputCol = 'carrier_idx')\
                                .fit(flights)\
                                .transform(flights)

# Repeat the process for the other categorical feature
flights_indexed = StringIndexer(inputCol='org', outputCol='org_idx')\
                                .fit(flights_indexed)\
                                .transform(flights_indexed)

Exception ignored in: <function JavaWrapper.__del__ at 0x7f913d322f80>
Traceback (most recent call last):
  File "/home/danae/spark/python/pyspark/ml/wrapper.py", line 40, in __del__
    if SparkContext._active_spark_context and self._java_obj is not None:
AttributeError: 'StringIndexer' object has no attribute '_java_obj'


The first step to encoding categorical features is to create a `StringIndexer`. Members of this class are **Estimators** that take a DataFrame with a column of strings and map each unique string to a number. 

Then, the *Estimator* returns a **Transformer** that takes a DataFrame, attaches the mapping to it as metadata, and returns a new DataFrame with a numeric column corresponding to the string column.

In [25]:
flights_indexed.select('carrier', 'carrier_idx', 'org', 'org_idx')\
                .distinct()\
                .orderBy('carrier_idx', 'org_idx').show(5)

+-------+-----------+---+-------+
|carrier|carrier_idx|org|org_idx|
+-------+-----------+---+-------+
|     UA|        0.0|ORD|    0.0|
|     UA|        0.0|SFO|    1.0|
|     UA|        0.0|JFK|    2.0|
|     UA|        0.0|LGA|    3.0|
|     UA|        0.0|SMF|    4.0|
+-------+-----------+---+-------+
only showing top 5 rows



### 3.3 Assembling columns
The final stage of data preparation is to consolidate all of the predictor columns into a single column.

This has to be done before modeling because every Spark modeling routine expects the data to be in this form.

In [26]:
flights_mod = flights_indexed.select('mon', 'dom', 'dow', 'carrier_idx', 
                         'org_idx', 'km', 'depart', 'duration', 'delay', 'label')

# Create an assembler object
assembler = VectorAssembler(inputCols=[
    'mon', 'dom', 'dow', 'carrier_idx', 'org_idx', 'km', 'depart', 'duration'
], outputCol='features')

# Consolidate predictor columns
flights_assembled = assembler.transform(flights_mod)

# Check the resulting column
flights_assembled.select('features','label').show(5, truncate=False)

+-----------------------------------------+-----+
|features                                 |label|
+-----------------------------------------+-----+
|[0.0,22.0,2.0,0.0,0.0,509.0,16.33,82.0]  |1    |
|[2.0,20.0,4.0,0.0,1.0,542.0,6.17,82.0]   |0    |
|[9.0,13.0,1.0,1.0,0.0,1989.0,10.33,195.0]|0    |
|[5.0,2.0,1.0,0.0,1.0,885.0,7.98,102.0]   |0    |
|[7.0,2.0,6.0,1.0,0.0,1180.0,10.83,135.0] |1    |
+-----------------------------------------+-----+
only showing top 5 rows



## 4. Train/test split

To objectively assess a Machine Learning model you need to be able to test it on an independent set of data. You can't use the same data that you used to train the model: of course the model will perform (relatively) well on those data!

You will split the data into two components:

- training data (used to train the model) and
- testing data (used to test the model).


In [28]:
# Split into training and testing sets in a 80:20 ratio
flights_train, flights_test = flights_assembled.randomSplit([0.8, 0.2], seed=17)

## 5. Build the models 
Now that you've split the flights data into training and testing sets, you can use the training set to fit a Decision Tree model.

In [31]:
# Create a classifier object and fit to the training data
tree = DecisionTreeClassifier()
gbt = GBTClassifier()

tree_model = tree.fit(flights_train)
gbt_model = gbt.fit(flights_train)

In [39]:
# Compare AUC on testing data
evaluator = BinaryClassificationEvaluator()

# Create predictions for the testing data 
prediction_tree = tree_model.transform(flights_test)
prediction_gbt = gbt_model.transform(flights_test)

evaluator.evaluate(prediction_tree)
evaluator.evaluate(prediction_gbt)

0.7198696758978518

In [40]:
# Find the number of trees and the relative importance of features
print(gbt_model.featureImportances)
print(gbt_model.getNumTrees)

(8,[0,1,2,3,4,5,6,7],[0.18206047719666724,0.16061942914144267,0.1425136430435656,0.09221722599904801,0.16716807895229843,0.06189958003724091,0.13495097255514493,0.05857059307459232])
20


In [41]:
prediction.select('label', 'prediction', 'probability').show(5, False)

+-----+----------+----------------------------------------+
|label|prediction|probability                             |
+-----+----------+----------------------------------------+
|1    |1.0       |[0.38793503480278424,0.6120649651972158]|
|1    |1.0       |[0.38793503480278424,0.6120649651972158]|
|0    |1.0       |[0.3168674698795181,0.6831325301204819] |
|1    |1.0       |[0.38793503480278424,0.6120649651972158]|
|1    |1.0       |[0.38793503480278424,0.6120649651972158]|
+-----+----------+----------------------------------------+
only showing top 5 rows



## Delayed flights with a Random Forest

In this exercise you'll bring together cross validation and ensemble methods. You'll be training a Random Forest classifier to predict delayed flights, using cross validation to choose the best values for model parameters.

You'll find good values for the following parameters:

- `featureSubsetStrategy` — the number of features to consider for splitting at each node and
- `maxDepth` — the maximum number of splits along any branch.

In [44]:
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
# Create a random forest classifier
forest = RandomForestClassifier()

# Create a parameter grid
params = ParamGridBuilder() \
            .addGrid(forest.featureSubsetStrategy, ['all', 'onethird', 'sqrt', 'log2']) \
            .addGrid(forest.maxDepth, [2, 5, 10]) \
            .build()

# Create a binary classification evaluator
evaluator = BinaryClassificationEvaluator()

# Create a cross-validator
cv = CrossValidator(estimator = forest
                    , estimatorParamMaps = params
                    , evaluator = evaluator
                    , numFolds = 5)

In [None]:
# Average AUC for each parameter combination in grid
avg_auc = cv.avgMetrics

# Average AUC for the best model
best_model_auc =  max(avg_auc)

# What's the optimal parameter value?
opt_max_depth = cv.bestModel.explainParam('maxDepth')
opt_feat_substrat = cv.bestModel.explainParam('featureSubsetStrategy')

# AUC for best model on testing data
best_auc = evaluator.evaluate(cv.transform(flights_test))

In [48]:
# Terminate the cluster
spark.stop()