# Tuning Machine Learning models in Spark

## ML Pipelines in Spark

ML model training and tuning often represents running the same steps once and again. Often, we run the same steps with small variations in order to evaluate combinations of parameters. 

In order to make this use case a lot easier, Spark provides the [Pipeline](https://spark.apache.org/docs/2.3.0/ml-pipeline.html) abstraction.

A Pipeline represents a series of steps in the processing of a dataset. Each step is a Transformer or an Estimator. The whole Pipeline is an Estimator, so we can .fit the whole pipeline in one step. When we do that, the steps'  .fit and .transform methods will be called in turn.

![pipelineestimator](https://spark.apache.org/docs/2.3.0/img/ml-Pipeline.png)

![PipelineModel](https://spark.apache.org/docs/2.3.0/img/ml-PipelineModel.png)

### Basic Example of ML

In [4]:
##################################################
## Libraries
##################################################
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

from pyspark.ml.linalg import Vectors
from pyspark.ml.classification import LogisticRegression

In [5]:
##################################################
## Create or Load a DataFrame
##################################################

# Prepare training data from a list of (label, features) tuples.
training = spark.createDataFrame([
    (Vectors.dense([0.0, 1.1, 0.1]),1.0),
    (Vectors.dense([2.0, 1.0, -1.0]),0.0),
    (Vectors.dense([2.0, 1.3, 1.0]),0.0),
    (Vectors.dense([0.0, 1.2, -0.5]),1.0)], ["features","label"])

training.show()

+--------------+-----+
|      features|label|
+--------------+-----+
| [0.0,1.1,0.1]|  1.0|
|[2.0,1.0,-1.0]|  0.0|
| [2.0,1.3,1.0]|  0.0|
|[0.0,1.2,-0.5]|  1.0|
+--------------+-----+



In [6]:
##################################################
## Model
##################################################

# Create a LogisticRegression instance. This instance is an Estimator.
lr = LogisticRegression(maxIter=10, regParam=0.01)
# Print out the parameters, documentation, and any default values.
print("LogisticRegression parameters:\n" + lr.explainParams() + "\n")

LogisticRegression parameters:
aggregationDepth: suggested depth for treeAggregate (>= 2). (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty. (default: 0.0)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial (default: auto)
featuresCol: features column name. (default: features)
fitIntercept: whether to fit an intercept term. (default: True)
labelCol: label column name. (default: label)
lowerBoundsOnCoefficients: The lower bounds on coefficients if fitting under bound constrained optimization. The bound matrix must be compatible with the shape (1, number of features) for binomial regression, or (number of classes, number of features) for multinomial regression. (undefined)
lowerBoundsOnIntercepts: The lower bounds on intercepts if fitting under bound constrained optimization. The bou

In [7]:
# Learn a LogisticRegression model. This uses the parameters stored in lr.
# Now, model1 is a TRANSFORMER
model1 = lr.fit(training)

In [8]:
# Since model1 is a Model (i.e., a transformer produced by an Estimator),
# we can view the parameters it used during fit().
# This prints the parameter (name: value) pairs, where names are unique IDs for this
# LogisticRegression instance.
print("Model 1 was fit using parameters: ")
print(model1.extractParamMap())

Model 1 was fit using parameters: 
{Param(parent='LogisticRegression_4f83c81941d0', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2)'): 2, Param(parent='LogisticRegression_4f83c81941d0', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty'): 0.0, Param(parent='LogisticRegression_4f83c81941d0', name='family', doc='The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial.'): 'auto', Param(parent='LogisticRegression_4f83c81941d0', name='featuresCol', doc='features column name'): 'features', Param(parent='LogisticRegression_4f83c81941d0', name='fitIntercept', doc='whether to fit an intercept term'): True, Param(parent='LogisticRegression_4f83c81941d0', name='labelCol', doc='label column name'): 'label', Param(parent='LogisticRegression_4f83c81941d0', name='maxIter', doc='maximum nu

In [11]:
# We may alternatively specify parameters using a Python dictionary as a paramMap
paramMap = {lr.maxIter: 20}
paramMap[lr.maxIter] = 30  # Specify 1 Param, overwriting the original maxIter.
paramMap.update({lr.regParam: 0.1, lr.threshold: 0.55})  # Specify multiple Params.

# You can combine paramMaps, which are python dictionaries.
paramMap2 = {lr.probabilityCol: "myProbability"}  # Change output column name
paramMapCombined = paramMap.copy()
paramMapCombined.update(paramMap2)

# Now learn a new model using the paramMapCombined parameters.
# paramMapCombined overrides all parameters set earlier via lr.set* methods.
model2 = lr.fit(training, paramMapCombined)
print("Model 2 was fit using parameters: ")
print(model2.extractParamMap())

# Prepare test data
test = spark.createDataFrame([
    (1.0, Vectors.dense([-1.0, 1.5, 1.3])),
    (0.0, Vectors.dense([3.0, 2.0, -0.1])),
    (1.0, Vectors.dense([0.0, 2.2, -1.5]))], ["label", "features"])

# Make predictions on test data using the Transformer.transform() method.
# LogisticRegression.transform will only use the 'features' column.
# Note that model2.transform() outputs a "myProbability" column instead of the usual
# 'probability' column since we renamed the lr.probabilityCol parameter previously.
prediction = model2.transform(test)
result = prediction.select("features", "label", "myProbability", "prediction") \
    .collect()

for row in result:
    print("features=%s, label=%s -> prob=%s, prediction=%s"
          % (row.features, row.label, row.myProbability, row.prediction))

Model 2 was fit using parameters: 
{Param(parent='LogisticRegression_4f83c81941d0', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2)'): 2, Param(parent='LogisticRegression_4f83c81941d0', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty'): 0.0, Param(parent='LogisticRegression_4f83c81941d0', name='family', doc='The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial.'): 'auto', Param(parent='LogisticRegression_4f83c81941d0', name='featuresCol', doc='features column name'): 'features', Param(parent='LogisticRegression_4f83c81941d0', name='fitIntercept', doc='whether to fit an intercept term'): True, Param(parent='LogisticRegression_4f83c81941d0', name='labelCol', doc='label column name'): 'label', Param(parent='LogisticRegression_4f83c81941d0', name='maxIter', doc='maximum nu

In [13]:
result = prediction.select("*") \
    .show()

+-----+--------------+--------------------+--------------------+----------+
|label|      features|       rawPrediction|       myProbability|prediction|
+-----+--------------+--------------------+--------------------+----------+
|  1.0|[-1.0,1.5,1.3]|[-2.8046569418746...|[0.05707304171034...|       1.0|
|  0.0|[3.0,2.0,-0.1]|[2.49587635664210...|[0.92385223117041...|       0.0|
|  1.0|[0.0,2.2,-1.5]|[-2.0935249027913...|[0.10972776114780...|       1.0|
+-----+--------------+--------------------+--------------------+----------+



### Basic example of Pipeline

In [14]:
##################################################
## Libraries
##################################################
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer

In [15]:
##################################################
## Create or Load a DataFrame
##################################################

# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])

training.show()

+---+----------------+-----+
| id|            text|label|
+---+----------------+-----+
|  0| a b c d e spark|  1.0|
|  1|             b d|  0.0|
|  2|     spark f g h|  1.0|
|  3|hadoop mapreduce|  0.0|
+---+----------------+-----+



In [16]:
# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")                                    # Separar Texto en Palabras
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")               # Convierte en vector  los recogido de tokenizer con frecuencia de registros....
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# Fit the pipeline to training documents.
model = pipeline.fit(training)

In [17]:
# Prepare test documents, which are unlabeled (id, text) tuples.
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "spark hadoop spark"),
    (7, "apache hadoop")
], ["id", "text"])

# Make predictions on test documents and print columns of interest.
prediction = model.transform(test)
selected = prediction.select("id", "text", "probability", "prediction")
for row in selected.collect():
    rid, text, prob, prediction = row
    print("(%d, %s) --> prob=%s, prediction=%f" % (rid, text, str(prob), prediction))

(4, spark i j k) --> prob=[0.15964077387874745,0.8403592261212527], prediction=1.000000
(5, l m n) --> prob=[0.8378325685476744,0.16216743145232562], prediction=0.000000
(6, spark hadoop spark) --> prob=[0.06926633132976037,0.9307336686702395], prediction=1.000000
(7, apache hadoop) --> prob=[0.9821575333444218,0.01784246665557808], prediction=0.000000


In [20]:
selected.show()

+---+------------------+--------------------+----------+
| id|              text|         probability|prediction|
+---+------------------+--------------------+----------+
|  4|       spark i j k|[0.15964077387874...|       1.0|
|  5|             l m n|[0.83783256854767...|       0.0|
|  6|spark hadoop spark|[0.06926633132976...|       1.0|
|  7|     apache hadoop|[0.98215753334442...|       0.0|
+---+------------------+--------------------+----------+



In [21]:
after_hashing = model.stages[1].transform(model.stages[0].transform(test)).show()

+---+------------------+--------------------+--------------------+
| id|              text|               words|            features|
+---+------------------+--------------------+--------------------+
|  4|       spark i j k|    [spark, i, j, k]|(262144,[20197,24...|
|  5|             l m n|           [l, m, n]|(262144,[18910,10...|
|  6|spark hadoop spark|[spark, hadoop, s...|(262144,[155117,2...|
|  7|     apache hadoop|    [apache, hadoop]|(262144,[66695,15...|
+---+------------------+--------------------+--------------------+



## Example: predicting flight delays

We'll be using the same [Transtats'](https://www.transtats.bts.gov/) OTP performance data] from way back when. Remember it?

It's a table that contains all domestic departures by US air air carriers that represent at least one percent of domestic scheduled passenger revenues, with data on each individual departure including [Tail Number](https://en.wikipedia.org/wiki/Tail_number), departure delay, origin, destination and carrier.


### Load the data

Opening .zip files in Spark is a bit of a pain. For now, let's just decompress the file we want to read. When we are ready to expand the processing to the cluster, we will need to do [this](https://stackoverflow.com/questions/28569788/how-to-open-stream-zip-files-through-spark).

In [44]:
file = 'On_Time_On_Time_Performance_2015_8.csv'
df = spark.read.csv(file,header=True, inferSchema=True)

In [45]:
df2 = df.select('FlightDate', 'DayOfWeek', 'Year', 'Month', 'DayofMonth', 'DayOfWeek',
                 'Carrier', 
                 'TailNum', 
                 'FlightNum', 
                 'Origin', 'OriginCityName', 'OriginStateName', 
                 'Dest', 'DestCityName', 'DestStateName',
                 'DepTime', 'DepDelay', 'AirTime', 'Distance')

df2.show()

+-------------------+---------+----+-----+----------+---------+-------+-------+---------+------+--------------+---------------+----+---------------+-------------+-------+--------+-------+--------+
|         FlightDate|DayOfWeek|Year|Month|DayofMonth|DayOfWeek|Carrier|TailNum|FlightNum|Origin|OriginCityName|OriginStateName|Dest|   DestCityName|DestStateName|DepTime|DepDelay|AirTime|Distance|
+-------------------+---------+----+-----+----------+---------+-------+-------+---------+------+--------------+---------------+----+---------------+-------------+-------+--------+-------+--------+
|2015-08-02 00:00:00|        7|2015|    8|         2|        7|     AA| N790AA|        1|   JFK|  New York, NY|       New York| LAX|Los Angeles, CA|   California|    854|    -6.0|  313.0|  2475.0|
|2015-08-03 00:00:00|        1|2015|    8|         3|        1|     AA| N784AA|        1|   JFK|  New York, NY|       New York| LAX|Los Angeles, CA|   California|    858|    -2.0|  316.0|  2475.0|
|2015-08-04 00:

In [46]:
df2.count()

510536

### Drop nas

There are only a few departures for which any of the columns of interest contains null values. The most expedient way to handle them is to just drop them, since they won't make much of a difference.

In [47]:
df3 = df2.dropna()

NA-related functions are grouped in a .na attribute of DataFrames.

In [48]:
df3.count()

503956

## Feature extraction and generation of target variable

The departing hour is the most important factor in delays, so we need to calculate it from the departure time. Since the input file uses a funny format for times, Spark has interpreted them as floats:

In [49]:
from pyspark.sql import types
from pyspark.sql import functions as f
flights = (df3.withColumn('DepHour',(df3['DepTime']/100).cast(types.IntegerType()))
              .withColumn('Delayed',f.when(df["DepDelay"] > 15, 1).otherwise(0)))

In [52]:
flights.show()

+-------------------+---------+----+-----+----------+---------+-------+-------+---------+------+--------------+---------------+----+---------------+-------------+-------+--------+-------+--------+-------+-------+
|         FlightDate|DayOfWeek|Year|Month|DayofMonth|DayOfWeek|Carrier|TailNum|FlightNum|Origin|OriginCityName|OriginStateName|Dest|   DestCityName|DestStateName|DepTime|DepDelay|AirTime|Distance|DepHour|Delayed|
+-------------------+---------+----+-----+----------+---------+-------+-------+---------+------+--------------+---------------+----+---------------+-------------+-------+--------+-------+--------+-------+-------+
|2015-08-02 00:00:00|        7|2015|    8|         2|        7|     AA| N790AA|        1|   JFK|  New York, NY|       New York| LAX|Los Angeles, CA|   California|    854|    -6.0|  313.0|  2475.0|      8|      0|
|2015-08-03 00:00:00|        1|2015|    8|         3|        1|     AA| N784AA|        1|   JFK|  New York, NY|       New York| LAX|Los Angeles, CA|

#### Exercise

Calculated the 'DepHour' column that represents the hour as an int.

We will also generate a binary target variable. The aviation industry considers a flight delayed when it departs more than 15 minutes after its scheduled departure time, so we will use that. We will create it as an integer, since that is what the learning algorithms expect.

In order to make the training times manageable, let's pick only 10% of the data to train.

In [54]:
samples = flights.sample(fraction=0.2,withReplacement=False,seed=42)

## Handle different fields in different ways IMPORTANTE!!!!

We have features of at least three kinds:

* Numeric continuous fields, which we can use as input to many algorithms as they are. In particular, decision trees can take continuous variables with any value as input, since they only look for the cutoff point that most increases the homogeneity of the resulting groups. In contrast, if we were using a logistic regression with regularization, for example, we would need to first scale the variables to have comparable magnitudes.

* There are fields which we will treat as categorical variables, but which are already integers. These need to be one-hot encoded.

* Finally, there are several categorical variables that are encoded as strings. These need to be one-hot encoded, but OneHotEncoder requires numeric input. Therefore, we will need to apply a StringIndexer to each of them before one-hot encoding.

We have generated the list of names of columns that have dataType string with a list comprehension, rather than hard-coding it, but it is just like the other ones.

## Handling categorical fields

Let's do the processing of just one field first, as an example. Then we will process the rest.

### StringIndexer 

A [StringIndexer](https://spark.apache.org/docs/2.2.0/ml-features.html#stringindexer) is an estimator that takes a single string field, then produces a transformer that codifies said field as numeric labels that are fit for feeding to a one-hot encoding. 

We need to specify an input column, an output column, and a way to handle invalids. In this case, invalids are values that the indexer has not seen during fitting but that the transformer finds during processing. Its values are 'error' (the default), which is pretty self-explanatory, 'skip', which drops them, and 'keep', which is what we want. It will assign all unseen labels to a single category index.

In [55]:
from pyspark.ml.feature import StringIndexer

carrier_indexer = StringIndexer(inputCol= 'Carrier',outputCol= 'CarrierIndex')
carrier_indexer

StringIndexer_3ae5a4a9214e

In [57]:
carrier_transformer = carrier_indexer.fit(samples)
carrier_transformer.transform

<bound method Transformer.transform of StringIndexer_3ae5a4a9214e>

In [58]:
carrier_transformer.transform(samples)

DataFrame[FlightDate: timestamp, DayOfWeek: int, Year: int, Month: int, DayofMonth: int, DayOfWeek: int, Carrier: string, TailNum: string, FlightNum: int, Origin: string, OriginCityName: string, OriginStateName: string, Dest: string, DestCityName: string, DestStateName: string, DepTime: int, DepDelay: double, AirTime: double, Distance: double, DepHour: int, Delayed: int, CarrierIndex: double]

In [64]:
carrier_transformer.transform(samples).select(['Carrier','CarrierIndex']).show(5)

+-------+------------+
|Carrier|CarrierIndex|
+-------+------------+
|     AA|         2.0|
|     AA|         2.0|
|     AA|         2.0|
|     AA|         2.0|
|     AA|         2.0|
+-------+------------+
only showing top 5 rows



In [66]:
flights.show(1)

+-------------------+---------+----+-----+----------+---------+-------+-------+---------+------+--------------+---------------+----+---------------+-------------+-------+--------+-------+--------+-------+-------+
|         FlightDate|DayOfWeek|Year|Month|DayofMonth|DayOfWeek|Carrier|TailNum|FlightNum|Origin|OriginCityName|OriginStateName|Dest|   DestCityName|DestStateName|DepTime|DepDelay|AirTime|Distance|DepHour|Delayed|
+-------------------+---------+----+-----+----------+---------+-------+-------+---------+------+--------------+---------------+----+---------------+-------------+-------+--------+-------+--------+-------+-------+
|2015-08-02 00:00:00|        7|2015|    8|         2|        7|     AA| N790AA|        1|   JFK|  New York, NY|       New York| LAX|Los Angeles, CA|   California|    854|    -6.0|  313.0|  2475.0|      8|      0|
+-------------------+---------+----+-----+----------+---------+-------+-------+---------+------+--------------+---------------+----+---------------+

In [92]:
categorical_fields = ['Year','Month','DepHour','DayofMonth','DayOfWeek']

string_fields = [field.name for field in samples.schema.fields if field.dataType == types.StringType()]

continuous_fields = ['Distance']

target_field = 'Delayed'

In [93]:
string_fields

['Carrier',
 'TailNum',
 'Origin',
 'OriginCityName',
 'OriginStateName',
 'Dest',
 'DestCityName',
 'DestStateName']

In [94]:
string_indexers =[StringIndexer(inputCol=field,outputCol=field+'index',handleInvalid='keep') for field in string_fields] # Si no hay valor crea una categoría para vacíos

### OneHotEncoder

A [OneHotEncoder](https://spark.apache.org/docs/2.2.0/ml-features.html#onehotencoder) generates a n-1 length vector column for an n-category column of category indices. 

We need to specify an input and an output column.

In [95]:
from pyspark.ml.feature import OneHotEncoder

transformed = carrier_transformer.transform(samples)
oh = OneHotEncoder(inputCol='CarrierIndex', outputCol='CarrierOneHOt')
oh_encoded = oh.transform(transformed)
oh_encoded

DataFrame[FlightDate: timestamp, DayOfWeek: int, Year: int, Month: int, DayofMonth: int, DayOfWeek: int, Carrier: string, TailNum: string, FlightNum: int, Origin: string, OriginCityName: string, OriginStateName: string, Dest: string, DestCityName: string, DestStateName: string, DepTime: int, DepDelay: double, AirTime: double, Distance: double, DepHour: int, Delayed: int, CarrierIndex: double, CarrierOneHOt: vector]

In [96]:
oh_encoded.select('CarrierIndex','CarrierOneHOt').show()

+------------+--------------+
|CarrierIndex| CarrierOneHOt|
+------------+--------------+
|         2.0|(12,[2],[1.0])|
|         2.0|(12,[2],[1.0])|
|         2.0|(12,[2],[1.0])|
|         2.0|(12,[2],[1.0])|
|         2.0|(12,[2],[1.0])|
|         2.0|(12,[2],[1.0])|
|         2.0|(12,[2],[1.0])|
|         2.0|(12,[2],[1.0])|
|         2.0|(12,[2],[1.0])|
|         2.0|(12,[2],[1.0])|
|         2.0|(12,[2],[1.0])|
|         2.0|(12,[2],[1.0])|
|         2.0|(12,[2],[1.0])|
|         2.0|(12,[2],[1.0])|
|         2.0|(12,[2],[1.0])|
|         2.0|(12,[2],[1.0])|
|         2.0|(12,[2],[1.0])|
|         2.0|(12,[2],[1.0])|
|         2.0|(12,[2],[1.0])|
|         2.0|(12,[2],[1.0])|
+------------+--------------+
only showing top 20 rows



### SparseVectors

The vectors produced by OneHotEncoder will each have only one non-zero value, but can potentially be very long. An efficient way to represent them is therefore a SparseVector, and that is what OneHotEncoder generates. 

A SparseVector is a data structure that only stores the length of the vector, a list of positions, and a list of values. All other values are assumed to be 0s.

This way, a vector like the following, with lenght 15 and non-zero values only on positions 3 and 9:

```python
[0.0, 0.0, 0.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 0.0, 0.0, 0.0, 0.0, 0.0]
```

can be compactly expressed as

```python
(15, [3, 9], [6.0, 4.0])
```

## Let's build our first Pipeline!

Our pipeline consists of a number of StringIndexers, followed by one OneHotEncoder, followed by a VectorAssembler, with a RandomForestClassifier at the end.

A Spark Pipeline is a single Estimator. We build it secifying the stages it comprises, and then we are ready to .fit it in one go. This will save us a lot of trouble, since we don't need to fit and transform each stage individually.

### StringIndexer stages

We only need to StringIndex some of the fields. We are going to build the input and output column names programatically.


In [97]:
string_indexers =[StringIndexer(inputCol=field,outputCol=field+'index',handleInvalid='keep') for field in string_fields] # Si no hay valor crea una categoría para vacíos

### OneHotEncoder

One OneHotEncoder per categorical column. We are also going to build these stages programatically

In [98]:
encoders_cat =[OneHotEncoder(inputCol=field,outputCol=field+'OneHot') for field in categorical_fields if field not in string_indexers] 
encoders_str =[OneHotEncoder(inputCol=field+'index',outputCol=field+'OneHot') for field in string_fields] 

### VectorAssembler

Once we have generated our features, we can assemble them into a single features column, together with the continuous_fields.

In [99]:
from pyspark.ml.feature import VectorAssembler

cols_to_concatenate = [ field + 'OneHot' for field in categorical_fields] + continuous_fields + [field+'OneHot' for field in string_fields]
cols_to_concatenate

['YearOneHot',
 'MonthOneHot',
 'DepHourOneHot',
 'DayofMonthOneHot',
 'DayOfWeekOneHot',
 'Distance',
 'CarrierOneHot',
 'TailNumOneHot',
 'OriginOneHot',
 'OriginCityNameOneHot',
 'OriginStateNameOneHot',
 'DestOneHot',
 'DestCityNameOneHot',
 'DestStateNameOneHot']

In [100]:
assembler = VectorAssembler(inputCols=cols_to_concatenate, outputCol='features')

### RandomForestClassifier

Aaaaand we are ready to do some Machine Learning! We'll use a RandomForestClassifier to try to predict delayed versus non delayed flights, a binary classification task.

In [101]:
from pyspark.ml.classification import RandomForestClassifier

rf = RandomForestClassifier(featuresCol='features',labelCol='Delayed')

### Pipeline!

Now that we have all the stages, we are finally ready to put them together into a single Estimator, our Pipeline.

In [102]:
from pyspark.ml.pipeline import Pipeline

pipeline = Pipeline(stages= string_indexers +
                            encoders_str +
                            encoders_cat +
                            [assembler] +
                            [rf])
pipeline

Pipeline_13ee26ca1baf

Now that we have gone to the trouble of building our Pipeline, fitting it and using it to predict the probabilty of delay on unseen data is as easy as using a single Estimator:

In [103]:
model = pipeline.fit(samples)

## Evaluating and tuning our Pipeline

Probably the most interesting use of Spark Pipelines is quickly (in terms of coding time) evaluating many combinations of hyperparameters to feed our model and choosing the best ones. For that, we can use a TrainValidationSplit or a CrossValidator. The CrossValidator will generally perform better, but it will take several times as much. I'm using here the TrainValidationSplit because the API is the same.

In [107]:
predictions = model.transform(flights.sample(fraction=0.03))
predictions

DataFrame[FlightDate: timestamp, DayOfWeek: int, Year: int, Month: int, DayofMonth: int, DayOfWeek: int, Carrier: string, TailNum: string, FlightNum: int, Origin: string, OriginCityName: string, OriginStateName: string, Dest: string, DestCityName: string, DestStateName: string, DepTime: int, DepDelay: double, AirTime: double, Distance: double, DepHour: int, Delayed: int, Carrierindex: double, TailNumindex: double, Originindex: double, OriginCityNameindex: double, OriginStateNameindex: double, Destindex: double, DestCityNameindex: double, DestStateNameindex: double, CarrierOneHot: vector, TailNumOneHot: vector, OriginOneHot: vector, OriginCityNameOneHot: vector, OriginStateNameOneHot: vector, DestOneHot: vector, DestCityNameOneHot: vector, DestStateNameOneHot: vector, YearOneHot: vector, MonthOneHot: vector, DepHourOneHot: vector, DayofMonthOneHot: vector, DayOfWeekOneHot: vector, features: vector, rawPrediction: vector, probability: vector, prediction: double]

In [108]:
predictions[['rawPrediction','probability','prediction','Delayed']].show(50)

+--------------------+--------------------+----------+-------+
|       rawPrediction|         probability|prediction|Delayed|
+--------------------+--------------------+----------+-------+
|[16.5459158689967...|[0.82729579344983...|       0.0|      0|
|[16.3265125259951...|[0.81632562629975...|       0.0|      0|
|[16.4937711266524...|[0.82468855633262...|       0.0|      0|
|[16.4676313895743...|[0.82338156947872...|       0.0|      1|
|[16.3265125259951...|[0.81632562629975...|       0.0|      0|
|[16.3265125259951...|[0.81632562629975...|       0.0|      0|
|[16.2961169198362...|[0.81480584599181...|       0.0|      0|
|[16.5459158689967...|[0.82729579344983...|       0.0|      0|
|[16.3195291143789...|[0.81597645571894...|       0.0|      0|
|[16.2961169198362...|[0.81480584599181...|       0.0|      0|
|[16.3265125259951...|[0.81632562629975...|       0.0|      0|
|[16.2961169198362...|[0.81480584599181...|       0.0|      0|
|[16.5718593128017...|[0.82859296564008...|       0.0| 

### Params and Evaluators

In order to evaluate different sets of parameters, we need a) the set of parameters to iterate through and b) a metric to compare the results. 

The first element is represented by ParamMaps, which we build with a ParamGridBuilder, and the second by an Evaluator that needs to be specific to the relevant task.

In [109]:
from pyspark.ml.tuning import TrainValidationSplit, CrossValidator

We now have all the elements in place to perform our fit:

And now we can predict on the rest of the flights and compare them with reality:

### Let's have a look

We are now ready to compare our predictions with reality. Do these features have any predictive power at all?

Not bad, considering we have not performed any feature engineering at all!

### Further Reading

https://spark.apache.org/docs/2.3.0/ml-tuning.html

https://stackoverflow.com/questions/28569788/how-to-open-stream-zip-files-through-spark