## 102 - Training Regression Algorithms with the L-BFGS Solver

In this example, we run a linear regression on the *Flight Delay* dataset to predict the delay times.

We demonstrate how to use the `TrainRegressor` and the `ComputePerInstanceStatistics` APIs.

First, import the packages.

In [1]:
import numpy as np
import pandas as pd
import mmlspark

StatementMeta(SamplePool, 43, 1, Finished, Available)



Next, import the CSV dataset.

In [2]:
flightDelay = spark.read.parquet("wasbs://publicwasb@mmlspark.blob.core.windows.net/On_Time_Performance_2012_9.parquet")
# print some basic info
print("records read: " + str(flightDelay.count()))
print("Schema: ")
flightDelay.printSchema()
flightDelay.limit(10).toPandas()

StatementMeta(SamplePool, 43, 2, Finished, Available)

records read: 490199
Schema: 
root
 |-- Quarter: long (nullable = true)
 |-- Month: long (nullable = true)
 |-- DayofMonth: long (nullable = true)
 |-- DayOfWeek: long (nullable = true)
 |-- Carrier: string (nullable = true)
 |-- OriginAirportID: long (nullable = true)
 |-- DestAirportID: long (nullable = true)
 |-- CRSDepTime: long (nullable = true)
 |-- DepTimeBlk: string (nullable = true)
 |-- CRSArrTime: long (nullable = true)
 |-- ArrDelay: double (nullable = true)
 |-- ArrTimeBlk: string (nullable = true)
 |-- Diverted: double (nullable = true)

   Quarter  Month  DayofMonth  ...  ArrDelay ArrTimeBlk  Diverted
0        3      9           9  ...      17.0  2100-2159       0.0
1        3      9          23  ...     159.0  2100-2159       0.0
2        3      9          24  ...       8.0  2100-2159       0.0
3        3      9          18  ...      32.0  2100-2159       0.0
4        3      9          16  ...       NaN  2100-2159       0.0
5        3      9          13  ...       5.0  

Split the dataset into train and test sets.

In [3]:
train,test = flightDelay.randomSplit([0.75, 0.25])

StatementMeta(SamplePool, 43, 3, Finished, Available)



Train a regressor on dataset with `l-bfgs`.

In [4]:
from mmlspark.train import TrainRegressor, TrainedRegressorModel
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import StringIndexer
# Convert columns to categorical
catCols = ["Carrier", "DepTimeBlk", "ArrTimeBlk"]
trainCat = train
testCat = test
for catCol in catCols:
    simodel = StringIndexer(inputCol=catCol, outputCol=catCol + "Tmp").fit(train)
    trainCat = simodel.transform(trainCat).drop(catCol).withColumnRenamed(catCol + "Tmp", catCol)
    testCat = simodel.transform(testCat).drop(catCol).withColumnRenamed(catCol + "Tmp", catCol)
lr = LinearRegression().setRegParam(0.1).setElasticNetParam(0.3)
model = TrainRegressor(model=lr, labelCol="ArrDelay").fit(trainCat)

StatementMeta(SamplePool, 43, 4, Finished, Available)



Save, load, or Score the regressor on the test data.

In [5]:
import random
model_name = "flightDelayModel_{}.mml".format(random.randint(1, 25))
model.write().overwrite().save(model_name)
flightDelayModel = TrainedRegressorModel.load(model_name)

scoredData = flightDelayModel.transform(testCat)
scoredData.limit(10).toPandas()

StatementMeta(SamplePool, 43, 5, Finished, Available)

   Quarter  Month  DayofMonth  ...  DepTimeBlk  ArrTimeBlk    scores
0        3      9           1  ...         9.0         1.0  4.642338
1        3      9           1  ...        11.0         2.0 -2.147081
2        3      9           1  ...         1.0         8.0 -1.424672
3        3      9           1  ...        12.0         6.0  4.673187
4        3      9           1  ...        11.0         8.0 -2.247999
5        3      9           1  ...        13.0         6.0  5.768083
6        3      9           1  ...         4.0        13.0 -4.031764
7        3      9           1  ...         1.0         8.0 -1.493748
8        3      9           1  ...        11.0         7.0 -1.286130
9        3      9           1  ...        15.0        16.0  3.747550

[10 rows x 14 columns]

Compute model metrics against the entire scored dataset

In [6]:
from mmlspark.train import ComputeModelStatistics
metrics = ComputeModelStatistics().transform(scoredData)
metrics.toPandas()

StatementMeta(SamplePool, 43, 6, Finished, Available)

   mean_squared_error  root_mean_squared_error       R^2  mean_absolute_error
0         1133.361571                33.665436  0.045415            17.529481

In [7]:
metrics.first()['root_mean_squared_error']

StatementMeta(SamplePool, 43, 7, Finished, Available)

33.665435845486535

Finally, compute and show per-instance statistics, demonstrating the usage
of `ComputePerInstanceStatistics`.

In [8]:
from mmlspark.train import ComputePerInstanceStatistics
evalPerInstance = ComputePerInstanceStatistics().transform(scoredData)
evalPerInstance.select("ArrDelay", "Scores", "L1_loss", "L2_loss").limit(10).toPandas()

StatementMeta(SamplePool, 43, 8, Finished, Available)

   ArrDelay    Scores    L1_loss      L2_loss
0      19.0  2.365212  16.634788   276.716181
1     -26.0 -0.322294  25.677706   659.344584
2     -11.0  8.009789  19.009789   361.372060
3      16.0  1.634964  14.365036   206.354251
4      -7.0 -0.236589   6.763411    45.743733
5      14.0  6.238971   7.761029    60.233570
6     -14.0  2.064378  16.064378   258.064234
7       8.0  5.721098   2.278902     5.193396
8      40.0  7.145334  32.854666  1079.429066
9     -20.0 -3.795250  16.204750   262.593929

In [9]:
spark.stop()

StatementMeta(SamplePool, 43, 9, Finished, Available)

