## 105 - Training Regressions

This example notebook is similar to
[Notebook 102](102 - Regression Example with Flight Delay Dataset.ipynb).
In this example, we will demonstrate the use of `DataConversion()` in two
ways.  First, to convert the data type of several columns after the dataset
has been read in to the Spark DataFrame instead of specifying the data types
as the file is read in.  Second, to convert columns to categorical columns
instead of iterating over the columns and applying the `StringIndexer`.

This sample demonstrates how to use the following APIs:
- [`TrainRegressor`
  ](http://mmlspark.azureedge.net/docs/pyspark/TrainRegressor.html)
- [`ComputePerInstanceStatistics`
  ](http://mmlspark.azureedge.net/docs/pyspark/ComputePerInstanceStatistics.html)
- [`DataConversion`
  ](http://mmlspark.azureedge.net/docs/pyspark/DataConversion.html)

First, import the pandas package

In [1]:
import pandas as pd

StatementMeta(SamplePool, 41, 1, Finished, Available)



Next, import the CSV dataset: retrieve the file if needed, save it locally,
read the data into a pandas dataframe via `read_csv()`, then convert it to
a Spark dataframe.

Print the schema of the dataframe, and note the columns that are `long`.

In [2]:
flightDelay = spark.read.parquet("wasbs://publicwasb@mmlspark.blob.core.windows.net/On_Time_Performance_2012_9.parquet")
# print some basic info
print("records read: " + str(flightDelay.count()))
print("Schema: ")
flightDelay.printSchema()
flightDelay.limit(10).toPandas()

StatementMeta(SamplePool, 41, 2, Finished, Available)

records read: 490199
Schema: 
root
 |-- Quarter: long (nullable = true)
 |-- Month: long (nullable = true)
 |-- DayofMonth: long (nullable = true)
 |-- DayOfWeek: long (nullable = true)
 |-- Carrier: string (nullable = true)
 |-- OriginAirportID: long (nullable = true)
 |-- DestAirportID: long (nullable = true)
 |-- CRSDepTime: long (nullable = true)
 |-- DepTimeBlk: string (nullable = true)
 |-- CRSArrTime: long (nullable = true)
 |-- ArrDelay: double (nullable = true)
 |-- ArrTimeBlk: string (nullable = true)
 |-- Diverted: double (nullable = true)

   Quarter  Month  DayofMonth  ...  ArrDelay ArrTimeBlk  Diverted
0        3      9           9  ...      17.0  2100-2159       0.0
1        3      9          23  ...     159.0  2100-2159       0.0
2        3      9          24  ...       8.0  2100-2159       0.0
3        3      9          18  ...      32.0  2100-2159       0.0
4        3      9          16  ...       NaN  2100-2159       0.0
5        3      9          13  ...       5.0  

Use the `DataConversion` transform API to convert the columns listed to
double.

The `DataConversion` API accepts the following types for the `convertTo`
parameter:
* `boolean`
* `byte`
* `short`
* `integer`
* `long`
* `float`
* `double`
* `string`
* `toCategorical`
* `clearCategorical`
* `date` -- converts a string or long to a date of the format
  "yyyy-MM-dd HH:mm:ss" unless another format is specified by
the `dateTimeFormat` parameter.

Again, print the schema and note that the columns are now `double`
instead of long.

In [3]:
from mmlspark.featurize import DataConversion
flightDelay = DataConversion(cols=["Quarter","Month","DayofMonth","DayOfWeek",
                                   "OriginAirportID","DestAirportID",
                                   "CRSDepTime","CRSArrTime"],
                             convertTo="double") \
                  .transform(flightDelay)
flightDelay.printSchema()
flightDelay.limit(10).toPandas()

StatementMeta(SamplePool, 41, 3, Finished, Available)

root
 |-- Quarter: double (nullable = true)
 |-- Month: double (nullable = true)
 |-- DayofMonth: double (nullable = true)
 |-- DayOfWeek: double (nullable = true)
 |-- Carrier: string (nullable = true)
 |-- OriginAirportID: double (nullable = true)
 |-- DestAirportID: double (nullable = true)
 |-- CRSDepTime: double (nullable = true)
 |-- DepTimeBlk: string (nullable = true)
 |-- CRSArrTime: double (nullable = true)
 |-- ArrDelay: double (nullable = true)
 |-- ArrTimeBlk: string (nullable = true)
 |-- Diverted: double (nullable = true)

   Quarter  Month  DayofMonth  ...  ArrDelay ArrTimeBlk  Diverted
0      3.0    9.0        14.0  ...      -6.0  2000-2059       0.0
1      3.0    9.0        14.0  ...     -17.0  1500-1559       0.0
2      3.0    9.0        14.0  ...     -22.0  1300-1359       0.0
3      3.0    9.0        14.0  ...      -7.0  2100-2159       0.0
4      3.0    9.0        14.0  ...     -13.0  1300-1359       0.0
5      3.0    9.0        14.0  ...      21.0  1400-1459     

Split the datasest into train and test sets.

In [4]:
train, test = flightDelay.randomSplit([0.75, 0.25])

StatementMeta(SamplePool, 41, 4, Finished, Available)



Create a regressor model and train it on the dataset.

First, use `DataConversion` to convert the columns `Carrier`, `DepTimeBlk`,
and `ArrTimeBlk` to categorical data.  Recall that in Notebook 102, this
was accomplished by iterating over the columns and converting the strings
to index values using the `StringIndexer` API.  The `DataConversion` API
simplifies the task by allowing you to specify all columns that will have
the same end type in a single command.

Create a LinearRegression model using the Limited-memory BFGS solver
(`l-bfgs`), an `ElasticNet` mixing parameter of `0.3`, and a `Regularization`
of `0.1`.

Train the model with the `TrainRegressor` API fit on the training dataset.

In [5]:
from mmlspark.train import TrainRegressor, TrainedRegressorModel
from pyspark.ml.regression import LinearRegression

trainCat = DataConversion(cols=["Carrier","DepTimeBlk","ArrTimeBlk"],
                          convertTo="toCategorical") \
               .transform(train)
testCat  = DataConversion(cols=["Carrier","DepTimeBlk","ArrTimeBlk"],
                          convertTo="toCategorical") \
               .transform(test)
lr = LinearRegression().setSolver("l-bfgs").setRegParam(0.1) \
                       .setElasticNetParam(0.3)
model = TrainRegressor(model=lr, labelCol="ArrDelay").fit(trainCat)

StatementMeta(SamplePool, 41, 5, Finished, Available)



Score the regressor on the test data.

In [6]:
scoredData = model.transform(testCat)
scoredData.limit(10).toPandas()

StatementMeta(SamplePool, 41, 6, Finished, Available)

   Quarter  Month  DayofMonth  ...  ArrTimeBlk  Diverted    scores
0      3.0    9.0         1.0  ...          10       0.0  3.970542
1      3.0    9.0         1.0  ...           2       0.0 -2.703046
2      3.0    9.0         1.0  ...          14       0.0  8.236376
3      3.0    9.0         1.0  ...          12       0.0  6.459875
4      3.0    9.0         1.0  ...           7       0.0  1.258493
5      3.0    9.0         1.0  ...          13       0.0  5.511770
6      3.0    9.0         1.0  ...           3       0.0 -1.883580
7      3.0    9.0         1.0  ...          12       0.0  6.679502
8      3.0    9.0         1.0  ...          14       0.0  7.575472
9      3.0    9.0         1.0  ...          13       0.0  7.578747

[10 rows x 14 columns]

Compute model metrics against the entire scored dataset

In [7]:
from mmlspark.train import ComputeModelStatistics
metrics = ComputeModelStatistics().transform(scoredData)
metrics.toPandas()

StatementMeta(SamplePool, 41, 7, Finished, Available)

   mean_squared_error  root_mean_squared_error       R^2  mean_absolute_error
0         1108.469661                33.293688  0.042336            17.477483

Finally, compute and show statistics on individual predictions in the test
dataset, demonstrating the usage of `ComputePerInstanceStatistics`

In [8]:
from mmlspark.train import ComputePerInstanceStatistics
evalPerInstance = ComputePerInstanceStatistics().transform(scoredData)
evalPerInstance.select("ArrDelay", "Scores", "L1_loss", "L2_loss") \
               .limit(10).toPandas()

StatementMeta(SamplePool, 41, 8, Finished, Available)

   ArrDelay    Scores    L1_loss      L2_loss
0       2.0 -5.271233   7.271233    52.870830
1      -6.0 -2.190444   3.809556    14.512719
2     -14.0  5.055870  19.055870   363.126173
3      -5.0  6.082205  11.082205   122.815261
4       4.0 -1.326078   5.326078    28.367109
5      86.0  3.353821  82.646179  6830.390844
6     -17.0  5.617123  22.617123   511.534256
7     -13.0  5.835471  18.835471   354.774976
8     -12.0 -5.826737   6.173263    38.109181
9     -27.0 -4.678505  22.321495   498.249127

In [9]:
spark.stop()

StatementMeta(SamplePool, 41, 9, Finished, Available)

