## 106 - Quantile Regression with VowpalWabbit

We will demonstrate how to use the VowpalWabbit quantile regressor with
TrainRegressor and ComputeModelStatistics on the Triazines dataset.


This sample demonstrates how to use the following APIs:
- [`TrainRegressor`
  ](http://mmlspark.azureedge.net/docs/pyspark/TrainRegressor.html)
- [`VowpalWabbitRegressor`
  ](http://mmlspark.azureedge.net/docs/pyspark/VowpalWabbitRegressor.html)
- [`ComputeModelStatistics`
  ](http://mmlspark.azureedge.net/docs/pyspark/ComputeModelStatistics.html)

In [1]:
triazines = spark.read.format("libsvm")\
    .load("wasbs://publicwasb@mmlspark.blob.core.windows.net/triazines.scale.svmlight")

StatementMeta(SamplePool, 47, 1, Finished, Available)



In [2]:
# print some basic info
print("records read: " + str(triazines.count()))
print("Schema: ")
triazines.printSchema()
triazines.limit(10).toPandas()

StatementMeta(SamplePool, 47, 2, Finished, Available)

records read: 105
Schema: 
root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)

   label                                           features
0  0.809  (-0.6, -0.3325, -0.3325, -1.0, -1.0, -1.0, -1....
1  0.602  (-0.6, 0.0, 0.0, -1.0, -0.3325, -1.0, -1.0, 0....
2  0.442  (-0.6, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1....
3  0.718  (-0.6, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1....
4  0.697  (-0.6, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1....
5  0.757  (0.2, -0.6675, -1.0, -1.0, -1.0, 0.0, -1.0, 0....
6  0.900  (0.2, -0.6675, -1.0, -1.0, -1.0, 0.0, -1.0, 0....
7  0.564  (-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1....
8  0.772  (0.2, -0.6675, -1.0, -1.0, -1.0, 0.0, -1.0, 0....
9  0.801  (0.2, -0.6675, -1.0, -1.0, -1.0, 0.0, -1.0, 0....
  Unsupported type in conversion to Arrow: VectorUDT
Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' is set to true.

Split the dataset into train and test

In [3]:
train, test = triazines.randomSplit([0.85, 0.15], seed=1)

StatementMeta(SamplePool, 47, 3, Finished, Available)



Train the quantile regressor on the training data.

Note: have a look at stderr for the task to see VW's output

Full command line argument docs can be found [here](https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Command-Line-Arguments).

Learning rate, numPasses and power_t are exposed to support grid search.

In [4]:
from mmlspark.vw import VowpalWabbitRegressor
model = (VowpalWabbitRegressor(numPasses=20, args="--holdout_off --loss_function quantile -q :: -l 0.1")
            .fit(train))

StatementMeta(SamplePool, 47, 4, Finished, Available)



Score the regressor on the test data.

In [5]:
scoredData = model.transform(test)
scoredData.limit(10).toPandas()

StatementMeta(SamplePool, 47, 5, Finished, Available)

   label  ... prediction
0  0.258  ...   0.609252
1  0.427  ...   0.833140
2  0.550  ...   0.850142
3  0.614  ...   0.869448
4  0.631  ...   0.795053
5  0.637  ...   0.705518
6  0.641  ...   0.858001
7  0.678  ...   0.858001
8  0.788  ...   0.786930
9  0.801  ...   0.841320

[10 rows x 4 columns]

Compute metrics using ComputeModelStatistics

In [6]:
from mmlspark.train import ComputeModelStatistics
metrics = ComputeModelStatistics(evaluationMetric='regression',
                                 labelCol='label',
                                 scoresCol='prediction') \
            .transform(scoredData)
metrics.toPandas()

StatementMeta(SamplePool, 47, 6, Finished, Available)

   mean_squared_error  root_mean_squared_error       R^2  mean_absolute_error
0             0.04856                 0.220364 -0.647188             0.183519

In [7]:
spark.stop()

StatementMeta(SamplePool, 47, 7, Finished, Available)

