## 106 - Quantile Regression with LightGBM

We will demonstrate how to use the LightGBM quantile regressor with
TrainRegressor and ComputeModelStatistics on the Triazines dataset.


This sample demonstrates how to use the following APIs:
- [`TrainRegressor`
  ](http://mmlspark.azureedge.net/docs/pyspark/TrainRegressor.html)
- [`LightGBMRegressor`
  ](http://mmlspark.azureedge.net/docs/pyspark/LightGBMRegressor.html)
- [`ComputeModelStatistics`
  ](http://mmlspark.azureedge.net/docs/pyspark/ComputeModelStatistics.html)

In [1]:
triazines = spark.read.format("libsvm")\
    .load("wasbs://publicwasb@mmlspark.blob.core.windows.net/triazines.scale.svmlight")

StatementMeta(SamplePool, 36, 1, Finished, Available)



In [2]:
# print some basic info
print("records read: " + str(triazines.count()))
print("Schema: ")
triazines.printSchema()
triazines.limit(10).toPandas()

StatementMeta(SamplePool, 36, 2, Finished, Available)

records read: 105
Schema: 
root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)

   label                                           features
0  0.809  (-0.6, -0.3325, -0.3325, -1.0, -1.0, -1.0, -1....
1  0.602  (-0.6, 0.0, 0.0, -1.0, -0.3325, -1.0, -1.0, 0....
2  0.442  (-0.6, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1....
3  0.718  (-0.6, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1....
4  0.697  (-0.6, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1....
5  0.757  (0.2, -0.6675, -1.0, -1.0, -1.0, 0.0, -1.0, 0....
6  0.900  (0.2, -0.6675, -1.0, -1.0, -1.0, 0.0, -1.0, 0....
7  0.564  (-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1....
8  0.772  (0.2, -0.6675, -1.0, -1.0, -1.0, 0.0, -1.0, 0....
9  0.801  (0.2, -0.6675, -1.0, -1.0, -1.0, 0.0, -1.0, 0....
  Unsupported type in conversion to Arrow: VectorUDT
Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' is set to true.

Split the dataset into train and test

In [3]:
train, test = triazines.randomSplit([0.85, 0.15], seed=1)

StatementMeta(SamplePool, 36, 3, Finished, Available)



Train the quantile regressor on the training data.

In [4]:
from mmlspark.lightgbm import LightGBMRegressor
model = LightGBMRegressor(objective='quantile',
                          alpha=0.2,
                          learningRate=0.3,
                          numLeaves=31).fit(train)

StatementMeta(SamplePool, 36, 4, Finished, Available)



We can save and load LightGBM to a file using the LightGBM native representation

In [5]:
from mmlspark.lightgbm import LightGBMRegressionModel
model.saveNativeModel("mymodel")
model = LightGBMRegressionModel.loadNativeModelFromFile("mymodel")

StatementMeta(SamplePool, 36, 5, Finished, Available)



View the feature importances of the trained model.

In [6]:
print(model.getFeatureImportances())

StatementMeta(SamplePool, 36, 6, Finished, Available)

[18.0, 4.0, 8.0, 0.0, 16.0, 16.0, 0.0, 3.0, 2.0, 0.0, 6.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 27.0, 27.0, 18.0, 28.0, 28.0, 0.0, 10.0, 0.0, 4.0, 0.0, 2.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 10.0, 0.0, 0.0, 0.0]

Score the regressor on the test data.

In [7]:
scoredData = model.transform(test)
scoredData.limit(10).toPandas()

StatementMeta(SamplePool, 36, 7, Finished, Available)

   label                                           features  prediction
0  0.258  (-0.2, 0.3325, -0.6675, -1.0, 0.3325, 0.0, -1....    0.414115
1  0.427  (-1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1....    0.539532
2  0.550  (-0.6, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1....    0.537624
3  0.614  (0.2, -0.6675, -1.0, -1.0, -1.0, 0.0, -1.0, 0....    0.640256
4  0.631  (-0.6, -1.0, -1.0, -1.0, -1.0, -1.0, -1.0, -1....    0.422801
5  0.637  (-0.2, 0.0, 0.0, -1.0, 0.3325, 0.0, -1.0, 0.0,...    0.521593
6  0.641  (0.2, -0.6675, -1.0, -1.0, -1.0, 0.0, -1.0, 0....    0.585361
7  0.678  (0.2, -0.6675, -1.0, -1.0, -1.0, 0.0, -1.0, 0....    0.585361
8  0.788  (0.2, -0.6675, -1.0, -1.0, -1.0, 0.0, -1.0, 0....    0.726604
9  0.801  (0.2, -0.6675, -1.0, -1.0, -1.0, 0.0, -1.0, 0....    0.634850

Compute metrics using ComputeModelStatistics

In [8]:
from mmlspark.train import ComputeModelStatistics
metrics = ComputeModelStatistics(evaluationMetric='regression',
                                 labelCol='label',
                                 scoresCol='prediction') \
            .transform(scoredData)
metrics.toPandas()

StatementMeta(SamplePool, 36, 8, Finished, Available)

   mean_squared_error  root_mean_squared_error       R^2  mean_absolute_error
0            0.014862                  0.12191  0.495869             0.107673

In [9]:
display(metrics)

StatementMeta(SamplePool, 36, 9, Finished, Available)

SynapseWidget(Synapse.DataFrame, 2365ef83-7fa8-4854-bb0e-763f8d5b4b90)

In [10]:
metrics.first()['root_mean_squared_error']


StatementMeta(SamplePool, 36, 10, Finished, Available)

0.12191041769242424

In [11]:
spark.stop()

StatementMeta(SamplePool, 36, 11, Finished, Available)

