## 104 - Train, Test, Evaluate for Regression with Auto Imports Dataset

This sample notebook is based on the Gallery [Sample 6: Train, Test, Evaluate
for Regression: Auto Imports
Dataset](https://gallery.cortanaintelligence.com/Experiment/670fbfc40c4f44438bfe72e47432ae7a)
for AzureML Studio.  This experiment demonstrates how to build a regression
model to predict the automobile's price.  The process includes training, testing,
and evaluating the model on the Automobile Imports data set.

This sample demonstrates the use of several members of the mmlspark library:
- [`TrainRegressor`
  ](http://mmlspark.azureedge.net/docs/pyspark/TrainRegressor.html)
- [`SummarizeData`
  ](http://mmlspark.azureedge.net/docs/pyspark/SummarizeData.html)
- [`CleanMissingData`
  ](http://mmlspark.azureedge.net/docs/pyspark/CleanMissingData.html)
- [`ComputeStatistics`
  ](http://mmlspark.azureedge.net/docs/pyspark/ComputeStatistics.html)
- [`FindBestModel`
  ](http://mmlspark.azureedge.net/docs/pyspark/FindBestModel.html)

First, import the pandas package so that we can read and parse the datafile
using `pandas.read_csv()`

In [1]:
data = spark.read.parquet("wasbs://publicwasb@mmlspark.blob.core.windows.net/AutomobilePriceRaw.parquet")


StatementMeta(SamplePool, 42, 1, Finished, Available)



To learn more about the data that was just read into the DataFrame,
summarize the data using `SummarizeData` and print the summary.  For each
column of the DataFrame, `SummarizeData` will report the summary statistics
in the following subcategories for each column:
* Feature name
* Counts
  - Count
  - Unique Value Count
  - Missing Value Count
* Quantiles
  - Min
  - 1st Quartile
  - Median
  - 3rd Quartile
  - Max
* Sample Statistics
  - Sample Variance
  - Sample Standard Deviation
  - Sample Skewness
  - Sample Kurtosis
* Percentiles
  - P0.5
  - P1
  - P5
  - P95
  - P99
  - P99.5

Note that several columns have missing values (`normalized-losses`, `bore`,
`stroke`, `horsepower`, `peak-rpm`, `price`).  This summary can be very
useful during the initial phases of data discovery and characterization.

In [2]:
from mmlspark.stages import SummarizeData
summary = SummarizeData().transform(data)
summary.toPandas()

StatementMeta(SamplePool, 42, 2, Finished, Available)

              Feature  Count  Unique_Value_Count  ...       P95      P99     P99_5
0           symboling  205.0                 6.0  ...      3.00      3.0      3.00
1   normalized-losses  164.0                53.0  ...    188.00    231.0    256.00
2                make  205.0                22.0  ...       NaN      NaN       NaN
3           fuel-type  205.0                 2.0  ...       NaN      NaN       NaN
4          aspiration  205.0                 2.0  ...       NaN      NaN       NaN
5          body-style  205.0                 5.0  ...       NaN      NaN       NaN
6        drive-wheels  205.0                 3.0  ...       NaN      NaN       NaN
7     engine-location  205.0                 2.0  ...       NaN      NaN       NaN
8          wheel-base  205.0                53.0  ...    110.00    115.6    115.60
9              length  205.0                75.0  ...    197.00    202.6    202.60
10              width  205.0                45.0  ...     70.50     71.7     72.00
11  

Split the dataset into train and test datasets.

In [3]:
# split the data into training and testing datasets
train, test = data.randomSplit([0.6, 0.4], seed=123)
train.limit(10).toPandas()

StatementMeta(SamplePool, 42, 3, Finished, Available)

   symboling  normalized-losses        make  ... city-mpg highway-mpg    price
0         -1              137.0  mitsubishi  ...       23          30   9279.0
1          0              108.0      nissan  ...       17          22  14399.0
2          0              128.0      nissan  ...       17          22  13499.0
3          0              161.0      peugot  ...       19          24  11900.0
4          1              103.0      nissan  ...       31          37   7349.0
5          1              122.0      nissan  ...       31          37   6849.0
6          1              125.0  mitsubishi  ...       25          32   6989.0
7          1              125.0  mitsubishi  ...       25          32   8189.0
8          1              125.0  mitsubishi  ...       23          30   9279.0
9          1              128.0      nissan  ...       45          50   7099.0

[10 rows x 25 columns]

Now use the `CleanMissingData` API to replace the missing values in the
dataset with something more useful or meaningful.  Specify a list of columns
to be cleaned, and specify the corresponding output column names, which are
not required to be the same as the input column names. `CleanMissiongData`
offers the options of "Mean", "Median", or "Custom" for the replacement
value.  In the case of "Custom" value, the user also specifies the value to
use via the "customValue" parameter.  In this example, we will replace
missing values in numeric columns with the median value for the column.  We
will define the model here, then use it as a Pipeline stage when we train our
regression models and make our predictions in the following steps.

In [4]:
from mmlspark.featurize import CleanMissingData
cols = ["normalized-losses", "stroke", "bore", "horsepower",
        "peak-rpm", "price"]
cleanModel = CleanMissingData().setCleaningMode("Median") \
                               .setInputCols(cols).setOutputCols(cols)

StatementMeta(SamplePool, 42, 4, Finished, Available)



Now we will create two Regressor models for comparison: Poisson Regression
and Random Forest.  PySpark has several regressors implemented:
* `LinearRegression`
* `IsotonicRegression`
* `DecisionTreeRegressor`
* `RandomForestRegressor`
* `GBTRegressor` (Gradient-Boosted Trees)
* `AFTSurvivalRegression` (Accelerated Failure Time Model Survival)
* `GeneralizedLinearRegression` -- fit a generalized model by giving symbolic
  description of the linear preditor (link function) and a description of the
  error distribution (family).  The following families are supported:
  - `Gaussian`
  - `Binomial`
  - `Poisson`
  - `Gamma`
  - `Tweedie` -- power link function specified through `linkPower`
Refer to the
[Pyspark API Documentation](http://spark.apache.org/docs/latest/api/python/)
for more details.

`TrainRegressor` creates a model based on the regressor and other parameters
that are supplied to it, then trains data on the model.

In this next step, Create a Poisson Regression model using the
`GeneralizedLinearRegressor` API from Spark and create a Pipeline using the
`CleanMissingData` and `TrainRegressor` as pipeline stages to create and
train the model.  Note that because `TrainRegressor` expects a `labelCol` to
be set, there is no need to set `linkPredictionCol` when setting up the
`GeneralizedLinearRegressor`.  Fitting the pipe on the training dataset will
train the model.  Applying the `transform()` of the pipe to the test dataset
creates the predictions.

In [5]:
# train Poisson Regression Model
from pyspark.ml.regression import GeneralizedLinearRegression
from pyspark.ml import Pipeline
from mmlspark.train import TrainRegressor

glr = GeneralizedLinearRegression(family="poisson", link="log")
poissonModel = TrainRegressor().setModel(glr).setLabelCol("price").setNumFeatures(256)
poissonPipe = Pipeline(stages = [cleanModel, poissonModel]).fit(train)
poissonPrediction = poissonPipe.transform(test)

StatementMeta(SamplePool, 42, 5, Finished, Available)



Next, repeat these steps to create a Random Forest Regression model using the
`RandomRorestRegressor` API from Spark.

In [6]:
# train Random Forest regression on the same training data:
from pyspark.ml.regression import RandomForestRegressor

rfr = RandomForestRegressor(maxDepth=30, maxBins=128, numTrees=8, minInstancesPerNode=1)
randomForestModel = TrainRegressor(model=rfr, labelCol="price", numFeatures=256).fit(train)
randomForestPipe = Pipeline(stages = [cleanModel, randomForestModel]).fit(train)
randomForestPrediction = randomForestPipe.transform(test)

StatementMeta(SamplePool, 42, 6, Finished, Available)



After the models have been trained and scored, compute some basic statistics
to evaluate the predictions.  The following statistics are calculated for
regression models to evaluate:
* Mean squared error
* Root mean squared error
* R^2
* Mean absolute error

Use the `ComputeModelStatistics` API to compute basic statistics for
the Poisson and the Random Forest models.

In [7]:
from mmlspark.train import ComputeModelStatistics
poissonMetrics = ComputeModelStatistics().transform(poissonPrediction)
print("Poisson Metrics")
poissonMetrics.toPandas()

StatementMeta(SamplePool, 42, 7, Finished, Available)

Poisson Metrics
   mean_squared_error  root_mean_squared_error       R^2  mean_absolute_error
0        6.387382e+06              2527.326961  0.856324          1617.131942

In [8]:
randomForestMetrics = ComputeModelStatistics().transform(randomForestPrediction)
print("Random Forest Metrics")
randomForestMetrics.toPandas()

StatementMeta(SamplePool, 42, 8, Finished, Available)

Random Forest Metrics
   mean_squared_error  root_mean_squared_error       R^2  mean_absolute_error
0        1.144745e+07              3383.408586  0.742503          2038.477109

We can also compute per instance statistics for `poissonPrediction`:

In [9]:
from mmlspark.train import ComputePerInstanceStatistics
def demonstrateEvalPerInstance(pred):
    return ComputePerInstanceStatistics().transform(pred) \
               .select("price", "Scores", "L1_loss", "L2_loss") \
               .limit(10).toPandas()
demonstrateEvalPerInstance(poissonPrediction)

StatementMeta(SamplePool, 42, 9, Finished, Available)

     price        Scores      L1_loss       L2_loss
0   8949.0   7908.261962  1040.738038  1.083136e+06
1   9549.0   8676.384389   872.615611  7.614580e+05
2  13499.0  14587.748803  1088.748803  1.185374e+06
3   7999.0   6433.316004  1565.683996  2.451366e+06
4   7499.0   6722.115545   776.884455  6.035495e+05
5   7799.0   6355.008858  1443.991142  2.085110e+06
6   5499.0   6487.236219   988.236219  9.766108e+05
7   6649.0   6575.215170    73.784830  5.444201e+03
8   8249.0   6911.996479  1337.003521  1.787578e+06
9  14489.0  13400.356342  1088.643658  1.185145e+06

and with `randomForestPrediction`:

In [10]:
demonstrateEvalPerInstance(randomForestPrediction)

StatementMeta(SamplePool, 42, 10, Finished, Available)

     price      Scores     L1_loss       L2_loss
0   6669.0   6457.2500    211.7500  4.483806e+04
1   8499.0   8073.3125    425.6875  1.812098e+05
2   9959.0  10710.0625    751.0625  5.640949e+05
3   8921.0   9949.3750   1028.3750  1.057555e+06
4  10595.0  19871.3750   9276.3750  8.605113e+07
5  12764.0  15938.2500   3174.2500  1.007586e+07
6  37028.0  17058.3125  19969.6875  3.987884e+08
7   7295.0   7128.3750    166.6250  2.776389e+04
8   5399.0   6384.2500    985.2500  9.707176e+05
9   7129.0   6624.0000    505.0000  2.550250e+05

In [11]:
spark.stop()

StatementMeta(SamplePool, 42, 11, Finished, Available)

