# AdventureWorks Product selection model training

#### Some code copied from https://docs.databricks.com/applications/machine-learning/mllib/binary-classification-mllib-pipelines.html
#### and from https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#multinomial-logistic-regression
#### and modified to fit the AdventureWorks data.

MLFlow tracking docs: https://www.mlflow.org/docs/latest/tracking.html

### Use the MLflow Tracking API

Use the [MLflow Tracking API](https://www.mlflow.org/docs/latest/python_api/index.html) to start a run and log parameters, metrics, and artifacts (files) from your data science code. 

In [0]:
import mlflow

# Start an MLflow run

with mlflow.start_run(run_name="test run 3"):
  # Log a parameter (key-value pair)
  mlflow.log_param("param_1", 3)

  # Log a metric; metrics can be updated throughout the run
  mlflow.log_metric("metric_1", 2, step=1)
  mlflow.log_metric("metric_2", 4, step=2)
  mlflow.log_metric("metric_3", 6, step=3)

  # Log an artifact (output file)
  with open("output.txt", "w") as f:
      f.write("Hello world!")
  mlflow.log_artifact("output.txt")

### Features for the model:
#### - CommuteDistance
#### - AgeBand
#### - HasChildren
#### - Education

In [0]:
%sql select * from t_salesinfo limit 3

OrderDateKey,DueDateKey,CustomerKey,PromotionKey,SalesTerritoryKey,SalesAmount,ProductKey,ProductSubcategoryKey,ProductCategoryKey,Category,Subcategory,Model,Gender,Salary,OrderQuantity,DiscountAmount,TotalProductCost,TaxAmt,HasChildren,HomeOwner,AgeBand,Education,NumberCarsOwned,CommuteDistance,FiscalYear,FiscalQuarter,Month,MonthNumberOfYear,CalendarYear
20121228,20130109,11245,1,8,4.99,477,28,4,Accessories,Bottles and Cages,Water Bottle,M,120000.0,1,0,1.8663,0.3992,Y,0,Golden,High School,4,5-10 Miles,2012,2,December,12,2012
20121228,20130109,16313,1,8,4.99,477,28,4,Accessories,Bottles and Cages,Water Bottle,F,30000.0,1,0,1.8663,0.3992,Y,1,Middle,Partial College,1,0-1 Miles,2012,2,December,12,2012
20121229,20130110,12390,1,8,4.99,477,28,4,Accessories,Bottles and Cages,Water Bottle,F,40000.0,1,0,1.8663,0.3992,Y,0,Middle,Partial College,1,0-1 Miles,2012,2,December,12,2012


The Pipelines API provides higher-level API built on top of DataFrames for constructing ML pipelines.
You can read more about the Pipelines API in the [programming guide](https://spark.apache.org/docs/latest/ml-guide.html).

**Multiple Classification** is the task of predicting a classification label.
E.g., What Category (Mountain, Road, Touring) will a customer buy.
This section demonstrates algorithms for making these types of predictions.

## Dataset Review


The input table you created t_salesinfo has the following:

Attribute Information:

- CommuteDistance
- AgeBand
- HasChildren
- Education

Target/Label: Mountain, Road, Touring


## Preprocess Data

Since we are going to try algorithms like Logistic Regression, we will have to convert the categorical variables in the dataset into numeric variables.
There are 2 ways we can do this.

* Category Indexing

  This is basically assigning a numeric value to each category from {0, 1, 2, ...numCategories-1}.
  This introduces an implicit ordering among your categories, and is more suitable for ordinal variables (eg: Poor: 0, Average: 1, Good: 2)

* One-Hot Encoding

  This converts categories into binary vectors with at most one nonzero value (eg: (Blue: [1, 0]), (Green: [0, 1]), (Red: [0, 0]))


Here, we will use a combination of [StringIndexer] and [OneHotEncoderEstimator] to convert the categorical variables.
The `OneHotEncoderEstimator` will return a [SparseVector]. Note: [OneHotEncoderEstimator] is [renamed as OneHotEncoder] in Spark 3.0.

Since we will have more than 1 stage of feature transformations, we use a [Pipeline] to tie the stages together.
This simplifies our code.

[StringIndexer]: http://spark.apache.org/docs/latest/ml-features.html#stringindexer
[OneHotEncoderEstimator]: https://spark.apache.org/docs/latest/ml-features.html#onehotencoderestimator
[SparseVector]: https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.linalg.SparseVector
[Pipeline]: http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.Pipeline
[renamed as OneHotEncoder]: https://issues.apache.org/jira/browse/SPARK-26133

In [0]:
%python

spdf_salesinfo = spark.sql('''
select split(Subcategory, ' ')[0] as Subcategory, AgeBand, 
       CommuteDistance, HasChildren, Education, Salary
FROM aw.t_salesinfo 
WHERE Category = 'Bikes' ''')

In [0]:
cols = spdf_salesinfo.columns
cols

In [0]:
import pyspark
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, VectorAssembler

from distutils.version import LooseVersion

categoricalColumns = ["AgeBand", "CommuteDistance", "HasChildren", "Education"]

stages = [] # stages in our Pipeline

for categoricalCol in categoricalColumns:
    # Category Indexing with StringIndexer
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol + "Index")
    # Use OneHotEncoder to convert categorical variables into binary SparseVectors
    if LooseVersion(pyspark.__version__) < LooseVersion("3.0"):
        from pyspark.ml.feature import OneHotEncoderEstimator
        encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
    else:
        from pyspark.ml.feature import OneHotEncoder
        encoder = OneHotEncoder(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"])
    # Add stages.  These are not run here, but will run all at once later on.
    stages += [stringIndexer, encoder]

The above code basically indexes each categorical column using the `StringIndexer`,
and then converts the indexed categories into one-hot encoded variables.
The resulting output has the binary vectors appended to the end of each row.

We use the `StringIndexer` again to encode our labels to label indices.

In [0]:
# Convert label into label indices using the StringIndexer
label_stringIdx = StringIndexer(inputCol="Subcategory", outputCol="label")
stages += [label_stringIdx]

Use a `VectorAssembler` to combine all the feature columns into a single vector column.
This includes both the numeric columns and the one-hot encoded binary vector columns in our dataset.

In [0]:
# Transform all features into a vector using VectorAssembler
numericCols = ["Salary"]
assemblerInputs = [c + "classVec" for c in categoricalColumns] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

Run the stages as a Pipeline. This puts the data through all of the feature transformations we described in a single call.

In [0]:
# from pyspark.ml.classification import DecisionTreeClassifier
  
partialPipeline = Pipeline().setStages(stages)
pipelineModel = partialPipeline.fit(spdf_salesinfo)
preppedDataDF = pipelineModel.transform(spdf_salesinfo)

display(preppedDataDF)

Subcategory,AgeBand,CommuteDistance,HasChildren,Education,Salary,AgeBandIndex,AgeBandclassVec,CommuteDistanceIndex,CommuteDistanceclassVec,HasChildrenIndex,HasChildrenclassVec,EducationIndex,EducationclassVec,label,features
Mountain,Late Middle,5-10 Miles,N,Bachelors,70000.0,1.0,"List(0, 3, List(1), List(1.0))",2.0,"List(0, 4, List(2), List(1.0))",0.0,"List(0, 1, List(0), List(1.0))",0.0,"List(0, 4, List(0), List(1.0))",0.0,"List(0, 13, List(1, 5, 7, 8, 12), List(1.0, 1.0, 1.0, 1.0, 70000.0))"
Mountain,Late Middle,0-1 Miles,N,Graduate Degree,20000.0,1.0,"List(0, 3, List(1), List(1.0))",0.0,"List(0, 4, List(0), List(1.0))",0.0,"List(0, 1, List(0), List(1.0))",2.0,"List(0, 4, List(2), List(1.0))",0.0,"List(0, 13, List(1, 3, 7, 10, 12), List(1.0, 1.0, 1.0, 1.0, 20000.0))"
Mountain,Late Middle,5-10 Miles,N,Partial College,80000.0,1.0,"List(0, 3, List(1), List(1.0))",2.0,"List(0, 4, List(2), List(1.0))",0.0,"List(0, 1, List(0), List(1.0))",1.0,"List(0, 4, List(1), List(1.0))",0.0,"List(0, 13, List(1, 5, 7, 9, 12), List(1.0, 1.0, 1.0, 1.0, 80000.0))"
Mountain,Late Middle,0-1 Miles,N,Bachelors,10000.0,1.0,"List(0, 3, List(1), List(1.0))",0.0,"List(0, 4, List(0), List(1.0))",0.0,"List(0, 1, List(0), List(1.0))",0.0,"List(0, 4, List(0), List(1.0))",0.0,"List(0, 13, List(1, 3, 7, 8, 12), List(1.0, 1.0, 1.0, 1.0, 10000.0))"
Mountain,Golden,1-2 Miles,N,Partial High School,80000.0,0.0,"List(0, 3, List(0), List(1.0))",3.0,"List(0, 4, List(3), List(1.0))",0.0,"List(0, 1, List(0), List(1.0))",4.0,"List(0, 4, List(), List())",0.0,"List(0, 13, List(0, 6, 7, 12), List(1.0, 1.0, 1.0, 80000.0))"
Mountain,Golden,0-1 Miles,N,Graduate Degree,40000.0,0.0,"List(0, 3, List(0), List(1.0))",0.0,"List(0, 4, List(0), List(1.0))",0.0,"List(0, 1, List(0), List(1.0))",2.0,"List(0, 4, List(2), List(1.0))",0.0,"List(0, 13, List(0, 3, 7, 10, 12), List(1.0, 1.0, 1.0, 1.0, 40000.0))"
Mountain,Golden,0-1 Miles,N,Graduate Degree,80000.0,0.0,"List(0, 3, List(0), List(1.0))",0.0,"List(0, 4, List(0), List(1.0))",0.0,"List(0, 1, List(0), List(1.0))",2.0,"List(0, 4, List(2), List(1.0))",0.0,"List(0, 13, List(0, 3, 7, 10, 12), List(1.0, 1.0, 1.0, 1.0, 80000.0))"
Mountain,Golden,2-5 Miles,Y,Bachelors,80000.0,0.0,"List(0, 3, List(0), List(1.0))",1.0,"List(0, 4, List(1), List(1.0))",1.0,"List(0, 1, List(), List())",0.0,"List(0, 4, List(0), List(1.0))",0.0,"List(0, 13, List(0, 4, 8, 12), List(1.0, 1.0, 1.0, 80000.0))"
Mountain,Golden,0-1 Miles,N,Graduate Degree,170000.0,0.0,"List(0, 3, List(0), List(1.0))",0.0,"List(0, 4, List(0), List(1.0))",0.0,"List(0, 1, List(0), List(1.0))",2.0,"List(0, 4, List(2), List(1.0))",0.0,"List(0, 13, List(0, 3, 7, 10, 12), List(1.0, 1.0, 1.0, 1.0, 170000.0))"
Mountain,Golden,1-2 Miles,Y,Partial High School,10000.0,0.0,"List(0, 3, List(0), List(1.0))",3.0,"List(0, 4, List(3), List(1.0))",1.0,"List(0, 1, List(), List())",4.0,"List(0, 4, List(), List())",0.0,"List(0, 13, List(0, 6, 12), List(1.0, 1.0, 10000.0))"


In [0]:
# Keep relevant columns
selectedcols = ["label", "features"] + cols
dataset = preppedDataDF.select(selectedcols)
display(dataset)

label,features,Subcategory,AgeBand,CommuteDistance,HasChildren,Education,Salary
0.0,"List(0, 13, List(1, 5, 7, 8, 12), List(1.0, 1.0, 1.0, 1.0, 70000.0))",Mountain,Late Middle,5-10 Miles,N,Bachelors,70000.0
0.0,"List(0, 13, List(1, 3, 7, 10, 12), List(1.0, 1.0, 1.0, 1.0, 20000.0))",Mountain,Late Middle,0-1 Miles,N,Graduate Degree,20000.0
0.0,"List(0, 13, List(1, 5, 7, 9, 12), List(1.0, 1.0, 1.0, 1.0, 80000.0))",Mountain,Late Middle,5-10 Miles,N,Partial College,80000.0
0.0,"List(0, 13, List(1, 3, 7, 8, 12), List(1.0, 1.0, 1.0, 1.0, 10000.0))",Mountain,Late Middle,0-1 Miles,N,Bachelors,10000.0
0.0,"List(0, 13, List(0, 6, 7, 12), List(1.0, 1.0, 1.0, 80000.0))",Mountain,Golden,1-2 Miles,N,Partial High School,80000.0
0.0,"List(0, 13, List(0, 3, 7, 10, 12), List(1.0, 1.0, 1.0, 1.0, 40000.0))",Mountain,Golden,0-1 Miles,N,Graduate Degree,40000.0
0.0,"List(0, 13, List(0, 3, 7, 10, 12), List(1.0, 1.0, 1.0, 1.0, 80000.0))",Mountain,Golden,0-1 Miles,N,Graduate Degree,80000.0
0.0,"List(0, 13, List(0, 4, 8, 12), List(1.0, 1.0, 1.0, 80000.0))",Mountain,Golden,2-5 Miles,Y,Bachelors,80000.0
0.0,"List(0, 13, List(0, 3, 7, 10, 12), List(1.0, 1.0, 1.0, 1.0, 170000.0))",Mountain,Golden,0-1 Miles,N,Graduate Degree,170000.0
0.0,"List(0, 13, List(0, 6, 12), List(1.0, 1.0, 10000.0))",Mountain,Golden,1-2 Miles,Y,Partial High School,10000.0


In [0]:
### Randomly split data into training and test sets. set seed for reproducibility
(trainingData, testData) = dataset.randomSplit([0.7, 0.3], seed=100)
print(trainingData.count())
print(testData.count())

## Fit and Evaluate Models

We are now ready to try out some of the Classification algorithms available in the Pipelines API.

The below are also capable of supporting multiclass classification with the Python API:
- Decision Tree Classifier
- Random Forest Classifier

These are the general steps we will take to build our models:
- Create initial model using the training set
- Tune parameters with a `ParamGrid` and 5-fold Cross Validation
- Evaluate the best model obtained from the Cross Validation using the test set

## Decision Trees

You can read more about [Decision Trees](http://spark.apache.org/docs/latest/mllib-decision-tree.html) in the Spark MLLib Programming Guide.
The Decision Trees algorithm is popular because it handles categorical
data and works out of the box with multiclass classification tasks.

In [0]:
display(trainingData)

label,features,Subcategory,AgeBand,CommuteDistance,HasChildren,Education,Salary
0.0,"List(0, 13, List(0, 3, 7, 8, 12), List(1.0, 1.0, 1.0, 1.0, 10000.0))",Mountain,Golden,0-1 Miles,N,Bachelors,10000.0
0.0,"List(0, 13, List(0, 3, 7, 8, 12), List(1.0, 1.0, 1.0, 1.0, 10000.0))",Mountain,Golden,0-1 Miles,N,Bachelors,10000.0
0.0,"List(0, 13, List(0, 3, 7, 8, 12), List(1.0, 1.0, 1.0, 1.0, 10000.0))",Mountain,Golden,0-1 Miles,N,Bachelors,10000.0
0.0,"List(0, 13, List(0, 3, 7, 8, 12), List(1.0, 1.0, 1.0, 1.0, 10000.0))",Mountain,Golden,0-1 Miles,N,Bachelors,10000.0
0.0,"List(0, 13, List(0, 3, 7, 8, 12), List(1.0, 1.0, 1.0, 1.0, 20000.0))",Mountain,Golden,0-1 Miles,N,Bachelors,20000.0
0.0,"List(0, 13, List(0, 3, 7, 8, 12), List(1.0, 1.0, 1.0, 1.0, 20000.0))",Mountain,Golden,0-1 Miles,N,Bachelors,20000.0
0.0,"List(0, 13, List(0, 3, 7, 8, 12), List(1.0, 1.0, 1.0, 1.0, 30000.0))",Mountain,Golden,0-1 Miles,N,Bachelors,30000.0
0.0,"List(0, 13, List(0, 3, 7, 8, 12), List(1.0, 1.0, 1.0, 1.0, 30000.0))",Mountain,Golden,0-1 Miles,N,Bachelors,30000.0
0.0,"List(0, 13, List(0, 3, 7, 8, 12), List(1.0, 1.0, 1.0, 1.0, 30000.0))",Mountain,Golden,0-1 Miles,N,Bachelors,30000.0
0.0,"List(0, 13, List(0, 3, 7, 8, 12), List(1.0, 1.0, 1.0, 1.0, 30000.0))",Mountain,Golden,0-1 Miles,N,Bachelors,30000.0


In [0]:
display(testData)

label,features,Subcategory,AgeBand,CommuteDistance,HasChildren,Education,Salary
0.0,"List(0, 13, List(0, 3, 7, 8, 12), List(1.0, 1.0, 1.0, 1.0, 10000.0))",Mountain,Golden,0-1 Miles,N,Bachelors,10000.0
0.0,"List(0, 13, List(0, 3, 7, 8, 12), List(1.0, 1.0, 1.0, 1.0, 20000.0))",Mountain,Golden,0-1 Miles,N,Bachelors,20000.0
0.0,"List(0, 13, List(0, 3, 7, 8, 12), List(1.0, 1.0, 1.0, 1.0, 30000.0))",Mountain,Golden,0-1 Miles,N,Bachelors,30000.0
0.0,"List(0, 13, List(0, 3, 7, 8, 12), List(1.0, 1.0, 1.0, 1.0, 30000.0))",Mountain,Golden,0-1 Miles,N,Bachelors,30000.0
0.0,"List(0, 13, List(0, 3, 7, 8, 12), List(1.0, 1.0, 1.0, 1.0, 40000.0))",Mountain,Golden,0-1 Miles,N,Bachelors,40000.0
0.0,"List(0, 13, List(0, 3, 7, 8, 12), List(1.0, 1.0, 1.0, 1.0, 40000.0))",Mountain,Golden,0-1 Miles,N,Bachelors,40000.0
0.0,"List(0, 13, List(0, 3, 7, 8, 12), List(1.0, 1.0, 1.0, 1.0, 40000.0))",Mountain,Golden,0-1 Miles,N,Bachelors,40000.0
0.0,"List(0, 13, List(0, 3, 7, 8, 12), List(1.0, 1.0, 1.0, 1.0, 40000.0))",Mountain,Golden,0-1 Miles,N,Bachelors,40000.0
0.0,"List(0, 13, List(0, 3, 7, 8, 12), List(1.0, 1.0, 1.0, 1.0, 40000.0))",Mountain,Golden,0-1 Miles,N,Bachelors,40000.0
0.0,"List(0, 13, List(0, 3, 7, 8, 12), List(1.0, 1.0, 1.0, 1.0, 40000.0))",Mountain,Golden,0-1 Miles,N,Bachelors,40000.0


In [0]:
# https://spark.apache.org/docs/2.2.0/ml-classification-regression.html#multinomial-logistic-regression

from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
import pandas as pd

# Import mlflow
import mlflow
import mlflow.spark

with mlflow.start_run(run_name="aw decisiontree"):
    dt = DecisionTreeClassifier(labelCol="label", featuresCol="features")
    model = dt.fit(trainingData)
    predictions = model.transform(testData)
    predictions.select("prediction", "label", "features").show(5)
    
    # Select (prediction, true label) and compute test error
    evaluator = MulticlassClassificationEvaluator(
        labelCol="label", predictionCol="prediction", metricName="accuracy")
    
    accuracy = evaluator.evaluate(predictions)
    
    # Log lot of things to MLFlow...
    mlflow.log_param("numNodes", model.numNodes)
    mlflow.log_param("depth", model.depth)
    mlflow.log_metric("Training Row Count", trainingData.count())
    mlflow.log_metric("Testing Row Count", testData.count())
    mlflow.log_metric("accuracy", accuracy)
    mlflow.spark.log_model(model, "dbfs/mnt/awdata/model")

In [0]:
%fs ls /mnt/awdata/model/sparkml/metadata

path,name,size
dbfs:/mnt/awdata/model/sparkml/metadata/_SUCCESS,_SUCCESS,0
dbfs:/mnt/awdata/model/sparkml/metadata/part-00000,part-00000,216
