# Classification - Adult Census using Vowpal Wabbit in MMLSpark

In this example, we predict incomes from the *Adult Census* dataset using Vowpal Wabbit (VW) classifier in MMLSpark.
First, we read the data and split it into train and test sets as in this [example](https://github.com/Azure/mmlspark/blob/master/notebooks/samples/Classification%20-%20Adult%20Census.ipynb
).

In [1]:
data = spark.read.parquet("wasbs://publicwasb@mmlspark.blob.core.windows.net/AdultCensusIncome.parquet")
data = data.select(["education", "marital-status", "hours-per-week", "income"])
train, test = data.randomSplit([0.75, 0.25], seed=123)
train.limit(10).toPandas()

StatementMeta(SamplePool, 27, 1, Finished, Available)

  education       marital-status  hours-per-week  income
0      10th             Divorced            25.0   <=50K
1      10th             Divorced            40.0   <=50K
2      10th             Divorced            40.0   <=50K
3      10th             Divorced            40.0   <=50K
4      10th   Married-civ-spouse            16.0   <=50K
5      10th   Married-civ-spouse            35.0   <=50K
6      10th   Married-civ-spouse            40.0   <=50K
7      10th   Married-civ-spouse            40.0   <=50K
8      10th   Married-civ-spouse            40.0   <=50K
9      10th   Married-civ-spouse            40.0   <=50K

Next, we define a pipeline that includes feature engineering and training of a VW classifier. We use a featurizer provided by VW that hashes the feature names. 
Note that VW expects classification labels being -1 or 1. Thus, the income category is mapped to this space before feeding training data into the pipeline.

In [2]:
from pyspark.sql.functions import when, col
from pyspark.ml import Pipeline
from mmlspark.vw import VowpalWabbitFeaturizer, VowpalWabbitClassifier

# Define classification label
train = train.withColumn("label", when(col("income").contains("<"), 0.0).otherwise(1.0)).repartition(1).cache()
print(train.count())

# Specify featurizer
vw_featurizer = VowpalWabbitFeaturizer(inputCols=["education", "marital-status", "hours-per-week"],
                                       outputCol="features")

# Define VW classification model
args = "--loss_function=logistic --quiet --holdout_off"
vw_model = VowpalWabbitClassifier(featuresCol="features",
                                  labelCol="label",
                                  args=args,
                                  numPasses=10)

# Create a pipeline
vw_pipeline = Pipeline(stages=[vw_featurizer, vw_model])

StatementMeta(SamplePool, 27, 2, Finished, Available)

24412

Then, we are ready to train the model by fitting the pipeline with the training data.

In [3]:
# Train the model
vw_trained = vw_pipeline.fit(train)

StatementMeta(SamplePool, 27, 3, Finished, Available)



After the model is trained, we apply it to predict the income of each sample in the test set.

In [4]:
# Making predictions
test = test.withColumn("label", when(col("income").contains("<"), 0.0).otherwise(1.0))
prediction = vw_trained.transform(test)
prediction.limit(10).toPandas()

StatementMeta(SamplePool, 27, 4, Finished, Available)

  education  ... prediction
0      10th  ...        0.0
1      10th  ...        0.0
2      10th  ...        0.0
3      10th  ...        0.0
4      10th  ...        0.0
5      10th  ...        0.0
6      10th  ...        0.0
7      10th  ...        0.0
8      10th  ...        0.0
9      10th  ...        0.0

[10 rows x 9 columns]
  Unsupported type in conversion to Arrow: VectorUDT
Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' is set to true.

Finally, we evaluate the model performance using `ComputeModelStatistics` function which will compute confusion matrix, accuracy, precision, recall, and AUC by default for classificaiton models.

In [5]:
from mmlspark.train import ComputeModelStatistics
metrics = ComputeModelStatistics(evaluationMetric="classification", 
                                 labelCol="label", 
                                 scoredLabelsCol="prediction").transform(prediction)
metrics.toPandas()

StatementMeta(SamplePool, 27, 5, Finished, Available)

  evaluation_type  ...       AUC
0  Classification  ...  0.698855

[1 rows x 6 columns]
  Unsupported type in conversion to Arrow: MatrixUDT
Attempting non-optimization as 'spark.sql.execution.arrow.fallback.enabled' is set to true.

In [6]:
spark.stop()

StatementMeta(SamplePool, 27, 6, Finished, Available)

