Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

-sandbox
#Logistic Regression Lab

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Getting Started

Run the following cell to configure our "classroom."

In [None]:
%run "../includes/setup_env"

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Reading the data

We begin by reading the data that we finished pre-processing in a prior Notebook.

*Note:* If you you do get an error messages about a non-existent file, please uncomment the first row of the following cell. This will run yesterday's notebook for preparing our dataset.

In [None]:
# %run "../day_1/03_data_prep_lab"

df = spark.read.parquet("dbfs:/FileStore/tables/preprocessed").cache()
display(df)

Let's begin by dividing the data into training and test sets. With time-series data, we usually divide the data based on a time cut-off and to avoid **leakage** we also put a gap (2 weeks in this case) between the training and test data. Another option we have is to sample every n-th row of the data. The data is collected hourly, and if we do not wish to use such a high frequency for modeling, we can sample every n-th row of the data.

In [None]:
# from pyspark.sql.types import DateType
from pandas import datetime
from pyspark.sql.functions import col, hour

# we sample every nth row of the data using the `hour` function
df_train = df.filter((col('datetime') < datetime(2015, 10, 1))) # & (hour(col('datetime')) % 3 == 0))
df_test = df.filter(col('datetime') > datetime(2015, 10, 15))

Let's look at some summary statistics for the labels in the data.

In [None]:
display(df_train.describe())

We now build a classifier for `y_0` (failure in the first component) (and drop the other labels).

In [None]:
df_train = df_train.drop("y_1","y_2","y_3","datetime", "machineID")
df_train = df_train.withColumnRenamed("y_0", "error")
df_train.cache()

df_test = df_test.drop("y_1","y_2","y_3","datetime", "machineID")
df_test = df_test.withColumnRenamed("y_0", "error")
df_test.cache()

Let's make sure we don't have any null values in our DataFrame.

In [None]:
recordCount = df_train.count()
noNullsRecordCount = df_train.na.drop().count()

print("We have {} records that contain null values.".format(recordCount - noNullsRecordCount))

In [None]:
display(df_train.groupBy("error").count())

## Train a Logistic Regression Model

Before we can apply the logistic regression model, we will need to do some data preparation, such as one hot encoding our categorical variables using `StringIndexer` and `OneHotEncoderEstimator`.

Let's start by taking a look at all of our columns, and determine which ones are categorical.

In [None]:
df_train.printSchema()

## Setting up the model

We set the `label` column of the LogisticRegression model to `error`, and the `features` column to `norm_features`.

In [None]:
from pyspark.ml.classification import LogisticRegression

lr = (LogisticRegression()
     .setLabelCol("error")
     .setFeaturesCol("norm_features"))

### Hands-on lab
Create a pipeline that contains a single stage for the model we created above. Then fit the pipeline to the training data and then use the fitted model to `transform` the test data.

In [None]:
# maximize this cell (click the + button on the right) to see the solution:
  
from pyspark.ml import Pipeline

pipeline = Pipeline(stages = [lr])
assert len(pipeline.getStages()) == 1 # make sure it's one stage only
print(pipeline.getStages())

lr_model = pipeline.fit(df_train)

df_pred = lr_model.transform(df_test) # apply the model to our held-out test set
display(df_pred)

### End of lab

In [None]:
df_pred.printSchema()

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Evaluate the Model

In [None]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator()
print(evaluator.explainParams())

In [None]:
evaluator.setLabelCol("error")
evaluator.setRawPredictionCol('rawPrediction')

metricName = evaluator.getMetricName()
metricVal = evaluator.evaluate(df_pred)

print("{}: {}".format(metricName, metricVal))

We could wrap this into a function to make it easier to get the output of multiple metrics.

In [None]:
def printEval(df, labelCol = "error", rawPredictionCol = "rawPrediction"):
  evaluator = BinaryClassificationEvaluator()
  evaluator.setLabelCol(labelCol)
  evaluator.setRawPredictionCol(rawPredictionCol)

  auroc = evaluator.setMetricName("areaUnderROC").evaluate(df)
  print("AUROC: {}".format(auroc))

In [None]:
printEval(df_pred)

##Conclusion
Hmmmm... our results are not great yet. We'll look into how to improve our results later.

Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.