-sandbox
# Machine Learning Pipeline

** What you will learn:**
* How to create a Machine Learning Pipeline.
* How to train a Machine Learning model.
* How to save & read the model.
* How to make predictions with the model.

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Getting Started

Run the following cell to configure our "classroom."

In [3]:
%run "../includes/mnt_blob"

## The Data

The dataset contains bike rental info from 2011 and 2012 in the Capital bikeshare system, plus additional relevant information such as weather.  

This dataset is from Fanaee-T and Gama (2013) and is hosted by the <a href="http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset" target="_blank">UCI Machine Learning Repository</a>.

## The Goal
We want to learn to predict bike rental counts (per hour) from information such as day of the week, weather, month, etc.  

Having good predictions of customer demand allows a business or service to prepare and increase supply as needed.

## Loading the data

We begin by loading our data, which is stored in the CSV format</a>.

In [7]:
fileName = "/mnt/data/bikeSharing/data-001/hour.csv"

initialDF = (spark.read          # Our DataFrameReader
  .option("header", "true")      # Let Spark know we have a header
  .option("inferSchema", "true") # Infering the schema (it is a small dataset)
  .csv(fileName)                 # Location of our data
  .cache()                       # Mark the DataFrame as cached.
)

initialDF.count()                # Materialize the cache

initialDF.printSchema()

## Understanding the data

According to the <a href="http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset" target="_blank">UCI ML Repository description</a>, we have the following schema:

**Feature columns**:
* **dteday**: date
* **season**: season (1:spring, 2:summer, 3:fall, 4:winter)
* **yr**: year (0:2011, 1:2012)
* **mnth**: month (1 to 12)
* **hr**: hour (0 to 23)
* **holiday**: whether the day was a holiday or not
* **weekday**: day of the week
* **workingday**: `1` if the day is neither a weekend nor holiday, otherwise `0`.
* **weathersit**: 
  * 1: Clear, Few clouds, Partly cloudy, Partly cloudy
  * 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
  * 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
  * 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
* **temp**: Normalized temperature in Celsius. The values are derived via `(t-t_min)/(t_max-t_min)`, `t_min=-8`, `t_max=+39` (only in hourly scale)
* **atemp**: Normalized feeling temperature in Celsius. The values are derived via `(t-t_min)/(t_max-t_min)`, `t_min=-16`, `t_max=+50` (only in hourly scale)
* **hum**: Normalized humidity. The values are divided to 100 (max)
* **windspeed**: Normalized wind speed. The values are divided to 67 (max)

**Label columns**:
* **casual**: count of casual users
* **registered**: count of registered users
* **cnt**: count of total rental bikes including both casual and registered

**Extraneous columns**:
* **instant**: record index

For example, the first row is a record of hour 0 on January 1, 2011---and apparently, 16 people rented bikes around midnight!

## Preprocessing the data

So what do we need to do to get our data ready for Machine Learning?

**Recall our goal**: We want to learn to predict the count of bike rentals (the `cnt` column).  We refer to the count as our target "label".

**Features**: What can we use as features to predict the `cnt` label?  

All the columns except `cnt`, and a few exceptions:
* `casual` & `registered`
  * The `cnt` column we want to predict equals the sum of the `casual` + `registered` columns.  We will remove the `casual` and `registered` columns from the data to make sure we do not use them to predict `cnt`.  (*Warning: This is a danger in careless Machine Learning.  Make sure you do not "cheat" by using information you will not have when making predictions*)
* `season` and the date column `dteday`: We could keep them, but they are well-represented by the other date-related columns like `yr`, `mnth`, and `weekday`.
* `holiday` and `weekday`: These features are highly correlated with the `workingday` column.
* row index column `instant`: This is a useless column to us.

Let's drop the columns `instant`, `dteday`, `season`, `casual`, `holiday`, `weekday`, and `registered` from our DataFrame and then review our schema:

In [11]:
preprocessedDF = initialDF.drop("instant", "dteday", "season", "casual", "registered", "holiday", "weekday")

preprocessedDF.printSchema()

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Train/Test Split

Our final data preparation step will be to split our dataset into separate training and test sets.

Using the `randomSplit()` function, we split the data such that 70% of the data is reserved for training and the remaining 30% for testing. 

For more information see:
* Scala: <a href="https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.Dataset" target="_blank">Dataset.randomSplit()</a>
* Python: <a href="https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.randomSplit" target="_blank">DataFrame.randomSplit()</a>

In [13]:
trainDF, testDF = preprocessedDF.randomSplit(
  [0.7, 0.3],  # 70-30 split
  seed=42)     # For reproducibility

print("We have %d training examples and %d test examples." % (trainDF.count(), testDF.count()))
assert (trainDF.count() == 12197)

## Visualize our data

Now that we have preprocessed our features, we can quickly visualize our data to get a sense of whether the features are meaningful.

We want to compare bike rental counts versus the hour of the day. 

To plot the data:
* Run the cell below
* From the list of plot types, select **Line**.
* Click the **Plot Options...** button.
* By dragging and dropping the fields, set the **Keys** to **hr** and the **Values** to **cnt**.

Once you've created the graph, go back and select different **Keys**. For example:
* **cnt** vs. **windspeed**
* **cnt** vs. **month**
* **cnt** vs. **workingday**
* **cnt** vs. **hum**
* **cnt** vs. **temp**
* ...etc.

In [15]:
display(trainDF)

A couple of notes:
* Rentals are low during the night, and they peak in the morning (8 am) and in the early evening (5 pm).  
* Rentals are high during the summer and low in winter.
* Rentals are high on working days vs. non-working days

This indicates that the `hr`, `mnth` and `workingday` features are all useful and can help us predict our label `cnt`. 

But how do other features affect our prediction? 

Do combinations of those features matter? For example, high wind in summer is not going to have the same effect as high wind in winter.

As it turns out our features can be divided into two types:
 * **Numeric columns:**
   * `mnth`
   * `temp`
   * `hr`
   * `hum`
   * `atemp`
   * `windspeed`

* **Categorical Columns:**
  * `yr`
  * `workingday`
  * `weathersit`
  
We could treat both `mnth` and `hr` as categorical but we would lose the temporal relationships (e.g. 2:00 AM comes before 3:00 AM).

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) StringIndexer

For each of the categorical columns, we are going to create one `StringIndexer` where we
  * Set `inputCol` to something like `weathersit`
  * Set `outputCol` to something like `weathersitIndex`

This will have the effect of treating a value like `weathersit` not as number 1 through 4, but rather four categories: **light**, **mist**, **medium** & **heavy**, for example.

For more information see:
* Scala: <a href="https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.feature.StringIndexer" target="_blank">StringIndexer</a>
* Python: <a href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=stringindexer#pyspark.ml.feature.StringIndexer" target="_blank">StringIndexer</a>

Before we get started, let's review our current schema:

In [19]:
trainDF.printSchema()

Let's create the first `StringIndexer` for the `workingday` column.

After we create it, we can run a sample through the indexer to see how it would affect our `DataFrame`.

In [21]:
from pyspark.ml.feature import StringIndexer

workingdayStringIndexer = StringIndexer(
  inputCol="workingday", 
  outputCol="workingdayIndex")

# Just for demonstration purposes, we will use the StringIndexer to fit and
# then transform our training data set just to see how it affects the schema
workingdayStringIndexer.fit(trainDF).transform(trainDF).printSchema()

Next we will create the `StringIndexer` for the `yr` column and preview its effect.

In [23]:
yrStringIndexer = StringIndexer(
  inputCol="yr", 
  outputCol="yrIndex")

yrStringIndexer.fit(trainDF).transform(trainDF).printSchema()

And then create our last `StringIndexer` for the `weathersit` column.

In [25]:
weathersitStringIndexer = StringIndexer(
  inputCol="weathersit", 
  outputCol="weathersitIndex")

weathersitStringIndexer.fit(trainDF).transform(trainDF).printSchema()

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) VectorAssembler

The next step is to assemble the feature columns into a single feature vector.

To do that we will use the `VectorAssembler` where we
  * Set `inputCols` to the new list of feature columns
  * Set `outputCol` to `features`
  
  
For more information see:
* Scala: <a href="https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.ml.feature.VectorAssembler" target="_blank">VectorAssembler</a>
* Python: <a href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.VectorAssembler" target="_blank">VectorAssembler</a>

In [27]:
from pyspark.ml.feature import VectorAssembler

assemblerInputs  = [
  "mnth", "temp", "hr", "hum", "atemp", "windspeed", # Our numerical features
  "yrIndex", "workingdayIndex", "weathersit"]        # Our new categorical features

vectorAssembler = VectorAssembler(
  inputCols=assemblerInputs, 
  outputCol="features")

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Random Forests

Random forests and ensembles of decision trees are more powerful than a single decision tree alone.

This is also the last step in our pipeline.

We will use the `RandomForestRegressor` where we
  * Set `labelCol` to the column that contains our label.
  * Set `seed` to ensure reproducibility.
  * Set `numTrees` to `3` so that we build 3 trees in our random forest.
  * Set `maxDepth` to `10` to control the depth/complexity of the tree.

For more information see:
* Scala: <a href="https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.regression.RandomForestRegressor" target="_blank">RandomForestRegressor</a>
* Python: <a href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.regression.RandomForestRegressor" target="_blank">RandomForestRegressor</a>

In [29]:
from pyspark.ml.regression import RandomForestRegressor

rfr = (RandomForestRegressor()
      .setLabelCol("cnt") # The column of our label
      .setSeed(27)        # Some seed value for consistency
      .setNumTrees(3)     # A guess at the number of trees
      .setMaxDepth(10)    # A guess at the depth of each tree
)

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Create a Machine Learning Pipeline

Now let's wrap all of these stages into a Pipeline.

In [31]:
from pyspark.ml import Pipeline

pipeline = Pipeline().setStages([
  workingdayStringIndexer, # categorize workingday
  weathersitStringIndexer, # categorize weathersit
  yrStringIndexer,         # categorize yr
  vectorAssembler,         # assemble the feature vector for all columns
  rfr])

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Train the model

Train the pipeline model to run all the steps in the pipeline.

In [33]:
pipelineModel = pipeline.fit(trainDF)

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Evaluate the model

Now that we have fitted a model, we can evaluate it.

In the case of a random forest, one of the best things to look at is the `featureImportances`:

In [35]:
from pyspark.ml.regression import RandomForestRegressionModel

rfrm = pipelineModel.stages[-1] # The RFRM is in the last stage of the model

#  Zip the list of features with their scores
scores = zip(assemblerInputs, rfrm.featureImportances)

# And pretty print 'em
for x in scores: print("%-15s = %s" % x)

print("-"*80)

Which features were most important?

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Making Predictions

Next, apply the trained pipeline model to the test set.

In [38]:
# Using the model, create our predictions from the test data
predictionsDF = pipelineModel.transform(testDF)

# Reorder the columns for easier interpretation
reorderedDF = predictionsDF.select("cnt", "prediction", "yr", "yrIndex", "mnth", "hr", "workingday", "workingdayIndex", "weathersit", "weathersitIndex", "temp", "atemp", "hum", "windspeed")

display(reorderedDF)

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Evaluate

Next, we'll use `RegressionEvaluator` to assess the results. The default regression metric is RMSE.

For more information see:
* Scala: <a href="https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.ml.evaluation.RegressionEvaluator" target="_blank">RegressionEvaluator</a>
* Python: <a href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.RegressionEvaluator" target="_blank">RegressionEvaluator</a>

In [40]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator().setLabelCol("cnt")

rmse = evaluator.evaluate(predictionsDF)

print("Test RMSE = %f" % rmse)

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) ParamGrid

There are a lot of hyperparamaters we could tune, and it would take a long time to manually configure.

Instead of a manual (ad-hoc) approach, let's use Spark's `ParamGridBuilder` to find the optimal hyperparameters in a more systematic approach.

In this example notebook, we keep these trees shallow and use a relatively small number of trees. Let's define a grid of hyperparameters to test:
  - maxDepth: max depth of each decision tree in the RF ensemble (Use the values `2, 5, 10`)
  - numTrees: number of trees in each RF ensemble (Use the values `10, 50`)

`addGrid()` accepts the name of the parameter (e.g. `rf.maxDepth`), and an Array of the possible values (e.g. `Array(2, 5, 10)`).

For more information see:
* Scala: <a href="https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.ml.tuning.ParamGridBuilder" target="_blank">ParamGridBuilder</a>
* Python: <a href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.ParamGridBuilder" target="_blank">ParamGridBuilder</a>

In [42]:
from pyspark.ml.tuning import ParamGridBuilder

paramGrid = (ParamGridBuilder()
            .addGrid(rfr.maxDepth, [2, 5, 10])
            .addGrid(rfr.numTrees, [10, 50])
            .build())

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Cross-Validation

We are also going to use 3-fold cross-validation to identify the optimal maxDepth and numTrees combination.

![crossValidation](https://files.training.databricks.com/images/301/CrossValidation.png)

With 3-fold cross-validation, we train on 2/3 of the data and evaluate with the remaining (held-out) 1/3. We repeat this process 3 times, so each fold gets the chance to act as the validation set. We then average the results of the three rounds.

We pass in the `estimator` (our original pipeline), an `evaluator`, and an `estimatorParamMaps` to the `CrossValidator` so that it knows:
- Which model to use
- How to evaluate the model
- What hyperparamters to set on the model

We can also set the number of folds we want to split our data into (3), as well as setting a seed so we all have the same split in the data.

For more information see:
* Scala: <a href="https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.ml.tuning.CrossValidator" target="_blank">CrossValidator</a>
* Python: <a href="https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.tuning.CrossValidator" target="_blank">CrossValidator</a>

In [45]:
from pyspark.ml.tuning import CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = (RegressionEvaluator()
  .setLabelCol("cnt")
  .setPredictionCol("prediction"))

cv = (CrossValidator()
  .setEstimator(pipeline)
  .setEvaluator(evaluator)
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(3)
  .setSeed(27))

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) A New Model

We can now use the `CrossValidator` to fit a new model - this could take several minutes on a small cluster.

In [47]:
cvModel = cv.fit(trainDF)

And now we can take a look at the model with the best hyperparameter configuration:

In [49]:
# Zip the two lists together
results = list(zip(cvModel.getEstimatorParamMaps(), cvModel.avgMetrics))

# And pretty print 'em
for x in results:
  numTrees, rmse = list(x[0].values())
  print("Depth: %s, Trees: %s\nAverage: %s\n" % (numTrees, rmse, x[1]))
  
print("-"*80)

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) One last set of predictions

Using our newest mode, let's make a final set of predictions:

In [51]:
# Using the model, create our predictions from the test data
finalPredictionsDF = cvModel.transform(testDF)

# Reorder the columns for easier interpretation
finalDF = finalPredictionsDF.select("cnt", "prediction", "yr", "yrIndex", "mnth", "hr", "workingday", "workingdayIndex", "weathersit", "weathersitIndex", "temp", "atemp", "hum", "windspeed")

display(finalDF)

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Evaluating the New Model

Let's see how our latest model does:

In [53]:
print("Test RMSE = %f" % evaluator.evaluate(finalPredictionsDF))

-sandbox
&copy; 2018 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>