##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Decision Trees
### Analyzing a bike sharing dataset

This notebook demonstrates creating an ML Pipeline to preprocess a dataset, train a Machine Learning model, save the model, and make predictions.

**Data**: The dataset contains bike rental info from 2011 and 2012 in the Capital bikeshare system, plus additional relevant information such as weather.  This dataset is from Fanaee-T and Gama (2013) and is hosted by the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset).

**Goal**: We want to learn to predict bike rental counts (per hour) from information such as day of the week, weather, season, etc.  Having good predictions of customer demand allows a business or service to prepare and increase supply as needed.  

In the next lab, we will also demonstrate hyperparameter tuning using cross-validation, as well as tree ensembles to fine-tune and improve our ML model.

[Decision Tree Regressor (Scala)](https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.ml.regression.DecisionTreeRegressor)

[Decision Tree Regressor (Python)](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.regression.DecisionTreeRegressor)

In [2]:
%run "../includes/mnt_blob"

## Load and understand the data

We begin by loading our data, which is stored in [Comma-Separated Value (CSV) format](https://en.wikipedia.org/wiki/Comma-separated_values). 

Use the `spark.read.csv` method to read the data and set a few options:
- `header`: set to true to indicate that the first line of the CSV data file is a header
- `inferSchema`: set to true to infer the datatypes
- The file is located at `/mnt/data/bikeSharing/data-001/hour.csv`.

[DataFrame Reader (Scala)](https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.DataFrameReader)

[DataFrame Reader (Python)](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader)

In [4]:
# TODO
# df = spark.read.<FILL_IN>

Let's cache the DataFrame so subsequent uses will be able to read from memory, instead of re-reading the data from disk.

In [6]:
# df.cache()

__Question__: Is the DataFrame in the Storage tab of the Spark UI?

#### Data description

From the [UCI ML Repository description](http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset), we have the following schema.

**Feature columns**:
* dteday: date
* season: season (1:spring, 2:summer, 3:fall, 4:winter)
* yr: year (0:2011, 1:2012)
* mnth: month (1 to 12)
* hr: hour (0 to 23)
* holiday: whether day is holiday or not
* weekday: day of the week
* workingday: if day is neither weekend nor holiday is 1, otherwise is 0.
* weathersit: 
  * 1: Clear, Few clouds, Partly cloudy, Partly cloudy
  * 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
  * 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
  * 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
* temp: Normalized temperature in Celsius. The values are derived via `(t-t_min)/(t_max-t_min)`, `t_min=-8`, `t_max=+39` (only in hourly scale)
* atemp: Normalized feeling temperature in Celsius. The values are derived via `(t-t_min)/(t_max-t_min)`, `t_min=-16`, `t_max=+50` (only in hourly scale)
* hum: Normalized humidity. The values are divided to 100 (max)
* windspeed: Normalized wind speed. The values are divided to 67 (max)

**Label columns**:
* casual: count of casual users
* registered: count of registered users
* cnt: count of total rental bikes including both casual and registered

**Extraneous columns**:
* instant: record index

For example, the first row is a record of hour 0 on January 1, 2011---and apparently 16 people rented bikes around midnight!

Let's look at a subset of our data. We'll use the [sample()](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample) method to sample 10% of the DataFrame without replacement, and call `display()` on the resulting DataFrame.

In [10]:
# TODO
display(<FILL_IN>)

## Preprocess data

So what do we need to do to get our data ready for Machine Learning?

*Recall our goal*: We want to learn to predict the count of bike rentals (the `cnt` column).  We refer to the count as our target "label".

*Features*: What can we use as features to predict the `cnt` label?  All the columns except `cnt`, and a few exceptions:
* The `cnt` column we want to predict equals the sum of the `casual` + `registered` columns.  We will remove the `casual` and `registered` columns from the data to make sure we do not use them to predict `cnt`.  (*Warning: This is a danger in careless Machine Learning.  Make sure you do not "cheat" by using information you will not have when making predictions*)
* date column `dteday`: We could keep it, but it is well-represented by the other date-related columns `season`, `yr`, `mnth`, and `weekday`.  We will discard it.
* `holiday` and `weekday`: These features are highly correlated with the `workingday` column.
* row index column `instant`: This is a useless column to us.

Let's drop the columns `instant`, `dteday`, `casual`, `holiday`, `weekday`, and `registered` from our DataFrame.

In [13]:
# TODO
df = df.<FILL_IN>
display(df)

Now that we have the columns we care about, let's print the schema of our dataset to see the type of each column using `printSchema()`.

In [15]:
df.printSchema()

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Train/Test Split

Our final data preparation step will be to split our dataset into separate training and test sets.

Use `randomSplit` to split the data such that 70% of the data is reserved for training, and the remaining 30% for testing. Use the set `seed` for reproducability (i.e. if you re-run this notebook or compare with your neighbor, you will get the same results).

Python: [randomSplit()](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.randomSplit)

Scala: [randomSplit()](https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.sql.Dataset)

In [17]:
# TODO
seed = 42
trainDF, testDF = df.<FILL_IN>

print("We have %d training examples and %d test examples." % (trainDF.count(), testDF.count()))
assert (trainDF.count() == 12197)

#### Visualize our data

Now that we have preprocessed our features and prepared a training dataset, we can quickly visualize our data to get a sense of whether the features are meaningful.

Calling `display()` on a DataFrame in Databricks and clicking the plot icon below the table will let you draw and pivot various plots.  See the [Visualizations section of the Databricks Guide](https://docs.databricks.com/user-guide/visualizations/index.html) for more ideas.

We want to compare bike rental counts versus hour of the day.  As one might expect, rentals are low during the night, and they peak in the morning (8am) and in the early evening (6pm).  This indicates the `hr` feature is useful and can help us predict our label `cnt`.  

Select the `hr` and `cnt` columns from `trainDF`, and visualize it as a bar chart (you might need to adjust the plot options).

In [19]:
# TODO
display(<FILL_IN>)

## Train a Machine Learning Pipeline

Let's learn a ML model to predict the `cnt` of bike rentals given a single `features` column of feature vectors. 

We will put together a simple Pipeline with the following stages:
* `VectorAssembler`: Assemble the feature columns into a feature vector.
* `VectorIndexer`: Identify columns which should be treated as categorical.  This is done heuristically, identifying any column with a small number of distinct values as being categorical.  For us, this will be the `yr` (2 values), `season` (4 values), `holiday` (2 values), `workingday` (2 values), and `weathersit` (4 values).
* `DecisionTreeRegressor`: This will build a decision tree to learn how to predict rental counts from the feature vectors.

First, we define the feature processing stages of the Pipeline:
* Assemble feature columns into a feature vector.
* Identify categorical features, and index them.

![Image of feature processing](http://training.databricks.com/databricks_guide/2-features.png)

Steps:
- To create our feature vector, we start by selecting all of the feature columns and calling it `featuresCols`.
- Use [VectorAssembler](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.VectorAssembler), and set `inputCols` to `featureCols`, and the `outputCols` as `rawFeatures`. This concatenates all feature columns into a single feature vector into the new column "rawFeatures".
- Use [VectorIndexer](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.VectorIndexer) to identify categorical features in our `rawFeatures` and index them. If any column has `maxCategories` or fewer distinct values, then it is treated as a categorical variable.

In [23]:
from pyspark.ml.feature import VectorAssembler, VectorIndexer

featuresCols = df.columns[:-1] # Removes "cnt"

vectorAssembler = VectorAssembler(inputCols=featuresCols, outputCol="rawFeatures")

vectorIndexer = VectorIndexer(inputCol="rawFeatures", outputCol="features", maxCategories=4)

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Decision Tree Regressor
Second, we define the model training stage of the Pipeline. [Decision Tree Regressor](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.regression.DecisionTreeRegressor) takes feature vectors and labels as input and learns to predict labels of new examples.

Let's take a look at some of the default parameters.

In [25]:
from pyspark.ml.regression import DecisionTreeRegressor

dt = DecisionTreeRegressor()
print(dt.explainParams())

DecisionTreeRegressor expects a `labelCol` called `label`, but in our DataFrame we don't have a label column. Let's tell the DecisionTreeRegressor that the label column is called `cnt`.

Use `dt.setLabelCol("")`

In [27]:
# TODO

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Pipeline
Now let's wrap all of these stages into a Pipeline.

In [29]:
# TODO
from pyspark.ml import Pipeline

pipeline = Pipeline().setStages([<FILL_IN>])

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Train

Train the pipeline model to run all the steps in the pipeline.

In [31]:
pipelineModel = pipeline.fit(trainDF)

Let's visualize the decision tree

In [33]:
print(pipelineModel.stages[2].toDebugString)

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Save Model

Let's go ahead and save this model.

In [35]:
fileName = userhome + "/tmp/MyPipeline"
pipelineModel.write().overwrite().save(fileName)

Let's read this model back in.

In [37]:
from pyspark.ml import PipelineModel

savedModel = PipelineModel.load(fileName)

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Predictions

Next, apply the saved model trained to the test set.

In [39]:
# TODO
predictionsDF = savedModel.<FILL_IN>

display(predictionsDF.select("cnt", "prediction"))

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Evaluate

Next, we'll use [RegressionEvaluator](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.RegressionEvaluator) to assess the results. The default regression metric is RMSE.

In [41]:
# TODO
from <FILL_IN>

evaluator = <FILL_IN>

rmse = <FILL_IN>
print("Test RMSE = %f" % rmse)

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Next Steps

Wow! Our RMSE is really high. In the next lab, we will cover ways to decrease the RMSE of our model, including: cross validation, hyperparameter tuning, and ensembles of trees.

-sandbox
&copy; 2018 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>