# Bike Sharing with ML Pipelines

Spark ML offers a nice Pipeline API for building more complex transformation and machine learning pipelines. We will use these building blocks in this exercises

# Loading Data

First we need to load data from S3 or HDFS. We read the data as CSV using Sparks builtin CSV parser, but still we need to create a schema to specify column names and data types.

In [None]:
from pyspark.sql.types import *


schema = StructType(
    [
        StructField('row_id', IntegerType(), True),
        StructField('date', StringType(), True),
        StructField('season', IntegerType(), True),
        StructField('year', IntegerType(), True),
        StructField('month', IntegerType(), True),
        StructField('hour', IntegerType(), True),
        StructField('holiday', IntegerType(), True),
        StructField('weekday', IntegerType(), True),
        StructField('workingday', IntegerType(), True),
        StructField('weather', IntegerType(), True),
        StructField('temperature', DoubleType(), True),
        StructField('apparent_temperature', DoubleType(), True),
        StructField('humidity', DoubleType(), True),
        StructField('wind_speed', DoubleType(), True),
        StructField('casual', IntegerType(), True),
        StructField('registered', IntegerType(), True),
        StructField('counter', IntegerType(), True),
    ]
)

data = spark.read.schema(schema).csv(
    's3://dimajix-training/data/bike-sharing/hour_nohead.csv'
)

## Inspect Data

Let us have a look at the first 10 entries again

In [None]:
data.limit(10).toPandas()

# Prepare for ML

Again we need to transform all numerical (and also categorical) entries to Doubles. This is required by most algorithms in ML.

In [None]:
from pyspark.sql.functions import *


# YOUR CODE HERE
ddata = ...

# Split Data

Before diving into the real world of Spark ML Pipelines, we split the data into a training set and a test set. Let us use 80% for training and 20% for validation.

In [None]:
# Split ddata into train_data and test_data
# YOUR CODE HERE
...

print("train_data: %d" % train_data.count())
print("test_data: %d" % test_data.count())

# Create a ML Pipeline

Now we will create out first very simple pipeline using all numerical variables as features. This can be done very easily. We already know the two relevant classes performing the actual work

    VectorAssembler - extracts all features and stores them inside a Vector
    LinearRegression - performs the regression
    
We create a ML Pipeline with these two components. As features we'll again use the columns

    year, season, month, hour, holiday, weekday, workingday,
    weather, temperature, apparent_temperature, humidity, wind_speed
    
and of course we want to predict "counter". The prediction shall be stored in "prediction".   

In [None]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import *
from pyspark.ml.regression import *


# Create a Pipeline with multiple stages. You will probably need a VectorAssembler and a LinearRegression stage.
pipeline = Pipeline(
    stages=[
        # YOUR CODE HERE
        ...
    ]
)

# Fit model using the Pipeline
model = pipeline.fit(train_data)

## Predict Data

Now that we have a model, we want to perform predictions for the test data. And let us also print a table with the first 10 entries of the predicted DataFrame

In [None]:
# YOUR CODE HERE

# Make some Pictures again

Again we need to import matplotlib.pyplot and add some magic to display the plots inside the notebook

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

Let us plot the total number of rents per day again, and let's compare that visually against the predictions.

In [None]:
daily = (
    prediction.groupBy('ts').agg({'counter': 'sum', 'prediction': 'sum'}).orderBy('ts')
)

pdf = daily.toPandas()

min_ts, max_ts = prediction.agg(min('ts'), max('ts')).collect()[0]

plt.figure(figsize=(16, 6), dpi=80, facecolor='w', edgecolor='k', tight_layout=True)
plt.plot(pdf['ts'], pdf['sum(counter)'])
plt.plot(pdf['ts'], pdf['sum(prediction)'])
axes = plt.gca()
axes.set_xlim([min_ts, max_ts])

# Evaluate Model

Again we want to evaluate the resulting model using RegressionEvaluator from package pyspark.ml.evaluation.

In [None]:
# YOUR CODE HERE
rmse = ...

print("RMSE of Simple Model = %f" % rmse)

# Logarithmic Metric

In this example, we might not be so much interested about the absolute prediction error, but more about a relative prediction error. This can be expressed very well on a logarithmic scale. But we cannot use builtin evaluators for that, we need to create one on our own.

But first let us try to calculate the RMSE metric manually. RMSE is defined as

    sqrt(avg((predicted_value - true_value)**2)
    
And the the Root Mean Squared Logarithmic Error is defined as

    sqrt(avg((log(predicted_value) - log(true_value))**2)
    
Both metrics can be easily implemented using standard functionality of Spark DataFrames.

In [None]:
# YOUR CODE HERE
rmse = ...
rmsle = ...

print("RMSE = %f" % rmse)
print("RMSLE = %f" % rmsle)

## Logarthmic Model

Since our error metric is now in logarithmic space, it makes sense to optimize in that space, too. Therefore we switch to a logarithmic model.

We implement the logarithmic model by applying the following transformation to the "counter" column before fitting the linear model:

    lcounter = log(counter + 1)
    
Then we fit a linear regression model to the target variable lcounter (instead of counter). The predicted value should be stored in a column 'lprediction'.

But since eventually we are interested in the linear value (and not in the logarithmic value), we backtransform the predicted value from the logarithmic scale into the linear scale by

    prediction = exp(lprediction) - 1
    
In order to perform the Transformation, we can add multiple SQLTransformer at appropriate locations to the Pipeline. An SQLTransformer has one keyword argument

    SQLTransformer(statement="SELECT x+y AS z,* FROM __THIS__")
    
which will create DataFrames with a new column 'z' which is the sum of both columns 'x' and 'y'. Different SQLTransformers can be used for the transformation of the counter and lprediction variable.

In [None]:
# The pipeline should have (in some correct order)
#  1x LinearRegression
#  2x SQLTransformer
#  1x VectorAssembler
pipeline = Pipeline(stages=[...])

# Fit model using the Pipeline
logmodel = pipeline.fit(train_data)

## Evaluate Model

Again we want to calculate the RMSE and RMSLE for the test data using the new logarithmic model.

In [None]:
# YOUR CODE HERE
logprediction = ...

rmse = ...
rmsle = ...

print("RMSE = %f" % rmse)
print("RMSLE = %f" % rmsle)

# Make some Pictures again

In [None]:
daily = (
    logprediction.groupBy('ts').agg({'counter': 'sum', 'prediction': 'sum'}).orderBy('ts')
)

pdf = daily.toPandas()

min_ts, max_ts = logprediction.agg(min('ts'), max('ts')).collect()[0]

plt.figure(figsize=(16, 6), dpi=80, facecolor='w', edgecolor='k', tight_layout=True)
plt.plot(pdf['ts'], pdf['sum(counter)'])
plt.plot(pdf['ts'], pdf['sum(prediction)'])
axes = plt.gca()
axes.set_xlim([min_ts, max_ts])

# Adding More Features

We might want to add more features in order to improve prediction quality. We propose the following additional features:

1. Features for modelling period effects of a year. This can be done by adding the two features:
        sin(ts / 31536000 * 6.28318531) 
        cos(ts / 31536000 * 6.28318531)
2. Similarily for modelling periodic effects within a week, the following features can be used:
        sin(weekday / 7 * 6.28318531)
        cos(weekday / 7 * 6.28318531)
3. And for modelling periodic effects within a single day the following features can be used:
        sin(hour / 24 * 6.28318531)
        cos(hour / 24 * 6.28318531)
4. season, one-hot encoded
5. weather, one-hot encoded

You can use SQLTransformer for arithmetic transformations and a combination of

    StringIndexer(inputCol='categoricalFeature', outputCol='categoricalIndex')
    OneHotEncoder(inputCol='categoricalIndex', outputCol='categoricalOneHot')
    
for creating one hot encoded categorical features.

In [None]:
# The Pipeline should have
#  1x LinearRegression
#  2x OneHotEncoder
#  2x StringIndexer
#  3x SQLTransformer (or maybe more)
#  1x VectorAssembler
pipeline2 = Pipeline(stages=[...])

logmodel2 = pipeline2.fit(train_data)

## Evaluate new Model

Again we want to evaluate our new model using RMSE and RMSLE metrics.

In [None]:
logprediction2 = logmodel2.transform(test_data)

rmse = ...
rmsle = ...

print("RMSE = %f" % rmse)
print("RMSLE = %f" % rmsle)