<div style="background:#F5F7FA; height:100px; padding: 2em; font-size:14px;">
<span style="font-size:18px;color:#152935;">Want to do more?</span><span style="border: 1px solid #3d70b2;padding: 15px;float:right;margin-right:40px; color:#3d70b2; "><a href="https://ibm.co/wsnotebooks" target="_blank" style="color: #3d70b2;text-decoration: none;">Sign Up</a></span><br>
<span style="color:#5A6872;"> Try out this notebook with your free trial of IBM Watson Studio.</span>
</div>

# Model bike sharing data with SPSS
This notebook shows you how to create a predictive model of bike sharing trends by using IBM SPSS Algorithms on Apache Spark. You'll learn how to create a generalized linear model with the SPSS ML API, and how to view the model with the SPSS Model Viewer.

The generalized linear model (GLM) is an analytical algorithm for different types of data. It includes statistical models such as linear regression for normally distributed targets, logistic models for binary or multinomial targets, and log linear models for count data. In addition to building a model, the GLM provides features such as variable selection, automatic selection of the distribution and link function, and model evaluation statistics. The GLM has options for regularization, such as LASSO, ridge regression, and elastic net, and can handle a wide variety of data.

The bike sharing model will:
 - Identify what affects the amount of bike rentals.
 - Predict future daily bike rental amounts based on date, weather, and season. 

This notebooks runs on Scala and Spark. Some familiarity with Scala is recommended.

## Table of contents 
This notebook contains these main sections:

1. [Overview of the bike sharing data](#overview)
1. [Prepare the data](#prepare)
1. [Configure the generalized linear model](#configure) 
1. [View the model](#view)
1. [Summary and next steps](next)

<a id="overview"></a>
## 1. Overview of the bike sharing data

You'll be looking at a the daily count of bike rentals between the years 2011 and 2012 in the Capital Bikeshare system, with corresponding weather and seasonal information. The [Capital Bikeshare](https://www.capitalbikeshare.com/home) system provides bicycles at over 400 stations in Washington, D.C. and neighboring cities in Virginia and Maryland. 

The data set that you'll use has the following fields:

- instant: the record ID
- dteday: the date
- season: the season (1 = spring, 2 = summer, 3 = fall, 4 = winter)
- yr: the year (0 = 2011, 1 = 2012)
- mnth: the month ( 1 - 12)
- hr: the hour (0 - 23)
- holiday: 0 = not a holiday, 1 = a holiday 
- weekday: the day of the week (Sunday = 0 - Friday = 6)
- workingday: 0 = a weekend or holiday, 1 = a work day
- weathersit: the weather conditions 
   - 1 = Clear or partly cloudy
   - 2 = Mist or clouds
   - 3 = Light precipitation
   - 4 = Heavy precipitation
- temp: the normalized temperature for the day in degrees Celsius (minimum = -8, maximum = +39) 
- atemp: the normalized feels-like temperature in degrees Celsius (minium = -16, maximum = +50) 
- hum: the normalized humidity (maximum = 100%)
- windspeed: the normalized wind speed in knots (maximum = 67)
- casual: the count of bikes rented to casual users
- registered: the count of bikes rented to registered users
- cnt: the total count of rented bikes (casual + registered)


<a id="prepare"></a>
# 2. Prepare the data
To prepare the bike sharing data:  

1. [Get the data into your notebook](#load)
1. [Create a Spark DataFrame](#df)
1. [Enrich the DataFrame](#enrich)

<a id="load"></a>
## 2.1. Get the data into your notebook
To get the data and load it into your notebook:

1. Download the `Bike-Sharing-Dataset.zip` file from this website: [https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset](https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset).
1. Extract the file.
1. Load the `day.csv` file into the notebook by clicking the __Data__ icon on the notebook action bar. Drop the file into the box or browse to select the file.

The file is loaded to your object storage. The data set appears in the __Files__ list in the notebook and also in the __Data assets__ section of the project.

<a id="df"></a>
## 2.2. Create a Spark DataFrame
Click **Insert to code -> Insert SparkSession DataFrame** from right hand side data panel, just below the `day.csv` data set, and add data in the below cell.
<br>Change dataframe name to `df` before running the cell.

Waiting for a Spark session to start...

+-------+--------------------+------+---+----+-------+-------+----------+----------+--------+--------+--------+---------+------+----------+----+
|instant|              dteday|season| yr|mnth|holiday|weekday|workingday|weathersit|    temp|   atemp|     hum|windspeed|casual|registered| cnt|
+-------+--------------------+------+---+----+-------+-------+----------+----------+--------+--------+--------+---------+------+----------+----+
|      1|2011-01-01 00:00:...|     1|  0|   1|      0|      6|         0|         2|0.344167|0.363625|0.805833| 0.160446|   331|       654| 985|
|      2|2011-01-02 00:00:...|     1|  0|   1|      0|      0|         0|         2|0.363478|0.353739|0.696087| 0.248539|   131|       670| 801|
|      3|2011-01-03 00:00:...|     1|  0|   1|      0|      1|         1|         1|0.196364|0.189405|0.437273| 0.248309|   120|      1229|1349|
|      4|2011-01-04 00:00:...|     1|  0|   1|      0|      2|         1|         1|     0.2|0.212122|0.590435| 0.160296|   108|  

<a id="enrich"></a>
## 2.3. Enrich the DataFrame

The generalized linear model algorithm requires generated properties for the fields in the DataFrame so that they have proper data types, measurable levels, and roles.   

Run the SPSS DataFrame assistant `enrich` function to generate those properties:

In [2]:
import com.ibm.spss.ml.utils.DataFrameImplicits._
val df2 = df.enrich



Show the first three rows of the DataFrame:

In [3]:
df2.show(3)

+-------+--------------------+------+---+----+-------+-------+----------+----------+--------+--------+--------+---------+------+----------+----+
|instant|              dteday|season| yr|mnth|holiday|weekday|workingday|weathersit|    temp|   atemp|     hum|windspeed|casual|registered| cnt|
+-------+--------------------+------+---+----+-------+-------+----------+----------+--------+--------+--------+---------+------+----------+----+
|      1|2011-01-01 00:00:...|     1|  0|   1|      0|      6|         0|         2|0.344167|0.363625|0.805833| 0.160446|   331|       654| 985|
|      2|2011-01-02 00:00:...|     1|  0|   1|      0|      0|         0|         2|0.363478|0.353739|0.696087| 0.248539|   131|       670| 801|
|      3|2011-01-03 00:00:...|     1|  0|   1|      0|      1|         1|         1|0.196364|0.189405|0.437273| 0.248309|   120|      1229|1349|
+-------+--------------------+------+---+----+-------+-------+----------+----------+--------+--------+--------+---------+------+--

<a id="configure"></a>
# 3. Configure the generalized linear model 

Configure the generalized linear model with the `GeneralizedLinear()` method to analyze what conditions affect the number of rented bikes. 

First, import the SPSS generalized linear model algorithm package:

In [4]:
import com.ibm.spss.ml.classificationandregression.GeneralizedLinear
import com.ibm.spss.ml.classificationandregression.params._

Now, run the `GeneralizedLinear()` method. You set the `TargetField` parameter to `cnt` and the `Effects` list to the fields that describe the type of day, the season, and the weather conditions. By specifying `UNKNOWN` for the distribution and link function, the model automatically chooses the most appropriate settings for the data.

In [5]:
val gle = GeneralizedLinear().
  setTargetField("cnt").
  setInputFieldList(Array("season","yr","mnth","holiday","weekday","workingday","weathersit","temp","atemp","hum","windspeed","casual","registered")).
  setEffects(List(
    Effect(List("season"), List(0)), 
    Effect(List("mnth"), List(0)),
    Effect(List("holiday"), List(0)),
    Effect(List("weekday"), List(0)),
    Effect(List("workingday"), List(0)),
    Effect(List("weathersit"), List(0)),
    Effect(List("temp"), List(0)),
    Effect(List("atemp"), List(0)),
    Effect(List("hum"), List(0)),
    Effect(List("windspeed"), List(0)))).
  setDistribution("UNKNOWN").
  setLinkFunction("UNKNOWN").      
  setUseVariableSelection(true).
  setVariableSelectionMethod("FORWARD_STEPWISE").
  setDetectTwoWayInteraction(true).
  setTargetSortOrder("DESCENDING")

val gle_model = gle.fit(df2)

[Stage 25:>                                                         (0 + 2) / 2]

<a id="view"></a>
# 4. View the model 

View the model with the SPSS Model Viewer. The visualization for the generalized linear model includes tests of model effects, statistics for each parameter, and a table and chart of standardized deviation residuals.

## 4.1 Generate a project token

Before you can run the model viewer, you need to generate a project token

1. In the **My Projects** banner, click the **More** icon and then click **Insert project token**. The project token is inserted into the first cell of the notebook, before the title.
2. Copy the text, which appears at the beginning of the notebook, into the following cell and run it.

## 4.2 Start the model viewer

Run the code in the following cell to start SPSS Model Viewer, where you can see a visualization and see model statistics and other characteristics.

You will need project token for that.
Insert Project token from more (three dots option) from  the top panel. 
Token will get added to the first cell of the notebook. 
Copy and paste it to next cell.

In [6]:
// @hidden_cell
// The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
import com.ibm.analytics.projectNotebookIntegration._
val pc = ProjectUtil.newProjectContext(sc, "***********", "**************")



In [7]:
import com.ibm.spss.scala.ModelViewer

kernel.magics.html(ModelViewer.createHTML(pc, gle_model.toPMML(), Option(gle_model.statXML()), Option("http://model-viewer.mybluemix.net")))


0,1,2
Target Field,Target Field,cnt
Scale Parameter Handling,Scale Parameter Handling,
Probability Distribution [1]The probability distribution and link function were automatically detected based on the model with minimum ASE in the testing data (ASE = 1791390.785).,Probability Distribution [1]The probability distribution and link function were automatically detected based on the model with minimum ASE in the testing data (ASE = 1791390.785).,Poisson
Link Function [1]The probability distribution and link function were automatically detected based on the model with minimum ASE in the testing data (ASE = 1791390.785).,Link Function [1]The probability distribution and link function were automatically detected based on the model with minimum ASE in the testing data (ASE = 1791390.785).,Identity
Model Type,Model Type,Poisson Regression
Model Building Method,Model Building Method,
Number of Predictors Input,Number of Predictors Input,13
Number of Predictors in Final Model,Number of Predictors in Final Model,8
Log LikeLihood [2]The full log-likelihood function is displayed and used in computing information criteria.,Log LikeLihood [2]The full log-likelihood function is displayed and used in computing information criteria.,-150463.306
Deviance,Value,254763.550

Records,Number,Percent
Included,731,100.0
Excluded,0,0.0
Total,731,100.0

Source,Type III,Type III,Type III
Source,Wald Chi-Square,df,Sig.
(Intercept),1281.07,1,0.0
season,18578.96,3,0.0
mnth,11688.056,11,0.0
holiday,71.121,1,0.0
weekday,2012.194,6,0.0
weathersit,23589.355,2,0.0
temp,20140.497,1,0.0
hum,489.377,1,0.0
windspeed,11157.071,1,0.0
temp * hum,1375.809,1,0.0

Parameter,B,Std. Error,95% Wald Confidence Interval,95% Wald Confidence Interval,Hypothesis Test,Hypothesis Test,Hypothesis Test
Parameter,B,Std. Error,Lower,Upper,Wald Chi-Square,df,Sig.
(Intercept),895.016,73.017,751.905,1038.128,150.249,1.0,<0.0001
season = 1,-1639.728,13.434,-1666.057,-1613.398,14898.939,1.0,<0.0001
season = 2,-615.832,17.737,-650.596,-581.068,1205.485,1.0,<0.0001
season = 3,-893.134,18.599,-929.587,-856.68,2305.973,1.0,<0.0001
season = 4,0 [1]Set to zero because this parameter is redundant.,,,,,,
mnth = 1,332.988,11.585,310.282,355.694,826.18,1.0,<0.0001
mnth = 2,302.177,11.809,279.031,325.322,654.76,1.0,<0.0001
mnth = 3,392.554,13.033,367.009,418.099,907.149,1.0,<0.0001
mnth = 4,-34.312,19.177,-71.898,3.274,3.201,1.0,0.074
mnth = 5,94.468,21.099,53.114,135.821,20.047,1.0,<0.0001


<a id="next"></a>
# Summary and next steps
You have created a generalized linear model of the bike sharing data. Now you can:
 - Create a different model to compare model evaluations, like the test of model effects, residuals, and so on. See [SPSS documentation](https://apsportal.ibm.com/docs/content/kc_gen/integrations-gen2.html).
 - Predict further bike rental amounts for incoming data.

## Authors

Kang Jiangbo and Yu Wenpei are SPSS Algorithm Engineers at IBM.

### Data citations
Lichman, M. (2013). UCI Machine Learning Repository [https://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. 

Fanaee-T, Hadi, and Gama, Joao, 'Event labeling combining ensemble detectors and background knowledge', Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg [Web link](https://link.springer.com/article/10.1007%2Fs13748-013-0040-3).

Copyright © 2017, 2018 IBM. This notebook and its source code are released under the terms of the MIT License.