*Analytical Information Systems*

# Tutorial 6 - Predictive Modeling II

Matthias Griebel<br>
Lehrstuhl für Wirtschaftsinformatik und Informationsmanagement

SS 2019

<h1>Agenda<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Modeling" data-toc-modified-id="Modeling-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Modeling</a></span><ul class="toc-item"><li><span><a href="#Models-for-supervised-learning" data-toc-modified-id="Models-for-supervised-learning-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Models for supervised learning</a></span></li><li><span><a href="#Metrics-for-regression" data-toc-modified-id="Metrics-for-regression-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Metrics for regression</a></span></li></ul></li><li><span><a href="#Up-to-you:--Price-forecasting-for-used-cars" data-toc-modified-id="Up-to-you:--Price-forecasting-for-used-cars-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Up to you:  Price forecasting for used cars</a></span></li><li><span><a href="#Exam-Questions" data-toc-modified-id="Exam-Questions-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Exam Questions</a></span><ul class="toc-item"><li><span><a href="#Exam-AIS-SS-2018" data-toc-modified-id="Exam-AIS-SS-2018-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Exam AIS SS 2018</a></span></li></ul></li></ul></div>

## Modeling

__Recap: CRISP-DM__
<img align="right" src="http://statistik-dresden.de/wp-content/uploads/2012/04/CRISP-DM_Process_Diagram1.png" style="width:50%">

Today, we will focus on 
- `Data Preparation`
- `Modeling` for regression tasks

### Models for supervised learning

`parsnip` contains wrappers for a number of [models](https://tidymodels.github.io/parsnip/articles/articles/Models.html). 

- Classification
    - Regression: `logistic_reg()`,  `multinom_reg()`
    - Tree based:`decision_tree()`, `rand_forest()`, `boost_tree()`
    - ANN: `mlp()`
    - KNN: `nearest_neighbor()`
    - SVM: `svm_poly()`, `svm_rbf()`

- Regression
    - Regression: `linear_reg()`
    - Tree based: `decision_tree()`, `rand_forest()`, `boost_tree()`
    - ANN: `mlp()`
    - KNN: `nearest_neighbor()`
    - SVM: `svm_poly()`, `svm_rbf()`


__LightGBM__

[LightGBM](https://lightgbm.readthedocs.io/en/latest/) is a gradient boosting framework that uses tree based learning algorithms

- Faster training speed and higher efficiency
- Lower memory usage
- Better accuracy
- Support of parallel and GPU learning
- Capable of handling large-scale data

But: not yet supported by `parsnip`

### Metrics for regression

There are several metrics for evaluating regression models. As with classification metrics, the `yardstick` package contains all common regression metrics.


*Note: We define $x_i$ as the actual value and $y_i$ as the predicted value*

__Mean absolute error (MAE)__ 

$\frac{1}{n}\sum_{i=1}^{n}|x_i-y_i|$ 

- absolute difference between $yi$ and $xi$
- good interpretability 

__Root-mean-square error (RMSE)__

$\sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i-y_i)^2}$

- square root of the average of squared errors (MSE)
- proportional to the size of the squared error: larger errors have a disproportionately large effect 

__Coefficient of determination ($R^2$)__

<img align="center" src="https://wikimedia.org/api/rest_v1/media/math/render/svg/6b863cb70dd04b45984983cb6ed00801d5eddc94" style="width:15%">

<img align="center" src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/86/Coefficient_of_Determination.svg/600px-Coefficient_of_Determination.svg.png" style="width:30%">

where 

sum of squares of residuals:
$SS_{tot} = \sum_{i}(x_i - \bar{x})^2$ 

total sum of squares:
$SS_{res} = \sum_{i}(x_i - y_i)^2$ 

- proportion of the variance in the dependent variable that is predictable from the independent variable(s)
- usually between 0 and 1

Source: [Wikipedia](https://en.wikipedia.org/wiki/Coefficient_of_determination)

__Mean Absolute Percentage Error (MAPE)__

$\frac{100\%}{n}\sum_{i=1}^{n}\left |\frac{x_i-y_i}{x_i}\right|$

- frequently used for (time series) forecasting
- cannot be used if there are zero values
- puts a heavier penalty on negative errors (biased: systematically select a method whose forecasts are too low)

## Up to you:  Price forecasting for used cars

__Kaggle used cars database__

Over 370,000 used cars scraped from Ebay Kleinanzeigen. See description on [Kaggle.com](https://www.kaggle.com/orgesleka/used-cars-database)


Why is that interesting?

<img align="center" src="images/06/wkda.png" style="width:80%">

Load required packages

In [None]:
library(tidyverse)
library(tidymodels)

Load data

In [None]:
autos <- read_csv('data/T06/autos.csv.zip')

In [None]:
autos %>%
  glimpse()

__Up to you: Price forecasting for used cars__

Build a regression model that predicts the price for used cars

1. Split the data into train and test set

In [None]:
autos_split <- initial_split(autos, prop = 3/4)
train_set <- training(autos_split)
test_set <- testing(autos_split)

__Up to you: Price forecasting for used cars__

2. Prepare a recipe for data preprocessing, removing outliers or inconsistencies 

In [None]:
train_set %>%
    recipe(price ~ vehicleType + yearOfRegistration + gearbox + powerPS
                   + kilometer + fuelType + brand) -> rec
rec

*Check Prices*

In [None]:
sum(is.na(train_set$price))
options(repr.plot.width=7, repr.plot.height=3)
options(scipen=999)
train_set %>%
  ggplot(aes(x=vehicleType, y=price)) + geom_boxplot() + ylim(0,1000000)

Some outliers need to be removed

In [None]:
quantiles_price <- quantile(train_set$price, probs = c(0.01, 0.05, 0.95, 0.99))
quantiles_price
train_set %>%
  filter(price >= quantiles_price[1], price <= quantiles_price[4]) %>%
  ggplot(aes(x=vehicleType, y=price)) + geom_boxplot()

Add findings to recipe

In [None]:
rec %>%
    step_filter(price >= quantiles_price[2], price <= quantiles_price[4]) -> rec
rec

*Check vehicle type*

In [None]:
unique(train_set$vehicleType)
table(train_set$vehicleType)
sum(is.na(train_set$vehicleType))

Add findings to recipe

In [None]:
rec %>%
    step_naomit(vehicleType) %>%
    step_dummy(vehicleType) -> rec
rec

*Check power*

In [None]:
train_set %>%
  ggplot(aes(x=vehicleType, y=powerPS)) + geom_boxplot()

In [None]:
quantiles_power <- quantile(train_set$powerPS, probs = c(0.05, 0.1, 0.15, 0.95, 0.99))
quantiles_power
train_set %>%
  filter(powerPS > 0, powerPS <= quantiles_power[4]) %>%
  ggplot(aes(x=vehicleType, y=powerPS)) + geom_boxplot()

Add findings to recipe

In [None]:
rec %>%
    step_filter(powerPS > 0, powerPS <= quantiles_power[4]) -> rec
rec

*Check year of registration*

In [None]:
train_set$yearOfRegistration %>% summary()
train_set %>%
  filter(yearOfRegistration > 1900, yearOfRegistration <= 2016) %>%
  ggplot(aes(x=vehicleType, y=yearOfRegistration)) + geom_boxplot()

Add findings to recipe

In [None]:
rec %>%
    step_filter(yearOfRegistration > 1950, yearOfRegistration <= 2016) -> rec
rec

*Check fuel type*

In [None]:
unique(train_set$fuelType)
table(train_set$fuelType)
sum(is.na(train_set$fuelType))

Add findings to recipe

In [None]:
rec %>%
    #step_knnimpute(fuelType) %>%
    step_naomit(fuelType) %>%
    step_other(fuelType, threshold = 0.01) %>%
    step_dummy(fuelType) -> rec
rec

*Check gearbox*

In [None]:
unique(train_set$gearbox)
table(train_set$gearbox)
sum(is.na(train_set$gearbox))

Add findings to recipe

In [None]:
rec %>%
    step_naomit(gearbox) %>%
    step_dummy(gearbox) -> rec
rec

*Check brand*

In [None]:
train_set %>% distinct(brand) %>% pull()

Add findings to recipe

In [None]:
german_brands = c('volkswagen', 'audi', 'bmw', 'mercedes_benz', 'porsche','opel')
rec %>%
    step_mutate(brand=if_else(brand %in% german_brands, 'german', 'foreign')) %>%
    step_string2factor(brand) %>%
    step_dummy(brand)-> rec
rec

*Check Kilometer*

In [None]:
train_set %>%
  ggplot(aes(x=vehicleType, y=kilometer)) + geom_boxplot()
unique(train_set$kilometer)

Prepare recipe

In [None]:
rec %>% 
    check_missing(all_predictors()) %>%
    prep() -> prepped_rec

Bake train and test set

In [None]:
train_set_baked <- prepped_rec %>% juice()
test_set_baked <- prepped_rec %>% bake(new_data=test_set)

In [None]:
train_set_baked %>% head()

__Up to you: Price forecasting for used cars__

3. Fit and evaluate two different models on the train set

*Linear Regression*

In [None]:
linear_reg(mode = 'regression') %>%
    set_engine('lm') %>%
    fit(price ~ ., data = train_set_baked) %>%
    predict(new_data = test_set_baked) %>%
    cbind(truth = test_set_baked$price) %>%
    metrics(truth, .pred) -> res_lin
res_lin

*XGBoost* (detailed  [parameters](https://xgboost.readthedocs.io/en/latest/parameter.html))

In [None]:
boost_tree(mode="regression", tree_depth = 6, learn_rate = 0.3) %>%
    set_engine('xgboost') %>%
    fit(price ~ ., data = train_set_baked) %>%
    predict(new_data = test_set_baked) %>%
    cbind(truth = test_set_baked$price) %>%
    metrics(truth, .pred) -> res_xgb
res_xgb

## Exam Questions

### Exam AIS SS 2018

__Question 5: Supervised Learning__

(a) (3 points) Overfitting is a key problem which arises in supervised learning. Explain the central underlying trade-off using a simple plot.

> <img src="http://scott.fortmann-roe.com/docs/docs/BiasVariance/biasvariance.png" style="width:70%">
 Source: http://scott.fortmann-roe.com/docs/docs/BiasVariance/biasvariance.png

(b) (2 points) For each of the following machine learning approaches name a measure / algorithm variant which allows controlling over-fitting tendencies. You should only consider measures which are specific to this algorithm  (i.e., not cross-validation or other generic approaches).

Decision Tree

> - Pre-Pruning (Early Stopping Rule): Stop the algorithm before it becomes a fully-grown tree
- Post-pruning: Grow decision tree to its entirety, then trim the nodes of the decision tree in a bottom-up fashion

k nearest Neighbors

> - _increase_ k

Boosting Algorithms

> - Learning Rate
- other hyperparameters, i.e., number of boosting rounds

Linear Regression

> - Limit the number of independent variables
- Ridge: Penalize by sum-of-squares of parameters
- Lasso: Penalize by sum-of-absolute values of parameters

(c) __Support Vector Machines__
A certain kind of SVM is characterized by the following optimization problem:

$$\min \underbrace{\frac{1}{2} w^Tw}_{A} + \underbrace{C \sum_k \epsilon_k}_{B}$$
subject to
$$y_i (wx_i+b) \geq 1 - \epsilon_i$$

i. (2 points) What kind of SVM is described here? Provide an intuition of the role of $\epsilon$ in the constraint.

> - Soft Margin SVM
- Slack variables $\epsilon_i$ can be added to allow misclassification of difficult or noisy examples

ii.  (2 points)  Briefly explain the two parts A and B of the objective function.

$$\min \underbrace{\frac{1}{2} w^Tw}_{A} + \underbrace{C \sum_k \epsilon_k}_{B}$$

> A: Margin between hyperplanes

> B: Penalty term for misclassification, parameter C is a regularization parameter to control overfitting,

(d)  (2 points)  Briefly explain the concept of bootstrap aggregation (bagging) and how it benefits supervised learning.

> Create ensembles of weak learners by repeatedly randomly resampling the training data
-  Given a training set of size n, create m samples of size n by drawing n examples from the original data, with replacement 
- Create m models from m samples
- Combine the m resulting models using simple majority vote (classification) or averaging (regression)

> Decreases error by decreasing the variance in the results due to unstable learners, bias remains unchanged.

(e)  (2 points)  How  would  you  assess  the  relative  importance  of  variables  in  a  random forest.  Explain your answer. (You may consider the next question for an illustration.)

From (f) - See Tutorial 5
<img align="center" src="images/05/rf.png" style="width:60%">

Solution (e)

> Number of splits (across all tress) that include the feature renders the feature more important (*Sex*: 3, *Age/Pclass*: 2); Position of split matters as well (first split always *Sex* - high information gain) 

>Not in lecture: 
- Gini Importance / Mean Decrease in Impurity (MDI)
    - Calculate the sum over the number of splits (across all tress) that include the feature, proportionally to the number of samples it splits.
- Permutation Importance or Mean Decrease in Accuracy (MDA) 

>see (https://medium.com/the-artificial-impostor/feature-importance-measures-for-tree-models-part-i-47f187c1a2c3)