*Analytical Information Systems*

# Tutorial 6 - Predictive Modeling II

Matthias Griebel<br>
Lehrstuhl für Wirtschaftsinformatik und Informationsmanagement

SS 2019

<h1>Agenda<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Modeling" data-toc-modified-id="Modeling-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Modeling</a></span><ul class="toc-item"><li><span><a href="#Models-for-supervised-learning" data-toc-modified-id="Models-for-supervised-learning-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Models for supervised learning</a></span></li><li><span><a href="#Metrics-for-regression" data-toc-modified-id="Metrics-for-regression-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Metrics for regression</a></span></li></ul></li><li><span><a href="#Up-to-you:--Price-forecasting-for-used-cars" data-toc-modified-id="Up-to-you:--Price-forecasting-for-used-cars-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Up to you:  Price forecasting for used cars</a></span></li></ul></div>

## Modeling

__Recap: CRISP-DM__
<img align="right" src="http://statistik-dresden.de/wp-content/uploads/2012/04/CRISP-DM_Process_Diagram1.png" style="width:50%">

Today, we will focus on 
- `Data Preparation`
- `Modeling` for regression tasks

### Models for supervised learning

`parsnip` contains wrappers for a number of [models](https://tidymodels.github.io/parsnip/articles/articles/Models.html). 

- Classification
    - `logistic_reg()`,  `multinom_reg()`
    - `decision_tree()`, `rand_forest()`, `boost_tree()`
    - `mlp()`
    - `nearest_neighbor()`
    - `svm_poly()`, `svm_rbf()`

- Regression
    - `linear_reg()`
    - `decision_tree()`, `rand_forest()`, `boost_tree()`
    - `mlp()`
    - `nearest_neighbor()`
    - `svm_poly()`, `svm_rbf()`


__LightGBM__

[LightGBM](https://lightgbm.readthedocs.io/en/latest/) is a gradient boosting framework that uses tree based learning algorithms

- Faster training speed and higher efficiency
- Lower memory usage
- Better accuracy
- Support of parallel and GPU learning
- Capable of handling large-scale data

But: not yet supported by `parsnip`

Loading required packages

In [None]:
library(tidyverse)
library(tidymodels)

### Metrics for regression

There are several metrics for evaluating regression models. As with classification metrics, the `yardstick` package contains all common regression metrics.


*Note: We define $x_i$ as the actual value and $y_i$ as the predicted value*

__Mean absolute error (MAE)__ 

$\frac{1}{n}\sum_{i=1}^{n}|x_i-y_i|$ 

- absolute difference between $yi$ and $xi$
- good interpretability 

__Root-mean-square error (RMSE)__

$\sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i-y_i)^2}$

- square root of the average of squared errors (MSE)
- proportional to the size of the squared error: larger errors have a disproportionately large effect 

__Coefficient of determination ($R^2$)__

<img align="center" src="https://wikimedia.org/api/rest_v1/media/math/render/svg/6b863cb70dd04b45984983cb6ed00801d5eddc94" style="width:15%">

<img align="center" src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/86/Coefficient_of_Determination.svg/600px-Coefficient_of_Determination.svg.png" style="width:30%">

where 

sum of squares of residuals:
$SS_{tot} = \sum_{i}(x_i - \bar{x})^2$ 

total sum of squares:
$SS_{res} = \sum_{i}(x_i - y_i)^2$ 

- proportion of the variance in the dependent variable that is predictable from the independent variable(s)
- usually between 0 and 1

Source: [Wikipedia](https://en.wikipedia.org/wiki/Coefficient_of_determination)

__Mean Absolute Percentage Error (MAPE)__

$\frac{100\%}{n}\sum_{i=1}^{n}\left |\frac{x_i-y_i}{x_i}\right|$

- frequently used for (time series) forecasting
- cannot be used if there are zero values
- puts a heavier penalty on negative errors (biased: systematically select a method whose forecasts are too low)

## Up to you:  Price forecasting for used cars

__Kaggle used cars database__

Over 370,000 used cars scraped from Ebay Kleinanzeigen. See description on [Kaggle.com](https://www.kaggle.com/orgesleka/used-cars-database)


Why is that interesting?

<img align="center" src="images/06/wkda.png" style="width:80%">

Load data

In [None]:
autos <- read_csv('data/T06/autos.csv.zip')

In [None]:
autos %>%
  glimpse()

__Up to you: Price forecasting for used cars__

Build a regression model that predicts the price for used cars

1. Split the data into train and test set

__Up to you: Price forecasting for used cars__

2. Prepare a recipe for data preprocessing, removing outliers or inconsistencies 

*Check Prices*

Some outliers need to be removed

Add findings to recipe

*Check vehicle type*

Add findings to recipe

*Check power*

Add findings to recipe

*Check year of registration*

Add findings to recipe

*Check fuel type*

Add findings to recipe

*Check gearbox*

Add findings to recipe

*Check brand*

Add findings to recipe

*Check Kilometer*

Prepare recipe

Bake train and test set

__Up to you: Price forecasting for used cars__

3. Fit and evaluate two different models on the train set

*Linear Regression*

*XGBoost*