# Support Vector Machines Lösung

### Module importieren

In [1]:
import os

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

### Daten einlesen

In [2]:
# adjust to correct path if necessary
dataset_dir_path = os.path.join(os.path.pardir, os.path.pardir, os.path.pardir, 'datasets', 'regression')
rental_bikes_df = pd.read_csv(os.path.join(dataset_dir_path, 'rental_bikes.csv'))

### Überblick über Daten bekommen

In [3]:
rental_bikes_df.head()

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature,Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature,Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Functioning Day
0,01/12/2017,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
1,01/12/2017,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
2,01/12/2017,173,2,-6.0,39,1.0,2000,-17.7,0.0,0.0,0.0,Winter,No Holiday,Yes
3,01/12/2017,107,3,-6.2,40,0.9,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
4,01/12/2017,78,4,-6.0,36,2.3,2000,-18.6,0.0,0.0,0.0,Winter,No Holiday,Yes


In [4]:
rental_bikes_df.describe().round(2)

Unnamed: 0,Rented Bike Count,Hour,Temperature,Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature,Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm)
count,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0,8760.0
mean,704.6,11.5,12.88,58.23,1.72,1436.83,4.07,0.57,0.15,0.08
std,645.0,6.92,11.94,20.36,1.04,608.3,13.06,0.87,1.13,0.44
min,0.0,0.0,-17.8,0.0,0.0,27.0,-30.6,0.0,0.0,0.0
25%,191.0,5.75,3.5,42.0,0.9,940.0,-4.7,0.0,0.0,0.0
50%,504.5,11.5,13.7,57.0,1.5,1698.0,5.1,0.01,0.0,0.0
75%,1065.25,17.25,22.5,74.0,2.3,2000.0,14.8,0.93,0.0,0.0
max,3556.0,23.0,39.4,98.0,7.4,2000.0,27.2,3.52,35.0,8.8


### Daten vorbereiten

#### Ziel Variable als numpy Array in `y` speichern

In [5]:
y = rental_bikes_df['Rented Bike Count'].values

In [6]:
# rental_bikes_df['weekday'] = pd.to_datetime(rental_bikes_df.Date, dayfirst=True ).dt.weekday

#### Zielvariable und nicht benötigte Variable entfernen

In [7]:
rental_bikes_df_features = rental_bikes_df.drop(columns=['Rented Bike Count', 'Date'])

#### Kategorische Variablen in Dummyvariablen umwandeln

In [8]:
rental_bikes_df_features = pd.get_dummies(rental_bikes_df_features, columns=['Seasons', 'Holiday', 'Functioning Day'], drop_first=True)
X = rental_bikes_df_features.values

#### Für spätere Validierung Daten in Train- und Testset aufteilen

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=5)

#### Daten skalieren

Bei der Verwendung von SVMs ist es wichtig Daten vorher zu skalieren. Hier geschieht dies durch Standardisierung $(z = \frac{x - \mu}{\sigma}) $.

In [10]:
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### Modell Training

Initialisieren Sie das SVM Modell mit `SVR()` und wählen Sie passende Parameter. Insbesondere `kernel`, `gamma` und `C` sind wichtige Parameter einer SVM. Dabei werden `gamma` und `C` üblicherweise in Zehnerpotenzen verwendet, z.B gamma=0.01, C=10. Als `kernel` bieten sich vor allem 'linear' und 'rbf' (Gaußscher Kernel) an, wobei `gamma` für den linearen Kernel irrelevant ist. Nutzen Sie die `help()` Funktion um etwas mehr über die Parameter zu erfahren.

Versuchen Sie einen RMSE Wert von möglichst unter 400 oder sogar 300 zu erreichen.

In [19]:
help(svm.SVR)

Help on class SVR in module sklearn.svm._classes:

class SVR(sklearn.base.RegressorMixin, sklearn.svm._base.BaseLibSVM)
 |  SVR(*, kernel='rbf', degree=3, gamma='scale', coef0=0.0, tol=0.001, C=1.0, epsilon=0.1, shrinking=True, cache_size=200, verbose=False, max_iter=-1)
 |  
 |  Epsilon-Support Vector Regression.
 |  
 |  The free parameters in the model are C and epsilon.
 |  
 |  The implementation is based on libsvm. The fit time complexity
 |  is more than quadratic with the number of samples which makes it hard
 |  to scale to datasets with more than a couple of 10000 samples. For large
 |  datasets consider using :class:`~sklearn.svm.LinearSVR` or
 |  :class:`~sklearn.linear_model.SGDRegressor` instead, possibly after a
 |  :class:`~sklearn.kernel_approximation.Nystroem` transformer.
 |  
 |  Read more in the :ref:`User Guide <svm_regression>`.
 |  
 |  Parameters
 |  ----------
 |  kernel : {'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'} or callable,          default='rbf'


In [28]:
svm_model = svm.SVR(kernel='rbf', gamma=1, C=1000)

In [None]:
svm_model.fit(X_train, y_train)

### Modell anwenden

Nachdem Sie das Modell trainiert haben können Sie es nutzen um Vorhersagen für das Testset zu treffen. Nutzen Sie dafür `predict()` und speichern Sie die Vorhersagen in `y_pred`.

In [30]:
y_pred = svm_model.predict(X_test)

### Validierung

Überprüfen Sie im letzten Schritt, wie gut Ihr Modell Vorhersagen machen kann anhand des RMSE (Root Mean Squared Error) und vergleichen Sie es damit den Durchschnitt vorherzusagen. 

In [None]:
rmse_test = np.sqrt(mean_squared_error(y_test, y_pred)).round()
print(rmse_test)

In [None]:
y_mean = np.repeat(np.mean(y_train), len(y_test))

rmse_baseline = np.sqrt(mean_squared_error(y_test, y_mean)).round()
print(rmse_baseline)