# Megatutorial 4: Regression
In diesem Megatutorial wollen wir uns mit dem Thema Regression beschäftigen. Dazu werden wir unseren Datensatz wechseln. Fortan werden wir mit der bikesharing.csv arbeiten, die wir bereit in Orange kennengelernt haben.

## Szenario
Wir wurden von einem Bikesharing-Unternehmen damit beauftragt, einen historischen Datensatz zum Zweck der Erstellung eines Vorhersagemodells zu analysieren. Das Unternehmen möchte gerne die Anzahl der Verliehenen Bikes einen Tag im Vorhinein vorhersagen können. Dazu sollen Wetterdaten und Daten über den Vorhersagetag verwendet werden. Das Unternehmen kann uns folgende historische Daten zur Verfügung stellen:

* ``instant``: record index
* ``day``: day of date
* ``season``: season (springer, summer, fall, winter)
* ``yr``: year (2011, 2012)
* ``mnth``: month ( 1 to 12)
* ``hr``: hour (0 to 23)
* ``holiday``: weather day is holiday or not (yes, no)
* ``weekday``: day of the week (Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday)
* ``workingday``: if day is neither weekend nor holiday is 1, otherwise is 0
* ``weathersit``: weathersituation on the current day
* ``temp``: temperature in Celsius
* ``atemp``: feeling temperature in Celsius
* ``hum``: humidity
* ``windspeed``: wind speed. The values are divided to 67 (max)
* ``casual``: count of casual users
* ``registered``: count of registered users
* ``cnt``: count of total rental bikes including both casual and registered
## Aufgaben
* Lade die Daten in pandas.
* Verschaffe dir einen Überblick über die Daten
* Lade scikit-learn.
* Entwickle zwei passende Regressionsmodelle in scikit-learn.
* Evaluiere deine Regressionsmodelle mit Hilfe der Trainingsdaten.

In [39]:
from pandas import read_csv

# Transformer/Funktionen zur Vorverarbeitung
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Esrtimators für die Regression
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression

# Metriken für die Regression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, root_mean_squared_error

## Daten laden

In [40]:
data = read_csv("../data/bikesharing.csv")

In [41]:
data.isna().any()

Unnamed: 0    False
season        False
holiday       False
weekday       False
workingday    False
weathersit    False
temp          False
atemp         False
hum            True
windspeed     False
casual        False
registered    False
cnt           False
day           False
month         False
year          False
dtype: bool

# Preprocessing

### Fehlende Werte

In [42]:
imputer_engine = SimpleImputer(strategy="median")
data["hum"] = imputer_engine.fit_transform(data[["hum"]])

In [43]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  731 non-null    int64  
 1   season      731 non-null    object 
 2   holiday     731 non-null    object 
 3   weekday     731 non-null    object 
 4   workingday  731 non-null    object 
 5   weathersit  731 non-null    object 
 6   temp        731 non-null    float64
 7   atemp       731 non-null    float64
 8   hum         731 non-null    float64
 9   windspeed   731 non-null    float64
 10  casual      731 non-null    int64  
 11  registered  731 non-null    int64  
 12  cnt         731 non-null    int64  
 13  day         731 non-null    int64  
 14  month       731 non-null    int64  
 15  year        731 non-null    int64  
dtypes: float64(4), int64(7), object(5)
memory usage: 91.5+ KB


### Target/Feature auswählen

In [44]:
data.columns

Index(['Unnamed: 0', 'season', 'holiday', 'weekday', 'workingday',
       'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'casual',
       'registered', 'cnt', 'day', 'month', 'year'],
      dtype='object')

In [45]:
features = [
    'season', 'holiday', 'weekday','weathersit',
    'temp', 'atemp', 'hum', 'windspeed', 'month'
]

target = [
    'cnt'
]

In [46]:
X = data[features]
y = data[target]

## Label Encoding

In [47]:
season_encoder_engine = LabelEncoder()
X.loc[:, "season"] = season_encoder_engine.fit_transform(X["season"])

holiday_encoder_engine = LabelEncoder()
X.loc[:, "holiday"] = holiday_encoder_engine.fit_transform(X["holiday"])

weekday_encoder_engine = LabelEncoder()
X.loc[:, "weekday"] = weekday_encoder_engine.fit_transform(X["weekday"])

weathersit_encoder_engine = LabelEncoder()
X.loc[:, "weathersit"] = weathersit_encoder_engine.fit_transform(X["weathersit"])

## Hold-Out-Resampling

In [48]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### Modeling

In [54]:
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

predictions = linear_model.predict(X_test)

print(
    "R²", r2_score(y_test, predictions),
    "RMSE", root_mean_squared_error(y_test, predictions),
    "MAE", mean_absolute_error(y_test, predictions)
)


R² 0.5341878815932997 RMSE 1308.6327290097538 MAE 1131.6930858099993


In [56]:
tree_model = DecisionTreeRegressor(max_depth=5)
tree_model.fit(X_train, y_train)

predictions = tree_model.predict(X_test)

print(
    "R²", r2_score(y_test, predictions),
    "RMSE", root_mean_squared_error(y_test, predictions),
    "MAE", mean_absolute_error(y_test, predictions)
)

R² 0.5287973154682334 RMSE 1316.1829617144297 MAE 1071.5228154493652
