# Task 3: Modeling Optimization

**Course:** Introduction to Data Science
**Lecturer:** Prof. Dr. Hendrik Meth

**Group 2:**
- Linus Breitenberger
- Tristan Ruhm
- Prarichut Poachanuan
- Anushka Irphale
- Patryk Gadziosmki

<div style="width:100%;height:30px;background-color:#E31134"></div>

## 0. Importing Requirements

In [None]:
# importing libraries
import pandas as pd
from sklearn import linear_model
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import PowerTransformer

# 1. Reworking of Train- and Testdata
After the feedback we got from Task 2, we did a little bit of reworking to optimize our linear regression Model, before we 
tackle Task 3.

Some of our Optimization includes:

    -Removed temp and kept atemp during feature selection, as using both would create issues with multicorrelation.
    -remove the 'casual' and 'registered' labels, because we only use 'cnt'
    -checked out how 'mnth' is distributed ? (Anectode from the lecture)
    -kept 'instant', because its relevant apparently
    
Since we changed our dataset, we also tested our Models again to compare if the Model improved.

<div style="width:100%;height:30px;background-color:#E31134"></div>

<div style="width:100%;height:30px;background-color:#E31134"></div>

# 6. Summary of Task 3

## 1. Reworking and Optimization of train and test data

Since our final model for Task 2 didn't seem to be optimal and some details had been overseen, we decided to do a rework of our train and test datas and optimize it.

Because of the high correlation with 'cnt', we decided to keep 'instant' in the data set. The features 'temp' and 'atemp' had a multicorrelation, so we kept 'atemp', which had a minimal better correlation, in and dropped 'temp'. We also removed the 'casual' and 'registered' labels, since we only use 'cnt'.

In the end we tested our models with the new test data and had following result:
- `MAE`: 923.977 
- `R^2 value of the model`:  0.36870153666640637

The Mean Absolute Error (MAE) has an improvement. On the opposite, the Coefficient of Determination (R^2) has a deterioration. We exported the reworked train- and testdata to csv and used for the `base linear regression` and the `model optimization`.


## 2. Baseline linear regression

### 2.1 Split

In our first step, we refined the feature and label for our model buidling with splitting the training data into `train_features` and `train_labels` and the testdata into `test_features` and `test_label`.

### 2.2 Linear Regression

For our initial model building, we employed linear regression to predict the bike rental count ('cnt') using labeled training data.

### Model Selection and Training:
We instantiated a linear regression model using the `linear_model.LinearRegression()` function. The model was trained on the training features (`train_features`) and labels (`train_labels`) using the `fit` method.

### Model Coefficients:
- The coefficients of the linear regression model, representing the weights assigned to each feature, are printed using `print(baseline_model.coef_)`. These coefficients provide insights into the contribution of each feature to the prediction of the target variable

Like in the last task, this linear regression model serves as our baseline model, providing a starting point for evaluation and potential refinement in subsequent stages of model development.

## 2. Model Optimization

### 2.2 Model Building

For a better model optimization, we created further model versions besides of the linear regression. Beside of linear regression, scikit-learn has various other algorithms. With testing these algortihms, we can figure out, which model has the best result. In the following, we present the results of each algorithm.

For each model, we used the method `forward selection`, to get the best result.

#### 2.2.1 Polynomial Regression


For the polynomial regression, we used the function `makepipeline(PolynomialFeatures())`. It gives out multiple results, which  will run through `linear_model.LinearRegression()`, to get one value as a prediction.

We made some finetuning with the parameter to get the best possible result. With the parameters `degree=2`, `interaction_only='false'`, `include_bias=True`, `include_bias=True` and `order=C`, it got the best outcome.


#### 2.2.2 K-Nearest-neighbours Regression

With the K-nearest-neighbours regression, we compared each value of 'cnt' with his 10 nearest neighbour-values. For that, we used the function `KNeighborsRegressor()` to predict a numeric label, which is the average of all 10 values. A special aspect is that no linear relationships are required to build a model.

We used the parameters `weights=uniform`, `algorithm=auto`, `leaf_size= default`, `p = 10`, `metric = default` and `n_jobs = None`. for the amount of the nearest value neighbours we used the parameter `n_neighbour=3`, since a comparison with 3 values had the best result in R².

#### 2.2.3 Regression Tree / Decision Tree Regression

A regression tree makes it possible to handle nominal features and non-linear relationships. We used the `DecisionTreeRegressor()`-function.

For finetuning, we used the parameters `criterion = absolute_error`, `splitter = best`, `max_depth = 5`, `min_sample_split = default`, `min_sample_leaf = 5`, `min_wieght_fraction_leaf = 0.0` and `max_features = None`. For max_features, we chose the value None, since we want, that all features should be taken in a split.  


#### 2.2.4 Support Vector Regression

The Support Vector Regression defines a hyperplane, which can be linear, polynomial or gaussian.

First, we used the function `SVR()` with a polynomial kernel, since the evaluation of polynomial regression had a good result. The result of it was not so good, so we concluded to take the linear kernel, which had a better result.

We used the parameters
- kernel = linear; 
- c =100 ; regularization_parameter 
- epsilon ; 
- max_iter = -1; no limit for iteration

### 2.3 Evaluation

### Polynomial Regression

<style>

    .heatMap tr:nth-child(11) { background: green; }
</style>

<div class="heatMap">

| Iteration | Selected Features             | Performance (R²) | Decision          |
|------------|--------------------------------|------------------|--------------------|
| 1          | int                                                         | 0.249            | -                 |
| 2          | int, atemp                                                  | 0.381            | keep           |
| 3          | int, atemp, weathersit                                      | 0.409            | keep     |
| 4          | int, atemp, weathersit, yr                                  | 0.428            | keep  |
| 5          | int, atemp, weathersit, yr, season                          | 0.436553            | keep  |
| 6          | int, atemp, weathersit, yr, season, price reduction         | 0.242           | don't keep|
| 7          | int, atemp, weathersit, yr, season, leaflets                | 0.312            | don't keep   |
| 8          | int, atemp, weathersit, yr, season, windspeed               | 0.411            | don't keep         |
| 9          | int, atemp, weathersit, yr, season, workingday              | 0.377            | don't keep         |
| 10          | int, atemp, weathersit, yr, season, weekday                 | 0.436904            | keep         |
| 11          | int, atemp, weathersit, yr, season, weekday, holiday        | 0.437776            | keep         |
| 12         | int, atemp, weathersit, yr, season, weekday, holiday, mnth  | 0.4194            | don't keep         |

</div>


### K-Nearest-neighbours Regression
<style>

    .heatMap tr:nth-child(11) { background: green; }
</style>

<div class="heatMap">

| Iteration | Selected Features             | Performance (R²) | Decision          |
|------------|--------------------------------|------------------|--------------------|
| 1          | int                                                         | 0.414            | -              |
| 2          | int, atemp                                                  | 0.416            | keep           |
| 3          | int, atemp, weathersit                                      | 0.416            | don't keep     |
| 4          | int, atemp, weekday                                         | 0.412            | don't keep  |
| 5          | int, atemp, holiday                                         | 0.416            | don't keep  |
| 6          | int, atemp, mnth                                            | 0.416            | don't keep  |
| 7          | int, atemp, yr                                              | 0.416            | don't keep   |
| 8          | int, atemp, season                                          | 0.423            | keep         |
| 9          | int, atemp, season, price reduction                         | 0.423            | don't keep         |
| 10          | int, atemp, season, leaflets                               | 0.400            | don't keep         |
| 11          | int, atemp, season, windspeed                               | 0.424            | keep         |
| 12          | int, atemp, season, windspeed, workingday                   | 0.418           | don't keep         |
</div>

### Regression Tree / Decision Tree
<style>

    .heatMap tr:nth-child(11) { background: green; }
</style>

<div class="heatMap">

| Iteration | Selected Features             | Performance (R²) | Decision          |
|------------|--------------------------------|------------------|--------------------|
| 1          | int                                                         | 0.418            | -                 |
| 2          | int, season                                                 | 0.418            | don't keep           |
| 3          | int, atemp                                      | 0.431            | keep     |
| 4          | int, atemp, windspeed                                  | 0.430            | don't keep  |
| 5          | int, atemp, workingday                          | 0.432            | keep  |
| 6          | int, atemp, workingday, weathersit         | 0.441           |  keep  |
| 7          | int, atemp, workingday, weathersit, workingday         | 0.442         | keep   |
| 8          | int, atemp, workingday, weathersit; workingday, holiday             | 0.442      | don't keep         |
| 9          | int, atemp, workingday, weathersit; workingday, mnth         | 0.442            | don't keep         |
| 10          | int, atemp, workingday, weathersit; workingday,yr            | 0.442            | don't keep         |
| 11          | int, atemp, workingday, weathersit, workingday, price reduction    | 0.442      | don't keep         |
| 12          | int, atemp, workingday, weathersit, workingday, leaflets  | 0.438           | don't keep         |
</div>

### Support Vector Regression

<style>

    .heatMap tr:nth-child(11) { background: green; }
</style>

<div class="heatMap">

| Iteration | Selected Features             | Performance (R²) | Decision          |
|------------|--------------------------------|------------------|--------------------|
| 1          | int                                                         | 0.272            | -                 |
| 2          | int, weathersit                                            | 0.301            | keep           |
| 3          | int, weathersit, workingday                                      | 0.296            | don't keep     |
| 4          | int, weathersit, weekday                                 | 0.282            | don't keep  |
| 5          | int, weathersit, atemp                          | 0.366            | keep  |
| 6          | int, weathersit, atemp, season         | 0.370           |  keep  |
| 7          | int, weathersit, atemp, season, windspeed         | 0.370         | dont't keep   |
| 8          | int, weathersit, atemp, season, holiday             | 0.371      | keep         |
| 9          | int, weathersit, atemp, season, holiday, mnth         | 0.389            | keep         |
| 10          | int, weathersit, atemp, season, holiday, mnth, yr            | 0.389            | don't keep         |
| 11          | int, weathersit, atemp, season, holiday, mnth, price reduction    | 0.390      | keep         |
| 12          | int, weathersit, atemp, season, holiday, mnth, price reduction ,leaflets | 0.388  | don't keep       |
</div>

### 2.4 Interpretation & Conclusion

For a better comparison, we took included a model linear regression

<style>

    .heatMap tr:nth-child(3) { background: green; }
</style>

<div class="heatMap">

| Model | Features             | MAE | R²          |
|------------|--------------------------------|------------------|--------------------|
| Linear Regression    | Instant, season, yr, weathersit, atemp     | 911.851   | 0.3804996890160657                 |
| Polynomial Regression  | Instant, season, yr, mnth, holiday, weekday, weathersit, atemp           | 787.973            | 0.4194389747273023           |
| K-Nearest-Neighbours Regression    | Instant, season, atemp, windspeed       | 787.889            | 0.4241129170014134     |
| Regression Tree          | Instant, weekday, workingday, weathersit, atemp | 817.727            | 0.44207064714392696  |
| Support Vector Regression          | Instant, season, mnth, holiday, weathersit, atemp, price reduction          | 913.187            | 0.3900211198538589  |
</div>