# LM - Week 05 - BikeSharing Extended

## Introduction
In last week's tutorial we explored the basics of linear regression and applied it to a [__Bikesharing dataset__](https://www.kaggle.com/c/bike-sharing-demand/overview) in order to predict the number of bike rentals based on weather information.  
This week we will extend our toolbox by useful methods, such as __categorical predictors__ and __non-linear transformations__, to further increase our models' performance and utilize all the information at hand. 

## Categorical predictors



In [None]:
import numpy as np
import random
# Set seed for reproducibility
np.random.seed(42)  # Set seed for NumPy
random.seed(42) # Set seed for random module

### Load data

In [None]:
import pandas as pd

# Loading the data from a csv file
data = pd.read_csv("https://raw.githubusercontent.com/kbrennig/MODS_WS24_25/refs/heads/main/data/BikeSharing.csv")

### Explore data

If we look at the data again, we see that there are columns that we knowingly left out the last time, e.g. __season__ or __workingday__. More precisely, these are so-called __categorical__ variables, which can take a discrete value from a finite set of values (as opposed to numerical properties, which can take virtually any number on an infinitely accurate number line).

*Run the code below.*

In [None]:
data.describe()

#### __Create training and test sets__

Now let's create our training and test sets from our data.

*Run the code below.*

In [None]:
from sklearn.model_selection import train_test_split
data_training, data_test = train_test_split(data, test_size=0.2, random_state=42)

### Model 

When it comes to training our linear regression model, not much has changed compared to last week. We can add categorical variables by putting `C()` around the predictor.

*Run the code below.*

In [None]:
from statsmodels.formula.api import ols

model_01 = ols(formula="count ~ temp + C(season)", data=data_training)
model_01 = model_01.fit()

print(model_01.summary(slim=True))

## Non-linear transformations

Non-linear transformations are another very powerful method to increase our model's performance and fit to the data, since it is unrealistic to expect linear relations in most of real-world problems. 

### Explore data

If we again look at the data and search for skewed distributions, we can see that the __casual__ variable seems to be extremely skewed, since its __mean__ is more then double the __median__. The __casual__ variable stands for the number of bikes rented by non-registered customers (as opposed to __registered__, which stands for registered customers and __count__ = __unregistered__ + __registered__)

*Run the code below.*

In [None]:
data.describe()

To further verify our hypothesis we can also plot a histogram of the original data and compare it to the log-transformed data. The `log()` function is calculating the logarithm to the base __e__. 

*Run the code below.*

In [None]:
import matplotlib.pyplot as plt
plt.hist(data["casual"], bins=12)
plt.plot()

In [None]:
import numpy as np

plt.hist(np.log(data["casual"]+1), bins=12)
plt.plot()

### Transform data

Keep in mind, that we will deviate from our standard output variable __count__ and try to predict the variable __casual__ now, since it serves as a good example.

#### __Create training and test sets__

*Run the code below.*

In [None]:
data_training_log, data_test_log = train_test_split(data, test_size=0.2, random_state=42)

### Model

#### __Log Transformation of the dependent variable__

In the following we add a new column to the dataset where the values of __casual__ are log transformed. You might wonder about the +1 in `np.log(data_training_log["casual"]+1)`. 
This is because mathematically log(0) is not defined, so if __casual__ equals 0 in our data, we can not regress for casual = 0.  
For this reason it is common to just add 1 to circumvent it.

*Run the code below.*

In [None]:
from statsmodels.graphics.regressionplots import abline_plot

# Transforming the casual variable using the natural logarithm
data_training_log["casual_log"] = np.log(data_training_log["casual"]+1)

# Fitting the linear regression model
model_02 = ols(formula="casual_log ~ temp", data=data_training_log)
model_02 = model_02.fit()

# Displaying the actual observations of temp and count
ax = data_training_log.plot.scatter(x='temp', y='casual_log')
# Plotting the regression line
abline_plot(model_results=model_02, ax=ax, color='red')
plt.show()

In [None]:
# Fitting the linear regression model
model_03 = ols(formula="casual ~ temp", data=data_training_log)
model_03 = model_03.fit()

# Displaying the actual observations of temp and count
ax = data_training_log.plot.scatter(x='temp', y='casual')
# Plotting the regression line
abline_plot(model_results=model_03, ax=ax, color='red')
plt.show()

#### __Print fitted models__

Since we used the logarithm of __casual__ the interpretation of the predictor coefficients has changed respectively. We can only tell whether an independent variable has a __positive__ or __negative__ influence but __not directly the magnitude__.

In [None]:
print(model_02.summary(slim=True))

In [None]:
print(model_03.summary(slim=True))

We can clearly see the difference in model fit between the two models.

#### __Make one prediction__

Let's try to predict the outcome for a specific temperature. We want to know how many bikes would be rented if the temperature would equal 35 degrees.

Again: Since we used the logarithm of the variable __casual__ the overall interpretation of our regression changes. This includes the target variable __casual__ which would return a value on a logarithmic scale.

To "undo" this we can simply recalculate our function as an exponential function and subtract 1 in the end.

*Run the code below.*

In [None]:
new_data = pd.DataFrame({"temp":[35]})
prediction = model_02.predict(new_data)
print(np.exp(prediction) -1)

#### __Make predictions for all entries in test set and calculate the RMSE__

Now we can make predictions on our held out test set and calculate the RMSE.

*Run the code below.*

In [None]:
from sklearn.metrics import root_mean_squared_error

data_test_log["casual_log_pred"] = model_02.predict(data_test_log["temp"])
# Transforming the predicted values back to the original scale
data_test_log["casual_pred"] = np.exp(data_test_log["casual_log_pred"]) - 1

rmse = root_mean_squared_error(data_test_log["casual"], data_test_log["casual_pred"])
print(rmse)

*Adapt the code and calculate the RMSE of the the other model (model_03) on the test data*

*Write your code below.*

In [None]:
# Enter your code here!

#### __Polynomial terms__

Another possibility to model non-linearities is to extend our linear models by polynomial terms. 

*Run the code below.*

In [None]:
model_04 = ols(formula="casual ~ temp + I(temp**2)", data=data_training_log)
model_04 = model_04.fit()

print(model_04.summary(slim=True))

#### __Interaction terms__

Last but not least, we can use interaction terms whenever we want to additionally emphasize the relationship between two independent variables.

*Run the code below.*

In [None]:
model_05 = ols(formula="casual ~ temp + humidity + temp:humidity", data=data_training_log)
model_05 = model_05.fit()

print(model_05.summary(slim=True))

## Summary

So to sum it up let us have a look what we did in this week's tutorial:

1. in the first part we learned how to incorporate __categorical variables__
2. after that, we had a look at how __non-linear transformations__ can be used to model the relationship between dependent and independent variables more accurately


*You can adjust the code in the cell below to build and evaluate different models*

In [None]:
# Enter your code here!

# Bonus: Transform categorical data

For now our data shows us on which date bikes were rented but does not specify the weekday.  
But what if there is a pattern that shows us that more bikes were rented on Sundays?  

In this case we can extract the information __weekday__ from our object __datetime__ by using the function `.dt.day_name()`.     
Same thing goes for the extraction of the specific time at which a bike was rented by using the attribute `.dt.hour`.    

In the end, we have to dummy encode the categorical variables, since variables such as weekdays, workdays, hours, etc. are represented by numbers and therefore interpreted as numerical variables by default.
For each column that should be dummy encoded, the `OneHotEncoder` looks how many different values are present in that column and creates a new column for each of them. The values of the newly created column are either 1 or 0 depending on the former value of the row.
For Example if you dummy encode __season__, you get the three columns __season_2__, __season_3__ and __season_4__. To understand the values let's look at __season_2__. In this column there is the value `1` in a row if the value of the column __season__ was `2`. 

To implement this in python we use the class `OneHotEncoder` of scikit-learn (sklearn). 
First you have to create an object of it which can take the arguments:
- __drop:__ Drops the first factor level. In our case the first season. This is done because all the information of the season column can be represented without one factor level. Imagine you want to know if a row with the already dummy encoded season is in the first season. You can check for all dummy columns if they contain a False. If None of them contain a True the season must have been season 1 which is the dropped factor level.
- __handle_unknown:__ Specifies how to deal with new data that contains a factor which was not present when the OneHotEncoder was fit. For example we can come back to the season column and imagine our training set only consists of the first two seasons of the year. Now we want to fit our OneHotEncoder to the data it will not know the other two seasons. If you want to transform your test data we have to specify how to deal with these unknown categories.
In our case we chose `infrequent_if_exist` which assigns an infrequent category where the category was unknown before.
- __sparse_output:__ We set it to `False` as by default we receive a sparse matrix as output and that cannot be easily processed into a pandas DataFrame.

Now that we created the object we can fit it to our data and simultaneously transform our training dataset by calling `fit_transform` on our OneHotEncoder. As input we pass it a subset of our training data consisting of all the columns we want to transform.

For our test set we proceed analogously, but we use the `transform` method instead of `fit_transform` because now we only want the method to transform the learned categories and deal with the unknown as specified in the argument `handle_unknown` of the `OneHotEncoder`. By fitting the OneHotEncoder on the training set and only transforming the test set we ensure that no information is leaked from our test set into our training set.

In [None]:
data["datetime"] = pd.to_datetime(data["datetime"])
data["weekday"] = data["datetime"].dt.day_name()                   # Monday, Tuesday, ...
data["hour"] = data["datetime"].dt.hour
data["month"] = data["datetime"].dt.month
data["year"] = data["datetime"].dt.year

In [None]:
from sklearn.preprocessing import OneHotEncoder
# Dummy encoding / One-hot encoding

data_training_cat, data_test_cat = train_test_split(data, test_size=0.2, random_state=42)

# Creating the OneHotEncoder object
encoder = OneHotEncoder(drop='first', handle_unknown='infrequent_if_exist', sparse_output=False)

# Fit and transform the training data
encoded_training = encoder.fit_transform(data_training_cat[["holiday", "season", "weekday"]])
# Create a DataFrame with the encoded variables
encoded_training_df = pd.DataFrame(encoded_training, columns=encoder.get_feature_names_out(["holiday", "season", "weekday"]))
# Join the encoded variables to the original DataFrame and remove the original columns
data_training_cat = data_training_cat.join(encoded_training_df).drop(columns=["holiday", "season", "weekday"])


# Transform the test data
encoded_test = encoder.transform(data_test_cat[["holiday", "season", "weekday"]])
encoded_test_df = pd.DataFrame(encoded_test, columns=encoder.get_feature_names_out(["holiday", "season", "weekday"]))
data_test_cat = data_test_cat.join(encoded_test_df).drop(columns=["holiday", "season", "weekday"])

In [None]:
print(data_training_cat.columns)
print(data_training_cat.head())

Below you can see the dataset containing the dummy encoded variables instead of the season column.

In [None]:
print(data_training_cat.describe())

In [None]:
# At last we can train one model with many features
# Deselecting columns that should not be used as features
subset_train_X = data_training_cat.drop(columns=["datetime", "count", "hour", "month", "year", "casual", "registered", "atemp"])
feature_columns = list(subset_train_X.columns)

# Here we build the formula by constructing a string
# Add all selected features to the formula
formula_str = "count ~ " + " + ".join(feature_columns)

# Add an interaction term between temp and humidity
formula_str += " + temp:humidity"
print(formula_str)

In [None]:
model_06 = ols(formula=formula_str, data=data_training_cat)
model_06 = model_06.fit()
print(model_06.summary(slim=True))