## Linear models do not automatically capture interaction effects between input features. 

We use regression analysis to understand the relationships, patterns, and causalities in data. Often we are interested in understanding the impacts that changes in the dependent variables have on our outcome of interest.

### What is conditional dependence ?
it describes the behavior of a specific variable by keeping the others fixed.<br>

#### In linear models, the target value is modeled as a linear combination of the features.
#### The coefficients in multiple linear models represent the relationship between the given feature, `X` and the target,`y` assuming that all the other features remain constant.

### What is marginal dependence ?
it describes the behavior of a specific variable without keeping the others fixed.<br>
For an example : Features `Sex`, `Age`, `Education`. Target: `Wage` <br>
when we plot `Age` vs `wage` what we see is a marginal dependence. 

Some features may not be a good predictor of the target variable all by itself, but in presence of other features it can help us model the target-variable.

#### We can use the `PolynomialFeatures` class to model the interaction explicitly.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, MinMaxScaler, SplineTransformer

from sklearn.compose import ColumnTransformer
from sklearn.model_selection import TimeSeriesSplit, cross_validate, cross_val_score
from sklearn.pipeline import make_pipeline

from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_poisson_deviance

from sklearn.linear_model import Ridge, PoissonRegressor, RidgeCV
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor

from feature_engine.creation import CyclicalFeatures

## Bike Sharing Demand dataset

In [2]:
from sklearn.datasets import fetch_openml

bike_sharing = fetch_openml(
    "Bike_Sharing_Demand", version=2, as_frame=True, parser="pandas"
)
df = bike_sharing.frame

# The target of the prediction problem is the absolute count of bike rentals on a hourly basis:

# Let us rescale the target variable (number of hourly bike rentals) 
# to predict a relative demand so that the mean absolute error is more easily interpreted
#  as a fraction of the maximum demand.

## Feature and target
y = df['count']/df['count'].max()

y_count = df.pop('count')

X = df

X.head()

Unnamed: 0,season,year,month,hour,holiday,weekday,workingday,weather,temp,feel_temp,humidity,windspeed
0,spring,0,1,0,False,6,False,clear,9.84,14.395,0.81,0.0
1,spring,0,1,1,False,6,False,clear,9.02,13.635,0.8,0.0
2,spring,0,1,2,False,6,False,clear,9.02,13.635,0.8,0.0
3,spring,0,1,3,False,6,False,clear,9.84,14.395,0.75,0.0
4,spring,0,1,4,False,6,False,clear,9.84,14.395,0.75,0.0


In [3]:
# "heavy_rain" cateory is appearing only 3 times in our data, so lets add that to "rain" category
X['weather'] = X['weather'].replace(to_replace='heavy_rain', value='rain')
X['weather'].value_counts()

clear    11413
misty     4544
rain      1422
Name: weather, dtype: int64

## Feature Engineering and Modeling pipeline

In [4]:
# cross validation split

tscv = TimeSeriesSplit(
    n_splits=5,
    gap=48, #2day gap
    max_train_size=10000,
    test_size=1000
)

def custom_scoring(est, x, y):
    y_pred = est.predict(x)
    mask = y_pred>0
    mae = mean_absolute_error(y[mask], y_pred[mask])
    mse = mean_squared_error(y[mask], y_pred[mask])
    mpd = mean_poisson_deviance(y[mask], y_pred[mask])
    return {'mean_absolute_error': mae, 'mean_squared_error':mse, 'mean_poisson_deviance':mpd}
    


def evaluate_pipeline(pipe, X, y, cv, ):

    score = cross_validate(pipe, X, y, cv=cv, scoring=custom_scoring)
    
    mae = np.mean(score['test_mean_absolute_error'])
    mse = np.mean(score['test_mean_squared_error'])
    mpd = np.mean(score['test_mean_poisson_deviance'])
    
    result = f'Mean absolute error : {mae}\nMean squared error : {mse}\nMean poisson deviance : {mpd}'

    print(result)
    return score  

In [5]:
categorical_columns = [col for col in X.select_dtypes(include='category')]