## Zero Inflated Models

There are regression datasets that contain an unusually high amount of zeroes as the targets. This can be the case if you want to predict a count of rare events, such as 
* Defects in manufacturing
* The amount of some natural disasters 
* The amount of crimes in some neighborhood. 

Usually nothing happens, meaning the target count is zero, but sometimes we actually have to do some modelling work.

**The classical machine learning algorithms can have a hard time dealing with such datasets.** 

Take linear regression, for example: the chance of outputting an actual zero is diminishing. Sure, you can get regions where you are close to zero, but modelling an output of exacly zero is infeasible in general. The same goes for neural networks.

What we can do circumvent these problems is the following: 
* Train a classifier to tell us whether the target is zero, or not.
* Train a regressor on all samples with a non-zero target.

In [124]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import SplineTransformer, OneHotEncoder, MinMaxScaler
from sklearn.model_selection import TimeSeriesSplit

from sklearn.model_selection import cross_val_score, cross_validate

from sklearn.metrics import mean_squared_error, mean_poisson_deviance, accuracy_score

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline

from sklearn.linear_model import PoissonRegressor
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor, HistGradientBoostingClassifier

import sklego
from sklego.meta import ZeroInflatedRegressor

In [2]:
print('scikit-lego VERSION',sklego.__version__)

scikit-lego VERSION 0.6.14


## Data

#### This dataset is not zero inflated. To demonstrate the ZIP models we will artificialy inflate zeros 

link : https://www.kaggle.com/datasets/hmavrodiev/london-bike-sharing-dataset

or you can use :

<pre>
from sklearn.datasets import fetch_openml

bike_sharing = fetch_openml(
    "Bike_Sharing_Demand", version=2, as_frame=True, parser="pandas"
)
df = bike_sharing.frame
</pre>

* "`cnt`" - the count of a new bike shares
* "`t1`" - real temperature in C
* "`t2`" - temperature in C "feels like"
* "`hum`" - humidity in percentage
* "`windspeed`" - wind speed in km/h
* "`weathercode`" - category of the weather
* "`isholiday`" - boolean field - 1 holiday / 0 non holiday
* "`isweekend`" - boolean field - 1 if the day is weekend
* "`season`" - category field meteorological seasons: 0-spring ; 1-summer; 2-fall; 3-winter.


#### **objective** : predict the future bike shares.

In [3]:
df = pd.read_csv('./data/london_bikes.csv')
df.head()

Unnamed: 0,cnt,t1,hum,wind_speed,weather_code,is_holiday,is_weekend,season,hour
0,182,3.0,93.0,6.0,3,0,1,3,0
1,138,3.0,93.0,5.0,1,0,1,3,1
2,134,2.5,96.5,0.0,1,0,1,3,2
3,72,2.0,100.0,0.0,1,0,1,3,3
4,47,2.0,93.0,6.5,1,0,1,3,4


## Features and labels

In [4]:
y = df.pop('cnt')
X = df

## Lets artificially inflate zeros 

In [5]:
y_zero_inflated = y.copy()
y_zero_inflated[y<500] = 0

## Distribution of Count data

In [23]:
fig,ax = plt.subplots(nrows=1, ncols=2, figsize=(15,4), sharey=True, constrained_layout=True)
sns.histplot(y, bins=20, ax=ax[0]);
sns.histplot(y_zero_inflated, bins=20, ax=ax[1], color='seagreen');
ax[0].set(title='Count data')
ax[1].set(title='Artificially zero inflated data')
ax[1].patches[0].set_facecolor('salmon')
plt.tight_layout()

<img src='./plots/count-data-distribution.png'>

## Data Preprocessing 

In [61]:
# one-hot-encode the cat feats
cat_features = ['weather_code',	'is_holiday','is_weekend', 'season']

## Periodic spline : Modeling seasonal effects

* Seasonal effects can be modelled using periodic splines, which have equal function value and equal derivatives at the first and last knot.

* The splines period is the distance between the first and last knot (which we specify manually, if known)

* Periodic splines provide a better fit both within and outside of the range of training data given the additional information of periodicity. 

* Periodic splines can also be useful for naturally periodic features (such as day of the year), as the smoothness at the boundary knots prevents a jump in the transformed values (e.g. from Dec 31st to Jan 1st). 

* For naturally periodic features or more generally features where the period is known, it is advised to explicitly pass this information to the SplineTransformer by setting the knots manually.



In [78]:
n_knots = 13
period = 24
knots = np.linspace(0, period, n_knots)[:,None]
spline = SplineTransformer(n_knots=n_knots, knots=knots, degree=3, extrapolation='periodic')

In [90]:
spline_features = pd.DataFrame(data=spline.fit_transform(df[['hour']]), columns=[f'spline-{i}' for i in range(12)])
spline_features.head()

Unnamed: 0,spline-0,spline-1,spline-2,spline-3,spline-4,spline-5,spline-6,spline-7,spline-8,spline-9,spline-10,spline-11
0,0.166667,0.666667,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.020833,0.479167,0.479167,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.166667,0.666667,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.020833,0.479167,0.479167,0.020833,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.166667,0.666667,0.166667,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [110]:
for col in spline_features.columns:
    plt.plot(df['hour'][:24], spline_features[col][:24])

<img src='./plots/spline-features-plot.png'>

## Feature Pipeline

In [98]:
preprocess = ColumnTransformer(transformers=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'), cat_features),
    ('cyclic', SplineTransformer(n_knots=n_knots, knots=knots, degree=3, extrapolation='periodic'), ['hour'])
], remainder=MinMaxScaler())

## Model pipeline

In [99]:
def build_pipeline(model):
    pipe = make_pipeline(preprocess, model)
    return pipe

## Cross validation split

In [100]:
tscv = TimeSeriesSplit(n_splits=50, max_train_size=10000, test_size=336)

## Poisson Regression : Linear Model

In [105]:
model = make_pipeline(preprocess, PoissonRegressor(max_iter=600))

score = cross_val_score(estimator=model, X=X, y=y, scoring='neg_mean_poisson_deviance', cv=tscv)

print('Mean score :',-1* score.mean())


Mean score : 193.5139292883774


## Random Forest : Tree based Regressor

In [106]:
model = make_pipeline(preprocess, RandomForestRegressor(criterion='poisson'))

score = cross_val_score(estimator=model, X=X, y=y, scoring='neg_mean_poisson_deviance', cv=tscv)

print('Mean score :',-1* score.mean())


Mean score : 57.96441961149053


## HistGradientBoostingRegressor : BoostingRegressor

In [107]:
model = make_pipeline(preprocess, HistGradientBoostingRegressor(loss='poisson'))

score = cross_val_score(estimator=model, X=X, y=y, scoring='neg_mean_poisson_deviance', cv=tscv)

print('Mean score :',-1* score.mean())

Mean score : 53.24384758900737


### How will the `HistGradientBoostingRegressor` perform in a Zero Inflated dataset

In [111]:
model = make_pipeline(preprocess, HistGradientBoostingRegressor(loss='poisson'))

score = cross_val_score(estimator=model, X=X, y=y_zero_inflated, scoring='neg_mean_poisson_deviance', cv=tscv)

print('Mean score :',-1* score.mean())

Mean score : 120.89981599294003


### Why did the Model perform worse ?

* Datasets that contain an unusually high amount of zeroes as the targets. 
* We need a model that can handle zero-inflation

## Zero Inflated Model

#### 1. Train a classifier to tell us whether the target is zero, or not. 

#### 2. Train a regressor on all samples with a non-zero target.

#### We are going to use scikit-lego for this task
*   `pip install scikit-lego`
*   `from sklego.meta import ZeroInflatedRegressor`

In [123]:
model = ZeroInflatedRegressor(
    classifier=HistGradientBoostingClassifier(),
    regressor=HistGradientBoostingRegressor(loss='poisson')
)

In [125]:
def custom_metric(est, x, y):
    y_pred = est.predict(x)
    mask = y_pred>0
    return {
        'mean_poisson_deviance' : mean_poisson_deviance(y[mask], y_pred[mask]),
        'accuracy_score':accuracy_score(y[~mask], y_pred[~mask])
    }
    

In [126]:
score_obj = cross_validate(estimator=model, X=X, y=y_zero_inflated, scoring=custom_metric, cv=tscv)

In [128]:
score_obj.keys()

dict_keys(['fit_time', 'score_time', 'test_mean_poisson_deviance', 'test_accuracy_score'])

In [162]:
sns.boxplot(score_obj['test_mean_poisson_deviance'], orient='h', color='seagreen')
sns.swarmplot(score_obj['test_mean_poisson_deviance'], orient='h', color='k');
plt.title('Mean Poisson Deviance');

<img src='./plots/mean-poisson-deviance.png'>

In [159]:
sns.histplot(score_obj['test_accuracy_score'], bins=10)
plt.title("Accuracy score : predicting zero's");

<img src='./plots/accuracy-zero-prediction.png'>