# Stream Ensemble Classification
---

## `NEWeather` dataset

**Description:** The National Oceanic and Atmospheric Administration (NOAA),
has compiled a database of weather measurements from over 7,000 weather 
stations worldwide. Records date back to the mid-1900s. Daily measurements
include a variety of features (temperature, pressure, wind speed, etc.) as
well as a series of indicators for precipitation and other weather-related
events. The `NEweather` dataset contains data from this database, specifically
from the Offutt Air Force Base in Bellevue, Nebraska ranging for over 50 years
(1949-1999).

**Features:** 8 Daily weather measurements
 
|       Attribute      | Description |
|:--------------------:|:-----------------------------|
| `temp`                   | Temperature
| `dew_pnt`                | Dew Point
| `sea_lvl_press`          | Sea Level Pressure
| `visibility`             | Visibility
| `avg_wind_spd`           | Average Wind Speed
| `max_sustained_wind_spd` | Maximum Sustained Wind Speed
| `max_temp`               | Maximum Temperature
| `min_temp`               | Minimum Temperature


**Class:** `rain` | 0: no rain, 1: rain
 
**Samples:** 18,159


In [1]:
import pandas as pd
from river.stream import iter_pandas
from river.metrics import Metrics,Accuracy,BalancedAccuracy,CohenKappa,GeometricMean,Rolling
from river.evaluate import progressive_val_score

In [2]:
data = pd.read_csv("../datasets/NEweather.csv")
features = data.columns[:-1]

## Online Bagging

---
[Online Bagging](https://riverml.xyz/0.10.1/api/ensemble/BaggingClassifier/) is an Online bootstrap aggregation for classification.

In [None]:
from river.ensemble import BaggingClassifier
from river.tree import HoeffdingTreeClassifier

model = BaggingClassifier(model=HoeffdingTreeClassifier(),
                          n_models=10,
                          seed=42)
metrics = Metrics(metrics=[Accuracy(),BalancedAccuracy(),GeometricMean(),CohenKappa()])
stream = iter_pandas(X=data[features], y=data['rain'])

progressive_val_score(dataset=stream,
                      model=model,
                      metric=metrics,
                      print_every=1000)

## Leveraging Bagging
---

[Leveraging Bagging](https://riverml.xyz/0.10.1/api/ensemble/LeveragingBaggingClassifier/) is an improvement over the Oza Bagging algorithm. The bagging performance is leveraged by increasing the re-sampling. It uses a poisson distribution to simulate the re-sampling process. To increase re-sampling it uses a higher w value of the Poisson distribution (agerage number of events), 6 by default, increasing the input space diversity, by attributing a different range of weights to the data samples.

To deal with concept drift, Leveraging Bagging uses the ADWIN algorithm to monitor the performance of each member of the enemble If concept drift is detected, the worst member of the ensemble (based on the error estimation by ADWIN) is replaced by a new (empty) classifier.

In [None]:
from river.ensemble import LeveragingBaggingClassifier
from river.tree import HoeffdingTreeClassifier

model = LeveragingBaggingClassifier(model=HoeffdingTreeClassifier(),
                          n_models=10,
                          seed=42)
metrics = Metrics(metrics=[Accuracy(),BalancedAccuracy(),GeometricMean(),CohenKappa()])
stream = iter_pandas(X=data[features], y=data['rain'])

progressive_val_score(dataset=stream,
                      model=model,
                      metric=metrics,
                      print_every=1000)

## AdaptiveRandomForest
---



The 3 most important aspects of [ARF](https://riverml.xyz/0.10.1/api/ensemble/AdaptiveRandomForestClassifier/) are:
- inducing diversity through re-sampling
- inducing diversity through randomly selecting subsets of features for node splits
- drift detectors per base tree, which cause selective resets in response to drifts

It also allows training background trees, which start training if a warning is detected and replace the active tree if the warning escalates to a drift.

In [3]:
from river.ensemble import AdaptiveRandomForestClassifier

model = AdaptiveRandomForestClassifier(n_models=10)
metrics = Metrics(metrics=[Accuracy(),BalancedAccuracy(),GeometricMean(),CohenKappa()])
stream = iter_pandas(X=data[features], y=data['rain'])

progressive_val_score(dataset=stream,
                      model=model,
                      metric=metrics,
                      print_every=1000)

[1,000] Accuracy: 75.38%, BalancedAccuracy: 66.46%, GeometricMean: 61.31%, CohenKappa: 37.02%
[2,000] Accuracy: 77.09%, BalancedAccuracy: 66.93%, GeometricMean: 61.94%, CohenKappa: 38.28%
[3,000] Accuracy: 78.53%, BalancedAccuracy: 67.75%, GeometricMean: 62.76%, CohenKappa: 40.46%
[4,000] Accuracy: 78.19%, BalancedAccuracy: 67.61%, GeometricMean: 62.48%, CohenKappa: 40.19%
[5,000] Accuracy: 78.10%, BalancedAccuracy: 67.93%, GeometricMean: 63.14%, CohenKappa: 40.62%
[6,000] Accuracy: 77.96%, BalancedAccuracy: 68.42%, GeometricMean: 64.19%, CohenKappa: 41.17%
[7,000] Accuracy: 78.50%, BalancedAccuracy: 69.27%, GeometricMean: 65.45%, CohenKappa: 42.79%
[8,000] Accuracy: 78.33%, BalancedAccuracy: 69.37%, GeometricMean: 65.66%, CohenKappa: 42.86%
[9,000] Accuracy: 78.25%, BalancedAccuracy: 69.21%, GeometricMean: 65.35%, CohenKappa: 42.67%
[10,000] Accuracy: 78.44%, BalancedAccuracy: 69.37%, GeometricMean: 65.54%, CohenKappa: 43.02%
[11,000] Accuracy: 78.63%, BalancedAccuracy: 69.47%, Geomet

Accuracy: 78.25%, BalancedAccuracy: 70.89%, GeometricMean: 68.08%, CohenKappa: 45.27%

In [12]:
import pickle

#save
with open('/Users/alessiobernardo/Downloads/ai_model.pkl', 'wb') as f:
    pickle.dump(model, f)

In [13]:
#load
with open('/Users/alessiobernardo/Downloads/ai_model.pkl', 'rb') as f:
    model_loaded = pickle.load(f)

In [14]:
s = iter_pandas(X=data[features][:5], y=data['rain'][:5])
progressive_val_score(dataset=s,
                      model=model_loaded,
                      metric=metrics,
                      print_every=1)

[1] Accuracy: 78.25%, BalancedAccuracy: 70.89%, GeometricMean: 68.09%, CohenKappa: 45.27%
[2] Accuracy: 78.25%, BalancedAccuracy: 70.89%, GeometricMean: 68.09%, CohenKappa: 45.28%
[3] Accuracy: 78.25%, BalancedAccuracy: 70.90%, GeometricMean: 68.09%, CohenKappa: 45.28%
[4] Accuracy: 78.25%, BalancedAccuracy: 70.89%, GeometricMean: 68.09%, CohenKappa: 45.28%
[5] Accuracy: 78.25%, BalancedAccuracy: 70.89%, GeometricMean: 68.09%, CohenKappa: 45.28%


Accuracy: 78.25%, BalancedAccuracy: 70.89%, GeometricMean: 68.09%, CohenKappa: 45.28%

## StreamingRandomPatches
---
[SRP](https://riverml.xyz/0.10.1/api/ensemble/SRPClassifier/) is an ensemble method that simulates bagging or random subspaces. The default algorithm uses both bagging and random subspaces, namely Random Patches. The default base estimator is a Hoeffding Tree, but other base estimators can be used (differently from random forest variations).

In [None]:
from river.ensemble import SRPClassifier
from river.tree import HoeffdingTreeClassifier

model = SRPClassifier(model=HoeffdingTreeClassifier(),
                      n_models=10,
                      seed=42)
metrics = Metrics(metrics=[Accuracy(),BalancedAccuracy(),GeometricMean(),CohenKappa()])
stream = iter_pandas(X=data[features], y=data['rain'])

progressive_val_score(dataset=stream, 
                      model=model, 
                      metric=metrics, 
                      print_every=1000)

## Concept Drift Impact

Concept drift can negatively impact learning methods if not properly handled. Multiple real-world applications suffer **model degradation** as the models can not adapt to changes in the data.

---
## `AGRAWAL` dataset

We will load the data from a csv file. The data was generated using the `AGRAWAL` data generator with 3 **gradual drifts** at the 5k, 10k, and 15k marks. It contains 9 features, 6 numeric and 3 categorical.

There are 10 functions for generating binary class labels from the features. These functions determine whether a **loan** should be approved.

| Feature    | Description            | Values                                                                |
|------------|------------------------|-----------------------------------------------------------------------|
| `salary`     | salary                 | uniformly distributed from 20k to 150k                                |
| `commission` | commission             | if (salary <   75k) then 0 else uniformly distributed from 10k to 75k |
| `age`        | age                    | uniformly distributed from 20 to 80                                   |
| `elevel`     | education level        | uniformly chosen from 0 to 4                                          |
| `car`        | car maker              | uniformly chosen from 1 to 20                                         |
| `zipcode`    | zip code of the town   | uniformly chosen from 0 to 8                                          |
| `hvalue`     | value of the house     | uniformly distributed from 50k x zipcode to 100k x zipcode            |
| `hyears`     | years house owned      | uniformly distributed from 1 to 30                                    |
| `loan`       | total loan amount      | uniformly distributed from 0 to 500k                                  |

**Class:** `y` | 0: no loan, 1: loan
 
**Samples:** 20,000

`elevel`, `car`, and `zipcode` are categorical features.

In [3]:
data = pd.read_csv("../datasets/agr_a_20k.csv")
features = data.columns[:-1]

Since there are several drifts, to better track the model behaviour, let's use a [Rolling Window](https://riverml.xyz/0.10.1/api/metrics/Rolling/) to compute the performance. It is a fixed size wrapper for computing metrics ovear a window. When the window is full, the oldest element is removed. In this way, the performance refers to the most recent data, possibly affected by a concept drift

## ADWIN Online Bagging

---
[ADWIN Online Bagging](https://riverml.xyz/0.10.1/api/ensemble/ADWINBaggingClassifier/) is the online bagging method with the addition of the ADWIN algorithm as a change detector. If concept drift is detected, the worst member of the ensemble (based on the error estimation by ADWIN) is replaced by a new (empty) classifier.

In [None]:
from river.ensemble import ADWINBaggingClassifier
from river.tree import HoeffdingTreeClassifier

model = ADWINBaggingClassifier(model=HoeffdingTreeClassifier(nominal_attributes=['elevel', 'car', 'zipcode']),
                          n_models=10,
                          seed=42)
#metrics = Metrics(metrics=[Accuracy(),BalancedAccuracy(),GeometricMean(),CohenKappa()])
metrics = Rolling(metric=Metrics(metrics=[Accuracy(),BalancedAccuracy(),GeometricMean(),CohenKappa()]),window_size=500)
stream = iter_pandas(X=data[features], y=data['class'])

progressive_val_score(dataset=stream,
                      model=model,
                      metric=metrics,
                      print_every=500)

## Leveraging Bagging

In [None]:
from river.ensemble import LeveragingBaggingClassifier
from river.tree import HoeffdingTreeClassifier

model = LeveragingBaggingClassifier(model=HoeffdingTreeClassifier(nominal_attributes=['elevel', 'car', 'zipcode']),
                          n_models=10,
                          seed=42)
#metrics = Metrics(metrics=[Accuracy(),BalancedAccuracy(),GeometricMean(),CohenKappa()])
metrics = Rolling(metric=Metrics(metrics=[Accuracy(),BalancedAccuracy(),GeometricMean(),CohenKappa()]),window_size=500)
stream = iter_pandas(X=data[features], y=data['class'])

progressive_val_score(dataset=stream,
                      model=model,
                      metric=metrics,
                      print_every=500)

## AdaptiveRandomForest

In [4]:
from river.ensemble import AdaptiveRandomForestClassifier

model = AdaptiveRandomForestClassifier(n_models=10,nominal_attributes=['elevel', 'car', 'zipcode'])
#metrics = Metrics(metrics=[Accuracy(),BalancedAccuracy(),GeometricMean(),CohenKappa()])
metrics = Rolling(metric=Metrics(metrics=[Accuracy(),BalancedAccuracy(),GeometricMean(),CohenKappa()]),window_size=500)
stream = iter_pandas(X=data[features], y=data['class'])

progressive_val_score(dataset=stream,
                      model=model,
                      metric=metrics,
                      print_every=500)

[500] Accuracy: 65.53%, BalancedAccuracy: 57.43%, GeometricMean: 51.94%, CohenKappa: 16.11%	(rolling 500)
[1,000] Accuracy: 65.00%, BalancedAccuracy: 55.32%, GeometricMean: 47.22%, CohenKappa: 11.88%	(rolling 500)
[1,500] Accuracy: 70.60%, BalancedAccuracy: 63.23%, GeometricMean: 57.82%, CohenKappa: 29.21%	(rolling 500)
[2,000] Accuracy: 71.80%, BalancedAccuracy: 62.24%, GeometricMean: 54.59%, CohenKappa: 28.03%	(rolling 500)
[2,500] Accuracy: 72.00%, BalancedAccuracy: 61.90%, GeometricMean: 55.02%, CohenKappa: 27.07%	(rolling 500)
[3,000] Accuracy: 74.00%, BalancedAccuracy: 66.30%, GeometricMean: 62.11%, CohenKappa: 35.81%	(rolling 500)
[3,500] Accuracy: 77.80%, BalancedAccuracy: 68.47%, GeometricMean: 64.00%, CohenKappa: 41.49%	(rolling 500)
[4,000] Accuracy: 78.00%, BalancedAccuracy: 69.11%, GeometricMean: 64.74%, CohenKappa: 42.88%	(rolling 500)
[4,500] Accuracy: 72.60%, BalancedAccuracy: 63.67%, GeometricMean: 57.39%, CohenKappa: 30.86%	(rolling 500)
[5,000] Accuracy: 75.20%, Bala

Accuracy: 71.60%, BalancedAccuracy: 68.43%, GeometricMean: 67.09%, CohenKappa: 37.99%	(rolling 500)

## StreamingRandomPatches
---
We set the drift and warning detection options

In [None]:
from river.ensemble import SRPClassifier
from river.tree import HoeffdingTreeClassifier
from river.drift import ADWIN

model = SRPClassifier(model=HoeffdingTreeClassifier(nominal_attributes=['elevel', 'car', 'zipcode']),
                      n_models=10,
                      drift_detector=ADWIN(delta=0.001),
                      warning_detector=ADWIN(delta=0.01),
                      seed=42)
#metrics = Metrics(metrics=[Accuracy(),BalancedAccuracy(),GeometricMean(),CohenKappa()])
metrics = Rolling(metric=Metrics(metrics=[Accuracy(),BalancedAccuracy(),GeometricMean(),CohenKappa()]),window_size=500)
stream = iter_pandas(X=data[features], y=data['class'])

progressive_val_score(dataset=stream, 
                      model=model, 
                      metric=metrics, 
                      print_every=500)