# Stream-based machine learning pipeline

## OUTLINE: 
* Concept drift algorithms
* Hoeffding Tree
* Hoeffding Adaptive Tree
* Evaluation - Holdout, Prequential, "Real-world" (incremetal)
* Note : same functionality can be achieved with higher performance in java based MOA (skmultiflow is a "child project" to MOA )

### Concept drift

Concept drifts categories:
<img src="./images/concept_drifts.png" alt="drawing" width="500"/>

    [1]
#### ADWIN
<img src="./images/adwin.png" alt="drawing" width="500"/>
    
    [2-3]

<small><small><small>
    
    [1] Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). "A survey on concept drift adaptation." ACM computing surveys (CSUR), 46(4), 1-37.

    [2] Grulich, Philipp Marian, et al. "Scalable Detection of Concept Drifts on Data Streams with Parallel Adaptive Windowing." EDBT. 2018.

    [3] Bifet, Albert, and Ricard Gavalda. "Learning from time-changing data with adaptive windowing." Proceedings of the 2007 SIAM international conference on data mining. Society for Industrial and Applied Mathematics, 2007.
</small></small></small>

## libraries

In [12]:
import numpy as np
import pandas as pd

#https://scikit-multiflow.github.io/scikit-multiflow/documentation.html#learning-methods
from skmultiflow.drift_detection import DDM
from skmultiflow.drift_detection.eddm import EDDM
from skmultiflow.drift_detection import PageHinkley
from skmultiflow.drift_detection.adwin import ADWIN
from skmultiflow.evaluation import EvaluateHoldout

from skmultiflow.meta import AdaptiveRandomForest
from skmultiflow.trees import HoeffdingTree
from skmultiflow.trees import HAT
from skmultiflow.evaluation import EvaluatePrequential

from skmultiflow.data import DataStream

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
%matplotlib inline
import glob

In [3]:
def scale_dataset(dataset):
    scaler = StandardScaler()
    scaler.fit(dataset.drop(columns=['fault_id']))
    dataset_scaled = pd.DataFrame(scaler.transform(dataset.drop(columns=['fault_id'])),columns=col_names[:-1])
    dataset_scaled['fault_id'] = dataset['fault_id'].copy()
    return dataset_scaled

In [4]:
dataset = pd.read_csv('workshop_dataset.csv',index_col=False).sample(frac=1).reset_index(drop=True)
col_names = dataset.columns.tolist()
test_size = len(dataset)//10
training_size = len(dataset) - test_size

# 1. Concept Drift Detection

In [4]:
# magnitude of row vectors - concept drift detectors take as input single value not list/vector
data_stream_magnitude = dataset[dataset.columns[:-1]].apply(np.linalg.norm, axis=1).values

In [9]:
adwin = ADWIN()
ddm = DDM()
eddm = EDDM()
ph = PageHinkley()

adwin_detected_changes = []
ddm_detected_changes = []
eddm_detected_changes = []
ph_detected_changes = []

for i in range(len(data_stream_magnitude)):
    adwin.add_element(data_stream[i])
    ddm.add_element(data_stream[i])
    eddm.add_element(data_stream[i])
    ph.add_element(data_stream[i])
    if adwin.detected_change():
        adwin_detected_changes.extend(i)
    if ddm.detected_change():
        ddm_detected_changes.extend(i)
    if eddm.detected_change():
        eddm_detected_changes.extend(i)
    if ph.detected_change():
        ph_detected_changes.extend(i)        

Change detected in data: 7644.600911861115 - at index: 28
Change detected in data: 7619.368761792521 - at index: 57


# 2. Classification pipeline

2.1 Holdout - follows batch machine learning logic, i.e. train incremetanlly models and test it on test dataset  
2.2 Prequential - test-then-train  
2.3 Real-world scenarios - model is incrementally trained and then incremetally tested by comparing predicted labels to ground truth  

## 2.1. Holdout evaluation

In [21]:
HT.reset()
HAT.reset()

In [22]:
samples = dataset.drop(columns=['fault_id'])
labels  = dataset['fault_id'].to_frame()

stream = DataStream(data = samples, y = labels)
stream.prepare_for_use()

#HT = HoeffdingTree()
#HAT = HAT()
#ARF = AdaptiveRandomForest()
evaluator = EvaluateHoldout(max_samples=100000,
                            max_time=6000,
                            n_wait=100,                            
                            batch_size=100,
                            test_size=test_size,
                            output_file='HAT_houldout.csv',
                            metrics=['precision','recall','f1'])
evaluator.evaluate(stream=stream, model=[HT,HAT], model_names=['HT','HAT'])

Holdout Evaluation
Evaluating 1 target(s).
Separating 2023 holdout samples.
Evaluating...

Processed samples: 11723
Mean performance:
HT - Precision: 1.0000
HT - Recall: 1.0000
HT - F1 score: 1.0000
HAT - Precision: 1.0000
HAT - Recall: 0.9564
HAT - F1 score: 0.9777


[HoeffdingTree(binary_split=False, grace_period=200, leaf_prediction='nba',
               max_byte_size=33554432, memory_estimate_period=1000000,
               nb_threshold=0, no_preprune=False, nominal_attributes=None,
               remove_poor_atts=None, split_confidence=1e-07,
               split_criterion='info_gain', stop_mem_management=False,
               tie_threshold=0.05),
 HAT(binary_split=False, grace_period=200, leaf_prediction='nba',
     max_byte_size=33554432, memory_estimate_period=1000000, nb_threshold=0,
     no_preprune=False, nominal_attributes=None, remove_poor_atts=False,
     split_confidence=1e-07, split_criterion='info_gain',
     stop_mem_management=False, tie_threshold=0.05)]

## 2.2. Prequential evaluation

In [23]:
HT.reset()
HAT.reset()
evaluator = EvaluatePrequential(n_wait=100, 
                                batch_size=100, 
                                pretrain_size=training_size, 
                                output_file='HAT_prequential.csv',
                                metrics=['precision','recall','f1'])
evaluator.evaluate(stream=stream, model=HAT, model_names=['HAT'])

In [16]:
# skmultiflow saves results to file with leading 5 lines containing configuraiton of evaluation, learner etc
# skmultiflow also did not evaluate last 200 samples
# for the sake of comparisson we shrink the MOA results
# accuracy in MOA is in % and in skmultiflow fraction
HAT_results = pd.read_csv('HAT_prequential.csv',skiprows=[0,1,2,3,4],index_col=False)

In [24]:
#HAT_results[]

## 2.3. Real-world scenario

### 2.3.1. Incremental approach

In [None]:
HAT.reset()

samples_train = dataset.drop(columns=['fault_id']).values[:train_size]
labels_train  = dataset['fault_id'].to_frame().values.flatten()[:train_size]

stream_train = DataStream(data=samples_train, y=labels_train)
stream_train.prepare_for_use()

for sample in range(len(labels_train)):
    X, Y = stream_train.next_sample()
    HAT.partial_fit(X, Y)

In [22]:
samples_test = dataset_test.drop(columns=['fault_id']).values[train_size:]
labels_test  = dataset_test['fault_id'].to_frame().values.flatten()[train_size:]

stream_test = DataStream(data = samples_test, y = labels_test)
stream_test.prepare_for_use()

labels_test_predicted = []
for sample in range(len(labels_test)):
    X, Y = stream_test.next_sample()
    Y_pred = HAT.predict(X)
    labels_test_predicted.extend(HAT.predict(X))
    
print('Classification report :\n' + str(classification_report(labels_test, labels_test_predicted)))

  return sum_value / self.sample_count


### 2.3.1. Bulk approach

In [161]:
HAT.reset()
HAT.fit(samples_train,labels_train.flatten())
labels_test_predicted = ARF.predict(samples_test)
print('Classification report :\n' + str(classification_report(labels_test, labels_test_predicted)))

AdaptiveRandomForest(binary_split=False, disable_weighted_vote=False,
                     drift_detection_method=ADWIN(delta=0.001), grace_period=50,
                     lambda_value=6, leaf_prediction='nba',
                     max_byte_size=33554432, max_features=1,
                     memory_estimate_period=2000000, n_estimators=10,
                     nb_threshold=0, no_preprune=False, nominal_attributes=None,
                     performance_metric='acc', random_state=None,
                     remove_poor_atts=False, split_confidence=0.01,
                     split_criterion='info_gain', stop_mem_management=False,
                     tie_threshold=0.05,