# Incremental (Online) Learning with Scikit-Multiflow 

With seemingly infinite streams of data, one of the key challenges is to create lightweight models that are always ready to predict and adaptive to changes in the data distribution.

### What is Incremental Learning?
At every iteration, the model predicts a class label, reveals the true label, and is then updated

Incremental learning refers to a family of scalable algorithms that learn to sequentially update models from infinite data streams.

An incremental model has the following characteristics:

It can **predict at any time**

It can **adapt to concept drift** — i.e. changes in the data distribution

It is able to process an *infinite data stream with finite resources* (**time and memory**)

**Scikit-Multiflow **- a free Python framework for data-stream learning.

### Simple Online Classifier
There are many incremental models available with scikit-multiflow, one of the most popular being Hoeffding Trees

#### Hoeffding Trees
Hoeffding trees are built using the Very Fast Decision Tree Learner (VFDT), an anytime system that builds decision trees using constant memory and constant time per example. Introduced in 2000 by Pedro Domingos and Geoff Hulten, it makes use of a well known statistical result, the Hoeffding bound, in order to guarantee that its output is asymptotically identical to that of a traditional learner.

In [4]:
from skmultiflow.trees import HoeffdingTreeClassifier

tree = HoeffdingTreeClassifier()

In [1]:
from skmultiflow.data import SEAGenerator 

stream = SEAGenerator()      # create a stream
stream.prepare_for_use()     # prepare the stream for use

#### Training a Hoeffding Tree for Classification
If we want to train the tree on the SEA data stream, we can just loop through however many data points we want.

In [5]:
correctness_dist = []
nb_iters = 2000
for i in range(nb_iters):
   X, Y = stream.next_sample()        # get the next sample
   prediction = tree.predict(X)       # predict Y using the tree     
   if Y == prediction:                # check the prediction
     correctness_dist.append(1)
   else:
     correctness_dist.append(0)   
   
   tree.partial_fit(X, Y)             # update the tree

In [6]:
%matplotlib notebook

In [3]:
import matplotlib.pyplot as plt

time = [i for i in range(1, nb_iters)]
accuracy = [sum(correctness_dist[:i])/len(correctness_dist[:i]) for i in range(1, nb_iters)]
plt.plot(time, accuracy)

### Alternative Approach with Scikit-Multiflow
In scikit-multiflow, there is a built-in way to do the exact same thing with less code. What we can do is import the EvaluatePrequential class:

In [8]:
from skmultiflow.evaluation import EvaluatePrequential

In [9]:
eval=EvaluatePrequential(show_plot=True,max_samples=10000,metrics=['accuracy','kappa','running_time','model_size'])

In [4]:
eval.evaluate(stream=stream, model=tree, model_names=['TR'])