# Stream Learning

 Most human activity has time as a key feature related to them. That relationship can be described according to the scale as a short-term or a long-term one. For example, a short-term relationship could be the recording of a signal for some time before analysing or processing it. Based on the features that could the extraction from a sliding window, machine learning approaches would usually build up a dataset for training and testing. That dataset would have several sections of the signal with the corresponding desired output for that piece of information. However, this is more complicated with long-term relationships that could not be easily recorded due to the scale. In this case, the evolution of the input does not show on the signal by itself making it much harder to capture those changes. Usually, classical machine learning approaches have struggled to keep the pace of the latter type. The common approach to tackle this kind of issue is to retrain a model from time to time and deploy an update, sometimes with a difference of minutes or even seconds. Alternatively, to deal with this kind of problem, a new knowledge area inside machine learning has arisen called online learning or stream learning.

Nowadays, due to possible confusion with some teaching practices, the term online learning has been replaced by stream learning. This term designates a specific scenario for an intelligent model, one where the model is constantly fed with an infinite stream of data. Therefore, instead of a stable dataset which is fed once and again to adjust the model, the model is only to see that data once. This slight change comes with not a few issues that have to be tackled. Additionally, this approach also comes with new opportunities that are going to be covered in the following sections as long as the problems.

However, before starting, it is required to formally define some background concepts.


## Data Stream

First and foremost, a key definition which is at the very core of stream learning is the term data stream. This term refers to a sequential collection of individual elements which, when we are talking about a machine learning approach, is a set of features measured simultaneously on an entity.

Each set of features measured at a certain time is also referred to as an observation or sample. At this point, It might be worth mentioning that those samples can have a stable structure, e.g., in every sample, all features are measured, or it can be more flexible with features that appear and disappear over time.

Therefore, generally speaking, we understand a data stream as a continuous set of samples over time.


### Reactive and proactive data streams

Those data streams can be later classified depending on the relationship with the user in reactive and proactive.

Reactive data streams are ones where the data comes to you. Typical examples can be the visits to a website, the interactions with a server, the events on a machine, etc. Therefore, any data stream that's out of your control and over which, you have no influence or control further than receiving and reading it. It just happens and you have to react to it.

Proactive data streams are ones where you have control over the data stream. For example, you might be reading the data from a file. You decide at which speed you want to read the data, in what order, etc.


## Online processing
This concept refers to processing a stream of data observation by observation. More specifically, if we focus on machine learning, it refers to training a model by teaching it one sample at a time.

This concept makes stream learning the complete opposite of the traditional approach, where the samples are packed in batches and, then, adjust the model after processing the errors of the bunch. Therefore, in online processing, the vectorization doesn't bring any speedup and numeric processing libraries such as NumPy and PyTorch bring too much overhead. So, the new approach does not revisit past data as the batched approach does, because it has a continuous data stream which, usually, can only be seen by the model once.

So, this concept establishes one of the key points of this approach and it is that the online learning model is a stateful, dynamic object. Consequently, we are in front of a new machine-learning paradigm with its pros and cons.

## Datasets in training

Although in production, 90% of the data stream is going to be reactive, when we have to tackle the training and evaluation of a model a dataset is usually built. The reason behind this is that we usually do not have access to the real-time feed when we are developing the model, so it has to be simulated with some captured observations.

However, the capture of the dataset ends the similarities between traditional machine learning and online machine learning. Opposite to the first one, the online approach does not split the data into training and evaluation datasets but uses the whole dataset in both stages. So, when an observation arrives, it is used to evaluate the model and later the model is adjusted according to the output. This pace is continuously used in this kind of model and it is especially important for the time component to know which element comes first of which. The idea is to simulate the same scenario that the model is going to find in the real life to ensure correct behaviour over time.


### Concept Drift

The main reason why any machine learning approach, offline or online, can not be performing well with this kind of problem with a continuous feed of data is the concept drift. By this term, researchers named the situation when the data start to change over time a pattern not previously noticed appears or the balance among classes can vary.

The advantage of online models is that they can keep learning and, consequently, they can cope with drift. So, this kind of approach can adapt to concept drift seamlessly without having to retrain a new model. We would revisit this concept later with an example.



# Stream Machine Learning Libraries

Due to being a relatively new apporach, stream learning do not have a lot of implementations nowadays. In fact, the usual approach are ad-hoc implementations for each particular problem and entity. It has not been until recently that some libraries has appear in the stage with a more general approach to this point of view. The main actors nowadays can be resumed as:

- **[Apache SAMOA](https://incubator.apache.org/projects/samoa.html)**, a project to perform análisis and data mining on data Streams. It has a part focused on machine learning. Do not recive any update since 2020 and it is still in the incubator of the Apache Fundation. It is roumored that the foundation is going to drop its development.

- **[MOA](https://moa.cms.waikato.ac.nz/)**, the name comes from Massive Online Analysis. This proyect has been developed by the same authors of the WEKA proyect. So, it is deeply related with this one and it has been also written in Java. It includes a collection of machine learning algorithms (classification, regression, clustering, outlier detection, concept drift detection and recommender systems) and tools for evaluation. It is mainly limited to the interface provided or to implement extensions if you want to work with the remaining ecosystem.

- **[Vowpal wabbit](https://vowpalwabbit.org/)**, a Python library with a more general approach which cover questions like reinforment learning. It has a section focused on steam learning but more from the point of view reiforment learning. It has as downsize the requirement of a particular format in the data for the library, which significantl affects it s performance and usability

- **[River](https://riverml.xyz/)**, another Python library focused on stream learning which has a more general appoach than Vowpal Wabbit. In this case the number of models is similar to the ones present in scikit-learn with the tools to adapt them to the new approach. Additionaly it also has the possibility to develop reinforment learning approaches. As main advantage, it can work with the most common types of data such as Pandas Dataframes.

In this subject,  River is going to be our reference implementation in order to widen the number of problems that we can tackle with. Let's see a couple of examples for the most common scenarios.



## Binary classisification

Probably the most elemental approach to machine learning in general. In this case the models have a single output which is going to tell as if an example is from a certain class or the complementary one.

1. First step should be to install the library if we haven't done it, yet.

In [1]:
#It would required to use a version of Python >3.8
try:
    import river
except ImportError as err:
    !pip install river

    
# this library is only to improve the redability of some structures
# https://rich.readthedocs.io/en/stable/introduction.html
try:
    from rich import print
except ImportError as err:
    !pip install rich
    from rich import print

2. Import a dataset to operate with. In this case the problem is going to be a detector of bank fraud with credit cards, one of the example dataset that river has implemented

In [2]:
from rich import print
from river import datasets

dataset = datasets.CreditCard()
print(f"The object contains the information of the dataset, such as, number of samples and features")
print(dataset)

In [3]:
# Let's take a look on the first example
sample, target = next(iter(dataset))
print(sample)
print(target)

Working with imbalanced classes is quite usual in online learning for tasks such as fraud detection and spam classification. The CreditCard dataset is certainly inbalanted and it provide us information about its classes in the description. However, we can easily calculate this percentages of representation: 

In [4]:
import collections #python library

counts = collections.Counter(target for _, target in dataset)#it generates a dictionary with labels and counts

for label, count in counts.items():
    print(f'{label}: {count} ({count / sum(counts.values()):.2%})')


In this baseline example we are not working  with the imbalanced problem. However, there are different approaches to deal with it in order to improve the ML models. 

3. Build a model that can be used to discriminate between the two classes. In this particular case, a very simple linear_model ([logisticRegression](https://riverml.xyz/0.14.0/api/linear-model/LogisticRegression/)) is going to be created in order to exemplify a point.

In [5]:
from river import linear_model

model = linear_model.LogisticRegression()

Without properly training the model, the result of the probabilities for each class is exactly the same as it can be seen on the call to function `predict_proba_one`. Let's see which response we have with the previous `sample`.

In [6]:
print(model.predict_proba_one(sample))

So, for each class we have a random classifier with no knowledge. This is the point were things differ from traditional machine learning. The same sample that has been used to test, it is going to be used to adjust the model, because that sample is no longer available. Any kind of performance metric must be calculated before adjusting the model.

4. Train the model with the tested sample

In [7]:
model = model.learn_one(sample, target)

If we test again with the same pattern we would see variation of the probabilities. 

In [8]:
print(model.predict_proba_one(sample))

Simply to test the output and get an answer we can execute <code>predict_one()</code>, that returns the class label without probabilities.

In [9]:
print(model.predict_one(sample))

To integrate the steps in a single loop and see a complete process, the following piece of code shows how to use a loop and how integrate a rolling measure for this kind of systems. There are different available metrics in River. In this particular case  the Area under the [ROC curve](https://riverml.xyz/0.14.0/api/metrics/ROCAUC/) is used, but we could have selected any other.

In [10]:
from river import metrics

model = linear_model.LogisticRegression()
metric = metrics.ROCAUC()

for sample, target in dataset:
    prediction = model.predict_one(sample)
    metric.update(target, prediction)
    model.learn_one(sample, target)
   

print(metric)

A common and simple approach to improve the model performance is to scale the data. There are different preprocessing operations available in River including methods for scaling data. One approach is the data standarization using the [preprocessing.StandardScaler](https://riverml.xyz/0.14.0/api/preprocessing/StandardScaler/). 

It should be highlighted, that not only models can be used in a similar way to `scikit-learn`, but the library also has pipelines in their core to link different processes. For example, here is a pipeline with two operators: StandarScaler and LogisticRegression. In this case, it could be worth mentioning, we haven't use the loop because there is a function that makes the loop and evaluation for us, i.e., `evaluate. progressive_val_score`.

In [11]:
from river import evaluate
from river import compose
from river import preprocessing

model = compose.Pipeline(
    preprocessing.StandardScaler(),
    linear_model.LogisticRegression()
)

print(model)

metric = metrics.ROCAUC()
evaluate.progressive_val_score(dataset, model, metric)

ROCAUC: 89.11%

## Multiclass Classification

The next step in complexity should be the multiclass classification, where each instance, instead of a single two-class problem, can belong to any of a set of labels. In this scenario, the steps are similar to the binary classification but adapting the techniques or the loss functions to take into account the multioutput. For example, using the same library as before, we use a dataset with different images to identify the type of elements which could belong to any of  7 possible classes.

In [12]:
from rich import print
from river import datasets

dataset = datasets.ImageSegments()
print(dataset)

As in binary classification, the dataset has the samples associated with a particular target in a tuple-like structure. However, it is here where we can see the first difference with the binary classifier. In this case, we are going to define a new classification method called the [Hoeffding tree](https://riverml.xyz/0.14.0/api/tree/HoeffdingTreeClassifier/). When the probabilities are checked  for a certain sample (<code>predict_proba_one</code>), it is going to be empty. The reason is that the model has not already seen any sample. So, it has no information about the "possible" classes. If this were a binary classifier, it would output a probability of 50% for True and False because the classes would be implicit. But in this case, we're doing multiclass classification.

Going along with this behaviour, the <code>predict_one</code> method initially returns None because the model hasn't seen any labelled data yet.

In [13]:
from river import tree

data_stream = iter(dataset)
sample, target = next(data_stream)

model = tree.HoeffdingTreeClassifier()
print(model.predict_proba_one(sample))
print(model.predict_one(sample))

However, when the model learns examples, it adds those classes to the possibilities of the model. For example, learning the first sample will give 100% of probabilities that the sample is assigned to that class. In fact, there are not more options since only one class sample was observed.

In [14]:
model.learn_one(sample, target)
print(model.predict_proba_one(sample))
print(model.predict_one(sample))

If a second sample is used to train, we can see how the probabilies change.

In [15]:
sample, target = next(data_stream) # Next sample on the list

model.learn_one(sample, target)
print(model.predict_proba_one(sample))
print(model.predict_one(sample))

This is one of the key points of online classifiers, the models can deal with new classes which appear in the data stream.

Typically, the data is used once to make a prediction. When the prediction is made, the ground-truth will emerge later and it can be used first to train the model and also to evaluate. This schema is usually called **progressive validation**. Once the model is evaluated, the same observation is used to adjust the model.

In [16]:
from river import metrics

model = tree.HoeffdingTreeClassifier()

metric = metrics.ClassificationReport()

for sample, target in dataset:
    prediction = model.predict_one(sample)
    if prediction is not None:
        metric.update(target, prediction)
    model.learn_one(sample, target)

print(metric)

In this case, [ClassificationReport](https://riverml.xyz/0.14.0/api/metrics/ClassificationReport/) retrieves the precision, recall and F1 for each class the model has seen. Additionally, the Support column indicates the number of instances identified in the stream. Finally, we can see the three different aggregated measures and the general accuracy of the system. 

This exemplifies a typical pipeline in stream learning. It is so frequent that  River has a class to encapsulate the whole process in a single instance, as in the binary classification.

In [17]:
from river import evaluate

model = tree.HoeffdingTreeClassifier()
metric = metrics.ClassificationReport()

print(evaluate.progressive_val_score(dataset, model, metric))

## Regression

Lastly, about the typical ML problems, we have the regression ones. In this case, the model has to predict a numeric output for a given sample. A  regression sample is made up of a bunch of features and a target which is usually codified as a continuous number, although it may also be discrete. Let's see an example with the Trump approval rating.

In [18]:
from river import datasets

dataset = datasets.TrumpApproval()
print(dataset)

So, we have a dataset with 6 features and we have to give a prediction in $[0,1]$. To do so, we are going to use a regression model, in this case, an adapted [KNN](https://riverml.xyz/0.14.0/api/neighbors/KNNRegressor/) to perform regression which is already implemented in the library.

It must be noted that the regression models do not have the <code>predict_proba_one()</code> method since it does not calculate class probabilities.

In [19]:
from river import neighbors

data_stream = iter(dataset)
sample, target = next(data_stream)

model = neighbors.KNNRegressor()
print(model.predict_one(sample))

As it can be seen, the model has not been trained already and, therefore, the default output is $0.0$. Now, we are going to train the model and repeat the prediction

In [20]:
model = model.learn_one(sample, target)
print(model.predict_one(sample))

Going along with the **progressive validation** as in the previous cases, we can found the same loop of prediction, evaluation and train that we have previously seen.

In [21]:
from river import metrics

model = neighbors.KNNRegressor()

metric = metrics.MAE()

for sample, target in dataset:
    prediction = model.predict_one(sample)
    metric.update(target, prediction)
    model.learn_one(sample, target)

print(metric)

Or, in the compact notation

In [22]:
from river import evaluate

model = neighbors.KNNRegressor()
metric = metrics.MAE()

evaluate.progressive_val_score(dataset, model, metric)

MAE: 0.31039