# Streaming anomaly detection

Anomaly detection is a very common ML task. Here we will consider streaming tabular data.

## Streaming a dataset

As an example, we'll use a credit card transactions dataset.

In [1]:
from river import datasets

dataset = datasets.CreditCard()
dataset

Credit card frauds.

The datasets contains transactions made by credit cards in September 2013 by european
cardholders. This dataset presents transactions that occurred in two days, where we have 492
frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class
(frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation.
Unfortunately, due to confidentiality issues, we cannot provide the original features and more
background information about the data. Features V1, V2, ... V28 are the principal components
obtained with PCA, the only features which have not been transformed with PCA are 'Time' and
'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first
transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be
used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and
it tak

**Question ðŸ¤”: in production, what would be the output of an anomaly detector on this dataset?**

**Question ðŸ¤”: how would humans and the model interact with each other?**

**Question ðŸ¤”: how could you exploit human feedback?**

In [14]:
type(dataset)

river.datasets.credit_card.CreditCard

The dataset is special in that it isn't loaded in memory. When you loop over it with `for`, it will stream the dataset from the disk, one row at a time.

In [12]:
for transaction, is_fraud in dataset.take(1):
    ...

transaction

{'Time': 0.0,
 'V1': -1.3598071336738,
 'V2': -0.0727811733098497,
 'V3': 2.53634673796914,
 'V4': 1.37815522427443,
 'V5': -0.338320769942518,
 'V6': 0.462387777762292,
 'V7': 0.239598554061257,
 'V8': 0.0986979012610507,
 'V9': 0.363786969611213,
 'V10': 0.0907941719789316,
 'V11': -0.551599533260813,
 'V12': -0.617800855762348,
 'V13': -0.991389847235408,
 'V14': -0.311169353699879,
 'V15': 1.46817697209427,
 'V16': -0.470400525259478,
 'V17': 0.207971241929242,
 'V18': 0.0257905801985591,
 'V19': 0.403992960255733,
 'V20': 0.251412098239705,
 'V21': -0.018306777944153,
 'V22': 0.277837575558899,
 'V23': -0.110473910188767,
 'V24': 0.0669280749146731,
 'V25': 0.128539358273528,
 'V26': -0.189114843888824,
 'V27': 0.133558376740387,
 'V28': -0.0210530534538215,
 'Amount': 149.62}

In [13]:
is_fraud

0

**Question ðŸ¤”: what is the fraud rate?**

## Progressive validation

In [23]:
from river import anomaly
from river import compose
from river import metrics
from river import preprocessing

model = compose.Pipeline(
    preprocessing.MinMaxScaler(),
    anomaly.HalfSpaceTrees(seed=42)
)

metric = metrics.ROCAUC()

for x, y in dataset.take(100_000):
    score = model.score_one(x)
    model = model.learn_one(x)
    metric = metric.update(y, score)

metric

ROCAUC: 91.49%

**Question ðŸ¤”: what do you think of this way of evaluating a model?**

Normally, an anomaly detection task is tackled with an unsupervised model due to a lack of labels. Here, we have labels, which allows to evaluate the model's performance. However, we can also train a supervised model and see if it performs any better.

In [24]:
from river import linear_model
from river import preprocessing

model = compose.Pipeline(
    preprocessing.StandardScaler(),
    linear_model.LogisticRegression()
)

metric = metrics.ROCAUC()

for x, y in dataset.take(100_000):
    score = model.predict_proba_one(x)[True]
    model = model.learn_one(x, y)
    metric = metric.update(y, score)

metric

ROCAUC: 89.20%

**Question ðŸ¤”: why do you think the performance is worse?**

River also has an `evaluate` module with a `progressive_val_score` function.

In [25]:
from river import evaluate

evaluate.progressive_val_score(
    dataset.take(100_000),
    model=compose.Pipeline(
        preprocessing.StandardScaler(),
        linear_model.LogisticRegression()
    ),
    metric=metrics.ROCAUC(),
    print_every=10_000,
    show_time=True,
    show_memory=True
)

[10,000] ROCAUC: 94.57% â€“ 00:00:00 â€“ 10.3 KB
[20,000] ROCAUC: 89.21% â€“ 00:00:01 â€“ 10.3 KB
[30,000] ROCAUC: 87.08% â€“ 00:00:01 â€“ 10.3 KB
[40,000] ROCAUC: 87.39% â€“ 00:00:02 â€“ 10.3 KB
[50,000] ROCAUC: 90.46% â€“ 00:00:03 â€“ 10.3 KB
[60,000] ROCAUC: 89.19% â€“ 00:00:03 â€“ 10.3 KB
[70,000] ROCAUC: 89.08% â€“ 00:00:04 â€“ 10.3 KB
[80,000] ROCAUC: 89.23% â€“ 00:00:05 â€“ 10.3 KB
[90,000] ROCAUC: 89.76% â€“ 00:00:05 â€“ 10.3 KB
[100,000] ROCAUC: 89.20% â€“ 00:00:06 â€“ 10.3 KB


ROCAUC: 89.20%

## Improving the supervised approach

In an anomaly detection task, the number of positive cases is usually much lower than the amount of negatives. This penalizes many supervised classification models, because many are based on the assumption that the data is somewhat balanced. In the case of logistic regression, it's possible to adjust the loss function to increase the importance of positive samples on the learning process.

In [26]:
from river import optim

evaluate.progressive_val_score(
    dataset.take(100_000),
    model=compose.Pipeline(
        preprocessing.StandardScaler(),
        linear_model.LogisticRegression(
            loss=optim.losses.Log(weight_pos=5)
        )
    ),
    metric=metrics.ROCAUC(),
    print_every=10_000,
    show_time=True,
    show_memory=True
)

[10,000] ROCAUC: 95.90% â€“ 00:00:00 â€“ 10.26 KB
[20,000] ROCAUC: 92.71% â€“ 00:00:01 â€“ 10.26 KB
[30,000] ROCAUC: 91.84% â€“ 00:00:01 â€“ 10.26 KB
[40,000] ROCAUC: 92.17% â€“ 00:00:02 â€“ 10.26 KB
[50,000] ROCAUC: 94.16% â€“ 00:00:03 â€“ 10.26 KB
[60,000] ROCAUC: 92.55% â€“ 00:00:03 â€“ 10.26 KB
[70,000] ROCAUC: 92.21% â€“ 00:00:04 â€“ 10.26 KB
[80,000] ROCAUC: 92.28% â€“ 00:00:05 â€“ 10.26 KB
[90,000] ROCAUC: 92.59% â€“ 00:00:06 â€“ 10.26 KB
[100,000] ROCAUC: 91.87% â€“ 00:00:06 â€“ 10.26 KB


ROCAUC: 91.87%

An alternative is to under-sample the majority class. The idea is that the model is being drowned with negative examples. Adjusting the class distribution can help a model. Note that one could also over-sample the minority class. However, the advantage of under-sampling is that it reduces the processing time, because less data has to be processed.

In [27]:
from river import imblearn

evaluate.progressive_val_score(
    dataset.take(100_000),
    model=compose.Pipeline(
        preprocessing.StandardScaler(),
        imblearn.RandomUnderSampler(
            classifier=linear_model.LogisticRegression(),
            desired_dist={0: .8, 1: .2},
            seed=42
        )
    ),
    metric=metrics.ROCAUC(),
    print_every=10_000,
    show_time=True,
    show_memory=True
)

[10,000] ROCAUC: 94.55% â€“ 00:00:00 â€“ 14.33 KB
[20,000] ROCAUC: 95.59% â€“ 00:00:01 â€“ 14.33 KB
[30,000] ROCAUC: 95.40% â€“ 00:00:01 â€“ 14.33 KB
[40,000] ROCAUC: 95.34% â€“ 00:00:02 â€“ 14.33 KB
[50,000] ROCAUC: 96.72% â€“ 00:00:02 â€“ 14.33 KB
[60,000] ROCAUC: 95.42% â€“ 00:00:03 â€“ 14.33 KB
[70,000] ROCAUC: 95.14% â€“ 00:00:03 â€“ 14.33 KB
[80,000] ROCAUC: 95.38% â€“ 00:00:04 â€“ 14.33 KB
[90,000] ROCAUC: 95.72% â€“ 00:00:05 â€“ 14.33 KB
[100,000] ROCAUC: 95.26% â€“ 00:00:05 â€“ 14.33 KB


ROCAUC: 95.26%

Nothing prevents us from combining the two approaches.

In [28]:
from river import imblearn

evaluate.progressive_val_score(
    dataset.take(100_000),
    model=compose.Pipeline(
        preprocessing.StandardScaler(),
        imblearn.RandomUnderSampler(
            classifier=linear_model.LogisticRegression(
                loss=optim.losses.Log(weight_pos=5)
            ),
            desired_dist={0: .8, 1: .2},
            seed=42
        )
    ),
    metric=metrics.ROCAUC(),
    print_every=10_000,
    show_time=True,
    show_memory=True
)

[10,000] ROCAUC: 94.23% â€“ 00:00:00 â€“ 14.28 KB
[20,000] ROCAUC: 96.77% â€“ 00:00:01 â€“ 14.28 KB
[30,000] ROCAUC: 96.86% â€“ 00:00:01 â€“ 14.28 KB
[40,000] ROCAUC: 96.54% â€“ 00:00:02 â€“ 14.28 KB
[50,000] ROCAUC: 97.54% â€“ 00:00:02 â€“ 14.28 KB
[60,000] ROCAUC: 97.15% â€“ 00:00:03 â€“ 14.28 KB
[70,000] ROCAUC: 96.83% â€“ 00:00:03 â€“ 14.28 KB
[80,000] ROCAUC: 96.77% â€“ 00:00:04 â€“ 14.28 KB
[90,000] ROCAUC: 96.97% â€“ 00:00:05 â€“ 14.28 KB
[100,000] ROCAUC: 96.49% â€“ 00:00:05 â€“ 14.28 KB


ROCAUC: 96.49%

## Going further: active learning

We started off with an unsupervised approach. We did so because we assumed we had no labels to train a supervised model. Next, we trained a supervised model, which performed with some tuning. In a real setup, labels wouldn't be available at first. One way to proceed would be to have both models running alongside. 

The first model would be unsupervised and rank samples according to their anomaly score. Humans would label the samples according to this ranking. These labels would then feed into the second model. A great way to prioritize this labelling effort is to use active learning. See a demo [here](https://next.databutton.com/v/13lkg6b6), with explanations [here](https://maxhalford.github.io/blog/online-active-learning-river-databutton/).

**Question ðŸ¤”: if there are two models running alongside, how to determine which one's outputs should be used?**