<img src="./img/sktime-logo-text-horizontal.jpg" alt="sktime logo" style="width: 100%; max-width: 600px;">

<span style="font-size: 5em;"> `skchange` </span>
<img src="./img/NR_logo_hvit.png" alt="NR logo" style="width: 30%; max-width: 400px;"> 

## Agenda

1. **General introduction** to scikit-learn-like packages

    * `sklearn`
    * `sktime`
    * `skchange`

2. **Detection**

    * Change detection and segmentation
    * Segment anomaly detection
    * Detector API
    * Algorithm framework
    * Costs and scores

3. **Use cases**

    * Fault detection - heating and ventilation system

## Running the notebooks

See the public repository
[https://github.com/NorskRegnesentral/skchange-tutorial-hydro](https://github.com/NorskRegnesentral/skchange-tutorial-hydro)


# Introduction to scikit-learn-like packages


## What is `sklearn`?

Perhaps the most widely used package for traditional machine learning in Python.

- Unified interface for estimators
- Modular design
- Composable
- Simple parameter interface

=> 

- Easy to change an estimator with another.
- Easy to build more advanced models from existing components.
- Enables generalised tools for model tuning, evaluation, etc.

[sklearn website](https://scikit-learn.org/stable/)

### `sklearn` unified interface

`sklearn` provides a unified interface to multiple learning tasks including classification, regression.

1. **Instantiate** your model of choice, with parameter settings
2. **Fit** the instance of your model
3. Use that fitted instance to **predict** new data!

<!-- <img src="./img/estimator-conceptual-model.jpg" alt="Estimator conceptual model" style="width: 100%; max-width: 1200px;"> -->

In [1]:
import warnings

warnings.filterwarnings("ignore")

In [2]:
# get data to use the model on
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

X, y = load_iris(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [3]:
X_train.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
130,7.4,2.8,6.1,1.9
128,6.4,2.8,5.6,2.1
71,6.1,2.8,4.0,1.3
8,4.4,2.9,1.4,0.2
121,5.6,2.8,4.9,2.0


In [4]:
y_train.head()

130    2
128    2
71     1
8      0
121    2
Name: target, dtype: int64

In [5]:
X_test.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
24,4.8,3.4,1.9,0.2
32,5.2,4.1,1.5,0.1
83,6.0,2.7,5.1,1.6
94,5.6,2.7,4.2,1.3
77,6.7,3.0,5.0,1.7


In [None]:
from sklearn.svm import SVC

# 1. Instantiate SVC with parameters gamma, C
classifier = SVC(gamma=0.001, C=100.0)

# 2. Fit classifier to training data
classifier.fit(X_train, y_train)

# 3. Predict labels on test data
y_test_pred = classifier.predict(X_test)

y_test_pred

array([0, 0, 2, 1, 2, 1, 1, 0, 0, 1, 2, 1, 0, 1, 2, 2, 0, 2, 1, 2, 2, 2,
       2, 2, 2, 0, 2, 2, 2, 0, 0, 1, 0, 1, 0, 2, 2, 1])

IMPORTANT: To use another classifier, only the specification line, part 1 changes!

`SVC` could have been `RandomForest`, steps 2 and 3 remain the same - unified interface:

In [7]:
from sklearn.ensemble import RandomForestClassifier

# 1. Instantiate RandomForest with parameters n_estimators
classifier = RandomForestClassifier(n_estimators=100)

# 2. Fit clf to training data
classifier.fit(X_train, y_train)

# 3. Predict labels on test data
y_test_pred = classifier.predict(X_test)

y_test_pred

array([0, 0, 2, 1, 2, 1, 1, 0, 0, 1, 2, 1, 0, 1, 2, 2, 0, 2, 1, 2, 2, 2,
       2, 2, 2, 0, 2, 2, 2, 0, 0, 1, 0, 1, 0, 2, 2, 1])

In object oriented design terminology, this is called **"strategy pattern"**

= different estimators can be switched out without change to the interface

= like a power plug adapter, it's plug&play if it conforms with the interface

parameters can be accessed and set via `get_params`, `set_params`:

In [8]:
classifier.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

## `sktime`: `sklearn` for time series

Richer space of time series tasks, compared to "tabular":

- **Classification** - predict a class label
- **Regression** - predict a continuous value
- **Clustering** - group similar samples
- **Forecasting** - predict the future based on the past
- **Detection and segmentation** - identify jumps, anomalies, segments or other events in a data stream

`sktime`: `sklearn`-like interfaces for these tasks, + modular + composable.

[sktime website](https://www.sktime.net/en/latest/index.html)


### Example of change detection
**Change detection** - identify points in time series where properties of the data changes.

"Change" or "change point" detection are used interchangeably.

In [9]:
from skchange.datasets.generate import generate_alternating_data
from utils import plot_multivariate_time_series, add_changepoint_vlines

from skchange.change_detectors.moving_window import MovingWindow

df = generate_alternating_data(n_segments=10, segment_length=50, mean=5, random_state=1)

detector = MovingWindow(bandwidth=30)  # 1) Instantiate
detector.fit(df)                       # 2) Fit to training data
cpts = detector.predict(df)            # 3) Predict changepoints

cpt_fig = plot_multivariate_time_series(df)
cpt_fig = add_changepoint_vlines(cpt_fig, cpts)
cpt_fig.update_layout(showlegend=False, xaxis_title=None).show()

## `skchange`: `sktime`-compatible change and anomaly detection

A 2nd party extension to `sktime`s maturing detection module.

Main focus: Statistical methods for change detection and segment anomaly detection.

* **Fast**: Numba is used for performance.
* **Easy to use**: Follows the conventions of sktime and scikit-learn.
* **Easy to extend**: 

  - Make your own detectors by inheriting from the base class templates.
  - Composable detection algorithms.
  - Create custom detection scores and cost function components.
* **Segment anomaly detection**: Detect intervals of anomalous behaviour in time series data.
* **Subset anomaly detection**: Detect intervals of anomalous behaviour in time series data, and infer the subset of variables that cause the anomaly.

## Summary

- `sklearn`: Unified interface, modular, composition stable, easy specification language
- `sktime`: Evolves the interface for time series learning tasks
- `skchange`: Extends `sktime` with fast and up-to-date change and anomaly detection methods


## Next:

* In depth `skchange` and detection tasks (50 min)
* Use case (10 min)

---
## Credits

Notebook creation: tveten

Many vignettes based on existing `sktime` tutorials, credit: fkiraly, miraep8, marrov

General credit also to `sklearn`, `sktime` and `skchange` contributors