# Getting started

This section is here to help you getting started with Skchange. It covers the fundamental concepts of the library in a brief and concise way.

## Installation
```bash
pip install skchange
```

To make full use of the library, you can install the optional Numba dependency. This will speed up the computation of the algorithms in Skchange, often by as much as 10-100 times.

```bash
pip install skchange[numba]
```

## Change detection basics

Change detection is the task of identifying abrupt changes in the distribution of a time series. The goal is to estimate the time points at which the distribution changes. These points are called change points (or change-points or changepoints).

Here is an example of two changes in the mean of a Gaussian time series with unit variance.

![](../_static/images/changepoint_illustration.png)

Changes may occur in much more complex ways. For example, changes can affect:

- Variance.
- Shape of the distribution.
- Auto-correlation.
- Relationships between variables in multivariate time series.
- An unknown, small portion of variables in a high-dimensional time series.

Skchange supports detecting changes in all of these scenarios, amongst others.

## Change detection in skchange
Skchange follows a familiar scikit-learn-type API and is compatible with Sktime.

Here's an example of a change detector:

In [98]:
from skchange.change_detectors import MovingWindow
from skchange.change_scores import CUSUM
from skchange.penalties import BICPenalty

detector = MovingWindow(
    change_score=CUSUM(),
    # penalty=BICPenalty(),
)
detector

None
False


In [None]:
from skchange.penalties.base import BasePenalty

penalty = BICPenalty()
print(penalty.__module__)
print(penalty.__class__)

isinstance(BICPenalty(), BasePenalty)

skchange.penalties.constant_penalties
<class 'skchange.penalties.constant_penalties.BICPenalty'>
<class 'type'>


False

In [78]:
from skchange.penalties.base import BasePenalty as base_BasePenalty
from skchange.penalties import BasePenalty

print(id(base_BasePenalty))
print(id(BasePenalty))

2385977517744
2385977517744


In [77]:
from skchange.penalties.constant_penalties import BICPenalty as BICPenalty2
from skchange.penalties import BICPenalty as BICPenalty

print(BICPenalty().__module__)
print(BICPenalty2().__module__)

skchange.penalties.constant_penalties
skchange.penalties.constant_penalties



Let us look at each each part of the detector in more detail:

1. `change_score`: Represents the choice of feature to detect changes in. `CUSUM` is a popular choice for detecting changes in the mean of a time series.
2. `penalty`: Used to control the complexity of the change point model. The higher the penalty, the fewer change points will be detected. The BIC penalty is a standard choice and serves as a default in all detectors.
3. `detector`: The search algorithm for detecting change points. It governs which data intervals the change score is evaluated on and how the results are compiled to a final set of detected change points.

In Skchange, all detectors follow the same pattern. They are composed of some kind of score to be evaluated on data intervals, and a penalty. We will soon get back to the meaning of "some kind of score".

### `fit`
After initialising your detector of choice, you need to fit it to training data before you can use it to detect change points.

Here are some 3-dimensional Gaussian toy data with four segments with different means vectors.

In [41]:
import numpy as np

from skchange.datasets.generate import generate_changing_data

# Generate data
n = 300
cpts = [100, 120, 250]
means = [
    np.array([0.0, 0.0, 0.0]),
    np.array([8.0, 0.0, 0.0]),
    np.array([0.0, 0.0, 0.0]),
    np.array([2.0, 3.0, 5.0]),
]
x = generate_changing_data(n, changepoints=cpts, means=means)
x.columns = ["var0", "var1", "var2"]
x.index.name = "time"
x

Unnamed: 0_level_0,var0,var1,var2
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,-2.752893,-0.515398,0.454008
1,-0.151696,-0.134083,-0.314193
2,0.009432,1.919497,-0.267541
3,-0.894207,0.822849,-0.329668
4,0.311975,-1.104357,-0.466815
...,...,...,...
295,0.321262,2.870586,4.281460
296,3.571222,4.022462,5.988896
297,3.677609,2.953683,3.788493
298,1.405786,2.901411,5.387476


Here is how the data looks:

In [42]:
import plotly.express as px

px.line(x)

As in scikit-learn, the role of `fit` is to learn the "fittable" parameters of the detector before it can be used  for detection tasks on test data. In Skchange, the primary parameter to fit is the `penalty`. For example, `BIC = (n_params + 1) * log(n_samples)`, where `n_params` is the total number of parameters in each segment and obtained from the change score, and `n_samples=x.shape[0]`.

In [101]:
detector.fit(x)

In [102]:
detector.get_fitted_params()

{'penalty': BICPenalty(),
 'penalty__n': 300,
 'penalty__n_params_per_variable': 1,
 'penalty__n_params_total': 3,
 'penalty__p': 3}

### `predict`

### `transform`

<!-- Most change detection algorithms follow the same structure. They consist of three parts:

1. Score (in a wide sense)
2. Penalty
3. Search algorithm

In `skchange`, all detectors are composable with respect to these parts.
The detector specifies the search algorithm, and it is composed of an interval scorer and a penalty. -->

## Interval scores
The choice of interval score represents the choice of distributional feature(s) to detect changes in.

Interval scores are not primarly meant to be used directly, but they are important building blocks to understand to make full use of the library.

The most basic type of interval scores in Skchange are *costs*.
A cost measures the cost/loss/error of a model fit to a data interval `X[s:e]`.

In [None]:
import numpy as np

from skchange.costs import GaussianCost

X = np.random.rand(100)

cost = GaussianCost()  # Cost for a Gaussian model with constant mean and variance.
cost.fit(X)  # Set up the cost for the given data.
cost.evaluate([0, 10])  # Evaluate the cost for the given interval, X[0:10].

array([[0.86044396]])

Another type of interval score are *change scores*. A change score measures the degree of change between two intervals adjacent `X[s:k]` and `X[k:e]`. They can be statistical tests, time series distances, or any other measure of difference.

In [16]:
from skchange.change_scores import CUSUM

score = CUSUM()  # CUSUM score for a change in mean.
score.fit(X)  # Set up the score for the given data.
score.evaluate([0, 5, 10])  # Evaluate the change score between X[0, 5] and X[5, 10].

array([[0.18612366]])

We can also compute several interval scores at once.

In [14]:
score.evaluate([[0, 5, 10], [10, 12, 30], [60, 69, 71]])

array([[0.18612366],
       [0.76651348],
       [0.52551144]])

The computational bottleneck of change detection algorithms is to evaluate an interval score over a large number of intervals and possible splits. In Skchange, this is solved as follows:

- In `fit`, relevant quantities are precomputed to speed up the cost evaluations.
- In `evaluate`, `numba` is leveraged to efficiently evaluate many interval-split-pairs in one call.

Moreover, costs can always be used to construct a change score by the following formula:
```
score.evaluate([start, split, end]) = cost.evaluate([start, end]) - (cost.evaluate([start, split]) + cost.evaluate([split, end]))
```
You can read this formula as "score = cost of the interval without a change point - cost of the interval with a single change point"

This means that you can always pass a cost to a change detector, even the ones that expects change scores, because it is converted to a change score internally.

At the same time, we also support change scores that can not be reduced to costs. This is different from e.g. the `ruptures` library. There are quite a few important scores that can not be reduced to costs, such as the Mann-Whitney U test, the Kolmogorov-Smirnov test, as well as scores for sparse change detection.

## Penalties
A function that penalizes the number of change points. The higher the penalty, the fewer change points are detected.

## Change detection
A change detection method in Skchange is a search method for finding the optimal change points given an interval scorer and a penalty.

## Segment anomaly detection
