# Variance and autocorrelation insights

### Approach for identifying and labeling time series

<hr>

**NB :** In this notebook, we wanted to evaluate if variance and autocorrelation were good indicators of a imminent transition (like a **step** in the data). This study analyses where the **steps / brutal changes** were in the data and uses the signals before **to try to predict these steps**. We use variance and autocorrelation as features for our regression.

<hr>

**NB :** However from later discussion with the EDF team, we realized that these abrupt changes in the data didn't really correspond to predictable patterns because they were related to the random action of external agents, like the people responsible for the regulation of the pump. 

<hr>

We **segment** a univariate time series data according to the timestamps at which something "goes wrong". From the **intervals** (different sizes) obtained we decide the following:

- Since at the end of each interval, something goes wrong, we can **further segment** our time series data and **label as `bad` (1)** the final portion each interval.
- Since there is **no problem** occuring **during** each interval, we can **further segment** the beginning of these intervals and **label them as `good` (2)**.

We are therefore left with a set of healthy/unhealthy (shorter) time series.

In [2]:
from observation import Observation

PATH = "../../Data/GMPP_IRSDI/"

fnames = ["A1-DEB1-1.txt"] 
tags = ["deb1_1"] 
obs_deb = Observation(PATH, fnames, tags, format="%Y-%m-%dT%H:%M:%S.000Z", ncol=2)

healthy_ts, unhealthy_ts = obs_deb.split_healthy_unhealthy()

Loading in memory 1 observations...
Analysing intervals with low diversity
Analysing intervals with bad level
Analysing intervals with low sampling rate


### The resulting dataset

healthy_ts contains a list of healthy time series, whereas unhealthy_ts contains the others.
You can look in more details at how the cut is made in the `split_healthy_unhealthy` function; the default parameters are to take a `portion` from each interval, between 0.1 and 0.3 for the healthy (beginning of interval) and 0.7 and 0.95 for the unhealthy (end of interval).

**Important note:** 
The time series obtained have different length !

In [3]:
print("Healthy time series length :\nFirst : %i\nSecond : %i\nThird : %i"%(len(healthy_ts[0]),len(healthy_ts[1]),len(healthy_ts[2])))

Healthy time series length :
First : 69
Second : 90
Third : 546


### Scoring the time series (sliding variance)

We can study the variance of each time series. Theoretically, unhealthy series should have higher variance than the healthy ones.

**PB :** the length is different, so instead of brutally taking the variance, we could take the **mean of the variance for a sliding window**.

We therefore have a score for each unhealthy/healthy intervals.

In [4]:
from exploitation.potential import sliding_variance, sliding_autocovariance, summary_variance_autocovariance
import numpy as np
w_length = 300

unhealthy_scores = np.array([np.nanmean(sliding_variance(ts.values.ravel(),w_length)) for ts in unhealthy_ts if len(ts)>w_length])
healthy_scores = np.array([np.nanmean(sliding_variance(ts.values.ravel(),w_length)) for ts in healthy_ts if len(ts)>w_length])

### Averaging the scores

We can then average that score, or take the median.

In our case, the median variance are very close, but the **mean variance of the unhealthy intervals is higher**, suggesting that **unhealthy intervals show high variance** more often, but that it is not enough to discriminate.

In [5]:
print("Mean :\nUnhealthy : %2.3f\nHealthy : %2.3f\n"%(np.mean(unhealthy_scores),np.mean(healthy_scores)))
print("Median :\nUnhealthy : %2.3f\nHealthy : %2.3f"%(np.median(unhealthy_scores),np.median(healthy_scores)))

Mean :
Unhealthy : 497.781
Healthy : 274.403

Median :
Unhealthy : 194.343
Healthy : 205.216


In [6]:
unhealthy_scores = np.array([np.nanmean(sliding_autocovariance(ts.values.ravel(),w_length,lag = 5, autocorrelation=True)) for ts in unhealthy_ts if len(ts)>w_length])
healthy_scores = np.array([np.nanmean(sliding_autocovariance(ts.values.ravel(),w_length,lag = 5,autocorrelation=True)) for ts in healthy_ts if len(ts)>w_length])

### Sliding autocorrelation

Doing the same with autocorrelation, we obtain results that are clearly not discriminative.

In [7]:
print("Mean :\nUnhealthy : %2.3f\nHealthy : %2.3f\n"%(np.mean(unhealthy_scores),np.mean(healthy_scores)))
print("Median :\nUnhealthy : %2.3f\nHealthy : %2.3f"%(np.median(unhealthy_scores),np.median(healthy_scores)))

Mean :
Unhealthy : 0.328
Healthy : 0.331

Median :
Unhealthy : 0.305
Healthy : 0.348


In [8]:
fnames = ["A1-DEB1-1.txt","A1-DEB1-2.txt","A1-DEB1-3.txt","A1-DEB1-4.txt","A1-DEB2-1.txt","A1-DEB2-2.txt","A1-DEB2-3.txt","A1-DEB2-4.txt"]
tags = ["deb1_1","deb1_2","deb1_3","deb1_4","deb2_1","deb2_2","deb2_3","deb2_4"]
obs_deb = Observation(PATH, fnames, tags, format="%Y-%m-%dT%H:%M:%S.000Z", ncol=2)

healthy_ts, unhealthy_ts = obs_deb.split_healthy_unhealthy()

Loading in memory 8 observations...
Analysing intervals with low diversity
Analysing intervals with bad level
Analysing intervals with low sampling rate


### Summing it up

The code is the following: 

`Mn = Mean` | `Md = Median` | `Va = Variance` | `Au = Autocorrelation`

Such that:

- `MnVa Uh` corresponds to the mean of the sliding variance mean for the unhealthy series
- `MdVa He` corresponds to the median of the sliding variance mean for the healthy series
- `MdAu He` corresponds to the median of the sliding aucovariance mean for the healthy series

In [9]:
summary_variance_autocovariance(unhealthy_ts, healthy_ts,w_length=30)

Unnamed: 0,MnVa Un,MnVa He,MdVa Un,MdVa He,MnAu Un,MnAu He,MdAu Un,MdAu He
value_deb1_1,85.014446,77.690494,44.619974,44.017577,0.124188,0.106168,0.086526,0.06128
value_deb1_2,137.057562,142.008833,87.113336,89.99883,0.138661,0.143803,0.115637,0.087313
value_deb1_3,66.232754,57.500758,28.244668,27.289998,0.097651,0.101713,0.092118,0.060404
value_deb1_4,71.274655,56.547674,21.191374,21.150701,0.073139,0.092394,0.049192,0.067783
value_deb2_1,80.805961,317.763865,37.27416,39.210761,0.098128,0.09295,0.084083,0.061391
value_deb2_2,562.364696,880.116231,141.299953,151.927434,0.11181,0.138909,0.094858,0.10016
value_deb2_3,66.190932,296.645971,17.216122,15.935294,0.125423,0.103173,0.092706,0.08602
value_deb2_4,91.625287,69.841164,23.903837,20.405364,0.093869,0.092335,0.087815,0.056647


A simpler approach would simply take the variance of each interval instead of the sliding variance mean, but both approaches give similar results.

**NB:** the difference between `value_deb_1` results as obtained in the array are coming from the fact that:

- w_length is chosen at 30, instead of default 300.
- The **multivariate** time series is more segemented, since they are more sensors (and therefore more "wrong" values 2^15 - 1). 

## Considering bad intervals independently

Sensors are now considered individually 

In [10]:
fnames = ["A1-DEB1-1.txt","A1-DEB1-2.txt","A1-DEB1-3.txt","A1-DEB1-4.txt","A1-DEB2-1.txt","A1-DEB2-2.txt","A1-DEB2-3.txt","A1-DEB2-4.txt"]
tags = ["deb1_1","deb1_2","deb1_3","deb1_4","deb2_1","deb2_2","deb2_3","deb2_4"]
obs_deb_independent = [Observation(PATH, [fname], [tag], format="%Y-%m-%dT%H:%M:%S.000Z", ncol=2) for (fname,tag) in zip(fnames,tags)]

healthy_unhealthy_ts = [obs_deb.split_healthy_unhealthy() for obs_deb in obs_deb_independent]

Loading in memory 1 observations...
Loading in memory 1 observations...
Loading in memory 1 observations...
Loading in memory 1 observations...
Loading in memory 1 observations...
Loading in memory 1 observations...
Loading in memory 1 observations...
Loading in memory 1 observations...
Analysing intervals with low diversity
Analysing intervals with bad level
Analysing intervals with low sampling rate
Analysing intervals with low diversity
Analysing intervals with bad level
Analysing intervals with low sampling rate
Analysing intervals with low diversity
Analysing intervals with bad level
Analysing intervals with low sampling rate
Analysing intervals with low diversity
Analysing intervals with bad level
Analysing intervals with low sampling rate
Analysing intervals with low diversity
Analysing intervals with bad level
Analysing intervals with low sampling rate
Analysing intervals with low diversity
Analysing intervals with bad level
Analysing intervals with low sampling rate
Analysing 

In [11]:
individual_summaries = [summary_variance_autocovariance(unhealthy_ts, healthy_ts,w_length=300) for (unhealthy_ts,healthy_ts) in healthy_unhealthy_ts]

In [12]:
import pandas as pd;
pd.concat(individual_summaries)

Unnamed: 0,MnVa Un,MnVa He,MdVa Un,MdVa He,MnAu Un,MnAu He,MdAu Un,MdAu He
value_deb1_1,497.781283,274.403183,194.343281,205.216367,0.327776,0.330894,0.30476,0.348437
value_deb1_2,483.004389,546.049994,358.413635,340.57884,0.329355,0.321955,0.30198,0.274779
value_deb1_3,342.51114,289.770148,131.432713,148.380462,0.341275,0.348843,0.311398,0.320853
value_deb1_4,369.983005,523.644873,155.017745,190.768511,0.356273,0.365544,0.351421,0.313815
value_deb2_1,214.958133,201.051602,169.74127,163.225567,0.361032,0.355475,0.358758,0.376932
value_deb2_2,1956.268914,78.313843,1180.523877,28.284324,0.684616,0.518288,0.792264,0.600535
value_deb2_3,442.110617,109.662015,82.739349,88.049069,0.501263,0.455404,0.507349,0.476506
value_deb2_4,398.168499,150.562461,110.38355,114.122371,0.421578,0.411641,0.395644,0.390659


## Logistic classifier using variance - autocovariance based features

In [16]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
reg = LogisticRegression()

summaryRegression = pd.DataFrame(0, index= tags,columns=["N_samples","Score"])
w_length=300;lag=5

for i in range(len(healthy_unhealthy_ts)):
    # For each sensor individually, compute a train test dataset, with variance autocovariance features.
    healthy_ts,unhealthy_ts = healthy_unhealthy_ts[i]
    
    unhealthy_scores_variance = np.array([np.nanmean(sliding_variance(ts.values.ravel(),w_length)) for ts in unhealthy_ts if len(ts)>w_length+lag])
    healthy_scores_variance = np.array([np.nanmean(sliding_variance(ts.values.ravel(),w_length)) for ts in healthy_ts if len(ts)>w_length+lag])
    unhealthy_scores_autocovariance = np.array([np.nanmean(sliding_autocovariance(ts.values.ravel(),w_length,lag = lag, autocorrelation=True)) for ts in unhealthy_ts if len(ts)>w_length+lag])
    healthy_scores_autocovariance = np.array([np.nanmean(sliding_autocovariance(ts.values.ravel(),w_length,lag = lag,autocorrelation=True)) for ts in healthy_ts if len(ts)>w_length+lag])
    n_samples_unhealthy = len(unhealthy_scores_variance)
    n_samples_healthy = len(healthy_scores_variance)
    unhealthy = np.concatenate((unhealthy_scores_variance[:,np.newaxis],unhealthy_scores_autocovariance[:,np.newaxis],np.ones((n_samples_unhealthy,1))),axis=1)
    healthy = np.concatenate((healthy_scores_variance[:,np.newaxis],healthy_scores_autocovariance[:,np.newaxis],np.zeros((n_samples_healthy,1))),axis=1)
    dataset = np.concatenate((healthy,unhealthy),axis=0)
    n_samples = len(healthy_ts)
    
    X_train, X_test, y_train, y_test = train_test_split(dataset[:,:2], dataset[:,[2]], test_size=0.20, random_state=42)
    reg.fit(X_train,y_train.ravel())
    summaryRegression.ix[i,:]=[n_samples_unhealthy+n_samples_healthy,reg.score(X_test,y_test.ravel())]

### Summary of the regression scores for sensors (considered independenty)

In [17]:
summaryRegression

Unnamed: 0,N_samples,Score
deb1_1,231,0.382979
deb1_2,236,0.583333
deb1_3,240,0.541667
deb1_4,234,0.617021
deb2_1,168,0.411765
deb2_2,8,1.0
deb2_3,152,0.483871
deb2_4,138,0.535714


For the **deb2_2** sensor we see that we only have very few samples. 
This is because only 8 time intervals are longer than 300 (which is our sliding window length).
So **only for deb2_2**, let's take a window of length 60. 

In [24]:
summaryRegression

Unnamed: 0,N_samples,Score
deb1_1,231,0.382979
deb1_2,236,0.583333
deb1_3,240,0.541667
deb1_4,234,0.617021
deb2_1,168,0.411765
deb2_2,214,0.651163
deb2_3,152,0.483871
deb2_4,138,0.535714


The classification score is relatively bad if we only take **variance/autocovariance** features.