Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Data Exploration

In this lab, we will explore and visualize our telemetry data.  You will learn how calculate metrics on top of your raw time series to gain deeper insights into your data.  

For example, sometimes the raw data looks just like noise, but if you run a [short-time Fourier transform](https://en.wikipedia.org/wiki/Short-time_Fourier_transform) you are able to detect that there are systematic oscillations in your data.

In this lab, you will:
- Get to know your dataset better by visualizing it
- Learn how to visualize time series data
- Become familiar with a set of standard metrics that can be defined on time series data
- Understand when to use which metric

### Prerequisites:

You will need the following Python packages to run this lab:
- rstl
- scipy
- seaborn

We will guide you through the process of installing these packages.

## Load and visualize/explore your data

In [None]:
%matplotlib inline 

# let's set up your environment, and define some global variables

import os
from rstl import STL
import pandas as pd
import random
import matplotlib.pyplot as plt
from scipy.stats import norm
import seaborn
import numpy as np


base_path = '..'
data_subdir = 'data'
data_filename = 'telemetry.csv'
data_path = os.path.join(base_path, data_subdir, data_filename)

# adjust this based on your screen's resolution
fig_panel = (18, 16)
wide_fig = (16, 4)
dpi=80 

In [None]:
# next, we load the telemetry data

print("Reading data ... ", end="")
df = pd.read_csv(data_path)
print("Done.")

print("Parsing datetime...", end="")
df['datetime'] = pd.to_datetime(df['datetime'], format="%m/%d/%Y %I:%M:%S %p")
print("Done.")

df = df.rename(str, columns={'datetime': 'timestamp'})

In [None]:
# let's define some useful variables
sensors = df.columns[2:].tolist() # a list containing the names of the sensors
machines = df['machineID'].unique().tolist() # a list of our machine ids

n_sensors = len(sensors)
n_machines = len(machines)
print("We have %d sensors: %s for each of %d machines." % (n_sensors, sensors, n_machines))

In [None]:
# let's pick a random machine
random_machine = 67

df_s = df.loc[df['machineID'] == random_machine, :]

In [None]:
# let's get some info about the time domain
df_s['timestamp'].describe()

**Question**: At which frequency do we receive sensor data?

In [None]:
# create a table of descriptive statistics for our data set
df_s.describe()

Let's do some time series specific exploration of the data

In [None]:
n_samples = 24*14 # we look at the first 14 days of sensor data

fig, ax = plt.subplots(2, 2, figsize=fig_panel, dpi=dpi) # create 2x2 panel of figures
for s, sensor in enumerate(sensors):
    c = s%2 # column of figure panel
    r = int(s/2) # row of figure panel
    ax[r,c].plot(df_s['timestamp'][:n_samples], df_s[sensor][:n_samples])
    ax[r,c].set_title(sensor)

Next, we create histogram plots to have an understanding of how these data are distributed.

In [None]:
n_bins=200

fig, ax = plt.subplots(2,2,figsize=fig_panel, dpi=dpi)
for s, sensor in enumerate(sensors):
    c = s%2
    r = int(s/2)
    # fit normal distribution to data
    (mu, sigma) = norm.fit(df_s[sensor])
    
    # the histogram of the data
    n, bins, patches = ax[r,c].hist(df_s[sensor], density=True, bins=n_bins) 
    
    # add a 'best fit' line
    y = norm.pdf(bins, mu, sigma)
    l = ax[r,c].plot(bins, y, linewidth=2)
    ax[r,c].set_title(sensor)

# Useful metrics for Timeseries data

## Bollinger Bands

[Bollinger Bands](https://en.wikipedia.org/wiki/Bollinger_Bands) are a type of statistical chart characterizing the prices and volatility over time of a financial instrument or commodity, using a formulaic method propounded by John Bollinger in the 1980s. Financial traders employ these charts as a methodical tool to inform trading decisions, control automated trading systems, or as a component of technical analysis. 

This can be done very quickly with pandas, because it has a built-in function `ewm` for convolving the data with a sliding window with exponential decay, which can be combined with standard statistical functions, such as `mean` or `std`.

Of course, you can imagine that rolling means, standard deviations etc can be useful on their own, without using them for creating Bollinger Bands.

In [None]:
window_size = 12 # the size of the window over which to aggregate
sample_size = 24 * 7 * 2 # let's only look at two weeks of data
x = df_s['timestamp']

fig, ax = plt.subplots(2, 2, figsize=fig_panel, dpi=dpi)
for s, sensor in enumerate(sensors):
    c = s%2
    r = int(s/2)
    rstd = df_s[sensor].ewm(window_size).std()
    rm = df_s[sensor].ewm(window_size).mean()
    ax[r,c].plot(x[window_size:sample_size], df_s[sensor][window_size:sample_size], color='blue', alpha=.2)
    ax[r,c].plot(x[window_size:sample_size], rm[window_size:sample_size] - 2 * rstd[window_size:sample_size], color='grey')
    ax[r,c].plot(x[window_size:sample_size], rm[window_size:sample_size] + 2 * rstd[window_size:sample_size], color='grey')
    ax[r,c].plot(x[window_size:sample_size], rm[window_size:sample_size], color='black')
    ax[r,c].set_title(sensor)

## Lag features

Lag features can be very useful in machine learning approaches dealing with time series.  For example, if you want to train a model to predict whether a machine is going to fail the next day, you can just shift your logs of failures forward by a day, so that failures (i.e. target labels) are aligned with the feature data you will use for predicting failures.

Luckily, pandas has a built-in `shift` function for doing this. 

In [None]:
sample_size = 24 * 2 # let's only look at first two days
x = df_s['timestamp']

fig, ax = plt.subplots(2, 2, figsize=fig_panel, dpi=dpi)
for s, sensor in enumerate(sensors):
    c = s%2
    r = int(s/2)
    rstd = df_s[sensor].ewm(window_size).std()
    rm = df_s[sensor].ewm(window_size).mean()
    ax[r,c].plot(x[:sample_size], df_s[sensor][:sample_size], color='black', alpha=1, label='orig')
    ax[r,c].plot(x[:sample_size], df_s[sensor][:sample_size].shift(-1), color='blue', alpha=1, label='-1h') # shift by x hour
    ax[r,c].plot(x[:sample_size], df_s[sensor][:sample_size].shift(-2), color='blue', alpha=.5, label='-2h') # shift by x hour
    ax[r,c].plot(x[:sample_size], df_s[sensor][:sample_size].shift(-3), color='blue', alpha=.2, label='-3h') # shift by x hour
    ax[r,c].set_title(sensor)
ax[r,c].legend()

## Rolling entropy

Depending on your use-case entropy can also be a useful metric, as it gives you an idea of how evenly your measures are distributed in a specific range. For more information, visit Wikipedia:

https://en.wikipedia.org/wiki/Entropy_(information_theory)

In [None]:
from scipy.stats import entropy

sample_size = 24*7*4 # use the first x hours of data

sensor = 'volt'
sensor_data = df_s[sensor]
rolling_entropy = sensor_data.rolling(12).apply(entropy)

fig, ax = plt.subplots(2,1, figsize=wide_fig)
ax[0].plot(x[:sample_size], sensor_data[:sample_size])
ax[1].plot(x[:sample_size], rolling_entropy[:sample_size])

## Short-time Fourier Transform (STFT)

STFTs can be useful if you want to see whether there are oscillations in your data.  They can be used as a way of quantifying the change of a non-stationary signal’s frequency and phase content over time. It can be very useful to detect oscillations in a specific frequency range.

In [None]:
import numpy as np
from scipy import signal

# Number of samplepoints
N = df_s.shape[0]

# sample spacing
T = 1.0 / N
x = df_s['timestamp']

fig, ax = plt.subplots(2, 2, figsize=fig_panel, dpi=dpi)
for s, sensor in enumerate(sensors):
    c = s%2 # column in figure panel
    r = int(s/2) # row in figure panel
    f, t, Zxx = signal.stft(df_s[sensor], fs=3600, nperseg=48)
    ax[r,c].pcolormesh(t/t[1], f, np.abs(Zxx))
    ax[r,c].set_title(sensor)
    ax[r,c].set_ylabel('Frequency [Hz]')
    ax[r,c].set_xlabel('Time (days)')

Depending on your usecase, you may only be interested in the results within certain frequency ranges.  For example, you may only be interested in changes at the lowest frequency range.

In [None]:
import numpy as np

freq_comp = 0
freq_ts = np.abs(Zxx[freq_comp,:]) # 

# compare this to the bottom/right panel of the previous figure
plt.plot(t/t[1], freq_ts)
plt.xlabel('Time (days)')
plt.ylabel('Power')

## Other useful metrics

There are various other useful metrics for timeseries data.  You may keep them in the back of your mind when you are dealing with another scenario.

- Rolling median, min, max, mode etc. statistics
- Rolling entropy, or rolling majority, for categorical features
- Rolling text statistics for text features

## Quiz

The big question is when to use which metric for your use-case.  

Here are a couple of sample scenarios. Can you recommend which one of the above metrics to use in each case?
1. You are developing a fitness application for mobile phones that have an [accelerometer](https://en.wikipedia.org/wiki/Accelerometer). You want to be able to measure how much time a user spends sitting, walking, and running over the course of a day. Which metric would you use to identify the different activities?
2. You want to get rich on the stock market, but you hate volatility.  Which metric would you use to measure volatility?
3. You are in charge of a server farm.  You are looking for a way to detect denial of service attacks on your servers.  You don't want to constantly look at individual amounts of traffic at all of the servers at the same time.  However, you know that all of the servers typically get a constant amount of traffic.  Which metric could you use to determine that things have shifted, such as when some servers seem to be getting a lot more traffic than the other servers?

Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.