Time series data:
- sequences of data points collected or recoreded at specific time intervals. Clustering this data involves grouping sequences that exhibit similar patterns or behaviours over time
- Unlike traditional clustering, time series clustering must account for temporal dependencies and potential shifts in time. 

Key Concepts in Time Series Clustering: Similarity Measures
- A crucial aspect of time series clustering is the similarity measure used to compare different time series. Common similarity measures include:
  - Euclidean Distance => Measures the straight line distance between 2 points in multidimensional space. While its simple, it is not invariant to time shifts
  - Dynamic Time Warping (DTW) => Aligns sequences by warping the time axis to minimize the distance between them. DTW is robust to time shifts and varying speeds. 
  - Correlation-based measures => Evaluate the correlation between timeseries, focusing on the similarity of their shapes rather than their exact values

Time Series Clustering Techniques:
 - Shape-Based clustering
   - Focus on the shape of the time series, using features like autocorrelation, partial autocorrelation, and cepstral coefficients
   - Clustering algos like k-means or hierarchical clustering can be applied to these features
- Feature-Based Cluserting:
   - Extracts features from the time series, such as trend, seasonality and frequency components
   - Common feature extraction techniques include Fourier transforms, wavelets and svd
   - Clustering algos are then applied to extracted features
- Model Based Clusering:
  - Time series are generated from a mixture of underlying probability distributions
  - Gaussian Mixture Models (GMMs) commonly used to model these
  - The EM algo is used to estimate the parameters of the GMMs

Examples:
 - Whole Time Series Clustering with K-Means:

In [3]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

In [2]:
from sklearn.metrics import silhouette_score

In [1]:
from tslearn.utils import to_time_series_dataset
from tslearn.clustering import TimeSeriesKMeans

Install h5py to use hdf5 features: http://docs.h5py.org/
  warn(h5py_msg)


In [15]:
from tslearn.preprocessing import TimeSeriesScalerMeanVariance

In [2]:
# Generating synthetic time series data
np.random.seed(0)
time_series_data = np.random.randn(100,50) # 100 time series, each of length 50

In [4]:
# Standardizing the data
scaler = StandardScaler()
time_series_data_scaled = scaler.fit_transform(time_series_data)

In [5]:
# clustering using k-means
kmeans = KMeans(n_clusters=3, random_state=0)
labels = kmeans.fit_predict(time_series_data_scaled)

In [6]:
print(labels)

[2 1 1 2 2 1 2 0 2 0 2 1 2 0 1 2 0 1 2 2 2 0 0 1 2 0 2 0 1 1 1 1 1 1 1 1 2
 2 1 1 1 0 1 2 1 2 2 1 0 2 2 1 1 2 2 1 1 2 1 1 2 0 2 1 1 2 1 1 2 1 2 2 2 2
 0 1 2 2 1 2 0 2 1 1 1 2 0 0 1 0 1 1 1 2 0 0 1 2 2 0]


In [8]:
silhouette_score(time_series_data_scaled, labels)

0.022768212978088172

Example 2 - Subsequence Clustering with K-Means:
- Method involves extracting subsequences from the time series data and then applying k-means clustering to these subsequences. This approach captures local patterns within the time series.

In [4]:
# Generating synthetic time series data
np.random.seed(0)
time_series_data = np.random.randn(10,100)

In [5]:
# extracting subsequences
window_size = 20
subsequences = [time_series_data[i,j:j+window_size]
                for i in range(time_series_data.shape[0])
                for j in range(time_series_data.shape[1] - window_size + 1)]
subsequences = np.array(subsequences)

In [6]:
subsequences.shape

(810, 20)

In [7]:
# standardising the subsequences
scaler = StandardScaler()
subsequences_scaled = scaler.fit_transform(subsequences)

In [10]:
# clustering using k-means
kmeans = KMeans(n_clusters=3, random_state=0)
labels = kmeans.fit_predict(subsequences_scaled)

In [11]:
silhouette_score(subsequences_scaled,labels)

0.03588982681330601

Example 3: Shape Based Clustering with Dynamic Time Warping
- This method uses Dynamic Time Warping (DTW) as the distance measure to cluster time series based on their shapes. DTW aligns sequences by warping the time axis to minimize the distance between them
Making them robust to time shifts.

In [12]:
# generating synthetic time series data
np.random.seed(0)
time_series_data = np.random.randn(20,50) # 20 time series, each of length 50

In [13]:
# converting to a time series dataset
time_series_dataset = to_time_series_dataset(time_series_data)

In [16]:
# standardizing the data
scaler = TimeSeriesScalerMeanVariance()
time_series_dataset_scaled = scaler.fit_transform(time_series_dataset)

In [17]:
# clustering the timeseries using DTW metric
model = TimeSeriesKMeans(n_clusters=3, metric="dtw", random_state=0)
labels = model.fit_predict(time_series_dataset_scaled)

In [19]:
type(time_series_dataset_scaled)

numpy.ndarray

In [21]:
time_series_dataset_scaled.shape

(20, 50, 1)

Example 4: Clustering Time Series data using DTW and evaluating with the silhouette score

In [None]:
# Generate an example time series dataset
time = np.arange(0, 10, 0.1)
values = np.sin(time)
data = np.array