## Timeseries clustering

Time series clustering is to partition time series data into groups based on similarity or distance, so that time series in the same cluster are similar.

Methodology followed:
* Use Variational Recurrent AutoEncoder (VRAE) for dimensionality reduction of the timeseries
* To visualize the clusters, PCA and t-sne are used

Paper:
https://arxiv.org/pdf/1412.6581.pdf

#### Contents

0. [Load data and preprocess](#Load-data-and-preprocess)
1. [Initialize VRAE object](#Initialize-VRAE-object)
2. [Fit the model onto dataset](#Fit-the-model-onto-dataset)
3. [Transform the input timeseries to encoded latent vectors](#Transform-the-input-timeseries-to-encoded-latent-vectors)
4. [Save the model to be fetched later](#Save-the-model-to-be-fetched-later)
5. [Visualize using PCA and tSNE](#Visualize-using-PCA-and-tSNE)

In [None]:
from IPython.display import HTML
HTML('''<script>
code_show=true; 
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

### Import required modules

In [None]:
from vrae.vrae import VRAE
from vrae.utils import *
import numpy as np
import torch

import plotly
from torch.utils.data import DataLoader, TensorDataset
plotly.offline.init_notebook_mode()

### Input parameters

In [None]:
dload = './model_dir' #download directory

### Hyper parameters

In [None]:
seq_len = 30
hidden_size = 256
hidden_layer_depth = 3
latent_length = 64
batch_size = 10
learning_rate = 0.00002
n_epochs = 10000
dropout_rate = 0.0
optimizer = 'Adam' # options: ADAM, SGD
cuda = True # options: True, False
print_every=30
clip = True # options: True, False
max_grad_norm=5
loss = 'MSELoss' # options: SmoothL1Loss, MSELoss
block = 'LSTM' # options: LSTM, GRU

### Load data and preprocess

In [None]:
X_train, X_val, y_train, y_val = open_data('data', ratio_train=1.0, dataset='ECG5000')

num_classes = len(np.unique(y_train))
base = np.min(y_train)  # Check if data is 0-based
if base != 0:
    y_train -= base
y_val -= base
y_val = y_train

In [None]:
# Data is sequential, so we need to reshape it to add a time dimension
# X_train = X_train.reshape(X_train.shape[0], seq_len, -1)[:-6]

# For learning embeddings, we don't care about overfitting
X_val = X_train

train_dataset = TensorDataset(torch.from_numpy(X_train))
test_dataset = TensorDataset(torch.from_numpy(X_train))

**Fetch `sequence_length` from dataset**

In [None]:
sequence_length = X_train.shape[1]

**Fetch `number_of_features` from dataset**

This config corresponds to number of input features

In [None]:
number_of_features = X_train.shape[2]

### Initialize VRAE object

VRAE inherits from `sklearn.base.BaseEstimator` and overrides `fit`, `transform` and `fit_transform` functions, similar to sklearn modules

In [None]:
vrae = VRAE(sequence_length=sequence_length,
            number_of_features = number_of_features,
            hidden_size = hidden_size, 
            hidden_layer_depth = hidden_layer_depth,
            latent_length = latent_length,
            batch_size = batch_size,
            learning_rate = learning_rate,
            n_epochs = n_epochs,
            dropout_rate = dropout_rate,
            optimizer = optimizer, 
            cuda = cuda,
            print_every=print_every, 
            clip=clip, 
            max_grad_norm=max_grad_norm,
            loss = loss,
            block = block,
            dload = dload)

### Fit the model onto dataset

In [None]:
vrae.fit(train_dataset)

#If the model has to be saved, with the learnt parameters use:
# vrae.fit(dataset, save = True)

### Transform the input timeseries to encoded latent vectors

In [None]:
#If the latent vectors have to be saved, pass the parameter `save`
z_run = vrae.transform(train_dataset, save = True)
z_run.shape

### Save the model to be fetched later

In [None]:
vrae.save('./vrae.pth')

# To load a presaved model, execute:
# vrae.load('vrae.pth')

In [None]:
# Get some reconstructions
reconstructions = vrae.reconstruct(train_dataset)

### Visualize using PCA and tSNE

In [None]:
# %matplotlib notebook
import scipy.io

z_run_tsne = TSNE(perplexity=80, min_grad_norm=1E-12, n_iter=3000).fit_transform(z_run)
# scipy.io.savemat('tsne_vae_embeddings_20210618_Pop_Cage_001.mat', {'data': z_run_tsne.T})

# plot_clustering(z_run, y_train, engine='matplotlib', download = False)
# If plotly to be used as rendering engine, uncomment below line
#plot_clustering(z_run, y_val, engine='plotly', download = False)

In [None]:
# # Create clusters.annot
# from sklearn.mixture import GaussianMixture

# # Predict cluster assignments
# gm = GaussianMixture(n_components=5, random_state=0).fit(z_run)
# clusters = gm.predict(z_run)

# # Number of seconds in each sequence
# filt_time_step = 0.025
# num_secs_seq = sequence_length * filt_time_step
# end_time = len(z_run) * num_secs_seq + num_secs_seq

# # Print head of the file
# f = open ('Pop01-06_18_2021.annot','w')
# # write the header--------------------
# f.write('Bento annotation file\n')
# f.write('Movie file(s): {}\n\n'.format('Pop_20210618_cage_C1_01.avi'))
# f.write('{0} {1}\n'.format('Stimulus name:',''))
# f.write('{0} {1}\n'.format('Annotation start frame:',1))
# f.write('{0} {1}\n'.format('Annotation stop frame:', 26994))
# f.write('{0} {1}\n'.format('Annotation framerate:', 30))

# f.write('\n{0}\n'.format('List of channels:'))
# channels = ['cluster_num']
# for item in channels:
#         f.write('{0}\n'.format(item))
# f.write('\n');

# f.write('{0}\n'.format('List of annotations:'))
# clust_names = ['cluster_{}'.format(str(num)) for num in set(clusters)]
# labels = clust_names
# # labels = [item.replace(' ','_') for item in labels]
# for item in labels:
#     f.write('{0}\n'.format(item))
# f.write('\n')

# # now write the contents---------------
# for ch in channels:
#     f.write('{0}----------\n'.format(ch))
#     for beh in labels:
#         f.write('>{0}\n'.format(beh))
#         f.write('{0}\t {1}\t {2} \n'.format('Start','Stop','Duration'))

#         idxs = np.where(clusters == int(beh.split('_')[-1]))[0]
#         for hit in idxs:
#             start_time = hit * num_secs_seq/2
#             end_time = start_time + num_secs_seq
#             f.write('{0}\t{1}\t{2}\n'.format(start_time, end_time, num_secs_seq))
#         f.write('\n')
#     f.write('\n')

# f.close()