# Data Compression and Dynamic Time Warping Distance Notebook
Christopher Marais

gmarais@ufl.edu

2023/12/14

<br>

This notebook aims to take the taht was extracted and decompress the timestamps of each event. 

Then we compress it in a different way before calcualting the Dynamic Time Warping (DTW) distance between event timeseries.

The input is the csv file of the labels data and the event timestamps list pickle file.

The output is a pairwise DTW distance matrix between each event compressed timeseries as a .npy file. 

##### WARNING!: This notebook can run anywhere beteen 3 - 12 hours. Make sure you need to run it before running it.

<br>

Input: `/02_Clean_data/00_recording_event_times_labels.csv`, `00_event_timestamps_arrays_list.pkl` 

↓

Process: `</03_Scripts/01_Data_Compression_DTW_Calculation.ipynb>`

↓

Output: `/02_Clean_data/02_dtw_distance_matrix.npy`

<br>

-----------

### Import Packages

In [10]:
import os
import pickle
import math
import numpy as np
import pandas as pd # use pandas for more functionality
from dtaidistance import dtw

### Define Working Directory

In [11]:
# get working directory as parent directory of current directory
cwd = os.getcwd()
pwd = os.path.dirname(cwd)

### Define functions to decompress and recompress data

#### Decompression function

In [12]:
def create_binary_stream(row):
    # Unpack indices and stream length from the row
    indices, stream_length = row['event_timestamps'], row['event_length']

    # Initialize a NumPy array of zeros
    binary_stream = np.zeros(int(stream_length), dtype=int)

    if len(indices) != 0:

        # Ensure indices are integers
        indices = [int(i) for i in indices if isinstance(i, (int, float)) and not np.isnan(i)]

        # Convert indices to a NumPy array and filter out-of-bound indices
        indices = np.array(indices)
        valid_indices = indices[(0 <= indices) & (indices < int(stream_length))]

        # Set the specified indices to 1
        binary_stream[valid_indices] = 1

    return binary_stream

#### Compression function

In [13]:
def scale_zeros(binary_vector, scaling_factor=10):
    if scaling_factor <= 0:
        raise ValueError("Scaling factor must be greater than 0")

    scaled_vector = []
    zero_count = 0

    for bit in binary_vector:
        if bit == 1:
            if zero_count > 0:
                # Scale the number of zeros and add them to the new vector
                scaled_count = max(1, int(math.ceil(zero_count / scaling_factor)))
                scaled_vector.extend([0] * scaled_count)
                zero_count = 0
            scaled_vector.append(1)
        else:
            zero_count += 1

    # Handle trailing zeros
    if zero_count > 0:
        scaled_count = max(1, int(math.ceil(zero_count / scaling_factor)))
        scaled_vector.extend([0] * scaled_count)

    # Convert the list to a NumPy array with a double type
    return np.array(scaled_vector, dtype=np.double)

### Import data

In [15]:
# import csv as a dataframe
data_df = pd.read_csv(pwd + "/02_Clean_data/00_recording_event_times_labels.csv")
# import pickle file with list of arrays
with open(pwd + "/02_Clean_data/00_event_timestamps_arrays_list.pkl", 'rb') as file:
    event_timestamps_arrays_list = pickle.load(file)

### Decompress data

In [16]:
# add arays to dataframe
data_df['event_timestamps'] = event_timestamps_arrays_list
# get the length of each array
data_df['event_length'] = data_df['event_timestamps'].apply(lambda x: len(x))
# get the binary stream for each row
data_df['binary_stream'] = data_df.apply(create_binary_stream, axis=1)

### Compress data

In [None]:
# create compressed binary representation
# Scale factor
# smaller = more compressed
scale_factor = 100
# Apply the function to each element of the column
data_df['scaled_arrays'] = data_df['binary_stream'].apply(lambda x: scale_zeros(x, scale_factor))

data_df

### Dynamic Time Warping (DTW) distance calculation

In [None]:
# get a small subset to calcuylate the distance matrix from
scaled_timeseries_lst = data_df['scaled_arrays'].tolist()

# calculate the distance matrix
# function docs: https://dtaidistance.readthedocs.io/en/latest/modules/dtw.html?highlight=parallel#dtaidistance.dtw.distance_matrix_fast
dtw_distance_matrix = dtw.distance_matrix_fast(scaled_timeseries_lst) # use distance_matrix() if C/OpenMP not working on device

### Save data to disk

In [None]:
# save distance matrix for further clustering analysis
# Save to .npy file
np.save(pwd + "/02_Clean_data/01_dtw_distance_matrix.npy", dtw_distance_matrix)