# Pattern Matching on Sensor Data

<div class="alert alert-block alert-info">
<b>Tip:</b> We have new features for highly optimized time series analytics.

See documentation and notebooks on ***Temporal Similarity Search (TSS):***

- Transformed TSS: [Documentation](https://code.kx.com/kdbai/reference/transformed-tss.html)

- Non-Transformed TSS: [Documentation](https://code.kx.com/kdbai/use/non-transformed-tss.html) 

</div>

##### Note: This example requires a KDB.AI endpoint and API key. Sign up for a free [KDB.AI account](https://kdb.ai/get-started).

This example explores the process of conducting pattern matching on time series manufacturing data using *Transformed Temporal Similarity Search* in KDB.AI.

Our goal is to identify and retrieve historical time series that exhibit specific patterns. This matching capability is instrumental in a wide array of manufacturing scenarios, including quality control, process optimization, and predictive maintenance. For instance, imagine a scenario where we have time series data representing machinery performance, and we need to pinpoint instances of unusual behaviour, such as spikes, drops, or recurring trends.

We will guide you through a straightforward approach that leverages the raw time series data directly, without the need for complex modeling or domain-specific expertise. This approach is particularly attractive because it doesn't require additional resources for model creation. The sample will demonstrate that this simplistic method can yield satisfactory results.

### Aim

This tutorial will walk through the process of storing time series data in a vector database, generating time series vector embeddings. We will use KDB.AI's vector database to find patterns that match an input query pattern. We will cover the following topics:

1. Load Sensor Data
1. Create Sensor Vector Embeddings
1. Store Embeddings in KDB.AI
1. Search For Similar Sequences To A Target Sensor Sequence
1. Delete the KDB.AI Table

---

## 0. Setup

### Install dependencies

In order to successfully run this sample, note the following steps depending on where you are running this notebook:

-***Run Locally / Private Environment:*** The [Setup](https://github.com/KxSystems/kdbai-samples/blob/main/README.md#setup) steps in the repository's `README.md` will guide you on prerequisites and how to run this with Jupyter.


-***Colab / Hosted Environment:*** Open this notebook in Colab and run through the cells.

In [1]:
!pip install kdbai_client
!pip install matplotlib

In [2]:
### !!! Only run this cell if you need to download the data into your environment, for example in Colab
### This downloads sensor data
!mkdir ./data
!wget -P ./data https://raw.githubusercontent.com/KxSystems/kdbai-samples/main/pattern_matching/data/archive.zip

### Import Packages

In [3]:
# read data
from zipfile import ZipFile
import pandas as pd

In [4]:
# plotting
import matplotlib.pyplot as plt

In [5]:
# vector DB
import os
import kdbai_client as kdbai
from getpass import getpass
import time

### Ignore Warning

In [6]:
import warnings

warnings.simplefilter("ignore", UserWarning)

### Define Helper Functions

In [7]:
def show_df(df: pd.DataFrame) -> pd.DataFrame:
    print(df.shape)
    return df.head()

## 1. Load Sensor Data

### Dataset Overview

The dataset that will be used for this example is the [Water Pump Sensor Dataset](https://www.kaggle.com/datasets/nphantawee/pump-sensor-data) available on Kaggle. The datatset consist of a `sensor.csv` file which has raw values from 52 sensors from a town water pump.

As the `sensors.csv` file is >100mb, we cannot host this file on GitHub and must instead zip this file up and extract it locally.

### Extract the Data From a ZipFile

In [8]:
def extract_zip(file_name):
    with ZipFile(file_name, "r") as zipf:
        zipf.extractall("data")

In [9]:
extract_zip("data/archive.zip")

You should now have a sensor.csv file.

### Read In The Sensor Data From The CSV

In [10]:
raw_sensors_df = pd.read_csv("data/sensor.csv")

In [11]:
show_df(raw_sensors_df)

### Pre-process The Data

Let's do some preparation on the dataset to clean it up. We will remove duplicates, drop  irrelevant columns and handle missing data.

In [12]:
# Drop duplicates
sensors_df = raw_sensors_df.drop_duplicates()

In [13]:
# Remove columns that are unnecessary/bad data
sensors_df = sensors_df.drop(["Unnamed: 0", "sensor_15", "sensor_50"], axis=1)

In [14]:
# convert timestamp to datetime format
sensors_df["timestamp"] = pd.to_datetime(sensors_df["timestamp"])

In [15]:
# Removes rows with any NaN values
sensors_df = sensors_df.dropna()

In [16]:
# Reset the index
sensors_df = sensors_df.reset_index(drop=True)

In [17]:
show_df(sensors_df)

This dataset has 52 sensor columns - for the purposes of this example we will only select the first one `sensor_00` for simplicity.

### Explore The Data For One Sensor

In [18]:
# Extract the readings from the BROKEN state of the pump
broken_sensors_df = sensors_df[sensors_df["machine_status"] == "BROKEN"]

In [19]:
# Plot time series for each sensor with BROKEN state marked with X in red color
plt.figure(figsize=(18, 3))
plt.plot(
    broken_sensors_df["timestamp"],
    broken_sensors_df["sensor_00"],
    linestyle="none",
    marker="X",
    color="red",
    markersize=12,
)
plt.plot(sensors_df["timestamp"], sensors_df["sensor_00"], color="blue")
plt.show()

We can see above that over time the sensor values stay generally around 2.5 with a few noisy dropoff spikes. We have plotted the column `machine_status=BROKEN` in red here which corresponds with a lot of these spikes indicating the reason for the dropoffs.

## 2. Create Sensor Vector Embeddings

Next, let's create embeddings for these values. We have chosen a simple approach that leverages the raw time series data directly, without the need for complex modelling or domain-specific expertise.

### Extract One Sensors Values

In [20]:
sensor0_df = sensors_df[["timestamp", "sensor_00"]]
sensor0_df = sensor0_df.reset_index(drop=True).reset_index()

In [21]:
# This is our sensor data to be ingested into KDB.AI
sensor0_df.head()

In [22]:
sensor0_df.shape

### Group The Sensor0 Values into Time Windows

The code below divides the original time series data into overlapping windows, with each window containing a specified number of rows and a step size determining how they are shifted along the timeline. It also extracts a timestamp from each window as we will want to store this as metadata.

In [23]:
# Set the window size (number of rows in each window)
window_size = 100
step_size = 1

In [24]:
# define windows
windows = [
    sensor0_df.iloc[i : i + window_size]
    for i in range(0, len(sensor0_df) - window_size + 1, step_size)
]

In [25]:
# Iterate through the windows & extract column values
start_times = [w["timestamp"].iloc[0] for w in windows]
end_times = [w["timestamp"].iloc[-1] for w in windows]
sensor0_values = [w["sensor_00"].tolist() for w in windows]

In [26]:
# Create a new DataFrame from the collected data
embedding_df = pd.DataFrame(
    {"timestamp": start_times, "sensor_00": sensor0_values}
)

In [27]:
embedding_df = embedding_df.reset_index(drop=True).reset_index()

In [28]:
# Show the resulting DataFrame
show_df(embedding_df)

In [29]:
# When is the first time a sensor is 'broken'?
broken_sensors_df["timestamp"]

## 3. Store Embeddings in KDB.AI

With the embeddings created, we need to store them in a vector database to enable efficient searching.

### Define KDB.AI Session

KDB.AI comes in two offerings:

1. [KDB.AI Cloud](https://trykdb.kx.com/kdbai/signup/) - For experimenting with smaller generative AI projects with a vector database in our cloud.
2. [KDB.AI Server](https://trykdb.kx.com/kdbaiserver/signup/) - For evaluating large scale generative AI applications on-premises or on your own cloud provider.

Depending on which you use there will be different setup steps and connection details required.

##### Option 1. KDB.AI Cloud

To use KDB.AI Cloud, you will need two session details - a URL endpoint and an API key.
To get these you can sign up for free [here](https://trykdb.kx.com/kdbai/signup).

You can connect to a KDB.AI Cloud session using `kdbai.Session` and passing the session URL endpoint and API key details from your KDB.AI Cloud portal.

If the environment variables `KDBAI_ENDPOINTS` and `KDBAI_API_KEY` exist on your system containing your KDB.AI Cloud portal details, these variables will automatically be used to connect.
If these do not exist, it will prompt you to enter your KDB.AI Cloud portal session URL endpoint and API key details.

In [30]:
KDBAI_ENDPOINT = (
    os.environ["KDBAI_ENDPOINT"]
    if "KDBAI_ENDPOINT" in os.environ
    else input("KDB.AI endpoint: ")
)
KDBAI_API_KEY = (
    os.environ["KDBAI_API_KEY"]
    if "KDBAI_API_KEY" in os.environ
    else getpass("KDB.AI API key: ")
)

In [31]:
session = kdbai.Session(api_key=KDBAI_API_KEY, endpoint=KDBAI_ENDPOINT)

##### Option 2. KDB.AI Server

To use KDB.AI Server, you will need download and run your own container.
To do this, you will first need to sign up for free [here](https://trykdb.kx.com/kdbaiserver/signup/).

You will receive an email with the required license file and bearer token needed to download your instance.
Follow instructions in the signup email to get your session up and running.

Once the [setup steps](https://code.kx.com/kdbai/gettingStarted/kdb-ai-server-setup.html) are complete you can then connect to your KDB.AI Server session using `kdbai.Session` and passing your local endpoint.

In [32]:
# session = kdbai.Session(endpoint="http://localhost:8082")

### Define Vector DB Table Schema

The next step is to define a schema for our KDB.AI table where we will store our embeddings. Our table will have three colums: index, timestamp, and sensor_00. Sensor_00 is where the time series embeddings will be stored and searched using Transformed Temporal Similarity Search.

The key is that the 100 dimension windows will be compressed to 8 dimensions with Transformed TSS, making search much faster and significantly reducing the memory footprint.

In [47]:
# Set up the schema and indexes for KDB.AI table, specifying embeddings column with 384 dimensions, Euclidean Distance, and flat index
sensor_schema = [
    {"name": "index", "type": "int64"},
    {"name": "timestamp", "type": "datetime64[ns]"},
    {"name": "sensor_00", "type": "float64s"}
]

indexes = [
    {
        "name": "flat_index",
        "type": "flat",
        "column": "sensor_00",
        "params": {"dims": 8, "metric": "L2"},
    }
]

embedding_conf = {'sensor_00': {"dims": 8, "type": "tsc", "on_insert_error": "reject_all" }}



### Create Vector DB Table

Use the KDB.AI `create_table` function to create a table that matches the defined schema in the vector database.

In [49]:
# get the database connection. Default database name is 'default'
database = session.database('default')

# First ensure the table does not already exist
try:
    database.table("sensors").drop()
    time.sleep(5)
except kdbai.KDBAIException:
    pass

In [50]:
# Create the table called "sensors"
table = database.create_table("sensors",
                              schema = sensor_schema,
                              indexes = indexes,
                              embedding_configurations = embedding_conf)


In [51]:
table.query()

### Add Embedded Data to KDB.AI Table

When adding larger amounts of data, you may need insert data into an index in chunks. It is a good idea to first get an idea of how large the dataset to insert is.

In [52]:
from tqdm import tqdm
n = 1000  # number of rows per batch

for i in tqdm(range(0, embedding_df.shape[0], n)):
    table.insert(embedding_df[i:i+n].reset_index(drop=True))

### Verify Data Has Been Inserted

Running `table.query()` should show us that data has been added.

Note that while we only see the three columns including our 100 dimension vector/time series window, the 100 dimension time series window has been compressed to 8 dimensions, and that compressed time series windows will be used for similarity search in the backend.

In [53]:
show_df(table.query())

## 4. Search For Similar Sequences To A Target Sensor Sequence

Now our data is loaded successfully, we can perform pattern matching on our historical sensor data using KDB.AI `search`.

### Define an Example Pattern to Query

The first step is to select a pattern that will be used to query.

We chose this by selecting a start time, filtering to get the vector's values for that record, and then storing this in a variable called `q`. Any pattern could be selected here.

The resulting query pattern is also displayed as a line plot for visual inspection and analysis.

In [54]:
broken_sensors_df["timestamp"]

In [None]:
## This is our query vector, using the 17100th sliding window as an example (this is just before the first instance when the sensor is in a failed state)
q = embedding_df['sensor_00'][17100]
#q

In [56]:
# Visualise the query pattern
plt.figure(figsize=(10, 6))
plt.plot(embedding_df['sensor_00'][17100], marker="o", linestyle="-")
plt.xlabel("Timestamp")
plt.ylabel("Value")
plt.title("Query & Similar Patterns")
plt.grid(True)
plt.xticks(rotation=45)  # Rotate x-axis labels for readability
plt.show()

#### Let's return the top 100 matches to this query vector to see if we can identify the other instances of a failed state

In [68]:
nn1_result = table.search(vectors={'flat_index': [q]}, n=100, filter=[(">","index", 18000)])
nn1_result[0]

#### Since every timestamp/row has a 100 dimension window, we will have matches that are close to one another and are matching upon the same 'anomaly' pattern. To ensure we are returning only unique pattern matches we will remove any matches within a range of 200 points from our next closest match. This will ensure we are only capturing each potential failed state one time within our final results:

In [69]:
def filter_results(df, range_size=200):

    final_results = []
    removed_indices = set()

    for _, row in df.iterrows():
        current_index = row['index']

        # If this index hasn't been removed
        if current_index not in removed_indices:
            final_results.append(row)

            # Mark indices within range for removal
            lower_bound = max(0, current_index - range_size // 2)
            upper_bound = current_index + range_size // 2
            removed_indices.update(range(lower_bound, upper_bound + 1))

    # Create a new dataframe from the final results
    final_df = pd.DataFrame(final_results)

    return final_df

filtered_df = filter_results(nn1_result[0])

# Display the filtered results
print(filtered_df)

#### For our reference, these are all of the times when the sensor returns a failed state:

In [70]:
broken_sensors_df["timestamp"]

### Results:

We see that our final results closely capture each of the timestamps of the failed states within a few indexes. There is only one captured pattern that is not within the failed states: 110667. If you go back near the beginning of this notebook you will see within the pattern a large drop in the signal near this index - this could show a time that needs to be investigated as a potential missed failed state. 

## Visualize the matching patterns of other 'failed' states:

In [71]:
for i in filtered_df['index']:
    plt.plot(embedding_df['sensor_00'][i], marker="o", linestyle="-")
plt.xlabel("Timestamp")
plt.ylabel("Value")
plt.title("Query & Similar Patterns")
plt.grid(True)
plt.xticks(rotation=45)  # Rotate x-axis labels for readability
plt.show()

## 5. Delete the KDB.AI Table

Once finished with the table, it is best practice to drop it.

In [None]:
table.drop()

## Take Our Survey

We hope you found this sample helpful! Your feedback is important to us, and we would appreciate it if you could take a moment to fill out our brief survey. Your input helps us improve our content.

[**Take the Survey**](https://delighted.com/t/go0ElNsJ)