# Fuzzy Filtering on Metadata with KDB.AI Vector Database

##### Note: This example requires a KDB.AI endpoint and API key. Sign up for a free [KDB.AI account](https://kdb.ai/get-started).

#### In this example, we will show how to use metadata filtering along with fuzzy filtering in a KDB.AI vector database to increase the speed and accuracy of vector similarity searches.

#### NOTE! KDB.AI Also has 'fuzzy filter' capabilities on metadata columns.
Data often contains errors such as typos, misspellings, or international spelling variations, which can hinder the accuracy of search results. Fuzzy filters address this issue by enabling the retrieval of documents that contain terms and metadata entries similar to the specified query term and filters, even if there are slight variations.

There are many distance metrics you can use for fuzzy filtering, it defaults to Levenshtein distance, but you have the ability to choose the distance metric from a variety of options including: Levenshtein, Damerau-Levenshtein, Hamming, Indel, Jaro, JaroWinkler, Longest Common Subsequence, Optimal String Alignment (OSA), Prefix, or Postfix.

#### Agenda:
1. Set Up
2. Data Import and Understanding
3. Set Up KDB.AI Vector Database
4. Insert Movie Data into the KDB.AI table
5. Run Filtered & Fuzzy Similarity Searches on our KDB.AI vector database

Movie Dataset Source: https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots

## 1. Set Up
#### Installs, imports, and API Key setup

In order to successfully run this sample, note the following steps depending on where you are running this notebook:

-***Run Locally / Private Environment:*** The [Setup](https://github.com/KxSystems/kdbai-samples/blob/main/README.md#setup) steps in the repository's `README.md` will guide you on prerequisites and how to run this with Jupyter.


-***Colab / Hosted Environment:*** Open this notebook in Colab and run through the cells.

In [None]:
!pip install kdbai_client sentence_transformers

In [2]:
import pandas as pd
import os
from getpass import getpass

In [3]:
### !!! Only run this cell if you need to download the data into your environment, for example in Colab
### This downloads movie data
if os.path.exists("./data/filtered_embedded_movies.pkl") == False:
  !mkdir ./data
  !wget -P ./data https://raw.githubusercontent.com/KxSystems/kdbai-samples/main/metadata_filtering/data/filtered_embedded_movies.pkl

--2024-09-16 19:51:48--  https://raw.githubusercontent.com/KxSystems/kdbai-samples/main/metadata_filtering/data/filtered_embedded_movies.pkl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 76970270 (73M) [application/octet-stream]
Saving to: ‘./data/filtered_embedded_movies.pkl’


2024-09-16 19:51:49 (232 MB/s) - ‘./data/filtered_embedded_movies.pkl’ saved [76970270/76970270]



## 2. Data Import and Understanding
### Import movies dataframe

In [None]:
# Read in the Movies dataframe
df = pd.read_pickle("./data/filtered_embedded_movies.pkl")

### Initial data exploration: Let's understand the data!

In [None]:
#How many rows do we have?
print(df.shape[0])

19161


In [None]:
#What columns do we have?
for column in df.columns:
    print(column)

ReleaseYear
Title
Origin
Director
Cast
Genre
Plot
embeddings


In [None]:
#Let us inspect the dataframe
df.head()

Unnamed: 0,ReleaseYear,Title,Origin,Director,Cast,Genre,Plot,embeddings
0,1975,The Candy Tangerine Man,American,Matt Cimber,John Daniels Eli Haines Tom Hankason,action,A successful Los Angeles-based businessperson ...,"[-0.06835174, -0.013138616, -0.12417501, 0.002..."
1,1975,Capone,American,Steve Carver,Ben Gazzara Susan Blakely John Cassavetes Sylv...,crime drama,The story is of the rise and fall of the Chica...,"[-0.01411798, 0.040705115, -0.0014280609, 0.00..."
2,1975,Cleopatra Jones and the Casino of Gold,American,Charles Bail,Tamara Dobson Stella Stevens,action,The story begins with two government agents Ma...,"[-0.0925895, 0.01188509, -0.08999529, -0.01541..."
3,1975,Conduct Unbecoming,American,Michael Anderson,Stacy Keach Richard Attenborough Christopher P...,drama,Around 1880 two young British officers arrive ...,"[-0.07435084, -0.06386179, 0.017042944, 0.0288..."
4,1975,Cooley High,American,Michael Schultz,Lawrence Hilton-Jacobs Glynn Turman Garrett Mo...,comedy,Set in 1964 Chicago Preach an aspiring playwri...,"[-0.041632336, 0.037923656, -0.072276264, -0.0..."


## 3. Set up KDB.AI Vector Database
Now that we understand our dataset, we can set up our vector db



In [None]:
# vector DB
import os
from getpass import getpass
import kdbai_client as kdbai
import time

##### Option 1. KDB.AI Cloud

To use KDB.AI Cloud, you will need two session details - a URL endpoint and an API key.
To get these you can sign up for free [here](https://trykdb.kx.com/kdbai/signup).

You can connect to a KDB.AI Cloud session using `kdbai.Session` and passing the session URL endpoint and API key details from your KDB.AI Cloud portal.

If the environment variables `KDBAI_ENDPOINTS` and `KDBAI_API_KEY` exist on your system containing your KDB.AI Cloud portal details, these variables will automatically be used to connect.
If these do not exist, it will prompt you to enter your KDB.AI Cloud portal session URL endpoint and API key details.

In [None]:
#Set up KDB.AI endpoint and API key
KDBAI_ENDPOINT = (
    os.environ["KDBAI_ENDPOINT"]
    if "KDBAI_ENDPOINT" in os.environ
    else input("KDB.AI endpoint: ")
)
KDBAI_API_KEY = (
    os.environ["KDBAI_API_KEY"]
    if "KDBAI_API_KEY" in os.environ
    else getpass("KDB.AI API key: ")
)

KDB.AI endpoint: your_kdbai_endpoint
KDB.AI API key: ········


In [None]:
session = kdbai.Session(api_key=KDBAI_API_KEY, endpoint=KDBAI_ENDPOINT)

##### Option 2. KDB.AI Server

To use KDB.AI Server, you will need download and run your own container.
To do this, you will first need to sign up for free [here](https://trykdb.kx.com/kdbaiserver/signup/).

You will receive an email with the required license file and bearer token needed to download your instance.
Follow instructions in the signup email to get your session up and running.

Once the [setup steps](https://code.kx.com/kdbai/gettingStarted/kdb-ai-server-setup.html) are complete you can then connect to your KDB.AI Server session using `kdbai.Session` and passing your local endpoint.

In [None]:
session = kdbai.Session()

### Set up the table schema
Have a table column for each column in the dataframe, as well as an 'embeddings' column for the movie description embeddings

In [None]:
#Set up the schema for KDB.AI table, specifying embeddings column with 384 dimensions, Euclidean Distance, and flat index
table_schema = {
    "columns": [
        {"name": "ReleaseYear", "pytype": "int64"},
        {"name": "Title", "pytype": "str"},
        {"name": "Origin", "pytype": "str"},
        {"name": "Director", "pytype": "str"},
        {"name": "Cast", "pytype": "str"},
        {"name": "Genre", "pytype": "str"},
        {"name": "Plot", "pytype": "str"},
        {
            "name": "embeddings",
            "pytype": "float64",
            "vectorIndex": {"dims": 384, "metric": "L2", "type": "flat"},
        },
    ]
}

### Create a table called "metadata_demo"
First check if the table already exists, then create a new table with the table schema from above

In [None]:
# First ensure the table does not already exist
try:
    session.table("metadata_demo").drop()
    time.sleep(5)
except kdbai.KDBAIException:
    pass

In [None]:
#Create the table called "metadata_demo"
table = session.create_table("metadata_demo", table_schema)

## 4. Insert Movie Data into the KDB.AI table

In [None]:
#Insert the data into the table, split into 2000 row batches
from tqdm import tqdm
n = 2000  # chunk row size

for i in tqdm(range(0, df.shape[0], n)):
    table.insert(df[i:i+n].reset_index(drop=True))

100%|██████████████████████████████████████████████████████████████████████████████████| 10/10 [00:40<00:00,  4.00s/it]


In [None]:
#function to view the dataframe within the table
def show_df(df: pd.DataFrame) -> pd.DataFrame:
    print(df.shape)
    return df.head()

In [None]:
#View contents of the table
show_df(table.query())

(19161, 8)


Unnamed: 0,ReleaseYear,Title,Origin,Director,Cast,Genre,Plot,embeddings
0,1975,The Candy Tangerine Man,American,Matt Cimber,John Daniels Eli Haines Tom Hankason,action,A successful Los Angeles-based businessperson ...,"[-0.06835174, -0.013138616, -0.12417501, 0.002..."
1,1975,Capone,American,Steve Carver,Ben Gazzara Susan Blakely John Cassavetes Sylv...,crime drama,The story is of the rise and fall of the Chica...,"[-0.01411798, 0.040705115, -0.0014280609, 0.00..."
2,1975,Cleopatra Jones and the Casino of Gold,American,Charles Bail,Tamara Dobson Stella Stevens,action,The story begins with two government agents Ma...,"[-0.0925895, 0.01188509, -0.08999529, -0.01541..."
3,1975,Conduct Unbecoming,American,Michael Anderson,Stacy Keach Richard Attenborough Christopher P...,drama,Around 1880 two young British officers arrive ...,"[-0.07435084, -0.06386179, 0.017042944, 0.0288..."
4,1975,Cooley High,American,Michael Schultz,Lawrence Hilton-Jacobs Glynn Turman Garrett Mo...,comedy,Set in 1964 Chicago Preach an aspiring playwri...,"[-0.041632336, 0.037923656, -0.072276264, -0.0..."


## 5. Run Filtered Similarity Searches on our KDB.AI Vector Database

#### Set up embedding model to embed our natural language queries

In [None]:
# embedding model to be used to embed user input query
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

#### Create a query vector by using the embedding model to embed a natural language query

In [None]:
#Embed a query
query_vector = [embedding_model.encode('star wars Luke Skywalker').tolist()]

#### Run vector similarity search, return the top-3 similar movies

In [None]:
#Search vector db to find most relevant movies
print(table.search(query_vector, n=3))

[   ReleaseYear                                             Title    Origin  \
0         1983                                Return of the Jedi  American   
1         1977  Star Wars Episode IV: A New Hope (aka Star Wars)  American   
2         1980                           The Empire Strikes Back  American   

           Director                                               Cast  \
0  Richard Marquand  Mark Hamill Harrison Ford Carrie Fisher Billy ...   
1      George Lucas  Mark Hamill Harrison Ford Carrie Fisher Alec G...   
2    Irvin Kershner  Carrie Fisher Harrison Ford Mark Hamill Billy ...   

             Genre                                               Plot  \
0  science fiction  Luke Skywalker initiates a plan to rescue Han ...   
1  science fiction  The galaxy is in the midst of a civil war. Spi...   
2  science fiction  Three years after the destruction of the Death...   

                                          embeddings  __nn_distance  
0  [-0.047360003, -0.08337

#### Repeat the search with metadata filters to narrow the search space

In [None]:
print(table.search(query_vector, n=3, filter=[("like", "Director", "George Lucas"),("=", "ReleaseYear", 1977)]))

[   ReleaseYear                                             Title    Origin  \
0         1977  Star Wars Episode IV: A New Hope (aka Star Wars)  American   

       Director                                               Cast  \
0  George Lucas  Mark Hamill Harrison Ford Carrie Fisher Alec G...   

             Genre                                               Plot  \
0  science fiction  The galaxy is in the midst of a civil war. Spi...   

                                          embeddings  __nn_distance  
0  [-0.100305825, 0.008335104, 0.03792797, -0.038...       0.910225  ]


### Fuzzy Filtering
What if there are some spelling mistakes?

In [None]:
# Fuzzy filter with a misspelled name
print(table.search(query_vector, n=3, filter=[['fuzzy','Director',[["Goerge Lucas",2]]]]))

In [None]:
# Fuzzy filter with a misspelled genre, choosing the distance metric algorithm to use, in this case 'jaro'. This defaults to 'Levenshtein'
print(table.search(query_vector, n=3, filter=[['fuzzy','Genre',[["ficton",2,"jaro"]]]]))

#### More Examples

In [None]:
#Another query
query_vector = [embedding_model.encode('conspiracy theories involving art').tolist()]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
#Another filtered search example
print(table.search(query_vector, n=3, filter=[("like", "Genre", "*thriller*"),("like","Cast","*Tom Hanks*")]))

[   ReleaseYear               Title    Origin          Director  \
0         2006   The Da Vinci Code  American        Ron Howard   
1         2017          The Circle  American    James Ponsoldt   
2         2017            The Post  American  Steven Spielberg   

                                                Cast  \
0  Tom Hanks Audrey Tautou Ian McKellen Alfred Mo...   
1  James Ponsoldt (director/screenplay); Tom Hank...   
2  Steven Spielberg (director); Liz Hannah Josh S...   

                              Genre  \
0                          thriller   
1             sci-fi drama thriller   
2  biography drama history thriller   

                                                Plot  \
0  Jacques Sauni¨re the Louvres curator is pursue...   
1  When her car breaks down Mae Holland contacts ...   
2  In 1966 Vietnam State Department military anal...   

                                          embeddings  __nn_distance  
0  [-0.11887315, -0.049770635, -0.022621859, -0.0...     

In [None]:
#Another query
query_vector = [embedding_model.encode('middle earth fantasy adventure in the Shire').tolist()]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
#Another filtered search example
print(table.search(query_vector, n=3, filter=[("within","ReleaseYear",[2000,2010])]))

[   ReleaseYear                                              Title    Origin  \
0         2001  The Lord of the Rings: The Fellowship of the Ring  American   
1         2002              The Lord of the Rings: The Two Towers  American   
2         2003      The Lord of the Rings: The Return of the King  American   

        Director                                               Cast  \
0  Peter Jackson  Elijah Wood Ian McKellen Liv Tyler Sean Astin ...   
1  Peter Jackson  Elijah Wood Ian McKellen Liv Tyler Viggo Morte...   
2  Peter Jackson  Elijah Wood Ian McKellen Liv Tyler Sean Astin ...   

               Genre                                               Plot  \
0            fantasy  In the Second Age of Middle-earth the lords of...   
1  adventure fantasy  After awakening from a dream of Gandalf the Gr...   
2  adventure fantasy  Many years ago two Hobbits Smeagol and Dagol a...   

                                          embeddings  __nn_distance  
0  [-0.047063936, 0.036902

#### Another Fuzzy Filtering Example on the Genre metadata column

In [None]:
#Another filtered search example, with fuzzy filtering the Genre column reconciling for a typo
print(table.search(query_vector, n=3, filter=[("within","ReleaseYear",[2000,2010]),['fuzzy','Genre',[["fantesy",2,"damerau-levenshtein"]]]]))

## Delete the KDB.AI Table
Once finished with the table, it is best practice to drop it.

In [None]:
table.drop()

#### Take Our Survey
We hope you found this sample helpful! Your feedback is important to us, and we would appreciate it if you could take a moment to fill out our brief survey. Your input helps us improve our content.

Take the [Survey](https://delighted.com/t/wtS7T4Lg)
