# Metadata Filtering with KDB.AI Vector Database

##### Note: This example requires KDB.AI server. Sign up for a free [KDB.AI account](https://kdb.ai/get-started).

#### In this example, we will show how to use metadata filtering in a KDB.AI vector database to increase the speed and accuracy of vector similarity searches.

#### Agenda:
1. Set Up
2. Data Import and Understanding 
3. Set Up KDB.AI Vector Database
4. Insert Movie Data into the KDB.AI table
5. Run Filtered Similarity Searches on our KDB.AI vector database

Movie Dataset Source: https://www.kaggle.com/datasets/jrobischon/wikipedia-movie-plots

## 1. Set Up
#### Installs and imports

In order to successfully run this sample, note the following steps depending on where you are running this notebook:

-***Run Locally / Private Environment:*** The [Setup](https://github.com/KxSystems/kdbai-samples/blob/main/README.md#setup) steps in the repository's `README.md` will guide you on prerequisites and how to run this with Jupyter.


-***Colab / Hosted Environment:*** Open this notebook in Colab and run through the cells.

In [None]:
!pip install kdbai_client
!pip install sentence_transformers

In [None]:
### !!! Only run this cell if you need to download the data into your environment, for example in Colab
### This downloads movie data
!mkdir ./data 
!wget -P ./data https://raw.githubusercontent.com/KxSystems/kdbai-samples/main/metadata_filtering/data/filtered_embedded_movies.pkl

In [53]:
import pandas as pd
import os
from getpass import getpass

## 2. Data Import and Understanding
### Import movies dataframe

In [54]:
# Read in the Movies dataframe
df = pd.read_pickle("./data/filtered_embedded_movies.pkl")

### Initial data exploration: Let's understand the data!

In [55]:
#How many rows do we have?
print(df.shape[0])

19161


In [56]:
#What columns do we have?
for column in df.columns:
    print(column)

ReleaseYear
Title
Origin
Director
Cast
Genre
Plot
embeddings


In [57]:
#Let us inspect the dataframe
df.head()

Unnamed: 0,ReleaseYear,Title,Origin,Director,Cast,Genre,Plot,embeddings
0,1975,The Candy Tangerine Man,American,Matt Cimber,John Daniels Eli Haines Tom Hankason,action,A successful Los Angeles-based businessperson ...,"[-0.06835174, -0.013138616, -0.12417501, 0.002..."
1,1975,Capone,American,Steve Carver,Ben Gazzara Susan Blakely John Cassavetes Sylv...,crime drama,The story is of the rise and fall of the Chica...,"[-0.01411798, 0.040705115, -0.0014280609, 0.00..."
2,1975,Cleopatra Jones and the Casino of Gold,American,Charles Bail,Tamara Dobson Stella Stevens,action,The story begins with two government agents Ma...,"[-0.0925895, 0.01188509, -0.08999529, -0.01541..."
3,1975,Conduct Unbecoming,American,Michael Anderson,Stacy Keach Richard Attenborough Christopher P...,drama,Around 1880 two young British officers arrive ...,"[-0.07435084, -0.06386179, 0.017042944, 0.0288..."
4,1975,Cooley High,American,Michael Schultz,Lawrence Hilton-Jacobs Glynn Turman Garrett Mo...,comedy,Set in 1964 Chicago Preach an aspiring playwri...,"[-0.041632336, 0.037923656, -0.072276264, -0.0..."


## 3. Set up KDB.AI Vector Database
Now that we understand our dataset, we can set up our vector db



In [58]:
# vector DB
import os
from getpass import getpass
import kdbai_client as kdbai
import time

To use KDB.AI Server, you will need download and run your own container.
To do this, you will first need to sign up for free [here](https://trykdb.kx.com/kdbaiserver/signup/).

You will receive an email with the required license file and bearer token needed to download your instance.
Follow instructions in the signup email to get your session up and running.

Once the [setup steps](https://code.kx.com/kdbai/gettingStarted/kdb-ai-server-setup.html) are complete you can then connect to your KDB.AI Server session using `kdbai.Session` and passing your local endpoint.


In [None]:
#Set up KDB.AI server endpoint 
KDBAI_ENDPOINT = (
    os.environ["KDBAI_ENDPOINT"]
    if "KDBAI_ENDPOINT" in os.environ
    else "http://localhost:8082"
)

#connect to KDB.AI Server, default mode is qipc
session = kdbai.Session(endpoint=KDBAI_ENDPOINT)


### Set up the table schema
Have a table column for each column in the dataframe, as well as an 'embeddings' column for the movie description embeddings

In [62]:
#Set up the schema and indexes for KDB.AI table, specifying embeddings column with 384 dimensions, Euclidean Distance, and flat index
table_schema = [
    {"name": "ReleaseYear", "type": "int64"},
    {"name": "Title", "type": "bytes"},
    {"name": "Origin", "type": "str"},
    {"name": "Director", "type": "bytes"},
    {"name": "Cast", "type": "bytes"},
    {"name": "Genre", "type": "str"},
    {"name": "Plot", "type": "bytes"},
    {"name": "embeddings", "type": "float64s"}
]

indexes = [
    {
        "name": "flat_index",
        "type": "flat",
        "column": "embeddings",
        "params": {"dims": 384, "metric": "L2"},
    }
]

### Create a table called "metadata_demo"
First check if the table already exists, then create a new table with the table schema from above

In [63]:
# get the database connection. Default database name is 'default'
database = session.database('default')
# First ensure the table does not already exist
try:
    database.table("metadata_demo").drop()
    time.sleep(5)
except kdbai.KDBAIException:
    pass

In [64]:
#Create the table called "metadata_demo"
table = database.create_table("metadata_demo", schema = table_schema, indexes = indexes)

## 4. Insert Movie Data into the KDB.AI table 

In [65]:
#Insert the data into the table, split into 2000 row batches
from tqdm import tqdm 
n = 2000  # chunk row size

# convert empty cast values to string form for backend. Here we are using value None tofor empty Cast value.
for index, row in df.iterrows():
    cast = row['Cast']
    if 1 == len(cast):
        df.loc[index, 'Cast'] = 'None'
    
for i in tqdm(range(0, df.shape[0], n)):
    data = df[i:i+n].reset_index(drop=True)
    # change data types as per table schema
    data['Title'] = data['Title'].str.encode('utf-8')
    data['Director'] = data['Director'].str.encode('utf-8')
    data['Cast'] = data['Cast'].str.encode('utf-8')
    data['Plot'] = data['Plot'].str.encode('utf-8')
    table.insert(data)

100%|██████████| 10/10 [00:01<00:00,  9.12it/s]


In [66]:
#function to view the dataframe within the table
def show_df(df: pd.DataFrame) -> pd.DataFrame:
    print(df.shape)
    return df.head()

In [67]:
#View contents of the table
show_df(table.query())

(19161, 8)


Unnamed: 0,ReleaseYear,Title,Origin,Director,Cast,Genre,Plot,embeddings
0,1975,b'The Candy Tangerine Man',American,b'Matt Cimber',b'John Daniels Eli Haines Tom Hankason',action,b'A successful Los Angeles-based businessperso...,"[-0.06835173815488815, -0.01313861645758152, -..."
1,1975,b'Capone',American,b'Steve Carver',b'Ben Gazzara Susan Blakely John Cassavetes Sy...,crime drama,b'The story is of the rise and fall of the Chi...,"[-0.014117980375885963, 0.0407051146030426, -0..."
2,1975,b'Cleopatra Jones and the Casino of Gold',American,b'Charles Bail',b'Tamara Dobson Stella Stevens',action,b'The story begins with two government agents ...,"[-0.09258949756622314, 0.011885089799761772, -..."
3,1975,b'Conduct Unbecoming',American,b'Michael Anderson',b'Stacy Keach Richard Attenborough Christopher...,drama,b'Around 1880 two young British officers arriv...,"[-0.07435084134340286, -0.06386178731918335, 0..."
4,1975,b'Cooley High',American,b'Michael Schultz',b'Lawrence Hilton-Jacobs Glynn Turman Garrett ...,comedy,b'Set in 1964 Chicago Preach an aspiring playw...,"[-0.041632335633039474, 0.0379236564040184, -0..."


## 5. Run Filtered Similarity Searches on our KDB.AI Vector Database

#### Set up embedding model to embed our natural language queries

In [None]:
# embedding model to be used to embed user input query
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

#### Create a query vector by using the embedding model to embed a natural language query

In [69]:
#Embed a query
query_vector = {'flat_index' : [embedding_model.encode('star wars Luke Skywalker').tolist()]}

#### Run vector similarity search, return the top-3 similar movies

In [70]:
#Search vector db to find most relevant movies
print(table.search(vectors=query_vector, n=3))

[   __nn_distance  ReleaseYear  \
0       0.748475         1983   
1       0.910225         1977   
2       0.942763         1980   

                                               Title    Origin  \
0                              b'Return of the Jedi'  American   
1  b'Star Wars Episode IV: A New Hope (aka Star W...  American   
2                         b'The Empire Strikes Back'  American   

              Director                                               Cast  \
0  b'Richard Marquand'  b'Mark Hamill Harrison Ford Carrie Fisher Bill...   
1      b'George Lucas'  b'Mark Hamill Harrison Ford Carrie Fisher Alec...   
2    b'Irvin Kershner'  b'Carrie Fisher Harrison Ford Mark Hamill Bill...   

             Genre                                               Plot  \
0  science fiction  b'Luke Skywalker initiates a plan to rescue Ha...   
1  science fiction  b'The galaxy is in the midst of a civil war. S...   
2  science fiction  b'Three years after the destruction of the Dea...   


#### Repeat the search with metadata filters to narrow the search space

In [71]:
print(table.search(vectors=query_vector, n=3, filter=[("like", "Director", "George Lucas"),("=", "ReleaseYear", 1977)]))

[   __nn_distance  ReleaseYear  \
0       0.910225         1977   

                                               Title    Origin  \
0  b'Star Wars Episode IV: A New Hope (aka Star W...  American   

          Director                                               Cast  \
0  b'George Lucas'  b'Mark Hamill Harrison Ford Carrie Fisher Alec...   

             Genre                                               Plot  \
0  science fiction  b'The galaxy is in the midst of a civil war. S...   

                                          embeddings  
0  [-0.10030582547187805, 0.008335104212164879, 0...  ]


#### More Examples

In [72]:
# Another query
query_vector = {'flat_index' : [embedding_model.encode('conspiracy theories involving art').tolist()]}

In [73]:
# Another filtered search example
print(table.search(vectors=query_vector, n=3, filter=[("like", "Genre", "*thriller*"),("like","Cast","*Tom Hanks*")]))

[   __nn_distance  ReleaseYear                  Title    Origin  \
0       1.276896         2006  b' The Da Vinci Code'  American   
1       1.395944         2017          b'The Circle'  American   
2       1.607655         2017            b'The Post'  American   

              Director                                               Cast  \
0        b'Ron Howard'  b'Tom Hanks Audrey Tautou Ian McKellen Alfred ...   
1    b'James Ponsoldt'  b'James Ponsoldt (director/screenplay); Tom Ha...   
2  b'Steven Spielberg'  b'Steven Spielberg (director); Liz Hannah Josh...   

                              Genre  \
0                          thriller   
1             sci-fi drama thriller   
2  biography drama history thriller   

                                                Plot  \
0  b'Jacques Sauni\xc2\xa8re the Louvres curator ...   
1  b'When her car breaks down Mae Holland contact...   
2  b'In 1966 Vietnam State Department military an...   

                                          e

In [74]:
# Another query
query_vector = {'flat_index' : [embedding_model.encode('middle earth fantasy adventure in the Shire').tolist()]}

In [75]:
# Another filtered search example
print(table.search(vectors=query_vector, n=3, filter=[("within","ReleaseYear",[2000,2010])]))

[   __nn_distance  ReleaseYear  \
0       1.014505         2001   
1       1.099138         2002   
2       1.153412         2003   

                                               Title    Origin  \
0  b'The Lord of the Rings: The Fellowship of the...  American   
1           b'The Lord of the Rings: The Two Towers'  American   
2   b'The Lord of the Rings: The Return of the King'  American   

           Director                                               Cast  \
0  b'Peter Jackson'  b'Elijah Wood Ian McKellen Liv Tyler Sean Asti...   
1  b'Peter Jackson'  b'Elijah Wood Ian McKellen Liv Tyler Viggo Mor...   
2  b'Peter Jackson'  b'Elijah Wood Ian McKellen Liv Tyler Sean Asti...   

               Genre                                               Plot  \
0            fantasy  b'In the Second Age of Middle-earth the lords ...   
1  adventure fantasy  b'After awakening from a dream of Gandalf the ...   
2  adventure fantasy  b'Many years ago two Hobbits Smeagol and Dagol...   

   

## Delete the KDB.AI Table
Once finished with the table, it is best practice to drop it.

In [76]:
table.drop()

#### Take Our Survey
We hope you found this sample helpful! Your feedback is important to us, and we would appreciate it if you could take a moment to fill out our brief survey. Your input helps us improve our content.

Take the [Survey](https://delighted.com/t/wtS7T4Lg)
