# Quickstart: Hello, KDB.AI

##### Note: This example requires a KDB.AI endpoint and API key. Sign up for a free [KDB.AI account](https://kdb.ai/get-started).

How to get started with the KDB.AI vector database. Here, you'll get a quick taste of KDB.AI in  ~10 minutes.

You will learn how to:

1. Connect to KDB.AI
1. Create a KDB.AI Table
1. Add Data to the KDB.AI Table
1. Query the Table
1. Perform Similarity Search
1. Delete the KDB.AI Table

## 0. Setup

### Install dependencies

In order to successfully run this sample, note the following steps depending on where you are running this notebook:

-***Run Locally / Private Environment:*** The [Setup](https://github.com/KxSystems/kdbai-samples/blob/main/README.md#setup) steps in the repository's `README.md` will guide you on prerequisites and how to run this with Jupyter.


-***Colab / Hosted Environment:*** Open this notebook in Colab and run through the cells.

#### Embedding Library

To generate embeddings, we will be using FastEmbed, a fast, lightweight alternative to Sentence Transformers.

It supports a variety of popular text models and is built for efficiency and accuracy. In this notebook, we will use FastEmbed to generate embeddings for company descriptions, which we will then store in a KDB.AI table and use for similarity search.

In [None]:
!pip install kdbai_client fastembed onnxruntime==1.19.2

### Import Packages

In [2]:
# vector DB
import os
from getpass import getpass
import kdbai_client as kdbai
from fastembed import TextEmbedding
import time

In [3]:
import numpy as np
import pandas as pd

## 1. Connect to KDB.AI

### Define KDB.AI Session

To use KDB.AI Server, you will need download and run your own container.
To do this, you will first need to sign up for free [here](https://trykdb.kx.com/kdbaiserver/signup/).

You will receive an email with the required license file and bearer token needed to download your instance.
Follow instructions in the signup email to get your session up and running.

Once the [setup steps](https://code.kx.com/kdbai/gettingStarted/kdb-ai-server-setup.html) are complete you can then connect to your KDB.AI Server session using `kdbai.Session` and passing your local endpoint.


In [None]:
#Set up KDB.AI server endpoint 
KDBAI_ENDPOINT = (
    os.environ["KDBAI_ENDPOINT"]
    if "KDBAI_ENDPOINT" in os.environ
    else "http://localhost:8082"
)

#connect to KDB.AI Server, default mode is qipc
session = kdbai.Session(endpoint=KDBAI_ENDPOINT)


<div class="alert alert-block alert-info">
<b>Need help understanding a function?</b><br/>
Add ? before or after any function name in KDB.AI to bring up the documentation for that function along with sample code and arguments.
</div>

In [None]:
?kdbai.Session

### Verify Defined Databases

We can check our connection using the `session.databases()` function.
This will return a list of all the databases we have defined in our vector database thus far.
This should return a "default" database along with any other databases you have already created.

In [8]:
session.databases()

[KDBAI database "default"]

### Verify Defined Tables

We can check our connection using the `session.list()` function.
This will return a list of all the tables we have defined in our vector database thus far.
This should return an empty list.

In [9]:
# ensure no table called "company_data" exists
try:
    for t in session.database('default').tables:
            if t.name == 'company_data':
                t.drop() 
    time.sleep(5)
except kdbai.KDBAIException:
    pass

In [10]:
session.database('default').tables

[]

## 2. Create a KDB.AI Table

To create a table we can use `create_table`, this function takes two arguments - the name and schema of the table.

This schema must meet the following criteria:
- It must contain a list of columns.
- All columns must have either a `type` or a `qtype` specified, except the column of vectors.
- One column of vector embeddings may also have a `vectorIndex` attribute with the configuration of the index for similarity search - this column is implicitly an array of `float64`.

Run `?session.database('default').create_table` for more details and sample code.

In [11]:
database = session.database('default')
?database.create_table

[0;31mSignature:[0m
[0mdatabase[0m[0;34m.[0m[0mcreate_table[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mtable[0m[0;34m:[0m [0;34m'str'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mschema[0m[0;34m:[0m [0;34m'Optional[List[Dict[str, Any]]]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mindexes[0m[0;34m:[0m [0;34m'List[Dict[str, Any]]'[0m [0;34m=[0m [0;34m[[0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mpartition_column[0m[0;34m:[0m [0;34m'Optional[str]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0membedding_configurations[0m[0;34m:[0m [0;34m'Optional[Dict[str, Any]]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mexternal_data_references[0m[0;34m:[0m [0;34m'Optional[List[Dict[str, Any]]]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdefault_result_type[0m[0;34m:[0m [0;34m'str'[0m [0;34m=[0m [0;34m'pd'[0m[0;34m,[0m

### Define Schema

Our table will have three columns: the first two, `company_name` `company_description` contain company names and descriptions, the third will be the vector embeddings we will use for similarity search later on in this example.



In [12]:
schema = [
        {"name": "company_name", "type": "str"},
        {"name": "company_description", "type": "str"},
        {"name": "vectors", "type": "float64s"}
    ]

### Define the indexes
We will define our dimensionality, similarity metric and index type with the vectorIndex attribute. For this example we chose:

- type = flat : You have the choice of using other indexes like, hnsw, qHNSW, and IVFPQ, or a qFlat index here, as with metrics the one you chose depends your data and your overall performance requirements.
- name = vectorIndex : this is a custom name you give your index.
- column = vectors : this is the column where the embeddings are stored.
#### params:
- dims = 384 : In the next section, we generate embeddings that are 384-dimensional to match this. The number of dimensions should mirror the output dimensions of your embedding model.
- metric = CS : We chose Cosine Similarity. You have the choice of using other metrics here like IP/Inner Product and CS/Cosine Similarity and the one you chose depends on the specific context and nature of your data.
!Note, it is possible to define multiple indexes within a table!

In [13]:
# Define the index
indexes = [
            {
                "name": "vectorIndex", "type": "flat", 
                "params": {"dims": 384, "metric": "CS"},
                "column": "vectors"
            }
        ]

### Create Table

In [14]:
database = session.database('default')
table = database.create_table("company_data", schema, indexes=indexes)

## 3. Add Data to the KDB.AI Table

First, let's define a list of companies and their descriptions:

In [15]:
company_data = [
    ("Apple", "A technology company known for its iPhones, MacBooks, and innovative designs"),
    ("Google", "A search engine giant that also specializes in advertising, cloud computing, and AI"),
    ("Brave", "A privacy-focused search engine and browser."),
    ("Perplexity", "An answer engine that searches the internet and uses a large language model to summarize web data."),
    ("Amazon", "An e-commerce leader that offers a wide range of products and services, including AWS"),
    ("Microsoft", "A technology company known for its software products like Windows and Office"),
    ("Facebook", "A social media platform that connects people worldwide and owns Instagram and WhatsApp"),
    ("Tesla", "An electric vehicle manufacturer known for its innovative and sustainable energy solutions"),
    ("Rivian", "An electric vehicle company focusing on adventure-oriented trucks and SUVs"),
    ("Lucid Motors", "A company specializing in high-performance electric luxury vehicles"),
    ("Netflix", "A streaming service that offers a wide variety of TV shows, movies, and original content"),
    ("Hulu", "A streaming platform providing a wide range of TV shows, movies, and original content"),
    ("Disney+", "A streaming service offering movies, TV shows, and original content from Disney"),
    ("Uber", "A ride-sharing company that also offers food delivery and freight services"),
    ("Lyft", "A ride-sharing platform connecting passengers with drivers"),
    ("Didi", "A Chinese ride-sharing company offering various transportation services"),
    ("Airbnb", "A platform that allows people to rent out their homes or find lodging worldwide"),
    ("Vrbo", "A vacation rental online marketplace where homeowners list their properties for short-term rentals"),
    ("Booking.com", "An online travel agency offering lodging reservations and other travel products"),
    ("Spotify", "A music streaming service offering a wide range of songs, albums, and podcasts"),
    ("Apple Music", "A music and video streaming service developed by Apple Inc."),
    ("YouTube Music", "A music streaming service developed by YouTube"),
    ("Twitter", "A social media platform for sharing short messages and real-time updates"),
    ("Instagram", "A photo and video sharing social networking service"),
    ("Snapchat", "A multimedia messaging app known for its disappearing messages"),
    ("LinkedIn", "A professional networking platform for job seekers and employers"),
    ("Slack", "A collaboration platform for team communication and project management"),
    ("Microsoft Teams", "A collaboration platform for team communication and project management"),
    ("Zoom", "A video conferencing platform used for virtual meetings and webinars")
]

Now let's define an embedding model. Here we are using the default embedding model in FastEmbed, BAAI/bge-small-en-v1.5, which has 384 dimensions.

In [None]:
embedding_model = TextEmbedding()

Let's generate an embedding for each company. Because embedding_model.embed returns a generator, we turn it into a list.

In [17]:
# Example ID values
embeddings = list(embedding_model.embed([desc for _, desc in company_data]))
len(embeddings[0])

384

##### Create a dataframe from our company data

In [20]:
names = [company for company, _ in company_data]
descriptions = [description for _, description in company_data]

# column names/types matching the schema
embeddings_df = pd.DataFrame({"company_name": names, "company_description": descriptions, "vectors": list(embeddings)})
embeddings_df.head()

Unnamed: 0,company_name,company_description,vectors
0,Apple,"A technology company known for its iPhones, Ma...","[-0.034876905, 0.032589626, -0.002934602, -0.0..."
1,Google,A search engine giant that also specializes in...,"[-0.017866805, -0.057211027, -0.028582964, 0.0..."
2,Brave,A privacy-focused search engine and browser.,"[-0.017717587, -0.020544883, -0.024149919, -0...."
3,Perplexity,An answer engine that searches the internet an...,"[-0.043428417, 0.0026718834, -0.014964736, 0.0..."
4,Amazon,An e-commerce leader that offers a wide range ...,"[-0.04374583, -0.05412757, 0.03689078, -0.0363..."


We can now add data to our KDB.AI table using `insert`.

In [21]:
table.insert(embeddings_df)

{'rowsInserted': 29}

## 4. Query the Table

We can use `query` to query data from the table.

In [22]:
table.query()

Unnamed: 0,company_name,company_description,vectors
0,Apple,"A technology company known for its iPhones, Ma...","[-0.034876905381679535, 0.0325896255671978, -0..."
1,Google,A search engine giant that also specializes in...,"[-0.01786680519580841, -0.057211026549339294, ..."
2,Brave,A privacy-focused search engine and browser.,"[-0.01771758683025837, -0.020544882863759995, ..."
3,Perplexity,An answer engine that searches the internet an...,"[-0.043428417295217514, 0.0026718834415078163,..."
4,Amazon,An e-commerce leader that offers a wide range ...,"[-0.04374583065509796, -0.05412757024168968, 0..."
5,Microsoft,A technology company known for its software pr...,"[-0.03227386251091957, 0.018396837636828423, 0..."
6,Facebook,A social media platform that connects people w...,"[0.001751603209413588, -0.07014184445142746, 0..."
7,Tesla,An electric vehicle manufacturer known for its...,"[-0.0014115246012806892, 0.07673310488462448, ..."
8,Rivian,An electric vehicle company focusing on advent...,"[-0.004529878031462431, 0.05161484703421593, 0..."
9,Lucid Motors,A company specializing in high-performance ele...,"[0.003517858451232314, 0.07283230125904083, 0...."


The `query` function accepts a wide range of arguments to make it easy to filter, aggregate, and sort.
Run `?table.query` to see them all.

Let's filter for companies starting with the letter 'A' using the 'like' operator. Four rows are returned as expected.

In [23]:
table.query(filter=[("like", "company_name", "A*")])

Unnamed: 0,company_name,company_description,vectors
0,Apple,"A technology company known for its iPhones, Ma...","[-0.034876905381679535, 0.0325896255671978, -0..."
1,Amazon,An e-commerce leader that offers a wide range ...,"[-0.04374583065509796, -0.05412757024168968, 0..."
2,Airbnb,A platform that allows people to rent out thei...,"[-0.0020053349435329437, -0.0806913897395134, ..."
3,Apple Music,A music and video streaming service developed ...,"[-0.03607852756977081, -0.07780104875564575, -..."


## 5. Perform Similarity Search

Finally, let's perform similarity search on the table. We do this using the `search` function.

In [24]:
?table.search

[0;31mSignature:[0m
[0mtable[0m[0;34m.[0m[0msearch[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mvectors[0m[0;34m:[0m [0;34m'Dict[str, Any]'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn[0m[0;34m:[0m [0;34m'int'[0m [0;34m=[0m [0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtype[0m[0;34m:[0m [0;34m'Optional[str]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mindex_params[0m[0;34m:[0m [0;34m'Optional[Dict[str, Any]]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0moptions[0m[0;34m:[0m [0;34m'Optional[Dict[str, Any]]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfilter[0m[0;34m:[0m [0;34m'Optional[List[List[Any]]]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msort_columns[0m[0;34m:[0m [0;34m'Optional[List[str]]'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m

In [26]:
query = "A company that helps facilitate meetings"
query_vector = list(embedding_model.embed([query]))[0].tolist()
table.search(vectors={'vectorIndex': [query_vector]})[0]

Unnamed: 0,__nn_distance,company_name,company_description,vectors
0,0.730767,Zoom,A video conferencing platform used for virtual...,"[-0.047964468598365784, -0.001985185779631138,..."


<div class="alert alert-block alert-warning">
<b>Note:</b> The dimension of input query vectors must match the vector embedding dimensions in the table, defined in schema above.
</div>

<div class="alert alert-block alert-warning">
<b>Note:</b> The output was a list of length one, matching the number of vectors we input to the search. This can be indexed on position [0] to extract the dataframe corresponding to the single input vector.
</div>

The closest matching neighbor for the query vector passed in is returned alongside the calculation of L2 ([Euclidean Distance](#https://en.wikipedia.org/wiki/Euclidean_distance)) similarity.

We can also rerun the same query for more neighbors.

In [27]:
# Find 3 closest neighbours of a single query vector
table.search(vectors={'vectorIndex': [query_vector]}, n=3)[0]

Unnamed: 0,__nn_distance,company_name,company_description,vectors
0,0.730767,Zoom,A video conferencing platform used for virtual...,"[-0.047964468598365784, -0.001985185779631138,..."
1,0.714641,Booking.com,An online travel agency offering lodging reser...,"[-0.011170933023095131, -0.01784929819405079, ..."
2,0.714121,Microsoft Teams,A collaboration platform for team communicatio...,"[-0.04067962244153023, 0.02870851941406727, -0..."


And we can apply a filter to the search results. Here we use the '<>' filter, which keeps data that is not equal to a value.

In [28]:
# Find 3 closest neighbours of a single query vector
table.search(
    vectors={'vectorIndex': [query_vector]},
    n=3,
    filter=[("<>", "company_name", "Booking.com")],
)[0]

Unnamed: 0,__nn_distance,company_name,company_description,vectors
0,0.730767,Zoom,A video conferencing platform used for virtual...,"[-0.047964468598365784, -0.001985185779631138,..."
1,0.714121,Microsoft Teams,A collaboration platform for team communicatio...,"[-0.04067962244153023, 0.02870851941406727, -0..."
2,0.714121,Slack,A collaboration platform for team communicatio...,"[-0.04067962244153023, 0.02870851941406727, -0..."


And also we can search passing more than one query vector.

In [29]:
query1 = "A company with a music-related product"
query2 = "A social media company"

query1_vector = list(embedding_model.embed([query1]))[0].tolist()
query2_vector = list(embedding_model.embed([query2]))[0].tolist()

table.search(
    vectors={'vectorIndex': [
        query1_vector,
        query2_vector,
    ]},
    n=3,
    aggs={'Company Name': 'company_name'}
)

[    Company Name
 0        Spotify
 1    Apple Music
 2  YouTube Music,
   Company Name
 0     Facebook
 1      Twitter
 2    Instagram]

## 6. Delete the KDB.AI Table

We can use `table.drop()` to delete a table.

In [30]:
for t in session.database('default').tables:
    if t.name == 'company_data':
        t.drop()

<div class="alert alert-block alert-warning">
<b>Warning:</b> Once you drop a table, you cannot use it again.
</div>

## Next Steps

Now that you’re successfully making indexes with KDB.AI, you can start inserting your own data or view more examples:
- [PDF Document Search](../document_search)
- [MRI Image Search](../image_search)
- [Music Recommendation System](../music_recommendation)
- [Sensor Pattern Matching](../pattern_matching)
- [Retrieval Augmented Generation with LangChain](../retrieval_augmented_generation)
- [Sentiment Analysis of Reviews](../sentiment_analysis)