# Quickstart: Hello, KDB.AI

##### Note: This example requires a KDB.AI endpoint and API key. Sign up for a free [KDB.AI account](https://kdb.ai/get-started).

How to get started with the KDB.AI vector database. Here, you'll get a quick taste of KDB.AI in  ~10 minutes.

You will learn how to:

1. Connect to KDB.AI
1. Create a KDB.AI Table
1. Add Data to the KDB.AI Table
1. Query the Table
1. Perform Similarity Search
1. Delete the KDB.AI Table

## 0. Setup

### Install dependencies

In order to successfully run this sample, note the following steps depending on where you are running this notebook:

-***Run Locally / Private Environment:*** The [Setup](https://github.com/KxSystems/kdbai-samples/blob/main/README.md#setup) steps in the repository's `README.md` will guide you on prerequisites and how to run this with Jupyter.


-***Colab / Hosted Environment:*** Open this notebook in Colab and run through the cells.

#### Embedding Library

To generate embeddings, we will be using FastEmbed, a fast, lightweight alternative to Sentence Transformers.

It supports a variety of popular text models and is built for efficiency and accuracy. In this notebook, we will use FastEmbed to generate embeddings for company descriptions, which we will then store in a KDB.AI table and use for similarity search.

In [None]:
!pip install kdbai_client fastembed

Collecting kdbai_client
  Downloading kdbai_client-1.2.1-py3-none-any.whl (28 kB)
Collecting fastembed
  Downloading fastembed-0.3.4-py3-none-any.whl (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.0/55.0 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Collecting pykx<3.0.0,>=2.1.1 (from kdbai_client)
  Downloading pykx-2.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.7/9.7 MB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
Collecting PyStemmer<3.0.0,>=2.2.0 (from fastembed)
  Downloading PyStemmer-2.2.0.1.tar.gz (303 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m303.0/303.0 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting loguru<0.8.0,>=0.7.2 (from fastembed)
  Downloading loguru-0.7.2-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.5/62.5 kB

### Import Packages

In [None]:
# vector DB
import os
from getpass import getpass
import kdbai_client as kdbai
from fastembed import TextEmbedding
import time

In [None]:
import numpy as np
import pandas as pd

## 1. Connect to KDB.AI

### Define KDB.AI Session

KDB.AI comes in two offerings:

1. [KDB.AI Cloud](https://trykdb.kx.com/kdbai/signup/) - For experimenting with smaller generative AI projects with a vector database in our cloud.
2. [KDB.AI Server](https://trykdb.kx.com/kdbaiserver/signup/) - For evaluating large scale generative AI applications on-premises or on your own cloud provider.

Depending on which you use there will be different setup steps and connection details required.

##### Option 1. KDB.AI Cloud

To use KDB.AI Cloud, you will need two session details - a URL endpoint and an API key.
To get these you can sign up for free [here](https://trykdb.kx.com/kdbai/signup).

You can connect to a KDB.AI Cloud session using `kdbai.Session` and passing the session URL endpoint and API key details from your KDB.AI Cloud portal.

If the environment variables `KDBAI_ENDPOINTS` and `KDBAI_API_KEY` exist on your system containing your KDB.AI Cloud portal details, these variables will automatically be used to connect.
If these do not exist, it will prompt you to enter your KDB.AI Cloud portal session URL endpoint and API key details.

In [None]:
KDBAI_ENDPOINT = (
    os.environ["KDBAI_ENDPOINT"]
    if "KDBAI_ENDPOINT" in os.environ
    else input("KDB.AI endpoint: ")
)
KDBAI_API_KEY = (
    os.environ["KDBAI_API_KEY"]
    if "KDBAI_API_KEY" in os.environ
    else getpass("KDB.AI API key: ")
)

KDB.AI endpoint: https://cloud.kdb.ai/instance/4wqv2o7ppm
KDB.AI API key: ··········


In [None]:
session = kdbai.Session(api_key=KDBAI_API_KEY, endpoint=KDBAI_ENDPOINT)

##### Option 2. KDB.AI Server

To use KDB.AI Server, you will need download and run your own container.
To do this, you will first need to sign up for free [here](https://trykdb.kx.com/kdbaiserver/signup/).

You will receive an email with the required license file and bearer token needed to download your instance.
Follow instructions in the signup email to get your session up and running.

Once the [setup steps](https://code.kx.com/kdbai/gettingStarted/kdb-ai-server-setup.html) are complete you can then connect to your KDB.AI Server session using `kdbai.Session` and passing your local endpoint.

In [None]:
# session = kdbai.Session(endpoint="http://localhost:8082")

<div class="alert alert-block alert-info">
<b>Need help understanding a function?</b><br/>
Add ? before or after any function name in KDB.AI to bring up the documentation for that function along with sample code and arguments.
</div>

In [None]:
?kdbai.Session

### Verify Defined Tables

We can check our connection using the `session.list()` function.
This will return a list of all the tables we have defined in our vector database thus far.
This should return an empty list.

In [None]:
# ensure no table called "company_data" exists
try:
    session.table("company_data").drop()
    time.sleep(5)
except kdbai.KDBAIException:
    pass

In [None]:
session.list()

[]

## 2. Create a KDB.AI Table

To create a table we can use `create_table`, this function takes two arguments - the name and schema of the table.

This schema must meet the following criteria:
- It must contain a list of columns.
- All columns must have either a `pytype` or a `qtype` specified, except the column of vectors.
- One column of vector embeddings may also have a `vectorIndex` attribute with the configuration of the index for similarity search - this column is implicitly an array of `float32`.

Run `?session.create_table` for more details and sample code.

In [None]:
?session.create_table

### Define Schema

Our table will have two columns the first `id` with a list of dummy ID's, the second will be the vector embeddings we will use for similarity search later on in this example.

We will define our dimensionality, similarity metric and index type with the `vectorIndex` attribute. For this example we chose:
- `dims = 384` : In the next section, we generate embeddings using FastEmbed that are 384-dimensional to match this. You can chose any value here.
- `metric = CS` : We chose [CS/Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity). Our data is high dimensional which Cosine Similarity is suitable for. You have the choice of using other metrics here like [IP/Inner Product](https://en.wikipedia.org/wiki/Inner_product_space) and [L2/Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance). The one you chose depends on the specific context and nature of your data.
- `type = flat` : We use a [Flat index](https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexFlat.html) here as we have a simple data structure so this is more than adequate. You have the choice of using other indexes like [HNSW](https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexHNSW.html) and [IVFPQ](https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexIVFPQ.html) here, as with metrics the one you chose depends your data and your overall performance requirements.

Note, this is wrong


In [None]:
schema = {
    "columns": [
        {"name": "company_name", "pytype": "str"},
        {"name": "company_description", "pytype": "str"},
        {"name": "embeddings", "vectorIndex": {"dims": 384, "metric": "CS", "type": "flat"}},
    ]
}

### Create Table

In [None]:
table = session.create_table("company_data", schema)

## 3. Add Data to the KDB.AI Table

First, let's define a list of companies and their descriptions:

In [None]:
company_data = [
    ("Apple", "A technology company known for its iPhones, MacBooks, and innovative designs"),
    ("Google", "A search engine giant that also specializes in advertising, cloud computing, and AI"),
    ("Brave", "A privacy-focused search engine and browser."),
    ("Perplexity", "An answer engine that searches the internet and uses a large language model to summarize web data."),
    ("Amazon", "An e-commerce leader that offers a wide range of products and services, including AWS"),
    ("Microsoft", "A technology company known for its software products like Windows and Office"),
    ("Facebook", "A social media platform that connects people worldwide and owns Instagram and WhatsApp"),
    ("Tesla", "An electric vehicle manufacturer known for its innovative and sustainable energy solutions"),
    ("Rivian", "An electric vehicle company focusing on adventure-oriented trucks and SUVs"),
    ("Lucid Motors", "A company specializing in high-performance electric luxury vehicles"),
    ("Netflix", "A streaming service that offers a wide variety of TV shows, movies, and original content"),
    ("Hulu", "A streaming platform providing a wide range of TV shows, movies, and original content"),
    ("Disney+", "A streaming service offering movies, TV shows, and original content from Disney"),
    ("Uber", "A ride-sharing company that also offers food delivery and freight services"),
    ("Lyft", "A ride-sharing platform connecting passengers with drivers"),
    ("Didi", "A Chinese ride-sharing company offering various transportation services"),
    ("Airbnb", "A platform that allows people to rent out their homes or find lodging worldwide"),
    ("Vrbo", "A vacation rental online marketplace where homeowners list their properties for short-term rentals"),
    ("Booking.com", "An online travel agency offering lodging reservations and other travel products"),
    ("Spotify", "A music streaming service offering a wide range of songs, albums, and podcasts"),
    ("Apple Music", "A music and video streaming service developed by Apple Inc."),
    ("YouTube Music", "A music streaming service developed by YouTube"),
    ("Twitter", "A social media platform for sharing short messages and real-time updates"),
    ("Instagram", "A photo and video sharing social networking service"),
    ("Snapchat", "A multimedia messaging app known for its disappearing messages"),
    ("LinkedIn", "A professional networking platform for job seekers and employers"),
    ("Slack", "A collaboration platform for team communication and project management"),
    ("Microsoft Teams", "A collaboration platform for team communication and project management"),
    ("Zoom", "A video conferencing platform used for virtual meetings and webinars")
]

Now let's define an embedding model. Here we are using the default embedding model in FastEmbed, BAAI/bge-small-en-v1.5, which has 384 dimensions.

In [None]:
embedding_model = TextEmbedding()

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

Let's generate an embedding for each company. Because embedding_model.embed returns a generator, we turn it into a list.

In [None]:
# Example ID values
embeddings = list(embedding_model.embed([desc for _, desc in company_data]))
len(embeddings[0])

384

##### Create a dataframe from our company data

In [None]:
names = [company for company, _ in company_data]
descriptions = [description for _, description in company_data]

# column names/types matching the schema
embeddings_df = pd.DataFrame({"company_name": names, "company_description": descriptions, "embeddings": list(embeddings)})
embeddings_df.head()

Unnamed: 0,company_name,company_description,embeddings
0,Apple,"A technology company known for its iPhones, Ma...","[-0.034876905, 0.032589626, -0.002934602, -0.0..."
1,Google,A search engine giant that also specializes in...,"[-0.017866805, -0.057211027, -0.028582964, 0.0..."
2,Brave,A privacy-focused search engine and browser.,"[-0.017717587, -0.020544883, -0.024149919, -0...."
3,Perplexity,An answer engine that searches the internet an...,"[-0.043428417, 0.0026718834, -0.014964736, 0.0..."
4,Amazon,An e-commerce leader that offers a wide range ...,"[-0.04374583, -0.05412757, 0.03689078, -0.0363..."


We can now add data to our KDB.AI table using `insert`.

In [None]:
table.insert(embeddings_df)

'Insert successful'

## 4. Query the Table

We can use `query` to query data from the table.

In [None]:
table.query()

Unnamed: 0,company_name,company_description,embeddings
0,Apple,"A technology company known for its iPhones, Ma...","[-0.034876905, 0.032589626, -0.002934602, -0.0..."
1,Google,A search engine giant that also specializes in...,"[-0.017866805, -0.057211027, -0.028582964, 0.0..."
2,Brave,A privacy-focused search engine and browser.,"[-0.017717587, -0.020544883, -0.024149919, -0...."
3,Perplexity,An answer engine that searches the internet an...,"[-0.043428417, 0.0026718834, -0.014964736, 0.0..."
4,Amazon,An e-commerce leader that offers a wide range ...,"[-0.04374583, -0.05412757, 0.03689078, -0.0363..."
5,Microsoft,A technology company known for its software pr...,"[-0.032273863, 0.018396838, 0.039459627, -0.05..."
6,Facebook,A social media platform that connects people w...,"[0.0017516032, -0.070141844, 0.01608348, -0.04..."
7,Tesla,An electric vehicle manufacturer known for its...,"[-0.0014115246, 0.076733105, 0.04761862, 0.016..."
8,Rivian,An electric vehicle company focusing on advent...,"[-0.004529878, 0.051614847, 0.054322973, -0.01..."
9,Lucid Motors,A company specializing in high-performance ele...,"[0.0035178585, 0.0728323, 0.01885113, 0.002544..."


The `query` function accepts a wide range of arguments to make it easy to filter, aggregate, and sort.
Run `?table.query` to see them all.

Let's filter for companies starting with the letter 'A' using the 'like' operator. Four rows are returned as expected.

In [None]:
table.query(filter=[("like", "company_name", "A*")])

Unnamed: 0,company_name,company_description,embeddings
0,Apple,"A technology company known for its iPhones, Ma...","[-0.034876905, 0.032589626, -0.002934602, -0.0..."
1,Amazon,An e-commerce leader that offers a wide range ...,"[-0.04374583, -0.05412757, 0.03689078, -0.0363..."
2,Airbnb,A platform that allows people to rent out thei...,"[-0.002005335, -0.08069139, 0.0041435347, 0.00..."
3,Apple Music,A music and video streaming service developed ...,"[-0.036078528, -0.07780105, -0.008587321, -0.0..."


## 5. Perform Similarity Search

Finally, let's perform similarity search on the table. We do this using the `search` function.

In [None]:
?table.search

In [None]:
query = "A company that helps facilitate meetings"
query_vector = list(embedding_model.embed([query]))[0].tolist()
table.search(vectors=[query_vector])[0]

Unnamed: 0,__nn_distance,company_name,company_description,embeddings
0,0.730767,Zoom,A video conferencing platform used for virtual...,"[-0.04796447, -0.0019851858, -0.029626451, -0...."
1,0.714641,Booking.com,An online travel agency offering lodging reser...,"[-0.011170933, -0.017849298, 0.025415434, 0.01..."
2,0.714121,Microsoft Teams,A collaboration platform for team communicatio...,"[-0.040679622, 0.02870852, -0.011716248, -0.06..."
3,0.714121,Slack,A collaboration platform for team communicatio...,"[-0.040679622, 0.02870852, -0.011716248, -0.06..."
4,0.704085,Microsoft,A technology company known for its software pr...,"[-0.032273863, 0.018396838, 0.039459627, -0.05..."


<div class="alert alert-block alert-warning">
<b>Note:</b> The dimension of input query vectors must match the vector embedding dimensions in the table, defined in schema above.
</div>

<div class="alert alert-block alert-warning">
<b>Note:</b> The output was a list of length one, matching the number of vectors we input to the search. This can be indexed on position [0] to extract the dataframe corresponding to the single input vector.
</div>

The closest matching neighbor for the query vector passed in is returned alongside the calculation of L2 ([Euclidean Distance](#https://en.wikipedia.org/wiki/Euclidean_distance)) similarity.

We can also rerun the same query for more neighbors.

In [None]:
# Find 3 closest neighbours of a single query vector
table.search(vectors=[query_vector], n=3)[0]

Unnamed: 0,__nn_distance,company_name,company_description,embeddings
0,0.730767,Zoom,A video conferencing platform used for virtual...,"[-0.04796447, -0.0019851858, -0.029626451, -0...."
1,0.714641,Booking.com,An online travel agency offering lodging reser...,"[-0.011170933, -0.017849298, 0.025415434, 0.01..."
2,0.714121,Microsoft Teams,A collaboration platform for team communicatio...,"[-0.040679622, 0.02870852, -0.011716248, -0.06..."


And we can apply a filter to the search results. Here we use the '<>' filter, which keeps data that is not equal to a value.

In [None]:
# Find 3 closest neighbours of a single query vector
table.search(
    vectors=[query_vector],
    n=3,
    filter=[("<>", "company_name", "Booking.com")],
)[0]

Unnamed: 0,__nn_distance,company_name,company_description,embeddings
0,0.730767,Zoom,A video conferencing platform used for virtual...,"[-0.04796447, -0.0019851858, -0.029626451, -0...."
1,0.714121,Microsoft Teams,A collaboration platform for team communicatio...,"[-0.040679622, 0.02870852, -0.011716248, -0.06..."
2,0.714121,Slack,A collaboration platform for team communicatio...,"[-0.040679622, 0.02870852, -0.011716248, -0.06..."


And also we can search passing more than one query vector.

In [None]:
query1 = "A company with a music-related product"
query2 = "A social media company"

query1_vector = list(embedding_model.embed([query1]))[0].tolist()
query2_vector = list(embedding_model.embed([query2]))[0].tolist()

# Find the 3 closest neighbours of 2 query vectors
table.search(
    vectors=[
        query1_vector,
        query2_vector,
    ],
    n=3,
    aggs=["company_name"] # we can use an aggregation to return a subset of the columns
)

[    company_name
 0        Spotify
 1    Apple Music
 2  YouTube Music,
   company_name
 0     Facebook
 1      Twitter
 2    Instagram]

## 6. Delete the KDB.AI Table

We can use `table.drop()` to delete a table.

In [None]:
table.drop()

True

<div class="alert alert-block alert-warning">
<b>Warning:</b> Once you drop a table, you cannot use it again.
</div>

## Next Steps

Now that you’re successfully making indexes with KDB.AI, you can start inserting your own data or view more examples:
- [PDF Document Search](../document_search)
- [MRI Image Search](../image_search)
- [Music Recommendation System](../music_recommendation)
- [Sensor Pattern Matching](../pattern_matching)
- [Retrieval Augmented Generation with LangChain](../retrieval_augmented_generation)
- [Sentiment Analysis of Reviews](../sentiment_analysis)