## Hybrid Search

##### Note: This example requires a KDB.AI endpoint and API key. Sign up for a free [KDB.AI account](https://kdb.ai/get-started).

KDB.AI hybrid search is a method of similarity search to increase the relevancy of results retrieved from the vector database. It combines two search methods: sparse vector search, and dense vector search.

Sparse vector search uses the BM25 algorithm to find the most relevant keyword matches, while dense vector search finds the most semantically relevant matches.

In KDB.AI, users can run sparse or dense search independently, or run hybrid search which runs both sparse and dense vector searches and then re-ranks to combine the results of each search based on a user defined "weight" value.

In this sample we will use hybrid search over a Federal Reserve speech to extract chunks of the speech that are similar to a user's prompt. In this notebook we will chunk up the document into smaller subsections, create sparse and dense vectors of the chunks, store the vectors in the KDB.AI vector database, and then run dense search, sparse search, and hybrid search to retrieve the most relevant chunks to a user's query.

Agenda:
1. Dependencies, Imports & Setup
2. Ingest & Chunk Data
3. Generate Sparse & Dense Vectors for Each Chunk
4. Define KDB.AI Session and Create Database
5. Create KDB.AI Schema & Table
6. Insert Data into KDB.AI Table
7. Create Sparse and Dense Query Vectors
8. Run Sparse, Dense, and Hybrid Searches

[Inflation: Progress and the Path Ahead](https://www.federalreserve.gov/newsevents/speech/powell20230825a.htm)

## 1. Dependencies, Imports & Setup

In order to successfully run this sample, note the following steps depending on where you are running this notebook:

-***Run Locally / Private Environment:*** The [Setup](https://github.com/KxSystems/kdbai-samples/blob/main/README.md#setup) steps in the repository's `README.md` will guide you on prerequisites and how to run this with Jupyter.


-***Colab / Hosted Environment:*** Open this notebook in Colab and run through the cells.

In [None]:
!pip install kdbai_client

In [None]:
!pip install sentence-transformers langchain langchain-community

In [6]:
import pandas as pd
import numpy as np
import os
from getpass import getpass
import kdbai_client as kdbai
import time
from transformers import BertTokenizerFast
from collections import Counter

# Ignore Warnings
import warnings
warnings.filterwarnings("ignore")

from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [7]:
### !!! Only run this cell if you need to download the data into your environment, for example in Colab
### This downloads Federal Reserve Inflation speech data
if os.path.exists("./data/inflation.txt") == False:
  !mkdir ./data
  !wget -P ./data https://raw.githubusercontent.com/KxSystems/kdbai-samples/main/hybrid_search/data/inflation.txt

## 2. Ingest & Chunk Data
Data is from Federal Reserve Chain Jerome H. Powell:

[Inflation: Progress and the Path Ahead](https://www.federalreserve.gov/newsevents/speech/powell20230825a.htm)

In [8]:
### Load the documents we want to prompt an LLM about
doc = TextLoader("data/inflation.txt").load()

In [9]:
### Chunk the documents into 500 character chunks using langchain's text splitter "RucursiveCharacterTextSplitter"
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)

In [10]:
### split_documents produces a list of all the chunks created
pages = [p.page_content for p in text_splitter.split_documents(doc)]

In [11]:
### Create a blank dataframe to store chunks and vectors in before insertion
data = {
    'ID':[],
    'chunk': [],
    'dense': [],
    'sparse': []
}

# Create the DataFrame
df = pd.DataFrame(data)

## 3. Generate Sparse & Dense Vectors for Each Chunk

In [None]:
### Tokenizer to create sparse vectors
token = BertTokenizerFast.from_pretrained('bert-base-uncased')

### Embedding model to be used to embed user input query
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

In [13]:
### Create sparse and dense vectors of each chunk, append to the dataframe

id = 0
for chunk in pages:
    ### Create the dense query vector
    dense_chunk = [embedding_model.encode(chunk).tolist()]

    ### Create the sparse query vector
    sparse_chunk = [dict(Counter(y)) for y in token([chunk], padding=True,max_length=None)['input_ids']]
    sparse_chunk[0].pop(101);sparse_chunk[0].pop(102);

    new_row_df = pd.DataFrame([{"ID": str(id), "chunk": chunk, "dense": dense_chunk[0], "sparse": sparse_chunk[0]}])
    df = pd.concat([df, new_row_df], ignore_index=True)
    id += int(1)
df.head()

Unnamed: 0,ID,chunk,dense,sparse
0,0,"At last year's Jackson Hole symposium, I deliv...","[-0.022856269031763077, -0.02936530113220215, ...","{2012: 2, 2197: 1, 2095: 3, 1005: 2, 1055: 2, ..."
1,1,are confident that inflation is moving sustain...,"[0.011283153668045998, -0.030178586021065712, ...","{2024: 1, 9657: 1, 2008: 1, 14200: 1, 2003: 1,..."
2,2,Today I will review our progress so far and di...,"[-0.03170400112867355, 0.01769343577325344, 0....","{2651: 1, 1045: 2, 2097: 2, 3319: 1, 2256: 2, ..."
3,3,The Decline in Inflation So Far,"[0.003466668538749218, 0.007666163146495819, -...","{1996: 1, 6689: 1, 1999: 1, 14200: 1, 2061: 1,..."
4,4,The ongoing episode of high inflation initiall...,"[-0.02072943188250065, -0.055148035287857056, ...","{1996: 7, 7552: 1, 2792: 1, 1997: 4, 2152: 1, ..."


## 4. Define KDB.AI Session & Database
KDB.AI comes in two offerings:

KDB.AI Cloud - For experimenting with smaller generative AI projects with a vector database in our cloud.
KDB.AI Server - For evaluating large scale generative AI applications on-premises or on your own cloud provider.
Depending on which you use there will be different setup steps and connection details required.

Option 1. KDB.AI Cloud
To use KDB.AI Cloud, you will need two session details - a URL endpoint and an API key. To get these you can sign up for free here.

You can connect to a KDB.AI Cloud session using kdbai.Session and passing the session URL endpoint and API key details from your KDB.AI Cloud portal.

If the environment variables KDBAI_ENDPOINTS and KDBAI_API_KEY exist on your system containing your KDB.AI Cloud portal details, these variables will automatically be used to connect. If these do not exist, it will prompt you to enter your KDB.AI Cloud portal session URL endpoint and API key details.

### Option 1. KDB.AI Cloud

In [None]:
#Set up KDB.AI endpoint and API key
KDBAI_ENDPOINT = (
    os.environ["KDBAI_ENDPOINT"]
    if "KDBAI_ENDPOINT" in os.environ
    else input("KDB.AI endpoint: ")
)
KDBAI_API_KEY = (
    os.environ["KDBAI_API_KEY"]
    if "KDBAI_API_KEY" in os.environ
    else getpass("KDB.AI API key: ")
)

In [None]:
### Start Session with KDB.AI Cloud
session = kdbai.Session(api_key=KDBAI_API_KEY, endpoint=KDBAI_ENDPOINT)

### Option 2. KDB.AI Server
To use KDB.AI Server, you will need download and run your own container. To do this, you will first need to sign up for free here.

You will receive an email with the required license file and bearer token needed to download your instance. Follow instructions in the signup email to get your session up and running.

Once the setup steps are complete you can then connect to your KDB.AI Server session using kdbai.Session and passing your local endpoint.

In [None]:
### start session with KDB.AI Server
#session = kdbai.Session(endpoint="http://localhost:8082")

### Verify Defined Databases

We can check our connection using the `session.databases()` function.
This will return a list of all the databases we have defined in our vector database thus far.
This should return a "default" database along with any other databases you have already created.

In [None]:
session.databases()

### Create a Database called "myDatabase"

In [152]:
# ensure no database called "myDatabase" exists
try:
    session.database("myDatabase").drop()
except kdbai.KDBAIException:
    pass

In [153]:
# Create the database
db = session.create_database("myDatabase")

## 5. Create Schema, Indexes and KDB.AI Table

Now, let us define the schema that will be used to create the KDB.AI table.

"ID" and "chunk" columns will hold the unique identifier and raw text chunk.

sparse and dense columns will hold the respective sparse and dense vectors.

In [154]:
schema = [
  {"name": "ID", "type": "str"},
  {"name": "chunk", "type": "str"},
  {
      "name":"sparse",
      "type":"general",
  },
  {
      "name":"dense",
      "type":"float64s",
  },
]

### Define the indexes
In this example, we have two indexes, one for dense search and one for sparse search (bm25).

- dense_index: uses a flat index type, with 384 dims and Euclidean Distance search metric

- sparse_index: uses bm25 search type. We also define the "b" and "k" parameters. These parameters can be adjusted at runtime, enabling the hyperparameter tuning for term saturation and document length impact on relevance. This will be discussed further during a later example.

In [155]:
# Define the index
indexes = [
    {
        'type': 'flat',
        'name': 'dense_index',
        'column': 'dense',
        'params': {'dims': 384, 'metric': "L2"},
    },
    {
        'type': 'bm25',
        'name': 'sparse_index',
        'column': 'sparse',
        'params': {'k': 1.25, 'b': 0.75},
    },
]

In [None]:
# List all of the tables in the db
db.tables

In [157]:
# First ensure the table does not already exist
try:
    db.table("inflation").drop()
except kdbai.KDBAIException:
    pass

In [158]:
# Create the table with the defined schema and indexes from above
table = db.create_table(table="inflation", schema=schema, indexes=indexes)

In [159]:
db.tables

[KDBAI table "inflation"]

In [160]:
table.indexes

[{'name': 'dense_index',
  'type': 'flat',
  'column': 'dense',
  'params': {'metric': 'L2', 'dims': 384}},
 {'name': 'sparse_index',
  'type': 'bm25',
  'column': 'sparse',
  'params': {'sparse': True, 'k': 1.25, 'b': 0.75}}]

## 6. Insert data into the KDB.AI Table

In [161]:
### Insert the dataframe into the KDB.AI table
table.insert(df)

{'rowsInserted': 43}

In [None]:
table.query()

## 7. Create Sparse and Dense Query Vectors

In [163]:
query = '12-month basis'

### Create the dense query vector
dense_query = [embedding_model.encode(query).tolist()]

### Create the sparse query vector
sparse_query = [dict(Counter(y)) for y in token([query], padding=True,max_length=None)['input_ids']]
sparse_query[0].pop(101);sparse_query[0].pop(102);

## 8. Run Sparse, Dense, and Hybrid Searches

In [164]:
### Adjust display settings so we can see full output
pd.set_option('display.max_colwidth', None)

In [165]:
### Type 1 - dense search
table.search(vectors={"dense_index":dense_query}, n=5)[0][['ID','chunk']]

Unnamed: 0,ID,chunk
0,9,"coming quarters. Twelve-month core inflation is still elevated, and there is substantial further ground to cover to get back to price stability."
1,35,"That assessment is further complicated by uncertainty about the duration of the lags with which monetary tightening affects economic activity and especially inflation. Since the symposium a year ago, the Committee has raised the policy rate by 300 basis points, including 100 basis points over the past seven months. And we have substantially reduced the size of our securities holdings. The wide range of estimates of these lags suggests that there may be significant further drag in the pipeline."
2,29,"Total hours worked has been flat over the past six months, and the average workweek has declined to the lower end of its pre-pandemic range, reflecting a gradual normalization in labor market conditions (figure 5)."
3,23,"Restrictive monetary policy has tightened financial conditions, supporting the expectation of below-trend growth.5 Since last year's symposium, the two-year real yield is up about 250 basis points, and longer-term real yields are higher as well—by nearly 150 basis points.6 Beyond changes in interest rates, bank lending standards have tightened, and loan growth has slowed sharply.7 Such a tightening of broad financial conditions typically contributes to a slowing in the growth of economic"
4,24,"activity, and there is evidence of that in this cycle as well. For example, growth in industrial production has slowed, and the amount spent on residential investment has declined in each of the past five quarters (figure 4)."


In [166]:
### Type 2 - sparse search
table.search(vectors={"sparse_index":sparse_query}, n=5)[0][['ID','chunk']]

Unnamed: 0,ID,chunk
0,14,"Similar dynamics are playing out for core goods inflation overall. As they do, the effects of monetary restraint should show through more fully over time. Core goods prices fell the past two months, but on a 12-month basis, core goods inflation remains well above its pre-pandemic level. Sustained progress is needed, and restrictive monetary policy is called for to achieve that progress."
1,8,"On a 12-month basis, core PCE inflation peaked at 5.4 percent in February 2022 and declined gradually to 4.3 percent in July (figure 1, panel B). The lower monthly readings for core inflation in June and July were welcome, but two months of good data are only the beginning of what it will take to build confidence that inflation is moving down sustainably toward our goal. We can't yet know the extent to which these lower readings will continue or where underlying inflation will settle over"
2,6,"On a 12-month basis, U.S. total, or ""headline,"" PCE (personal consumption expenditures) inflation peaked at 7 percent in June 2022 and declined to 3.3 percent as of July, following a trajectory roughly in line with global trends (figure 1, panel A).1 The effects of Russia's war against Ukraine have been a primary driver of the changes in headline inflation around the world since early 2022. Headline inflation is what households and businesses experience most directly, so this decline is very"
3,9,"coming quarters. Twelve-month core inflation is still elevated, and there is substantial further ground to cover to get back to price stability."
4,23,"Restrictive monetary policy has tightened financial conditions, supporting the expectation of below-trend growth.5 Since last year's symposium, the two-year real yield is up about 250 basis points, and longer-term real yields are higher as well—by nearly 150 basis points.6 Beyond changes in interest rates, bank lending standards have tightened, and loan growth has slowed sharply.7 Such a tightening of broad financial conditions typically contributes to a slowing in the growth of economic"


**After comparing the sparse search and dense search results based on the query of "12-month basis", we see that while both return relevant results, the sparse search is returns several chunks that contain specific references to "12-month basis".**

**This search example shows the advantage of having a sparse search when interested in specific terms.**

Let's run a hybrid search to combine the results:

In [167]:
table.search(
    vectors={"sparse_index": sparse_query,"dense_index": dense_query},
    index_params={"sparse_index":{'weight':0.5} ,"dense_index":{'weight':0.5}},
    n=5
)[0][['ID','chunk']]

Unnamed: 0,ID,chunk
0,9,"coming quarters. Twelve-month core inflation is still elevated, and there is substantial further ground to cover to get back to price stability."
1,14,"Similar dynamics are playing out for core goods inflation overall. As they do, the effects of monetary restraint should show through more fully over time. Core goods prices fell the past two months, but on a 12-month basis, core goods inflation remains well above its pre-pandemic level. Sustained progress is needed, and restrictive monetary policy is called for to achieve that progress."
2,23,"Restrictive monetary policy has tightened financial conditions, supporting the expectation of below-trend growth.5 Since last year's symposium, the two-year real yield is up about 250 basis points, and longer-term real yields are higher as well—by nearly 150 basis points.6 Beyond changes in interest rates, bank lending standards have tightened, and loan growth has slowed sharply.7 Such a tightening of broad financial conditions typically contributes to a slowing in the growth of economic"
3,8,"On a 12-month basis, core PCE inflation peaked at 5.4 percent in February 2022 and declined gradually to 4.3 percent in July (figure 1, panel B). The lower monthly readings for core inflation in June and July were welcome, but two months of good data are only the beginning of what it will take to build confidence that inflation is moving down sustainably toward our goal. We can't yet know the extent to which these lower readings will continue or where underlying inflation will settle over"
4,35,"That assessment is further complicated by uncertainty about the duration of the lags with which monetary tightening affects economic activity and especially inflation. Since the symposium a year ago, the Committee has raised the policy rate by 300 basis points, including 100 basis points over the past seven months. And we have substantially reduced the size of our securities holdings. The wide range of estimates of these lags suggests that there may be significant further drag in the pipeline."


##### Hybrid Search with Sparse Bias, Sparse 'weight = 0.9'

In [168]:
table.search(
    vectors={"sparse_index": sparse_query,"dense_index": dense_query},
    index_params={"sparse_index":{'weight':0.9} ,"dense_index":{'weight':0.1}},
    n=5
)[0][['ID','chunk']]

Unnamed: 0,ID,chunk
0,14,"Similar dynamics are playing out for core goods inflation overall. As they do, the effects of monetary restraint should show through more fully over time. Core goods prices fell the past two months, but on a 12-month basis, core goods inflation remains well above its pre-pandemic level. Sustained progress is needed, and restrictive monetary policy is called for to achieve that progress."
1,8,"On a 12-month basis, core PCE inflation peaked at 5.4 percent in February 2022 and declined gradually to 4.3 percent in July (figure 1, panel B). The lower monthly readings for core inflation in June and July were welcome, but two months of good data are only the beginning of what it will take to build confidence that inflation is moving down sustainably toward our goal. We can't yet know the extent to which these lower readings will continue or where underlying inflation will settle over"
2,9,"coming quarters. Twelve-month core inflation is still elevated, and there is substantial further ground to cover to get back to price stability."
3,6,"On a 12-month basis, U.S. total, or ""headline,"" PCE (personal consumption expenditures) inflation peaked at 7 percent in June 2022 and declined to 3.3 percent as of July, following a trajectory roughly in line with global trends (figure 1, panel A).1 The effects of Russia's war against Ukraine have been a primary driver of the changes in headline inflation around the world since early 2022. Headline inflation is what households and businesses experience most directly, so this decline is very"
4,23,"Restrictive monetary policy has tightened financial conditions, supporting the expectation of below-trend growth.5 Since last year's symposium, the two-year real yield is up about 250 basis points, and longer-term real yields are higher as well—by nearly 150 basis points.6 Beyond changes in interest rates, bank lending standards have tightened, and loan growth has slowed sharply.7 Such a tightening of broad financial conditions typically contributes to a slowing in the growth of economic"


##### Hybrid Search with Dense Bias: Dense 'weight = 0.9'

In [169]:
table.search(
    vectors={"sparse_index": sparse_query,"dense_index": dense_query},
    index_params={"sparse_index":{'weight':0.1} ,"dense_index":{'weight':0.9}},
    n=5
)[0][['ID','chunk']]

Unnamed: 0,ID,chunk
0,9,"coming quarters. Twelve-month core inflation is still elevated, and there is substantial further ground to cover to get back to price stability."
1,35,"That assessment is further complicated by uncertainty about the duration of the lags with which monetary tightening affects economic activity and especially inflation. Since the symposium a year ago, the Committee has raised the policy rate by 300 basis points, including 100 basis points over the past seven months. And we have substantially reduced the size of our securities holdings. The wide range of estimates of these lags suggests that there may be significant further drag in the pipeline."
2,29,"Total hours worked has been flat over the past six months, and the average workweek has declined to the lower end of its pre-pandemic range, reflecting a gradual normalization in labor market conditions (figure 5)."
3,23,"Restrictive monetary policy has tightened financial conditions, supporting the expectation of below-trend growth.5 Since last year's symposium, the two-year real yield is up about 250 basis points, and longer-term real yields are higher as well—by nearly 150 basis points.6 Beyond changes in interest rates, bank lending standards have tightened, and loan growth has slowed sharply.7 Such a tightening of broad financial conditions typically contributes to a slowing in the growth of economic"
4,24,"activity, and there is evidence of that in this cycle as well. For example, growth in industrial production has slowed, and the amount spent on residential investment has declined in each of the past five quarters (figure 4)."


### Sparse Search Hyperparameter Optimization
#### Dynamic Testing /  Override of b and k within index_params

Depending on the use-case, it could be beneficial to tune the underlying parameters of sparse search in order to increase relevancy of retrieved data. KDB.AI offers developers the ability to customize the 'b' and 'k' parameters during runtime, ensuring flexibility in sparse search implementation.

**b: (values 0 to 1, defaults to 0.75)**
<br>Document length impact on relevance
<br>In general the more specific a document is the less likely length will detrimentally impact relevance so b should be low. For general documents that cover multiple topics at a high level consider using a higher value of b.
<br>
<br>**k: (values 0 to 3, defaults to 1.2)**
<br>Term saturation
<br>How much more relevant do additional instances of  a term make a document. The lower k, the faster term saturation occurs, (i.e. additional terms do not count as much).

In [174]:
table.search(
    vectors={"sparse_index": sparse_query,"dense_index": dense_query},
    index_params={"sparse_index":{'weight':0.1,'b':0.1, 'k':3} ,"dense_index":{'weight':0.9}},
    n=5
)[0][['ID','chunk']]

Unnamed: 0,ID,chunk
0,9,"coming quarters. Twelve-month core inflation is still elevated, and there is substantial further ground to cover to get back to price stability."
1,35,"That assessment is further complicated by uncertainty about the duration of the lags with which monetary tightening affects economic activity and especially inflation. Since the symposium a year ago, the Committee has raised the policy rate by 300 basis points, including 100 basis points over the past seven months. And we have substantially reduced the size of our securities holdings. The wide range of estimates of these lags suggests that there may be significant further drag in the pipeline."
2,29,"Total hours worked has been flat over the past six months, and the average workweek has declined to the lower end of its pre-pandemic range, reflecting a gradual normalization in labor market conditions (figure 5)."
3,23,"Restrictive monetary policy has tightened financial conditions, supporting the expectation of below-trend growth.5 Since last year's symposium, the two-year real yield is up about 250 basis points, and longer-term real yields are higher as well—by nearly 150 basis points.6 Beyond changes in interest rates, bank lending standards have tightened, and loan growth has slowed sharply.7 Such a tightening of broad financial conditions typically contributes to a slowing in the growth of economic"
4,24,"activity, and there is evidence of that in this cycle as well. For example, growth in industrial production has slowed, and the amount spent on residential investment has declined in each of the past five quarters (figure 4)."


**Additionally, upon the insertion of new data into the KDB.AI table, all underlying BM25 statistics are updated. This means that when new data is added, the BM25 scoring is updated and aligns with the all sparse data when a sparse seach is run.**

### Delete the KDB.AI Table
Once finished with the table, it is best practice to drop it.

In [None]:
table.drop()
db.drop()

#### Take Our Survey
We hope you found this sample helpful! Your feedback is important to us, and we would appreciate it if you could take a moment to fill out our brief survey. Your input helps us improve our content.

Take the [Survey](https://delighted.com/t/U2RoT32R)

