# BabyDragon Indexes

The `indexes` submodule of the BabyDragon package provides different indexing
and searching strategies for various data types.
The main class in this
submodule is  `MemoryIndex` class, a wrapper for a Faiss index that simplifies managing the index and associated data. It supports creating an index from scratch, loading an index from a file, or initializing from a pandas DataFrame. The class also provides methods for adding and removing items from the index, querying the index, saving and loading the index, and pruning the index based on certain constraints.

##  Table of Contents

1. [MemoryIndex](#usage)
   - [Initializing a MemoryIndex](#initializing-a-memoryindex)
   - [Adding and Removing Items](#adding-and-removing-items)
   - [Querying the Index](#querying-the-index)
   - [Saving and Loading](#saving-and-loading)
   - [Pruning the Index](#pruning-the-index)
   - [Multithreading](#multithreading)
2. [Examples](#examples)

## Usage

### Initializing a MemoryIndex

A `MemoryIndex` object can be initialized in several ways:
1. Create a new empty index from scratch:




In [1]:
from babydragon.memory.indexes.memory_index import MemoryIndex

index = MemoryIndex()

Creating a new index


Before adding values we need to specify an api key 

In [2]:
import openai
openai.api_key = "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"

2. Create a new index from a list of values using the default ada02 embedder:

In [3]:
values = ["apple", "banana", "cherry"]

index = MemoryIndex(values=values)

Creating a new index from a list of values
Embedding value  0  of  3
Embedding value  0  took  0.21761608123779297  seconds
Embedding value  1  of  3
Embedding value  1  took  0.7341010570526123  seconds
Embedding value  2  of  3
Embedding value  2  took  0.20077300071716309  seconds


We can now search the index using the underlying faiss index by calling the `faiss_query` method

In [4]:
results, scores, indeces = index.faiss_query("apple", k=3)
for result, score in zip(results, scores):
    print(result, score)

apple 0.9999985
banana 0.90329254
cherry 0.8461201


3. Create a new index from a list of values and their embeddings:

In [5]:
from babydragon.models.embedders.ada2 import OpenAiEmbedder
embedder = OpenAiEmbedder()

embeddings = []
for value in values:
    embeddings.append(embedder.embed(value))

index = MemoryIndex(name="precomputed_index",values=values, embeddings=embeddings)

results, scores, indeces = index.faiss_query("apple", k=3)
for result, score in zip(results, scores):
    print(result, score)

Creating a new index from a list of embeddings and values
apple 0.9999985
banana 0.90340185
cherry 0.84610236


5. Load an existing index from a file:

In [6]:
index = MemoryIndex(load=True, name = "precomputed_index")

Loading index from storage/precomputed_index


6. Initialize a MemoryIndex object from a pandas DataFrame:

In [7]:
import pandas as pd

data_frame = pd.DataFrame({
    "values": values,
    "embeddings": embeddings  # list of embeddings corresponding to the values
})

index = MemoryIndex.from_pandas(data_frame=data_frame, columns="values", embeddings_col="embeddings")

results, scores, indeces = index.faiss_query("apple", k=3)
for result, score in zip(results, scores):
    print(result, score)

Loading the DataFrame
Creating a new index from a list of embeddings and values
apple 0.9999981
banana 0.90342164
cherry 0.8461188



### Adding and Removing Items
You can add items to the index by calling the add_to_index method:

In [8]:
index.add_to_index(value="orange")

results, scores, indeces = index.faiss_query("apple", k=4)
for result, score in zip(results, scores):
    print(result, score)

apple 0.9999981
banana 0.90342164
orange 0.86520016
cherry 0.8461188


You can also remove items from the index by calling the remove_from_index method:

In [9]:
index.remove_from_index(value="banana")

results, scores, indeces = index.faiss_query("apple", k=4)
for result, score in zip(results, scores):
    print(result, score)

apple 0.9999981
orange 0.86520016
cherry 0.8461188


### Querying the Index
To query the index, use the faiss_query or token_bound_query methods:

In [10]:
# Query the top-5 most similar values with a maximum tokens constraint

results, scores, indices = index.token_bound_query(query="fruit", k=5, max_tokens=4)
for result, score in zip(results, scores):
    print(result, score)

apple 0.9044406
orange 0.8771416
cherry 0.860623


### Saving and Loading
After modifying the index you can save the index to disk by calling the save method, the location and name of the file is controlled by the `index.save_path` and `index.name` attributes:

In [11]:
index.name="precomputed_index"
index.save()

You can load an index from a file by calling the load method:

In [12]:
index = MemoryIndex(load=True, name= "precomputed_index")

Loading index from storage/precomputed_index


### Pruning the Index
To prune the index based on certain constraints, use the prune method. If you want to prune the index based on the number of tokens for each value you can define the pruning method as lenght and give an inclusive length constraint:

In [28]:
pruned_index=index.prune("length",length_constraint=(0,1))
results, scores, indices = pruned_index.token_bound_query(query="fruit", k=5, max_tokens=4)
for result, score in zip(results, scores):
    print(result, score)

Pruning by length
Length constraint:  (0, 1)
Number of values:  3
tokenizer:  <Encoding 'cl100k_base'>
value apple is in range (0, 1)
value banana is in range (0, 1)
Creating a new index from a list of embeddings and values
banana 0.9200244
apple 0.9044848


Or we can use a regular expression that must be matched for the index to be included, in the example we filter out apple with the regex "^(?!.*apple).*$"

In [29]:
pruned_index=index.prune("regex",regex_pattern="^(?!.*apple).*$")
# write the regex pattern for searching not apple
results, scores, indices = pruned_index.token_bound_query(query="fruit", k=5, max_tokens=4)
for result, score in zip(results, scores):
    print(result, score)

Creating a new index from a list of embeddings and values
banana 0.9199276
cherry 0.860623


### Multithreading
In order to enable multi-threading for speeding up the embedding process, you can set the `max_workers parameter to a value bigger than 1, the current rate limits parameters are conservative to avoid open_ai errors:

In [17]:
index = MemoryIndex(values=values,max_workers=8)

Embedding 3 values
setting up savepath
Executing task memory_index_embedding_task using 8 workers.
RateLimiter: This is the first call, no wait required.
RateLimiter: Waiting for 0.04 seconds before next call.
RateLimiter: Waiting for 0.04 seconds before next call.
Sub-task 0 executed in 0.21 seconds.
Sub-task 0 results saved in 0.00 seconds.
Sub-task 1 executed in 0.17 seconds.
Sub-task 1 results saved in 0.00 seconds.
Sub-task 2 executed in 0.05 seconds.
Sub-task 2 results saved in 0.00 seconds.
Task execution completed.
Creating a new index from a list of embeddings and values
