Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] feat: add support for LanceDB as vector db provider #210

Closed
wants to merge 21 commits into from
Closed
Show file tree
Hide file tree
Changes from 8 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
fdb8475
fix: upate the basic example client to communiate with server on the …
sp6370 Mar 25, 2024
a694567
Merge branch 'SciPhi-AI:main' into main
sp6370 Mar 25, 2024
da8f46c
chore: add lanceDB as optional dependency
sp6370 Mar 26, 2024
1fdf426
chore: add lanceDB as optional dependency part 2
sp6370 Mar 26, 2024
0d085e0
feat: Add support for LanceDB as Vector DB provider
sp6370 Mar 26, 2024
53a3dd3
chore: added pyarrow as the dependency
sp6370 Mar 26, 2024
081acc5
chore: added implementation for lancedb initialize_collection
sp6370 Mar 26, 2024
7f63a6f
chore: updated the database schema for lancedb
sp6370 Mar 26, 2024
e378820
chore: update dependency for pyarrow
sp6370 Mar 26, 2024
a76bf16
chore: support lancedb selection from config.json
sp6370 Apr 4, 2024
9c87c77
Merge branch 'main' into main
sp6370 Apr 4, 2024
5c0f8f0
Update factory.py
sp6370 Apr 4, 2024
9c276cf
chore: add skeleton code for lancedb provider support
sp6370 Apr 4, 2024
78f6fb9
Merge branch 'main' of https://github.com/sp6370/R2R
sp6370 Apr 4, 2024
b3f9208
Merge branch 'SciPhi-AI:main' into main
sp6370 Apr 4, 2024
ddad47e
Merge branch 'SciPhi-AI:main' into main
sp6370 Apr 8, 2024
b371e69
chore: update .env.example for lancedb
sp6370 Apr 8, 2024
92fb033
chore: update lancedb implementation to set db uri from env
sp6370 Apr 8, 2024
a8b0fe6
feat: update lancedb implementation to support upsert_entries and search
sp6370 Apr 8, 2024
da55720
Merge branch 'main' of https://github.com/sp6370/R2R
sp6370 Apr 8, 2024
024c981
feat: update lancedb implementation to support lancedb cloud
sp6370 Apr 8, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 16 additions & 13 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -30,18 +30,20 @@ uvicorn = "^0.27.0.post1"
vecs = "^0.4.0"

# Optional dependencies
bs4 = {version = "^0.0.2", optional = true}
pypdf = {version = "^4.0.2", optional = true}
tiktoken = {version = "^0.5.2", optional = true}
datasets = {version = "^2.16.1", optional = true}
qdrant_client = {version = "^1.7.0", optional = true}
psycopg2-binary = {version = "^2.9.9", optional = true}
numpy = {version = "^1.26.4", optional = true}
scikit-learn = {version = "^1.4.1.post1", optional = true}
sentry-sdk = {version = "^1.40.4", optional = true}
deepeval = {version ="^0.20.88", optional = true}
parea-ai = {version = "^0.2.86", optional = true}
ionic-api-sdk = {version = "0.9.3", optional = true}
bs4 = { version = "^0.0.2", optional = true }
pypdf = { version = "^4.0.2", optional = true }
tiktoken = { version = "^0.5.2", optional = true }
datasets = { version = "^2.16.1", optional = true }
qdrant_client = { version = "^1.7.0", optional = true }
psycopg2-binary = { version = "^2.9.9", optional = true }
numpy = { version = "^1.26.4", optional = true }
scikit-learn = { version = "^1.4.1.post1", optional = true }
sentry-sdk = { version = "^1.40.4", optional = true }
deepeval = { version = "^0.20.88", optional = true }
parea-ai = { version = "^0.2.86", optional = true }
ionic-api-sdk = { version = "0.9.3", optional = true }
lancedb = { version = "^0.6.5", optional = true }
pyarrow = { version = "^15.0.2", optional = true }

[tool.poetry.extras]
parsing = ["bs4", "pypdf"]
Expand All @@ -53,7 +55,8 @@ local_vectordb = ["numpy", "scikit-learn"]
monitoring = ["sentry-sdk"]
eval = ["deepeval", "parea-ai"]
ionic = ["ionic-api-sdk"]
all = ["bs4", "pypdf", "tiktoken", "datasets", "qdrant_client", "psycopg2-binary", "numpy", "scikit-learn", "sentry-sdk", "protobuf", "deepeval", "parea-ai", "ionic"]
lancedb = ["lancedb"]
all = ["bs4", "pypdf", "tiktoken", "datasets", "qdrant_client", "psycopg2-binary", "numpy", "scikit-learn", "sentry-sdk", "protobuf", "deepeval", "parea-ai", "ionic", "lancedb"]

[tool.poetry.group.dev.dependencies]
black = "^23.3.0"
Expand Down
2 changes: 1 addition & 1 deletion r2r/core/providers/vector_db.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ def to_dict(self) -> dict:


class VectorDBProvider(ABC):
supported_providers = ["local", "pgvector", "qdrant"]
supported_providers = ["local", "pgvector", "qdrant", "lancedb"]

def __init__(self, provider: str):
if provider not in VectorDBProvider.supported_providers:
Expand Down
89 changes: 89 additions & 0 deletions r2r/vector_dbs/lancedb/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
import logging
from typing import Optional

from r2r.core import VectorDBProvider, VectorEntry

logger = logging.getLogger(__name__)


class LanceDB(VectorDBProvider):
# TODO enable LanceDB provider to support lanceDB Cloud
def __init__(
self, provider: str = "lancedb", db_path: Optional[str] = None
) -> None:
logger.info("Initializing `LanceDB` to store and retrieve embeddings.")

super().__init__(provider)

if provider != "lancedb":
raise ValueError(
"LanceDB must be initialized with provider `lancedb`."
)

try:
import lancedb
except ImportError:
raise ValueError(
f"Error, `lancedb` is not installed. Please install it using `pip install lancedb`."
)

self.db_path = db_path

try:
self.client = lancedb.connect(uri=self.db_path)
except Exception as e:
raise ValueError(
f"Error {e} occurred while attempting to connect to the lancedb provider."
)
self.collection_name: Optional[str] = None

def initialize_collection(
self, collection_name: str, dimension: int
) -> None:
self.collection_name = collection_name

try:
import pyarrow
except ImportError:
raise ValueError(
f"Error, `pyarrow` is not installed. Please install it using `pip install pyarrow`."
)
sp6370 marked this conversation as resolved.
Show resolved Hide resolved

table_schema = pyarrow.schema(
[
pyarrow.field("id", pyarrow.string()),
pyarrow.field(
"vector", pyarrow.list_(pyarrow.float32(), dimension)
),
# TODO Handle storing metadata
Copy link
Contributor Author

@sp6370 sp6370 Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AyushExel I need a column in the table to store metadata information associated with the vector.

Metadata has the following type:

MetadataValues = Union[str, int, float, bool, List[str]]
Metadata = Dict[str, MetadataValues]

How can I achieve this with Pyarrow/LanceDB?

]
)

try:
self.client.create_table(
name=f"{collection_name}",
on_bad_vectors="error",
schema=table_schema,
)
except Exception as e:
# TODO - Handle more appropriately - create collection fails when it already exists
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception handling here could lead to silent failures which are hard to debug. Instead of simply passing when an exception occurs, consider logging the exception or re-raising it after logging.


def copy(self, entry: VectorEntry, commit=True) -> None:
raise NotImplementedError(
"LanceDB does not support the `copy` method."
)

def upsert(self, entry: VectorEntry, commit=True) -> None:
if self.collection_name is None:
raise ValueError(
"Please call `initialize_collection` before attempting to run `upsert`."
)
self.client.open_table(self.collection_name).add(
{
"vector": entry.vector,
"id": entry.id,
# TODO ADD metadata storage
},
mode="overwrite",
)