<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/vector_stores/LanceDBIndexDemo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LanceDB Vector Store
In this notebook we are going to show how to use [LanceDB](https://www.lancedb.com) to perform vector searches in LlamaIndex

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [None]:
%pip install llama-index-vector-stores-lancedb

In [None]:
import logging
import sys

# Uncomment to see debug logs
# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
# logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

from llama_index.core import SimpleDirectoryReader, Document, StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.lancedb import LanceDBVectorStore
import textwrap

### Setup OpenAI
The first step is to configure the openai key. It will be used to created embeddings for the documents loaded into the index

In [None]:
import openai

openai.api_key = "sk-"

Download Data

In [None]:
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2024-05-10 04:07:51--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’


2024-05-10 04:07:51 (4.41 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]



### Loading documents
Load the documents stored in the `data/paul_graham/` using the SimpleDirectoryReader

In [None]:
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
print("Document ID:", documents[0].doc_id, "Document Hash:", documents[0].hash)

Document ID: 90925731-4de0-43d8-aadd-db0c8246fa91 Document Hash: b3096f4aea5bc17b1b8f93d6630142871b20a3c3cfd4ed04614d17ae10373254


### Create the index
Here we create an index backed by LanceDB using the documents loaded previously. LanceDBVectorStore takes a few arguments.
- uri (str, required): Location where LanceDB will store its files.
- table_name (str, optional): The table name where the embeddings will be stored. Defaults to "vectors".
- nprobes (int, optional): The number of probes used. A higher number makes search more accurate but also slower. Defaults to 20.
- refine_factor: (int, optional): Refine the results by reading extra elements and re-ranking them in memory. Defaults to None

- More details can be found at the [LanceDB docs](https://lancedb.github.io/lancedb/ann_indexes)

In [None]:
vector_store = LanceDBVectorStore(uri="./lancedb")
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

### Query the index
We can now ask questions using our index. We can use filtering via `MetadataFilters` or use native lance `where` clause.

In [None]:
from llama_index.core.vector_stores import (
    MetadataFilters,
    FilterOperator,
    FilterCondition,
    MetadataFilter,
)

query_filters = MetadataFilters(
    filters=[
        MetadataFilter(
            key="creation_date", operator=FilterOperator.EQ, value="2024-05-10"
        ),
        MetadataFilter(
            key="file_size", value=75040, operator=FilterOperator.GT
        ),
    ],
    condition=FilterCondition.AND,
)

In [None]:
query_engine = index.as_query_engine(filters=query_filters)
response = query_engine.query("How much did Viaweb charge per month?")

In [None]:
response.metadata

{'60e0e1b9-e1c3-404d-96ce-04479399bb84': {'file_path': '/Users/raghavdixit/Desktop/open_source/llama_index_lance/docs/docs/examples/vector_stores/data/paul_graham/paul_graham_essay.txt',
  'file_name': 'paul_graham_essay.txt',
  'file_type': 'text/plain',
  'file_size': 75042,
  'creation_date': '2024-05-10',
  'last_modified_date': '2024-05-10'}}

 You can also use lancedb filters(SQL like) directly 

In [None]:
lance_filter = "metadata.file_name = 'paul_graham_essay.txt' "
retriever = index.as_retriever(where=lance_filter)
response = retriever.retrieve("How much did Viaweb charge per month?")

In [None]:
response[0].metadata

{'file_path': '/Users/raghavdixit/Desktop/open_source/llama_index_lance/docs/docs/examples/vector_stores/data/paul_graham/paul_graham_essay.txt',
 'file_name': 'paul_graham_essay.txt',
 'file_type': 'text/plain',
 'file_size': 75042,
 'creation_date': '2024-05-10',
 'last_modified_date': '2024-05-10'}

In [None]:
response = query_engine.query("What did the author do growing up?")

In [None]:
print(textwrap.fill(str(response), 100))

The author worked on writing and programming before college.


### Appending data
You can also add data to an existing index

In [None]:
del index

index = VectorStoreIndex.from_documents(
    [Document(text="The sky is purple in Portland, Maine")],
    uri="/tmp/new_dataset",
)

In [None]:
query_engine = index.as_query_engine()
response = query_engine.query("Where is the sky purple?")
print(textwrap.fill(str(response), 100))

Portland, Maine


You can also create an index from an existing table

In [None]:
del index

vec_store = LanceDBVectorStore.from_table(vector_store._table)
index = VectorStoreIndex.from_vector_store(vec_store)

In [None]:
query_engine = index.as_query_engine()
response = query_engine.query("What companies did the author start?")
print(textwrap.fill(str(response), 100))

The author started Viaweb and Aspra.
