We start with high-precision and high-recall retrieval methods as LightRAG helps you optimize the later stage of your search/retrieval pipeline. As the first stage is often comes with cloud db providers with their search and filter support.

## LLMRetriever

The indexing process is to form prompt using the targeting documents and set up the top_k parameter.
The ``retrieve`` is to run the ``generator`` and parse the response to standard ``RetrieverOutputType`` which is a list of 
``RetrieverOutput``. Each ``RetrieverOutput`` contains the document id and the score.

In [1]:
# prepare the document
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2024-06-16 13:13:07--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’


2024-06-16 13:13:07 (3.36 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]



In [2]:
# use fsspec to read the document
!pip install fsspec



In [44]:
import fsspec
import os
import time
def get_local_file_metadata(file_path: str):
    stat = os.stat(file_path)
    return {
        'size': stat.st_size,  # File size in bytes
        'creation_date': time.ctime(stat.st_ctime),  # Creation time
        'last_modified_date': time.ctime(stat.st_mtime)  # Last modification time
    }


def load_text_file(file_path: str) -> str:
    """
    Loads a text file from the specified path using fsspec.

    Args:
        file_path (str): The path to the text file. This can be a local path or a URL for a supported file system.

    # Example usage with a local file
    local_file_path = 'file:///path/to/localfile.txt'
    print(load_text_file(local_file_path))

    # Example usage with an S3 file
    s3_file_path = 's3://mybucket/myfile.txt'
    print(load_text_file(s3_file_path))

    # Example usage with a GCS file
    gcs_file_path = 'gcs://mybucket/myfile.txt'
    print(load_text_file(gcs_file_path))

    # Example usage with an HTTP file
    http_file_path = 'https://example.com/myfile.txt'
    print(load_text_file(http_file_path))

    Returns:
        str: The content of the text file.
    """
    with fsspec.open(file_path, 'r') as file:
        content = file.read()
    return content


In [46]:
text = load_text_file('paul_graham/paul_graham_essay.txt')
file_metadata = get_local_file_metadata('paul_graham/paul_graham_essay.txt')
print(text[:1000])
print(file_metadata)



What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.

The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in t

In [48]:
# split the documents

from lightrag.core.document_splitter import DocumentSplitter
from lightrag.core.types import Document

# sentence splitting is confusing, the length needs to be smaller
metadata = {"title": "Paul Graham's essay", "path": "data/paul_graham/paul_graham_essay.txt"}
metadata.update(file_metadata)
documents = [Document(text = text, meta_data = metadata)]
splitter = DocumentSplitter(split_by="word", split_length=800, split_overlap=200)

print(documents)
print(splitter)

[Document(id=9ea48f71-62f3-4251-840a-8702a1938f67, text=

What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I trie ..., meta_data={'title': "Paul Graham's essay", 'path': 'data/paul_graham/paul_graham_essay.txt', 'size': 75042, 'creation_date': 'Sun Jun 16 13:13:07 2024', 'last_modified_date': 'Sun Jun 16 13:13:07 2024'}, estimated_num_tokens=16534)]
DocumentSplitter(split_by=word, split_length=800, split_overlap=200)


In [6]:
token_limit = 16385

# compute the maximum number of splitted_documents with split length = 800 and overlap = 200
# total of 28 subdocuments now

16385 // 800

20

From the document structure, we can see the ``estimated_num_tokens=16534`` this will help us
adapt our retriever.

In [7]:
# split the document
splitted_documents = splitter(documents = documents)
print(splitted_documents[0], len(splitted_documents))

Splitting documents: 100%|██████████| 1/1 [00:00<00:00, 32.44it/s]

['what', ' i', ' worked', ' on', ' fe', 'bruary', ' ', '202', '1', ' before', ' college', ' the', ' two', ' main', ' things', ' i', ' worked', ' on', ' outside', ' of', ' school', ' were', ' writing', ' and', ' programming', ' i', ' didnt', ' write', ' essays', ' i', ' wrote', ' what', ' beginning', ' writers', ' were', ' supposed', ' to', ' write', ' then', ' and', ' probably', ' still', ' are', ' short', ' stories', ' my', ' stories', ' were', ' awful', ' they', ' had', ' hardly', ' any', ' plot', ' just', ' characters', ' with', ' strong', ' feelings', ' which', ' i', ' imagined', ' made', ' them', ' deep', ' the', ' first', ' programs', ' i', ' tried', ' writing', ' were', ' on', ' the', ' ib', 'm', ' ', '140', '1', ' that', ' our', ' school', ' district', ' used', ' for', ' what', ' was', ' then', ' called', ' data', ' processing', ' this', ' was', ' in', ' ', '9', 'th', ' grade', ' so', ' i', ' was', ' ', '13', ' or', ' ', '14', ' the', ' school', ' districts', ' ', '140', '1', '




In [8]:
from lightrag.components.retriever import LLMRetriever
from lightrag.components.model_client import OpenAIClient

from lightrag.tracing import trace_generator_call

from lightrag.utils import setup_env

# 1. set up the tracing for failed call as the retriever has generator attribute

@trace_generator_call(save_dir="developer_notes/traces")
class LoggedLLMRetriever(LLMRetriever):
    pass
top_k = 2
retriever = LoggedLLMRetriever(
    top_k = top_k, model_client=OpenAIClient(), model_kwargs={"model": "gpt-3.5-turbo"}
)

retriever.build_index_from_documents(documents=[doc.text for doc in splitted_documents[0:20]])

print(retriever)
retriever.generator.print_prompt()

LoggedLLMRetriever(
  (generator): Generator(
    model_kwargs={'model': 'gpt-3.5-turbo'}, 
    (prompt): Prompt(
      template: <SYS>
      Your are a retriever. Given a list of documents in the context, \
      you will retrieve a list of {{top_k}} indices(int) of the documents that are most relevant to the query. You will output a list as follows:
      [<id from the most relevent with top_k options>]
      <Documents>
      {% for doc in documents %}
      ```{{ loop.index - 1}}. {{doc}}```
      {% endfor %}
      </Documents>
      </SYS>
      Query: {{input_str}}
      You:
      , preset_prompt_kwargs: {'top_k': 2, 'documents': ['what i worked on february 2021 before college the two main things i worked on outside of school were writing and programming i didnt write essays i wrote what beginning writers were supposed to write then and probably still are short stories my stories were awful they had hardly any plot just characters with strong feelings which i imagined made them

Note: We need to know the ground truth, you can save the splitted documents and then label the data.

Here we did that, the ground truth is (indices)

In [9]:
query = "What happened at Viaweb and Interleaf?"
output = retriever(input=query)
print(output)

[RetrieverOutput(doc_indices=[8, 10], doc_scores=None, query=None, documents=None)]


In [10]:
# output[0].documents = [splitted_documents[idx] for idx in output[0].doc_indices]
for per_query_output in output:
    per_query_output.documents = [splitted_documents[idx] for idx in per_query_output.doc_indices]
print("output.documents", output[0].documents)
len(output)

output.documents [Document(id=e3300e8c-3f5e-4cef-a8b4-c7e7239821dc, text=ases 4 to 5 feet on a side one day in late 1994 as i was stretching one of these monsters there was something on the radio about a famous fund manager he wasnt that much older than me and was super rich the thought suddenly occurred to me why dont i become rich then ill be able to work on whatever i want meanwhile id been hearing more and more about this new thing called the world wide web robert  ..., meta_data={'title': "Paul Graham's essay", 'path': 'data/paul_graham/paul_graham_essay.txt'}, estimated_num_tokens=800, parent_doc_id=c760fb7f-a348-461b-9ace-aaedbca2acd2), Document(id=0eaf474e-1f9a-4ba3-8cac-ce7702f0cbaf, text=minded people i know and in completely different ways if you could see inside rtms brain it would look like a colonial new england church and if you could see inside trevors it would look like the worst excesses of austrian rococo we opened for business with 6 stores in january 1996 it was ju

1

In [11]:
# check the first document
print(output[0].documents[0].text)
print("interleaf" in output[0].documents[0].text.lower())
print("viaweb" in output[0].documents[0].text.lower())

ases 4 to 5 feet on a side one day in late 1994 as i was stretching one of these monsters there was something on the radio about a famous fund manager he wasnt that much older than me and was super rich the thought suddenly occurred to me why dont i become rich then ill be able to work on whatever i want meanwhile id been hearing more and more about this new thing called the world wide web robert morris showed it to me when i visited him in cambridge where he was now in grad school at harvard it seemed to me that the web would be a big deal id seen what graphical user interfaces had done for the popularity of microcomputers it seemed like the web would do the same for the internet if i wanted to get rich here was the next train leaving the station i was right about that part what i got wrong was the idea i decided we should start a company to put art galleries online i cant honestly say after reading so many y combinator applications that this was the worst startup idea ever but it was

In [12]:
# check the second document
print(output[0].documents[1].text)
print("interleaf" in output[0].documents[1].text.lower())
print("viaweb" in output[0].documents[1].text.lower())

minded people i know and in completely different ways if you could see inside rtms brain it would look like a colonial new england church and if you could see inside trevors it would look like the worst excesses of austrian rococo we opened for business with 6 stores in january 1996 it was just as well we waited a few months because although we worried we were late we were actually almost fatally early there was a lot of talk in the press then about ecommerce but not many people actually wanted online stores 8 there were three main parts to the software the editor which people used to build sites and which i wrote the shopping cart which robert wrote and the manager which kept track of orders and statistics and which trevor wrote in its time the editor was one of the best generalpurpose site builders i kept the code tight and didnt have to integrate with any other software except roberts and trevors so it was quite fun to work on if all id had to do was work on this software the next 3

## Reranker


In [13]:
# from lightrag.components.retriever import RerankerRetriever

# query = "Li"
# strings = ["Li", "text2"]

# retriever = RerankerRetriever(top_k=1)
# print(retriever)
# retriever.build_index_from_documents(documents=documents)
# print(retriever.documents)
# output = retriever.retrieve(query)
# print(output)

In [14]:
# retriever.build_index_from_documents(documents=strings)

In [15]:
# output = retriever.retrieve(query)

## FAISSRetriever

To use Semantic search, we very likely need TextSplitter and compute the embeddings. This data-preprocessing is more use-case specific and should be better to be done by users in data transformation stage. Then we can treat these embeddings as the input documents.

In this case, the real index is the splitted documents along with its embeddings. We will use ``LocalDB`` to handle the data transformation and the storage of the index.



In [16]:
from lightrag.core.db import LocalDB

db = LocalDB()
db.load_documents(documents)
len(db.documents)

1

Let us see how to create data transformers using only the component config

In [17]:
# create data transformer
data_transformer_config = {  # attribute and its config to recreate the component
        "embedder":{
            "component_name": "Embedder",
            "component_config": {
                "model_client": {
                    "component_name": "OpenAIClient",
                    "component_config": {},
                },
                "model_kwargs": {
                    "model": "text-embedding-3-small",
                    "dimensions": 256,
                    "encoding_format": "float",
                },
            },
        },
        "document_splitter": {
            "component_name": "DocumentSplitter",
            "component_config": {
                "split_by": "word",
                "split_length": 400,
                "split_overlap": 200,
            },
        },
        "to_embeddings": {
            "component_name": "ToEmbeddings",
            "component_config": {
                "vectorizer": {
                    "component_name": "Embedder",
                    "component_config": {
                        "model_client": {
                            "component_name": "OpenAIClient",
                            "component_config": {},
                        },
                        "model_kwargs": {
                            "model": "text-embedding-3-small",
                            "dimensions": 256,
                            "encoding_format": "float",
                        },
                    },
                    # the other config is to instantiate the entity (class and function) with the given config as arguments
                    # "entity_state": "storage/embedder.pkl", # this will load back the state of the entity
                },
                "batch_size": 100,
            },
        },
    }

In [18]:
from lightrag.utils.config import new_components_from_config

components = new_components_from_config(data_transformer_config)
print(components)

{'embedder': Embedder(
  model_kwargs={'model': 'text-embedding-3-small', 'dimensions': 256, 'encoding_format': 'float'}, 
  (model_client): OpenAIClient()
), 'document_splitter': DocumentSplitter(split_by=word, split_length=400, split_overlap=200), 'to_embeddings': ToEmbeddings(
  batch_size=100
  (vectorizer): Embedder(
    model_kwargs={'model': 'text-embedding-3-small', 'dimensions': 256, 'encoding_format': 'float'}, 
    (model_client): OpenAIClient()
  )
  (batch_embedder): BatchEmbedder(
    (embedder): Embedder(
      model_kwargs={'model': 'text-embedding-3-small', 'dimensions': 256, 'encoding_format': 'float'}, 
      (model_client): OpenAIClient()
    )
  )
)}


In [19]:
from lightrag.core.component import Sequential

data_transformer = Sequential(components["document_splitter"], components["to_embeddings"])
data_transformer

Sequential(
  (0): DocumentSplitter(split_by=word, split_length=400, split_overlap=200)
  (1): ToEmbeddings(
    batch_size=100
    (vectorizer): Embedder(
      model_kwargs={'model': 'text-embedding-3-small', 'dimensions': 256, 'encoding_format': 'float'}, 
      (model_client): OpenAIClient()
    )
    (batch_embedder): BatchEmbedder(
      (embedder): Embedder(
        model_kwargs={'model': 'text-embedding-3-small', 'dimensions': 256, 'encoding_format': 'float'}, 
        (model_client): OpenAIClient()
      )
    )
  )
)

The above code is equivalent to the code with config

```python

        vectorizer = Embedder(
            model_client=OpenAIClient(),
            # batch_size=self.vectorizer_settings["batch_size"],
            
            model_kwargs=self.vectorizer_settings["model_kwargs"],
        )
        # TODO: check document splitter, how to process the parent and order of the chunks
        text_splitter = DocumentSplitter(
            split_by=self.text_splitter_settings["split_by"],
            split_length=self.text_splitter_settings["chunk_size"],
            split_overlap=self.text_splitter_settings["chunk_overlap"],
        )
        self.data_transformer = Sequential(
            text_splitter,
            ToEmbeddings(
                vectorizer=vectorizer,
                batch_size=self.vectorizer_settings["batch_size"],
            ),
        )
```

Config:

```yaml
vectorizer:
  batch_size: 100
  model_kwargs:
    model: text-embedding-3-small
    dimensions: 256
    encoding_format: float

retriever:
  top_k: 2

generator:
  model: gpt-3.5-turbo
  temperature: 0.3
  stream: false

text_splitter:
  split_by: word
  chunk_size: 400
  chunk_overlap: 200
```

In [20]:
# test using only the document splitter
text_split = components["document_splitter"](documents)
print(text_split)


Splitting documents: 100%|██████████| 1/1 [00:00<00:00,  9.05it/s]

[Document(id=a9596a31-04f7-4c14-a64e-4dc306ca3231, text=

What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I trie ..., meta_data={'title': "Paul Graham's essay", 'path': 'data/paul_graham/paul_graham_essay.txt'}, estimated_num_tokens=499, parent_doc_id=c760fb7f-a348-461b-9ace-aaedbca2acd2), Document(id=2720e1a4-070f-42c6-b4fc-14108757a4ce, text=spectacularly loud printer.

I was puzzled by the 1401. I couldn't figure out what to do with it. And in retrospect there's not much I could have done with it. The only form of input to programs was data stored on punched cards, and I didn't have any data stored on punched cards. The only other option was to 




In [21]:
# test the whole data transformer
embeddings = data_transformer(documents)
print(embeddings)

Splitting documents: 100%|██████████| 1/1 [00:00<00:00, 30.77it/s]
Batch embedding documents: 100%|██████████| 1/1 [00:01<00:00,  1.16s/it]
Adding embeddings to documents from batch: 1it [00:00, 6853.44it/s]

[Document(id=b9817744-a6e6-4b1b-b942-b2bf12a0b625, text=

What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I trie ..., meta_data={'title': "Paul Graham's essay", 'path': 'data/paul_graham/paul_graham_essay.txt'}, estimated_num_tokens=499, vector=[-0.07907507, 0.038137976, -0.00067343825, -0.019498399, 0.12735985, -0.03413015, 0.012397228, 0.11991674, -0.072140895, 0.09001707]..., parent_doc_id=c760fb7f-a348-461b-9ace-aaedbca2acd2), Document(id=ae353d9c-bfbb-4a03-b7c3-f49916ca3841, text=spectacularly loud printer.

I was puzzled by the 1401. I couldn't figure out what to do with it. And in retrospect there's not much I could have done with it. The on




In [22]:
db.register_transformer(data_transformer)
db.transformer_setups

{'Sequential__0_1_1.vectorizer_1.vectorizer.model_client_1.batch_embedder_': Sequential(
   (0): DocumentSplitter(split_by=word, split_length=400, split_overlap=200)
   (1): ToEmbeddings(
     batch_size=100
     (vectorizer): Embedder(
       model_kwargs={'model': 'text-embedding-3-small', 'dimensions': 256, 'encoding_format': 'float'}, 
       (model_client): OpenAIClient()
     )
     (batch_embedder): BatchEmbedder(
       (embedder): Embedder(
         model_kwargs={'model': 'text-embedding-3-small', 'dimensions': 256, 'encoding_format': 'float'}, 
         (model_client): OpenAIClient()
       )
     )
   )
 )}

In [23]:
db.transform_data(transformer=data_transformer)

Splitting documents: 100%|██████████| 1/1 [00:00<00:00, 50.26it/s]
Batch embedding documents: 100%|██████████| 1/1 [00:00<00:00,  1.23it/s]
Adding embeddings to documents from batch: 1it [00:00, 27060.03it/s]


'Sequential__0_1_1.vectorizer_1.vectorizer.model_client_1.batch_embedder_'

In [24]:
keys = list(db.transformed_documents.keys())
documents = db.transformed_documents[keys[0]]
vectors = [doc.vector for doc in documents]
print(len(vectors), type(vectors), vectors[0][0:10])

# check if all embeddings are the same length
dimensions = set([len(vector) for vector in vectors])
dimensions

68 <class 'list'> [-0.07907507, 0.038137976, -0.00067343825, -0.019498399, 0.12735985, -0.03413015, 0.012397228, 0.11991674, -0.072140895, 0.09001707]


{256}

In [25]:
# check the length of all documents,text 
lengths = set([doc.estimated_num_tokens for doc in documents])
lengths

{316,
 466,
 467,
 468,
 469,
 471,
 472,
 474,
 476,
 477,
 478,
 479,
 480,
 481,
 482,
 483,
 484,
 485,
 486,
 487,
 488,
 489,
 490,
 492,
 493,
 494,
 496,
 497,
 498,
 499,
 500,
 502,
 509,
 510,
 531}

In [26]:
total = 0
for doc in documents:
    if len(doc.vector) != 256:
        print(doc)
        total+=1
print(total)

0


In [27]:
# save the db states, including the original documents with len 1, and transformed documents
db.save_state("developer_notes/db_states.pkl")

In [28]:
# construct the db

restored_db = LocalDB.load_state("developer_notes/db_states.pkl")
restored_db

LocalDocumentDB(documents=[Document(id=c760fb7f-a348-461b-9ace-aaedbca2acd2, text=

What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I trie ..., meta_data={'title': "Paul Graham's essay", 'path': 'data/paul_graham/paul_graham_essay.txt'}, estimated_num_tokens=16534)], transformed_documents={'Sequential__0_1_1.vectorizer_1.vectorizer.model_client_1.batch_embedder_': [Document(id=a2144305-f0cb-4312-a025-691a6676bcdb, text=

What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still

In [29]:
len_documents=len(restored_db.documents)
keys = list(restored_db.transformed_documents.keys())
len_transformed_documents=len(restored_db.transformed_documents[keys[0]])
print(len_documents, len_transformed_documents, keys)

1 68 ['Sequential__0_1_1.vectorizer_1.vectorizer.model_client_1.batch_embedder_']


In [30]:
# lets' print out part of the vector
restored_db.transformed_documents[keys[0]][0].vector[0:10]


[-0.07907507,
 0.038137976,
 -0.00067343825,
 -0.019498399,
 0.12735985,
 -0.03413015,
 0.012397228,
 0.11991674,
 -0.072140895,
 0.09001707]

Now we have prepared the embeddings which can be used in ``FAISSRetriever``. The ``FAISSRetriever`` is a simple wrapper around the FAISS library. It is a simple and efficient way to search for the nearest neighbors in the embedding space.

In [31]:

from lightrag.components.retriever import FAISSRetriever



retriever = FAISSRetriever(embedder=components["embedder"], top_k=5)
print(retriever)

FAISSRetriever(
  top_k=5, metric=prob
  (embedder): Embedder(
    model_kwargs={'model': 'text-embedding-3-small', 'dimensions': 256, 'encoding_format': 'float'}, 
    (model_client): OpenAIClient()
  )
)


In [32]:
documents = restored_db.transformed_documents[keys[0]]
vectors = [doc.vector for doc in documents]
print(len(vectors), type(vectors), vectors[0][0:10])

# check if all embeddings are the same length
dimensions = set([len(vector) for vector in vectors])
dimensions

68 <class 'list'> [-0.07907507, 0.038137976, -0.00067343825, -0.019498399, 0.12735985, -0.03413015, 0.012397228, 0.11991674, -0.072140895, 0.09001707]


{256}

In [33]:
# convert vectors to numpy array
import numpy as np
vectors_np = np.array(vectors, dtype=np.float32)

In [34]:
retriever.build_index_from_documents(documents=vectors)

In [35]:
# retriever for a single query
query = "What happened at Viaweb and Interleaf?"
second_query = "What company did Paul Graham co-found?"

output = retriever(input=[query, second_query])
output

[RetrieverOutput(doc_indices=[24, 25, 17, 32, 38], doc_scores=[0.7670000195503235, 0.7459999918937683, 0.734000027179718, 0.7329999804496765, 0.7200000286102295], query='What happened at Viaweb and Interleaf?', documents=None),
 RetrieverOutput(doc_indices=[47, 44, 49, 45, 46], doc_scores=[0.800000011920929, 0.7900000214576721, 0.7879999876022339, 0.7799999713897705, 0.7749999761581421], query='What company did Paul Graham co-found?', documents=None)]

In [36]:
# get initial documents
for per_query_output in output:
    per_query_output.documents = [documents[idx] for idx in per_query_output.doc_indices]

output

[RetrieverOutput(doc_indices=[24, 25, 17, 32, 38], doc_scores=[0.7670000195503235, 0.7459999918937683, 0.734000027179718, 0.7329999804496765, 0.7200000286102295], query='What happened at Viaweb and Interleaf?', documents=[Document(id=0c16c813-4274-41fc-8a7c-6798503ef87e, text=online stores. At first this was going to be normal desktop software, which in those days meant Windows software. That was an alarming prospect, because neither of us knew how to write Windows software or wanted to learn. We lived in the Unix world. But we decided we'd at least try writing a prototype store builder on Unix. Robert wrote a shopping cart, and I wrote a new site generator for stores  ..., meta_data={'title': "Paul Graham's essay", 'path': 'data/paul_graham/paul_graham_essay.txt'}, estimated_num_tokens=469, vector=[-0.044771, 0.077939205, -0.043221936, 0.016963774, 0.0947663, -0.09500929, -0.009089364, 0.051574733, -0.09397658, 0.06463547]..., parent_doc_id=c760fb7f-a348-461b-9ace-aaedbca2acd2), Docum

In the RAG notes, we will combine this with Generator to get the end to end response.

## BM25Retriever


In [37]:
from lightrag.components.retriever import InMemoryBM25Retriever

index_strings = [doc.text for doc in documents]

retriever = InMemoryBM25Retriever(documents=index_strings)

# retriever.build_index_from_documents(documents=index_strings)

output = retriever(input=[query, second_query])
output

[RetrieverOutput(doc_indices=[63, 25, 26, 7, 64], doc_scores=[7.0489595887523615, 6.594032462786561, 6.424140827741723, 6.1951843542008325, 5.972384167287728], query='What happened at Viaweb and Interleaf?', documents=None),
 RetrieverOutput(doc_indices=[60, 59, 37, 38, 30], doc_scores=[7.731069752214729, 7.5611781171698915, 3.3894780534098707, 3.00588461790695, 2.9752086523619306], query='What company did Paul Graham co-found?', documents=None)]

In [38]:
retriever = InMemoryBM25Retriever(top_k=1)
retriever.build_index_from_documents(["hello world", "world is beautiful", "today is a good day"])
output = retriever.retrieve("hello")
output

[RetrieverOutput(doc_indices=[0], doc_scores=[0.6229580777634034], query='hello', documents=None)]

In [40]:
# save the index

path = "developer_notes/bm25_index.json"
retriever.save_to_file(path)

In [41]:
retriever_loaded = InMemoryBM25Retriever.load_from_file(path)

In [42]:
# test the loaded index
output = retriever_loaded.retrieve("hello", top_k=1)
output

[RetrieverOutput(doc_indices=[0], doc_scores=[0.6229580777634034], query='hello', documents=None)]

In [43]:
retriever_loaded

InMemoryBM25Retriever(
  top_k=1, k1=1.5, b=0.75, epsilon=0.25
  Number of documents: 3
)