We start with high-precision and high-recall retrieval methods as LightRAG helps you optimize the later stage of your search/retrieval pipeline. As the first stage is often comes with cloud db providers with their search and filter support.

## LLMRetriever

The indexing process is to form prompt using the targeting documents and set up the top_k parameter.
The ``retrieve`` is to run the ``generator`` and parse the response to standard ``RetrieverOutputType`` which is a list of 
``RetrieverOutput``. Each ``RetrieverOutput`` contains the document id and the score.

In [1]:
# prepare the document
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2024-06-13 12:51:39--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’


2024-06-13 12:51:39 (3.36 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]



In [1]:
# use fsspec to read the document
!pip install fsspec



In [2]:
import fsspec

def load_text_file(file_path: str) -> str:
    """
    Loads a text file from the specified path using fsspec.

    Args:
        file_path (str): The path to the text file. This can be a local path or a URL for a supported file system.

    # Example usage with a local file
    local_file_path = 'file:///path/to/localfile.txt'
    print(load_text_file(local_file_path))

    # Example usage with an S3 file
    s3_file_path = 's3://mybucket/myfile.txt'
    print(load_text_file(s3_file_path))

    # Example usage with a GCS file
    gcs_file_path = 'gcs://mybucket/myfile.txt'
    print(load_text_file(gcs_file_path))

    # Example usage with an HTTP file
    http_file_path = 'https://example.com/myfile.txt'
    print(load_text_file(http_file_path))

    Returns:
        str: The content of the text file.
    """
    with fsspec.open(file_path, 'r') as file:
        content = file.read()
    return content


In [4]:
text = load_text_file('data/paul_graham/paul_graham_essay.txt')
print(text[:1000])



What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.

The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in t

In [5]:
# split the documents

from lightrag.core.document_splitter import DocumentSplitter
from lightrag.core.types import Document

# sentence splitting is confusing, the length needs to be smaller
documents = [Document(text = text, meta_data = {"title": "Paul Graham's essay", "path": "data/paul_graham/paul_graham_essay.txt"})]
splitter = DocumentSplitter(split_by="token", split_length=800, split_overlap=200)

print(documents)
print(splitter)

[Document(id=cbbc5062-7c75-4e3a-aed1-c523ab3a914a, text=

What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I trie ..., meta_data={'title': "Paul Graham's essay", 'path': 'data/paul_graham/paul_graham_essay.txt'}, estimated_num_tokens=16534)]
DocumentSplitter(split_by=token, split_length=800, split_overlap=200)


In [6]:
token_limit = 16385

# compute the maximum number of splitted_documents with split length = 800 and overlap = 200
# total of 28 subdocuments now

16385 // 800

20

From the document structure, we can see the ``estimated_num_tokens=16534`` this will help us
adapt our retriever.

In [7]:
# split the document
splitted_documents = splitter(documents = documents)
print(splitted_documents[0], len(splitted_documents))

Splitting documents: 100%|██████████| 1/1 [00:00<00:00, 21.89it/s]

['\n\n', 'what', ' i', ' worked', ' on', '\n\n', 'fe', 'bruary', ' ', '202', '1', '\n\n', 'before', ' college', ' the', ' two', ' main', ' things', ' i', ' worked', ' on', ',', ' outside', ' of', ' school', ',', ' were', ' writing', ' and', ' programming', '.', ' i', ' didn', "'t", ' write', ' essays', '.', ' i', ' wrote', ' what', ' beginning', ' writers', ' were', ' supposed', ' to', ' write', ' then', ',', ' and', ' probably', ' still', ' are', ':', ' short', ' stories', '.', ' my', ' stories', ' were', ' awful', '.', ' they', ' had', ' hardly', ' any', ' plot', ',', ' just', ' characters', ' with', ' strong', ' feelings', ',', ' which', ' i', ' imagined', ' made', ' them', ' deep', '.\n\n', 'the', ' first', ' programs', ' i', ' tried', ' writing', ' were', ' on', ' the', ' ib', 'm', ' ', '140', '1', ' that', ' our', ' school', ' district', ' used', ' for', ' what', ' was', ' then', ' called', ' "', 'data', ' processing', '."', ' this', ' was', ' in', ' ', '9', 'th', ' grade', ',', 




In [8]:
from lightrag.components.retriever import LLMRetriever
from lightrag.components.model_client import OpenAIClient

from lightrag.tracing import trace_generator_call

from lightrag.utils import setup_env

# 1. set up the tracing for failed call as the retriever has generator attribute

@trace_generator_call(save_dir="developer_notes/traces")
class LoggedLLMRetriever(LLMRetriever):
    pass
top_k = 2
retriever = LoggedLLMRetriever(
    top_k = top_k, model_client=OpenAIClient(), model_kwargs={"model": "gpt-3.5-turbo"}
)

retriever.build_index_from_documents(documents=[doc.text for doc in splitted_documents[0:20]])

print(retriever)

LoggedLLMRetriever(
  (generator): Generator(
    model_kwargs={'model': 'gpt-3.5-turbo'}, 
    (prompt): Prompt(
      template: <SYS>
      Your are a retriever. Given a list of documents in the context, \
      you will retrieve a list of {{top_k}} indices(int) of the documents that are most relevant to the query. You will output a list as follows:
      [<id from the most relevent with top_k options>]
      <Documents>
      {% for doc in documents %}
      ```{{ loop.index - 1}}. {{doc}}```
      {% endfor %}
      </Documents>
      </SYS>
      Query: {{input_str}}
      You:
      , preset_prompt_kwargs: {'top_k': 2, 'documents': ['\n\nwhat i worked on\n\nfebruary 2021\n\nbefore college the two main things i worked on, outside of school, were writing and programming. i didn\'t write essays. i wrote what beginning writers were supposed to write then, and probably still are: short stories. my stories were awful. they had hardly any plot, just characters with strong feelings, whic

Note: We need to know the ground truth, you can save the splitted documents and then label the data.

Here we did that, the ground truth is (indices)

In [7]:
query = "What happened at Viaweb and Interleaf?"
output = retriever(input=query)
print(output)

[RetrieverOutput(doc_indices=[6, 9], doc_scores=None, query=None, documents=None)]


In [8]:
retriever.generator.print_prompt()

Prompt:
<SYS>
Your are a retriever. Given a list of documents in the context, \
you will retrieve a list of 2 indices(int) of the documents that are most relevant to the query. You will output a list as follows:
[<id from the most relevent with top_k options>]
<Documents>
```0. 

what i worked on

february 2021

before college the two main things i worked on, outside of school, were writing and programming. i didn't write essays. i wrote what beginning writers were supposed to write then, and probably still are: short stories. my stories were awful. they had hardly any plot, just characters with strong feelings, which i imagined made them deep.

the first programs i tried writing were on the ibm 1401 that our school district used for what was then called "data processing." this was in 9th grade, so i was 13 or 14. the school district's 1401 happened to be in the basement of our junior high school, and my friend rich draves and i got permission to use it. it was like a mini bond villain

In [9]:
output[0].documents = [splitted_documents[idx] for idx in output[0].doc_indices]
for retriever_output in output:
    retriever_output.documents = [splitted_documents[idx] for idx in retriever_output.doc_indices]
print("output.documents", output[0].documents)

output.documents [Document(id=78ed01b8-1afc-4d18-818d-2ff92b69bb06, text=. most visual perception is handled by low-level processes that merely tell your brain "that's a water droplet" without telling you details like where the lightest and darkest points are, or "that's a bush" without telling you the shape and position of every leaf. this is a feature of brains, not a bug. in everyday life it would be distracting to notice every leaf on every bush. but when you have  ..., meta_data={'title': "Paul Graham's essay", 'path': 'data/paul_graham/paul_graham_essay.txt'}, estimated_num_tokens=800, parent_doc_id=23ed9efb-80e7-4eec-9bb3-bafc3ebc6811), Document(id=40b950a7-571d-4846-934d-0a3c9992c3b6, text= i moved to new york i became her de facto studio assistant.

she liked to paint on big, square canvases, 4 to 5 feet on a side. one day in late 1994 as i was stretching one of these monsters there was something on the radio about a famous fund manager. he wasn't that much older than me, and 

In [10]:
print(output[0].documents[0].text)
print("interleaf" in output[0].documents[0].text.lower())
print("viaweb" in output[0].documents[0].text.lower())

. most visual perception is handled by low-level processes that merely tell your brain "that's a water droplet" without telling you details like where the lightest and darkest points are, or "that's a bush" without telling you the shape and position of every leaf. this is a feature of brains, not a bug. in everyday life it would be distracting to notice every leaf on every bush. but when you have to paint something, you have to look more closely, and when you do there's a lot to see. you can still be noticing new things after days of trying to paint something people usually take for granted, just as you can after days of trying to write an essay about something people usually take for granted.

this is not the only way to paint. i'm not 100% sure it's even a good way to paint. but it seemed a good enough bet to be worth trying.

our teacher, professor ulivi, was a nice guy. he could see i worked hard, and gave me a good grade, which he wrote down in a sort of passport each student had.

In [11]:
print(output[0].documents[1].text)
print("interleaf" in output[0].documents[1].text.lower())
print("viaweb" in output[0].documents[1].text.lower())

 i moved to new york i became her de facto studio assistant.

she liked to paint on big, square canvases, 4 to 5 feet on a side. one day in late 1994 as i was stretching one of these monsters there was something on the radio about a famous fund manager. he wasn't that much older than me, and was super rich. the thought suddenly occurred to me: why don't i become rich? then i'll be able to work on whatever i want.

meanwhile i'd been hearing more and more about this new thing called the world wide web. robert morris showed it to me when i visited him in cambridge, where he was now in grad school at harvard. it seemed to me that the web would be a big deal. i'd seen what graphical user interfaces had done for the popularity of microcomputers. it seemed like the web would do the same for the internet.

if i wanted to get rich, here was the next train leaving the station. i was right about that part. what i got wrong was the idea. i decided we should start a company to put art galleries on

## Reranker


In [9]:
from lightrag.components.retriever import RerankerRetriever

query = "Li"
documents = ["Li", "text2"]

retriever = RerankerRetriever(top_k=1)
print(retriever)
# retriever.build_index_from_documents(documents=documents)
# print(retriever.documents)
# output = retriever.retrieve(query)
# print(output)

RerankerRetriever(
  (model_client): TransformersClient()
)


In [10]:
retriever.build_index_from_documents(documents=documents)

In [11]:
# output = retriever.retrieve(query)

api_kwargs: {'model': 'BAAI/bge-reranker-base', 'input': [['Li', 'Li'], ['Li', 'text2']]}    
init reranker client
Loading model BAAI/bge-reranker-base
get tokenizer: XLMRobertaTokenizerFast(name_or_path='BAAI/bge-reranker-base', vocab_size=250002, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	250001: AddedToken("<mask>"

: 

## FAISSRetriever

To use Semantic search, we very likely need TextSplitter and compute the embeddings. This data-preprocessing is more use-case specific and should be better to be done by users in data transformation stage. Then we can treat these embeddings as the input documents.

In this case, the real index is the splitted documents along with its embeddings. We will use ``LocalDocumentDB`` to handle the data transformation and the storage of the index.

