We start with high-precision and high-recall retrieval methods as AdalFlow helps you optimize the later stage of your search/retrieval pipeline. As the first stage is often comes with cloud db providers with their search and filter support.

## FAISS retriever (simple)

We mainly use this to quickly show how to implement semantic search as retriever.

In [1]:
# decide a meaningful query and a list of documents
query_1 = "What are the benefits of renewable energy?" # gt is [0, 3]
query_2 = "How do solar panels impact the environment?" # gt is [1, 2]

documents =[
    {
        "title": "The Impact of Renewable Energy on the Economy",
        "content": "Renewable energy technologies not only help in reducing greenhouse gas emissions but also contribute significantly to the economy by creating jobs in the manufacturing and installation sectors. The growth in renewable energy usage boosts local economies through increased investment in technology and infrastructure."
    },
    {
        "title": "Understanding Solar Panels",
        "content": "Solar panels convert sunlight into electricity by allowing photons, or light particles, to knock electrons free from atoms, generating a flow of electricity. Solar panels are a type of renewable energy technology that has been found to have a significant positive effect on the environment by reducing the reliance on fossil fuels."
    },
    {
        "title": "Pros and Cons of Solar Energy",
        "content": "While solar energy offers substantial environmental benefits, such as reducing carbon footprints and pollution, it also has downsides. The production of solar panels can lead to hazardous waste, and large solar farms require significant land, which can disrupt local ecosystems."
    },
    {
        "title":  "Renewable Energy and Its Effects",
        "content": "Renewable energy sources like wind, solar, and hydro power play a crucial role in combating climate change. They do not produce greenhouse gases during operation, making them essential for sustainable development. However, the initial setup and material sourcing for these technologies can still have environmental impacts."
    }
]

In [2]:
# create an embedder
from adalflow.core.embedder import Embedder 
from adalflow.core.types import ModelClientType


model_kwargs = {
    "model": "text-embedding-3-small",
    "dimensions": 256,
    "encoding_format": "float",
}

embedder = Embedder(model_client =ModelClientType.OPENAI(), model_kwargs=model_kwargs)
embedder

Embedder(
  model_kwargs={'model': 'text-embedding-3-small', 'dimensions': 256, 'encoding_format': 'float'}, 
  (model_client): OpenAIClient()
)

In [3]:
# the documents can fit into a batch, thus we only need the simple embedder
# embedder takes a list of string. we will pass only the content of the documents
output = embedder(input=[doc["content"] for doc in documents])
print(output.embedding_dim, output.length, output)

256 4 EmbedderOutput(data=[Embedding(embedding=[0.006102133, 0.07962484, 0.14928514, 0.041595064, 0.062026925, 0.04285206, -0.016078092, 0.06906609, 0.05352508, 0.06513513, 0.025711235, 0.015723849, -0.12478519, 0.041229393, 0.04548032, -0.005096538, -0.06028999, 0.0075248214, 0.027653862, 0.06924892, -0.03780123, 0.07505395, 0.005767887, -0.012604219, -0.012112848, 0.027813843, -0.047262963, 0.025574109, 0.054027874, -0.02331152, 0.073225595, -0.030579228, -0.05503347, -0.07971626, 0.13621241, 0.0521081, -0.02326581, -0.0907778, 0.031219153, 0.0018483521, 0.026374014, -0.067237735, -0.10357628, -0.031744804, -0.028408058, -0.012524229, -0.052839443, -0.066186436, 0.07345414, 0.077567935, 0.07724798, -0.081773154, 0.05174243, -0.22360775, -0.03604144, -0.05334224, -0.010495897, 0.048999898, -0.093794584, -0.005770744, 0.16729443, -0.012489947, 0.074094065, 0.12405385, -0.057181787, 0.08625262, -0.011507206, 0.022260215, -0.053707913, -0.0017255095, 0.0026925376, -0.049594115, 0.0571360

In [4]:
# prepare the retriever

from adalflow.components.retriever import FAISSRetriever

# pass the documents in the initialization 
documents_embeddings = [x.embedding for x in output.data]
retriever = FAISSRetriever(top_k=2, embedder=embedder, documents=documents_embeddings)
retriever

FAISSRetriever(
  top_k=2, metric=prob, dimensions=256, total_documents=4
  (embedder): Embedder(
    model_kwargs={'model': 'text-embedding-3-small', 'dimensions': 256, 'encoding_format': 'float'}, 
    (model_client): OpenAIClient()
  )
)

In [5]:
# execute the retriever
output_1 = retriever(input=query_1)
output_2 = retriever(input=query_2)
output_3 = retriever(input = [query_1, query_2])
print(output_1)
print(output_2)
print(output_3)

[RetrieverOutput(doc_indices=[0, 3], doc_scores=[0.8119999766349792, 0.7749999761581421], query='What are the benefits of renewable energy?', documents=None)]
[RetrieverOutput(doc_indices=[2, 1], doc_scores=[0.8169999718666077, 0.8109999895095825], query='How do solar panels impact the environment?', documents=None)]
[RetrieverOutput(doc_indices=[0, 3], doc_scores=[0.8119999766349792, 0.7749999761581421], query='What are the benefits of renewable energy?', documents=None), RetrieverOutput(doc_indices=[2, 1], doc_scores=[0.8169999718666077, 0.8109999895095825], query='How do solar panels impact the environment?', documents=None)]


In [6]:
# second, we dont pass documents in init, and instead pass it with method build_index_from_documents

retriever_1 = FAISSRetriever(top_k=2, embedder=embedder)
print(retriever_1)
retriever_1.build_index_from_documents(documents=documents_embeddings)
print(retriever_1)

output_1 = retriever_1(input=query_1)
output_2 = retriever_1(input=query_2)
output_3 = retriever_1(input = [query_1, query_2])
print(output_1)
print(output_2)
print(output_3)

FAISSRetriever(
  top_k=2, metric=prob
  (embedder): Embedder(
    model_kwargs={'model': 'text-embedding-3-small', 'dimensions': 256, 'encoding_format': 'float'}, 
    (model_client): OpenAIClient()
  )
)
FAISSRetriever(
  top_k=2, metric=prob, dimensions=256, total_documents=4
  (embedder): Embedder(
    model_kwargs={'model': 'text-embedding-3-small', 'dimensions': 256, 'encoding_format': 'float'}, 
    (model_client): OpenAIClient()
  )
)
[RetrieverOutput(doc_indices=[0, 3], doc_scores=[0.8119999766349792, 0.7749999761581421], query='What are the benefits of renewable energy?', documents=None)]
[RetrieverOutput(doc_indices=[2, 1], doc_scores=[0.8169999718666077, 0.8109999895095825], query='How do solar panels impact the environment?', documents=None)]
[RetrieverOutput(doc_indices=[0, 3], doc_scores=[0.8119999766349792, 0.7749999761581421], query='What are the benefits of renewable energy?', documents=None), RetrieverOutput(doc_indices=[2, 1], doc_scores=[0.8169999718666077, 0.81099

## BM25 retriever (simple)

In [7]:
from adalflow.components.retriever import BM25Retriever

document_map_func = lambda x: x["content"]

bm25_retriever = BM25Retriever(top_k=2, documents=documents, document_map_func=document_map_func)
print(bm25_retriever)

InMemoryBM25Retriever(top_k=2, k1=1.5, b=0.75, epsilon=0.25, use_tokenizer=True, total_documents=4)


In [8]:
# show how a word splitter and a token splitter differs

from adalflow.components.retriever.bm25_retriever import split_text_by_word_fn_then_lower_tokenized, split_text_by_word_fn

query_1_words = split_text_by_word_fn(query_1)
query_1_tokens = split_text_by_word_fn_then_lower_tokenized(query_1)

print(query_1_words)
print(query_1_tokens)

['what', 'are', 'the', 'benefits', 'of', 'renewable', 'energy?']
['what', 'are', 'the', 'benef', 'its', 'of', 're', 'new', 'able', 'energy', '?']


In [9]:
output_1 = bm25_retriever(input=query_1)
output_2 = bm25_retriever(input=query_2)
output_3 = bm25_retriever(input = [query_1, query_2])
print(output_1)
print(output_2)
print(output_3)

[RetrieverOutput(doc_indices=[2, 1], doc_scores=[2.151683837681807, 1.6294762236217233], query='What are the benefits of renewable energy?', documents=None)]
[RetrieverOutput(doc_indices=[3, 2], doc_scores=[1.5166601493236314, 0.7790170272403408], query='How do solar panels impact the environment?', documents=None)]
[RetrieverOutput(doc_indices=[2, 1], doc_scores=[2.151683837681807, 1.6294762236217233], query='What are the benefits of renewable energy?', documents=None), RetrieverOutput(doc_indices=[3, 2], doc_scores=[1.5166601493236314, 0.7790170272403408], query='How do solar panels impact the environment?', documents=None)]


In [10]:
states = bm25_retriever.to_dict()
print(states)

{'type': 'InMemoryBM25Retriever', 'data': {'_components': {'_ordered_dict': True, 'data': []}, '_parameters': {'_ordered_dict': True, 'data': []}, 'training': False, '_init_args': {'b': 0.75, 'document_map_func': None, 'documents': None, 'epsilon': 0.25, 'k1': 1.5, 'top_k': 5, 'use_tokenizer': True}, 'k1': 1.5, 'b': 0.75, 'epsilon': 0.25, 'top_k': 2, '_use_tokenizer': True, '_split_function': <function split_text_by_word_fn_then_lower_tokenized at 0x10f919f80>, 'indexed': True, 'index_keys': ['avgdl', 'b', 'doc_len', 'epsilon', 'idf', 'indexed', 'k1', 'nd', 't2d', 'top_k', 'total_documents', 'use_tokenizer'], 't2d': [{'.': 2, 'able': 2, 'also': 1, 'and': 2, 'ased': 1, 'boost': 1, 'but': 1, 'by': 1, 'cing': 1, 'con': 1, 'conom': 1, 'conomy': 1, 'creating': 1, 'du': 1, 'e': 2, 'em': 1, 'energy': 2, 'gas': 1, 'green': 1, 'growth': 1, 'help': 1, 'house': 1, 'ies': 1, 'ificantly': 1, 'in': 4, 'incre': 1, 'inf': 1, 'installation': 1, 'investment': 1, 'issions': 1, 'jobs': 1, 'local': 1, 'man

In [11]:
# use short queries, it performs slightly better

query_1_short = "renewable energy?"  # gt is [0, 3]
query_2_short = "solar panels?"  # gt is [1, 2]

output_1 = bm25_retriever(input=query_1_short)
output_2 = bm25_retriever(input=query_2_short)
output_3 = bm25_retriever(input = [query_1_short, query_2_short])
print(output_1)
print(output_2)
print(output_3)

[RetrieverOutput(doc_indices=[0, 1], doc_scores=[0.8490398606823998, 0.6288584231026185], query='renewable energy?', documents=None)]
[RetrieverOutput(doc_indices=[2, 1], doc_scores=[0.49343959021478, 0.38733491458639285], query='solar panels?', documents=None)]
[RetrieverOutput(doc_indices=[0, 1], doc_scores=[0.8490398606823998, 0.6288584231026185], query='renewable energy?', documents=None), RetrieverOutput(doc_indices=[2, 1], doc_scores=[0.49343959021478, 0.38733491458639285], query='solar panels?', documents=None)]


In [12]:
# use both title and content
document_map_func = lambda x: x["title"] + " " + x["content"]

print(documents)
bm25_retriever.build_index_from_documents(documents=documents, document_map_func=document_map_func)

output_1 = bm25_retriever(input=query_1_short)
output_2 = bm25_retriever(input=query_2_short)
output_3 = bm25_retriever(input = [query_1_short, query_2_short])
print(output_1)
print(output_2)
print(output_3)

[{'title': 'The Impact of Renewable Energy on the Economy', 'content': 'Renewable energy technologies not only help in reducing greenhouse gas emissions but also contribute significantly to the economy by creating jobs in the manufacturing and installation sectors. The growth in renewable energy usage boosts local economies through increased investment in technology and infrastructure.'}, {'title': 'Understanding Solar Panels', 'content': 'Solar panels convert sunlight into electricity by allowing photons, or light particles, to knock electrons free from atoms, generating a flow of electricity. Solar panels are a type of renewable energy technology that has been found to have a significant positive effect on the environment by reducing the reliance on fossil fuels.'}, {'title': 'Pros and Cons of Solar Energy', 'content': 'While solar energy offers substantial environmental benefits, such as reducing carbon footprints and pollution, it also has downsides. The production of solar panels 

## Reranker (simple)

In [13]:
# !poetry add cohere --group dev

In [4]:
from adalflow.components.retriever import RerankerRetriever

model_client = ModelClientType.COHERE()
model_kwargs = {"model": "rerank-english-v3.0"}


reranker = RerankerRetriever(
    top_k=2, model_client=model_client, model_kwargs=model_kwargs
)
print(reranker)

RerankerRetriever(
  top_k=2, model_kwargs={'model': 'rerank-english-v3.0'}, model_client=CohereAPIClient(), total_documents=0
  (model_client): CohereAPIClient()
)


In [5]:
# build index and run queries
document_map_func = lambda x: x["content"]
reranker.build_index_from_documents(documents=documents, document_map_func=document_map_func)

print(reranker)

RerankerRetriever(
  top_k=2, model_kwargs={'model': 'rerank-english-v3.0', 'documents': ['Renewable energy technologies not only help in reducing greenhouse gas emissions but also contribute significantly to the economy by creating jobs in the manufacturing and installation sectors. The growth in renewable energy usage boosts local economies through increased investment in technology and infrastructure.', 'Solar panels convert sunlight into electricity by allowing photons, or light particles, to knock electrons free from atoms, generating a flow of electricity. Solar panels are a type of renewable energy technology that has been found to have a significant positive effect on the environment by reducing the reliance on fossil fuels.', 'While solar energy offers substantial environmental benefits, such as reducing carbon footprints and pollution, it also has downsides. The production of solar panels can lead to hazardous waste, and large solar farms require significant land, which can d

In [6]:
# run queries
output_1 = reranker(input=query_1)
output_2 = reranker(input=query_2)
output_3 = reranker(input = [query_1, query_2])
print(output_1)
print(output_2)
print(output_3)

[RetrieverOutput(doc_indices=[0, 3], doc_scores=[0.99520767, 0.9696708], query='What are the benefits of renewable energy?', documents=None)]
[RetrieverOutput(doc_indices=[1, 2], doc_scores=[0.98742366, 0.9701269], query='How do solar panels impact the environment?', documents=None)]
[RetrieverOutput(doc_indices=[0, 3], doc_scores=[0.99520767, 0.9696708], query='What are the benefits of renewable energy?', documents=None), RetrieverOutput(doc_indices=[1, 2], doc_scores=[0.98742366, 0.9701269], query='How do solar panels impact the environment?', documents=None)]


In [6]:
# use transformer client

model_client = ModelClientType.TRANSFORMERS()
model_kwargs = {"model": "BAAI/bge-reranker-base"}

reranker = RerankerRetriever(
    top_k=2,
    model_client=model_client,
    model_kwargs=model_kwargs,
    documents=documents,
    document_map_func=document_map_func,
)
print(reranker)


RerankerRetriever(
  top_k=2, model_kwargs={'model': 'BAAI/bge-reranker-base', 'documents': ['Renewable energy technologies not only help in reducing greenhouse gas emissions but also contribute significantly to the economy by creating jobs in the manufacturing and installation sectors. The growth in renewable energy usage boosts local economies through increased investment in technology and infrastructure.', 'Solar panels convert sunlight into electricity by allowing photons, or light particles, to knock electrons free from atoms, generating a flow of electricity. Solar panels are a type of renewable energy technology that has been found to have a significant positive effect on the environment by reducing the reliance on fossil fuels.', 'While solar energy offers substantial environmental benefits, such as reducing carbon footprints and pollution, it also has downsides. The production of solar panels can lead to hazardous waste, and large solar farms require significant land, which ca

In [18]:
# run queries
import torch
# Set the number of threads for PyTorch, avoid segementation fault
torch.set_num_threads(1)
torch.set_num_interop_threads(1)

In [19]:


output_1 = reranker(input=query_1)
output_2 = reranker(input=query_2)
output_3 = reranker(input = [query_1, query_2])
print(output_1)
print(output_2)
print(output_3)

[RetrieverOutput(doc_indices=[0, 3], doc_scores=[0.9996004700660706, 0.9950029253959656], query='What are the benefits of renewable energy?', documents=None)]
[RetrieverOutput(doc_indices=[2, 0], doc_scores=[0.9994490742683411, 0.9994476437568665], query='How do solar panels impact the environment?', documents=None)]
[RetrieverOutput(doc_indices=[0, 3], doc_scores=[0.9996004700660706, 0.9950029253959656], query='What are the benefits of renewable energy?', documents=None), RetrieverOutput(doc_indices=[2, 0], doc_scores=[0.9994490742683411, 0.9994476437568665], query='How do solar panels impact the environment?', documents=None)]


As we see the second query is missing one. But Semantically, these documents might be close.
If we use top_k = 3, the genearator might be able to filter out the irrelevant one and eventually give out the right final response.

In [22]:
# try to use title this time
document_map_func = lambda x: x["title"] + " " + x["content"]

reranker.build_index_from_documents(documents=documents, document_map_func=document_map_func)

# run queries
output_1 = reranker(input=query_1)
output_2 = reranker(input=query_2)
output_3 = reranker(input = [query_1, query_2])
print(output_1)
print(output_2)
print(output_3)


[RetrieverOutput(doc_indices=[0, 3], doc_scores=[0.9844216108322144, 0.8057923913002014], query='What are the benefits of renewable energy?', documents=None)]
[RetrieverOutput(doc_indices=[1, 2], doc_scores=[0.9824342131614685, 0.9368231892585754], query='How do solar panels impact the environment?', documents=None)]
[RetrieverOutput(doc_indices=[0, 3], doc_scores=[0.9844216108322144, 0.8057923913002014], query='What are the benefits of renewable energy?', documents=None), RetrieverOutput(doc_indices=[1, 2], doc_scores=[0.9824342131614685, 0.9368231892585754], query='How do solar panels impact the environment?', documents=None)]


## LLM as retriever

(1) Directly return the doc_indices from the LLM model.

In [7]:
from adalflow.components.retriever import LLMRetriever

model_client = ModelClientType.OPENAI()
model_kwargs = {
    "model": "gpt-4o",
}
document_map_func = lambda x: x["content"]
llm_retriever = LLMRetriever(
        top_k=2, 
        model_client=model_client, 
        model_kwargs=model_kwargs, 
        documents=documents, 
        document_map_func=document_map_func
    )
print(llm_retriever)

LLMRetriever(
  top_k=2, total_documents=4,
  (generator): Generator(
    model_kwargs={'model': 'gpt-4o'}, 
    (prompt): Prompt(
      template: <SYS>
      You are a retriever. Given a list of documents, you will retrieve the top_k {{top_k}} most relevant documents and output the indices (int) as a list:
      [<index of the most relevant top_k options>]
      <Documents>
      {% for doc in documents %}
      ```Index {{ loop.index - 1 }}. {{ doc }}```
      ______________
      {% endfor %}
      </Documents>
      </SYS>
      Query: {{ input_str }}
      You:
      , preset_prompt_kwargs: {'top_k': 2, 'documents': ['Renewable energy technologies not only help in reducing greenhouse gas emissions but also contribute significantly to the economy by creating jobs in the manufacturing and installation sectors. The growth in renewable energy usage boosts local economies through increased investment in technology and infrastructure.', 'Solar panels convert sunlight into electricity by

In [8]:
# run queries
output_1 = llm_retriever(input=query_1)
output_2 = llm_retriever(input=query_2)
output_3 = llm_retriever(input = [query_1, query_2])
print(output_1)
print(output_2)
print(output_3)

[RetrieverOutput(doc_indices=[0, 3], doc_scores=None, query='What are the benefits of renewable energy?', documents=None)]
[RetrieverOutput(doc_indices=[1, 2], doc_scores=None, query='How do solar panels impact the environment?', documents=None)]
[RetrieverOutput(doc_indices=[0, 3], doc_scores=None, query='What are the benefits of renewable energy?', documents=None), RetrieverOutput(doc_indices=[1, 2], doc_scores=None, query='How do solar panels impact the environment?', documents=None)]


In [9]:
# you should try both gpt-3.5-turbo and gpt-4o
# you can use a different model without reinitializing the retriever
model_kwargs = {
    "model": "gpt-3.5-turbo",
}
output_1 = llm_retriever(model_kwargs=model_kwargs, input=query_1)
output_2 = llm_retriever(model_kwargs=model_kwargs, input=query_2)
output_3 = llm_retriever(model_kwargs=model_kwargs, input = [query_1, query_2])
print(output_1)
print(output_2)
print(output_3)

[RetrieverOutput(doc_indices=[0, 1], doc_scores=None, query='What are the benefits of renewable energy?', documents=None)]
[RetrieverOutput(doc_indices=[1, 2], doc_scores=None, query='How do solar panels impact the environment?', documents=None)]
[RetrieverOutput(doc_indices=[0, 1], doc_scores=None, query='What are the benefits of renewable energy?', documents=None), RetrieverOutput(doc_indices=[1, 2], doc_scores=None, query='How do solar panels impact the environment?', documents=None)]


## LLMRetriever

The indexing process is to form prompt using the targeting documents and set up the top_k parameter.
The ``retrieve`` is to run the ``generator`` and parse the response to standard ``RetrieverOutputType`` which is a list of 
``RetrieverOutput``. Each ``RetrieverOutput`` contains the document id and the score.

In [16]:
# prepare the document
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2024-06-19 19:53:19--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’


2024-06-19 19:53:19 (4.51 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]



In [None]:
# use fsspec to read the document
!pip install fsspec

In [18]:
import fsspec
import os
import time
def get_local_file_metadata(file_path: str):
    stat = os.stat(file_path)
    return {
        'size': stat.st_size,  # File size in bytes
        'creation_date': time.ctime(stat.st_ctime),  # Creation time
        'last_modified_date': time.ctime(stat.st_mtime)  # Last modification time
    }


def load_text_file(file_path: str) -> str:
    """
    Loads a text file from the specified path using fsspec.

    Args:
        file_path (str): The path to the text file. This can be a local path or a URL for a supported file system.

    # Example usage with a local file
    local_file_path = 'file:///path/to/localfile.txt'
    print(load_text_file(local_file_path))

    # Example usage with an S3 file
    s3_file_path = 's3://mybucket/myfile.txt'
    print(load_text_file(s3_file_path))

    # Example usage with a GCS file
    gcs_file_path = 'gcs://mybucket/myfile.txt'
    print(load_text_file(gcs_file_path))

    # Example usage with an HTTP file
    http_file_path = 'https://example.com/myfile.txt'
    print(load_text_file(http_file_path))

    Returns:
        str: The content of the text file.
    """
    with fsspec.open(file_path, 'r') as file:
        content = file.read()
    return content


In [19]:
text = load_text_file('paul_graham/paul_graham_essay.txt')
file_metadata = get_local_file_metadata('paul_graham/paul_graham_essay.txt')
print(text[:1000])
print(file_metadata)



What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.

The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in t

In [20]:
# split the documents

from adalflow.components.data_process import DocumentSplitter
from adalflow.core.types import Document

# sentence splitting is confusing, the length needs to be smaller
metadata = {"title": "Paul Graham's essay", "path": "data/paul_graham/paul_graham_essay.txt"}
metadata.update(file_metadata)
documents = [Document(text = text, meta_data = metadata)]
splitter = DocumentSplitter(split_by="word", split_length=800, split_overlap=200)

print(documents)
print(splitter)

[Document(id=4e32f7b4-a82d-415b-a1e8-e63ef07e9ac6, text=

What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I trie ..., meta_data={'title': "Paul Graham's essay", 'path': 'data/paul_graham/paul_graham_essay.txt', 'size': 75042, 'creation_date': 'Sun Jun 16 13:13:07 2024', 'last_modified_date': 'Sun Jun 16 13:13:07 2024'}, estimated_num_tokens=16534)]
DocumentSplitter(split_by=word, split_length=800, split_overlap=200)


In [21]:
token_limit = 16385

# compute the maximum number of splitted_documents with split length = 800 and overlap = 200
# total of 28 subdocuments now

16385 // 800

20

From the document structure, we can see the ``estimated_num_tokens=16534`` this will help us
adapt our retriever.

In [22]:
# split the document
splitted_documents = splitter(documents = documents)
print(splitted_documents[0], len(splitted_documents))

Splitting documents:   0%|          | 0/1 [00:00<?, ?it/s]

Splitting documents: 100%|██████████| 1/1 [00:00<00:00, 28.08it/s]

Document(id=9e3a3e94-0fb1-4aaa-a9cf-006325349358, text=

What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I trie ..., meta_data={'title': "Paul Graham's essay", 'path': 'data/paul_graham/paul_graham_essay.txt', 'size': 75042, 'creation_date': 'Sun Jun 16 13:13:07 2024', 'last_modified_date': 'Sun Jun 16 13:13:07 2024'}, estimated_num_tokens=985, parent_doc_id=4e32f7b4-a82d-415b-a1e8-e63ef07e9ac6) 23





In [23]:
from adalflow.components.retriever import LLMRetriever
from adalflow.components.model_client import OpenAIClient

from adalflow.tracing import trace_generator_call

from adalflow.utils import setup_env

# 1. set up the tracing for failed call as the retriever has generator attribute

@trace_generator_call(save_dir="tutorials/traces")
class LoggedLLMRetriever(LLMRetriever):
    pass
top_k = 2
retriever = LoggedLLMRetriever(
    top_k = top_k, model_client=OpenAIClient(), model_kwargs={"model": "gpt-3.5-turbo"}
)

retriever.build_index_from_documents(documents=[doc.text for doc in splitted_documents[0:16]])

print(retriever)
retriever.generator.print_prompt()

LoggedLLMRetriever(
  (generator): Generator(
    model_kwargs={'model': 'gpt-3.5-turbo'}, 
    (prompt): Prompt(
      template: <SYS>
      Your are a retriever. Given a list of documents in the context, \
      you will retrieve a list of {{top_k}} indices(int) of the documents that are most relevant to the query. You will output a list as follows:
      [<id from the most relevent with top_k options>]
      <Documents>
      {% for doc in documents %}
      ```{{ loop.index - 1}}. {{doc}}```
      {% endfor %}
      </Documents>
      </SYS>
      Query: {{input_str}}
      You:
      , preset_prompt_kwargs: {'top_k': 2, 'documents': ['\n\nWhat I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, whic

Note: We need to know the ground truth, you can save the splitted documents and then label the data.

Here we did that, the ground truth is (indices)

In [24]:
query = "What happened at Viaweb and Interleaf?"
output = retriever(input=query)
print(output)

[RetrieverOutput(doc_indices=[1], doc_scores=None, query='What happened at Viaweb and Interleaf?', documents=None)]


In [25]:
# output[0].documents = [splitted_documents[idx] for idx in output[0].doc_indices]
for per_query_output in output:
    per_query_output.documents = [splitted_documents[idx] for idx in per_query_output.doc_indices]
print("output.documents", output[0].documents)
len(output)

output.documents [Document(id=619bb230-3a9b-4258-b7c9-8846aea0db55, text=was that the other fields took up so much of the space of ideas that there wasn't much left for these supposed ultimate truths. All that seemed left for philosophy were edge cases that people in other fields felt could safely be ignored.

I couldn't have put this into words when I was 18. All I knew at the time was that I kept taking philosophy courses and they kept being boring. So I decided to s ..., meta_data={'title': "Paul Graham's essay", 'path': 'data/paul_graham/paul_graham_essay.txt', 'size': 75042, 'creation_date': 'Sun Jun 16 13:13:07 2024', 'last_modified_date': 'Sun Jun 16 13:13:07 2024'}, estimated_num_tokens=958, parent_doc_id=4e32f7b4-a82d-415b-a1e8-e63ef07e9ac6)]


1

In [26]:
# check the first document
print(output[0].documents[0].text)
print("interleaf" in output[0].documents[0].text.lower())
print("viaweb" in output[0].documents[0].text.lower())

was that the other fields took up so much of the space of ideas that there wasn't much left for these supposed ultimate truths. All that seemed left for philosophy were edge cases that people in other fields felt could safely be ignored.

I couldn't have put this into words when I was 18. All I knew at the time was that I kept taking philosophy courses and they kept being boring. So I decided to switch to AI.

AI was in the air in the mid 1980s, but there were two things especially that made me want to work on it: a novel by Heinlein called The Moon is a Harsh Mistress, which featured an intelligent computer called Mike, and a PBS documentary that showed Terry Winograd using SHRDLU. I haven't tried rereading The Moon is a Harsh Mistress, so I don't know how well it has aged, but when I read it I was drawn entirely into its world. It seemed only a matter of time before we'd have Mike, and when I saw Winograd using SHRDLU, it seemed like that time would be a few years at most. All you ha

In [27]:
# check the second document
print(output[0].documents[1].text)
print("interleaf" in output[0].documents[1].text.lower())
print("viaweb" in output[0].documents[1].text.lower())

IndexError: list index out of range

## Reranker


In [None]:
# from adalflow.components.retriever import RerankerRetriever

# query = "Li"
# strings = ["Li", "text2"]

# retriever = RerankerRetriever(top_k=1)
# print(retriever)
# retriever.build_index_from_documents(documents=documents)
# print(retriever.documents)
# output = retriever.retrieve(query)
# print(output)

In [None]:
# retriever.build_index_from_documents(documents=strings)

In [None]:
# output = retriever.retrieve(query)

## FAISSRetriever

To use Semantic search, we very likely need TextSplitter and compute the embeddings. This data-preprocessing is more use-case specific and should be better to be done by users in data transformation stage. Then we can treat these embeddings as the input documents.

In this case, the real index is the splitted documents along with its embeddings. We will use ``LocalDB`` to handle the data transformation and the storage of the index.



In [None]:
from adalflow.core.db import LocalDB

db = LocalDB()
db.load_documents(documents)
len(db.documents)

1

Let us see how to create data transformers using only the component config

In [None]:
# create data transformer
data_transformer_config = {  # attribute and its config to recreate the component
        "embedder":{
            "component_name": "Embedder",
            "component_config": {
                "model_client": {
                    "component_name": "OpenAIClient",
                    "component_config": {},
                },
                "model_kwargs": {
                    "model": "text-embedding-3-small",
                    "dimensions": 256,
                    "encoding_format": "float",
                },
            },
        },
        "document_splitter": {
            "component_name": "DocumentSplitter",
            "component_config": {
                "split_by": "word",
                "split_length": 400,
                "split_overlap": 200,
            },
        },
        "to_embeddings": {
            "component_name": "ToEmbeddings",
            "component_config": {
                "vectorizer": {
                    "component_name": "Embedder",
                    "component_config": {
                        "model_client": {
                            "component_name": "OpenAIClient",
                            "component_config": {},
                        },
                        "model_kwargs": {
                            "model": "text-embedding-3-small",
                            "dimensions": 256,
                            "encoding_format": "float",
                        },
                    },
                    # the other config is to instantiate the entity (class and function) with the given config as arguments
                    # "entity_state": "storage/embedder.pkl", # this will load back the state of the entity
                },
                "batch_size": 100,
            },
        },
    }

In [None]:
from adalflow.utils.config import new_components_from_config

components = new_components_from_config(data_transformer_config)
print(components)

{'embedder': Embedder(
  model_kwargs={'model': 'text-embedding-3-small', 'dimensions': 256, 'encoding_format': 'float'}, 
  (model_client): OpenAIClient()
), 'document_splitter': DocumentSplitter(split_by=word, split_length=400, split_overlap=200), 'to_embeddings': ToEmbeddings(
  batch_size=100
  (vectorizer): Embedder(
    model_kwargs={'model': 'text-embedding-3-small', 'dimensions': 256, 'encoding_format': 'float'}, 
    (model_client): OpenAIClient()
  )
  (batch_embedder): BatchEmbedder(
    (embedder): Embedder(
      model_kwargs={'model': 'text-embedding-3-small', 'dimensions': 256, 'encoding_format': 'float'}, 
      (model_client): OpenAIClient()
    )
  )
)}


In [None]:
from adalflow.core.component import Sequential

data_transformer = Sequential(components["document_splitter"], components["to_embeddings"])
data_transformer

Sequential(
  (0): DocumentSplitter(split_by=word, split_length=400, split_overlap=200)
  (1): ToEmbeddings(
    batch_size=100
    (vectorizer): Embedder(
      model_kwargs={'model': 'text-embedding-3-small', 'dimensions': 256, 'encoding_format': 'float'}, 
      (model_client): OpenAIClient()
    )
    (batch_embedder): BatchEmbedder(
      (embedder): Embedder(
        model_kwargs={'model': 'text-embedding-3-small', 'dimensions': 256, 'encoding_format': 'float'}, 
        (model_client): OpenAIClient()
      )
    )
  )
)

The above code is equivalent to the code with config

```python

        vectorizer = Embedder(
            model_client=OpenAIClient(),
            # batch_size=self.vectorizer_settings["batch_size"],
            
            model_kwargs=self.vectorizer_settings["model_kwargs"],
        )
        # TODO: check document splitter, how to process the parent and order of the chunks
        text_splitter = DocumentSplitter(
            split_by=self.text_splitter_settings["split_by"],
            split_length=self.text_splitter_settings["chunk_size"],
            split_overlap=self.text_splitter_settings["chunk_overlap"],
        )
        self.data_transformer = Sequential(
            text_splitter,
            ToEmbeddings(
                vectorizer=vectorizer,
                batch_size=self.vectorizer_settings["batch_size"],
            ),
        )
```

Config:

```yaml
vectorizer:
  batch_size: 100
  model_kwargs:
    model: text-embedding-3-small
    dimensions: 256
    encoding_format: float

retriever:
  top_k: 2

generator:
  model: gpt-3.5-turbo
  temperature: 0.3
  stream: false

text_splitter:
  split_by: word
  chunk_size: 400
  chunk_overlap: 200
```

In [None]:
# test using only the document splitter
text_split = components["document_splitter"](documents)
print(text_split)


Splitting documents: 100%|██████████| 1/1 [00:00<00:00,  8.55it/s]

[Document(id=2a1cc40e-4172-471d-b741-d44af5548277, text=

What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I trie ..., meta_data={'title': "Paul Graham's essay", 'path': 'data/paul_graham/paul_graham_essay.txt', 'size': 75042, 'creation_date': 'Sun Jun 16 13:13:07 2024', 'last_modified_date': 'Sun Jun 16 13:13:07 2024'}, estimated_num_tokens=499, parent_doc_id=024146ce-84e5-43a2-a1fe-e5385ad9aab5), Document(id=3923c220-13ba-4886-a363-18320a006e6d, text=spectacularly loud printer.

I was puzzled by the 1401. I couldn't figure out what to do with it. And in retrospect there's not much I could have done with it. The only form of input to programs was d




In [None]:
# test the whole data transformer
embeddings = data_transformer(documents)
print(embeddings)

Splitting documents: 100%|██████████| 1/1 [00:00<00:00, 19.23it/s]
Batch embedding documents: 100%|██████████| 1/1 [00:02<00:00,  2.89s/it]
Adding embeddings to documents from batch: 1it [00:00, 2794.34it/s]

[Document(id=e45f82df-1dd9-4d0b-87eb-70ebbc07f229, text=

What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I trie ..., meta_data={'title': "Paul Graham's essay", 'path': 'data/paul_graham/paul_graham_essay.txt', 'size': 75042, 'creation_date': 'Sun Jun 16 13:13:07 2024', 'last_modified_date': 'Sun Jun 16 13:13:07 2024'}, estimated_num_tokens=499, vector=[-0.07907507, 0.038137976, -0.00067343825, -0.019498399, 0.12735985, -0.03413015, 0.012397228, 0.11991674, -0.072140895, 0.09001707]..., parent_doc_id=024146ce-84e5-43a2-a1fe-e5385ad9aab5), Document(id=ba2beda6-43c4-4186-a678-4d9089899542, text=spectacularly loud printer.

I was puzzled by the 1401. 




In [None]:
db.register_transformer(data_transformer)
db.transformer_setups

{'Sequential__0_1_1.vectorizer_1.vectorizer.model_client_1.batch_embedder_': Sequential(
   (0): DocumentSplitter(split_by=word, split_length=400, split_overlap=200)
   (1): ToEmbeddings(
     batch_size=100
     (vectorizer): Embedder(
       model_kwargs={'model': 'text-embedding-3-small', 'dimensions': 256, 'encoding_format': 'float'}, 
       (model_client): OpenAIClient()
     )
     (batch_embedder): BatchEmbedder(
       (embedder): Embedder(
         model_kwargs={'model': 'text-embedding-3-small', 'dimensions': 256, 'encoding_format': 'float'}, 
         (model_client): OpenAIClient()
       )
     )
   )
 )}

In [None]:
db.transform_data(transformer=data_transformer)

Splitting documents: 100%|██████████| 1/1 [00:00<00:00, 20.65it/s]
Batch embedding documents: 100%|██████████| 1/1 [00:00<00:00,  1.13it/s]
Adding embeddings to documents from batch: 1it [00:00, 15887.52it/s]


'Sequential__0_1_1.vectorizer_1.vectorizer.model_client_1.batch_embedder_'

In [None]:
keys = list(db.transformed_documents.keys())
documents = db.transformed_documents[keys[0]]
vectors = [doc.vector for doc in documents]
print(len(vectors), type(vectors), vectors[0][0:10])

# check if all embeddings are the same length
dimensions = set([len(vector) for vector in vectors])
dimensions

68 <class 'list'> [-0.07920883, 0.038013875, -0.00059247564, -0.01969087, 0.1272431, -0.03413296, 0.012310769, 0.11960851, -0.07208321, 0.089897245]


{256}

In [None]:
# check the length of all documents,text 
lengths = set([doc.estimated_num_tokens for doc in documents])
print(lengths)

{531, 316, 466, 467, 468, 469, 471, 472, 474, 476, 477, 478, 479, 480, 481, 482, 483, 484, 485, 486, 487, 488, 489, 490, 492, 493, 494, 496, 497, 498, 499, 500, 502, 509, 510}


In [None]:
total = 0
for doc in documents:
    if len(doc.vector) != 256:
        print(doc)
        total+=1
print(total)

0


In [None]:
# save the db states, including the original documents with len 1, and transformed documents
db.save_state("tutorials/db_states.pkl")

In [None]:
# construct the db

restored_db = LocalDB.load_state("tutorials/db_states.pkl")
restored_db

LocalDB(documents=[Document(id=024146ce-84e5-43a2-a1fe-e5385ad9aab5, text=

What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I trie ..., meta_data={'title': "Paul Graham's essay", 'path': 'data/paul_graham/paul_graham_essay.txt', 'size': 75042, 'creation_date': 'Sun Jun 16 13:13:07 2024', 'last_modified_date': 'Sun Jun 16 13:13:07 2024'}, estimated_num_tokens=16534)], transformed_documents={'Sequential__0_1_1.vectorizer_1.vectorizer.model_client_1.batch_embedder_': [Document(id=e215f93d-9f3d-41de-8052-24bff73be6b2, text=

What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming.

In [None]:
len_documents=len(restored_db.documents)
keys = list(restored_db.transformed_documents.keys())
len_transformed_documents=len(restored_db.transformed_documents[keys[0]])
print(len_documents, len_transformed_documents, keys)

1 68 ['Sequential__0_1_1.vectorizer_1.vectorizer.model_client_1.batch_embedder_']


In [None]:
# lets' print out part of the vector
restored_db.transformed_documents[keys[0]][0].vector[0:10]


[-0.07920883,
 0.038013875,
 -0.00059247564,
 -0.01969087,
 0.1272431,
 -0.03413296,
 0.012310769,
 0.11960851,
 -0.07208321,
 0.089897245]

Now we have prepared the embeddings which can be used in ``FAISSRetriever``. The ``FAISSRetriever`` is a simple wrapper around the FAISS library. It is a simple and efficient way to search for the nearest neighbors in the embedding space.

In [None]:

from adalflow.components.retriever import FAISSRetriever



retriever = FAISSRetriever(embedder=components["embedder"], top_k=5)
print(retriever)

FAISSRetriever(
  top_k=5, metric=prob
  (embedder): Embedder(
    model_kwargs={'model': 'text-embedding-3-small', 'dimensions': 256, 'encoding_format': 'float'}, 
    (model_client): OpenAIClient()
  )
)


In [None]:
documents = restored_db.transformed_documents[keys[0]]
vectors = [doc.vector for doc in documents]
print(len(vectors), type(vectors), vectors[0][0:10])

# check if all embeddings are the same length
dimensions = set([len(vector) for vector in vectors])
dimensions

68 <class 'list'> [-0.07920883, 0.038013875, -0.00059247564, -0.01969087, 0.1272431, -0.03413296, 0.012310769, 0.11960851, -0.07208321, 0.089897245]


{256}

In [None]:
# convert vectors to numpy array
import numpy as np
vectors_np = np.array(vectors, dtype=np.float32)

In [None]:
retriever.build_index_from_documents(documents=vectors)

In [None]:
# retriever for a single query
query = "What happened at Viaweb and Interleaf?"
second_query = "What company did Paul Graham co-found?"

output = retriever(input=[query, second_query])
output

[RetrieverOutput(doc_indices=[24, 25, 17, 32, 38], doc_scores=[0.7670000195503235, 0.7459999918937683, 0.734000027179718, 0.7329999804496765, 0.7200000286102295], query='What happened at Viaweb and Interleaf?', documents=None),
 RetrieverOutput(doc_indices=[47, 44, 49, 45, 46], doc_scores=[0.800000011920929, 0.7900000214576721, 0.7879999876022339, 0.7799999713897705, 0.7749999761581421], query='What company did Paul Graham co-found?', documents=None)]

In [None]:
# get initial documents
for per_query_output in output:
    per_query_output.documents = [documents[idx] for idx in per_query_output.doc_indices]

output

[RetrieverOutput(doc_indices=[24, 25, 17, 32, 38], doc_scores=[0.7670000195503235, 0.7459999918937683, 0.734000027179718, 0.7329999804496765, 0.7200000286102295], query='What happened at Viaweb and Interleaf?', documents=[Document(id=10dda6e6-caa5-4961-a8c0-50f4dce70def, text=online stores. At first this was going to be normal desktop software, which in those days meant Windows software. That was an alarming prospect, because neither of us knew how to write Windows software or wanted to learn. We lived in the Unix world. But we decided we'd at least try writing a prototype store builder on Unix. Robert wrote a shopping cart, and I wrote a new site generator for stores  ..., meta_data={'title': "Paul Graham's essay", 'path': 'data/paul_graham/paul_graham_essay.txt', 'size': 75042, 'creation_date': 'Sun Jun 16 13:13:07 2024', 'last_modified_date': 'Sun Jun 16 13:13:07 2024'}, estimated_num_tokens=469, vector=[-0.044771, 0.077939205, -0.043221936, 0.016963774, 0.0947663, -0.09500929, -0.0

In the RAG notes, we will combine this with Generator to get the end to end response.

## BM25Retriever


In [None]:
from adalflow.components.retriever import BM25Retriever

index_strings = [doc.text for doc in documents]

retriever = BM25Retriever(documents=index_strings)

# retriever.build_index_from_documents(documents=index_strings)

output = retriever(input=[query, second_query])
output

[RetrieverOutput(doc_indices=[38, 17, 39, 25, 63], doc_scores=[10.1114636573896, 9.649273483751845, 9.355255256567045, 9.283456223615227, 9.064007693239802], query='What happened at Viaweb and Interleaf?', documents=None),
 RetrieverOutput(doc_indices=[60, 59, 38, 39, 14], doc_scores=[14.466181996644409, 14.269841646528711, 12.763172741824395, 10.693920475917215, 5.3626785714999], query='What company did Paul Graham co-found?', documents=None)]

In [None]:
retriever = BM25Retriever(top_k=1)
retriever.build_index_from_documents(["hello world", "world is beautiful", "today is a good day"])
output = retriever.retrieve("hello")
output

In [None]:
# save the index

path = "tutorials/bm25_index.json"
retriever.save_to_file(path)

In [None]:
retriever_loaded = BM25Retriever.load_from_file(path)

In [None]:
# test the loaded index
output = retriever_loaded.retrieve("hello", top_k=1)
output

[RetrieverOutput(doc_indices=[0], doc_scores=[0.6229580777634034], query='hello', documents=None)]

In [None]:
retriever_loaded

InMemoryBM25Retriever(
  top_k=1, k1=1.5, b=0.75, epsilon=0.25
  Number of documents: 3
)