# ADVANCED CUSTOMIZATION OPTIONS - METHODS - EXAMPLES


## 1. Custom Token Parsing (with SentenceWindowParser):

Enables parsing and segmenting long documents into manageable chunks based on sentence boundaries. This is useful for processing large texts where maintaining context over long distances can be challenging. By splitting text into smaller windows (that contain a fixed number of sentences), you can ensure that the context remains coherent during retrieval and processing.

> Example:


In [33]:
!pip install -q llama_index

In [34]:
from llama_index.core import VectorStoreIndex, Document
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.postprocessor import MetadataReplacementPostProcessor

In [35]:
# Generate a sample text
sample_text = """
Artificial intelligence (AI) is the simulation of human intelligence in machines.
Machine learning (ML) is a subset of AI, which enables systems to automatically learn and improve from experience.
Natural language processing (NLP) allows machines to understand and interpret human language.
Deep learning (DL), a subset of ML, uses neural networks to model complex patterns in data.
"""

# Convert sample text to llamaindex's document structure
documents = [Document(text=sample_text)]

# Create the sentence window parser
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,  # we chose a window size of 3 sentences
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

# Extract the set of nodes that will be stored in the vectorindex (parsing)
nodes = node_parser.get_nodes_from_documents(documents)

# Build the vector index from the parsed nodes
index = VectorStoreIndex(nodes)

# Create a query engine with post-processing for metadata replacement
# (replace the sentence in each node with it's surrounding context)
query_engine = index.as_query_engine(
    similarity_top_k=2,
    node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window")
    ],
)

# Run a query and show the answer
question = "What is artificial intelligence?"
window_response = query_engine.query(question)
print(window_response)


Artificial intelligence is the simulation of human intelligence in machines.


## 2. Custom Retrieval Configurations (using Retriever with Metadata filter):

Allows adjusting how documents are retrieved from an index based on specific criteria. By using a metadata filter with a retriever, you can focus on retrieving only those documents that match certain metadata conditions (e.g., author, date, type). This helps to improve the relevance of the results returned by your queries, ensuring that only the most relevant data is fetched and improving the efficiency and accuracy of your LLM interactions.

> Example:

In [36]:
!pip install -q llama-index llama-index-vector-stores-chroma

In [37]:
from llama_index.core import SimpleDirectoryReader
from llama_index.core import VectorStoreIndex
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb
from llama_index.llms.openai import OpenAI
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.schema import Node

In [38]:
# Set your API key
import openai
openai.api_key = "<YOUR_API_KEY>"

In [39]:
# Create the directory and download the pdfs from the internet
!mkdir documents
!wget https://wwfint.awsassets.panda.org/downloads/wwf_impacts_of_plastic_pollution_on_biodiversity.pdf -O ./documents/file1.pdf
!wget https://www.thegef.org/sites/default/files/publications/STAP_MarineDebris_-_website_1.pdf -O ./documents/file2.pdf

# Load the documents
documents = SimpleDirectoryReader(input_dir="./documents").load_data()

# Metadata mapping for each file
metadata_map = {
    "file1.pdf": {
        "author": "Mine B. Tekman",
        "document_type": "Summary of a study for wwf",
        "publication_year": "2022",
    },
    "file2.pdf": {
        "author": "Richard C. Thompson",
        "document_type": "Scientific information paper",
        "publication_year": "2011",
    }
}

# Add the metadata to documents based on file names
for doc in documents:
    file_name = doc.extra_info['file_path'].split("/")[-1] # Extract file name from the path
    if file_name in metadata_map:
        doc.metadata = metadata_map[file_name]

# Create a vector store and embedding model
client = chromadb.PersistentClient(path="./metadata_filtered_db")
collection = client.get_or_create_collection("multi_doc_collection")
vector_store = ChromaVectorStore(chroma_collection=collection)
embedding_model = OpenAIEmbedding(model="text-embedding-ada-002")

mkdir: cannot create directory ‘documents’: File exists
--2024-11-19 18:25:46--  https://wwfint.awsassets.panda.org/downloads/wwf_impacts_of_plastic_pollution_on_biodiversity.pdf
Resolving wwfint.awsassets.panda.org (wwfint.awsassets.panda.org)... 13.224.14.117, 13.224.14.85, 13.224.14.76, ...
Connecting to wwfint.awsassets.panda.org (wwfint.awsassets.panda.org)|13.224.14.117|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4323968 (4.1M) [application/pdf]
Saving to: ‘./documents/file1.pdf’


2024-11-19 18:25:46 (42.3 MB/s) - ‘./documents/file1.pdf’ saved [4323968/4323968]

--2024-11-19 18:25:46--  https://www.thegef.org/sites/default/files/publications/STAP_MarineDebris_-_website_1.pdf
Resolving www.thegef.org (www.thegef.org)... 23.185.0.1, 2620:12a:8001::1, 2620:12a:8000::1
Connecting to www.thegef.org (www.thegef.org)|23.185.0.1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4766030 (4.5M) [application/pdf]
Saving to: ‘./documents

In [40]:
# Create the index
index = VectorStoreIndex.from_documents(documents, embedding_model=embedding_model, vector_store=vector_store)

# Define a metadata filter function
def metadata_filter(node: Node):
   # Filter by publication year
    return int(node.metadata.get("publication_year", 0)) == 2022

# Create a retriever with the metadata filter
retriever = index.as_retriever(
    similarity_top_k=1,
    filter_fn=metadata_filter
)

# Query the index
query_engine = RetrieverQueryEngine.from_args(
    retriever=retriever,
    llm=OpenAI(model="gpt-3.5-turbo", temperature=0.1)
)

You can try to change the publication year to 2011 (the publication year of the second file), then run the same query and check the differences between the two answers.

In [41]:
# Run a query
query = "Why do we have to give importance to the oceans when talking about global enviromental problems?"
response = query_engine.query(query)
print(response)

Oceans are crucial to global ecological services and play a significant role in maintaining the balance of the Earth's ecosystems. They are highly sensitive environments that provide essential services such as regulating the climate, supporting biodiversity, and producing oxygen. Therefore, addressing issues like marine debris in the oceans is important as it directly impacts the health of these vital ecosystems and the overall well-being of the planet.


## 3. Flare:

Flare is a method designed to enhance the retrieval process by providing a more dynamic and context-aware interaction with the data. It adapts to the user's query by focusing on the most relevant portions of the document and offering refined suggestions, leading to more accurate and contextually appropriate responses. This whole process helps to reduce hallucinations.

> Example:

In [42]:
!pip install -q llama-index

In [43]:
from llama_index.core import VectorStoreIndex, Document
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core.query_engine import FLAREInstructQueryEngine

In [44]:
# Set your API key
import openai
openai.api_key = "<YOUR_API_KEY>"

In [45]:
# Create the sample text to be indexed
sample_text = """
Artificial intelligence (AI) is the simulation of human intelligence in machines.
Machine learning (ML) is a subset of AI, which enables systems to automatically learn and improve from experience.
Natural language processing (NLP) allows machines to understand and interpret human language.
Deep learning (DL), a subset of ML, uses neural networks to model complex patterns in data.
"""

# Create a list of Document objects (each document is treated as separate text data)
documents = [Document(text=sample_text)]

# Initialize a simple node parser to parse the document into nodes and parse the documents into nodes
node_parser = SimpleNodeParser.from_defaults()
nodes = node_parser.get_nodes_from_documents(documents)

# Build the vector index from the parsed nodes
vector_index = VectorStoreIndex(nodes)

# Create a simple query engine based on the vector index
index_query_engine = vector_index.as_query_engine()

# Set up the FLARE query engine
flare_query_engine = FLAREInstructQueryEngine(
    query_engine=index_query_engine,
    max_iterations=4,  # Maximum number of reasoning iterations
    verbose=True,  # Enables detailed output to understand the reasoning steps
)

In [46]:
# Run a query
question = "What is the relationship between artificial intelligence and machine learning?"
response = flare_query_engine.query(question)

# Print the result
print(response)

[1;3;32mQuery: What is the relationship between artificial intelligence and machine learning?
[0m[1;3;34mCurrent response: 
[0m[1;3;38;5;200mLookahead response: [Search(What is the relationship between artificial intelligence and machine learning?)]
[0m[1;3;38;5;200mUpdated lookahead response: Machine learning is a subset of artificial intelligence, enabling systems to automatically learn and improve from experience.
[0m[1;3;34mCurrent response:  Machine learning is a subset of artificial intelligence, enabling systems to automatically learn and improve from experience.
[0m[1;3;38;5;200mLookahead response: [Search(How do artificial intelligence and machine learning differ?)]
[0m[1;3;38;5;200mUpdated lookahead response: Artificial intelligence encompasses a broader scope of simulating human intelligence in machines, while machine learning is a specific subset of AI that focuses on enabling systems to learn and improve from experience automatically.
[0m[1;3;34mCurrent resp

## 4. Raptor:

Raptor is another advanced technique in LlamaIndex designed for building solid RAG applications in production. This model is based on constructing a tree structure that encapsulates both granular details and overarching narratives. Raptor applies optimized algorithms to integrate information across extensive documents at varying abstraction levels, significantly enhancing question-answering tasks, particularly those requiring multi-step reasoning.

> Example:


In [47]:
!pip install -q llama-index llama-index-packs-raptor llama-index-vector-stores-chroma

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/88.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m88.8/88.8 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/56.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.9/56.9 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h

In [48]:
from llama_index.core import SimpleDirectoryReader
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import VectorStoreIndex
import chromadb


In [49]:
# Using !wget to download the document from the internet
# (we chose a random file about "fruit and vegetable production in the Federated States of Micronesia")
!wget https://openknowledge.fao.org/server/api/core/bitstreams/aac462ae-90d2-422c-b9e6-3e5336b18b52/content -O ./crop_production_and_cultivation_document.pdf

# Load the document
documents = SimpleDirectoryReader(input_files=["./crop_production_and_cultivation_document.pdf"]).load_data()

# Setup the vector store and embeddings
client = chromadb.PersistentClient(path="./crop_production_and_cultivation_db")
collection = client.get_or_create_collection("crop_production_and_cultivation")
vector_store = ChromaVectorStore(chroma_collection=collection)

embedding_model = OpenAIEmbedding(model="text-embedding-ada-002")
llm_model = OpenAI(model="gpt-3.5-turbo", temperature=0.1)

# Create an index with the documents
index = VectorStoreIndex.from_documents(documents, embedding_model=embedding_model, vector_store=vector_store)

# Create a query engine from the index
query_engine = index.as_query_engine()


--2024-11-19 18:30:54--  https://openknowledge.fao.org/server/api/core/bitstreams/aac462ae-90d2-422c-b9e6-3e5336b18b52/content
Resolving openknowledge.fao.org (openknowledge.fao.org)... 3.232.251.16
Connecting to openknowledge.fao.org (openknowledge.fao.org)|3.232.251.16|:443... connected.
HTTP request sent, awaiting response... 200 200
Length: unspecified [application/pdf]
Saving to: ‘./crop_production_and_cultivation_document.pdf’

./crop_production_a     [   <=>              ]   3.84M  7.55MB/s    in 0.5s    

2024-11-19 18:30:55 (7.55 MB/s) - ‘./crop_production_and_cultivation_document.pdf’ saved [4029979]



In [50]:
# Run a query
query = "What are the best practices for fruit and vegetable production and cultivation?"
response = query_engine.query(query)

# Show the answer
print(response)

Sunlight, good soil, water source, and wind shelter are crucial factors for successful fruit and vegetable production. It is important to ensure that plants receive at least six hours of sunlight daily, have access to good soil that can be improved with minerals and manure if needed, are located near a water source but not in areas prone to waterlogging, and are protected from strong winds. When choosing crops to plant, consider factors like ease of growth, suitability for the local region, and maximizing space efficiency. Opt for vegetables that are easy to grow such as Chinese cabbage, tomatoes, squash, lettuce, and root vegetables like taro. Additionally, it is beneficial to grow crops that are commonly grown in the local area to receive support and guidance from experienced farmers.
