# RAG using LlamaIndex

## Settings

In [None]:
# download your own data
# !git clone https://github.com/unclecode/crawl4ai.git data/crawl4ai

Cloning into 'data/crawl4ai'...


In [None]:
# change these variables to your own
input_dir = "data/crawl4ai/docs/md_v2"
index_name = "crawl4ai"

EMBEDDING_MODEL = "models/text-embedding-004"   # or "BAAI/bge-base-en-v1.5"
GEMINI_API_KEY = ""
LLM_MODEL = "gemini/gemini-2.5-flash-lite"   # or "ollama/qwen3:8b"

file_types = [".md", ".mdx"]
vector_store_path = "output/chromadb"


## Step 1: Load Data

In [None]:
# imports
import os
import sys
from pathlib import Path
import subprocess

from dotenv import load_dotenv
import chromadb
from IPython.display import display, Markdown

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.llms.litellm import LiteLLM
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.google_genai import GoogleGenAIEmbedding
from llama_index.embeddings.fastembed import FastEmbedEmbedding
from llama_index.core.node_parser import MarkdownNodeParser, CodeSplitter, SentenceSplitter
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.extractors import SummaryExtractor, TitleExtractor, KeywordExtractor, DocumentContextExtractor
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core.schema import MetadataMode


import sys
sys.path.append('../core')
from custom_components.custom_extractors import CustomDocumentContextExtractor


load_dotenv()


  from .autonotebook import tqdm as notebook_tqdm


True

In [3]:
# load chromadb vector store (or another vector stores like Qdrant, PineCone etc)
chroma_process = subprocess.Popen(["chroma", "run", "--path", vector_store_path])
chroma_client = chromadb.HttpClient()
chroma_collection = chroma_client.get_or_create_collection(index_name)
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
print('Vector Store Count:', chroma_collection.count())

Vector Store Count: 0


In [4]:
# load documents
documents = SimpleDirectoryReader(input_dir=input_dir, exclude=[], recursive=True, filename_as_id=True,
                                       required_exts=file_types).load_data()
# Note: For PDFs, first convert to text/markdown using specialized pdf converters

# set doc-id for easy identification
for i, document in enumerate(documents):
    document.doc_id = Path(document.metadata['file_path']).relative_to(Path(input_dir).absolute()).as_posix()
    document.metadata['file_path'] = document.doc_id

# create docstore - can be used to pass original docs to some Components
docstore = SimpleDocumentStore()
docstore.add_documents(documents)

print('Document Count', len(documents))

Document Count 67


In [19]:
# Process documents one by one (its also possible to process in batches)
document = documents[7]
print(document.doc_id)

advanced/multi-url-crawling.md


In [None]:
# preview document
display(Markdown(document.text[:500]))

# Advanced Multi-URL Crawling with Dispatchers

> **Heads Up**: Crawl4AI supports advanced dispatchers for **parallel** or **throttled** crawling, providing dynamic rate limiting and memory usage checks. The built-in `arun_many()` function uses these dispatchers to handle concurrency efficiently.

## 1. Introduction

When crawling many URLs:

- **Basic**: Use `arun()` in a loop (simple but less efficient)
- **Better**: Use `arun_many()`, which efficiently handles multiple URLs with prop

## Step 2: Chunking

In [None]:
# extract nodes from documents (i.e. split document into chunks)
# experiment with different node parsers

file_extension = Path(document.metadata['file_name']).suffix.lower()
if file_extension in ('.md', '.mdx'):
    node_parser = MarkdownNodeParser()
elif file_extension == '.txt':
    node_parser = SentenceSplitter(chunk_size=1024, chunk_overlap=20)
elif file_extension == '.py':
    node_parser = CodeSplitter(language='python', chunk_lines=70, chunk_lines_overlap=20, max_chars=2000)
else:
    raise ValueError(f"Filetype not supported.")

nodes = node_parser.get_nodes_from_documents([document])

# assign predictable node_id to nodes
for node in nodes:
    node.node_id = f"{node.ref_doc_id}-{node.hash}"

print('Node Count', len(nodes))

Node Count 16


## Step 3: Extract Metadata (Optional)

In [24]:
# load embedding model

embed_model = GoogleGenAIEmbedding(model_name=EMBEDDING_MODEL, api_key=GEMINI_API_KEY)
# embed_model = FastEmbedEmbedding(model_name=EMBEDDING_MODEL)

# load LLM model (required if extracting metadata - Optional)
llm = LiteLLM(model=LLM_MODEL, max_tokens=8192, max_retries=6)
# llm = Ollama(model=LLM_MODEL, request_timeout=120.0, context_window=8192)

In [25]:
# extract metadata for nodes (Uses LLMs) - Optional
# helps in improving embedding quality. can also help during retrieval

# Warning - May use too many LLM tokens

title_extractor = TitleExtractor(llm=llm, show_progress=False)
keyword_extractor = KeywordExtractor(llm=llm, show_progress=False)

# context extractor - Read Claude blog on Contextual RAG
context_extractor = CustomDocumentContextExtractor(
                # these 2 are mandatory
                docstore=docstore,
                max_context_length=8192,
                # below are optional
                llm=llm,  # default to Settings.llm
                oversized_document_strategy="warn",
                # max_output_tokens=100,
                key="context",
                prompt=CustomDocumentContextExtractor.ORIGINAL_CONTEXT_PROMPT,
                show_progress=False
            )

extractors = [keyword_extractor]  # include the ones that you want to use

In [26]:
# Warning: Can be costly if you have a lot of nodes
for extractor in extractors:
    extractor_results = await extractor.aextract(nodes)
    for node, result in zip(nodes, extractor_results):
        node.metadata.update(result)

In [27]:
nodes[0].metadata

{'file_path': 'advanced/multi-url-crawling.md',
 'file_name': 'multi-url-crawling.md',
 'file_size': 15208,
 'creation_date': '2025-07-24',
 'last_modified_date': '2025-07-24',
 'header_path': '/',
 'excerpt_keywords': 'Crawling, Dispatchers, Parallel, Throttled, Concurrency'}

## Step 4: Generate Embeddings

In [None]:
# get embeddings for nodes
# Embeddings are numerical representation for a given text, such that similar documents have similar embeddings
# Embeddings are generated for text + metadata

# we can get different content from nodes depending on requirement by specifying MetadataMode
node_texts = [node.get_content(metadata_mode=MetadataMode.EMBED) for node in nodes]
embeddings = await embed_model.aget_text_embedding_batch(node_texts)

In [29]:
print(nodes[1].get_content(metadata_mode=MetadataMode.EMBED))

file_path: advanced/multi-url-crawling.md
/eader_path: /Advanced Multi-URL Crawling with Dispatchers
excerpt_keywords: Crawling, Dispatchers, Concurrency, Rate Limiting, Memory Management

## 1. Introduction

When crawling many URLs:

- **Basic**: Use `arun()` in a loop (simple but less efficient)
- **Better**: Use `arun_many()`, which efficiently handles multiple URLs with proper concurrency control
- **Best**: Customize dispatcher behavior for your specific needs (memory management, rate limits, etc.)

**Why Dispatchers?**  

- **Adaptive**: Memory-based dispatchers can pause or slow down based on system resources
- **Rate-limiting**: Built-in rate limiting with exponential backoff for 429/503 responses
- **Real-time Monitoring**: Live dashboard of ongoing tasks, memory usage, and performance
- **Flexibility**: Choose between memory-adaptive or semaphore-based concurrency

---


In [30]:
print('Embedding dimensions', len(embeddings[0]))

Embedding dimensions 768


In [31]:
# add embeddings to nodes:
for node, embedding in zip(nodes, embeddings):
    node.embedding = embedding

In [32]:
# add nodes (with embeddings) to Vector Store
node_ids = await vector_store.async_add(nodes)

In [33]:
print('Vector Store Count:', chroma_collection.count())

Vector Store Count: 16


In [34]:
# stop the chromadb server
chroma_process.terminate()

## Retrieval

In [35]:
from llama_index.core.response_synthesizers import get_response_synthesizer
from llama_index.core.response_synthesizers.type import ResponseMode
from llama_index.core.schema import QueryBundle

In [None]:
query = "how to extract multiple urls with low memory usage?"
top_k = 5  # max relevant nodes to retrieve

In [37]:
# load chroma vector store
chroma_process = subprocess.Popen(["chroma", "run", "--path", vector_store_path])
chroma_client = chromadb.HttpClient()
chroma_collection = chroma_client.get_or_create_collection(index_name)
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)

In [38]:
# load the retriever
vector_index = VectorStoreIndex.from_vector_store(vector_store, embed_model=embed_model)
vector_retriever = vector_index.as_retriever(similarity_top_k=top_k)

# retrieve nodes
retrieved_nodes = await vector_retriever.aretrieve(query)

In [None]:
# print retrieved nodes scores and text
print('Node score:', retrieved_nodes[0].score)
display(Markdown(retrieved_nodes[0].node.text))

Node score: 0.4948026761151271


### 4.1 Batch Processing (Default)

```python
async def crawl_batch():
    browser_config = BrowserConfig(headless=True, verbose=False)
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        stream=False  # Default: get all results at once
    )
    
    dispatcher = MemoryAdaptiveDispatcher(
        memory_threshold_percent=70.0,
        check_interval=1.0,
        max_session_permit=10,
        monitor=CrawlerMonitor(
            display_mode=DisplayMode.DETAILED
        )
    )

    async with AsyncWebCrawler(config=browser_config) as crawler:
        # Get all results at once
        results = await crawler.arun_many(
            urls=urls,
            config=run_config,
            dispatcher=dispatcher
        )
        
        # Process all results after completion
        for result in results:
            if result.success:
                await process_result(result)
            else:
                print(f"Failed to crawl {result.url}: {result.error_message}")
```

**Review:**  
- **Purpose:** Executes a batch crawl with all URLs processed together after crawling is complete.  
- **Dispatcher:** Uses `MemoryAdaptiveDispatcher` to manage concurrency and system memory.  
- **Stream:** Disabled (`stream=False`), so all results are collected at once for post-processing.  
- **Best Use Case:** When you need to analyze results in bulk rather than individually during the crawl.

---

## Answer Generation

In [43]:
# # setup langfuse for Observability - Optional

# # Enter your Langfuse Keys
# LANGFUSE_SECRET_KEY = ""
# LANGFUSE_PUBLIC_KEY = ""
# LANGFUSE_HOST = "https://cloud.langfuse.com"

# from langfuse import get_client
# from openinference.instrumentation.llama_index import LlamaIndexInstrumentor

# def initialize_langfuse():
#     langfuse = get_client()
#     langfuse_available = False

#     # Verify langfuse connection
#     if langfuse.auth_check():
#         langfuse_available = True
#         LlamaIndexInstrumentor().instrument()
#         print("Langfuse client is authenticated and ready!")
#         return langfuse, langfuse_available
#     else:
#         print("Authentication failed. Please check your credentials and host.")
#         return langfuse, langfuse_available
    

In [44]:
langfuse, langfuse_available = None, False
# langfuse, langfuse_available = initialize_langfuse()

In [46]:
query_bundle = QueryBundle(query)

In [47]:
# synthesize response from retrieved nodes
response_synthesizer = get_response_synthesizer(llm=llm, response_mode=ResponseMode.COMPACT)

In [48]:
if not langfuse_available:
    response = response_synthesizer.synthesize(query_bundle, retrieved_nodes)
else:
    with langfuse.start_as_current_span(name="Vector DB Query"):
        response = response_synthesizer.synthesize(query_bundle, retrieved_nodes)
    langfuse.flush()

In [53]:
display(Markdown(response.response))

To extract multiple URLs with low memory usage, you can utilize the `MemoryAdaptiveDispatcher`. This dispatcher dynamically adjusts concurrency based on system memory, pausing or slowing down crawling when memory resources are constrained. When using this dispatcher, you can configure parameters such as `memory_threshold_percent`, `check_interval`, and `max_session_permit` to fine-tune its behavior. The `arun_many()` function, which supports these dispatchers, is recommended for efficient handling of multiple URLs with proper concurrency control.

In [None]:
# check the source nodes used for generating the response
print(response.get_formatted_sources())

> Source (Doc id: advanced/multi-url-crawling.md-4a280cb47538cc51d2edb2504467bdb1b7985e497b1786019361e74eac2c8929): ### 4.1 Batch Processing (Default)

```python
async def crawl_batch():
    browser_config = B...

> Source (Doc id: advanced/multi-url-crawling.md-503f24d9b00dadf6469b8b8fb5cec8fa6b337d62ef3e5a38f41d76b9f85547b1): ## 1. Introduction

When crawling many URLs:

- **Basic**: Use `arun()` in a loop (simple but...

> Source (Doc id: advanced/multi-url-crawling.md-ebd3e0e1f5e67c69c4bca6b4b773f2aa1de176f28796bd899259861fb50b3164): ## 5. Dispatch Results

Each crawl result includes dispatch information:

```python
@datacla...

> Source (Doc id: advanced/multi-url-crawling.md-d6265503231a80d6823982dc7df4424c336a83759be172d75108aa71129d0ee1): # Advanced Multi-URL Crawling with Dispatchers

> **Heads Up**: Crawl4AI supports advanced disp...

> Source (Doc id: advanced/multi-url-crawling.md-1caae7d6b6db998a7720d209dd266d8a2ccdf68de71530263c9dfb4443866f54): ## 6. Summary

1. **Two Dis