# 🧠 Building Effective Workflow Agents

Welcome to our workflow agents lab! In this module, we'll learn how to build sophisticated workflow agents using several advanced patterns:

- **Prompt chaining**: Breaking tasks into sequential steps where each output feeds into the next
- **Routing**: Directing requests to different specialized agents based on query type
- **Parallelization**: Running multiple agent tasks simultaneously for efficiency
- **Orchestrators**: Coordinating complex workflows between multiple agents
- **Evaluator-optimizer patterns**: Continuously improving agent performance

In the next few notebooks, we'll apply these techniques to build a "deep research" OpenSearch Question Answering workflow agent that leverages OpenSearch's public documentation.

### 💾 Setup: ChromaDB Vector Store

In this notebook, we'll set up our ChromaDB vector store for use across these different patterns. This involves:

1. Downloading OpenSearch documentation
2. Processing and chunking the text
3. Creating embeddings
4. Storing them in ChromaDB

**Note**: Building the ChromaDB cache from the public documentation takes about 2 minutes. If doing the lab on your own, you'll need to run this section. If participating in an AWS lab environment, we've pre-created the ChromaDB store for you, and you can skip this section.

Let's begin! 🚀

# Step 1: Download our source data
First we'll need to clone the OpenSearch documentation from the website which is licensed under Apache 2.0. This will be our "base knowledge". we use bash cell magic %%bash to run the entire cell as a bash script

In [None]:
%%bash
mkdir -p ../../data/opensearch-docs
cd ../../data/opensearch-docs
git clone https://github.com/opensearch-project/documentation-website.git .

## Step 2: Create RAG Pipeline
We'll modify the helper functions from module 1 to ingest markdown documents. To speed things up we'll also add some threading to the local chromaDB to create the collection.

In [None]:
import chromadb
import boto3
from chromadb.config import Settings

# Initialize Chroma client from our persisted store
chroma_client = chromadb.PersistentClient(path="../../data/chroma")

# Initialize the Bedrock client
session = boto3.Session()
bedrock = session.client(service_name='bedrock-runtime')

print("✅ Client setup complete!")

Next we'll modify the LlamaIndex ingestion pipeline 

In [None]:
from typing import List, Dict, Any
from pydantic import BaseModel
from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import Node
from llama_index.core import SimpleDirectoryReader
from llama_index.core.ingestion import IngestionPipeline

import re

# Create a class to use instead of LlamaIndex Nodes. This way we decouple our chroma collections from LlamaIndexes
class RAGChunk(BaseModel):
    id_: str
    text: str
    metadata: Dict[str, Any] = {}


class SentenceSplitterChunkingStrategy:
    def __init__(self, input_dir: str, chunk_size: int = 256, chunk_overlap: int = 128):
        self.input_dir = input_dir
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.pipeline = self._create_pipeline()

        # Helper to get regex pattern for normalizing relative file paths.
        self.relative_path_pattern = rf"{re.escape(input_dir)}(/.*)"

    def _extract_relative_path(self, full_path):
        # Get Regex pattern
        pattern = self.relative_path_pattern
        match = re.search(pattern, full_path)
        if match:
            return match.group(1).lstrip('/')
        return None

    def _create_pipeline(self) -> IngestionPipeline:
        transformations = [
            SentenceSplitter(chunk_size=self.chunk_size, chunk_overlap=self.chunk_overlap),
        ]
        return IngestionPipeline(transformations=transformations)

    def load_documents(self) -> List[Document]:
        # If you're using a different type of file besides md, you'll want to change this. 
        return SimpleDirectoryReader(
            input_dir=self.input_dir, 
            recursive=True,
            required_exts=['.md']
        ).load_data()

    def to_ragchunks(self, nodes: List[Node]) -> List[RAGChunk]:
        return [
            RAGChunk(
                id_=node.node_id,
                text=node.text,
                metadata={
                    **node.metadata,
                    'relative_path': self._extract_relative_path(node.metadata['file_path'])
                }
            )
            for node in nodes
        ]

    def process(self) -> List[RAGChunk]:
        documents = self.load_documents()
        nodes = self.pipeline.run(documents=documents)
        rag_chunks = self.to_ragchunks(nodes)
        
        print(f"Processing complete. Created {len(rag_chunks)} chunks.")
        return rag_chunks

In [None]:
# These values were evaluated using the outputs of this notebook. https://github.com/aws-samples/genai-system-evaluation/blob/main/example-notebooks/1_Embedding_And_Chunking_Validation.ipynb
chunking_strategy = SentenceSplitterChunkingStrategy(
    input_dir="../../data/opensearch-docs",
    chunk_size=2048,
    chunk_overlap=128
)

# Get the nodes from the chunker.
chunks: RAGChunk = chunking_strategy.process()

Lets create a wrapper around ChromaDB like we did in module 1. But this time we'll add some parallelization to speed things up b/c we have a lot of documents

In [None]:
from pydantic import BaseModel
from typing import List, Dict
from abc import ABC, abstractmethod
import chromadb
from chromadb.api.types import EmbeddingFunction
from typing import List, Dict, Any
from concurrent.futures import ThreadPoolExecutor, as_completed
from chromadb.utils.embedding_functions import AmazonBedrockEmbeddingFunction


class RetrievalResult(BaseModel):
    id: str
    document: str
    embedding: List[float]
    distance: float
    metadata: Dict = {}


# Example of a concrete implementation
class ChromaDBWrapperClient:

    def __init__(self, chroma_client, collection_name: str, embedding_function: AmazonBedrockEmbeddingFunction):
        self.client = chroma_client
        self.collection_name = collection_name
        self.embedding_function = embedding_function

        # Create the collection
        self.collection = self._create_collection()

    def _create_collection(self):
        return self.client.get_or_create_collection(
            name=self.collection_name,
            embedding_function=self.embedding_function
        )

    def add_chunks_to_collection(self, chunks: List[RAGChunk], batch_size: int = 20, num_workers: int = 10):
        batches = [chunks[i:i + batch_size] for i in range(0, len(chunks), batch_size)]
        
        with ThreadPoolExecutor(max_workers=num_workers) as executor:
            futures = [executor.submit(self._add_batch, batch) for batch in batches]
            for future in as_completed(futures):
                future.result()  # This will raise an exception if one occurred during the execution
        print('Finished Ingesting Chunks Into Collection')

    def _add_batch(self, batch: List[RAGChunk]):
        self.collection.add(
            ids=[chunk.id_ for chunk in batch],
            documents=[chunk.text for chunk in batch],
            metadatas=[chunk.metadata for chunk in batch]
        )

    def retrieve(self, query_text: str, n_results: int = 5) -> List[RetrievalResult]:
        # Query the collection
        results = self.collection.query(
            query_texts=[query_text],
            n_results=n_results,
            include=['embeddings', 'documents', 'metadatas', 'distances']
        )

        # Transform the results into RetrievalResult objects
        retrieval_results = []
        for i in range(len(results['ids'][0])):
            retrieval_results.append(RetrievalResult(
                id=results['ids'][0][i],
                document=results['documents'][0][i],
                embedding=results['embeddings'][0][i],
                distance=results['distances'][0][i],
                metadata=results['metadatas'][0][i] if results['metadatas'][0] else {}
            ))

        return retrieval_results

Now use our LlamaIndex ingestion pipeline and bulk upload the data to chroma. This will create a persistant DB file under data/chroma. 

In [None]:
from chromadb.utils.embedding_functions import AmazonBedrockEmbeddingFunction

# Define some experiment variables
EMBEDDING_MODEL_ID: str = 'amazon.titan-embed-text-v2:0'
COLLECTION_NAME: str = 'opensearch-docs-rag'

# This is a handy function Chroma implemented for calling bedrock. Lets use it!
embedding_function = AmazonBedrockEmbeddingFunction(
    session=session,
    model_name=EMBEDDING_MODEL_ID
)

# Create our retrieval task. All retrieval tasks in this tutorial implement BaseRetrievalTask which has the method retrieve()
# If you'd like to extend this to a different retrieval configuration, all you have to do is create a class that that implements
# this abstract class and the rest is the same!
chroma_os_docs_collection: ChromaDBWrapperClient = ChromaDBWrapperClient(
    chroma_client = chroma_client, 
    collection_name = COLLECTION_NAME,
    embedding_function = embedding_function
)

chroma_os_docs_collection.add_chunks_to_collection(chunks)

Now lets run some queries and see what comes back

In [None]:
chroma_os_docs_collection.retrieve("What is OpenSearch?")


# Conclusion
In this setup, we downloaded the AOS documentation and ingested it into a local vector database that we'll be reusing across the next couple laps. Feel free to move to the next lab in the module.