A complete Retrieval-Augmented Generation (RAG) knowledge base ingestion pipeline on AWS Step Functions. Covers document intake from S3, Confluence, Notion, and SharePoint; text extraction and semantic chunking; embedding generation via OpenAI or Amazon Bedrock; vector store upsert with deduplication; retrieval quality validation using MRR and NDCG@5; knowledge catalog versioning; and downstream application cache invalidation.
File:
statemachines/rag-ingestion-pipeline.asl.jsonEvent:events/event-document.json
Processes a single document through the full RAG ingestion pipeline, with parallel catalog update and notification after retrieval quality validation passes.
IngestDocument
├─ ExtractTextContent
├─ GenerateEmbeddings
├─ IndexToVectorStore
├─ ValidateRetrievalQuality
├─ IsRetrievalQualitySufficient (Choice)
│ └─ false ──► IngestionFailed
└─ [Parallel] UpdateKnowledgeCatalog + NotifyIngestionStatus
└─ IngestionComplete
File:
statemachines/bulk-document-ingestion.asl.jsonEvent:events/event-batch.json
Ingests multiple documents concurrently using a Map state. Supports scheduled full-corpus re-ingestion after major content updates or new namespace imports.
[Map: MaxConcurrency 3 — $.documents]
├─ BatchIngestDocument
├─ BatchExtractText
├─ BatchGenerateEmbeddings
├─ BatchIndexToVectorStore
├─ BatchValidateRetrieval
└─ [Parallel] BatchUpdateCatalog + BatchNotifyStatus
File:
statemachines/knowledge-base-refresh-escalation.asl.jsonEvent:events/event-refresh.json
Handles escalated refresh for stale or degraded-quality documents detected by the quality monitoring job. Applies a review hold, re-extracts content, then runs parallel embedding regeneration and vector store cleanup before re-indexing and routing to resolution.
SetRefreshEscalationContext (escalationReason: STALE_CONTENT_DETECTED)
├─ ReIngestDocument
├─ ReExtractText
├─ WaitForContentReview (5s)
└─ [Parallel] EscalatedEmbeddingGeneration + EscalatedVectorStoreCleanup
└─ EscalatedIndexing
└─ EscalatedCatalogUpdate
└─ NotifyRefreshOutcome
└─ RouteRefreshOutcome (Choice)
├─ passes ──► RefreshEscalationResolved
└─ else ──► RefreshEscalationFailed
File:
statemachines/express-snippet-indexing.asl.jsonEvent:events/event-snippet.json
Fast-path indexing for short text snippets such as support FAQs, product descriptions, or changelog entries. Skips the OCR extraction step and goes directly from ingestion to embedding generation.
SetSnippetContext (snippetMode: EXPRESS, skipOcr: true)
├─ ExpressIngestSnippet
├─ ExpressGenerateSnippetEmbeddings
├─ ExpressIndexSnippet
├─ WaitForIndexPropagation (2s)
└─ RouteByIndexHealth (Choice)
├─ queryTestPassed ──► ExpressUpdateSnippetCatalog
│ └─ NotifySnippetIndexed
│ └─ SnippetIndexingComplete
└─ else ──► SnippetIndexingFailed
| Component | Name | Description |
|---|---|---|
| Lambda | ingest-document |
Ingests PDF, Word, HTML, Markdown, Confluence, Notion, and SharePoint docs with namespace-specific chunking config and embedding model selection |
| Lambda | extract-text-content |
Extracts and chunks text using TEXTRACT, DOCX_PARSER, HTML_PARSER, or Markdown/Confluence/Notion parsers with post-processing pipelines |
| Lambda | generate-embeddings |
Generates dense vector embeddings using text-embedding-3-large, text-embedding-3-small, Amazon Titan v2, Cohere Embed v3, or Voyage Large 2 |
| Lambda | index-to-vector-store |
Upserts embeddings to Pinecone or OpenSearch Serverless with deduplication, stale vector deletion, and query health check |
| Lambda | validate-retrieval-quality |
Runs MRR and NDCG@5 quality tests against a set of representative queries and routes on configurable threshold |
| Lambda | update-knowledge-catalog |
Versions the catalog entry, invalidates downstream RAG application caches, and computes next scheduled refresh date |
| Lambda | notify-ingestion-status |
Notifies content owner (email), RAG platform ops (Slack), and downstream applications (webhook) on completion or quality failure |
rag-knowledge-base-workflow/
├── template.yaml
├── README.md
├── lambdas/
│ ├── ingest-document.js
│ ├── extract-text-content.js
│ ├── generate-embeddings.js
│ ├── index-to-vector-store.js
│ ├── validate-retrieval-quality.js
│ ├── update-knowledge-catalog.js
│ └── notify-ingestion-status.js
├── statemachines/
│ ├── rag-ingestion-pipeline.asl.json
│ ├── bulk-document-ingestion.asl.json
│ ├── knowledge-base-refresh-escalation.asl.json
│ └── express-snippet-indexing.asl.json
└── events/
├── event-document.json
├── event-batch.json
├── event-refresh.json
└── event-snippet.json
Run and iterate on the workflow using Thrubit:
- Open Thrubit
- Import
template.yaml - Load any event from
/events - Execute the workflow locally
Thrubit enables fast, cost-free local execution of Step Functions with full visibility into state transitions.