AI — RAG Knowledge Base Workflow

A complete Retrieval-Augmented Generation (RAG) knowledge base ingestion pipeline on AWS Step Functions. Covers document intake from S3, Confluence, Notion, and SharePoint; text extraction and semantic chunking; embedding generation via OpenAI or Amazon Bedrock; vector store upsert with deduplication; retrieval quality validation using MRR and NDCG@5; knowledge catalog versioning; and downstream application cache invalidation.

Workflows

1. `RagIngestionPipeline`

File: statemachines/rag-ingestion-pipeline.asl.json Event: events/event-document.json

Processes a single document through the full RAG ingestion pipeline, with parallel catalog update and notification after retrieval quality validation passes.

IngestDocument
  ├─ ExtractTextContent
      ├─ GenerateEmbeddings
          ├─ IndexToVectorStore
              ├─ ValidateRetrievalQuality
                  ├─ IsRetrievalQualitySufficient (Choice)
                  │    └─ false ──► IngestionFailed
                  └─ [Parallel] UpdateKnowledgeCatalog + NotifyIngestionStatus
                       └─ IngestionComplete

2. `BulkDocumentIngestion`

File: statemachines/bulk-document-ingestion.asl.json Event: events/event-batch.json

Ingests multiple documents concurrently using a Map state. Supports scheduled full-corpus re-ingestion after major content updates or new namespace imports.

[Map: MaxConcurrency 3 — $.documents]
  ├─ BatchIngestDocument
  ├─ BatchExtractText
  ├─ BatchGenerateEmbeddings
  ├─ BatchIndexToVectorStore
  ├─ BatchValidateRetrieval
  └─ [Parallel] BatchUpdateCatalog + BatchNotifyStatus

3. `KnowledgeBaseRefreshEscalation`

File: statemachines/knowledge-base-refresh-escalation.asl.json Event: events/event-refresh.json

Handles escalated refresh for stale or degraded-quality documents detected by the quality monitoring job. Applies a review hold, re-extracts content, then runs parallel embedding regeneration and vector store cleanup before re-indexing and routing to resolution.

SetRefreshEscalationContext (escalationReason: STALE_CONTENT_DETECTED)
  ├─ ReIngestDocument
  ├─ ReExtractText
  ├─ WaitForContentReview (5s)
  └─ [Parallel] EscalatedEmbeddingGeneration + EscalatedVectorStoreCleanup
       └─ EscalatedIndexing
            └─ EscalatedCatalogUpdate
                 └─ NotifyRefreshOutcome
                      └─ RouteRefreshOutcome (Choice)
                           ├─ passes ──► RefreshEscalationResolved
                           └─ else ──► RefreshEscalationFailed

4. `ExpressSnippetIndexing`

File: statemachines/express-snippet-indexing.asl.json Event: events/event-snippet.json

Fast-path indexing for short text snippets such as support FAQs, product descriptions, or changelog entries. Skips the OCR extraction step and goes directly from ingestion to embedding generation.

SetSnippetContext (snippetMode: EXPRESS, skipOcr: true)
  ├─ ExpressIngestSnippet
  ├─ ExpressGenerateSnippetEmbeddings
  ├─ ExpressIndexSnippet
  ├─ WaitForIndexPropagation (2s)
  └─ RouteByIndexHealth (Choice)
       ├─ queryTestPassed ──► ExpressUpdateSnippetCatalog
       │                           └─ NotifySnippetIndexed
       │                                └─ SnippetIndexingComplete
       └─ else ──► SnippetIndexingFailed

Included Components

Component	Name	Description
Lambda	`ingest-document`	Ingests PDF, Word, HTML, Markdown, Confluence, Notion, and SharePoint docs with namespace-specific chunking config and embedding model selection
Lambda	`extract-text-content`	Extracts and chunks text using TEXTRACT, DOCX_PARSER, HTML_PARSER, or Markdown/Confluence/Notion parsers with post-processing pipelines
Lambda	`generate-embeddings`	Generates dense vector embeddings using `text-embedding-3-large`, `text-embedding-3-small`, Amazon Titan v2, Cohere Embed v3, or Voyage Large 2
Lambda	`index-to-vector-store`	Upserts embeddings to Pinecone or OpenSearch Serverless with deduplication, stale vector deletion, and query health check
Lambda	`validate-retrieval-quality`	Runs MRR and NDCG@5 quality tests against a set of representative queries and routes on configurable threshold
Lambda	`update-knowledge-catalog`	Versions the catalog entry, invalidates downstream RAG application caches, and computes next scheduled refresh date
Lambda	`notify-ingestion-status`	Notifies content owner (email), RAG platform ops (Slack), and downstream applications (webhook) on completion or quality failure

Directory Structure

rag-knowledge-base-workflow/
├── template.yaml
├── README.md
├── lambdas/
│   ├── ingest-document.js
│   ├── extract-text-content.js
│   ├── generate-embeddings.js
│   ├── index-to-vector-store.js
│   ├── validate-retrieval-quality.js
│   ├── update-knowledge-catalog.js
│   └── notify-ingestion-status.js
├── statemachines/
│   ├── rag-ingestion-pipeline.asl.json
│   ├── bulk-document-ingestion.asl.json
│   ├── knowledge-base-refresh-escalation.asl.json
│   └── express-snippet-indexing.asl.json
└── events/
    ├── event-document.json
    ├── event-batch.json
    ├── event-refresh.json
    └── event-snippet.json

Local Testing

Run and iterate on the workflow using Thrubit:

Open Thrubit
Import template.yaml
Load any event from /events
Execute the workflow locally

About Thrubit

Thrubit enables fast, cost-free local execution of Step Functions with full visibility into state transitions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI — RAG Knowledge Base Workflow

Workflows

1. `RagIngestionPipeline`

2. `BulkDocumentIngestion`

3. `KnowledgeBaseRefreshEscalation`

4. `ExpressSnippetIndexing`

Included Components

Directory Structure

Local Testing

About Thrubit

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
events		events
lambdas		lambdas
statemachines		statemachines
README.md		README.md
template.yaml		template.yaml

Folders and files

Latest commit

History

Repository files navigation

AI — RAG Knowledge Base Workflow

Workflows

1. RagIngestionPipeline

2. BulkDocumentIngestion

3. KnowledgeBaseRefreshEscalation

4. ExpressSnippetIndexing

Included Components

Directory Structure

Local Testing

About Thrubit

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `RagIngestionPipeline`

2. `BulkDocumentIngestion`

3. `KnowledgeBaseRefreshEscalation`

4. `ExpressSnippetIndexing`

Packages