Skip to content

Thrubit/rag-knowledge-base-workflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Thrubit Logo

AI — RAG Knowledge Base Workflow

A complete Retrieval-Augmented Generation (RAG) knowledge base ingestion pipeline on AWS Step Functions. Covers document intake from S3, Confluence, Notion, and SharePoint; text extraction and semantic chunking; embedding generation via OpenAI or Amazon Bedrock; vector store upsert with deduplication; retrieval quality validation using MRR and NDCG@5; knowledge catalog versioning; and downstream application cache invalidation.


Workflows

1. RagIngestionPipeline

File: statemachines/rag-ingestion-pipeline.asl.json Event: events/event-document.json

Processes a single document through the full RAG ingestion pipeline, with parallel catalog update and notification after retrieval quality validation passes.

IngestDocument
  ├─ ExtractTextContent
      ├─ GenerateEmbeddings
          ├─ IndexToVectorStore
              ├─ ValidateRetrievalQuality
                  ├─ IsRetrievalQualitySufficient (Choice)
                  │    └─ false ──► IngestionFailed
                  └─ [Parallel] UpdateKnowledgeCatalog + NotifyIngestionStatus
                       └─ IngestionComplete

2. BulkDocumentIngestion

File: statemachines/bulk-document-ingestion.asl.json Event: events/event-batch.json

Ingests multiple documents concurrently using a Map state. Supports scheduled full-corpus re-ingestion after major content updates or new namespace imports.

[Map: MaxConcurrency 3 — $.documents]
  ├─ BatchIngestDocument
  ├─ BatchExtractText
  ├─ BatchGenerateEmbeddings
  ├─ BatchIndexToVectorStore
  ├─ BatchValidateRetrieval
  └─ [Parallel] BatchUpdateCatalog + BatchNotifyStatus

3. KnowledgeBaseRefreshEscalation

File: statemachines/knowledge-base-refresh-escalation.asl.json Event: events/event-refresh.json

Handles escalated refresh for stale or degraded-quality documents detected by the quality monitoring job. Applies a review hold, re-extracts content, then runs parallel embedding regeneration and vector store cleanup before re-indexing and routing to resolution.

SetRefreshEscalationContext (escalationReason: STALE_CONTENT_DETECTED)
  ├─ ReIngestDocument
  ├─ ReExtractText
  ├─ WaitForContentReview (5s)
  └─ [Parallel] EscalatedEmbeddingGeneration + EscalatedVectorStoreCleanup
       └─ EscalatedIndexing
            └─ EscalatedCatalogUpdate
                 └─ NotifyRefreshOutcome
                      └─ RouteRefreshOutcome (Choice)
                           ├─ passes ──► RefreshEscalationResolved
                           └─ else ──► RefreshEscalationFailed

4. ExpressSnippetIndexing

File: statemachines/express-snippet-indexing.asl.json Event: events/event-snippet.json

Fast-path indexing for short text snippets such as support FAQs, product descriptions, or changelog entries. Skips the OCR extraction step and goes directly from ingestion to embedding generation.

SetSnippetContext (snippetMode: EXPRESS, skipOcr: true)
  ├─ ExpressIngestSnippet
  ├─ ExpressGenerateSnippetEmbeddings
  ├─ ExpressIndexSnippet
  ├─ WaitForIndexPropagation (2s)
  └─ RouteByIndexHealth (Choice)
       ├─ queryTestPassed ──► ExpressUpdateSnippetCatalog
       │                           └─ NotifySnippetIndexed
       │                                └─ SnippetIndexingComplete
       └─ else ──► SnippetIndexingFailed

Included Components

Component Name Description
Lambda ingest-document Ingests PDF, Word, HTML, Markdown, Confluence, Notion, and SharePoint docs with namespace-specific chunking config and embedding model selection
Lambda extract-text-content Extracts and chunks text using TEXTRACT, DOCX_PARSER, HTML_PARSER, or Markdown/Confluence/Notion parsers with post-processing pipelines
Lambda generate-embeddings Generates dense vector embeddings using text-embedding-3-large, text-embedding-3-small, Amazon Titan v2, Cohere Embed v3, or Voyage Large 2
Lambda index-to-vector-store Upserts embeddings to Pinecone or OpenSearch Serverless with deduplication, stale vector deletion, and query health check
Lambda validate-retrieval-quality Runs MRR and NDCG@5 quality tests against a set of representative queries and routes on configurable threshold
Lambda update-knowledge-catalog Versions the catalog entry, invalidates downstream RAG application caches, and computes next scheduled refresh date
Lambda notify-ingestion-status Notifies content owner (email), RAG platform ops (Slack), and downstream applications (webhook) on completion or quality failure

Directory Structure

rag-knowledge-base-workflow/
├── template.yaml
├── README.md
├── lambdas/
│   ├── ingest-document.js
│   ├── extract-text-content.js
│   ├── generate-embeddings.js
│   ├── index-to-vector-store.js
│   ├── validate-retrieval-quality.js
│   ├── update-knowledge-catalog.js
│   └── notify-ingestion-status.js
├── statemachines/
│   ├── rag-ingestion-pipeline.asl.json
│   ├── bulk-document-ingestion.asl.json
│   ├── knowledge-base-refresh-escalation.asl.json
│   └── express-snippet-indexing.asl.json
└── events/
    ├── event-document.json
    ├── event-batch.json
    ├── event-refresh.json
    └── event-snippet.json

Local Testing

Run and iterate on the workflow using Thrubit:

  1. Open Thrubit
  2. Import template.yaml
  3. Load any event from /events
  4. Execute the workflow locally

About Thrubit

Thrubit enables fast, cost-free local execution of Step Functions with full visibility into state transitions.


About

A RAG knowledge base ingestion pipeline on AWS Step Functions. Covers document intake, text extraction, embedding generation via Amazon Bedrock, vector store indexing with deduplication, retrieval quality validation, and knowledge catalog versioning.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors