#### Internal Knowledge Assistant – Technical Design Document

## 1. Overview

You are building an internal “Glean-inspired” knowledge application that connects to enterprise sources (Confluence, Google Docs, SharePoint, Slack, and Jira Epic details by ID). Users ask natural-language questions and the system returns an answer **with citations** (links, snippets, and metadata) sourced **only from internal content**.

Key goals:

* Unified search + question answering over multiple enterprise systems
* Strong security: least privilege, auditability, tenant boundaries
* Accurate answers with verifiable **source citations**
* Fast retrieval at scale (thousands → millions of docs/messages)
* Governance and compliance (retention, PII handling, access controls)

Non-goals (initially):

* External/public web search
* Fully autonomous actions (e.g., closing Jira tickets) unless explicitly added later

#### 2. Requirements:

### 2.1 Functional

    1. Connectors
    
       * Confluence spaces/pages/attachments
       * Google Docs / Drive (Docs, Sheets as text snapshots, PDFs)
       * SharePoint / OneDrive documents and folders
       * Slack channels/threads/messages (allowed channels)
       * Jira: Epic + linked issues by Epic ID / Jira ID
    
    2. Ingestion & Indexing
    
       * Scheduled sync + incremental updates (webhooks where possible)
       * Parse content and extract metadata (title, owner, timestamps, permissions)
       * Chunk documents for retrieval (structure-aware chunking)
       * Store embeddings + keyword index
    
    3. Query & Answering
    
       * Natural-language query
       * Retrieve relevant chunks respecting permissions
       * Generate answer with **citations** (source URL + snippet + confidence)
       * Provide follow-ups: “show more”, “open source”, “refine scope”
    
    4. Source Attribution
    
       * Each answer includes citations:
    
         * Document title
         * Source system (Confluence/Slack/etc.)
         * Link to the original item
         * Snippet / highlighted text used
    
    5. Admin & Observability
    
       * Connector health dashboards
       * Ingestion job status
       * Query analytics (latency, top queries)
       * Audit logs for access and retrieval

#### 3. High-Level Architecture

#### 3.1 Components

Web App / UI

    Search bar, chat interface, filters (source, time range, space/channel)

    Answer view with citations + “open source”

    API Gateway / Backend

    Auth middleware (SSO)

    Query orchestration

    Permission checks and policy enforcement

Connector Services (per source)

    Confluence Connector

    Google Drive/Docs Connector

    SharePoint Connector

    Slack Connector

    Jira Connector

    Ingestion Pipeline

    Scheduler/event triggers

### Fetch → normalize → parse → chunk → enrich → index

Indexing Layer

    Keyword index for exact match and filtering

    Vector database for semantic retrieval

    Metadata store for permissions, ACL, document lineage

    RAG / Answering Service

    Query rewrite (optional)

    Hybrid retrieval (BM25 + embeddings)

Reranking

    LLM response generation with citation mapping

    Security & Governance

    Central policy engine for ACL

    Audit logs
    
    Data retention/deletion workflows

Observability

       Logs (Splunk)

#### 4. Data Model

4.1 Canonical Document Schema

    All source items are normalized into a single schema.

Document

    doc_id (global UUID)

    source_system (confluence|gdoc|sharepoint|slack|jira)

    source_native_id (pageId, fileId, messageTs, issueKey)

    title

    url

    body_text (normalized plaintext)

    content_type (page|doc|pdf|message|issue|comment|attachment)

    created_at, updated_at

    author, owner

    container (space, drive folder, slack channel, jira project)

    labels/tags

    acl (access control list)

    hash (content hash for idempotency)

Chunk

    chunk_id

    doc_id
    
    chunk_index
    
    text
    
    token_count
    
    embedding_vector
    
    chunk_metadata (heading, section path, message thread id)

4.2 ACL Model (Critical)

    Use a common permission representation:
    
    Users, groups, roles

Allow/deny lists

    Inheritance rules (e.g., Confluence page inherits space permissions)
    
    Store ACL at both doc and (optionally) chunk levels.

#### 5. Ingestion & Sync Design

5.1 Sync Modes

    Initial bulk sync: full crawl for all configured scopes

    Incremental sync: fetch changes since last cursor

    Near-real-time: webhooks where supported (Slack events, Google Drive push notifications)

5.2 Idempotency & Change Detection

    Compute hash = SHA256(normalized_text + key_metadata)

    If hash unchanged → skip re-index
    
    Keep sync_cursor per connector scope

5.3 Parsing / Normalization

    HTML → text (Confluence pages)
    
    Google Docs → exported text/HTML → text
    
    SharePoint docs → text extraction (docx/pdf)
    
    Slack messages → plain text + thread context
    
    Jira issues → fields: summary, description, comments

5.4 Chunking Strategy

    Use structure-aware chunking:

    Documents: split by headings, paragraphs, lists

    Slack: message-level chunks + thread rollups

    Jira: field-based chunks (Summary, Description, Comments)

Target chunk size:

    400–800 tokens per chunk (tune per model)

    Overlap 50–150 tokens for continuity

5.5 Enrichment

    Entity extraction: Jira keys, project names, teams, service names

    Language detection

    Timestamp normalization

    Optional: classification tags (confidential/public-internal)

#### 6. Indexing & Retrieval

6.1 Hybrid Retrieval

Combine:

    Keyword/BM25 (fast exact matches; IDs like EPIC-123)

    Vector search (semantic matches)

Process:

    Generate candidates from BM25 + vector (top K each)

    Merge + dedupe

    Apply ACL filter (must happen before model sees text)

    Rerank with a cross-encoder or LLM reranker

    Select top N chunks for context

#### 6.2 Storage Choices

    Vector DB: Pinecone

    Metadata + ACL store: Postgres for doc registry + permissions

#### 6.3 Citation Mapping

    Each retrieved chunk carries:

    doc_id, url, section_path, snippet_start/end During generation, maintain a mapping from answer sentences → source chunks.

Output format:

    Inline citations like: [1], [2]

    Citation list includes title + source + link + snippet

#### RAG proces here 

#### End-to-End Flow (Python + LangChain + LangGraph)

##### Step 1 — User Authentication (SSO)

##### Step 2 — User Submits Question

Example:<br>
POST /query<br>
{<br>
  "question": "I have this JIRA: C1AHB-6567", <br>
  "user_context": "Can you provide the requirement about pagination"<br>
}<br>
    

##### Step 3 — LangGraph Entry Node (Query Router)

##### Step 4 — Tool Selection (LangChain Tools)
    LangChain tools are registered:

        confluence_search_tool : Embedding   and store into pinecone

        slack_search_tool : Embedding   and store into pinecone

        gdoc_search_tool : Embedding   and store into pinecone

        sharepoint_search_tool

        jira_fetch_tool : API Call

##### Step 5 — Jira Epic Fetch (Conditional Path):


    If Jira ID detected:
    
    Call Jira REST API
    
    Fetch:
    
    Epic summary
    
    Description
    
    Linked issues
    
    Comments
    
    Output is normalized into internal document format.

Step 6 — Hybrid Retrieval (Search Path)

Step 7 — Merge & Deduplicate Results

    Merge BM25 + vector results
    
    Remove duplicates by doc_id
    
    Keep top-K candidates

Step 8 — ACL / Permission Filtering (Critical Step)

Step 9 — Reranking (Relevance Boost)

Step 10 — Context Builder

Step 11 — Grounded Prompt Construction

Step 12 — LLM Answer Generation

Step 13 — Citation Mapping

Step 14 — LangGraph Exit Node (Response Assembly)

Step 15 — UI Rendering

User
 ↓
Auth → Query API
 ↓
LangGraph
 ↓
Route Query
 ↓
Tools (Jira / Search)
 ↓
Hybrid Retrieval
 ↓
ACL Filter
 ↓
Rerank
 ↓
Context Builder
 ↓
LLM (LangChain)
 ↓
Citation Mapper
 ↓
Response
