Skip to content

JSLEEKR/lightrag-go

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

lightrag-go

Go License Tests

Re-implementation of LightRAG in Go -- graph-based Retrieval-Augmented Generation with parallel extraction, fuzzy entity matching, and zero heavy dependencies.


Why This Exists

LightRAG (30K+ stars) demonstrated that knowledge graph-based retrieval outperforms both naive vector RAG and Microsoft's GraphRAG approach, achieving 52.8% win rate with 73.6% better diversity while using <100 tokens for retrieval (vs GraphRAG's ~610K tokens).

However, the Python implementation has critical weaknesses:

  1. Entity merging is exact-match only -- "Donald Trump" and "Trump" create separate disconnected nodes (#1323)
  2. Embedding calls are sequential during merge -- ingesting 140K entities takes 20+ minutes (#1957)
  3. No hallucination detection -- LLM-extracted entities like "Noah Carter" from a sewing machine manual go unvalidated (#2333)
  4. Heavy dependency tree -- requires pandas, numpy, networkx, tiktoken, pydantic, and google-genai as core deps

lightrag-go reimplements the core algorithm in Go, fixing all four issues while maintaining algorithmic fidelity.


How It Works

                    INSERT PIPELINE
                    ==============
Documents --> [Token-Based Chunking] --> [Parallel LLM Extraction]
                                               |
                                     [Parse Delimited Output]
                                               |
                                  [Fuzzy Entity Merging + Dedup]
                                     /         |          \
                               [Graph]   [Vector Index]   [Validation]
                               (nodes,   (entity &        (hallucination
                                edges)   relation         detection)
                                         embeddings)

                    QUERY PIPELINE
                    ==============
Query --> [LLM Extract Keywords]
              |
     [high_level]  [low_level]
         |              |
   [Vector match    [Vector match
    relations]       entities]
         |              |
         +--- Merge ----+
              |
     [1-hop Graph Traversal]
              |
     [Token Budget Truncation]
              |
     [LLM Generate Answer]
              |
          Response

Core Algorithm

  1. Chunking: Documents are split into token-bounded chunks with configurable overlap (default: 1200 tokens, 100 overlap)

  2. Entity/Relation Extraction: Each chunk is processed by an LLM to extract entities and relationships using a structured delimiter format:

    ("entity"<|#|>ENTITY_NAME<|#|>ENTITY_TYPE<|#|>DESCRIPTION)
    ("relationship"<|#|>SOURCE<|#|>TARGET<|#|>DESCRIPTION<|#|>KEYWORDS<|#|>WEIGHT)
    
  3. Knowledge Graph Construction: Extracted entities and relations are merged into an in-memory graph with:

    • Fuzzy matching via Levenshtein distance (configurable threshold)
    • Case-insensitive deduplication
    • Description consolidation via LLM summarization
    • Source chunk tracking for provenance
  4. Dual-Mode Retrieval:

    • Local: Query keywords match entities via vector similarity, then 1-hop graph traversal gathers structural context
    • Global: Query themes match relations via vector similarity and keyword search for broader conceptual context
    • Hybrid: Combines both modes for comprehensive retrieval
  5. Answer Generation: Retrieved context (entities + relations + descriptions) is assembled within a token budget and passed to the LLM for answer generation


Improvements Over Original LightRAG

Feature LightRAG (Python) lightrag-go
Entity merging Exact name match only Fuzzy matching (Levenshtein distance + lowercasing)
Extraction Sequential per chunk Parallel goroutines with configurable concurrency
Embedding during merge Single-text calls (1-by-1) Batch embedding API calls
Hallucination detection None Validates entity names against source text
Dependencies pandas, numpy, networkx, tiktoken, pydantic Go stdlib + minimal deps
Distribution Python package + deps Single binary
Thread safety Semaphore-based sync.RWMutex with fine-grained locking

Installation

go install github.com/JSLEEKR/lightrag-go/cmd/lightrag-go@latest

Or build from source:

git clone https://github.com/JSLEEKR/lightrag-go.git
cd lightrag-go
go build -o lightrag-go ./cmd/lightrag-go

Usage

Environment Setup

# Required: LLM provider configuration
export LIGHTRAG_PROVIDER=openai          # or "anthropic"
export LIGHTRAG_API_KEY=your-api-key
export LIGHTRAG_MODEL=gpt-4o-mini        # optional, has defaults
export LIGHTRAG_EMBED_MODEL=text-embedding-3-small  # optional
export LIGHTRAG_GRAPH_PATH=./my_graph.json  # optional, default: lightrag_graph.json

CLI Commands

Index a Document

# From file
lightrag-go index -file document.txt

# From text
lightrag-go index -text "Go is a programming language created at Google by Rob Pike, Ken Thompson, and Robert Griesemer."

Query the Knowledge Graph

# Hybrid mode (default) -- combines local + global retrieval
lightrag-go query -q "Who created Go?" -mode hybrid

# Local mode -- entity-focused retrieval
lightrag-go query -q "Tell me about Rob Pike" -mode local

# Global mode -- theme-focused retrieval
lightrag-go query -q "What are the major programming languages?" -mode global

# Naive mode -- direct LLM call (no graph retrieval)
lightrag-go query -q "What is Go?" -mode naive

Graph Statistics

lightrag-go graph-stats
# Output:
# Knowledge Graph Statistics
#   Nodes (entities): 42
#   Edges (relations): 67
#   Entity types:
#     PERSON: 12
#     TECHNOLOGY: 18
#     ORGANIZATION: 8
#     CONCEPT: 4

Export Graph

# To stdout
lightrag-go export

# To file
lightrag-go export -o graph_export.json

Programmatic Usage

package main

import (
    "context"
    "fmt"

    "github.com/JSLEEKR/lightrag-go/pkg/core"
    "github.com/JSLEEKR/lightrag-go/pkg/llm"
    "github.com/JSLEEKR/lightrag-go/pkg/types"
)

func main() {
    // Configure LLM provider
    cfg := llm.Config{
        Provider: "openai",
        APIKey:   "your-api-key",
        Model:    "gpt-4o-mini",
    }

    provider := llm.NewOpenAIProvider(cfg)
    embedder := llm.NewOpenAIEmbedder(cfg)

    // Create engine
    engine := core.New(provider, embedder, core.DefaultConfig())
    ctx := context.Background()

    // Index documents
    engine.Insert(ctx, "Go is a programming language...", nil)

    // Query with graph-based retrieval
    result, _ := engine.Query(ctx, "Who created Go?", types.QueryModeHybrid)
    fmt.Println(result.Answer)

    // Save graph for persistence
    engine.SaveGraph("knowledge_graph.json")
}

Architecture

lightrag-go/
  cmd/lightrag-go/      CLI entry point
  pkg/
    core/               LightRAG orchestrator (coordinates all components)
    chunk/              Token-aware document chunking with overlap
    extract/            LLM-based entity/relation extraction + parsing
    graph/              In-memory knowledge graph with fuzzy matching
    vector/             In-memory vector index with cosine similarity
    merge/              Entity/relation merging with description consolidation
    query/              Dual-mode retrieval engine (local/global/hybrid)
    llm/                LLM provider interface + implementations
      openai.go         OpenAI API integration
      anthropic.go      Anthropic API integration
      mock.go           Deterministic mock for testing
    prompt/             LLM prompt templates
    types/              Shared data structures

Key Design Decisions

  1. Interface-based LLM integration: Provider and EmbeddingProvider interfaces allow swapping LLM backends without changing core logic

  2. In-memory everything: Graph and vector index are in-memory for simplicity and speed. Graph can be persisted to JSON files.

  3. Goroutine-based parallelism: Chunk extraction runs in parallel with a configurable concurrency semaphore (default: 4 workers)

  4. Fuzzy entity matching: Levenshtein distance-based matching during merge phase catches name variations that exact matching misses

  5. Source text validation: Extracted entities are cross-referenced against the source chunk text. Entities whose names don't appear in the source are flagged as potential hallucinations and filtered out.


Configuration

All configuration is available through Go structs:

cfg := core.Config{
    Chunk: chunk.Config{
        ChunkSize:    1200,  // Target tokens per chunk
        Overlap:      100,   // Overlap between chunks
        MinChunkSize: 50,    // Minimum tokens for a valid chunk
    },
    Extract: extract.Config{
        MaxContinuations:      1,     // Continue-extraction rounds
        Concurrency:           4,     // Parallel extraction workers
        ValidateAgainstSource: true,  // Hallucination detection
    },
    Merge: merge.Config{
        FuzzyMatchDistance:   2,    // Max Levenshtein distance for fuzzy matching
        MaxDescriptionLength: 500,  // Triggers LLM summarization
    },
    Query: query.Config{
        TopK:              10,    // Results per vector search
        MaxEntityTokens:   4000,  // Entity context budget
        MaxRelationTokens: 4000,  // Relation context budget
        MaxTotalTokens:    8000,  // Total context budget
        MaxNeighborHops:   1,     // Graph traversal depth
    },
}

Testing

All LLM calls are mocked in tests using deterministic fixtures:

go test ./... -v
# 141 tests passing across 10 packages

The MockProvider generates deterministic responses, and MockEmbedder produces normalized vectors from character frequency analysis, ensuring reproducible test results without API calls.


Query Modes Explained

Mode How It Works Best For
local Finds entities similar to query via vector search, then traverses 1-hop graph neighbors for structural context Specific entity questions: "Who is Rob Pike?"
global Matches query themes against relation keywords via vector and text search Broad conceptual questions: "What are the trends in systems programming?"
hybrid Combines local + global retrieval, deduplicates, then generates answer General questions that need both specific and thematic context
naive Bypasses graph entirely, sends query directly to LLM Baseline comparison, simple questions

Inspired By

This project is a Go reimplementation of LightRAG by the HKU Data Science Lab.

Paper: LightRAG: Simple and Fast Retrieval-Augmented Generation

Key findings from the paper:

  • Knowledge graph-based retrieval outperforms community-based approaches (GraphRAG)
  • Dual-mode keyword extraction (high-level themes + low-level entities) enables effective query routing
  • Graph structure captures relationships that vector similarity alone misses
  • LightRAG uses <100 tokens for retrieval vs GraphRAG's ~610K tokens

License

MIT

About

Re-implementation of LightRAG in Go — graph-based RAG with parallel extraction, fuzzy entity matching, zero heavy deps

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages