Skip to content

MukundaKatta/DocGraph

Repository files navigation

DocGraph

CI License: MIT Python 3.11+

Lightweight document knowledge graph builder.

Inspired by RAG knowledge graph trends but focused on lightweight document graph building. Extract entities and relationships from unstructured text and query them with a simple Python API -- no heavy NLP frameworks required.

Architecture

graph TD
    A[Document Text] --> B[Entity Extractor]
    B --> C[Relationship Detector]
    C --> D[KnowledgeGraph]
    D --> E[Query API]
    D --> F[Path Finding]
    D --> G[Subgraph Extraction]
    D --> H[JSON Export]

    subgraph "Entity Extraction"
        B --> B1[Persons]
        B --> B2[Organizations]
        B --> B3[Locations]
        B --> B4[Dates / Emails / URLs]
    end

    subgraph "Graph Algorithms"
        F --> F1[BFS]
        F --> F2[Shortest Path]
    end
Loading

Features

  • Regex-based entity extraction -- recognizes persons, organizations, locations, dates, emails, URLs, and monetary amounts
  • Co-occurrence relationship detection -- finds relationships based on textual proximity
  • Custom lightweight graph -- BFS traversal, shortest path, subgraph extraction with zero external graph dependencies
  • Document tracking -- every entity and relationship is traced back to its source document
  • JSON export -- serialize your entire knowledge graph for downstream use

Quickstart

Installation

pip install -e .

Build a Knowledge Graph

from docgraph import KnowledgeGraph

kg = KnowledgeGraph()

# Add documents
kg.add_document(
    "Alice Johnson is the CEO of Acme Corp, headquartered in New York.",
    doc_id="press-release-1"
)
kg.add_document(
    "Bob Smith joined Acme Corp as CTO. He previously worked with Alice Johnson at Global Technologies.",
    doc_id="press-release-2"
)

# Query an entity
result = kg.query("Alice Johnson")
print(result["neighbors"])  # ['Acme Corp', 'New York', ...]

# Find shortest path
path = kg.find_path("Bob Smith", "New York")
print(path)  # ['Bob Smith', 'Acme Corp', 'New York']

# Get neighborhood subgraph
subgraph = kg.get_subgraph("Acme Corp", depth=1)

# Export full graph
json_str = kg.export_json()

# Graph statistics
print(kg.stats())
# {'nodes': 6, 'edges': 5, 'density': 0.3333, 'documents': 2}

Configuration

from docgraph import DocGraphConfig, KnowledgeGraph

config = DocGraphConfig(
    max_entities=5000,
    co_occurrence_window=100,
    min_relationship_weight=0.2,
)
kg = KnowledgeGraph(config=config)

Or via environment variables (see .env.example):

export MAX_ENTITIES=10000
export LOG_LEVEL=DEBUG

Development

# Install dev dependencies
make dev

# Run tests
make test

# Lint + typecheck + test
make all

See CONTRIBUTING.md for full guidelines.

Project Structure

src/docgraph/
  __init__.py    # Public API
  config.py      # Configuration
  core.py        # Entity, Relationship, KnowledgeGraph
  utils.py       # Text processing, regex patterns, graph algorithms
tests/
  test_core.py   # Unit tests
docs/
  ARCHITECTURE.md

License

MIT -- see LICENSE.


Built by Officethree Technologies | Made with ❤️ and AI

About

Document knowledge graph builder — extract entities and relationships, query connected knowledge

Topics

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors