Skip to content

v0.1.2 - Auto Entity Extraction + Security Patch

Pre-release
Pre-release

Choose a tag to compare

@Hashevolution Hashevolution released this 05 May 09:50
· 749 commits to main since this release
a8aaa1f

v0.1.2 โ€” Auto Entity Extraction + Security Patch

Pre-release closing the STEP 7 (real-data validation) cycle and the first
wave of v0.2 patches.

Highlights

Automatic entity extraction is now the default upload behavior.
Before this release, every file upload created exactly one document
entity regardless of content (the process_document_for_entities method
was being called but did not exist, falling through to a fallback). It
now runs an LLM extraction pass with ontology normalization and Memory
Trust validation, producing typed entities (person / org / concept)
with relations.

Validated on 30 real-world PDFs: 161 entities (concept 62 / org 57 /
person 11 / document 31), 263 relations. Korean 57, English 104, mixed
cleanly. Average response time 25.7s โ†’ 23.8s (-1.9s) after the alias
matching fix.

All 4 open Dependabot alerts are closed. PyPDF2 was not actually
imported anywhere in the codebase and was dropped; cryptography was
bumped past the buffer-overflow advisory.

Merged PRs

  • #9 chore(deps) drop unused PyPDF2; bump cryptography 46.0.6 โ†’ 47.0.0
  • #10 fix(graph) expand entity aliases on creation; backfill 161
    existing entities (closes #7)
  • #12 chore(logging) silence pdfminer FontBBox warnings on PDF upload

New code

  • core/wiki_generator.py::process_document_for_entities โ€” LLM-based
    entity & relation extraction with ontology + Memory Trust (~200 LOC,
    shipped in c9c604e)
  • core/wiki_generator.py::_expand_alias_candidates โ€” alias expansion
    for "X (Y)" patterns (paren full-width / half-width)
  • scripts/step7_query_test.py โ€” reproducible 12-query benchmark
  • scripts/migrate_aliases.py โ€” one-shot backfill for pre-fix entities

Verified behaviors

  • Injection isolation: Ignore previous instructions... blocks at 0.0s
  • Memory Trust: 161/161 entities pass
  • PII masking: [REDACTED] triggers automatically inside answers
  • Hallucination resistance: BTC vs ๋น„ํŠธ์ฝ”์ธ returns "๊ด€๋ จ ์ž๋ฃŒ ์—†์Œ"
    rather than fabricating a connection
  • Multilingual: Korean / English / mixed queries all answered in the
    appropriate language

Open follow-up issues (next cycle)

# Title Priority
#11 Entity-level relations field is sparse high
#13 Route ALL LLM calls through llm/router high
#3 Entity dedup (BTC vs ๋น„ํŠธ์ฝ”์ธ) high
#14 Upload progress UI (XHR migration) medium
#15 Wire admin LLM selection to actual inference medium
#5 Entity type accuracy (products as org) medium
#6 Relation label distribution skewed medium
#4 metadata fallback writes LLM error string medium
#8 Risky-coding-request policy medium
#2 wiki_reset.py Windows CP949 crash medium

The v0.2.0 release will land after #11 and #13 are resolved โ€” the two
items most likely to push graph utilization above 60%.

Known limitations

  • 8B local model (gemma4:e4b) is the practical floor for response time.
    p50 is ~24 s; bringing it below 15 s likely needs streaming or
    task-specific smaller models (#13 unblocks this).
  • Risky-coding requests (e.g. "give me the command to delete this
    folder") are answered with warnings prefixed; some operators may want
    a hard refuse instead. Tracked in #8.