Problem
When using the regex fallback (no GLiNER/spaCy/LLM), extraction produces junk entities from OCR artifacts and document boilerplate. These pollute the graph and waste crossref API calls.
Examples from a 32-page Clearview AI contract:
To: — picked up as organization
Bill To: — picked up as organization
APPROVED — picked up as organization
EC .Orchestrating a or g-?te' 22 — OCR garbage picked up as organization
CID SES Criminal lnte — truncated OCR artifact
Justification of request: Clcorwcw will assist the department with identifying suspects through facial recognition. — entire sentence picked up as person
Sgt. Lakca Gaither, 09/94/2019, For FMU Use Only — name + metadata jammed together
Clearvicn Al. Inc. — OCR misspelling of "Clearview AI"
Fixes needed
- Blocklist — common boilerplate words that aren't entities: "To:", "From:", "Bill To:", "APPROVED", "RE:", "CC:", etc.
- Length filter — skip entities shorter than 3 chars or longer than 80 chars (full sentences aren't entities)
- Character ratio — if >30% special characters or digits, probably OCR garbage
- Fuzzy dedup — "Clearview AI" and "Clearvicn Al" should merge (Levenshtein distance)
- Sentence detection — if the "entity" contains a verb, it's probably a sentence not a name
Context
GLiNER and spaCy handle these correctly — this only affects the regex fallback tier. But regex is the safety net that always works, so it needs to be cleaner.
Problem
When using the regex fallback (no GLiNER/spaCy/LLM), extraction produces junk entities from OCR artifacts and document boilerplate. These pollute the graph and waste crossref API calls.
Examples from a 32-page Clearview AI contract:
To:— picked up as organizationBill To:— picked up as organizationAPPROVED— picked up as organizationEC .Orchestrating a or g-?te' 22— OCR garbage picked up as organizationCID SES Criminal lnte— truncated OCR artifactJustification of request: Clcorwcw will assist the department with identifying suspects through facial recognition.— entire sentence picked up as personSgt. Lakca Gaither, 09/94/2019, For FMU Use Only— name + metadata jammed togetherClearvicn Al. Inc.— OCR misspelling of "Clearview AI"Fixes needed
Context
GLiNER and spaCy handle these correctly — this only affects the regex fallback tier. But regex is the safety net that always works, so it needs to be cleaner.