DeepNSM: COCA 5K vocabulary + 16Kbit fingerprint (47 tests) by AdaWorldAPI · Pull Request #150 · AdaWorldAPI/lance-graph

AdaWorldAPI · 2026-04-07T07:12:55Z

COCA Vocabulary Migrated to DeepNSM

Word frequency data from AdaWorldAPI/DeepNSM → crates/deepnsm/word_frequency/:

word_rank_lookup.csv — vocabulary.rs load source (4096 words)
lemmas_5k.csv — 5051 COCA lemmas with PoS + frequency
forms_5k.csv, word_forms.csv, compact variants

Binary files gitignored. ~2 MB CSVs in git.

16Kbit VSA Fingerprint (from previous merge)

Already in deepnsm/src/fingerprint16k.rs:

16384 bits = 256×u64, AVX512-aligned
8/8 tests passing

DeepNSM = Complete Grammar+Vocabulary+Fingerprint+SPO Crate

vocabulary.rs    → COCA 4096 tokenizer
pos.rs           → Part-of-Speech
parser.rs        → PoS FSM (N→V→N)
spo.rs           → SpoTriple 36-bit
encoder.rs       → XOR bind + bundle
fingerprint16k.rs → 16Kbit VSA
word_frequency/  → COCA data files

47 tests passing. No separate grammar crate needed.

https://claude.ai/code/session_019RzHP8tpJu55ESTxhfUy1A

Word frequency files from AdaWorldAPI/DeepNSM: word_rank_lookup.csv (101 KB) — vocabulary.rs load source lemmas_5k.csv (670 KB) — 5051 COCA lemmas with PoS + frequency forms_5k.csv (586 KB) — word forms lemmas_compact.csv (156 KB) — compact format forms_compact.csv (145 KB) — compact forms word_forms.csv (378 KB) — all forms Binary files gitignored (*.bin, *.json, subgenres_5k.csv). 47 DeepNSM tests passing (including 8 fingerprint16k). https://claude.ai/code/session_019RzHP8tpJu55ESTxhfUy1A

chatgpt-codex-connector · 2026-04-07T07:13:01Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

…eam #150 ## Stale artifact removal (182 files, 3 MB) `AdaWorldAPI-lance-graph-d9df43b/` was a committed snapshot of an older upstream version (48 .rs files vs our 98). Full audit confirmed: - ZERO files exist only in the artifact (every file has a counterpart) - Every differing file: ours >= artifact in LOC (ours is strictly ahead) - All upstream features (#125 parameter_substitution, #140 lance_vector_search) are already in our src tree The directory created GitHub path confusion — duplicate navigation paths for datafusion_planner, spo, blasgraph, neighborhood, arigraph. Removing it eliminates that confusion with zero content loss. ## Cherry-pick: spark_dialect.rs from upstream PR #150 The ONE file upstream has that we didn't: - `crates/lance-graph/src/spark_dialect.rs` (107 LOC) Spark SQL dialect for DataFusion unparser: backtick quoting, STRING type casting, EXTRACT for dates, BIGINT/INT types, LENGTH(), derived table aliases. - `crates/lance-graph/tests/test_to_spark_sql.rs` (293 LOC) Full test suite for Spark SQL output. - `pub mod spark_dialect;` added to lib.rs Adapted from upstream's DF 50.3 to our DF 51 — same API surface, no changes needed. ## Upstream audit result (for the record) Upstream (lance-format/lance-graph) is at v0.5.4. Our fork is at v0.5.3 with newer deps (arrow 57 vs 56.2, datafusion 51 vs 50.3). Other than spark_dialect, every upstream feature and fix is already present in our source tree — parameter_substitution (#125), lance_vector_search (#140), complex RETURN clauses (#142), duplicate columns fix (#128) are all in `crates/lance-graph/src/`. Their deleted `simple_executor` was a prototype cold-path executor we never had. Our `ExecutionStrategy::DataFusion` path (6K LOC planner) + `ExecutionStrategy::BlasGraph` (semiring algebra) subsume it. The user has flagged adding a deliberate `ExecutionStrategy::Simple` cold path as a 4th strategy for trivial queries — that's a separate PR per the documented matrix of execution strategies. https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj

chore: remove stale upstream snapshot + port spark_dialect from upstream #150

## Summary - Add `SqlDialect` enum (`Default`, `Spark`, `PostgreSql`, `MySql`, `Sqlite`) and `SparkDialect` implementation using DataFusion's unparser `Dialect` trait - Refactor `to_sql()` to accept an optional `dialect` parameter instead of a separate method per dialect - Add Python API support: `query.to_sql(datasets, dialect="spark")` ### Spark SQL dialect differences - Backtick identifier quoting - `STRING` type instead of `VARCHAR` - `EXTRACT(field FROM expr)` for date parts - `LENGTH()` instead of `CHARACTER_LENGTH()` - `TIMESTAMP` without timezone info - Subqueries in FROM require aliases ### Usage **Rust:** ```rust use lance_graph::{CypherQuery, SqlDialect}; let sql = query.to_sql(datasets, Some(SqlDialect::Spark)).await?; ``` **Python:** ```python sql = query.to_sql(datasets, dialect="spark") ``` ## Test plan - [x] 7 new Spark SQL integration tests (backtick quoting, filters, relationships, complex queries, dialect comparison, PostgreSQL dialect) - [x] 5 unit tests for SparkDialect trait implementation - [x] 12 existing `to_sql` tests updated and passing 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Yu Chen <yu.chen@databricks.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

AdaWorldAPI merged commit ba0f03d into main Apr 7, 2026

AdaWorldAPI mentioned this pull request Apr 17, 2026

chore: remove stale upstream snapshot + port spark_dialect from upstream #150 #188

Merged

4 tasks

AdaWorldAPI added a commit that referenced this pull request Apr 17, 2026

Merge pull request #188 from AdaWorldAPI/claude/codec-rnd-bench

56e56a7

chore: remove stale upstream snapshot + port spark_dialect from upstream #150

AdaWorldAPI mentioned this pull request Apr 17, 2026

feat: R&D codec bench framework — upstream sync, probes P5/P7, InferenceBackend, measurement model #189

Merged

7 tasks

AdaWorldAPI mentioned this pull request May 13, 2026

feat(sprint-9): close PR #355 Tier-A deferred backlog (4 items) + correct misdiagnosed hpc-extras issue #369

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepNSM: COCA 5K vocabulary + 16Kbit fingerprint (47 tests)#150

DeepNSM: COCA 5K vocabulary + 16Kbit fingerprint (47 tests)#150
AdaWorldAPI merged 1 commit into
mainfrom
claude/risc-thought-engine-TCZw7

AdaWorldAPI commented Apr 7, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AdaWorldAPI commented Apr 7, 2026

COCA Vocabulary Migrated to DeepNSM

16Kbit VSA Fingerprint (from previous merge)

DeepNSM = Complete Grammar+Vocabulary+Fingerprint+SPO Crate

Uh oh!

chatgpt-codex-connector Bot commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants