Skip to content

DeepNSM: COCA 5K vocabulary + 16Kbit fingerprint (47 tests)#150

Merged
AdaWorldAPI merged 1 commit into
mainfrom
claude/risc-thought-engine-TCZw7
Apr 7, 2026
Merged

DeepNSM: COCA 5K vocabulary + 16Kbit fingerprint (47 tests)#150
AdaWorldAPI merged 1 commit into
mainfrom
claude/risc-thought-engine-TCZw7

Conversation

@AdaWorldAPI
Copy link
Copy Markdown
Owner

COCA Vocabulary Migrated to DeepNSM

Word frequency data from AdaWorldAPI/DeepNSM → crates/deepnsm/word_frequency/:

  • word_rank_lookup.csv — vocabulary.rs load source (4096 words)
  • lemmas_5k.csv — 5051 COCA lemmas with PoS + frequency
  • forms_5k.csv, word_forms.csv, compact variants

Binary files gitignored. ~2 MB CSVs in git.

16Kbit VSA Fingerprint (from previous merge)

Already in deepnsm/src/fingerprint16k.rs:

  • 16384 bits = 256×u64, AVX512-aligned
  • 8/8 tests passing

DeepNSM = Complete Grammar+Vocabulary+Fingerprint+SPO Crate

vocabulary.rs    → COCA 4096 tokenizer
pos.rs           → Part-of-Speech
parser.rs        → PoS FSM (N→V→N)
spo.rs           → SpoTriple 36-bit
encoder.rs       → XOR bind + bundle
fingerprint16k.rs → 16Kbit VSA
word_frequency/  → COCA data files

47 tests passing. No separate grammar crate needed.

https://claude.ai/code/session_019RzHP8tpJu55ESTxhfUy1A

Word frequency files from AdaWorldAPI/DeepNSM:
  word_rank_lookup.csv (101 KB) — vocabulary.rs load source
  lemmas_5k.csv (670 KB) — 5051 COCA lemmas with PoS + frequency
  forms_5k.csv (586 KB) — word forms
  lemmas_compact.csv (156 KB) — compact format
  forms_compact.csv (145 KB) — compact forms
  word_forms.csv (378 KB) — all forms

Binary files gitignored (*.bin, *.json, subgenres_5k.csv).
47 DeepNSM tests passing (including 8 fingerprint16k).

https://claude.ai/code/session_019RzHP8tpJu55ESTxhfUy1A
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

@AdaWorldAPI AdaWorldAPI merged commit ba0f03d into main Apr 7, 2026
AdaWorldAPI pushed a commit that referenced this pull request Apr 17, 2026
…eam #150

## Stale artifact removal (182 files, 3 MB)

`AdaWorldAPI-lance-graph-d9df43b/` was a committed snapshot of an older
upstream version (48 .rs files vs our 98). Full audit confirmed:
  - ZERO files exist only in the artifact (every file has a counterpart)
  - Every differing file: ours >= artifact in LOC (ours is strictly ahead)
  - All upstream features (#125 parameter_substitution, #140 lance_vector_search)
    are already in our src tree

The directory created GitHub path confusion — duplicate navigation paths
for datafusion_planner, spo, blasgraph, neighborhood, arigraph. Removing
it eliminates that confusion with zero content loss.

## Cherry-pick: spark_dialect.rs from upstream PR #150

The ONE file upstream has that we didn't:
  - `crates/lance-graph/src/spark_dialect.rs` (107 LOC)
    Spark SQL dialect for DataFusion unparser: backtick quoting, STRING
    type casting, EXTRACT for dates, BIGINT/INT types, LENGTH(), derived
    table aliases.
  - `crates/lance-graph/tests/test_to_spark_sql.rs` (293 LOC)
    Full test suite for Spark SQL output.
  - `pub mod spark_dialect;` added to lib.rs

Adapted from upstream's DF 50.3 to our DF 51 — same API surface, no
changes needed.

## Upstream audit result (for the record)

Upstream (lance-format/lance-graph) is at v0.5.4. Our fork is at v0.5.3
with newer deps (arrow 57 vs 56.2, datafusion 51 vs 50.3). Other than
spark_dialect, every upstream feature and fix is already present in our
source tree — parameter_substitution (#125), lance_vector_search (#140),
complex RETURN clauses (#142), duplicate columns fix (#128) are all in
`crates/lance-graph/src/`.

Their deleted `simple_executor` was a prototype cold-path executor we
never had. Our `ExecutionStrategy::DataFusion` path (6K LOC planner)
+ `ExecutionStrategy::BlasGraph` (semiring algebra) subsume it. The
user has flagged adding a deliberate `ExecutionStrategy::Simple` cold
path as a 4th strategy for trivial queries — that's a separate PR per
the documented matrix of execution strategies.

https://claude.ai/code/session_01NYGrxVopyszZYgLBxe4hgj
AdaWorldAPI added a commit that referenced this pull request Apr 17, 2026
chore: remove stale upstream snapshot + port spark_dialect from upstream #150
AdaWorldAPI pushed a commit that referenced this pull request Apr 23, 2026
## Summary
- Add `SqlDialect` enum (`Default`, `Spark`, `PostgreSql`, `MySql`,
`Sqlite`) and `SparkDialect` implementation using DataFusion's unparser
`Dialect` trait
- Refactor `to_sql()` to accept an optional `dialect` parameter instead
of a separate method per dialect
- Add Python API support: `query.to_sql(datasets, dialect="spark")`

### Spark SQL dialect differences
- Backtick identifier quoting
- `STRING` type instead of `VARCHAR`
- `EXTRACT(field FROM expr)` for date parts
- `LENGTH()` instead of `CHARACTER_LENGTH()`
- `TIMESTAMP` without timezone info
- Subqueries in FROM require aliases

### Usage

**Rust:**
```rust
use lance_graph::{CypherQuery, SqlDialect};

let sql = query.to_sql(datasets, Some(SqlDialect::Spark)).await?;
```

**Python:**
```python
sql = query.to_sql(datasets, dialect="spark")
```

## Test plan
- [x] 7 new Spark SQL integration tests (backtick quoting, filters,
relationships, complex queries, dialect comparison, PostgreSQL dialect)
- [x] 5 unit tests for SparkDialect trait implementation
- [x] 12 existing `to_sql` tests updated and passing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Yu Chen <yu.chen@databricks.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants