Syntax and Parameters

^{ngrep matching a paragraph from the book Flatland: A Romance of Many Dimensions}

ngrep is an experimental PoC for finding text by meaning rather than literal matching. The core question is simple: what happens when regular expressions are enriched with a small bit of semantics via Word Embeddings?

ngrep explores that idea with a familiar grep interface, extending regex with a new neural operator ~ while keeping the rest of the language intact. It supports the major regex features you already use (including lookarounds like negative lookahead), so you can combine semantic and literal patterns in one expression, built on top of the fantastic 🦀 fancy-regex.

The `~` operator

The ~ operator matches by semantic similarity, using neural Word Embeddings (yes, 2010s nostalgia). For example, ~(fruit)+ matches any token whose embedding is close to that of fruit:

Syntax and Parameters

You can refine the search using the following syntax:

~(word) e.g: ~(car) Match word using the default similarity threshold (from --threshold or the model config).
~(word;threshold) e.g: ~(car;0.3) Overrides the threshold for this specific match.
~(word1::word2) e.g: ~(car::bike) Uses the average embedding of word1 and word2 as the target.

~() matches only a single token, use ~()+ to capture full words or phrases
Word Embeddings are inherently imprecise, expect to tune the threshold and combine multiple ~ operators to get stable results

Install

From the repository root:

cargo install --path .
ngrep --help

Build (no install)

If you prefer to build locally without installing, run:

cargo build --release
./target/release/ngrep --help

Then use ./target/release/ngrep in the commands below instead of ngrep.

After ngrep is available you have to import some Word Embeddings model to start matching. Follow these steps to download and import the English FastText embeddings:

wget -qO- https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz | gunzip > cc.en.300.vec
ngrep import cc.en.300.vec ften

Match with:

echo 'a standard example is: hello world' | ngrep '~(hey)+ ~(planet)+'

Alternatively you can import any embeddings in the txt format and configure the default model with ngrep config. You can import multiple models and switch between them with ngrep config or --model, but only one model is used at a time for matching:

A note on performance

ngrep's current focus is exploration, not performance. It does not preload or cache vectors, performs frequent disk access, and ~ matches are not compiled into standard regex. This is a deliberate choice to keep the implementation simple while the idea is being explored.

To give you a glimpse of the current performance, it takes about ~18s to find the most common ways to refer to a big animal in the book Moby-Dick on MacBook Pro M4 (1.2MB of text, 22K lines, English FastText 300d):

wget -q https://raw.githubusercontent.com/massimo-nazaria/bash-textgen/refs/heads/main/moby-dick.txt
time ngrep -o '~(big)+ \b~(animal;0.35)+\b' moby-dick.txt | sort | uniq -c | sort -n
   1 big whale
   1 enormous creature
   1 enormous creatures
   1 gigantic creature
   1 gigantic fish
   1 great dromedary
   1 great hunting
   1 huge elephant
   1 huge reptile
   1 large herd
   1 little cannibal
   1 small cub
   1 small fish
   1 tremendous whale
   2 great fish
   3 great monster
   3 large whale
   5 great whales
   7 great whale

real	0m17.291s
user	0m16.577s
sys	  0m0.726s

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
assets		assets
embeddings		embeddings
fancy-regex-neural		fancy-regex-neural
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The `~` operator

Syntax and Parameters

Install

Build (no install)

A note on performance

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The ~ operator

Syntax and Parameters

Install

Build (no install)

A note on performance

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

The `~` operator

Packages