ngrep is an experimental PoC for finding text by meaning rather than literal matching. The core question is simple: what happens when regular expressions are enriched with a small bit of semantics via Word Embeddings?
ngrep explores that idea with a familiar grep interface, extending regex with a new neural operator ~ while keeping the rest of the language intact. It supports the major regex features you already use (including lookarounds like negative lookahead), so you can combine semantic and literal patterns in one expression, built on top of the fantastic 🦀 fancy-regex.
The ~ operator matches by semantic similarity, using neural Word Embeddings (yes, 2010s nostalgia).
For example, ~(fruit)+ matches any token whose embedding is close to that of fruit:
You can refine the search using the following syntax:
~(word)e.g:~(car)Match word using the default similarity threshold (from--thresholdor the model config).~(word;threshold)e.g:~(car;0.3)Overrides the threshold for this specific match.~(word1::word2)e.g:~(car::bike)Uses the average embedding ofword1andword2as the target.
~()matches only a single token, use~()+to capture full words or phrases- Word Embeddings are inherently imprecise, expect to tune the threshold and combine multiple
~operators to get stable results
From the repository root:
cargo install --path .
ngrep --helpIf you prefer to build locally without installing, run:
cargo build --release
./target/release/ngrep --helpThen use ./target/release/ngrep in the commands below instead of ngrep.
After ngrep is available you have to import some Word Embeddings model to start matching.
Follow these steps to download and import the English FastText embeddings:
wget -qO- https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz | gunzip > cc.en.300.vec
ngrep import cc.en.300.vec ftenMatch with:
echo 'a standard example is: hello world' | ngrep '~(hey)+ ~(planet)+'Alternatively you can import any embeddings in the txt format and configure the default model with ngrep config. You can import multiple models and switch between them with ngrep config or --model, but only one model is used at a time for matching:
- FastText Word vectors for 157 languages
- Wikipedia2Vec with ENTITY vectors
- GloVe: Global Vectors for Word Representation
ngrep's current focus is exploration, not performance. It does not preload or cache vectors, performs frequent disk access, and ~ matches are not compiled into standard regex. This is a deliberate choice to keep the implementation simple while the idea is being explored.
To give you a glimpse of the current performance, it takes about ~18s to find the most common ways to refer to a big animal in the book Moby-Dick on MacBook Pro M4 (1.2MB of text, 22K lines, English FastText 300d):
wget -q https://raw.githubusercontent.com/massimo-nazaria/bash-textgen/refs/heads/main/moby-dick.txt
time ngrep -o '~(big)+ \b~(animal;0.35)+\b' moby-dick.txt | sort | uniq -c | sort -n
1 big whale
1 enormous creature
1 enormous creatures
1 gigantic creature
1 gigantic fish
1 great dromedary
1 great hunting
1 huge elephant
1 huge reptile
1 large herd
1 little cannibal
1 small cub
1 small fish
1 tremendous whale
2 great fish
3 great monster
3 large whale
5 great whales
7 great whale
real 0m17.291s
user 0m16.577s
sys 0m0.726s