Skip to content

0xNaN/ngrep

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ngrep

ngrep matching a paragraph from the book Flatland: A Romance of Many Dimensions

ngrep is an experimental PoC for finding text by meaning rather than literal matching. The core question is simple: what happens when regular expressions are enriched with a small bit of semantics via Word Embeddings?

ngrep explores that idea with a familiar grep interface, extending regex with a new neural operator ~ while keeping the rest of the language intact. It supports the major regex features you already use (including lookarounds like negative lookahead), so you can combine semantic and literal patterns in one expression, built on top of the fantastic 🦀 fancy-regex.

The ~ operator

The ~ operator matches by semantic similarity, using neural Word Embeddings (yes, 2010s nostalgia). For example, ~(fruit)+ matches any token whose embedding is close to that of fruit:

fruits

Syntax and Parameters

You can refine the search using the following syntax:

  • ~(word) e.g: ~(car) Match word using the default similarity threshold (from --threshold or the model config).
  • ~(word;threshold) e.g: ~(car;0.3) Overrides the threshold for this specific match.
  • ~(word1::word2) e.g: ~(car::bike) Uses the average embedding of word1 and word2 as the target.
  • ~() matches only a single token, use ~()+ to capture full words or phrases
  • Word Embeddings are inherently imprecise, expect to tune the threshold and combine multiple ~ operators to get stable results

Install

From the repository root:

cargo install --path .
ngrep --help

Build (no install)

If you prefer to build locally without installing, run:

cargo build --release
./target/release/ngrep --help

Then use ./target/release/ngrep in the commands below instead of ngrep.

After ngrep is available you have to import some Word Embeddings model to start matching. Follow these steps to download and import the English FastText embeddings:

wget -qO- https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.vec.gz | gunzip > cc.en.300.vec
ngrep import cc.en.300.vec ften

Match with:

echo 'a standard example is: hello world' | ngrep '~(hey)+ ~(planet)+'

Alternatively you can import any embeddings in the txt format and configure the default model with ngrep config. You can import multiple models and switch between them with ngrep config or --model, but only one model is used at a time for matching:

A note on performance

ngrep's current focus is exploration, not performance. It does not preload or cache vectors, performs frequent disk access, and ~ matches are not compiled into standard regex. This is a deliberate choice to keep the implementation simple while the idea is being explored.

To give you a glimpse of the current performance, it takes about ~18s to find the most common ways to refer to a big animal in the book Moby-Dick on MacBook Pro M4 (1.2MB of text, 22K lines, English FastText 300d):

wget -q https://raw.githubusercontent.com/massimo-nazaria/bash-textgen/refs/heads/main/moby-dick.txt
time ngrep -o '~(big)+ \b~(animal;0.35)+\b' moby-dick.txt | sort | uniq -c | sort -n
   1 big whale
   1 enormous creature
   1 enormous creatures
   1 gigantic creature
   1 gigantic fish
   1 great dromedary
   1 great hunting
   1 huge elephant
   1 huge reptile
   1 large herd
   1 little cannibal
   1 small cub
   1 small fish
   1 tremendous whale
   2 great fish
   3 great monster
   3 large whale
   5 great whales
   7 great whale

real	0m17.291s
user	0m16.577s
sys	  0m0.726s

About

grep plus word embeddings

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages