Blazing-fast text analysis for the command line, Python, and the browser.
A unified tool for corpus-level NLP statistics — n-gram frequencies, readability scores, entropy analysis, language detection, BPE token counting, and more — written in Rust for performance, with bindings for Python and JavaScript/WASM.
Installation · Quick Start · Commands · Documentation · Contributing
- High performance — Parallel processing via
rayon. Analyzes multi-GB corpora in seconds. - Composable — Unix-friendly design with structured output (JSON, CSV, table). Pipes seamlessly with
jq,awk, and standard tooling. - Comprehensive — Eight analysis commands covering vocabulary statistics, n-gram frequencies, readability indices, Shannon entropy, Zipf's law, language model perplexity, language detection, and BPE tokenization.
- Multi-platform — Available as a native CLI binary, a Python package via PyO3, and an npm/WASM module for browser and Node.js environments.
- Streaming — Process unbounded stdin streams with incremental chunk-based output for
stats,ngrams, andentropy.
cargo install corpaOr build from source:
git clone https://github.com/Flurry13/corpa
cd corpa
cargo build --releasepip install corpanpm support coming soon.
corpa stats corpus.txt
corpa ngrams -n 2 --top 20 corpus.txt
corpa readability essay.txt
corpa entropy corpus.txt
corpa perplexity corpus.txt --smoothing laplace
corpa lang mystery.txt
corpa tokens corpus.txt --model gpt4
corpa zipf corpus.txt --top 10All commands accept file paths, directories (with --recursive), or stdin. Output format is controlled with --format (table, json, csv).
import corpa
corpa.stats(text="The quick brown fox jumps over the lazy dog.")
# {'tokens': 9, 'types': 8, 'sentences': 1, 'type_token_ratio': 0.8889, ...}
corpa.ngrams("corpus.txt", n=2, top=10)
# [{'ngram': 'of the', 'frequency': 4521, 'relative_pct': 2.09}, ...]
corpa.lang(text="Bonjour le monde")
# {'language': 'Français', 'code': 'fra', 'script': 'Latin', 'confidence': 0.99}All functions accept a file path as the first argument or a text= keyword argument for direct string input.
support coming soon.
| Command | Description |
|---|---|
stats |
Token, type, sentence counts, type-token ratio, hapax legomena, average sentence length |
ngrams |
N-gram frequency analysis with configurable N, top-K, minimum frequency, case folding, stopword filtering |
tokens |
Whitespace, sentence, and character tokenization; BPE token counts for GPT-3, GPT-4, and GPT-4o |
readability |
Flesch-Kincaid Grade, Flesch Reading Ease, Coleman-Liau Index, Gunning Fog Index, SMOG Index |
entropy |
Unigram, bigram, and trigram Shannon entropy; entropy rate; vocabulary redundancy |
perplexity |
N-gram language model perplexity with Laplace smoothing and Stupid Backoff |
lang |
Language and script detection with confidence scoring |
zipf |
Zipf's law rank-frequency distribution with exponent fitting and terminal sparkline plotting |
completions |
Shell completion generation for bash, zsh, and fish |
$ corpa stats prose.txt
corpa · prose.txt
┌─────────────────────┬────────────┐
│ Metric ┆ Value │
╞═════════════════════╪════════════╡
│ Tokens (words) ┆ 175 │
│ Types (unique) ┆ 95 │
│ Characters ┆ 805 │
│ Sentences ┆ 6 │
│ Type-Token Ratio ┆ 0.5429 │
│ Hapax Legomena ┆ 70 (73.7%) │
│ Avg Sentence Length ┆ 29.2 words │
└─────────────────────┴────────────┘
$ corpa readability prose.txt
corpa · prose.txt
┌──────────────────────┬───────┬─────────────┐
│ Metric ┆ Score ┆ Grade │
╞══════════════════════╪═══════╪═════════════╡
│ Flesch-Kincaid Grade ┆ 12.73 ┆ High School │
│ Flesch Reading Ease ┆ 41.16 ┆ Difficult │
│ Coleman-Liau Index ┆ 13.82 ┆ College │
│ Gunning Fog Index ┆ 16.97 ┆ College │
│ SMOG Index ┆ 14.62 ┆ College │
└──────────────────────┴───────┴─────────────┘
$ corpa tokens prose.txt --model all
corpa · prose.txt
┌──────────────┬────────┐
│ Tokenizer ┆ Tokens │
╞══════════════╪════════╡
│ Whitespace ┆ 126 │
│ Sentences ┆ 6 │
│ Characters ┆ 805 │
│ BPE (GPT-4) ┆ 150 │
│ BPE (GPT-4o) ┆ 148 │
│ BPE (GPT-3) ┆ 151 │
└──────────────┴────────┘
The --stream flag enables incremental processing of unbounded stdin, emitting cumulative results after each chunk. Chunk size is configurable with --chunk-lines (default: 1000).
cat huge_corpus.txt | corpa stats --stream --chunk-lines 500 --format jsonSupported commands: stats, ngrams, entropy.
| Format | Behavior |
|---|---|
json |
JSON Lines — one object per chunk |
csv |
Header row once, data rows per chunk |
table |
Table per chunk with chunk number in title |
| Flag | Description |
|---|---|
--format <fmt> |
Output format: table (default), json, csv |
--recursive |
Process directories recursively |
--stream |
Process stdin incrementally, emitting results per chunk |
--chunk-lines <N> |
Lines per chunk in streaming mode (default: 1000) |
Benchmarks on a 1GB English text corpus (Apple M2, 8 cores):
| Command | corpa | Python | Speedup |
|---|---|---|---|
| Word count | 1.9s | 11.5s | 6x |
| Bigram frequency | 3.4s | 53.9s | 16x |
| Readability | 5.4s | 107.9s | 20x |
Benchmarked on ~1 GB generated English text corpus (Apple M-series, 8 cores).
| Resource | Description |
|---|---|
| CLI Commands | Full command reference with options and examples |
| Streaming | Incremental stdin processing for large-scale analysis |
| Python API | PyO3 bindings — all commands as native Python functions |
| JavaScript API | WASM bindings for browser and Node.js environments |
- v0.1.0 — Core CLI:
stats,ngrams,tokens, JSON/CSV/table output, stdin and file input, recursive directories - v0.2.0 — Analysis:
readability,entropy,zipf, stopword filtering, case folding, parallel processing - v0.3.0 — Language Models:
perplexitywith Laplace/Stupid Backoff,langdetection, BPE token counting - v0.4.0 — Ecosystem: Python bindings (PyO3), WASM/npm package, streaming mode, shell completions
- v0.5.0 — Robustness & Output Quality: Typed JSON output, improved sentence detection (abbreviations, ellipsis collapsing), input validation hardening, Unicode-aware syllable counting, streaming entropy rewrite
- v0.6.0 — Bindings Parity: Python/WASM parameter parity (stopwords, min_freq, case_insensitive), proper error propagation, Python
.pyitype stubs and docstrings, WASM TypeScript definitions, fixedpackage.json, version synchronization
- Concordance / KWIC (keyword in context) search
- Diff mode for comparing two corpora — side-by-side statistics, vocabulary overlap, divergence metrics
- Collocation analysis (PMI, log-likelihood, chi-squared)
- Custom vocabulary and dictionary support
- Sentiment lexicon scoring (AFINN, VADER-style)
- Topic segmentation and keyword extraction (TF-IDF)
- Text complexity profiling — combined readability + entropy + vocabulary richness report
- Configurable sentence tokenizer (regex-based or rule-based)
- Language-specific stopword lists (bundled for top 10 languages)
- Plugin system for custom analysis modules
- Interactive TUI mode with live statistics
- Parallel streaming for multi-file batch processing
- Wasm streaming API for browser-based incremental analysis
Contributions are welcome. Please open an issue to discuss proposed changes before submitting a pull request.
cargo test # Run test suite
cargo clippy # Lint
cargo bench # Run benchmarksThis project is dual-licensed under MIT or Apache-2.0, at your option.
- rayon — Data parallelism
- clap — CLI argument parsing
- comfy-table — Terminal table rendering
- whatlang — Language detection
- tiktoken-rs — BPE tokenization for GPT models
- PyO3 — Rust bindings for Python
- wasm-bindgen — Rust/WebAssembly interop