corpa

Blazing-fast text analysis for the command line, Python, and the browser.

A unified tool for corpus-level NLP statistics — n-gram frequencies, readability scores, entropy analysis, language detection, BPE token counting, and more — written in Rust for performance, with bindings for Python and JavaScript/WASM.

Installation · Quick Start · Commands · Documentation · Contributing

Highlights

High performance — Parallel processing via rayon. Analyzes multi-GB corpora in seconds.
Composable — Unix-friendly design with structured output (JSON, CSV, table). Pipes seamlessly with jq, awk, and standard tooling.
Comprehensive — Eight analysis commands covering vocabulary statistics, n-gram frequencies, readability indices, Shannon entropy, Zipf's law, language model perplexity, language detection, and BPE tokenization.
Multi-platform — Available as a native CLI binary, a Python package via PyO3, and an npm/WASM module for browser and Node.js environments.
Streaming — Process unbounded stdin streams with incremental chunk-based output for stats, ngrams, and entropy.

Installation

CLI

cargo install corpa

Or build from source:

git clone https://github.com/Flurry13/corpa
cd corpa
cargo build --release

Python

pip install corpa

JavaScript / WASM

npm support coming soon.

Quick Start

CLI

corpa stats corpus.txt
corpa ngrams -n 2 --top 20 corpus.txt
corpa readability essay.txt
corpa entropy corpus.txt
corpa perplexity corpus.txt --smoothing laplace
corpa lang mystery.txt
corpa tokens corpus.txt --model gpt4
corpa zipf corpus.txt --top 10

All commands accept file paths, directories (with --recursive), or stdin. Output format is controlled with --format (table, json, csv).

Python

import corpa

corpa.stats(text="The quick brown fox jumps over the lazy dog.")
# {'tokens': 9, 'types': 8, 'sentences': 1, 'type_token_ratio': 0.8889, ...}

corpa.ngrams("corpus.txt", n=2, top=10)
# [{'ngram': 'of the', 'frequency': 4521, 'relative_pct': 2.09}, ...]

corpa.lang(text="Bonjour le monde")
# {'language': 'Français', 'code': 'fra', 'script': 'Latin', 'confidence': 0.99}

All functions accept a file path as the first argument or a text= keyword argument for direct string input.

JavaScript / WASM

support coming soon.

Commands

Command	Description
`stats`	Token, type, sentence counts, type-token ratio, hapax legomena, average sentence length
`ngrams`	N-gram frequency analysis with configurable N, top-K, minimum frequency, case folding, stopword filtering
`tokens`	Whitespace, sentence, and character tokenization; BPE token counts for GPT-3, GPT-4, and GPT-4o
`readability`	Flesch-Kincaid Grade, Flesch Reading Ease, Coleman-Liau Index, Gunning Fog Index, SMOG Index
`entropy`	Unigram, bigram, and trigram Shannon entropy; entropy rate; vocabulary redundancy
`perplexity`	N-gram language model perplexity with Laplace smoothing and Stupid Backoff
`lang`	Language and script detection with confidence scoring
`zipf`	Zipf's law rank-frequency distribution with exponent fitting and terminal sparkline plotting
`completions`	Shell completion generation for bash, zsh, and fish

Example Output

$ corpa stats prose.txt

  corpa · prose.txt
┌─────────────────────┬────────────┐
│ Metric              ┆      Value │
╞═════════════════════╪════════════╡
│ Tokens (words)      ┆        175 │
│ Types (unique)      ┆         95 │
│ Characters          ┆        805 │
│ Sentences           ┆          6 │
│ Type-Token Ratio    ┆     0.5429 │
│ Hapax Legomena      ┆ 70 (73.7%) │
│ Avg Sentence Length ┆ 29.2 words │
└─────────────────────┴────────────┘

$ corpa readability prose.txt

  corpa · prose.txt
┌──────────────────────┬───────┬─────────────┐
│ Metric               ┆ Score ┆       Grade │
╞══════════════════════╪═══════╪═════════════╡
│ Flesch-Kincaid Grade ┆ 12.73 ┆ High School │
│ Flesch Reading Ease  ┆ 41.16 ┆   Difficult │
│ Coleman-Liau Index   ┆ 13.82 ┆     College │
│ Gunning Fog Index    ┆ 16.97 ┆     College │
│ SMOG Index           ┆ 14.62 ┆     College │
└──────────────────────┴───────┴─────────────┘

$ corpa tokens prose.txt --model all

  corpa · prose.txt
┌──────────────┬────────┐
│ Tokenizer    ┆ Tokens │
╞══════════════╪════════╡
│ Whitespace   ┆    126 │
│ Sentences    ┆      6 │
│ Characters   ┆    805 │
│ BPE (GPT-4)  ┆    150 │
│ BPE (GPT-4o) ┆    148 │
│ BPE (GPT-3)  ┆    151 │
└──────────────┴────────┘

Streaming

The --stream flag enables incremental processing of unbounded stdin, emitting cumulative results after each chunk. Chunk size is configurable with --chunk-lines (default: 1000).

cat huge_corpus.txt | corpa stats --stream --chunk-lines 500 --format json

Supported commands: stats, ngrams, entropy.

Format	Behavior
`json`	JSON Lines — one object per chunk
`csv`	Header row once, data rows per chunk
`table`	Table per chunk with chunk number in title

Global Options

Flag	Description
`--format <fmt>`	Output format: `table` (default), `json`, `csv`
`--recursive`	Process directories recursively
`--stream`	Process stdin incrementally, emitting results per chunk
`--chunk-lines <N>`	Lines per chunk in streaming mode (default: 1000)

Performance

Benchmarks on a 1GB English text corpus (Apple M2, 8 cores):

Command	corpa	Python	Speedup
Word count	1.9s	11.5s	6x
Bigram frequency	3.4s	53.9s	16x
Readability	5.4s	107.9s	20x

Benchmarked on ~1 GB generated English text corpus (Apple M-series, 8 cores).

Documentation

Resource	Description
CLI Commands	Full command reference with options and examples
Streaming	Incremental stdin processing for large-scale analysis
Python API	PyO3 bindings — all commands as native Python functions
JavaScript API	WASM bindings for browser and Node.js environments

Roadmap

Completed

v0.1.0 — Core CLI: stats, ngrams, tokens, JSON/CSV/table output, stdin and file input, recursive directories
v0.2.0 — Analysis: readability, entropy, zipf, stopword filtering, case folding, parallel processing
v0.3.0 — Language Models: perplexity with Laplace/Stupid Backoff, lang detection, BPE token counting
v0.4.0 — Ecosystem: Python bindings (PyO3), WASM/npm package, streaming mode, shell completions
v0.5.0 — Robustness & Output Quality: Typed JSON output, improved sentence detection (abbreviations, ellipsis collapsing), input validation hardening, Unicode-aware syllable counting, streaming entropy rewrite
v0.6.0 — Bindings Parity: Python/WASM parameter parity (stopwords, min_freq, case_insensitive), proper error propagation, Python .pyi type stubs and docstrings, WASM TypeScript definitions, fixed package.json, version synchronization

Planned

v0.7.0 — Corpus Comparison & Search

Concordance / KWIC (keyword in context) search
Diff mode for comparing two corpora — side-by-side statistics, vocabulary overlap, divergence metrics
Collocation analysis (PMI, log-likelihood, chi-squared)
Custom vocabulary and dictionary support

v0.8.0 — Advanced Analysis

Sentiment lexicon scoring (AFINN, VADER-style)
Topic segmentation and keyword extraction (TF-IDF)
Text complexity profiling — combined readability + entropy + vocabulary richness report
Configurable sentence tokenizer (regex-based or rule-based)

Future

Language-specific stopword lists (bundled for top 10 languages)
Plugin system for custom analysis modules
Interactive TUI mode with live statistics
Parallel streaming for multi-file batch processing
Wasm streaming API for browser-based incremental analysis

Contributing

Contributions are welcome. Please open an issue to discuss proposed changes before submitting a pull request.

cargo test            # Run test suite
cargo clippy          # Lint
cargo bench           # Run benchmarks

License

This project is dual-licensed under MIT or Apache-2.0, at your option.

Acknowledgments

rayon — Data parallelism
clap — CLI argument parsing
comfy-table — Terminal table rendering
whatlang — Language detection
tiktoken-rs — BPE tokenization for GPT models
PyO3 — Rust bindings for Python
wasm-bindgen — Rust/WebAssembly interop

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.github/workflows		.github/workflows
corpa-python		corpa-python
corpa-wasm		corpa-wasm
data/stopwords		data/stopwords
docs		docs
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

corpa

Highlights

Installation

CLI

Python

JavaScript / WASM

Quick Start

CLI

Python

JavaScript / WASM

Commands

Example Output

Streaming

Global Options

Performance

Documentation

Roadmap

Completed

Planned

v0.7.0 — Corpus Comparison & Search

v0.8.0 — Advanced Analysis

Future

Contributing

License

Acknowledgments

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

corpa

Highlights

Installation

CLI

Python

JavaScript / WASM

Quick Start

CLI

Python

JavaScript / WASM

Commands

Example Output

Streaming

Global Options

Performance

Documentation

Roadmap

Completed

Planned

v0.7.0 — Corpus Comparison & Search

v0.8.0 — Advanced Analysis

Future

Contributing

License

Acknowledgments

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages