Skip to content

Flurry13/corpa

Repository files navigation

corpa

Blazing-fast text analysis for the command line, Python, and the browser.

A unified tool for corpus-level NLP statistics — n-gram frequencies, readability scores, entropy analysis, language detection, BPE token counting, and more — written in Rust for performance, with bindings for Python and JavaScript/WASM.

Installation · Quick Start · Commands · Documentation · Contributing


Highlights

  • High performance — Parallel processing via rayon. Analyzes multi-GB corpora in seconds.
  • Composable — Unix-friendly design with structured output (JSON, CSV, table). Pipes seamlessly with jq, awk, and standard tooling.
  • Comprehensive — Eight analysis commands covering vocabulary statistics, n-gram frequencies, readability indices, Shannon entropy, Zipf's law, language model perplexity, language detection, and BPE tokenization.
  • Multi-platform — Available as a native CLI binary, a Python package via PyO3, and an npm/WASM module for browser and Node.js environments.
  • Streaming — Process unbounded stdin streams with incremental chunk-based output for stats, ngrams, and entropy.

Installation

CLI

cargo install corpa

Or build from source:

git clone https://github.com/Flurry13/corpa
cd corpa
cargo build --release

Python

pip install corpa

JavaScript / WASM

npm support coming soon.


Quick Start

CLI

corpa stats corpus.txt
corpa ngrams -n 2 --top 20 corpus.txt
corpa readability essay.txt
corpa entropy corpus.txt
corpa perplexity corpus.txt --smoothing laplace
corpa lang mystery.txt
corpa tokens corpus.txt --model gpt4
corpa zipf corpus.txt --top 10

All commands accept file paths, directories (with --recursive), or stdin. Output format is controlled with --format (table, json, csv).

Python

import corpa

corpa.stats(text="The quick brown fox jumps over the lazy dog.")
# {'tokens': 9, 'types': 8, 'sentences': 1, 'type_token_ratio': 0.8889, ...}

corpa.ngrams("corpus.txt", n=2, top=10)
# [{'ngram': 'of the', 'frequency': 4521, 'relative_pct': 2.09}, ...]

corpa.lang(text="Bonjour le monde")
# {'language': 'Français', 'code': 'fra', 'script': 'Latin', 'confidence': 0.99}

All functions accept a file path as the first argument or a text= keyword argument for direct string input.

JavaScript / WASM

support coming soon.


Commands

Command Description
stats Token, type, sentence counts, type-token ratio, hapax legomena, average sentence length
ngrams N-gram frequency analysis with configurable N, top-K, minimum frequency, case folding, stopword filtering
tokens Whitespace, sentence, and character tokenization; BPE token counts for GPT-3, GPT-4, and GPT-4o
readability Flesch-Kincaid Grade, Flesch Reading Ease, Coleman-Liau Index, Gunning Fog Index, SMOG Index
entropy Unigram, bigram, and trigram Shannon entropy; entropy rate; vocabulary redundancy
perplexity N-gram language model perplexity with Laplace smoothing and Stupid Backoff
lang Language and script detection with confidence scoring
zipf Zipf's law rank-frequency distribution with exponent fitting and terminal sparkline plotting
completions Shell completion generation for bash, zsh, and fish

Example Output

$ corpa stats prose.txt

  corpa · prose.txt
┌─────────────────────┬────────────┐
│ Metric              ┆      Value │
╞═════════════════════╪════════════╡
│ Tokens (words)      ┆        175 │
│ Types (unique)      ┆         95 │
│ Characters          ┆        805 │
│ Sentences           ┆          6 │
│ Type-Token Ratio    ┆     0.5429 │
│ Hapax Legomena      ┆ 70 (73.7%) │
│ Avg Sentence Length ┆ 29.2 words │
└─────────────────────┴────────────┘
$ corpa readability prose.txt

  corpa · prose.txt
┌──────────────────────┬───────┬─────────────┐
│ Metric               ┆ Score ┆       Grade │
╞══════════════════════╪═══════╪═════════════╡
│ Flesch-Kincaid Grade ┆ 12.73 ┆ High School │
│ Flesch Reading Ease  ┆ 41.16 ┆   Difficult │
│ Coleman-Liau Index   ┆ 13.82 ┆     College │
│ Gunning Fog Index    ┆ 16.97 ┆     College │
│ SMOG Index           ┆ 14.62 ┆     College │
└──────────────────────┴───────┴─────────────┘
$ corpa tokens prose.txt --model all

  corpa · prose.txt
┌──────────────┬────────┐
│ Tokenizer    ┆ Tokens │
╞══════════════╪════════╡
│ Whitespace   ┆    126 │
│ Sentences    ┆      6 │
│ Characters   ┆    805 │
│ BPE (GPT-4)  ┆    150 │
│ BPE (GPT-4o) ┆    148 │
│ BPE (GPT-3)  ┆    151 │
└──────────────┴────────┘

Streaming

The --stream flag enables incremental processing of unbounded stdin, emitting cumulative results after each chunk. Chunk size is configurable with --chunk-lines (default: 1000).

cat huge_corpus.txt | corpa stats --stream --chunk-lines 500 --format json

Supported commands: stats, ngrams, entropy.

Format Behavior
json JSON Lines — one object per chunk
csv Header row once, data rows per chunk
table Table per chunk with chunk number in title

Global Options

Flag Description
--format <fmt> Output format: table (default), json, csv
--recursive Process directories recursively
--stream Process stdin incrementally, emitting results per chunk
--chunk-lines <N> Lines per chunk in streaming mode (default: 1000)

Performance

Benchmarks on a 1GB English text corpus (Apple M2, 8 cores):

Command corpa Python Speedup
Word count 1.9s 11.5s 6x
Bigram frequency 3.4s 53.9s 16x
Readability 5.4s 107.9s 20x

Benchmarked on ~1 GB generated English text corpus (Apple M-series, 8 cores).


Documentation

Resource Description
CLI Commands Full command reference with options and examples
Streaming Incremental stdin processing for large-scale analysis
Python API PyO3 bindings — all commands as native Python functions
JavaScript API WASM bindings for browser and Node.js environments

Roadmap

Completed

  • v0.1.0 — Core CLI: stats, ngrams, tokens, JSON/CSV/table output, stdin and file input, recursive directories
  • v0.2.0 — Analysis: readability, entropy, zipf, stopword filtering, case folding, parallel processing
  • v0.3.0 — Language Models: perplexity with Laplace/Stupid Backoff, lang detection, BPE token counting
  • v0.4.0 — Ecosystem: Python bindings (PyO3), WASM/npm package, streaming mode, shell completions
  • v0.5.0 — Robustness & Output Quality: Typed JSON output, improved sentence detection (abbreviations, ellipsis collapsing), input validation hardening, Unicode-aware syllable counting, streaming entropy rewrite
  • v0.6.0 — Bindings Parity: Python/WASM parameter parity (stopwords, min_freq, case_insensitive), proper error propagation, Python .pyi type stubs and docstrings, WASM TypeScript definitions, fixed package.json, version synchronization

Planned

v0.7.0 — Corpus Comparison & Search

  • Concordance / KWIC (keyword in context) search
  • Diff mode for comparing two corpora — side-by-side statistics, vocabulary overlap, divergence metrics
  • Collocation analysis (PMI, log-likelihood, chi-squared)
  • Custom vocabulary and dictionary support

v0.8.0 — Advanced Analysis

  • Sentiment lexicon scoring (AFINN, VADER-style)
  • Topic segmentation and keyword extraction (TF-IDF)
  • Text complexity profiling — combined readability + entropy + vocabulary richness report
  • Configurable sentence tokenizer (regex-based or rule-based)

Future

  • Language-specific stopword lists (bundled for top 10 languages)
  • Plugin system for custom analysis modules
  • Interactive TUI mode with live statistics
  • Parallel streaming for multi-file batch processing
  • Wasm streaming API for browser-based incremental analysis

Contributing

Contributions are welcome. Please open an issue to discuss proposed changes before submitting a pull request.

cargo test            # Run test suite
cargo clippy          # Lint
cargo bench           # Run benchmarks

License

This project is dual-licensed under MIT or Apache-2.0, at your option.


Acknowledgments

About

open source

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors