MathTok

A Hybrid Canonicalized AST-Based Tokenization Framework for Mathematical Language Modeling

Overview

MathTok is a research-grade tokenizer pipeline that converts raw mathematical expressions (LaTeX or ASCII) into a structured, semantically-rich token stream. Unlike standard BPE or SentencePiece tokenizers, MathTok is structure-aware: it builds an Abstract Syntax Tree (AST) from each expression and serializes it via DFS preorder traversal, preserving full mathematical structure.

Raw Mathematical Expression
          ↓
Canonicalization Layer       (sympy: simplify, expand, normalize)
          ↓
Hybrid Mathematical Lexer    (split TEXT / MATH spans)
          ↓
AST Generator                (SymPy tree → typed ASTNode tree)
          ↓
Operator-Aware Semantic Encoder  (rich metadata per operator)
          ↓
Structural Serialization     (DFS preorder → flat token stream)
          ↓
Structural Attention Metadata (per-token tree context)
          ↓
Vocabulary Mapping + BPE     (fixed math vocab + HF BPE for text)
          ↓
Compressed Token Stream

Quick Start

# Install dependencies and package in editable mode
pip install -e ".[eval,dev]"

# Tokenize an expression using the CLI pipeline
python -m mathtok.pipeline "The derivative of sin(x^2) + 3x"

# Run the comprehensive 110+ test suite
pytest tests/ -v

# Run the 4-way comparative tokenizer evaluation benchmark
# (MathTok vs GPT-2 BPE vs SentencePiece Unigram vs Char-level)
python -m evaluation.comparison

# Generate visual plots and the unified metrics dashboard
python -m evaluation.visualize

Python API

from mathtok import MathTokPipeline

pipeline = MathTokPipeline()

# Encode mixed text + math (supporting LaTeX or ASCII syntax)
out = pipeline.encode("The derivative of $\\sin(x^2)$ is $2x\\cos(x^2)$.")
print(out.tokens)      # ['[MATH_START]', 'FUNC_SIN', 'OP_POW', 'VAR_X', 'CONST_2', '[MATH_END]', ...]
print(out.sexp)        # (FUNC_SIN (OP_POW VAR_X CONST_2))
print(out.input_ids)   # [4, 27, 10, 45, 12, 5, ...]

# Access structural metadata (for tree-aware attention masking)
for meta in out.metadata:
    print(meta.token, meta.depth, meta.tree_position_key)

# Pure math expression serialization
out = pipeline.encode_math_only("(x+1)^2")
print(out.sexp)        # (OP_POW (OP_ADD VAR_X CONST_1) CONST_2)

# HuggingFace-compatible tokenizer export
hf_tok = pipeline.get_hf_tokenizer()
hf_tok.save_pretrained("./mathtok-tokenizer")
result = hf_tok("x^2 + 2*x + 1", return_tensors="pt")

Research Contributions

1. Hybrid Lexer

Separates natural language from mathematical content using LaTeX delimiter detection ( $...$ , $...$, \[...\]) and ASCII math heuristics.

2. Canonicalization Engine

Normalizes mathematically equivalent expressions via SymPy's simplify(), expand(), and internal representation (subtraction → addition + negation, division → multiplication + reciprocal).

3. AST-Based Structural Serialization

Maps SymPy's expression tree to a typed token vocabulary with semantic metadata per operator. Serializes via DFS preorder traversal.

4. Operator Semantic Registry

Every operator and function carries an explicit metadata record: arity, precedence, associativity, semantic_role. This is the primary novelty over standard tokenization.

5. Structural Attention Metadata

Per-token records encoding depth, parent_id, children_ids, tree_position_key, and sibling_count — enabling future structure-aware attention.

6. Two-Tier Vocabulary

Fixed math vocabulary: deterministic IDs for all operators, functions, variables, constants.
BPE text vocabulary: HuggingFace tokenizers BPE for natural language spans.

Evaluation Metrics & Benchmarks

Core Metrics

Metric	Symbol	Meaning
Semantic Compression Ratio	SCR	`structural_score / token_count` (Higher is better — measures parsed semantic content density)
Semantic Density	SD	`math_tokens / total_tokens` (Ratio of high-value math tokens, measures information density)
Structural Efficiency	SE	`parent_child_relations / token_count` (Ratio of hierarchy relationships encoded per token)
Token Stability	TS	`1 - CoV(token count across rewritings)` (Fidelity and stability across representations)

Empirical Benchmarks (4-Way Comparison)

Below are the empirical averages computed over our comprehensive suite of 70 mathematical test expressions:

Tokenizer	Mean SCR (↑ Better)	Semantic Density (↑ Better)	Structural Efficiency (↑ Better)
MathTok (Ours)	0.8501	0.5285	0.2339
GPT-2 BPE	0.4251	0.1838	0.1491
SentencePiece Unigram	0.3696	0.1499	0.1403
Character-Level	0.3708	0.1518	0.1518

Note

MathTok achieves a 2.30x structural compression improvement over SentencePiece.
MathTok packs 3.52x more math-centric information per token stream compared to SentencePiece unigrams (0.5285 vs 0.1499), showing immense semantic density.
MathTok is 1.67x more efficient at encoding hierarchical ast relationships directly into token structures (0.2339 vs 0.1403).

High-Impact Visualizations

The visualization system runs via python -m evaluation.visualize and exports professional visual assets under evaluation/results/:

Unified Evaluation Dashboard (metrics_dashboard.png): 3-panel side-by-side display of SCR, Semantic Density, and Structural Efficiency.
Overall SCR Comparison (scr_comparison.png): Comparative summary bar chart.
Category-Level Breakdowns (scr_by_category.png): SCR analyzed by nested/standard categories.
Semantic Density Summary (semantic_density_comparison.png): Ratio of math structure to total tokens.

Project Structure

math_token/
├── mathtok/
│   ├── canonicalizer.py      # Layer 1: Canonicalization Engine
│   ├── lexer.py              # Layer 2: Hybrid Mathematical Lexer
│   ├── ast_generator.py      # Layer 3: AST Generator
│   ├── operator_registry.py  # Layer 4: Operator Semantic Registry
│   ├── serializer.py         # Layer 5: Structural Traversal & Serialization
│   ├── metadata.py           # Layer 6: Structural Attention Metadata
│   ├── vocabulary.py         # Layer 7: Two-Tier Vocabulary
│   └── pipeline.py           # Orchestrator Pipeline
├── evaluation/
│   ├── metrics.py            # Definition of core evaluation metrics
│   ├── benchmark.py          # Quick benchmarking scripts
│   ├── comparison.py         # Full 4-way comparative framework (SentencePiece integrated)
│   ├── visualize.py          # Custom dashboard visualization engine
│   └── results/              # JSON/JSONL reports & visual plots
└── tests/                    # 110+ passing unit tests

Citation

@article{mathtok2024,
  title   = {MathTok: A Hybrid Canonicalized AST-Based Tokenization Framework
             for Mathematical Language Modeling},
  author  = {Anonymous},
  year    = {2024},
  note    = {Under review}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
evaluation		evaluation
mathtok		mathtok
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
model.md		model.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
review.md		review.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MathTok

Overview

Quick Start

Python API

Research Contributions

1. Hybrid Lexer

2. Canonicalization Engine

3. AST-Based Structural Serialization

4. Operator Semantic Registry

5. Structural Attention Metadata

6. Two-Tier Vocabulary

Evaluation Metrics & Benchmarks

Core Metrics

Empirical Benchmarks (4-Way Comparison)

High-Impact Visualizations

Project Structure

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MathTok

Overview

Quick Start

Python API

Research Contributions

1. Hybrid Lexer

2. Canonicalization Engine

3. AST-Based Structural Serialization

4. Operator Semantic Registry

5. Structural Attention Metadata

6. Two-Tier Vocabulary

Evaluation Metrics & Benchmarks

Core Metrics

Empirical Benchmarks (4-Way Comparison)

High-Impact Visualizations

Project Structure

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages