A Hybrid Canonicalized AST-Based Tokenization Framework for Mathematical Language Modeling
MathTok is a research-grade tokenizer pipeline that converts raw mathematical expressions (LaTeX or ASCII) into a structured, semantically-rich token stream. Unlike standard BPE or SentencePiece tokenizers, MathTok is structure-aware: it builds an Abstract Syntax Tree (AST) from each expression and serializes it via DFS preorder traversal, preserving full mathematical structure.
Raw Mathematical Expression
↓
Canonicalization Layer (sympy: simplify, expand, normalize)
↓
Hybrid Mathematical Lexer (split TEXT / MATH spans)
↓
AST Generator (SymPy tree → typed ASTNode tree)
↓
Operator-Aware Semantic Encoder (rich metadata per operator)
↓
Structural Serialization (DFS preorder → flat token stream)
↓
Structural Attention Metadata (per-token tree context)
↓
Vocabulary Mapping + BPE (fixed math vocab + HF BPE for text)
↓
Compressed Token Stream
# Install dependencies and package in editable mode
pip install -e ".[eval,dev]"
# Tokenize an expression using the CLI pipeline
python -m mathtok.pipeline "The derivative of sin(x^2) + 3x"
# Run the comprehensive 110+ test suite
pytest tests/ -v
# Run the 4-way comparative tokenizer evaluation benchmark
# (MathTok vs GPT-2 BPE vs SentencePiece Unigram vs Char-level)
python -m evaluation.comparison
# Generate visual plots and the unified metrics dashboard
python -m evaluation.visualizefrom mathtok import MathTokPipeline
pipeline = MathTokPipeline()
# Encode mixed text + math (supporting LaTeX or ASCII syntax)
out = pipeline.encode("The derivative of $\\sin(x^2)$ is $2x\\cos(x^2)$.")
print(out.tokens) # ['[MATH_START]', 'FUNC_SIN', 'OP_POW', 'VAR_X', 'CONST_2', '[MATH_END]', ...]
print(out.sexp) # (FUNC_SIN (OP_POW VAR_X CONST_2))
print(out.input_ids) # [4, 27, 10, 45, 12, 5, ...]
# Access structural metadata (for tree-aware attention masking)
for meta in out.metadata:
print(meta.token, meta.depth, meta.tree_position_key)
# Pure math expression serialization
out = pipeline.encode_math_only("(x+1)^2")
print(out.sexp) # (OP_POW (OP_ADD VAR_X CONST_1) CONST_2)
# HuggingFace-compatible tokenizer export
hf_tok = pipeline.get_hf_tokenizer()
hf_tok.save_pretrained("./mathtok-tokenizer")
result = hf_tok("x^2 + 2*x + 1", return_tensors="pt")Separates natural language from mathematical content using LaTeX delimiter detection ($...$, \(...\), \[...\]) and ASCII math heuristics.
Normalizes mathematically equivalent expressions via SymPy's simplify(), expand(), and internal representation (subtraction → addition + negation, division → multiplication + reciprocal).
Maps SymPy's expression tree to a typed token vocabulary with semantic metadata per operator. Serializes via DFS preorder traversal.
Every operator and function carries an explicit metadata record: arity, precedence, associativity, semantic_role. This is the primary novelty over standard tokenization.
Per-token records encoding depth, parent_id, children_ids, tree_position_key, and sibling_count — enabling future structure-aware attention.
- Fixed math vocabulary: deterministic IDs for all operators, functions, variables, constants.
- BPE text vocabulary: HuggingFace
tokenizersBPE for natural language spans.
| Metric | Symbol | Meaning |
|---|---|---|
| Semantic Compression Ratio | SCR | structural_score / token_count (Higher is better — measures parsed semantic content density) |
| Semantic Density | SD | math_tokens / total_tokens (Ratio of high-value math tokens, measures information density) |
| Structural Efficiency | SE | parent_child_relations / token_count (Ratio of hierarchy relationships encoded per token) |
| Token Stability | TS | 1 - CoV(token count across rewritings) (Fidelity and stability across representations) |
Below are the empirical averages computed over our comprehensive suite of 70 mathematical test expressions:
| Tokenizer | Mean SCR (↑ Better) | Semantic Density (↑ Better) | Structural Efficiency (↑ Better) |
|---|---|---|---|
| MathTok (Ours) | 0.8501 | 0.5285 | 0.2339 |
| GPT-2 BPE | 0.4251 | 0.1838 | 0.1491 |
| SentencePiece Unigram | 0.3696 | 0.1499 | 0.1403 |
| Character-Level | 0.3708 | 0.1518 | 0.1518 |
Note
- MathTok achieves a 2.30x structural compression improvement over SentencePiece.
- MathTok packs 3.52x more math-centric information per token stream compared to SentencePiece unigrams (0.5285 vs 0.1499), showing immense semantic density.
- MathTok is 1.67x more efficient at encoding hierarchical ast relationships directly into token structures (0.2339 vs 0.1403).
The visualization system runs via python -m evaluation.visualize and exports professional visual assets under evaluation/results/:
- Unified Evaluation Dashboard (
metrics_dashboard.png): 3-panel side-by-side display of SCR, Semantic Density, and Structural Efficiency. - Overall SCR Comparison (
scr_comparison.png): Comparative summary bar chart. - Category-Level Breakdowns (
scr_by_category.png): SCR analyzed by nested/standard categories. - Semantic Density Summary (
semantic_density_comparison.png): Ratio of math structure to total tokens.
math_token/
├── mathtok/
│ ├── canonicalizer.py # Layer 1: Canonicalization Engine
│ ├── lexer.py # Layer 2: Hybrid Mathematical Lexer
│ ├── ast_generator.py # Layer 3: AST Generator
│ ├── operator_registry.py # Layer 4: Operator Semantic Registry
│ ├── serializer.py # Layer 5: Structural Traversal & Serialization
│ ├── metadata.py # Layer 6: Structural Attention Metadata
│ ├── vocabulary.py # Layer 7: Two-Tier Vocabulary
│ └── pipeline.py # Orchestrator Pipeline
├── evaluation/
│ ├── metrics.py # Definition of core evaluation metrics
│ ├── benchmark.py # Quick benchmarking scripts
│ ├── comparison.py # Full 4-way comparative framework (SentencePiece integrated)
│ ├── visualize.py # Custom dashboard visualization engine
│ └── results/ # JSON/JSONL reports & visual plots
└── tests/ # 110+ passing unit tests
@article{mathtok2024,
title = {MathTok: A Hybrid Canonicalized AST-Based Tokenization Framework
for Mathematical Language Modeling},
author = {Anonymous},
year = {2024},
note = {Under review}
}