Skip to content

SurweeshSP/mathtok

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MathTok

A Hybrid Canonicalized AST-Based Tokenization Framework for Mathematical Language Modeling


Overview

MathTok is a research-grade tokenizer pipeline that converts raw mathematical expressions (LaTeX or ASCII) into a structured, semantically-rich token stream. Unlike standard BPE or SentencePiece tokenizers, MathTok is structure-aware: it builds an Abstract Syntax Tree (AST) from each expression and serializes it via DFS preorder traversal, preserving full mathematical structure.

Raw Mathematical Expression
          ↓
Canonicalization Layer       (sympy: simplify, expand, normalize)
          ↓
Hybrid Mathematical Lexer    (split TEXT / MATH spans)
          ↓
AST Generator                (SymPy tree → typed ASTNode tree)
          ↓
Operator-Aware Semantic Encoder  (rich metadata per operator)
          ↓
Structural Serialization     (DFS preorder → flat token stream)
          ↓
Structural Attention Metadata (per-token tree context)
          ↓
Vocabulary Mapping + BPE     (fixed math vocab + HF BPE for text)
          ↓
Compressed Token Stream

Quick Start

# Install dependencies and package in editable mode
pip install -e ".[eval,dev]"

# Tokenize an expression using the CLI pipeline
python -m mathtok.pipeline "The derivative of sin(x^2) + 3x"

# Run the comprehensive 110+ test suite
pytest tests/ -v

# Run the 4-way comparative tokenizer evaluation benchmark
# (MathTok vs GPT-2 BPE vs SentencePiece Unigram vs Char-level)
python -m evaluation.comparison

# Generate visual plots and the unified metrics dashboard
python -m evaluation.visualize

Python API

from mathtok import MathTokPipeline

pipeline = MathTokPipeline()

# Encode mixed text + math (supporting LaTeX or ASCII syntax)
out = pipeline.encode("The derivative of $\\sin(x^2)$ is $2x\\cos(x^2)$.")
print(out.tokens)      # ['[MATH_START]', 'FUNC_SIN', 'OP_POW', 'VAR_X', 'CONST_2', '[MATH_END]', ...]
print(out.sexp)        # (FUNC_SIN (OP_POW VAR_X CONST_2))
print(out.input_ids)   # [4, 27, 10, 45, 12, 5, ...]

# Access structural metadata (for tree-aware attention masking)
for meta in out.metadata:
    print(meta.token, meta.depth, meta.tree_position_key)

# Pure math expression serialization
out = pipeline.encode_math_only("(x+1)^2")
print(out.sexp)        # (OP_POW (OP_ADD VAR_X CONST_1) CONST_2)

# HuggingFace-compatible tokenizer export
hf_tok = pipeline.get_hf_tokenizer()
hf_tok.save_pretrained("./mathtok-tokenizer")
result = hf_tok("x^2 + 2*x + 1", return_tensors="pt")

Research Contributions

1. Hybrid Lexer

Separates natural language from mathematical content using LaTeX delimiter detection ($...$, \(...\), \[...\]) and ASCII math heuristics.

2. Canonicalization Engine

Normalizes mathematically equivalent expressions via SymPy's simplify(), expand(), and internal representation (subtraction → addition + negation, division → multiplication + reciprocal).

3. AST-Based Structural Serialization

Maps SymPy's expression tree to a typed token vocabulary with semantic metadata per operator. Serializes via DFS preorder traversal.

4. Operator Semantic Registry

Every operator and function carries an explicit metadata record: arity, precedence, associativity, semantic_role. This is the primary novelty over standard tokenization.

5. Structural Attention Metadata

Per-token records encoding depth, parent_id, children_ids, tree_position_key, and sibling_count — enabling future structure-aware attention.

6. Two-Tier Vocabulary

  • Fixed math vocabulary: deterministic IDs for all operators, functions, variables, constants.
  • BPE text vocabulary: HuggingFace tokenizers BPE for natural language spans.

Evaluation Metrics & Benchmarks

Core Metrics

Metric Symbol Meaning
Semantic Compression Ratio SCR structural_score / token_count (Higher is better — measures parsed semantic content density)
Semantic Density SD math_tokens / total_tokens (Ratio of high-value math tokens, measures information density)
Structural Efficiency SE parent_child_relations / token_count (Ratio of hierarchy relationships encoded per token)
Token Stability TS 1 - CoV(token count across rewritings) (Fidelity and stability across representations)

Empirical Benchmarks (4-Way Comparison)

Below are the empirical averages computed over our comprehensive suite of 70 mathematical test expressions:

Tokenizer Mean SCR (↑ Better) Semantic Density (↑ Better) Structural Efficiency (↑ Better)
MathTok (Ours) 0.8501 0.5285 0.2339
GPT-2 BPE 0.4251 0.1838 0.1491
SentencePiece Unigram 0.3696 0.1499 0.1403
Character-Level 0.3708 0.1518 0.1518

Note

  • MathTok achieves a 2.30x structural compression improvement over SentencePiece.
  • MathTok packs 3.52x more math-centric information per token stream compared to SentencePiece unigrams (0.5285 vs 0.1499), showing immense semantic density.
  • MathTok is 1.67x more efficient at encoding hierarchical ast relationships directly into token structures (0.2339 vs 0.1403).

High-Impact Visualizations

The visualization system runs via python -m evaluation.visualize and exports professional visual assets under evaluation/results/:

  • Unified Evaluation Dashboard (metrics_dashboard.png): 3-panel side-by-side display of SCR, Semantic Density, and Structural Efficiency.
  • Overall SCR Comparison (scr_comparison.png): Comparative summary bar chart.
  • Category-Level Breakdowns (scr_by_category.png): SCR analyzed by nested/standard categories.
  • Semantic Density Summary (semantic_density_comparison.png): Ratio of math structure to total tokens.

Project Structure

math_token/
├── mathtok/
│   ├── canonicalizer.py      # Layer 1: Canonicalization Engine
│   ├── lexer.py              # Layer 2: Hybrid Mathematical Lexer
│   ├── ast_generator.py      # Layer 3: AST Generator
│   ├── operator_registry.py  # Layer 4: Operator Semantic Registry
│   ├── serializer.py         # Layer 5: Structural Traversal & Serialization
│   ├── metadata.py           # Layer 6: Structural Attention Metadata
│   ├── vocabulary.py         # Layer 7: Two-Tier Vocabulary
│   └── pipeline.py           # Orchestrator Pipeline
├── evaluation/
│   ├── metrics.py            # Definition of core evaluation metrics
│   ├── benchmark.py          # Quick benchmarking scripts
│   ├── comparison.py         # Full 4-way comparative framework (SentencePiece integrated)
│   ├── visualize.py          # Custom dashboard visualization engine
│   └── results/              # JSON/JSONL reports & visual plots
└── tests/                    # 110+ passing unit tests

Citation

@article{mathtok2024,
  title   = {MathTok: A Hybrid Canonicalized AST-Based Tokenization Framework
             for Mathematical Language Modeling},
  author  = {Anonymous},
  year    = {2024},
  note    = {Under review}
}

About

Mathematical intelligence tokenizer framework for symbolic reasoning and LLM optimization.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages