Built end-to-end: Data Pipeline β SFT β DPO β Tool Augmentation β Self-Consistency β Benchmarking β Deployment
MathBrain is a specialized math-solving LLM that combines modern training techniques with tool-augmented reasoning to solve mathematical problems with structured, verifiable step-by-step solutions.
Unlike typical LLM projects that call an API, MathBrain is built from the ground up β every component of the pipeline is hand-built:
βββββββββββββββ βββββββββββββββ βββββββββββββββ ββββββββββββββββ
β 5 Datasets ββββββΆβ SFT Train ββββββΆβ DPO Train ββββββΆβ Tool Router β
β 75K examplesβ β QLoRA 7B β β Preference β β SymPy/Py/WA β
βββββββββββββββ βββββββββββββββ βββββββββββββββ ββββββββββββββββ
β
βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β FastAPI + βββββββ Benchmark βββββββ Self- ββββββββββββββ
β React UI β β Ablation β β Consistency β
βββββββββββββββ βββββββββββββββ βββββββββββββββ
|
<think>
<step>Identify that this is a quadratic equation</step>
<step>Apply the quadratic formula: x = (-b Β± β(bΒ²-4ac)) / 2a</step>
<tool>sympy
from sympy import symbols, solve
x = symbols('x')
result = solve(x**2 - 5*x + 6, x)
print(result)</tool>
<tool_result>[2, 3]</tool_result>
<step>The solutions are x = 2 and x = 3</step>
</think>
<answer>x = 2, x = 3</answer> |
Loads, formats, and deduplicates 5 math datasets into a unified training format:
| Dataset | Examples | What It Teaches |
|---|---|---|
| GSM8K | 7,473 | Grade school word problem reasoning |
| MATH (7 subjects) | 7,500 | Competition-level mathematical thinking |
| NuminaMath-CoT | 50,000 | Diverse chain-of-thought patterns |
| Orca-Math | 30,000 | GPT-4 quality step-by-step explanations |
| DeepMind Mathematics | 15,000 | Computational accuracy and drilling |
Each example is reformatted with custom reasoning tokens (<think>, <step>, <answer>) and deduplicated using MinHash LSH (removes ~15% overlap between datasets).
Training: QLoRA on Qwen2.5-Math-7B
β 4-bit quantization (NF4) β fits on single A100
β LoRA rank 64, alpha 128, targeting all linear layers
β Cosine LR schedule, gradient checkpointing
β 75K examples, 2 epochs
The model learns to produce structured <think>/<step>/<answer> reasoning chains instead of free-form text.
Most projects stop at SFT. MathBrain goes further:
- Sample 8 completions per problem from the SFT model
- Classify: correct final answer β β chosen | wrong answer β β rejected
- Train the model to prefer correct reasoning chains over incorrect ones
This teaches the model how to reason correctly, not just what format to use.
The model learns when and how to call external tools during reasoning:
| Tool | What It Does | Example |
|---|---|---|
| SymPy | Symbolic algebra, calculus, equations | solve(2*x + 5 - 15, x) β [5] |
| Python | Sandboxed numerical computation | sum(range(1, 101)) β 5050 |
| Wolfram Alpha | Complex queries, verification | "integral of sin(x)*cos(x)" |
| Matplotlib | Graph generation | Plots returned as base64 PNG |
The agentic generation loop: generate β detect <tool> β pause β execute β inject <tool_result> β continue generating.
Generate N solutions per problem, extract final answers, majority vote:
# 3 solutions for "What is the integral of xΒ²?"
Solution 1: xΒ³/3 + C β
Solution 2: xΒ³/3 + C β majority vote winner
Solution 3: xΒ³/2 + C β
# Confidence: 2/3 = 67%- FastAPI backend with 3 solve modes:
/solve(fast),/solve/verified(tools),/solve/consistent(majority vote) - SSE streaming for real-time token output
- React + KaTeX + TailwindCSS frontend with step visualization
- Supabase for query logging and user feedback
- Deployed on HuggingFace Spaces (model) + Vercel (frontend)
|
Comparisons at similar scale:
MathBrain with tools + self-consistency approaches models 2-4x its size. |
git clone https://github.com/Praneeth1636/MathBrain.git
cd MathBrain
pip install -r requirements.txt# Downloads all 5 datasets, formats with reasoning tokens, deduplicates
python scripts/prepare_data.py# SFT fine-tuning with QLoRA
python scripts/train_sft.py
# Generate DPO preference pairs from your SFT model
python scripts/generate_dpo_pairs.py
# DPO training
python scripts/train_dpo.py# Run full ablation study
python scripts/evaluate.py --model-path ./checkpoints/dpo/final --ablation# Launch the API
uvicorn src.api.main:app --host 0.0.0.0 --port 8000
# Or deploy to HuggingFace Spaces
docker build -t mathbrain .Open notebooks/01_setup_data_sft.ipynb in Colab, set runtime to A100, and run. Training saves directly to Google Drive (disconnect-proof).
MathBrain/
β
βββ src/
β βββ model/
β β βββ loader.py # Model loading, 4-bit quantization, QLoRA setup
β β βββ generate.py # Inference, streaming, multi-sample generation
β β
β βββ data/
β β βββ pipeline.py # 5-dataset loader, formatter, MinHash deduplication
β β
β βββ training/
β β βββ sft.py # Supervised fine-tuning with TRL
β β βββ dpo.py # DPO preference pair generation + training
β β
β βββ tools/
β β βββ router.py # Tool router, SymPy/Python/Wolfram/Plot executors,
β β # agentic generation loop, self-consistency voting
β β
β βββ evaluation/
β β βββ harness.py # Benchmark suite, answer extraction, error taxonomy
β β
β βββ api/
β βββ main.py # FastAPI with 3 solve modes + SSE streaming
β
βββ configs/ # YAML configs for model, data, SFT, DPO, eval
βββ scripts/ # CLI entry points for each pipeline stage
βββ notebooks/ # Colab notebooks (ready to run on A100)
βββ tests/ # Unit tests for answer extraction, tools, comparison
βββ Dockerfile # HuggingFace Spaces deployment
βββ requirements.txt
βββ pyproject.toml
Why Qwen2.5-Math-7B? It's pre-trained on math corpora and already has the exact architecture from our spec (RoPE, GQA, SwiGLU, RMSNorm). Starting from a math-specialized base means our SFT and DPO training can focus on reasoning structure rather than basic math knowledge.
Why QLoRA instead of full fine-tuning? Training all 7.6B parameters would require 4x A100s and risk catastrophic forgetting. QLoRA freezes the base model and trains ~83M adapter parameters (1.1%) β fits on a single A100, preserves pre-trained knowledge, and achieves 95%+ of full fine-tuning quality.
Why custom reasoning tokens? Standard models output free-form text that's hard to parse reliably. Our <think>/<step>/<answer>/<tool> tokens give the model a structured vocabulary for reasoning. The code can reliably extract answers, detect tool calls, and visualize step-by-step solutions.
Why DPO over RLHF? DPO achieves comparable results to PPO-based RLHF without needing a separate reward model or the instability of RL training. We generate preference pairs directly from the SFT model β correct solutions are "chosen", incorrect ones are "rejected". Simpler, cheaper, more stable.
Why tool augmentation? A 7B model with SymPy can outperform a 70B model on computation-heavy problems. The model learns when to use tools (symbolic algebra, numerical computation, verification) rather than trying to do everything in its head. This is the same approach used in production systems like ChatGPT and Gemini.
Why self-consistency? Generating 3 solutions and taking a majority vote costs 3x inference but typically adds 3-5% accuracy. More importantly, it provides a confidence score β if all 3 agree, the answer is likely correct. If they disagree, the problem might need human review.
| Layer | Technology |
|---|---|
| Training | PyTorch, HuggingFace Transformers, PEFT (QLoRA), TRL (SFT/DPO), bitsandbytes |
| Tools | SymPy, RestrictedPython, Wolfram Alpha API, Matplotlib |
| Data | HuggingFace Datasets, datasketch (MinHash LSH), pandas |
| API | FastAPI, Server-Sent Events (SSE) |
| Frontend | React, TailwindCSS, KaTeX (LaTeX rendering) |
| Database | Supabase (query logging, feedback) |
| Tracking | Weights & Biases |
| Deploy | Docker, HuggingFace Spaces, Vercel |
SFT Training (1 epoch, Qwen2.5-Math-7B, A100):
Step 500 β Train Loss: 0.981 β Val Loss: 1.000
Step 1000 β Train Loss: 0.975 β Val Loss: 0.992
Step 1500 β Train Loss: 0.973 β Val Loss: 0.985
Step 2000 β Train Loss: 0.955 β Val Loss: 0.983 β converging
Step 2340 β Training complete β Saved to Drive
- Phase 0: Base model + custom tokenizer + QLoRA
- Phase 1: 5-dataset pipeline with dedup (75K examples)
- Phase 2: SFT training on A100
- Phase 3: DPO training with on-policy preference pairs
- Phase 4: Tool augmentation integration
- Phase 5: Full benchmark ablation study
- Phase 6: FastAPI + React deployment
- Phase 7: Lean 4 formal verification (stretch)
MIT β see LICENSE for details.
- Fork the repo
- Create a feature branch (
git checkout -b feature/amazing-thing) - Run tests (
python -m pytest tests/) - Commit and push
- Open a PR
Built with β€οΈ by Praneeth Kadem
If this project helped you, consider giving it a β