A minimal full-stack cybersecurity-focused language model β from pretraining to chat.
Mesosfer is inspired by nanoGPT and follows Andrej Karpathy's educational approach to LLM training: clean, readable code that actually works. It is specifically optimized for cybersecurity and secure coding domains.
- Pretraining β Train base models from scratch using scaling laws
- Tokenizer Training β Custom BPE tokenizer (64K vocab, cybersec-aware)
- Supervised Fine-Tuning (SFT) β Instruction tuning with cybersecurity datasets
- Reinforcement Learning (RL) β GRPO-style RL on cybersecurity tasks
- RLHF Data Collection β Human feedback via thumbs up/down UI, stored to
data/rlhf/ - Evaluation β CORE benchmark + cybersecurity domain probes
- Chat Interface β CLI and WebUI for model interaction
- Syntax-highlighted code blocks (Python, Rust, JSON, etc.)
- Markdown rendering (headings, lists, bold/italic, inline code)
- Welcome screen with centered input on new conversation
| Feature | Implementation |
|---|---|
| Attention | Group-Query Attention (GQA) with Flash Attention 2/3 |
| Positional Encoding | Rotary embeddings (RoPE) |
| Activation | ReLUΒ² |
| Normalization | RMSNorm (no learnable parameters) |
| Optimizer | MuonAdamW (Muon for matrices, AdamW for embeddings) |
| Training Precision | BF16 (all GPUs) / FP8 (Hopper/H100 only) |
| Value Embeddings | ResFormer-style alternating layers |
| Vendor | GPUs | Backend |
|---|---|---|
| NVIDIA | H100, H200, B200, A100, L40S, L4, RTX 4090, T4 | CUDA |
| AMD | MI355X, MI350X, MI325X, MI300X, MI250X, MI210 | ROCm 7.0+ |
| Apple | M1/M2/M3 Pro/Max | MPS |
| Intel/x86 | CPU | CPU |
For detailed setup instructions, see:
- RUN_NVIDIA.md β NVIDIA GPU setup
- RUN_AMD.md β AMD GPU setup (ROCm 7.0 pre-built image)
# Clone the repository
git clone https://github.com/your-repo/mesosfer.git
cd mesosfer
# Install uv (if needed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment
uv venv --python 3.12
source .venv/bin/activate
# NVIDIA CUDA
uv sync --extra gpu
# AMD ROCm 7.0 (pre-built image β PyTorch already installed)
# uv sync --extra rocm --no-build-isolation
# pip install flash-attn --no-build-isolationData preparation must be completed before training. Run these steps in order:
# Download 170 shards of ClimbMix-400B (enough for depth 24 at ratio 10)
python -m mesosfer.data.dataset -n 170Downloads and interleaves CVE feeds, HuggingFace cybersec datasets, and local files.
Output: ~/.cache/mesosfer/base_data_cybersecurity/
python -m scripts.data.prepare_data
# Check progress
python -m scripts.data.prepare_data --status
# Dry-run to preview sources
python -m scripts.data.prepare_data --dry-runSteps 2a and 2b can run in parallel in separate terminals.
Converts data/log/ and data/cloud/ files to NL narratives (prevents loss spikes).
Output: data/log_nl/ and data/cloud_nl/
python -m scripts.data.convert_logs_to_nl
# Preview first 3 documents without writing
python -m scripts.data.convert_logs_to_nl --dry-runStep 2c can run in parallel with 2a and 2b. It only reads from
data/in the repo.
Train a 64K BPE tokenizer on the prepared data. Requires Step 2a to be complete.
python -m scripts.train.tok_train
# Evaluate tokenizer compression ratio vs GPT-2 and GPT-4
python -m scripts.eval.tok_evalRequires Steps 2a, 2b, 2c, and 3 to be complete.
# Depth 24 β recommended config for MI300X / single GPU
python -m scripts.train.base_train \
--depth=24 \
--target-param-data-ratio=10 \
--device-batch-size=32 \
--warmup-steps=200 \
--window-pattern=L \
--save-every=1000 \
--core-metric-every=5000 \
--run=d24_run
# Or use the full pipeline script (handles setup + tokenizer + pretrain + SFT)
WANDB_RUN=my_run bash runs/speedrun.shAMD ROCm note: Use
--window-pattern=L(full attention). Sliding window (SSL,SSSL) is not yet supported in the ROCm FA2 Triton backward pass.
python -m scripts.eval.base_eval \
--model-tag d24 \
--device-batch-size 32Requires Step 4 to be complete. Automatically includes cybersecurity SFT datasets
from data/sft/.
python -m scripts.chat.chat_sft \
--device-batch-size=32 \
--run=sft_run
# Disable cybersec SFT (for ablation)
# python -m scripts.chat.chat_sft --disable-cybersec-sft# CLI chat
python -m scripts.chat.chat_cli -p "Explain CVE-2021-44228 (Log4Shell)"
# Web UI (opens http://localhost:8000)
python -m scripts.chat.chat_web
# Web UI β multi-GPU (4 workers)
python -m scripts.chat.chat_web --num-gpus 4
# Web UI β load specific checkpoint
python -m scripts.chat.chat_web --model-tag d24 --step 14000The Web UI includes:
- Welcome screen β centered input shown before the first message
- Syntax-highlighted code blocks β Python, Rust, JSON, Bash, and more, with a one-click copy button
- Markdown rendering β headings, lists, bold/italic, inline code
- Thumbs up / down feedback β per-response rating that saves to
data/rlhf/feedback.jsonl
Human preference data is collected automatically while chatting via the Web UI.
Each π or π click on an assistant response appends a record to data/rlhf/feedback.jsonl.
Use this data to train a reward model or run DPO fine-tuning in a future step.
The Web UI server (scripts/chat/chat_web.py) exposes the following endpoints:
| Method | Path | Description |
|---|---|---|
GET |
/ |
Serve the chat UI (mesosfer/ui.html) |
GET |
/logo.svg |
Serve the Mesosfer logo |
GET |
/interface/* |
Serve static assets (CSS, JS) |
POST |
/chat/completions |
Streaming chat completion (SSE) |
POST |
/feedback |
Submit thumbs up/down feedback (saved to data/rlhf/feedback.jsonl) |
GET |
/health |
Health check + worker pool status |
GET |
/stats |
Worker pool statistics and GPU utilization |
POST /feedback
{
"message_index": 1,
"rating": "negative",
"reason": "factually_incorrect",
"comment": "Optional free-text",
"conversation": [
{ "role": "user", "content": "..." },
{ "role": "assistant", "content": "..." }
]
}Valid reason values: inappropriate_response, continuous_repetition, factually_incorrect, too_verbose, formatting_issues, other.
Step 2a: mesosfer.data.dataset -n 170 ββ
Step 2b: scripts.data.prepare_data ββ can run in parallel
Step 2c: scripts.data.convert_logs_to_nl ββ
β
Step 3: scripts.train.tok_train
scripts.eval.tok_eval
β
Step 4: scripts.train.base_train
β
Step 5: scripts.eval.base_eval
β
Step 6: scripts.chat.chat_sft
β
Step 7: scripts.chat.chat_cli / chat_web
β
Step 8: Collect RLHF feedback via Web UI β data/rlhf/feedback.jsonl
(future: reward model training / DPO)
mesosfer/
βββ mesosfer/ # Library (importable)
β βββ model/ # GPT model, attention, optimization
β βββ data/ # Dataset download, dataloader, tokenizer
β βββ eval/ # CORE eval, BPB, engine
β βββ utils/ # Common utilities, checkpointing, reporting
β βββ interface/ # Web UI static assets
β β βββ style.css # All UI styles (chat, code blocks, feedback, empty state)
β β βββ chat.js # Chat logic, streaming, slash commands, markdown rendering
β β βββ feedback.js # Thumbs up/down, feedback modal, POST /feedback
β β βββ markdown.js # Markdown parser + syntax-highlighted code block renderer
β βββ ui.html # HTML shell (loads interface/ assets)
βββ scripts/
β βββ train/ # base_train.py, tok_train.py
β βββ chat/ # chat_sft.py, chat_rl.py, chat_cli.py, chat_web.py
β βββ eval/ # base_eval.py, tok_eval.py
β βββ data/ # prepare_data.py, convert_logs_to_nl.py
βββ tasks/ # Eval tasks (MMLU, GSM8K, cybersec_sft, cybersec_rl)
βββ data/
β βββ log/ # Raw security logs (input to convert_logs_to_nl)
β βββ cloud/ # Raw cloud audit logs (input to convert_logs_to_nl)
β βββ log_nl/ # NL narratives from logs (output, used in training)
β βββ cloud_nl/ # NL narratives from cloud logs (output, used in training)
β βββ sft/ # Cybersecurity SFT conversations
β βββ rlhf/ # Human feedback collected via Web UI
β β βββ feedback.jsonl # Appended at runtime (gitignored)
β βββ synthetic-ir/ # Synthetic incident response data
β βββ synthetic-soc/ # Synthetic SOC analyst data
β βββ reverse-engineering/ # RE/exploitation analysis data
βββ runs/ # Training run scripts (speedrun, miniseries, etc.)
βββ tests/ # Unit tests
| Script | Description | Hardware |
|---|---|---|
runs/speedrun.sh |
Full pipeline: pretrain + SFT + eval | 8x H100/A100 or 1x MI300X |
runs/scaling_laws.sh |
Research: optimal model configs | 8x GPU |
runs/miniseries.sh |
Train multiple depths (12-26) | 8x GPU |
runs/runcpu.sh |
Demo on CPU/MacBook | CPU/MPS |
| Argument | Default | Description |
|---|---|---|
--depth |
20 | Transformer depth (24 recommended) |
--aspect-ratio |
64 | Model dimension = depth Γ 64 |
--head-dim |
128 | Attention head dimension |
--max-seq-len |
2048 | Maximum context length |
--device-batch-size |
32 | Batch size per GPU |
--target-param-data-ratio |
12 | Data-to-parameters ratio (10 recommended) |
--warmup-steps |
200 | LR warmup steps (200 for depth 20+) |
--window-pattern |
L | Attention window: L=full, S=sliding (L required for ROCm) |
--save-every |
1000 | Checkpoint every N steps |
--core-metric-every |
5000 | CORE eval every N steps |
# Backend selection (auto-detected if not set)
export mesosfer_TORCH_BACKEND=cuda # cuda, rocm, cpu
# Compute dtype override
export mesosfer_DTYPE=bfloat16 # bfloat16, float16, float32
# Cache directory (datasets, checkpoints, tokenizer)
export mesosfer_BASE_DIR="$HOME/.cache/mesosfer"
# Wandb logging
export WANDB_RUN=my_training_runSee DATASET.md for dataset sources, sampling weights, and token budgets. See DATASET2.md for advanced configuration and dynamic source definitions.
- Python 3.12+
- PyTorch 2.6.0+ (ROCm 7.0) or 2.9.1+ (CUDA 12.8)
- NVIDIA GPU (CUDA 12.8+) or AMD GPU (ROCm 7.0+)
- 16GB+ GPU VRAM (32GB+ recommended for depth 24)
# Install dev dependencies
uv sync --dev
# Run tests
pytest tests/
# Format code
ruff format .
# Lint
ruff check .| Model | Parameters | Tokens | Validation BPB | CORE Score |
|---|---|---|---|---|
| GPT-2 (reference) | ~124M | ~5B | ~0.97 | ~25 |
| mesosfer d24 (ratio=10) | ~1.38B | ~7.3B | 0.7337 | 0.2541 |
CORE score: average of MMLU (5-shot), GSM8K (COT), ARC-C, HumanEval (pass@1), Only base model
The architecture is designed to scale without fundamental changes. Model size is controlled by two arguments:
model_dim = depth Γ aspect_ratio (rounded up to nearest multiple of head_dim)
num_heads = model_dim / head_dim
| Depth | Aspect Ratio | Model Dim | ~Total Params | Dataset needed (ratio=10) | Covered by ~8.5B dataset? |
|---|---|---|---|---|---|
| 16 | 64 | 1024 | ~0.9B | ~2.7B | β |
| 18 | 64 | 1152 | ~1.2B | ~3.6B | β |
| 20 | 64 | 1280 | ~1.5B | ~4.8B | β |
| 22 | 64 | 1408 | ~1.8B | ~6.2B | β |
| 24 | 64 | 1536 | ~2.2B | ~7.8B | β |
| 28 | 64 | 1792 | ~3.1B | ~12.0B | β need more data |
| 32 | 128 | 4096 | ~11.5B | ~67B | β need more data |
| 36 | 128 | 4608 | ~15.5B | ~95B | β need more data |
| 40 | 128 | 5120 | ~20.3B | ~129B | β need more data |
| 44 | 128 | 5632 | ~26.0B | ~171B | β need more data |
| 48 | 128 | 6144 | ~32.6B | ~222B | β need more data |
"Dataset needed" = scaling params Γ ratio=10. Scaling params = transformer matrices + lm_head (excludes embeddings). Current dataset (~8.5B tokens) fully covers depth 16β24. For depth 28+, additional data sources are required. For Chinchilla-optimal training (ratio=20), double the token requirements above.
To train a larger model, simply pass the desired depth and aspect ratio:
# ~3B model (depth 28, aspect-ratio 64)
python -m scripts.train.base_train \
--depth=28 \
--aspect-ratio=64 \
--head-dim=128 \
--target-param-data-ratio=10 \
--run=d28_run
# ~7B model (depth 40, aspect-ratio 64)
python -m scripts.train.base_train \
--depth=40 \
--aspect-ratio=64 \
--head-dim=128 \
--target-param-data-ratio=10 \
--run=d40_runFeatures like GQA, Flash Attention 2/3, RoPE, RMSNorm, and BF16/FP8 are already implemented and designed for large-scale training. The main practical constraints for 7B+ models are data volume (current dataset ~8.5B tokens covers up to ~850M scaling params at ratio=10) and hardware (7B in BF16 requires ~14GB weights + optimizer state, recommend 2β4Γ A100/H100 80GB or 1Γ MI300X).
- Andrej Karpathy β nanoGPT inspiration
- PyTorch β Deep learning framework
- FlashAttention β Efficient attention
- Muon optimizer β Matrix parameter optimization
MIT License
{ "timestamp": "2026-05-20T10:00:00+00:00", "message_index": 1, "rating": "negative", "reason": "factually_incorrect", "comment": "The CVE number is wrong.", "conversation": [ { "role": "user", "content": "..." }, { "role": "assistant", "content": "..." } ] }