Skip to content

PabloRaka/mesosfer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

76 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Mesosfer

A minimal full-stack cybersecurity-focused language model β€” from pretraining to chat.

Mesosfer is inspired by nanoGPT and follows Andrej Karpathy's educational approach to LLM training: clean, readable code that actually works. It is specifically optimized for cybersecurity and secure coding domains.

Features

  • Pretraining β€” Train base models from scratch using scaling laws
  • Tokenizer Training β€” Custom BPE tokenizer (64K vocab, cybersec-aware)
  • Supervised Fine-Tuning (SFT) β€” Instruction tuning with cybersecurity datasets
  • Reinforcement Learning (RL) β€” GRPO-style RL on cybersecurity tasks
  • RLHF Data Collection β€” Human feedback via thumbs up/down UI, stored to data/rlhf/
  • Evaluation β€” CORE benchmark + cybersecurity domain probes
  • Chat Interface β€” CLI and WebUI for model interaction
    • Syntax-highlighted code blocks (Python, Rust, JSON, etc.)
    • Markdown rendering (headings, lists, bold/italic, inline code)
    • Welcome screen with centered input on new conversation

Architecture Highlights

Feature Implementation
Attention Group-Query Attention (GQA) with Flash Attention 2/3
Positional Encoding Rotary embeddings (RoPE)
Activation ReLUΒ²
Normalization RMSNorm (no learnable parameters)
Optimizer MuonAdamW (Muon for matrices, AdamW for embeddings)
Training Precision BF16 (all GPUs) / FP8 (Hopper/H100 only)
Value Embeddings ResFormer-style alternating layers

Hardware Support

Vendor GPUs Backend
NVIDIA H100, H200, B200, A100, L40S, L4, RTX 4090, T4 CUDA
AMD MI355X, MI350X, MI325X, MI300X, MI250X, MI210 ROCm 7.0+
Apple M1/M2/M3 Pro/Max MPS
Intel/x86 CPU CPU

For detailed setup instructions, see:


Quick Start

1. Install Dependencies

# Clone the repository
git clone https://github.com/your-repo/mesosfer.git
cd mesosfer

# Install uv (if needed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create virtual environment
uv venv --python 3.12
source .venv/bin/activate

# NVIDIA CUDA
uv sync --extra gpu

# AMD ROCm 7.0 (pre-built image β€” PyTorch already installed)
# uv sync --extra rocm --no-build-isolation
# pip install flash-attn --no-build-isolation

2. Prepare Data

Data preparation must be completed before training. Run these steps in order:

Step 2a β€” Download ClimbMix general pretraining data (~17 GB)

# Download 170 shards of ClimbMix-400B (enough for depth 24 at ratio 10)
python -m mesosfer.data.dataset -n 170

Step 2b β€” Prepare cybersecurity dataset

Downloads and interleaves CVE feeds, HuggingFace cybersec datasets, and local files. Output: ~/.cache/mesosfer/base_data_cybersecurity/

python -m scripts.data.prepare_data

# Check progress
python -m scripts.data.prepare_data --status

# Dry-run to preview sources
python -m scripts.data.prepare_data --dry-run

Steps 2a and 2b can run in parallel in separate terminals.

Step 2c β€” Convert raw security logs to natural language

Converts data/log/ and data/cloud/ files to NL narratives (prevents loss spikes). Output: data/log_nl/ and data/cloud_nl/

python -m scripts.data.convert_logs_to_nl

# Preview first 3 documents without writing
python -m scripts.data.convert_logs_to_nl --dry-run

Step 2c can run in parallel with 2a and 2b. It only reads from data/ in the repo.


3. Train Tokenizer

Train a 64K BPE tokenizer on the prepared data. Requires Step 2a to be complete.

python -m scripts.train.tok_train

# Evaluate tokenizer compression ratio vs GPT-2 and GPT-4
python -m scripts.eval.tok_eval

4. Pretrain Base Model

Requires Steps 2a, 2b, 2c, and 3 to be complete.

# Depth 24 β€” recommended config for MI300X / single GPU
python -m scripts.train.base_train \
    --depth=24 \
    --target-param-data-ratio=10 \
    --device-batch-size=32 \
    --warmup-steps=200 \
    --window-pattern=L \
    --save-every=1000 \
    --core-metric-every=5000 \
    --run=d24_run

# Or use the full pipeline script (handles setup + tokenizer + pretrain + SFT)
WANDB_RUN=my_run bash runs/speedrun.sh

AMD ROCm note: Use --window-pattern=L (full attention). Sliding window (SSL, SSSL) is not yet supported in the ROCm FA2 Triton backward pass.


5. Evaluate Base Model

python -m scripts.eval.base_eval \
    --model-tag d24 \
    --device-batch-size 32

6. Supervised Fine-Tuning (SFT)

Requires Step 4 to be complete. Automatically includes cybersecurity SFT datasets from data/sft/.

python -m scripts.chat.chat_sft \
    --device-batch-size=32 \
    --run=sft_run

# Disable cybersec SFT (for ablation)
# python -m scripts.chat.chat_sft --disable-cybersec-sft

7. Chat with Your Model

# CLI chat
python -m scripts.chat.chat_cli -p "Explain CVE-2021-44228 (Log4Shell)"

# Web UI (opens http://localhost:8000)
python -m scripts.chat.chat_web

# Web UI β€” multi-GPU (4 workers)
python -m scripts.chat.chat_web --num-gpus 4

# Web UI β€” load specific checkpoint
python -m scripts.chat.chat_web --model-tag d24 --step 14000

The Web UI includes:

  • Welcome screen β€” centered input shown before the first message
  • Syntax-highlighted code blocks β€” Python, Rust, JSON, Bash, and more, with a one-click copy button
  • Markdown rendering β€” headings, lists, bold/italic, inline code
  • Thumbs up / down feedback β€” per-response rating that saves to data/rlhf/feedback.jsonl

8. Collect RLHF Feedback

Human preference data is collected automatically while chatting via the Web UI. Each πŸ‘ or πŸ‘Ž click on an assistant response appends a record to data/rlhf/feedback.jsonl.

{
  "timestamp": "2026-05-20T10:00:00+00:00",
  "message_index": 1,
  "rating": "negative",
  "reason": "factually_incorrect",
  "comment": "The CVE number is wrong.",
  "conversation": [
    { "role": "user",      "content": "..." },
    { "role": "assistant", "content": "..." }
  ]
}

Use this data to train a reward model or run DPO fine-tuning in a future step.


Web UI API Endpoints

The Web UI server (scripts/chat/chat_web.py) exposes the following endpoints:

Method Path Description
GET / Serve the chat UI (mesosfer/ui.html)
GET /logo.svg Serve the Mesosfer logo
GET /interface/* Serve static assets (CSS, JS)
POST /chat/completions Streaming chat completion (SSE)
POST /feedback Submit thumbs up/down feedback (saved to data/rlhf/feedback.jsonl)
GET /health Health check + worker pool status
GET /stats Worker pool statistics and GPU utilization

Feedback endpoint payload

POST /feedback
{
  "message_index": 1,
  "rating": "negative",
  "reason": "factually_incorrect",
  "comment": "Optional free-text",
  "conversation": [
    { "role": "user",      "content": "..." },
    { "role": "assistant", "content": "..." }
  ]
}

Valid reason values: inappropriate_response, continuous_repetition, factually_incorrect, too_verbose, formatting_issues, other.


Full Pipeline Summary

Step 2a: mesosfer.data.dataset -n 170     ─┐
Step 2b: scripts.data.prepare_data         β”œβ”€ can run in parallel
Step 2c: scripts.data.convert_logs_to_nl  β”€β”˜
         ↓
Step 3:  scripts.train.tok_train
         scripts.eval.tok_eval
         ↓
Step 4:  scripts.train.base_train
         ↓
Step 5:  scripts.eval.base_eval
         ↓
Step 6:  scripts.chat.chat_sft
         ↓
Step 7:  scripts.chat.chat_cli / chat_web
         ↓
Step 8:  Collect RLHF feedback via Web UI  β†’  data/rlhf/feedback.jsonl
         (future: reward model training / DPO)

Project Structure

mesosfer/
β”œβ”€β”€ mesosfer/               # Library (importable)
β”‚   β”œβ”€β”€ model/              # GPT model, attention, optimization
β”‚   β”œβ”€β”€ data/               # Dataset download, dataloader, tokenizer
β”‚   β”œβ”€β”€ eval/               # CORE eval, BPB, engine
β”‚   β”œβ”€β”€ utils/              # Common utilities, checkpointing, reporting
β”‚   β”œβ”€β”€ interface/          # Web UI static assets
β”‚   β”‚   β”œβ”€β”€ style.css       # All UI styles (chat, code blocks, feedback, empty state)
β”‚   β”‚   β”œβ”€β”€ chat.js         # Chat logic, streaming, slash commands, markdown rendering
β”‚   β”‚   β”œβ”€β”€ feedback.js     # Thumbs up/down, feedback modal, POST /feedback
β”‚   β”‚   └── markdown.js     # Markdown parser + syntax-highlighted code block renderer
β”‚   └── ui.html             # HTML shell (loads interface/ assets)
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ train/              # base_train.py, tok_train.py
β”‚   β”œβ”€β”€ chat/               # chat_sft.py, chat_rl.py, chat_cli.py, chat_web.py
β”‚   β”œβ”€β”€ eval/               # base_eval.py, tok_eval.py
β”‚   └── data/               # prepare_data.py, convert_logs_to_nl.py
β”œβ”€β”€ tasks/                  # Eval tasks (MMLU, GSM8K, cybersec_sft, cybersec_rl)
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ log/                # Raw security logs (input to convert_logs_to_nl)
β”‚   β”œβ”€β”€ cloud/              # Raw cloud audit logs (input to convert_logs_to_nl)
β”‚   β”œβ”€β”€ log_nl/             # NL narratives from logs (output, used in training)
β”‚   β”œβ”€β”€ cloud_nl/           # NL narratives from cloud logs (output, used in training)
β”‚   β”œβ”€β”€ sft/                # Cybersecurity SFT conversations
β”‚   β”œβ”€β”€ rlhf/               # Human feedback collected via Web UI
β”‚   β”‚   └── feedback.jsonl  # Appended at runtime (gitignored)
β”‚   β”œβ”€β”€ synthetic-ir/       # Synthetic incident response data
β”‚   β”œβ”€β”€ synthetic-soc/      # Synthetic SOC analyst data
β”‚   └── reverse-engineering/ # RE/exploitation analysis data
β”œβ”€β”€ runs/                   # Training run scripts (speedrun, miniseries, etc.)
└── tests/                  # Unit tests

Training Scripts

Script Description Hardware
runs/speedrun.sh Full pipeline: pretrain + SFT + eval 8x H100/A100 or 1x MI300X
runs/scaling_laws.sh Research: optimal model configs 8x GPU
runs/miniseries.sh Train multiple depths (12-26) 8x GPU
runs/runcpu.sh Demo on CPU/MacBook CPU/MPS

Key Training Parameters

Argument Default Description
--depth 20 Transformer depth (24 recommended)
--aspect-ratio 64 Model dimension = depth Γ— 64
--head-dim 128 Attention head dimension
--max-seq-len 2048 Maximum context length
--device-batch-size 32 Batch size per GPU
--target-param-data-ratio 12 Data-to-parameters ratio (10 recommended)
--warmup-steps 200 LR warmup steps (200 for depth 20+)
--window-pattern L Attention window: L=full, S=sliding (L required for ROCm)
--save-every 1000 Checkpoint every N steps
--core-metric-every 5000 CORE eval every N steps

Configuration

Environment Variables

# Backend selection (auto-detected if not set)
export mesosfer_TORCH_BACKEND=cuda    # cuda, rocm, cpu

# Compute dtype override
export mesosfer_DTYPE=bfloat16        # bfloat16, float16, float32

# Cache directory (datasets, checkpoints, tokenizer)
export mesosfer_BASE_DIR="$HOME/.cache/mesosfer"

# Wandb logging
export WANDB_RUN=my_training_run

Dataset Configuration

See DATASET.md for dataset sources, sampling weights, and token budgets. See DATASET2.md for advanced configuration and dynamic source definitions.


Requirements

  • Python 3.12+
  • PyTorch 2.6.0+ (ROCm 7.0) or 2.9.1+ (CUDA 12.8)
  • NVIDIA GPU (CUDA 12.8+) or AMD GPU (ROCm 7.0+)
  • 16GB+ GPU VRAM (32GB+ recommended for depth 24)

Development

# Install dev dependencies
uv sync --dev

# Run tests
pytest tests/

# Format code
ruff format .

# Lint
ruff check .

Model Performance

Model Parameters Tokens Validation BPB CORE Score
GPT-2 (reference) ~124M ~5B ~0.97 ~25
mesosfer d24 (ratio=10) ~1.38B ~7.3B 0.7337 0.2541

CORE score: average of MMLU (5-shot), GSM8K (COT), ARC-C, HumanEval (pass@1), Only base model


Scaling to Larger Models

The architecture is designed to scale without fundamental changes. Model size is controlled by two arguments:

model_dim = depth Γ— aspect_ratio  (rounded up to nearest multiple of head_dim)
num_heads = model_dim / head_dim

Depth Γ— Aspect Ratio β†’ Parameter Count

Depth Aspect Ratio Model Dim ~Total Params Dataset needed (ratio=10) Covered by ~8.5B dataset?
16 64 1024 ~0.9B ~2.7B βœ…
18 64 1152 ~1.2B ~3.6B βœ…
20 64 1280 ~1.5B ~4.8B βœ…
22 64 1408 ~1.8B ~6.2B βœ…
24 64 1536 ~2.2B ~7.8B βœ…
28 64 1792 ~3.1B ~12.0B ❌ need more data
32 128 4096 ~11.5B ~67B ❌ need more data
36 128 4608 ~15.5B ~95B ❌ need more data
40 128 5120 ~20.3B ~129B ❌ need more data
44 128 5632 ~26.0B ~171B ❌ need more data
48 128 6144 ~32.6B ~222B ❌ need more data

"Dataset needed" = scaling params Γ— ratio=10. Scaling params = transformer matrices + lm_head (excludes embeddings). Current dataset (~8.5B tokens) fully covers depth 16–24. For depth 28+, additional data sources are required. For Chinchilla-optimal training (ratio=20), double the token requirements above.

To train a larger model, simply pass the desired depth and aspect ratio:

# ~3B model (depth 28, aspect-ratio 64)
python -m scripts.train.base_train \
    --depth=28 \
    --aspect-ratio=64 \
    --head-dim=128 \
    --target-param-data-ratio=10 \
    --run=d28_run

# ~7B model (depth 40, aspect-ratio 64)
python -m scripts.train.base_train \
    --depth=40 \
    --aspect-ratio=64 \
    --head-dim=128 \
    --target-param-data-ratio=10 \
    --run=d40_run

Features like GQA, Flash Attention 2/3, RoPE, RMSNorm, and BF16/FP8 are already implemented and designed for large-scale training. The main practical constraints for 7B+ models are data volume (current dataset ~8.5B tokens covers up to ~850M scaling params at ratio=10) and hardware (7B in BF16 requires ~14GB weights + optimizer state, recommend 2–4Γ— A100/H100 80GB or 1Γ— MI300X).


Acknowledgments

License

MIT License

About

Mesosfer: A minimal full-stack ChatGPT clone specializing in cybersecurity and secure coding. Multi-backend LLM training (NVIDIA CUDA / AMD ROCm / Apple MPS).

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors