Skip to content

2sophia/skylar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

20 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•—โ–ˆโ–ˆโ•—  โ–ˆโ–ˆโ•—โ–ˆโ–ˆโ•—   โ–ˆโ–ˆโ•—โ–ˆโ–ˆโ•—      โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•— โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•—
 โ–ˆโ–ˆโ•”โ•โ•โ•โ•โ•โ–ˆโ–ˆโ•‘ โ–ˆโ–ˆโ•”โ•โ•šโ–ˆโ–ˆโ•— โ–ˆโ–ˆโ•”โ•โ–ˆโ–ˆโ•‘     โ–ˆโ–ˆโ•”โ•โ•โ–ˆโ–ˆโ•—โ–ˆโ–ˆโ•”โ•โ•โ–ˆโ–ˆโ•—
 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•—โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•”โ•  โ•šโ–ˆโ–ˆโ–ˆโ–ˆโ•”โ• โ–ˆโ–ˆโ•‘     โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•‘โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•”โ•
 โ•šโ•โ•โ•โ•โ–ˆโ–ˆโ•‘โ–ˆโ–ˆโ•”โ•โ–ˆโ–ˆโ•—   โ•šโ–ˆโ–ˆโ•”โ•  โ–ˆโ–ˆโ•‘     โ–ˆโ–ˆโ•”โ•โ•โ–ˆโ–ˆโ•‘โ–ˆโ–ˆโ•”โ•โ•โ–ˆโ–ˆโ•—
 โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•‘โ–ˆโ–ˆโ•‘  โ–ˆโ–ˆโ•—   โ–ˆโ–ˆโ•‘   โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ•—โ–ˆโ–ˆโ•‘  โ–ˆโ–ˆโ•‘โ–ˆโ–ˆโ•‘  โ–ˆโ–ˆโ•‘
 โ•šโ•โ•โ•โ•โ•โ•โ•โ•šโ•โ•  โ•šโ•โ•   โ•šโ•โ•   โ•šโ•โ•โ•โ•โ•โ•โ•โ•šโ•โ•  โ•šโ•โ•โ•šโ•โ•  โ•šโ•โ•

๐Ÿง  A from-scratch LLM training framework โ€” 6M to 128B parameters, one codebase.

Python PyTorch HuggingFace License


Built from first principles. No black boxes. Every layer, every rotation, every gradient โ€” explicit.

by A. Ivanovitch โ€” CEO MwSpace ยท CTO Sophia AI


Skylar is a research-grade, production-ready decoder-only Transformer framework that implements the exact same architectural blueprint used by LLaMA 3, Mistral, and Qwen3 โ€” written entirely from scratch in PyTorch with zero abstraction layers. Every component is auditable, every parameter traceable, every design choice documented.


โšก Quick Start

# โ”€โ”€ Clone & install โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
git clone https://github.com/mwspace/skylar.git && cd skylar
python -m venv .venv && source .venv/bin/activate
pip install -e .

# โ”€โ”€ Train a model from scratch โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
python training/bin.pretrain.py --preset small --bf16

# โ”€โ”€ Chat with your model โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
python inference/bin.chat.py --model checkpoints_sft/best

๐Ÿ—๏ธ Architecture at a Glance

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                        SKYLAR v3.0                          โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                             โ”‚
โ”‚  Input Tokens โ”€โ”€โ–บ Token Embedding โ”€โ”€โ–บ Dropout               โ”‚
โ”‚                        โ”‚                                    โ”‚
โ”‚                        โ–ผ                                    โ”‚
โ”‚           โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                        โ”‚
โ”‚           โ”‚   ร— L Transformer Blocks                        โ”‚
โ”‚           โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚                        โ”‚
โ”‚           โ”‚  โ”‚ RMSNorm (Pre)    โ”‚  โ”‚                        โ”‚
โ”‚           โ”‚  โ”‚ GQA + RoPE + QKN โ”‚โ”€โ”€โ”คโ—„โ”€โ”€ FlexAttention /     โ”‚
โ”‚           โ”‚  โ”‚ Residual Add     โ”‚  โ”‚    SDPA / Causal       โ”‚
โ”‚           โ”‚  โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค  โ”‚                        โ”‚
โ”‚           โ”‚  โ”‚ RMSNorm (Pre)    โ”‚  โ”‚                        โ”‚
โ”‚           โ”‚  โ”‚ SwiGLU FFN       โ”‚  โ”‚                        โ”‚
โ”‚           โ”‚  โ”‚ Residual Add     โ”‚  โ”‚                        โ”‚
โ”‚           โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚                        โ”‚
โ”‚           โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                        โ”‚
โ”‚                        โ”‚                                    โ”‚
โ”‚                        โ–ผ                                    โ”‚
โ”‚              Final RMSNorm โ”€โ”€โ–บ LM Head โ”€โ”€โ–บ Logits           โ”‚
โ”‚                                    โ”‚                        โ”‚
โ”‚                              (optional ยตP                   โ”‚
โ”‚                               logit scaling)                โ”‚
โ”‚                                                             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ”ฌ Core Components

Component Implementation Why
๐ŸงŠ Normalization RMSNorm (pre-norm residual) Faster than LayerNorm, no mean computation
๐ŸŒ€ Positions RoPE with configurable ฮธ Relative encoding, natural extrapolation
๐Ÿ”— Attention GQA + QK-Norm + explicit d_head Qwen3-identical, reduced KV-cache
โšก FFN SwiGLU (gate ร— swish ร— up) +1-2% over GELU at same FLOPs
๐ŸŽญ Masking FlexAttention block-sparse O(T) memory for packed sequences
๐Ÿ“ Scaling ยตP (Maximal Update Param.) Tune on 50M โ†’ transfer to 4B+
๐Ÿ’พ KV-Cache Full validation + GQA expansion O(1) per-step generation cost
๐Ÿค— Interface HuggingFace PreTrainedModel Native save/load/hub integration

๐Ÿ“Š Model Family โ€” 15 Presets, 5 Orders of Magnitude

  6M โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ 128B
  โ”‚                                                        โ”‚
 test  small  medium  medium+  gold  large  1B  4B  8B  14B  32B  64B  96B  128B
  โ”‚      โ”‚       โ”‚        โ”‚      โ”‚     โ”‚    โ”‚   โ”‚   โ”‚    โ”‚    โ”‚    โ”‚    โ”‚     โ”‚
 CPU  RTX4090 RTX4090 RTX4090 RTX4090 H100 H100 DGX DGX  DGX  DGX  DGX  DGX  DGX
๐Ÿ“‹ Full Preset Table โ€” click to expand
Preset Params d_model Heads KV Heads d_head Layers d_ff Context ฮธ_rope Qwen3 Match
test ~6M 128 4 4 32 4 256 4K 100K โ€”
small ~40M 512 8 4 64 8 1024 8K 500K โ€”
small+ ~70M 640 10 5 64 10 1792 16K 1M โ€”
medium ~107M 768 12 4 64 12 2048 16K 1M โ€”
medium+ ~236M 1024 16 4 64 18 2816 16K 1M โญ prod
large ~358M 1024 16 4 64 28 2816 16K 5M โ€”
gold ~393M 1280 10 2 128 20 3456 16K 1M โ€”
1B ~1.0B 1536 16 4 96 32 5120 32K 5M โ€”
4b ~3.6B 2560 32 8 128 36 9728 32K 1M โœ… Qwen3-4B
8b ~7.3B 4096 32 8 128 36 12288 32K 1M โœ… Qwen3-8B
14b ~13.2B 5120 40 8 128 40 17408 32K 1M โœ… Qwen3-14B
32b ~29.6B 5120 64 8 128 64 25600 128K 1M โœ… Qwen3-32B
64b ~61B 8192 64 8 128 72 28672 128K 1M โ€”
96b ~99B 10240 80 8 128 80 32768 128K 1M โ€”
128b ~128B 11264 88 8 128 92 32768 128K 1M โ€”

๐Ÿ’ก The 4b through 32b presets are architecturally identical to Qwen3 official configs (verified against HuggingFace config.json). Only vocab_size differs (40,960 vs 151,936).


๐Ÿ”„ Training Pipeline โ€” one base, many products

One pretrained decoder backbone feeds an entire post-training suite. All discriminative/retrieval heads reuse the same weights via from_decoder() โ€” no re-pretraining.

                          โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                          โ”‚   ๐Ÿ“š PRETRAIN (base)  โ”‚  causal LM on raw text
                          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ–ผ               โ–ผ               โ–ผ                โ–ผ
      โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
      โ”‚ ๐Ÿ’ฌ SFT (chat)โ”‚ โ”‚ ๐Ÿ”Ž EMBEDDER โ”‚ โ”‚ ๐Ÿงฎ SPARSE    โ”‚ โ”‚ ๐Ÿท๏ธ CLASSIFIER โ”‚
      โ”‚  ChatML      โ”‚ โ”‚  dense      โ”‚ โ”‚  SPLADE      โ”‚ โ”‚  BERT-style   โ”‚
      โ”‚  + ORPO/SimPOโ”‚ โ”‚ (InfoNCE)   โ”‚ โ”‚ (FLOPS reg)  โ”‚ โ”‚  (CE)         โ”‚
      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
        generative       dense vec       sparse vec       class logits
                          โ””โ”€โ”€โ”€โ”€โ”€โ”€ hybrid retrieval โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ (Qdrant)
Stage Script Output
Pretrain training/bin.pretrain.py causal-LM base
SFT training/bin.sft.py ChatML chat model (assistant-only loss)
Preference training/bin.preference.py ORPO / SimPO (reference-free, replaces DPO)
Dense embedder training/bin.contrastive.py SkylarEmbedder (InfoNCE)
Sparse retriever training/bin.sparse.py SkylarSparseEncoder (SPLADE)
Classifier training/bin.classify.py SkylarClassifier (sequence classification)

See docs/POSTTRAIN.md for the full recipe.


๐Ÿ“ˆ Validated Results โ€” medium_plus (236M)

Trained from scratch on 1.12B tokens of Italian legal/normative text (4 epochs, ~19h on a single RTX 4090), then the full post-training suite. All four products come from the same 236M weights.

Product Metric
Base LM val loss 2.16 ยท health-check perplexity 15.4
Chat (SFT) grounded tasks 6/6 correct (answer-from-context, classify, extract-JSON, refuse-when-absent) ยท clean stop 6/6 grounded, 12/13 full battery
Dense embedder SQuAD-it test R@1 0.55 ยท nDCG@10 0.71 (open-domain) ยท R@1 0.93 in-domain
Sparse (SPLADE) Recall@1 1.000 ยท ~5 non-zeros/query ยท interpretable lexical weights (in-domain)
Classifier intent accuracy 1.00 (5-way banking, held-out)

Public Italian generative benchmarks (likelihood-based MC, eval/bench_ita.py):

medium (100M) medium_plus (236M) random
XCOPA-it (causal commonsense) 0.546 0.562 โญ 0.50
HellaSwag-it 0.279 0.292 0.25
Belebele-it 0.244 0.267 0.25

Retrieval vs off-the-shelf SOTA โ€” eval/bench_retrieval.py, SQuAD-it test (7609 queries / 1988 contexts, identical pool & metrics for every model). The Skylar embedder is the 236M base + a cheap contrastive fine-tune on Italian QA; bge-m3 and e5 are evaluated zero-shot:

Model Params R@1 R@5 nDCG@10
Skylar-embed (IT-QA fine-tune) 236M 0.55 0.81 0.71
intfloat/multilingual-e5-base 278M 0.71 0.91 0.83
BAAI/bge-m3 568M 0.70 0.90 0.83

Honest scope โ€” what's real and declarable. The 236M base is a grounded Italian RAG model, not a factual oracle: in its intended role it scores 6/6 (answer/extract/classify/refuse-from-context) with clean stopping, but it hallucinates open-domain facts and the knowledge-heavy benchmarks sit near random, as expected for the size. It is an Italian specialist โ€” English generation is not fluent (Italian-only corpus). The from-scratch retriever reaches ~78% of the R@1 and ~86% of the nDCG of bge-m3 (which is 2.4ร— larger) while running fully local/offline from the same base; it does not beat the multilingual SOTA on accuracy โ€” its edge is size, locality and a one-base gen+dense+sparse+classifier stack. The retrieval gap traces to the narrow 1.12B-token pretrain, not the contrastive recipe.

Stage 1 โ€” Pre-Training

python training/bin.pretrain.py \
  --preset medium \
  --data_dir .datasets/pretokenized \
  --bf16 \
  --batch_size 32 \
  --grad_accum 4 \
  --lr 3e-4 \
  --warmup_steps 2000
Feature Detail
๐Ÿ“ฆ Data format Sharded uint32 .bin files (1GB each) with SHA256 checksums
โ˜๏ธ Storage Local disk or AWS S3 streaming (async, non-blocking)
๐Ÿ“ˆ Scheduler WSD (Warmup-Stable-Decay) or cosine annealing
๐Ÿ’พ Checkpoints HuggingFace format + async S3 upload via ThreadPoolExecutor
๐Ÿ”„ Resume Full state recovery (model, optimizer, scaler, step, best loss)
๐ŸŽ›๏ธ ยตP Train proxy at 50M, transfer HPs directly to 4B+

Stage 2 โ€” Supervised Fine-Tuning (SFT)

python training/bin.sft.py \
  --base_model checkpoints/final \
  --data sft_data.jsonl \
  --epochs 3 \
  --lr 2e-5
Feature Detail
๐Ÿท๏ธ Format ChatML (<|im_start|>role\n...<|im_end|>)
๐ŸŽฏ Loss masking Gradient only on assistant tokens (system/user masked to -100)
๐Ÿ”ค Special tokens <|im_end|> loss boost for clean turn termination
๐Ÿ“Š Logging W&B integration with per-step metrics

๐Ÿ—‚๏ธ Project Structure

skylar/
โ”œโ”€โ”€ ๐Ÿง  models/                     # Model architecture
โ”‚   โ”œโ”€โ”€ config.py                  #   NanoTransformerConfig + 12 presets
โ”‚   โ”œโ”€โ”€ decoder.py                 #   NanoTransformer (GPT decoder)
โ”‚   โ”œโ”€โ”€ embedder.py                #   SkylarEmbedder (bidirectional)
โ”‚   โ”œโ”€โ”€ heads.py                   #   Classification + Reward heads
โ”‚   โ””โ”€โ”€ layers/
โ”‚       โ”œโ”€โ”€ attention.py           #     GQA + RoPE + FlexAttention + KV-Cache
โ”‚       โ”œโ”€โ”€ block.py               #     TransformerBlock (pre-norm residual)
โ”‚       โ”œโ”€โ”€ ffn.py                 #     SwiGLU FFN
โ”‚       โ”œโ”€โ”€ norm.py                #     RMSNorm
โ”‚       โ”œโ”€โ”€ rope.py                #     Rotary Position Embeddings
โ”‚       โ””โ”€โ”€ kv_cache.py            #     KV-cache utilities
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ฆ data/                       # Data pipeline
โ”‚   โ”œโ”€โ”€ bin.tokenizer.py           #   BPE tokenizer training + sharding
โ”‚   โ”œโ”€โ”€ bin.sft_data_to_jsonl.py   #   SFT data converter
โ”‚   โ”œโ”€โ”€ bin.sft_data_shaffle.py    #   Shuffle utility
โ”‚   โ”œโ”€โ”€ bin.sft_synthetic_data.py  #   Agentic synthetic data generator
โ”‚   โ””โ”€โ”€ pretrain-pipeline/         #   Pre-training corpus builder
โ”‚       โ”œโ”€โ”€ bin.pretrain_builder.py #     CLI: extract โ†’ clean โ†’ filter โ†’ dedup โ†’ shuffle
โ”‚       โ”œโ”€โ”€ extractors.py          #     PDF (Docling GPU) / HTML / JSON / TXT / XML
โ”‚       โ”œโ”€โ”€ cleaners.py            #     MinimalNormalizer, ColumnMergeCleaner
โ”‚       โ”œโ”€โ”€ prose_filter.py        #     Heuristic + GPU perplexity scoring
โ”‚       โ””โ”€โ”€ pipeline.py            #     PII (Presidio NER), Quality, Spam, Dedup
โ”‚
โ”œโ”€โ”€ ๐Ÿ‹๏ธ training/                    # Training loops
โ”‚   โ”œโ”€โ”€ bin.pretrain.py            #   Pre-training (Stage 1)
โ”‚   โ”œโ”€โ”€ bin.sft.py                 #   SFT (Stage 2)
โ”‚   โ”œโ”€โ”€ bin.dpo.py                 #   DPO (Stage 3 โ€” planned)
โ”‚   โ””โ”€โ”€ train.runpod.sh            #   Remote GPU deployment
โ”‚
โ”œโ”€โ”€ ๐Ÿ“Š eval/                       # Evaluation & diagnostics
โ”‚   โ”œโ”€โ”€ bin.eval_base_model.py     #   Base model eval
โ”‚   โ”œโ”€โ”€ bin.eval_sft_model_*.py    #   SFT model eval
โ”‚   โ””โ”€โ”€ bin.diagnose_sft*.py       #   SFT debugging tools
โ”‚
โ”œโ”€โ”€ ๐Ÿ’ฌ inference/                   # Generation & chat
โ”‚   โ”œโ”€โ”€ bin.chat.py                #   Interactive streaming REPL
โ”‚   โ”œโ”€โ”€ bin.generate.py            #   Text generation
โ”‚   โ””โ”€โ”€ bin.embed.py               #   Embedding extraction (planned)
โ”‚
โ”œโ”€โ”€ ๐Ÿ› ๏ธ utils/                      # Utilities
โ”‚   โ”œโ”€โ”€ chatML.py                  #   ChatML encoding + loss mask
โ”‚   โ””โ”€โ”€ bin.download_aws_*.py      #   S3 checkpoint download
โ”‚
โ””โ”€โ”€ ๐Ÿ“„ docs/                       # Documentation
    โ”œโ”€โ”€ PAPER.md                   #   Technical paper (Skylar v3.0)
    โ””โ”€โ”€ GUIDE.md                   #   Training guide

๐ŸŽ™๏ธ Interactive Chat

Skylar ships with a rich terminal chat interface powered by the rich library:

python inference/bin.chat.py --model checkpoints_sft/best
โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ ๐Ÿง  Skylar Chat โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚  Model: checkpoints_sft/best                          โ”‚
โ”‚  Params: 107M โ”‚ Context: 16K โ”‚ Device: cuda           โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

 You โ–บ Spiegami cos'รจ il RoPE in modo semplice

 Skylar โ–บ Il RoPE (Rotary Position Embeddings) รจ un modo
 elegante per dire al modello "dove" si trova ogni parola
 nella frase. Invece di aggiungere numeri alla posizione,
 ruota i vettori nello spazio...
Command Description
/system <msg> Set system prompt
/temp <0.0-2.0> Adjust temperature
/topk <n> Set top-k sampling
/topp <0.0-1.0> Set nucleus sampling
/rep <1.0-2.0> Repetition penalty
/max <n> Max generation tokens
/clear Clear conversation
/config Show current config
/help Show all commands
/quit Exit

๐Ÿ”ฌ What Makes Skylar Different

๐ŸŽฏ Transparent by Design

Every component โ€” RMSNorm math, RoPE frequency tables, GQA head expansion, FlexAttention mask functions, ยตP scaling rules โ€” is a standalone, readable PyTorch module. No library-level black boxes.

๐Ÿงฌ Qwen3 Architectural Parity

The 4Bโ€“32B presets are byte-for-byte identical to Qwen3's published architectures. Techniques developed here transfer directly to/from the Qwen3 family.

โšก Three-Path Attention Dispatch

The attention layer auto-selects the optimal path:

  1. FlexAttention + block mask (PyTorch โ‰ฅ 2.5)
  2. SDPA + dense mask (fallback)
  3. Causal SDPA with is_causal=True (generation)

๐Ÿ“ ยตP โ€” Tune Small, Scale Big

Full Maximal Update Parameterization: width-scaled init, per-group LR, 1/d_head attention scaling, output logit scaling. Tune HPs on ~50M params, transfer to 4B+ for free.

๐Ÿค– Agentic Data Synthesis

Multi-agent pipeline: Analyst โ†’ Turn Builder โ†’ Validator โ†’ Dedup โ†’ Shuffle. Supports OpenAI, Anthropic, and self-hosted vLLM endpoints as teacher models.

๐ŸŒ Italian-specialist

Configurable ByteLevel BPE (the released medium_plus uses a 32,768 vocab trained on Italian legal/normative text). Special tokens for chat (<|im_start|>, <|im_end|>), thinking (<think>), and tool use (<tool_call>). The vocabulary supports English byte-level, but a model trained on an Italian-only corpus is an Italian specialist โ€” English generation is not fluent unless English data is added to pretraining.


๐Ÿ–ฅ๏ธ Hardware Requirements

Preset VRAM Device Throughput Time for Chinchilla
test <1 GB CPU / any GPU โ€” minutes
small ~2 GB RTX 4090 ~100K tok/s ~5 hours โœ…
medium ~6 GB RTX 4090 ~51K tok/s ~10 hours โœ…
large ~12 GB RTX 4090 ~19K tok/s ~4 days
1B ~48 GB RTX PRO 6000 ~12K tok/s ~19 days
4b ~96 GB RTX PRO 6000 ~3.5K tok/s ~9 months
4b 5ร— DGX H100 40 GPUs ~240K tok/s ~8 days
8b 5ร— DGX H100 40 GPUs ~170K tok/s ~20 days
14b 10ร— DGX H100 80 GPUs ~150K tok/s ~44 days
32b 10ร— DGX H100 80 GPUs ~55K tok/s ~9 months
128b 512ร— H100 cluster ~40K tok/s ~24 months

โœ… = measured benchmarks on actual hardware


๐Ÿ”ง Configuration

from models.config import get_config, NanoTransformerConfig

# Use a preset
config = get_config("medium")

# Customize
config = get_config("medium", vocab_size=40960, dropout=0.05, max_seq_len=32768)

# Full manual config
config = NanoTransformerConfig(
    vocab_size=40960,
    d_model=768,
    n_heads=12,
    n_kv_heads=4,
    n_layers=12,
    d_ff=2048,
    max_seq_len=16384,
    qk_norm=True,
    rope_theta=1_000_000.0,
    mup_base_d_model=256,  # enable ยตP with 256-width proxy
)

๐Ÿ—ƒ๏ธ Tokenizer

python data/bin.tokenizer.py \
  --data corpus/ \
  --vocab_size 40960 \
  --output .datasets/tokenizer
Feature Detail
Algorithm ByteLevel BPE (HuggingFace tokenizers, Rust-backed)
Vocab configurable (default 40,960; medium_plus uses 32,768)
Output Sharded uint32 .bin files (1GB each)
Checksums SHA256 per shard + JSON metadata
Upload Auto S3 multipart transfer

Special tokens:

<pad>  <bos>  <eos>  <|im_start|>  <|im_end|>
<think>  </think>  <tool_call>  </tool_call>
<tool_response>  </tool_response>

๐Ÿ›ก๏ธ Pre-Training Data Pipeline

The data/pretrain-pipeline/ module transforms raw documents (PDF, HTML, JSON, TXT) into a clean, shuffled, deduplicated corpus ready for tokenization. Built for Italian legal/institutional text (EUR-Lex, Gazzetta Ufficiale, Banca d'Italia).

python data/pretrain-pipeline/bin.pretrain_builder.py /path/to/raw_documents \
  -o .datasets/pretokenized \
  --max-gb 5 \
  --gpu-perplexity
 Raw files (PDF/HTML/JSON/TXT)
      โ”‚
      โ”œโ”€โ”€โ”€โ”€ TXT (chat templates) โ”€โ”€โ”€โ”€ bypass all filters โ”€โ”€โ”
      โ”‚     Curated ChatML data included as-is to teach     โ”‚
      โ”‚     the model <|im_start|>/<|im_end|> structure     โ”‚
      โ”‚     from the earliest pre-training steps.           โ”‚
      โ”‚                                                     โ”‚
      โ–ผ                                                     โ”‚
 1. Extraction        Docling (GPU) / trafilatura / JSON    โ”‚
      โ”‚                                                     โ”‚
      โ–ผ                                                     โ”‚
 2. Normalization     MinimalNormalizer โ€” ftfy, NFKC         โ”‚
      โ”‚                                                     โ”‚
      โ–ผ                                                     โ”‚
 3. Column Merge      Broken two-column PDF detection        โ”‚
      โ”‚                                                     โ”‚
      โ–ผ                                                     โ”‚
 4. Prose Filter      Semantic chunking + scoring + PPL      โ”‚
      โ”‚                                                     โ”‚
      โ–ผ                                                     โ”‚
 5. PII Redaction     Presidio NER (Italian)                 โ”‚
      โ”‚                 Names โ†’ <PERSONA>                    โ”‚
      โ”‚                 Email โ†’ <EMAIL>  Phone โ†’ <TELEFONO>  โ”‚
      โ”‚                 CF โ†’ <CF>  IBAN โ†’ <IBAN>  CC โ†’ <CC>  โ”‚
      โ”‚                                                     โ”‚
      โ–ผ                                                     โ”‚
 6. Quality + Spam + Dedup                                   โ”‚
      โ”‚                                                     โ”‚
      โ–ผ                                                     โ–ผ
 7. Shuffle + Write   Global shuffle โ†’ <bos>doc<eos> output chunks
Component Technology Purpose
PDF Extraction Docling (GPU) + pymupdf fallback Layout-aware reading order
PII Detection Microsoft Presidio + spaCy NER Context-aware PII redaction
Perplexity facebook/xglm-564M (optional GPU) Filter incoherent / repetitive text
Near-Dedup MinHash LSH (datasketch) Fuzzy duplicate removal

๐Ÿค— HuggingFace Integration

Despite being built from scratch, Skylar is fully HuggingFace-native:

from models.decoder import NanoTransformer

# Save
model.save_pretrained("my-skylar-model")

# Load
model = NanoTransformer.from_pretrained("my-skylar-model")

# Push to Hub (planned)
model.push_to_hub("username/skylar-medium")

โ˜๏ธ Remote Training (RunPod)

bash training/train.runpod.sh --host <IP> --port <PORT>

The script handles everything:

  1. ๐Ÿ”Œ SSH connection + GPU verification (nvidia-smi)
  2. ๐Ÿ“ค Source code upload via scp
  3. ๐Ÿ“ฆ Dependency installation
  4. ๐Ÿ”‘ AWS credential export
  5. ๐Ÿš€ Background training launch with nohup
  6. โ˜๏ธ Async checkpoint upload to S3

๐Ÿ—บ๏ธ Roadmap

  • ๐Ÿง  Full decoder-only Transformer architecture
  • ๐ŸŒ€ RoPE + GQA + QK-Norm + SwiGLU
  • โšก FlexAttention document masking
  • ๐Ÿ“ ยตP (Maximal Update Parameterization)
  • ๐Ÿ“š Pre-training pipeline with packed sequences
  • ๐Ÿ›ก๏ธ Pre-training corpus builder (Docling + Presidio PII + dedup)
  • ๐Ÿ’ฌ SFT with ChatML + loss masking
  • ๐ŸŽ™๏ธ Interactive streaming chat REPL
  • ๐Ÿค– Agentic synthetic data generation
  • ๐Ÿ” SkylarEmbedder (bidirectional dense retrieval)
  • ๐Ÿ”— Contrastive training for the embedder (InfoNCE)
  • ๐Ÿงฎ Sparse retrieval โ€” SkylarSparseEncoder (SPLADE-style)
  • ๐Ÿท๏ธ SkylarClassifier (BERT-style sequence classification)
  • ๐ŸŽฏ Preference optimization โ€” ORPO / SimPO (reference-free; supersedes DPO)
  • ๐Ÿงช Public Italian benchmarks (XCOPA / HellaSwag / Belebele)
  • ๐Ÿ“Š Evaluation & diagnostic tools
  • ๐ŸŒ FastAPI/vLLM inference server
  • ๐Ÿ“ฆ Push-to-Hub support
  • ๐Ÿ”ญ Long-context: sliding-window attention + YaRN (for 8K+)
  • ๐ŸŒŠ Streaming/memmap dataset (required for 4B/8B scale)
  • ๐ŸŽฏ Classic DPO (optional; bin.preference.py covers the reference-free variants)

๐Ÿ“š Documentation

Document Description
docs/PAPER.md Full technical paper โ€” Skylar v3.0 architecture
docs/RESULTS.md Validated results โ€” public benchmarks + retrieval vs bge-m3 / e5
docs/POSTTRAIN.md Post-training suite โ€” chat, embeddings, sparse, classifier
docs/GUIDE.md Step-by-step training guide
docs/STRUCTURE.md Codebase structure reference
docs/DISTILLATION.md Synthetic data distillation guide
docs/TRAINING_SET.md Dataset preparation reference

๐Ÿ“„ Tech Stack

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  ๐Ÿ Python 3.10+                                 โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  ๐Ÿ”ฅ PyTorch โ‰ฅ 2.1      โ”‚  Core ML engine         โ”‚
โ”‚  ๐Ÿค— Transformers       โ”‚  HF compatibility layer โ”‚
โ”‚  ๐Ÿ”ค Tokenizers         โ”‚  Rust-backed BPE        โ”‚
โ”‚  ๐Ÿš€ Accelerate         โ”‚  Multi-GPU (DDP/FSDP)   โ”‚
โ”‚  ๐Ÿ“Š W&B                 โ”‚  Experiment tracking   โ”‚
โ”‚  โ˜๏ธ  Boto3             โ”‚  AWS S3 storage         โ”‚
โ”‚  ๐Ÿค– OpenAI + Anthropic โ”‚  Teacher model APIs     โ”‚
โ”‚  ๐Ÿ›ก๏ธ  Presidio + spaCy โ”‚  PII detection (NER)    โ”‚
โ”‚  ๐Ÿ“„ Docling            โ”‚  GPU PDF extraction     โ”‚
โ”‚  ๐Ÿ“‹ Pydantic           โ”‚  Data validation        โ”‚
โ”‚  ๐ŸŽจ Rich               โ”‚  Terminal UI            โ”‚
โ”‚  โš™๏ธ  Typer             โ”‚  CLI framework          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“œ License

Skylar is licensed under the Apache License 2.0.

Copyright ยฉ 2026 Aleksandr Ivanovitch, who holds all intellectual property rights in this software in his capacity as Chief Technology Officer (CTO) of Sophia AI S.r.l. See the LICENSE and NOTICE files for the full terms and attribution.

You are free to use, modify, and distribute this software under the terms of the Apache 2.0 license, provided you retain the copyright, patent, trademark, and attribution notices.



โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                                                                 โ”‚
โ”‚  "The best way to understand a Transformer is to build one."    โ”‚
โ”‚                                                                 โ”‚
โ”‚                              โ€” Skylar Project Philosophy        โ”‚
โ”‚                                                                 โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Built with ๐Ÿ”ฅ by Aleksandr Ivanovitch โ€” CTO, Sophia AI S.r.l.

Making frontier AI transparent, auditable, and reproducible.

Copyright ยฉ 2026 Aleksandr Ivanovitch โ€” CTO, Sophia AI S.r.l. ยท Licensed under the Apache License 2.0.

About

A Modern NanoTransformer Encoder/Decoder Framework

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors