โโโโโโโโโโโ โโโโโโ โโโโโโ โโโโโโ โโโโโโโ
โโโโโโโโโโโ โโโโโโโโ โโโโโโโ โโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโ โโโโโโโ โโโ โโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโ โโโโโ โโโ โโโโโโโโโโโโโโโโ
โโโโโโโโโโโ โโโ โโโ โโโโโโโโโโโ โโโโโโ โโโ
โโโโโโโโโโโ โโโ โโโ โโโโโโโโโโโ โโโโโโ โโโ
Built from first principles. No black boxes. Every layer, every rotation, every gradient โ explicit.
by A. Ivanovitch โ CEO MwSpace ยท CTO Sophia AI
Skylar is a research-grade, production-ready decoder-only Transformer framework that implements the exact same architectural blueprint used by LLaMA 3, Mistral, and Qwen3 โ written entirely from scratch in PyTorch with zero abstraction layers. Every component is auditable, every parameter traceable, every design choice documented.
# โโ Clone & install โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
git clone https://github.com/mwspace/skylar.git && cd skylar
python -m venv .venv && source .venv/bin/activate
pip install -e .
# โโ Train a model from scratch โโโโโโโโโโโโโโโโโโโโโโโโโโโ
python training/bin.pretrain.py --preset small --bf16
# โโ Chat with your model โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
python inference/bin.chat.py --model checkpoints_sft/bestโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ SKYLAR v3.0 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Input Tokens โโโบ Token Embedding โโโบ Dropout โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ ร L Transformer Blocks โ
โ โ โโโโโโโโโโโโโโโโโโโโ โ โ
โ โ โ RMSNorm (Pre) โ โ โ
โ โ โ GQA + RoPE + QKN โโโโคโโโ FlexAttention / โ
โ โ โ Residual Add โ โ SDPA / Causal โ
โ โ โโโโโโโโโโโโโโโโโโโโค โ โ
โ โ โ RMSNorm (Pre) โ โ โ
โ โ โ SwiGLU FFN โ โ โ
โ โ โ Residual Add โ โ โ
โ โ โโโโโโโโโโโโโโโโโโโโ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ Final RMSNorm โโโบ LM Head โโโบ Logits โ
โ โ โ
โ (optional ยตP โ
โ logit scaling) โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
| Component | Implementation | Why |
|---|---|---|
| ๐ง Normalization | RMSNorm (pre-norm residual) |
Faster than LayerNorm, no mean computation |
| ๐ Positions | RoPE with configurable ฮธ |
Relative encoding, natural extrapolation |
| ๐ Attention | GQA + QK-Norm + explicit d_head |
Qwen3-identical, reduced KV-cache |
| โก FFN | SwiGLU (gate ร swish ร up) |
+1-2% over GELU at same FLOPs |
| ๐ญ Masking | FlexAttention block-sparse |
O(T) memory for packed sequences |
| ๐ Scaling | ยตP (Maximal Update Param.) |
Tune on 50M โ transfer to 4B+ |
| ๐พ KV-Cache | Full validation + GQA expansion | O(1) per-step generation cost |
| ๐ค Interface | HuggingFace PreTrainedModel |
Native save/load/hub integration |
6M โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 128B
โ โ
test small medium medium+ gold large 1B 4B 8B 14B 32B 64B 96B 128B
โ โ โ โ โ โ โ โ โ โ โ โ โ โ
CPU RTX4090 RTX4090 RTX4090 RTX4090 H100 H100 DGX DGX DGX DGX DGX DGX DGX
๐ Full Preset Table โ click to expand
| Preset | Params | d_model | Heads | KV Heads | d_head | Layers | d_ff | Context | ฮธ_rope | Qwen3 Match |
|---|---|---|---|---|---|---|---|---|---|---|
test |
~6M | 128 | 4 | 4 | 32 | 4 | 256 | 4K | 100K | โ |
small |
~40M | 512 | 8 | 4 | 64 | 8 | 1024 | 8K | 500K | โ |
small+ |
~70M | 640 | 10 | 5 | 64 | 10 | 1792 | 16K | 1M | โ |
medium |
~107M | 768 | 12 | 4 | 64 | 12 | 2048 | 16K | 1M | โ |
medium+ |
~236M | 1024 | 16 | 4 | 64 | 18 | 2816 | 16K | 1M | โญ prod |
large |
~358M | 1024 | 16 | 4 | 64 | 28 | 2816 | 16K | 5M | โ |
gold |
~393M | 1280 | 10 | 2 | 128 | 20 | 3456 | 16K | 1M | โ |
1B |
~1.0B | 1536 | 16 | 4 | 96 | 32 | 5120 | 32K | 5M | โ |
4b |
~3.6B | 2560 | 32 | 8 | 128 | 36 | 9728 | 32K | 1M | โ Qwen3-4B |
8b |
~7.3B | 4096 | 32 | 8 | 128 | 36 | 12288 | 32K | 1M | โ Qwen3-8B |
14b |
~13.2B | 5120 | 40 | 8 | 128 | 40 | 17408 | 32K | 1M | โ Qwen3-14B |
32b |
~29.6B | 5120 | 64 | 8 | 128 | 64 | 25600 | 128K | 1M | โ Qwen3-32B |
64b |
~61B | 8192 | 64 | 8 | 128 | 72 | 28672 | 128K | 1M | โ |
96b |
~99B | 10240 | 80 | 8 | 128 | 80 | 32768 | 128K | 1M | โ |
128b |
~128B | 11264 | 88 | 8 | 128 | 92 | 32768 | 128K | 1M | โ |
๐ก The
4bthrough32bpresets are architecturally identical to Qwen3 official configs (verified against HuggingFaceconfig.json). Onlyvocab_sizediffers (40,960 vs 151,936).
One pretrained decoder backbone feeds an entire post-training suite. All
discriminative/retrieval heads reuse the same weights via from_decoder() โ no
re-pretraining.
โโโโโโโโโโโโโโโโโโโโโโโโ
โ ๐ PRETRAIN (base) โ causal LM on raw text
โโโโโโโโโโโโโฌโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโฌโโโโโโโโดโโโโโโโโฌโโโโโโโโโโโโโโโโโ
โผ โผ โผ โผ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ ๐ฌ SFT (chat)โ โ ๐ EMBEDDER โ โ ๐งฎ SPARSE โ โ ๐ท๏ธ CLASSIFIER โ
โ ChatML โ โ dense โ โ SPLADE โ โ BERT-style โ
โ + ORPO/SimPOโ โ (InfoNCE) โ โ (FLOPS reg) โ โ (CE) โ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
generative dense vec sparse vec class logits
โโโโโโโ hybrid retrieval โโโโโโโ (Qdrant)
| Stage | Script | Output |
|---|---|---|
| Pretrain | training/bin.pretrain.py |
causal-LM base |
| SFT | training/bin.sft.py |
ChatML chat model (assistant-only loss) |
| Preference | training/bin.preference.py |
ORPO / SimPO (reference-free, replaces DPO) |
| Dense embedder | training/bin.contrastive.py |
SkylarEmbedder (InfoNCE) |
| Sparse retriever | training/bin.sparse.py |
SkylarSparseEncoder (SPLADE) |
| Classifier | training/bin.classify.py |
SkylarClassifier (sequence classification) |
See docs/POSTTRAIN.md for the full recipe.
Trained from scratch on 1.12B tokens of Italian legal/normative text (4 epochs, ~19h on a single RTX 4090), then the full post-training suite. All four products come from the same 236M weights.
| Product | Metric |
|---|---|
| Base LM | val loss 2.16 ยท health-check perplexity 15.4 |
| Chat (SFT) | grounded tasks 6/6 correct (answer-from-context, classify, extract-JSON, refuse-when-absent) ยท clean stop 6/6 grounded, 12/13 full battery |
| Dense embedder | SQuAD-it test R@1 0.55 ยท nDCG@10 0.71 (open-domain) ยท R@1 0.93 in-domain |
| Sparse (SPLADE) | Recall@1 1.000 ยท ~5 non-zeros/query ยท interpretable lexical weights (in-domain) |
| Classifier | intent accuracy 1.00 (5-way banking, held-out) |
Public Italian generative benchmarks (likelihood-based MC, eval/bench_ita.py):
| medium (100M) | medium_plus (236M) | random | |
|---|---|---|---|
| XCOPA-it (causal commonsense) | 0.546 | 0.562 โญ | 0.50 |
| HellaSwag-it | 0.279 | 0.292 | 0.25 |
| Belebele-it | 0.244 | 0.267 | 0.25 |
Retrieval vs off-the-shelf SOTA โ eval/bench_retrieval.py, SQuAD-it test (7609 queries / 1988 contexts,
identical pool & metrics for every model). The Skylar embedder is the 236M base + a cheap contrastive
fine-tune on Italian QA; bge-m3 and e5 are evaluated zero-shot:
| Model | Params | R@1 | R@5 | nDCG@10 |
|---|---|---|---|---|
| Skylar-embed (IT-QA fine-tune) | 236M | 0.55 | 0.81 | 0.71 |
intfloat/multilingual-e5-base |
278M | 0.71 | 0.91 | 0.83 |
BAAI/bge-m3 |
568M | 0.70 | 0.90 | 0.83 |
Honest scope โ what's real and declarable. The 236M base is a grounded Italian RAG model, not a factual oracle: in its intended role it scores 6/6 (answer/extract/classify/refuse-from-context) with clean stopping, but it hallucinates open-domain facts and the knowledge-heavy benchmarks sit near random, as expected for the size. It is an Italian specialist โ English generation is not fluent (Italian-only corpus). The from-scratch retriever reaches ~78% of the R@1 and ~86% of the nDCG of
bge-m3(which is 2.4ร larger) while running fully local/offline from the same base; it does not beat the multilingual SOTA on accuracy โ its edge is size, locality and a one-base gen+dense+sparse+classifier stack. The retrieval gap traces to the narrow 1.12B-token pretrain, not the contrastive recipe.
python training/bin.pretrain.py \
--preset medium \
--data_dir .datasets/pretokenized \
--bf16 \
--batch_size 32 \
--grad_accum 4 \
--lr 3e-4 \
--warmup_steps 2000| Feature | Detail |
|---|---|
| ๐ฆ Data format | Sharded uint32 .bin files (1GB each) with SHA256 checksums |
| โ๏ธ Storage | Local disk or AWS S3 streaming (async, non-blocking) |
| ๐ Scheduler | WSD (Warmup-Stable-Decay) or cosine annealing |
| ๐พ Checkpoints | HuggingFace format + async S3 upload via ThreadPoolExecutor |
| ๐ Resume | Full state recovery (model, optimizer, scaler, step, best loss) |
| ๐๏ธ ยตP | Train proxy at 50M, transfer HPs directly to 4B+ |
python training/bin.sft.py \
--base_model checkpoints/final \
--data sft_data.jsonl \
--epochs 3 \
--lr 2e-5| Feature | Detail |
|---|---|
| ๐ท๏ธ Format | ChatML (<|im_start|>role\n...<|im_end|>) |
| ๐ฏ Loss masking | Gradient only on assistant tokens (system/user masked to -100) |
| ๐ค Special tokens | <|im_end|> loss boost for clean turn termination |
| ๐ Logging | W&B integration with per-step metrics |
skylar/
โโโ ๐ง models/ # Model architecture
โ โโโ config.py # NanoTransformerConfig + 12 presets
โ โโโ decoder.py # NanoTransformer (GPT decoder)
โ โโโ embedder.py # SkylarEmbedder (bidirectional)
โ โโโ heads.py # Classification + Reward heads
โ โโโ layers/
โ โโโ attention.py # GQA + RoPE + FlexAttention + KV-Cache
โ โโโ block.py # TransformerBlock (pre-norm residual)
โ โโโ ffn.py # SwiGLU FFN
โ โโโ norm.py # RMSNorm
โ โโโ rope.py # Rotary Position Embeddings
โ โโโ kv_cache.py # KV-cache utilities
โ
โโโ ๐ฆ data/ # Data pipeline
โ โโโ bin.tokenizer.py # BPE tokenizer training + sharding
โ โโโ bin.sft_data_to_jsonl.py # SFT data converter
โ โโโ bin.sft_data_shaffle.py # Shuffle utility
โ โโโ bin.sft_synthetic_data.py # Agentic synthetic data generator
โ โโโ pretrain-pipeline/ # Pre-training corpus builder
โ โโโ bin.pretrain_builder.py # CLI: extract โ clean โ filter โ dedup โ shuffle
โ โโโ extractors.py # PDF (Docling GPU) / HTML / JSON / TXT / XML
โ โโโ cleaners.py # MinimalNormalizer, ColumnMergeCleaner
โ โโโ prose_filter.py # Heuristic + GPU perplexity scoring
โ โโโ pipeline.py # PII (Presidio NER), Quality, Spam, Dedup
โ
โโโ ๐๏ธ training/ # Training loops
โ โโโ bin.pretrain.py # Pre-training (Stage 1)
โ โโโ bin.sft.py # SFT (Stage 2)
โ โโโ bin.dpo.py # DPO (Stage 3 โ planned)
โ โโโ train.runpod.sh # Remote GPU deployment
โ
โโโ ๐ eval/ # Evaluation & diagnostics
โ โโโ bin.eval_base_model.py # Base model eval
โ โโโ bin.eval_sft_model_*.py # SFT model eval
โ โโโ bin.diagnose_sft*.py # SFT debugging tools
โ
โโโ ๐ฌ inference/ # Generation & chat
โ โโโ bin.chat.py # Interactive streaming REPL
โ โโโ bin.generate.py # Text generation
โ โโโ bin.embed.py # Embedding extraction (planned)
โ
โโโ ๐ ๏ธ utils/ # Utilities
โ โโโ chatML.py # ChatML encoding + loss mask
โ โโโ bin.download_aws_*.py # S3 checkpoint download
โ
โโโ ๐ docs/ # Documentation
โโโ PAPER.md # Technical paper (Skylar v3.0)
โโโ GUIDE.md # Training guide
Skylar ships with a rich terminal chat interface powered by the rich library:
python inference/bin.chat.py --model checkpoints_sft/bestโญโโโโโโโโโโโโโโโโโโโ ๐ง Skylar Chat โโโโโโโโโโโโโโโโโโโโฎ
โ Model: checkpoints_sft/best โ
โ Params: 107M โ Context: 16K โ Device: cuda โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
You โบ Spiegami cos'รจ il RoPE in modo semplice
Skylar โบ Il RoPE (Rotary Position Embeddings) รจ un modo
elegante per dire al modello "dove" si trova ogni parola
nella frase. Invece di aggiungere numeri alla posizione,
ruota i vettori nello spazio...
| Command | Description |
|---|---|
/system <msg> |
Set system prompt |
/temp <0.0-2.0> |
Adjust temperature |
/topk <n> |
Set top-k sampling |
/topp <0.0-1.0> |
Set nucleus sampling |
/rep <1.0-2.0> |
Repetition penalty |
/max <n> |
Max generation tokens |
/clear |
Clear conversation |
/config |
Show current config |
/help |
Show all commands |
/quit |
Exit |
|
Every component โ RMSNorm math, RoPE frequency tables, GQA head expansion, FlexAttention mask functions, ยตP scaling rules โ is a standalone, readable PyTorch module. No library-level black boxes. |
The 4Bโ32B presets are byte-for-byte identical to Qwen3's published architectures. Techniques developed here transfer directly to/from the Qwen3 family. |
|
The attention layer auto-selects the optimal path:
|
Full Maximal Update Parameterization: width-scaled init, per-group LR, 1/d_head attention scaling, output logit scaling. Tune HPs on ~50M params, transfer to 4B+ for free. |
|
Multi-agent pipeline: Analyst โ Turn Builder โ Validator โ Dedup โ Shuffle. Supports OpenAI, Anthropic, and self-hosted vLLM endpoints as teacher models. |
Configurable ByteLevel BPE (the released |
| Preset | VRAM | Device | Throughput | Time for Chinchilla |
|---|---|---|---|---|
test |
<1 GB | CPU / any GPU | โ | minutes |
small |
~2 GB | RTX 4090 | ~100K tok/s | ~5 hours โ |
medium |
~6 GB | RTX 4090 | ~51K tok/s | ~10 hours โ |
large |
~12 GB | RTX 4090 | ~19K tok/s | ~4 days |
1B |
~48 GB | RTX PRO 6000 | ~12K tok/s | ~19 days |
4b |
~96 GB | RTX PRO 6000 | ~3.5K tok/s | ~9 months |
4b |
5ร DGX H100 | 40 GPUs | ~240K tok/s | ~8 days |
8b |
5ร DGX H100 | 40 GPUs | ~170K tok/s | ~20 days |
14b |
10ร DGX H100 | 80 GPUs | ~150K tok/s | ~44 days |
32b |
10ร DGX H100 | 80 GPUs | ~55K tok/s | ~9 months |
128b |
512ร H100 | cluster | ~40K tok/s | ~24 months |
โ = measured benchmarks on actual hardware
from models.config import get_config, NanoTransformerConfig
# Use a preset
config = get_config("medium")
# Customize
config = get_config("medium", vocab_size=40960, dropout=0.05, max_seq_len=32768)
# Full manual config
config = NanoTransformerConfig(
vocab_size=40960,
d_model=768,
n_heads=12,
n_kv_heads=4,
n_layers=12,
d_ff=2048,
max_seq_len=16384,
qk_norm=True,
rope_theta=1_000_000.0,
mup_base_d_model=256, # enable ยตP with 256-width proxy
)python data/bin.tokenizer.py \
--data corpus/ \
--vocab_size 40960 \
--output .datasets/tokenizer| Feature | Detail |
|---|---|
| Algorithm | ByteLevel BPE (HuggingFace tokenizers, Rust-backed) |
| Vocab | configurable (default 40,960; medium_plus uses 32,768) |
| Output | Sharded uint32 .bin files (1GB each) |
| Checksums | SHA256 per shard + JSON metadata |
| Upload | Auto S3 multipart transfer |
Special tokens:
<pad> <bos> <eos> <|im_start|> <|im_end|>
<think> </think> <tool_call> </tool_call>
<tool_response> </tool_response>
The data/pretrain-pipeline/ module transforms raw documents (PDF, HTML, JSON, TXT) into a clean, shuffled, deduplicated corpus ready for tokenization. Built for Italian legal/institutional text (EUR-Lex, Gazzetta Ufficiale, Banca d'Italia).
python data/pretrain-pipeline/bin.pretrain_builder.py /path/to/raw_documents \
-o .datasets/pretokenized \
--max-gb 5 \
--gpu-perplexity Raw files (PDF/HTML/JSON/TXT)
โ
โโโโโ TXT (chat templates) โโโโ bypass all filters โโโ
โ Curated ChatML data included as-is to teach โ
โ the model <|im_start|>/<|im_end|> structure โ
โ from the earliest pre-training steps. โ
โ โ
โผ โ
1. Extraction Docling (GPU) / trafilatura / JSON โ
โ โ
โผ โ
2. Normalization MinimalNormalizer โ ftfy, NFKC โ
โ โ
โผ โ
3. Column Merge Broken two-column PDF detection โ
โ โ
โผ โ
4. Prose Filter Semantic chunking + scoring + PPL โ
โ โ
โผ โ
5. PII Redaction Presidio NER (Italian) โ
โ Names โ <PERSONA> โ
โ Email โ <EMAIL> Phone โ <TELEFONO> โ
โ CF โ <CF> IBAN โ <IBAN> CC โ <CC> โ
โ โ
โผ โ
6. Quality + Spam + Dedup โ
โ โ
โผ โผ
7. Shuffle + Write Global shuffle โ <bos>doc<eos> output chunks
| Component | Technology | Purpose |
|---|---|---|
| PDF Extraction | Docling (GPU) + pymupdf fallback | Layout-aware reading order |
| PII Detection | Microsoft Presidio + spaCy NER | Context-aware PII redaction |
| Perplexity | facebook/xglm-564M (optional GPU) |
Filter incoherent / repetitive text |
| Near-Dedup | MinHash LSH (datasketch) | Fuzzy duplicate removal |
Despite being built from scratch, Skylar is fully HuggingFace-native:
from models.decoder import NanoTransformer
# Save
model.save_pretrained("my-skylar-model")
# Load
model = NanoTransformer.from_pretrained("my-skylar-model")
# Push to Hub (planned)
model.push_to_hub("username/skylar-medium")bash training/train.runpod.sh --host <IP> --port <PORT>The script handles everything:
- ๐ SSH connection + GPU verification (
nvidia-smi) - ๐ค Source code upload via
scp - ๐ฆ Dependency installation
- ๐ AWS credential export
- ๐ Background training launch with
nohup - โ๏ธ Async checkpoint upload to S3
- ๐ง Full decoder-only Transformer architecture
- ๐ RoPE + GQA + QK-Norm + SwiGLU
- โก FlexAttention document masking
- ๐ ยตP (Maximal Update Parameterization)
- ๐ Pre-training pipeline with packed sequences
- ๐ก๏ธ Pre-training corpus builder (Docling + Presidio PII + dedup)
- ๐ฌ SFT with ChatML + loss masking
- ๐๏ธ Interactive streaming chat REPL
- ๐ค Agentic synthetic data generation
- ๐ SkylarEmbedder (bidirectional dense retrieval)
- ๐ Contrastive training for the embedder (InfoNCE)
- ๐งฎ Sparse retrieval โ SkylarSparseEncoder (SPLADE-style)
- ๐ท๏ธ SkylarClassifier (BERT-style sequence classification)
- ๐ฏ Preference optimization โ ORPO / SimPO (reference-free; supersedes DPO)
- ๐งช Public Italian benchmarks (XCOPA / HellaSwag / Belebele)
- ๐ Evaluation & diagnostic tools
- ๐ FastAPI/vLLM inference server
- ๐ฆ Push-to-Hub support
- ๐ญ Long-context: sliding-window attention + YaRN (for 8K+)
- ๐ Streaming/memmap dataset (required for 4B/8B scale)
- ๐ฏ Classic DPO (optional;
bin.preference.pycovers the reference-free variants)
| Document | Description |
|---|---|
docs/PAPER.md |
Full technical paper โ Skylar v3.0 architecture |
docs/RESULTS.md |
Validated results โ public benchmarks + retrieval vs bge-m3 / e5 |
docs/POSTTRAIN.md |
Post-training suite โ chat, embeddings, sparse, classifier |
docs/GUIDE.md |
Step-by-step training guide |
docs/STRUCTURE.md |
Codebase structure reference |
docs/DISTILLATION.md |
Synthetic data distillation guide |
docs/TRAINING_SET.md |
Dataset preparation reference |
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ๐ Python 3.10+ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ ๐ฅ PyTorch โฅ 2.1 โ Core ML engine โ
โ ๐ค Transformers โ HF compatibility layer โ
โ ๐ค Tokenizers โ Rust-backed BPE โ
โ ๐ Accelerate โ Multi-GPU (DDP/FSDP) โ
โ ๐ W&B โ Experiment tracking โ
โ โ๏ธ Boto3 โ AWS S3 storage โ
โ ๐ค OpenAI + Anthropic โ Teacher model APIs โ
โ ๐ก๏ธ Presidio + spaCy โ PII detection (NER) โ
โ ๐ Docling โ GPU PDF extraction โ
โ ๐ Pydantic โ Data validation โ
โ ๐จ Rich โ Terminal UI โ
โ โ๏ธ Typer โ CLI framework โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Skylar is licensed under the Apache License 2.0.
Copyright ยฉ 2026 Aleksandr Ivanovitch, who holds all intellectual property
rights in this software in his capacity as Chief Technology Officer (CTO) of
Sophia AI S.r.l. See the LICENSE and NOTICE files
for the full terms and attribution.
You are free to use, modify, and distribute this software under the terms of the Apache 2.0 license, provided you retain the copyright, patent, trademark, and attribution notices.
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ "The best way to understand a Transformer is to build one." โ
โ โ
โ โ Skylar Project Philosophy โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Built with ๐ฅ by Aleksandr Ivanovitch โ CTO, Sophia AI S.r.l.
Making frontier AI transparent, auditable, and reproducible.
Copyright ยฉ 2026 Aleksandr Ivanovitch โ CTO, Sophia AI S.r.l. ยท Licensed under the Apache License 2.0.