# Notebook Course: Training a Custom Tiny Qwen3 MoE LLM from Scratch



I welcome you to this hands-on course where we build and train a Large Language Model (LLM) entirely from scratch only on Apple Silicon using MLX-LM and my package MLX-LM-LoRA.

This is not another “fine-tune GPT” course — we’re creating a custom MoE model architecture, pretraining it on web-scale data, finetuning it using supervised data, and optimizing it with preference learning. All of it runs locally on your Mac.

---

📦 Minimum Requirements

To run this notebook locally, you’ll need:
- An Apple Silicon Mac (M1 Pro / M2 Max / M3, etc.)
- Minimum 32 GB of unified memory (lower is also possible, but you'd have to change parameters to fit, Higher is better)
- Python ≥ 3.12

---

📚 Course Overview

This course covers the entire LLM training stack:
1.	Model Design: Define a Tiny Qwen3 MoE-style model
2.	Pretraining: Unsupervised training on `mlx-community/fineweb-200k`
3.	Supervised Finetuning (SFT): Using a subset of `digitalpipelines/wizard_vicuna_70k_uncensored`
4.	Preference Optimization: With `mlx-community/Human-Like-DPO`
5.	Evaluation + Inference: Test generations and evaluation

---

🤖 Model Architecture: Tiny Qwen3 MoE

We’ll build a custom minimal MoE version of the new Qwen3. The model uses:
- Transformer decoder-only blocks
- Sparse Mixture-of-Experts layers
- Rotary embeddings (RoPE)

We aim to keep it light enough to fully train on-device.

---

🧪 Pretraining Dataset: mlx-community/fineweb-200k

We’ll use a subset of the FineWeb dataset hosted on Hugging Face:
mlx-community/fineweb-200k

It is:
- Cleaned, deduplicated web content
- Ideal for unsupervised autoregressive training
- Efficiently streamed on Apple Silicon

---

🧾 Prompt Templates

We’ll use a custom prompt template to go with Qwen3.
For finetuning, we adopt the following ChatLM-style template:

```text
<|im_start|>system description
This is a conversation between Josie, a helpfull AI assistant and a human user.<|im_end|>
<|im_start|>user turn
{prompt}<|im_end|>
<|im_start|>assistant turn, name = 'Josie'
{answer}<|im_end|>
```

This helps align with conversational-tuned models and enables easy preference optimization later.

---

🧠 SFT Dataset: digitalpipelines/wizard_vicuna_70k_uncensored

We’ll use the Wizard Vicuna dataset, known for:
- Rich instruction-following data
- Clean conversational structure
- Multi turn conversations
- Aligned with assistant-style prompting

This stage teaches the model how a conversation looks like.

---

❤️ Preference Optimization: mlx-community/Human-Like-DPO

To refine the model’s behavior further, We’ll apply Monolithic Preference Optimization (ORPO) using:

mlx-community/Human-Like-DPO

This dataset includes:
- Ranked preference pairs
- Human-like instructions and completions
- Fine-grained reward signal for better alignment

---

🧰 Tools Used
- MLX-LM: Apple-native LLM framework leveraging MLX runtime
- MLX-LM-LoRA: My package for full-precision and LoRA training, supporting DPO, ORPO, GRPO and more
- Hugging Face Datasets + Transformers for data loading and formatting

---

🏁 By The End Of This Course…

You’ll have:
- Designed and built a custom Qwen3 MoE model
- Pretrained it on web-scale text
- Supervised-finetuned it on instruction data
- Aligned it using preference optimization
- Deployed and evaluated it entirely on your Mac

Let’s get started! 🚀

# ⚙️ Installation: Set Up the Environment

Before we begin training, let’s install all necessary dependencies.

We’ll use:
- `mlx-lm-lora` – includes mlx-lm, datasets
- `huggingface_hub` – for uploading our models

In [None]:
!pip install mlx-lm-lora huggingface_hub wandb

# 📦 Imports and Setup

Before building and training our Qwen3-style MoE model, we import everything needed:
- Core Python utilities (os, Path, dataclass)
- MLX-LM-LORA tools for defining models and saving configs
- HF token/tokenizer setup
- Model class and utility functions

In [None]:
from mlx_lm_lora.trainer.datasets import CacheDataset, ORPODataset, TextDataset
from mlx_lm_lora.trainer.orpo_trainer import ORPOTrainingArgs, train_orpo
from mlx_lm_lora.trainer.sft_trainer import SFTTrainingArgs, train_sft

from datasets import load_dataset
from huggingface_hub import create_repo, HfApi

from mlx_lm.tuner.utils import linear_to_lora_layers, print_trainable_parameters, get_total_parameters
from mlx_lm.utils import load, load_model, load_tokenizer, save_config, save_model
from mlx_lm.models.qwen3_moe import Model, BaseModelArgs
from mlx_lm.tuner.callbacks import TrainingCallback
from mlx_lm import generate

import mlx.optimizers as optim
from mlx_optimizers import Muon

from dataclasses import dataclass, field
from pathlib import Path
import math
import os

def calculate_iters(train_set, batch_size, epochs) -> int:
    num_samples = len(train_set)
    batches_per_epoch = math.ceil(num_samples / batch_size)
    iters = epochs * batches_per_epoch
    print(f"[INFO] Calculated {iters} iterations from {epochs} epochs (dataset size: {num_samples}, batch size: {batch_size})")
    return iters

# 📌 Setup: Define HF Token and Model Metadata

Before training or uploading, we set up:
- Your Hugging Face token for authentication
- The model name (used for saving)
- The user or organization name
- The local path where everything will be stored

In [3]:
hf_token = "" # <-- Add you HF Token here

new_model_name = "qwen3_tiny_moe"
user_name = "mlx-community"
author = "Gökdeniz Gülmez"

folder = "/Users/gokdenizgulmez/Desktop/mlx-lm-lora/examples/"

pretraining_dataset_name = "mlx-community/fineweb-200k"
pretraining_dataset_samples = 2000

finetuning_dataset_name = "digitalpipelines/wizard_vicuna_70k_uncensored"
finetuning_dataset_samples = 1000

preference_dataset_name = "mlx-community/Human-Like-DPO"
preference_dataset_samples = 100

# 📁 Prepare Target Directory for Model & Tokenizer

We define the full target path for saving the model and tokenizer.

If it doesn’t already exist, we:
- Clone the custom tokenizer repo from Hugging Face
- Place it inside the model folder

Otherwise, we skip cloning.

In [None]:
target_dir = os.path.join(folder, new_model_name)
if not os.path.exists(target_dir):
    !git clone https://huggingface.co/Goekdeniz-Guelmez/qwen3_tokenizer "{target_dir}"
else:
    print(f"Tokenizer already exists at: {target_dir}")

# 🧱 Define Model Configuration

We subclass the BaseModelArgs to create a tiny Qwen3 MoE architecture:
- Shallow network with only 2 layers
- Small hidden size (128) for fast local training
- 4 experts, 1 active per token (simple MoE routing)
- Tokenizer and embedding sizes that match Qwen3

This keeps the model small enough to:
- Fit on local Apple Silicon
- Still support pretraining and ORPO finetuning

In [15]:
@dataclass
class NewModelArgs(BaseModelArgs):
    mlp_only_layers: list[int] = field(default_factory=list)  # Default: no MLP-only layers

    model_type: str = "qwen3_moe"
    
    hidden_size: int = 128                   # Tiny, but enough for basic functionality
    num_hidden_layers: int = 2               # Very shallow
    intermediate_size: int = 256             # Typically 2x hidden size
    moe_intermediate_size: int = 256         # Same as above, or slightly larger
    num_attention_heads: int = 2             # Must divide hidden_size
    num_key_value_heads: int = 1             # Often set to 1 for tiny models
    head_dim: int = field(init=False)        # Will be computed post-init
    num_experts: int = 4                     # Small MoE with minimal experts
    num_experts_per_tok: int = 1             # One expert per token
    decoder_sparse_step: int = 1             # Typical setting

    rms_norm_eps: float = 1e-6               # Standard value
    vocab_size: int = 151936                 # Matches Qwen3 tokenizer size
    rope_theta: float = 1000.0               # Qwen3 uses 1e3

    tie_word_embeddings: bool = True         # Save params, good default
    max_position_embeddings: int = 1028      # Common default
    norm_topk_prob: bool = True              # MoE-specific regularization trick

    def __post_init__(self):
        self.head_dim = self.hidden_size // self.num_attention_heads  # Auto-calculated

args = NewModelArgs()

# 🧪 Initialize the Model

We instantiate the model using the configuration from the previous cell. This builds the full architecture in memory, ready for pretraining.

In [None]:
untrained_model = Model(args)

# 🔍 Inspect the Model

We print:
- The full architecture and parameter count

In [None]:
print(untrained_model)
print(f"{int(get_total_parameters(untrained_model) / 1e6):.3f}M total parameters.")

# 💾 Save Model & Config to Disk

We store:
- The full args configuration as config.json
- The initial (untrained) model weights using save_model(...)

This gives us a checkpointable starting point for training.

In [8]:
save_config(vars(args), f"{target_dir}/config.json")

In [9]:
tokenizer = load_tokenizer(Path(target_dir))

# Start Pretraning

In [11]:
pretrain_path = os.path.join(target_dir, "pretrained")
pretrain_folder = Path(pretrain_path)
pretrain_folder.mkdir(parents=True, exist_ok=True)

In [12]:
pretraining_opt = Muon(learning_rate=1e-5)

In [13]:
pretraining_dataset = load_dataset(pretraining_dataset_name)["train"]

if pretraining_dataset_samples is not None:
    pretraining_dataset = pretraining_dataset.select(range(pretraining_dataset_samples))

pretraining_train_dataset, pretraining_valid_dataset = pretraining_dataset.train_test_split(test_size=0.01, seed=42).values()

In [14]:
pretraining_train_set = TextDataset(pretraining_train_dataset, tokenizer, text_key='text')
pretraining_valid_set = TextDataset(pretraining_valid_dataset, tokenizer, text_key='text')

In [None]:
batch_size = 4
epochs = 1

train_sft(
    model=untrained_model,
    args=SFTTrainingArgs(
        batch_size=batch_size,
        iters=calculate_iters(train_set=pretraining_train_set, batch_size=batch_size, epochs=epochs),
        val_batches=1,
        steps_per_report=20,
        steps_per_eval=50,
        steps_per_save=-1,
        max_seq_length=untrained_model.args.max_position_embeddings,
        grad_checkpoint=True,
    ),
    optimizer=pretraining_opt,
    train_dataset=CacheDataset(pretraining_train_set),
    val_dataset=CacheDataset(pretraining_valid_set),
    training_callback=TrainingCallback()
)

In [20]:
pretrained_new_model_name = f"pretrained_{new_model_name}"

readme_file = f"""---
tags:
- mlx
- text-generation
pipeline_tag: text-generation
---

# Fully Pretrained Model: `{user_name}/{pretrained_new_model_name}`

This model was **pretrained from scratch** using the [`mlx-lm-lora`](https://github.com/Goekdeniz-Guelmez/mlx-lm-lora) training package on Apple Silicon via MLX.  
It is part of the **"Creating LLMs from Scratch"** notebook series by Gökdeniz Gülmez.

The model was trained on a subset of **{pretraining_dataset_name}**, using **{pretraining_dataset_samples}** samples.

---

## 🧾 Model Details

| Field                | Value                                   |
|---------------------|-----------------------------------------|
| **Model name**       | `{pretrained_new_model_name}`                     |
| **Pretraining type** | Full from scratch                      |
| **Training package** | [`mlx-lm-lora`](https://github.com/Goekdeniz-Guelmez/mlx-lm-lora) |
| **Model type**       | {untrained_model.args.model_type}      |
| **Author**           | {author}                               |

---

## 📦 Usage Example (Python)

```python
from mlx_lm.utils import load_model, load_tokenizer
from mlx_lm import generate

tokenizer = load_tokenizer("Goekdeniz-Guelmez/qwen3_tokenizer")
model = load_model("{user_name}/{pretrained_new_model_name}")

generate(model, tokenizer, "What is the meaning of life?")
"""

new_readme_path = f"{pretrain_path}/README.md"
with open(new_readme_path, "w") as new_readme_file:
    new_readme_file.write(readme_file)

save_model(pretrain_path, untrained_model)
save_config(vars(args), f"{pretrain_path}/config.json")

In [None]:
api = HfApi(token=hf_token)
create_repo(
  repo_id = f"{user_name}/{pretrained_new_model_name}",
  repo_type="model",
  exist_ok=True,
  token=hf_token,
  private=True
)
api.upload_folder(
  folder_path=new_model_name,
  repo_id=f"{user_name}/{new_model_name}",
  token=hf_token,
  commit_message="Initial Commit"
)

# Delete model from RAM

You might also restart the Notebook but be sure you have the pretrained model path before you do.

In [None]:
del untrained_model, args, pretrain_folder, pretraining_opt, pretraining_dataset, pretraining_train_dataset, pretraining_valid_dataset, pretraining_train_set, pretraining_valid_set, readme_file

# Start LoRA SFT

In [None]:
pretrained_model = load_model(pretrain_path)

max_seq_length = 512
num_lora_layers = 12
lora_parameters = {"rank": 8, "dropout": 0.0, "scale": 10.0}

In [None]:
finetune_path = os.path.join(target_dir, "finetuned")
finetune_folder = Path(finetune_path)
finetune_folder.mkdir(parents=True, exist_ok=True)

In [None]:
pretrained_model.freeze()

linear_to_lora_layers(
    model=pretrained_model,
    num_layers=num_lora_layers,
    config=lora_parameters,
    use_dora=False,
)

print_trainable_parameters(pretrained_model)

In [None]:
lora_args = {
    "lora_parameters": lora_parameters,
    "num_layers": num_lora_layers,
}

finetune_adapter_path = Path(os.path.join(finetune_path, "adapters"))
finetune_adapter_path.mkdir(parents=True, exist_ok=True)
save_config(vars(lora_args), finetune_adapter_path / "adapter_config.json")

In [None]:
system_prompt = """This is a conversation between Josie, a helpfull AI assistant and a human user."""

EOS_TOKEN = tokenizer.eos_token

full_prompt_format = """<|im_start|>system description
{}<|im_end|>
<|im_start|>user turn
{}<|im_end|>
<|im_start|>assistant turn, name = 'Josie'
{}"""

system_turn = """<|im_start|>system description
{}"""

user_turn = """
<|im_start|>user turn
{}"""

assistant_turn = """
<|im_start|>assistant turn, name = 'Josie'
{}"""

def format_prompts_func(sample):
    this_conversation = sample["conversations"]

    if isinstance(this_conversation, list):
        conversation = system_turn.format(system_prompt + EOS_TOKEN)
        for turn in this_conversation:
            if turn["from"] == "human":
                conversation += user_turn.format(turn['value'] + EOS_TOKEN)
            elif turn["from"] == "gpt":
                conversation += assistant_turn.format(turn['value'] + EOS_TOKEN)

    sample["text"] = conversation
    return sample

finetuning_dataset = load_dataset(finetuning_dataset_name)["train"]

if finetuning_dataset_samples is not None:
    finetuning_dataset = finetuning_dataset.select(range(finetuning_dataset_samples))

finetuning_dataset = finetuning_dataset.map(format_prompts_func,)
finetuning_train_dataset, finetuning_valid_dataset = finetuning_dataset.train_test_split(test_size=0.01, seed=42).values()

In [None]:
finetuning_train_set = TextDataset(finetuning_train_dataset, tokenizer, text_key='text')
finetuning_valid_set = TextDataset(finetuning_valid_dataset, tokenizer, text_key='text')

In [None]:
finetuning_opt = optim.AdamW(learning_rate=1e-5)

In [None]:
batch_size = 1
epochs = 1

train_sft(
    model=pretrained_model,
    args=SFTTrainingArgs(
        batch_size=batch_size,
        iters=calculate_iters(train_set=finetuning_train_set, batch_size=batch_size, epochs=epochs),
        val_batches=1,
        steps_per_report=20,
        steps_per_eval=50,
        steps_per_save=50,
        adapter_file=finetune_adapter_path,
        max_seq_length=max_seq_length,
        grad_checkpoint=True,
    ),
    optimizer=finetuning_opt,
    train_dataset=CacheDataset(finetuning_train_set),
    val_dataset=CacheDataset(finetuning_valid_set),
    training_callback=TrainingCallback()
)

In [None]:
finetuned_new_model_name = f"finetuned_{new_model_name}"

readme_file = f"""---
tags:
- mlx
- text-generation
pipeline_tag: text-generation
---

# Finetuned Model: `{user_name}/{finetuned_new_model_name}`

This model was **finetuned** using the [`mlx-lm-lora`](https://github.com/Goekdeniz-Guelmez/mlx-lm-lora) training package on Apple Silicon via MLX.  
It builds upon the base model {pretrained_new_model_name} that was previously **pretrained from scratch** and is part of the **"Creating LLMs from Scratch"** notebook series by Gökdeniz Gülmez.

The fine-tuning was performed on a subset of **{finetuning_dataset_name}**, using **{finetuning_dataset_samples}** samples.

---

## 🧾 Model Details

| Field                | Value                                   |
|---------------------|-----------------------------------------|
| **Model name**       | `{finetuned_new_model_name}`                     |
| **Finetuning type**  | Supervised fine-tuning (SFT)            |
| **Training package** | [`mlx-lm-lora`](https://github.com/Goekdeniz-Guelmez/mlx-lm-lora) |
| **Base model**       | {untrained_model.args.model_type}       |
| **Author**           | {author}                                |

---

## 📦 Usage Example (Python)

```python
from mlx_lm.utils import load_model, load_tokenizer
from mlx_lm import generate

tokenizer = load_tokenizer("Goekdeniz-Guelmez/qwen3_tokenizer")
model = load_model("{user_name}/{finetuned_new_model_name}")

generate(model, tokenizer, "What is the meaning of life?")
"""

new_readme_path = f"{finetune_path}/README.md"
with open(new_readme_path, "w") as new_readme_file:
    new_readme_file.write(readme_file)

save_model(finetune_path, finetune_path)
save_config(vars(args), f"{finetune_path}/config.json")

In [None]:
api = HfApi(token=hf_token)
create_repo(
  repo_id = f"{user_name}/{finetuned_new_model_name}",
  repo_type="model",
  exist_ok=True,
  token=hf_token,
  private=True
)
api.upload_folder(
  folder_path=new_model_name,
  repo_id=f"{user_name}/{new_model_name}",
  token=hf_token,
  commit_message="Initial Commit"
)

In [None]:
del pretrained_model, finetune_folder, finetune_adapter_path, finetuning_opt, finetuning_dataset, finetuning_train_dataset, finetuning_valid_dataset, finetuning_train_set, finetuning_valid_set, readme_file

# Start Preference Optimzation

In [None]:
finetuned_model = load_model(finetune_path)

max_seq_length = finetuned_model.args.max_position_embeddings
num_lora_layers = -1
lora_parameters = {"rank": 8, "dropout": 0.0, "scale": 10.0}

In [None]:
preference_path = os.path.join(target_dir, "preference_optmimzed")
preference_folder = Path(preference_path)
preference_folder.mkdir(parents=True, exist_ok=True)

In [None]:
finetuned_model.freeze()

linear_to_lora_layers(
    model=finetuned_model,
    num_layers=num_lora_layers,
    config=lora_parameters,
    use_dora=False,
)

print_trainable_parameters(finetuned_model)

In [None]:
lora_args = {
    "lora_parameters": lora_parameters,
    "num_layers": num_lora_layers,
}

finetune_adapter_path = Path(os.path.join(finetune_path, "adapters"))
finetune_adapter_path.mkdir(parents=True, exist_ok=True)
save_config(vars(lora_args), finetune_adapter_path / "adapter_config.json")

In [None]:
system_prompt = """This is a conversation between Josie, a helpfull AI assistant and a human user."""

EOS_TOKEN = tokenizer.eos_token

full_prompt_format = """<|im_start|>system description
{}<|im_end|>
<|im_start|>user turn
{}<|im_end|>
<|im_start|>assistant turn, name = 'Josie'
{}"""

system_turn = """<|im_start|>system description
{}"""

user_turn = """
<|im_start|>user turn
{}"""

assistant_turn = """
<|im_start|>assistant turn, name = 'Josie'
{}"""

def format_prompts_func(sample):
    prompt = sample["prompt"]
    chosen = sample["chosen"]
    rejected = sample["rejected"]

    chosen_conversation = full_prompt_format.format(system_prompt, prompt, chosen)
    rejected_conversation = full_prompt_format.format(system_prompt, prompt, rejected)

    sample["rejected"] = rejected_conversation
    sample["chosen"] = chosen_conversation
    return sample

preference_dataset = load_dataset(preference_dataset_name)["train"]

if preference_dataset_samples is not None:
    preference_dataset = preference_dataset.select(range(preference_dataset_samples))

preference_dataset = preference_dataset.map(format_prompts_func,)
preference_train_dataset, preference_valid_dataset = preference_dataset.train_test_split(test_size=0.01, seed=42).values()

In [None]:
preference_train_set = ORPODataset(preference_train_dataset, tokenizer)
preference_valid_set = ORPODataset(preference_valid_dataset, tokenizer)

In [None]:
preference_opt = optim.AdamW(learning_rate=1e-5)

In [None]:
batch_size = 4
epochs = 2

train_orpo(
    model=finetuned_model,
    args=ORPOTrainingArgs(
        batch_size=batch_size,
        iters=calculate_iters(train_set=preference_train_set, batch_size=batch_size, epochs=epochs),
        val_batches=1,
        steps_per_report=20,
        steps_per_eval=50,
        steps_per_save=50,
        adapter_file=preference_path,
        max_seq_length=max_seq_length,
        grad_checkpoint=True,
        beta=0.1,
        reward_scaling=0.6
    ),
    optimizer=preference_opt,
    train_dataset=CacheDataset(preference_train_set),
    val_dataset=CacheDataset(preference_valid_set),
    training_callback=TrainingCallback()
)

In [None]:
preference_new_model_name = f"preference_optimized_{new_model_name}"

readme_file = f"""---
tags:
- mlx
- text-generation
pipeline_tag: text-generation
---

# Preference-Optimized Model: `{user_name}/{preference_new_model_name}`

This model was optimized via **preference-based alignment** using the [`mlx-lm-lora`](https://github.com/Goekdeniz-Guelmez/mlx-lm-lora) training package on Apple Silicon via MLX.  
It builds upon a model that was **pretrained from scratch** and then **fine-tuned** with supervised learning, forming part of the **"Creating LLMs from Scratch"** notebook series by Gökdeniz Gülmez.

The preference optimization phase used **{preference_dataset_name}** with **{preference_dataset_samples}** ranked prompt pairs to align model outputs with human preferences.

---

## 🧾 Model Details

| Field                    | Value                                          |
|-------------------------|------------------------------------------------|
| **Model name**           | `{preference_new_model_name}`                           |
| **Alignment type**       | Preference Optimization (e.g. DPO/ORPO/GRPO)   |
| **Training package**     | [`mlx-lm-lora`](https://github.com/Goekdeniz-Guelmez/mlx-lm-lora) |
| **Base model**           | {untrained_model.args.model_type}             |
| **Author**               | {author}                                      |

---

## 📦 Usage Example (Python)

```python
from mlx_lm.utils import load_model, load_tokenizer
from mlx_lm import generate

tokenizer = load_tokenizer("Goekdeniz-Guelmez/qwen3_tokenizer")
model = load_model("{user_name}/{preference_new_model_name}")

generate(model, tokenizer, "What is the meaning of life?")
"""

new_readme_path = f"{preference_path}/README.md"
with open(new_readme_path, "w") as new_readme_file:
    new_readme_file.write(readme_file)

save_model(preference_path, finetuned_model)
save_config(vars(args), f"{preference_path}/config.json")

In [None]:
api = HfApi(token=hf_token)
create_repo(
  repo_id = f"{user_name}/{pretrained_new_model_name}",
  repo_type="model",
  exist_ok=True,
  token=hf_token,
  private=True
)
api.upload_folder(
  folder_path=new_model_name,
  repo_id=f"{user_name}/{new_model_name}",
  token=hf_token,
  commit_message="Initial Commit"
)