# Environment Setup: Package Installation

This cell installs all required Python packages for the training and quantization pipeline.  
- **unsloth** (2025.5.7): Core framework for efficient QLoRA training and inference.
- **transformers** (4.51.3), **accelerate** (1.7.0): Model loading, tokenization, and hardware acceleration.
- **numpy** (1.26.4): Array operations for preprocessing and metric computation.
- **llmcompressor** (0.6.0): Used exclusively for FP8 conversion and quantization prior to downstream inference (not required for core training).
- **scipy** (1.15.3): Required only for post-training statistical analysis (e.g., Wilcoxon tests).

> **Note:**  
> This cell is only required on a fresh environment or the first run. If all dependencies are already installed, it can be safely skipped.  
>  
> **llmcompressor** and **scipy** are *not* required for core model training, but are included here to avoid potential version mismatch issues.


In [None]:
# Install dependencies
!pip install torch==2.9.1 unsloth==2025.12.8 transformers==4.57.3 accelerate==1.12.0 numpy==1.26.4 xformers==0.0.33.post2 scipy==1.16.3

# Notebook Structure and Usage Guide

This notebook is organized into **five sections**:
1. **Preliminary Setup** (imports, random seeds, model selection)
2. **Phase 1 Training**
3. **Phase 2 Training**
4. **Phase 3 Training**
5. **Phase 4 (Full-Batch Alignment)**

**Execution Guidance:**
- Each phase can be run independently in separate sessions or kernel instances, *provided that the output artifacts (models/checkpoints) from all prior phases are available* in the same workspace.
- If preferred, the full notebook can be executed sequentially (‚ÄúRun All‚Äù), running the entire pipeline start to finish in one kernel instance.
- The **Preliminary Setup** section must always be run at the start of every session, as it initializes the environment, imports required libraries, and sets random seeds.

> *Tip:*  
> If you are resuming after an interruption or switching kernel, re-run the **Preliminary Setup** and then proceed directly to the phase of interest, as long as previous phases have already been completed and their outputs saved.


# Preliminary Setup

### Environment and Random Seed Initialization

This cell imports all necessary libraries, sets up deterministic random seeds for full reproducibility, and configures PyTorch CUDA settings (if available).

- `unsloth`, `transformers`, `trl`, and `peft` provide the core fine-tuning and adapter infrastructure.
- `numpy` and Python‚Äôs `random` standard library are used for explicit seed management.
- `Path` is used for consistent file operations.
- All seeds are set to a fixed value (`3407`) to ensure experiment reproducibility across runs and environments.

*You must run this cell at the start of every notebook session or after resetting the kernel.*


In [1]:
import torch
import os

dtype = torch.float16
n_gpus = torch.cuda.device_count()

GPU_BUFFERS = tuple([torch.empty(2*256*2048, dtype = dtype, device = f"cuda:{i}") for i in range(n_gpus)])

import gc
import json
from datetime import datetime, timezone

from unsloth import FastLanguageModel
import torch
from datasets import Dataset
from peft import PeftModel
from transformers import set_seed, AutoTokenizer, AutoModelForCausalLM, TextStreamer
from trl import SFTTrainer, SFTConfig
import random
import numpy as np
from pathlib import Path

# Set random seed for reproducibility
seed = 3407
set_seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)
random.seed(seed)
np.random.seed(seed)

  import pynvml  # type: ignore[import]


ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!


### Model Selection and Data Loading Utilities

- `model_name` sets the base model checkpoint to use (e.g., `"Qwen/Qwen3-14B"`). You can adjust this to select among supported Qwen3 sizes.
- The `load_split()` utility function loads training and test data splits from local JSON files (expects UTF-8 encoding).
- This modular loader will be used by each phase to ingest its own split of curriculum/alignment data.

*Adjust `model_name` if running on a different scale of Qwen3, and ensure the required data files are present in your workspace.*


In [2]:
model_name = "Qwen/Qwen3-14B"

#Data loader
def load_split(train_path, test_path):
    with open(train_path, 'r', encoding='utf-8') as f:
        train_data = json.load(f)
    with open(test_path, 'r', encoding='utf-8') as f:
        test_data = json.load(f)
    return train_data, test_data

# Phase 1: Single-Turn TIME Interventions (Structured Pre-Reasoning)

This phase introduces models to structured temporal inputs and expected outputs within **single-turn interactions**. The purpose is to *prime* the model toward sensitivity to:

- Temporal tags (e.g., `<time>`)
- Multiple reasoning block slots (`<think>...</think>`)

This aims to encourage surface imitation before deeper semantic reasoning is introduced.

> The actual outputs from the author's training runs are preserved in the notebook for transparency and auditability.

### Load Phase 1 Dataset

This cell loads the preprocessed JSON files for Phase 1‚Äîdesigned for single-turn conversations only‚Äîinto memory.

- Each entry is a fully-formed prompt-response example, structured as chat messages.
- The files must be named `phase1_train.json` and `phase1_test.json`, and must already be formatted with roles (`user`, `assistant`) and timestamps.

The print output confirms successful loading and gives visibility into the dataset size.

In [3]:
# Load phase 1 data
train_data, test_data = load_split('phase1_train.json', 'phase1_test.json')

print(f"Loaded {len(train_data)} train and {len(test_data)} test conversations.")

Loaded 2188 train and 387 test conversations.


### Initialize Tokenizer and Chat Template

This cell initializes the tokenizer from the selected base model (e.g., `Qwen/Qwen3-14B`) and configures a custom `chat_template` that emulates the expected format during fine-tuning and downstream inference.

- The template ensures that timestamps, assistant formatting and responses are in line with our framework.
- This structure mirrors the Qwen3 convention, with just TIME-specific modifications such as `<time>` and `<think>` tags added.

In [4]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.chat_template = """{%- if tools %}\n    {{- \'<|im_start|>system\\n\' }}\n    {%- if messages[0].role == \'system\' %}\n        {{- messages[0].content + \'\\n\\n\' }}\n    {%- endif %}\n    {{- "# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>" }}\n    {%- for tool in tools %}\n        {{- "\\n" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- "\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\"name\\": <function-name>, \\"arguments\\": <args-json-object>}\\n</tool_call><|im_end|>\\n" }}\n{%- else %}\n    {%- if messages[0].role == \'system\' %}\n        {{- \'<|im_start|>system\\n\' + messages[0].content + \'<|im_end|>\\n\' }}\n    {%- endif %}\n{%- endif %}\n{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}\n{%- for message in messages[::-1] %}\n    {%- set index = (messages|length - 1) - loop.index0 %}\n    {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith(\'<tool_response>\') and message.content.endswith(\'</tool_response>\')) %}\n        {%- set ns.multi_step_tool = false %}\n        {%- set ns.last_query_index = index %}\n    {%- endif %}\n{%- endfor %}\n{%- for message in messages %}\n    {%- if message.content is string %}\n        {%- set content = message.content %}\n    {%- else %}\n        {%- set content = \'\' %}\n    {%- endif %}\n    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}\n        {{- \'<|im_start|>\' + message.role + \'\\n\'}}\n        {%- if message.role == "user" and message.timestamp is defined %}\n            {{- \'<time>\' + message.timestamp + \'</time>\\n\' }}\n        {%- endif %}\n        {{- content + \'<|im_end|>\\n\' }}\n    {%- elif message.role == "assistant" %}\n        {{- \'<|im_start|>\' + message.role + \'\\n\' + content + \'<|im_end|>\\n\' }}\n        {%- if message.tool_calls %}\n            {%- for tool_call in message.tool_calls %}\n                {%- if (loop.first and content) or (not loop.first) %}\n                    {{- \'\\n\' }}\n                {%- endif %}\n                {%- if tool_call.function %}\n                    {%- set tool_call = tool_call.function %}\n                {%- endif %}\n                {{- \'<tool_call>\\n{"name": "\' }}\n                {{- tool_call.name }}\n                {{- \'", "arguments": \' }}\n                {%- if tool_call.arguments is string %}\n                    {{- tool_call.arguments }}\n                {%- else %}\n                    {{- tool_call.arguments | tojson }}\n                {%- endif %}\n                {{- \'}\\n</tool_call>\' }}\n            {%- endfor %}\n        {%- endif %}\n    {%- elif message.role == "tool" %}\n        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}\n            {{- \'<|im_start|>user\' }}\n        {%- endif %}\n        {{- \'\\n<tool_response>\\n\' }}\n        {{- content }}\n        {{- \'\\n</tool_response>\' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}\n            {{- \'<|im_end|>\\n\' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- \'<|im_start|>assistant\\n\' }}\n{%- endif %}"""

### Apply Chat Template to Dataset

This applies the above `chat_template` to both training and testing splits, generating **raw string prompts** (without tokenization yet).

- This lets us visually inspect how system, user, and assistant roles are represented.
- The outputs printed in this cell are the *first 500 characters* of the first training and test samples, giving visibility into whether formatting is preserved.


In [5]:
# Apply chat template to both train and test data, do not tokenize yet
train_texts = tokenizer.apply_chat_template(
    train_data,
    tokenize=False,
)

test_texts = tokenizer.apply_chat_template(
    test_data,
    tokenize=False,
)

print(f"First formatted train example:\n{train_texts[0][:500]}")
print(f"First formatted test example:\n{test_texts[0][:500]}")

First formatted train example:
<|im_start|>system
You are an AI assistant. Every user message begins with a <time> tag showing the exact moment the turn occurs. Sometimes, there's only the timestamp with no text‚Äîthat means time advanced without user input. Use <think>...</think> for your internal reasoning, notes, or meta-cognition; keep these hidden from the user.<|im_end|>
<|im_start|>user
<time>2023-11-01T10:56:00</time>
I'm boarding a plane in New York to fly to Sydney. The flight is in 10 minutes and the estimated flight
First formatted test example:
<|im_start|>user
<time>2023-09-01T00:00:00</time>
How many days passed in August 2023?<|im_end|>
<|im_start|>assistant
Let's determine how many days of August 2023 have already passed.
<think>
The current timestamp is September 1, 2023, 00:00:00. This means that all of August has just concluded. The model needs to recall the number of days in August.
</think>
Today is **September 1, 2023**.
<think>
August is a month that always has 

### Analyze Sequence Lengths Before Tokenization

This cell tokenizes the preformatted prompts (without truncation or special tokens) to estimate sequence lengths.

- Reports max, mean, and 90th percentile lengths for both train and test splits.
- This helps determine appropriate training sequence length selection strategies later.

We do **not yet tokenize for training**‚Äîthis step is purely for exploratory analysis and debugging.


In [6]:
# Tokenize to find sequence lengths without truncation
train_lengths = [len(tokenizer(t, add_special_tokens=False)['input_ids']) for t in train_texts]
test_lengths = [len(tokenizer(t, add_special_tokens=False)['input_ids']) for t in test_texts]

max_train_len = max(train_lengths)
max_test_len = max(test_lengths)

print(f"Max train sequence length: {max_train_len}")
print(f"Max test sequence length: {max_test_len}")

# Get some stats about the distribution
print(f"Train: Mean = {sum(train_lengths)/len(train_lengths):.1f}, 90th percentile = {sorted(train_lengths)[int(0.9*len(train_lengths))]}")
print(f"Test: Mean = {sum(test_lengths)/len(test_lengths):.1f}, 90th percentile = {sorted(test_lengths)[int(0.9*len(test_lengths))]}")


Max train sequence length: 2057
Max test sequence length: 1504
Train: Mean = 310.8, 90th percentile = 527
Test: Mean = 311.1, 90th percentile = 544


### Convert Texts into TRL-Compatible Format

For use with `trl.SFTTrainer`, the data must be wrapped as a list of dictionaries with a single `"text"` field.

- This format will later be passed into HuggingFace `datasets.Dataset`.
- The printed sample verifies that each item has the right structure.

In [7]:
# Format for TRL SFTTrainer: list of dicts with a 'text' field
train_dataset = [{"text": x} for x in train_texts]
eval_dataset = [{"text": x} for x in test_texts]

print(f"First train sample: {train_dataset[0]}")
print(f"First eval sample: {eval_dataset[0]}")

First train sample: {'text': "<|im_start|>system\nYou are an AI assistant. Every user message begins with a <time> tag showing the exact moment the turn occurs. Sometimes, there's only the timestamp with no text‚Äîthat means time advanced without user input. Use <think>...</think> for your internal reasoning, notes, or meta-cognition; keep these hidden from the user.<|im_end|>\n<|im_start|>user\n<time>2023-11-01T10:56:00</time>\nI'm boarding a plane in New York to fly to Sydney. The flight is in 10 minutes and the estimated flight duration is 21 hours. What's the local time in Sydney when I land?<|im_end|>\n<|im_start|>assistant\nLet's calculate your arrival time in Sydney considering the time difference and flight duration.\n<think>\nIt's 10:56 AM on November 1st, 2023 in New York.\n</think>\nCurrent local time in New York is **10:56 AM**.\n<think>\nTakeoff is in 10 minutes, making it 11:06 AM.\n</think>\nYou'll be airborne at **11:06 AM** New York time.\n<think>\nWith a flight durati

### Wrap as HuggingFace Dataset Objects

The lists of formatted examples are now converted into `datasets.Dataset` objects to be compatible with `SFTTrainer`.

- These objects support batching, shuffling, and streaming.
- The printout confirms successful instantiation.


In [8]:
# Wrap the data in HuggingFace Datasets objects
train_dataset = Dataset.from_list(train_dataset)
eval_dataset = Dataset.from_list(eval_dataset)

print(train_dataset, eval_dataset)

Dataset({
    features: ['text'],
    num_rows: 2188
}) Dataset({
    features: ['text'],
    num_rows: 387
})


### Load Base Model with `FastLanguageModel`

We load the Qwen3-14B base model using **Unsloth's `FastLanguageModel`** with QLoRA-style 4-bit quantization:

- `load_in_4bit=True` enables QLoRA (memory-efficient fine-tuning).
- `full_finetuning=False` ensures adapter-based training (LoRA).
- `max_seq_length` is set to the maximum sequence length seen in the training set (determined earlier).

> Note: The `chat_template` must be manually re-injected due to tokenizer reload.


In [9]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = max_train_len,       # Highest sequence length in train dataset
    load_in_4bit = True,                  # QLoRA
    load_in_8bit = False,                 # We are using 4-bit
    full_finetuning = False,              # QLoRA/PEFT not full FT
)

tokenizer.chat_template = """{%- if tools %}\n    {{- \'<|im_start|>system\\n\' }}\n    {%- if messages[0].role == \'system\' %}\n        {{- messages[0].content + \'\\n\\n\' }}\n    {%- endif %}\n    {{- "# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>" }}\n    {%- for tool in tools %}\n        {{- "\\n" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- "\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\"name\\": <function-name>, \\"arguments\\": <args-json-object>}\\n</tool_call><|im_end|>\\n" }}\n{%- else %}\n    {%- if messages[0].role == \'system\' %}\n        {{- \'<|im_start|>system\\n\' + messages[0].content + \'<|im_end|>\\n\' }}\n    {%- endif %}\n{%- endif %}\n{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}\n{%- for message in messages[::-1] %}\n    {%- set index = (messages|length - 1) - loop.index0 %}\n    {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith(\'<tool_response>\') and message.content.endswith(\'</tool_response>\')) %}\n        {%- set ns.multi_step_tool = false %}\n        {%- set ns.last_query_index = index %}\n    {%- endif %}\n{%- endfor %}\n{%- for message in messages %}\n    {%- if message.content is string %}\n        {%- set content = message.content %}\n    {%- else %}\n        {%- set content = \'\' %}\n    {%- endif %}\n    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}\n        {{- \'<|im_start|>\' + message.role + \'\\n\'}}\n        {%- if message.role == "user" and message.timestamp is defined %}\n            {{- \'<time>\' + message.timestamp + \'</time>\\n\' }}\n        {%- endif %}\n        {{- content + \'<|im_end|>\\n\' }}\n    {%- elif message.role == "assistant" %}\n        {{- \'<|im_start|>\' + message.role + \'\\n\' + content + \'<|im_end|>\\n\' }}\n        {%- if message.tool_calls %}\n            {%- for tool_call in message.tool_calls %}\n                {%- if (loop.first and content) or (not loop.first) %}\n                    {{- \'\\n\' }}\n                {%- endif %}\n                {%- if tool_call.function %}\n                    {%- set tool_call = tool_call.function %}\n                {%- endif %}\n                {{- \'<tool_call>\\n{"name": "\' }}\n                {{- tool_call.name }}\n                {{- \'", "arguments": \' }}\n                {%- if tool_call.arguments is string %}\n                    {{- tool_call.arguments }}\n                {%- else %}\n                    {{- tool_call.arguments | tojson }}\n                {%- endif %}\n                {{- \'}\\n</tool_call>\' }}\n            {%- endfor %}\n        {%- endif %}\n    {%- elif message.role == "tool" %}\n        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}\n            {{- \'<|im_start|>user\' }}\n        {%- endif %}\n        {{- \'\\n<tool_response>\\n\' }}\n        {{- content }}\n        {{- \'\\n</tool_response>\' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}\n            {{- \'<|im_end|>\\n\' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- \'<|im_start|>assistant\\n\' }}\n{%- endif %}"""

==((====))==  Unsloth 2025.12.8: Fast Qwen3 patching. Transformers: 4.57.3.
   \\   /|    NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition. Num GPUs = 1. Max memory: 94.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 12.0. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

### Inject LoRA Adapters into the Model

We apply **LoRA adapters** using Unsloth's wrapper:

- Target modules include all major projection layers in MLP and attention blocks (`q_proj`, `v_proj`, `gate_proj`, etc.).
- Adapter rank is set to `r=32` with Œ±=32.
- Dropout improves generalization but disables full-speed patching in Unsloth.
- Gradient checkpointing is used to reduce memory usage at a slight compute cost.

> ‚ö†Ô∏è **Note**: We will get a warning from Unsloth which is **benign**. This trade-off was accepted to preserve regularization via dropout. It does not affect correctness or convergence‚Äîonly training speed.


In [10]:
model = FastLanguageModel.get_peft_model(
    model,
    r=32,              # Rank of LoRA
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha=32,     # Standard
    lora_dropout=0.05, # Standard
    bias="none",       # Standard
    random_state = seed,
    use_gradient_checkpointing=True  # Saves memory, a bit slower
)

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.12.8 patched 40 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


### Configure and Initialize `SFTTrainer`

We use HuggingFace's `SFTTrainer` for instruction tuning:

- Batch size = 8 with gradient accumulation of 4 ‚Üí effective batch size = 32.
- Optimizer: 8-bit AdamW.
- 3 epochs with linear LR schedule, warmup = 100 steps.
- Evaluation and logging configured to occur frequently.
- All logging is local, and reproducibility is ensured via fixed seed.

> This trainer is setup for fine-tuning the model using the Phase 1 single-turn conversations.


In [11]:
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    eval_dataset = eval_dataset,            # Use test set for evaluation
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 8,    # 
        gradient_accumulation_steps = 4,    # Effective batch size = 32
        warmup_steps = 100,                 # 
        num_train_epochs = 3,               # Standard
        learning_rate = 2e-5,               # Standard
        logging_steps = 10,                 # Log every 10 steps
        optim = "adamw_8bit",               # Standard for QLoRA
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = seed,
        report_to = "none",                 
        max_grad_norm = 1.0,                # Standard
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=36):   0%|          | 0/2188 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=36):   0%|          | 0/387 [00:00<?, ? examples/s]

ü¶• Unsloth: Padding-free auto-enabled, enabling faster training.


### Inspect GPU Device and Record Initial Memory Usage

Before starting training, we record:

- The name of the active GPU.
- Total available GPU memory.
- Reserved memory at the beginning of the run.

> This helps contextualize memory usage during training.


In [12]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved at start.")

GPU = NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition. Max memory = 94.557 GB.
13.334 GB of memory reserved at start.


### Train the Model and Track Peak GPU Memory Usage

The actual fine-tuning begins using `trainer.train()`.

- After completion, we log the **peak GPU memory** reserved during the training process (in GB).
- This value reflects memory requirements under 4-bit QLoRA with checkpointing.

Training stats (loss, etc.) are stored in `trainer_stats`.

In [13]:
# Begin training, collect stats
trainer_stats = trainer.train()

# After training, report peak memory reserved
peak_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
print(f"Peak GPU memory used during training: {peak_gpu_memory} GB.")

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 2,188 | Num Epochs = 3 | Total steps = 207
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 4 x 1) = 32
 "-____-"     Trainable parameters = 128,450,560 of 14,896,757,760 (0.86% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,1.8436
20,1.9193
30,1.8905
40,1.7803
50,1.425
60,1.2158
70,1.0492
80,0.913
90,0.745
100,0.6715


Peak GPU memory used during training: 17.695 GB.


### Evaluate the Fine-Tuned Model on Held-Out Phase 1 Test Set

To assess the model after training:

- We switch to `gradient_accumulation_steps = 1` for evaluation.
- `trainer.evaluate()` is run on the Phase 1 test data.

Results include loss and other metrics; these are printed for transparency.


In [14]:
trainer.args.gradient_accumulation_steps = 1
eval_metrics = trainer.evaluate()
print("Eval metrics:", eval_metrics)

Eval metrics: {'eval_loss': 0.5381345152854919, 'eval_runtime': 26.1575, 'eval_samples_per_second': 14.795, 'eval_steps_per_second': 3.708, 'epoch': 3.0}


### Save LoRA Adapter and Tokenizer After Training

After training, we save:

- The **LoRA adapter** weights using `model.save_pretrained`.
- The tokenizer configuration using `tokenizer.save_pretrained`.

> These saved adapters are necessary to resume or merge LoRA layers later.

In [15]:
model.save_pretrained(model_name+"phase1_adapter")
tokenizer.save_pretrained(model_name+"phase1_adapter")

('Qwen/Qwen3-14Bphase1_adapter/tokenizer_config.json',
 'Qwen/Qwen3-14Bphase1_adapter/special_tokens_map.json',
 'Qwen/Qwen3-14Bphase1_adapter/chat_template.jinja',
 'Qwen/Qwen3-14Bphase1_adapter/vocab.json',
 'Qwen/Qwen3-14Bphase1_adapter/merges.txt',
 'Qwen/Qwen3-14Bphase1_adapter/added_tokens.json',
 'Qwen/Qwen3-14Bphase1_adapter/tokenizer.json')

### Free GPU and Host Memory (Intermediate Cleanup)

We delete the training artifacts:

- `model`, `tokenizer`, and `trainer` are explicitly deleted.
- Python's garbage collector is triggered.
- GPU memory is cleared using `torch.cuda.empty_cache()`.

> This prepares the environment for the final merge step without exceeding memory.

In [16]:
del model
del tokenizer
del trainer

gc.collect()
torch.cuda.empty_cache()

### Reload Base Model and Phase 1 Tokenizer (for Merging)

We now reload:

- The **original base model** (Qwen3-14B) in `bfloat16` for full-precision merge.
- The **tokenizer** saved with the LoRA adapter.

> This prepares us to merge adapter weights into the base model.

In [17]:
tokenizer = AutoTokenizer.from_pretrained(model_name+"phase1_adapter")
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)

`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

### Merge Adapter Weights and Save Final Phase 1 Model

This step:

- Wraps the base model with PEFT's `PeftModel` and loads the adapter.
- Merges LoRA weights into the base model (`merge_and_unload()`).
- Saves the **final merged model** and tokenizer to `"phase1"` directory.

> This produces a single checkpoint with integrated parameters for downstream training or inference.


In [18]:
model = PeftModel.from_pretrained(model, model_name+"phase1_adapter")
model = model.merge_and_unload()
model.save_pretrained(model_name+"phase1")
tokenizer.save_pretrained(model_name+"phase1")

('Qwen/Qwen3-14Bphase1/tokenizer_config.json',
 'Qwen/Qwen3-14Bphase1/special_tokens_map.json',
 'Qwen/Qwen3-14Bphase1/chat_template.jinja',
 'Qwen/Qwen3-14Bphase1/vocab.json',
 'Qwen/Qwen3-14Bphase1/merges.txt',
 'Qwen/Qwen3-14Bphase1/added_tokens.json',
 'Qwen/Qwen3-14Bphase1/tokenizer.json')

### Final Cleanup

We delete the model and tokenizer objects again to free memory before proceeding to the next phase.

In [19]:
del model
del tokenizer

# Phase 2: Two-Turn Temporal Scenarios

Phase 2 curriculum introduces **structured two-turn conversations**, primarily designed to develop **temporal awareness** across contexts. These examples encourage the model to reason about elapsed time, memory continuity, and changes in user intent‚Äîacross two turns.

‚ö†Ô∏è **Implementation Note**: The training code and workflow are structurally identical to Phase 1. All notebook cells are reused with minor changes like:

- `load_split('phase2_train.json', 'phase2_test.json')`
- Final model saved as `"Qwen/Qwen3-14Bphase2"`

If Phase 1 has already been run, you can begin directly by executing the notebook cells of Phase 2.

In [20]:
# Load phase 2 data
train_data, test_data = load_split('phase2_train.json', 'phase2_test.json')

print(f"Loaded {len(train_data)} train and {len(test_data)} test conversations.")

Loaded 5291 train and 935 test conversations.


In [21]:
tokenizer = AutoTokenizer.from_pretrained(model_name+"phase1")
tokenizer.chat_template = """{%- if tools %}\n    {{- \'<|im_start|>system\\n\' }}\n    {%- if messages[0].role == \'system\' %}\n        {{- messages[0].content + \'\\n\\n\' }}\n    {%- endif %}\n    {{- "# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>" }}\n    {%- for tool in tools %}\n        {{- "\\n" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- "\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\"name\\": <function-name>, \\"arguments\\": <args-json-object>}\\n</tool_call><|im_end|>\\n" }}\n{%- else %}\n    {%- if messages[0].role == \'system\' %}\n        {{- \'<|im_start|>system\\n\' + messages[0].content + \'<|im_end|>\\n\' }}\n    {%- endif %}\n{%- endif %}\n{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}\n{%- for message in messages[::-1] %}\n    {%- set index = (messages|length - 1) - loop.index0 %}\n    {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith(\'<tool_response>\') and message.content.endswith(\'</tool_response>\')) %}\n        {%- set ns.multi_step_tool = false %}\n        {%- set ns.last_query_index = index %}\n    {%- endif %}\n{%- endfor %}\n{%- for message in messages %}\n    {%- if message.content is string %}\n        {%- set content = message.content %}\n    {%- else %}\n        {%- set content = \'\' %}\n    {%- endif %}\n    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}\n        {{- \'<|im_start|>\' + message.role + \'\\n\'}}\n        {%- if message.role == "user" and message.timestamp is defined %}\n            {{- \'<time>\' + message.timestamp + \'</time>\\n\' }}\n        {%- endif %}\n        {{- content + \'<|im_end|>\\n\' }}\n    {%- elif message.role == "assistant" %}\n        {{- \'<|im_start|>\' + message.role + \'\\n\' + content + \'<|im_end|>\\n\' }}\n        {%- if message.tool_calls %}\n            {%- for tool_call in message.tool_calls %}\n                {%- if (loop.first and content) or (not loop.first) %}\n                    {{- \'\\n\' }}\n                {%- endif %}\n                {%- if tool_call.function %}\n                    {%- set tool_call = tool_call.function %}\n                {%- endif %}\n                {{- \'<tool_call>\\n{"name": "\' }}\n                {{- tool_call.name }}\n                {{- \'", "arguments": \' }}\n                {%- if tool_call.arguments is string %}\n                    {{- tool_call.arguments }}\n                {%- else %}\n                    {{- tool_call.arguments | tojson }}\n                {%- endif %}\n                {{- \'}\\n</tool_call>\' }}\n            {%- endfor %}\n        {%- endif %}\n    {%- elif message.role == "tool" %}\n        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}\n            {{- \'<|im_start|>user\' }}\n        {%- endif %}\n        {{- \'\\n<tool_response>\\n\' }}\n        {{- content }}\n        {{- \'\\n</tool_response>\' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}\n            {{- \'<|im_end|>\\n\' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- \'<|im_start|>assistant\\n\' }}\n{%- endif %}"""

In [22]:
# Apply chat template to both train and test data, do not tokenize yet
train_texts = tokenizer.apply_chat_template(
    train_data,
    tokenize=False,
)

test_texts = tokenizer.apply_chat_template(
    test_data,
    tokenize=False,
)

print(f"First formatted train example:\n{train_texts[0][:500]}")
print(f"First formatted test example:\n{test_texts[0][:500]}")

First formatted train example:
<|im_start|>user
<time>2024-05-10T16:19:14</time>
I'm running an experiment that finishes at 5 PM, and I need to understand the role of catalysts in chemical reactions. Can you explain how they work as I monitor the results?<|im_end|>
<|im_start|>assistant
Absolutely, let's cover catalysts while you oversee your experiment.
<think>
The experiment concludes at 5 PM, so there‚Äôs about 40 minutes available for this explanation. I‚Äôll focus on the essentials and practical examples.
</think>
A **cataly
First formatted test example:
<|im_start|>user
<time>2023-07-25T18:00:00</time>
My friend is flying from London to New York. The flight departs London at 10:00 AM local time on July 26th and the flight duration is 7 hours. New York is 5 hours behind London. What time should I expect them to land in New York?<|im_end|>
<|im_start|>assistant
Let's calculate your friend's arrival time in New York, accounting for flight duration and the time difference.
<think>
Th

In [23]:
# Tokenize to find sequence lengths without truncation
train_lengths = [len(tokenizer(t, add_special_tokens=False)['input_ids']) for t in train_texts]
test_lengths = [len(tokenizer(t, add_special_tokens=False)['input_ids']) for t in test_texts]

max_train_len = max(train_lengths)
max_test_len = max(test_lengths)

print(f"Max train sequence length: {max_train_len}")
print(f"Max test sequence length: {max_test_len}")

# Get some stats about the distribution
print(f"Train: Mean = {sum(train_lengths)/len(train_lengths):.1f}, 90th percentile = {sorted(train_lengths)[int(0.9*len(train_lengths))]}")
print(f"Test: Mean = {sum(test_lengths)/len(test_lengths):.1f}, 90th percentile = {sorted(test_lengths)[int(0.9*len(test_lengths))]}")


Max train sequence length: 3795
Max test sequence length: 2548
Train: Mean = 483.4, 90th percentile = 904
Test: Mean = 490.3, 90th percentile = 903


In [24]:
# Format for TRL SFTTrainer: list of dicts with a 'text' field
train_dataset = [{"text": x} for x in train_texts]
eval_dataset = [{"text": x} for x in test_texts]

print(f"First train sample: {train_dataset[0]}")
print(f"First eval sample: {eval_dataset[0]}")

First train sample: {'text': "<|im_start|>user\n<time>2024-05-10T16:19:14</time>\nI'm running an experiment that finishes at 5 PM, and I need to understand the role of catalysts in chemical reactions. Can you explain how they work as I monitor the results?<|im_end|>\n<|im_start|>assistant\nAbsolutely, let's cover catalysts while you oversee your experiment.\n<think>\nThe experiment concludes at 5 PM, so there‚Äôs about 40 minutes available for this explanation. I‚Äôll focus on the essentials and practical examples.\n</think>\nA **catalyst** is a substance that speeds up a chemical reaction without being consumed in the process. Here‚Äôs how it works:\n\n- **Lower Activation Energy**: Catalysts provide an alternative pathway for the reaction with a lower activation energy. This means that more collisions between reacting molecules have enough energy to overcome this barrier, increasing the reaction rate.\n- **Reversible**: Catalysts are not permanently altered by the reaction. They part

In [25]:
# Wrap the data in HuggingFace Datasets objects
train_dataset = Dataset.from_list(train_dataset)
eval_dataset = Dataset.from_list(eval_dataset)

print(train_dataset, eval_dataset)

Dataset({
    features: ['text'],
    num_rows: 5291
}) Dataset({
    features: ['text'],
    num_rows: 935
})


In [26]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name+"phase1",
    max_seq_length = max_train_len,       # Highest sequence length in train dataset
    load_in_4bit = True,                  # QLoRA
    load_in_8bit = False,                 # We are using 4-bit
    full_finetuning = False,              # QLoRA/PEFT not full FT
)

==((====))==  Unsloth 2025.12.8: Fast Qwen3 patching. Transformers: 4.57.3.
   \\   /|    NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition. Num GPUs = 1. Max memory: 94.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 12.0. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

In [27]:
model = FastLanguageModel.get_peft_model(
    model,
    r=32,              # Rank of LoRA
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha=32,     # Standard
    lora_dropout=0.05, # Standard
    bias="none",       # Standard
    random_state = seed,
    use_gradient_checkpointing=True  # Saves memory, a bit slower
)

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.12.8 patched 40 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


In [28]:
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    eval_dataset = eval_dataset,            # Use test set for evaluation
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 8,    # 
        gradient_accumulation_steps = 4,    # Effective batch size = 32
        warmup_steps = 100,                 # 
        num_train_epochs = 3,               # Standard
        learning_rate = 2e-5,               # Standard
        logging_steps = 10,                 # Log every 10 steps
        optim = "adamw_8bit",               # Standard for QLoRA
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = seed,
        report_to = "none",                 
        max_grad_norm = 1.0,                # Standard
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=36):   0%|          | 0/5291 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=36):   0%|          | 0/935 [00:00<?, ? examples/s]

ü¶• Unsloth: Padding-free auto-enabled, enabling faster training.


In [29]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved at start.")

GPU = NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition. Max memory = 94.557 GB.
13.328 GB of memory reserved at start.


In [30]:
# Begin training, collect stats
trainer_stats = trainer.train()

# After training, report peak memory reserved
peak_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
print(f"Peak GPU memory used during training: {peak_gpu_memory} GB.")

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,291 | Num Epochs = 3 | Total steps = 498
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 4 x 1) = 32
 "-____-"     Trainable parameters = 128,450,560 of 14,896,757,760 (0.86% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
10,1.0912
20,1.137
30,1.1098
40,1.0721
50,1.0343
60,0.9753
70,0.9209
80,0.8874
90,0.8434
100,0.842


Peak GPU memory used during training: 23.104 GB.


In [31]:
trainer.args.gradient_accumulation_steps = 1
eval_metrics = trainer.evaluate()
print("Eval metrics:", eval_metrics)

Eval metrics: {'eval_loss': 0.7074225544929504, 'eval_runtime': 91.4907, 'eval_samples_per_second': 10.22, 'eval_steps_per_second': 2.558, 'epoch': 3.0}


In [32]:
model.save_pretrained(model_name+"phase2_adapter")
tokenizer.save_pretrained(model_name+"phase2_adapter")

('Qwen/Qwen3-14Bphase2_adapter/tokenizer_config.json',
 'Qwen/Qwen3-14Bphase2_adapter/special_tokens_map.json',
 'Qwen/Qwen3-14Bphase2_adapter/chat_template.jinja',
 'Qwen/Qwen3-14Bphase2_adapter/vocab.json',
 'Qwen/Qwen3-14Bphase2_adapter/merges.txt',
 'Qwen/Qwen3-14Bphase2_adapter/added_tokens.json',
 'Qwen/Qwen3-14Bphase2_adapter/tokenizer.json')

In [33]:
del model
del tokenizer
del trainer

gc.collect()
torch.cuda.empty_cache()

In [34]:
tokenizer = AutoTokenizer.from_pretrained(model_name+"phase2_adapter")
model = AutoModelForCausalLM.from_pretrained(model_name+"phase1", torch_dtype=torch.bfloat16)

Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

In [35]:
model = PeftModel.from_pretrained(model, model_name+"phase2_adapter")
model = model.merge_and_unload()
model.save_pretrained(model_name+"phase2")
tokenizer.save_pretrained(model_name+"phase2")

('Qwen/Qwen3-14Bphase2/tokenizer_config.json',
 'Qwen/Qwen3-14Bphase2/special_tokens_map.json',
 'Qwen/Qwen3-14Bphase2/chat_template.jinja',
 'Qwen/Qwen3-14Bphase2/vocab.json',
 'Qwen/Qwen3-14Bphase2/merges.txt',
 'Qwen/Qwen3-14Bphase2/added_tokens.json',
 'Qwen/Qwen3-14Bphase2/tokenizer.json')

In [36]:
del model
del tokenizer

# Phase 3: Multi-Turn Generalization and Reasoning Modulation

Phase 3 expands to **three-turn and longer conversations**, generalizing beyond strictly temporal scenarios.

‚ö†Ô∏è **Implementation Note**: Training infrastructure remains the same with minor changes like:

- `load_split('phase3_train.json', 'phase3_test.json')`
- Save path updated to `"Qwen/Qwen3-14Bphase3"`

Phase 3 can be run independently as long as prior phases have generated the expected model weights.


In [37]:
# Load phase 3 data
train_data, test_data = load_split('phase3_train.json', 'phase3_test.json')

print(f"Loaded {len(train_data)} train and {len(test_data)} test conversations.")

Loaded 5878 train and 1039 test conversations.


In [38]:
tokenizer = AutoTokenizer.from_pretrained(model_name+"phase2")
tokenizer.chat_template = """{%- if tools %}\n    {{- \'<|im_start|>system\\n\' }}\n    {%- if messages[0].role == \'system\' %}\n        {{- messages[0].content + \'\\n\\n\' }}\n    {%- endif %}\n    {{- "# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>" }}\n    {%- for tool in tools %}\n        {{- "\\n" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- "\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\"name\\": <function-name>, \\"arguments\\": <args-json-object>}\\n</tool_call><|im_end|>\\n" }}\n{%- else %}\n    {%- if messages[0].role == \'system\' %}\n        {{- \'<|im_start|>system\\n\' + messages[0].content + \'<|im_end|>\\n\' }}\n    {%- endif %}\n{%- endif %}\n{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}\n{%- for message in messages[::-1] %}\n    {%- set index = (messages|length - 1) - loop.index0 %}\n    {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith(\'<tool_response>\') and message.content.endswith(\'</tool_response>\')) %}\n        {%- set ns.multi_step_tool = false %}\n        {%- set ns.last_query_index = index %}\n    {%- endif %}\n{%- endfor %}\n{%- for message in messages %}\n    {%- if message.content is string %}\n        {%- set content = message.content %}\n    {%- else %}\n        {%- set content = \'\' %}\n    {%- endif %}\n    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}\n        {{- \'<|im_start|>\' + message.role + \'\\n\'}}\n        {%- if message.role == "user" and message.timestamp is defined %}\n            {{- \'<time>\' + message.timestamp + \'</time>\\n\' }}\n        {%- endif %}\n        {{- content + \'<|im_end|>\\n\' }}\n    {%- elif message.role == "assistant" %}\n        {{- \'<|im_start|>\' + message.role + \'\\n\' + content + \'<|im_end|>\\n\' }}\n        {%- if message.tool_calls %}\n            {%- for tool_call in message.tool_calls %}\n                {%- if (loop.first and content) or (not loop.first) %}\n                    {{- \'\\n\' }}\n                {%- endif %}\n                {%- if tool_call.function %}\n                    {%- set tool_call = tool_call.function %}\n                {%- endif %}\n                {{- \'<tool_call>\\n{"name": "\' }}\n                {{- tool_call.name }}\n                {{- \'", "arguments": \' }}\n                {%- if tool_call.arguments is string %}\n                    {{- tool_call.arguments }}\n                {%- else %}\n                    {{- tool_call.arguments | tojson }}\n                {%- endif %}\n                {{- \'}\\n</tool_call>\' }}\n            {%- endfor %}\n        {%- endif %}\n    {%- elif message.role == "tool" %}\n        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}\n            {{- \'<|im_start|>user\' }}\n        {%- endif %}\n        {{- \'\\n<tool_response>\\n\' }}\n        {{- content }}\n        {{- \'\\n</tool_response>\' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}\n            {{- \'<|im_end|>\\n\' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- \'<|im_start|>assistant\\n\' }}\n{%- endif %}"""

In [39]:
# Apply chat template to both train and test data, do not tokenize yet
train_texts = tokenizer.apply_chat_template(
    train_data,
    tokenize=False,
)

test_texts = tokenizer.apply_chat_template(
    test_data,
    tokenize=False,
)

print(f"First formatted train example:\n{train_texts[0][:500]}")
print(f"First formatted test example:\n{test_texts[0][:500]}")

First formatted train example:
<|im_start|>user
<time>2023-06-14T09:00:00</time>
I‚Äôm about to start a 10-hour work marathon. How can I stay productive?<|im_end|>
<|im_start|>assistant
A long day ahead! Let's maximize your productivity.

<think>
9:00 AM start ‚Äî 10 hours means they‚Äôll finish by 7:00 PM. Solid block.
</think>

Suggestions:
- **Breaks:** Take a 5-minute break every hour. Recharge often!
- **Hydration:** Keep water at your desk.
- **Prioritize tasks:** List critical items first.
- **Stretch:** Every couple of hour
First formatted test example:
<|im_start|>user
<time>2023-03-15T09:37:18</time>
I just got promoted to team lead! What should I focus on in my first 30 days?<|im_end|>
<|im_start|>assistant
Congratulations on your promotion! As you step into this new role, here are some key areas to focus on during your first 30 days:

- **Understand team dynamics**: Spend time getting to know each team member's strengths and areas for improvement.
- **Set clear goals**: Ali

In [40]:
# Tokenize to find sequence lengths without truncation
train_lengths = [len(tokenizer(t, add_special_tokens=False)['input_ids']) for t in train_texts]
test_lengths = [len(tokenizer(t, add_special_tokens=False)['input_ids']) for t in test_texts]

max_train_len = max(train_lengths)
max_test_len = max(test_lengths)

print(f"Max train sequence length: {max_train_len}")
print(f"Max test sequence length: {max_test_len}")

# Get some stats about the distribution
print(f"Train: Mean = {sum(train_lengths)/len(train_lengths):.1f}, 90th percentile = {sorted(train_lengths)[int(0.9*len(train_lengths))]}")
print(f"Test: Mean = {sum(test_lengths)/len(test_lengths):.1f}, 90th percentile = {sorted(test_lengths)[int(0.9*len(test_lengths))]}")

Max train sequence length: 3795
Max test sequence length: 3041
Train: Mean = 496.0, 90th percentile = 855
Test: Mean = 464.7, 90th percentile = 752


In [41]:
# Format for TRL SFTTrainer: list of dicts with a 'text' field
train_dataset = [{"text": x} for x in train_texts]
eval_dataset = [{"text": x} for x in test_texts]

print(f"First train sample: {train_dataset[0]}")
print(f"First eval sample: {eval_dataset[0]}")

First train sample: {'text': "<|im_start|>user\n<time>2023-06-14T09:00:00</time>\nI‚Äôm about to start a 10-hour work marathon. How can I stay productive?<|im_end|>\n<|im_start|>assistant\nA long day ahead! Let's maximize your productivity.\n\n<think>\n9:00 AM start ‚Äî 10 hours means they‚Äôll finish by 7:00 PM. Solid block.\n</think>\n\nSuggestions:\n- **Breaks:** Take a 5-minute break every hour. Recharge often!\n- **Hydration:** Keep water at your desk.\n- **Prioritize tasks:** List critical items first.\n- **Stretch:** Every couple of hours, move around.\n- **Healthy snacks:** Keep energy levels stable.\n\nYou‚Äôve got this!<|im_end|>\n<|im_start|>user\n<time>2023-06-14T19:00:00</time>\nI just wrapped up. Any tips on how to unwind after such a day?<|im_end|>\n<|im_start|>assistant\nCongrats on finishing!\n\n<think>\n7:00 PM ‚Äî the user completed the marathon. They need effective unwinding.\n</think>\n\nSuggestions:\n- **Decompress:** A warm shower to relax muscles.\n- **Light mea

In [42]:
# Wrap the data in HuggingFace Datasets objects
train_dataset = Dataset.from_list(train_dataset)
eval_dataset = Dataset.from_list(eval_dataset)

print(train_dataset, eval_dataset)

Dataset({
    features: ['text'],
    num_rows: 5878
}) Dataset({
    features: ['text'],
    num_rows: 1039
})


In [43]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name+"phase2",
    max_seq_length = max_train_len,       # Highest sequence length in train dataset
    load_in_4bit = True,                  # QLoRA
    load_in_8bit = False,                 # We are using 4-bit
    full_finetuning = False,              # QLoRA/PEFT not full FT
)

==((====))==  Unsloth 2025.12.8: Fast Qwen3 patching. Transformers: 4.57.3.
   \\   /|    NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition. Num GPUs = 1. Max memory: 94.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 12.0. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

In [44]:
model = FastLanguageModel.get_peft_model(
    model,
    r=32,              # Rank of LoRA
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha=32,     # Standard
    lora_dropout=0.05, # Standard
    bias="none",       # Standard
    random_state = seed,
    use_gradient_checkpointing=True  # Saves memory, a bit slower
)

In [45]:
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    eval_dataset = eval_dataset,            # Use test set for evaluation
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 8,    # 
        gradient_accumulation_steps = 4,    # Effective batch size = 32
        warmup_steps = 100,                 # 
        num_train_epochs = 3,               # Standard
        learning_rate = 2e-5,               # Standard
        logging_steps = 10,                 # Log every 10 steps
        optim = "adamw_8bit",               # Standard for QLoRA
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = seed,
        report_to = "none",                 
        max_grad_norm = 1.0,                # Standard
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=36):   0%|          | 0/5878 [00:00<?, ? examples/s]

Unsloth: Tokenizing ["text"] (num_proc=36):   0%|          | 0/1039 [00:00<?, ? examples/s]

ü¶• Unsloth: Padding-free auto-enabled, enabling faster training.


In [46]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved at start.")

GPU = NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition. Max memory = 94.557 GB.
23.104 GB of memory reserved at start.


In [47]:
# Begin training, collect stats
trainer_stats = trainer.train()

# After training, report peak memory reserved
peak_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
print(f"Peak GPU memory used during training: {peak_gpu_memory} GB.")

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 5,878 | Num Epochs = 3 | Total steps = 552
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 4 x 1) = 32
 "-____-"     Trainable parameters = 128,450,560 of 14,896,757,760 (0.86% trained)


Step,Training Loss
10,0.9105
20,0.911
30,0.8933
40,0.8893
50,0.8787
60,0.8684
70,0.8699
80,0.8437
90,0.8689
100,0.8098


Peak GPU memory used during training: 33.475 GB.


In [48]:
trainer.args.gradient_accumulation_steps = 1
eval_metrics = trainer.evaluate()
print("Eval metrics:", eval_metrics)

Eval metrics: {'eval_loss': 0.7413343787193298, 'eval_runtime': 96.4224, 'eval_samples_per_second': 10.776, 'eval_steps_per_second': 2.696, 'epoch': 3.0}


### Manual Sanity Check: Inference with Chat Template

This cell performs a quick **manual test** to ensure:

1. The **chat template** is correctly applied in inference.
2. The **trained model** can produce a sensible response using a realistic prompt and timestamp.

We use a simple date-based query (‚ÄúHow many days till Christmas?‚Äù) with a fresh ISO 8601 timestamp injected. The prompt is tokenized using the same chat template applied during training to confirm consistency between training and inference pipelines.

‚ö†Ô∏è This is not part of evaluation, just a live check that tokenizer formatting, device placement, and sampling are functioning as expected after training.


In [49]:
messages = [
    {"role" : "user", "content" : "How many days till Christmas?", "timestamp": datetime.now().isoformat()[:19]}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize = False,
    add_generation_prompt = False,
)

_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    max_new_tokens = 256, # Increase for longer outputs!
    temperature = 0.6, top_p = 0.95, top_k = 20, # For non thinking
    streamer = TextStreamer(tokenizer, skip_prompt = False),
)

<|im_start|>user
<time>2025-12-22T09:05:23</time>
How many days till Christmas?<|im_end|>
<|im_start|>assistant
Let's calculate the number of days until Christmas.

### Current Date and Target Date
- **Today's Date:** December 22, 2025
- **Christmas Date:** December 25, 2025

<think>
The user is asking about the days between December 22 and December 25, 2025.
</think>

### Calculation
- **December 22 to December 23:** 1 day
- **December 23 to December 24:** 1 day
- **December 24 to December 25:** 1 day

<think>
Summing up the days: 1 + 1 + 1 = 3 days.
</think>

There are **3 days** left until Christmas.<|im_end|>


In [50]:
model.save_pretrained(model_name+"phase3_adapter")
tokenizer.save_pretrained(model_name+"phase3_adapter")

('Qwen/Qwen3-14Bphase3_adapter/tokenizer_config.json',
 'Qwen/Qwen3-14Bphase3_adapter/special_tokens_map.json',
 'Qwen/Qwen3-14Bphase3_adapter/chat_template.jinja',
 'Qwen/Qwen3-14Bphase3_adapter/vocab.json',
 'Qwen/Qwen3-14Bphase3_adapter/merges.txt',
 'Qwen/Qwen3-14Bphase3_adapter/added_tokens.json',
 'Qwen/Qwen3-14Bphase3_adapter/tokenizer.json')

In [51]:
del model
del tokenizer
del trainer

gc.collect()
torch.cuda.empty_cache()

In [52]:
tokenizer = AutoTokenizer.from_pretrained(model_name+"phase3_adapter")
model = AutoModelForCausalLM.from_pretrained(model_name+"phase2", torch_dtype=torch.bfloat16)

Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

In [53]:
model = PeftModel.from_pretrained(model, model_name+"phase3_adapter")
model = model.merge_and_unload()
model.save_pretrained(model_name+"phase3")
tokenizer.save_pretrained(model_name+"phase3")

('Qwen/Qwen3-14Bphase3/tokenizer_config.json',
 'Qwen/Qwen3-14Bphase3/special_tokens_map.json',
 'Qwen/Qwen3-14Bphase3/chat_template.jinja',
 'Qwen/Qwen3-14Bphase3/vocab.json',
 'Qwen/Qwen3-14Bphase3/merges.txt',
 'Qwen/Qwen3-14Bphase3/added_tokens.json',
 'Qwen/Qwen3-14Bphase3/tokenizer.json')

In [54]:
del model
del tokenizer

# Phase 4: Maximal Diversity, Full-Batch Alignment

This phase implements the decisive **gradient-aligned convergence** step using a **hand-curated, ultra-diverse batch** of just 128 conversations‚Äîeach structurally and stylistically distinct, unified only by their reliance on temporal and contextual meta-reasoning.

Key differences from earlier phases:

- **No test split**: all 128 samples are for alignment (pure SFT, no replay, no validation set).
- **Model initialization**: load the *Phase 3* checkpoint as the base.
- **High learning rate** (`1.5e-4`) with **large effective batch size of 128** and **very few steps**.
- **Trainer config**: single step per epoch with effective batch size of 128 (`grad_accumulation_steps=64`, batch size = 2) so each step processes 128 examples (8 √ó 16 accumulations) with no sampling so batch structure remains deterministic.
- **Save every epoch**: Each checkpoint is saved for later manual behavioral selection, since degenerate collapse can occur beyond a threshold.

This phase is where **all structural entropy is injected**‚Äîcontradictory scenarios, formats, and tones ensure that the **only consistently learnable signal is temporal, meta-cognitive alignment**. The workflow otherwise mirrors previous phases, but with these critical modifications.


In [55]:
# Load phase 4 data
with open('phase4.json', 'r', encoding='utf-8') as f:
    train_data = json.load(f)

print(f"Loaded {len(train_data)} train conversations.")

Loaded 128 train conversations.


In [56]:
tokenizer = AutoTokenizer.from_pretrained(model_name+"phase3")
tokenizer.chat_template = """{%- if tools %}\n    {{- \'<|im_start|>system\\n\' }}\n    {%- if messages[0].role == \'system\' %}\n        {{- messages[0].content + \'\\n\\n\' }}\n    {%- endif %}\n    {{- "# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>" }}\n    {%- for tool in tools %}\n        {{- "\\n" }}\n        {{- tool | tojson }}\n    {%- endfor %}\n    {{- "\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\"name\\": <function-name>, \\"arguments\\": <args-json-object>}\\n</tool_call><|im_end|>\\n" }}\n{%- else %}\n    {%- if messages[0].role == \'system\' %}\n        {{- \'<|im_start|>system\\n\' + messages[0].content + \'<|im_end|>\\n\' }}\n    {%- endif %}\n{%- endif %}\n{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}\n{%- for message in messages[::-1] %}\n    {%- set index = (messages|length - 1) - loop.index0 %}\n    {%- if ns.multi_step_tool and message.role == "user" and message.content is string and not(message.content.startswith(\'<tool_response>\') and message.content.endswith(\'</tool_response>\')) %}\n        {%- set ns.multi_step_tool = false %}\n        {%- set ns.last_query_index = index %}\n    {%- endif %}\n{%- endfor %}\n{%- for message in messages %}\n    {%- if message.content is string %}\n        {%- set content = message.content %}\n    {%- else %}\n        {%- set content = \'\' %}\n    {%- endif %}\n    {%- if (message.role == "user") or (message.role == "system" and not loop.first) %}\n        {{- \'<|im_start|>\' + message.role + \'\\n\'}}\n        {%- if message.role == "user" and message.timestamp is defined %}\n            {{- \'<time>\' + message.timestamp + \'</time>\\n\' }}\n        {%- endif %}\n        {{- content + \'<|im_end|>\\n\' }}\n    {%- elif message.role == "assistant" %}\n        {{- \'<|im_start|>\' + message.role + \'\\n\' + content + \'<|im_end|>\\n\' }}\n        {%- if message.tool_calls %}\n            {%- for tool_call in message.tool_calls %}\n                {%- if (loop.first and content) or (not loop.first) %}\n                    {{- \'\\n\' }}\n                {%- endif %}\n                {%- if tool_call.function %}\n                    {%- set tool_call = tool_call.function %}\n                {%- endif %}\n                {{- \'<tool_call>\\n{"name": "\' }}\n                {{- tool_call.name }}\n                {{- \'", "arguments": \' }}\n                {%- if tool_call.arguments is string %}\n                    {{- tool_call.arguments }}\n                {%- else %}\n                    {{- tool_call.arguments | tojson }}\n                {%- endif %}\n                {{- \'}\\n</tool_call>\' }}\n            {%- endfor %}\n        {%- endif %}\n    {%- elif message.role == "tool" %}\n        {%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}\n            {{- \'<|im_start|>user\' }}\n        {%- endif %}\n        {{- \'\\n<tool_response>\\n\' }}\n        {{- content }}\n        {{- \'\\n</tool_response>\' }}\n        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}\n            {{- \'<|im_end|>\\n\' }}\n        {%- endif %}\n    {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n    {{- \'<|im_start|>assistant\\n\' }}\n{%- endif %}"""

In [57]:
# Apply chat template to training data, do not tokenize yet
train_texts = tokenizer.apply_chat_template(
    train_data,
    tokenize=False,
)

print(f"First formatted train example:\n{train_texts[0][:500]}")

First formatted train example:
<|im_start|>user
<time>2025-06-08T03:45:13</time>
pls fast how do i convert like a 60 page word doc to latex latex i need latex i have to submit in the morning plz help<|im_end|>
<|im_start|>assistant
I can help you convert your Word document to LaTeX quickly! Don‚Äôt panic‚Äîyou can do this, even at 3:45 AM.
<think>
It's very late at night (3:45 AM) and the user is clearly stressed and facing a tight deadline. Their grammar and typing suggest panic. This is likely a thesis or major assignment. My j


In [58]:
# Tokenize to find sequence lengths without truncation
train_lengths = [len(tokenizer(t, add_special_tokens=False)['input_ids']) for t in train_texts]

max_train_len = max(train_lengths)

print(f"Max train sequence length: {max_train_len}")

# Get some stats about the distribution
print(f"Train: Mean = {sum(train_lengths)/len(train_lengths):.1f}, 90th percentile = {sorted(train_lengths)[int(0.9*len(train_lengths))]}")

Max train sequence length: 1379
Train: Mean = 735.8, 90th percentile = 981


In [59]:
# Format for TRL SFTTrainer: list of dicts with a 'text' field
train_dataset = [{"text": x} for x in train_texts]

print(f"First train sample: {train_dataset[0]}")

First train sample: {'text': "<|im_start|>user\n<time>2025-06-08T03:45:13</time>\npls fast how do i convert like a 60 page word doc to latex latex i need latex i have to submit in the morning plz help<|im_end|>\n<|im_start|>assistant\nI can help you convert your Word document to LaTeX quickly! Don‚Äôt panic‚Äîyou can do this, even at 3:45 AM.\n<think>\nIt's very late at night (3:45 AM) and the user is clearly stressed and facing a tight deadline. Their grammar and typing suggest panic. This is likely a thesis or major assignment. My job is to calm them, give fast actionable steps, and avoid overwhelming details.\n</think>\nHere‚Äôs the fastest way to convert a large Word document (like 60 pages) to LaTeX:\n\n### 1. **Use an Online Converter (Fastest for Emergencies)**\nTry [Pandoc](https://pandoc.org/)‚Äîit‚Äôs a free tool that can convert `.docx` to `.tex` files in one command.\n\n#### If you can use the command line:\n```bash\npandoc your_document.docx -o your_document.tex\n```\n- Yo

In [60]:
# Wrap the data in HuggingFace Datasets objects
train_dataset = Dataset.from_list(train_dataset)

print(train_dataset)

Dataset({
    features: ['text'],
    num_rows: 128
})


In [61]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name+"phase3",
    max_seq_length = max_train_len,       # Highest sequence length in train dataset
    load_in_4bit = True,                  # QLoRA
    load_in_8bit = False,                 # We are using 4-bit
    full_finetuning = False,              # QLoRA/PEFT not full FT
)

==((====))==  Unsloth 2025.12.8: Fast Qwen3 patching. Transformers: 4.57.3. vLLM: 0.13.0.
   \\   /|    NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition. Num GPUs = 1. Max memory: 95.592 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu128. CUDA: 12.0. CUDA Toolkit: 12.8. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

In [62]:
model = FastLanguageModel.get_peft_model(
    model,
    r=32,              # Rank of LoRA
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha=32,     # Standard
    lora_dropout=0.05, # Standard
    bias="none",       # Standard
    random_state = seed,
    use_gradient_checkpointing=True  # Saves memory, a bit slower
)

Unsloth: Dropout = 0 is supported for fast patching. You are using dropout = 0.05.
Unsloth will patch all other layers, except LoRA matrices, causing a performance hit.
Unsloth 2025.12.8 patched 40 layers with 0 QKV layers, 0 O layers and 0 MLP layers.


### Trainer Setup and Epoch-Level Checkpoint Strategy

We configure the trainer to operate in a high-intensity alignment regime:

- **Small batch (8) with large gradient accumulation (16)** yields an effective batch size of 128.
- **Max steps** is empirically tuned per model size to ensure that one **epoch checkpoint** lands in the narrow **loss band of 1.045‚Äì1.05**, which we found indicative of optimal alignment without collapse.
- **Learning rate (1.5e-4)** is increased to compensate for the short training horizon.
- **Epoch-level saving** (`save_strategy="epoch"`) enables **manual checkpoint selection**, as automated validation is disabled and the risk of overshooting this narrow zone is high.
- **No evaluation set** is used in this phase, we will choose the first checkpoint in the **range of 1.045‚Äì1.05**.

This configuration allows fine-grained control over the **transition to minimal reasoning**, permitting us to recover the one ‚Äújust right‚Äù checkpoint‚Äîneither underfit nor behaviorally degenerate.

In [63]:
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    args = SFTConfig(
        dataset_text_field = "text",
        per_device_train_batch_size = 8,    # 
        gradient_accumulation_steps = 16,   # Effective batch size = 128
        warmup_steps = 6,                   # 
        max_steps = 36,                     # 
        learning_rate = 1.5e-4,             # Higher LR as we have expotentially fewer steps
        logging_steps = 1,                  # Log every step
        optim = "adamw_8bit",               # Standard for QLoRA
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = seed,
        report_to = "none",                 
        max_grad_norm = 1.0,                # Standard
        save_strategy = "epoch",            # Saving every epoch so we can test each checkpoint manually
    ),
)

Unsloth: Tokenizing ["text"] (num_proc=36):   0%|          | 0/128 [00:00<?, ? examples/s]

ü¶• Unsloth: Padding-free auto-enabled, enabling faster training.


In [64]:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved at start.")

GPU = NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition. Max memory = 95.592 GB.
13.334 GB of memory reserved at start.


In [65]:
# Begin training, collect stats
trainer_stats = trainer.train()

# After training, report peak memory reserved
peak_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
print(f"Peak GPU memory used during training: {peak_gpu_memory} GB.")

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 128 | Num Epochs = 36 | Total steps = 36
O^O/ \_/ \    Batch size per device = 8 | Gradient accumulation steps = 16
\        /    Data Parallel GPUs = 1 | Total batch size (8 x 16 x 1) = 128
 "-____-"     Trainable parameters = 128,450,560 of 14,896,757,760 (0.86% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.3566
2,1.3566
3,1.355
4,1.348
5,1.3259
6,1.2926
7,1.2686
8,1.2524
9,1.2402
10,1.2274


Peak GPU memory used during training: 26.354 GB.


In [66]:
del model
del tokenizer
del trainer

gc.collect()
torch.cuda.empty_cache()

### Rename Output Directory for Checkpoint Organization

After training, we rename the default `trainer_output` directory to reflect the model size (e.g., `trainer_output14B`). This ensures that checkpoints from different model variants are isolated and clearly labeled for downstream merging and evaluation.

In [67]:
Path("trainer_output").rename("trainer_output"+model_name[11:])

PosixPath('trainer_output14B')

### Checkpoint Selection Strategy and Model Merging

The final checkpoint used for producing the **TIME** model is selected based on a *loss-based selection heuristic*, rather than arbitrary epoch limits or fixed training durations. Specifically, we monitor training loss and identify the **first epoch checkpoint** where the loss falls within a narrow empirically determined ‚Äúsweet spot‚Äù of **1.050 to 1.045**.

This range has been established across multiple model sizes and curriculum phases as the inflection point that balances sufficient learning and structure acquisition without overfitting or collapse. Checkpoints before this range often show underdeveloped behavior (e.g., weaker temporal behavior and output formatting), while those after tend to fall into **degenerate  modes**‚Äîhallmarked by sharp increase in format bleed with markdown leaking into think blocks, think block content appearing outside think blocks, infinite repetitions and so on.

This approach provides two key benefits:
1. **Behavioral Stability**: The checkpoint selected within this band consistently shows high structural fidelity (clean reasoning boundaries, minimal format bleed), compact and economical reasoning, and robust temporal understanding across sizes.
2. **Reproducibility**: Due to fixed seeds (`set_seed`, `random.seed`, `numpy.random.seed`) and deterministic inputs in the batch, **the exact same checkpoint is consistently produced** for a given model size and configuration, regardless of when or where it is trained (as long as training hyperparameters are unchanged). Despite underlying hardware non-determinism (e.g., CUDA kernels), the global training trajectory is sufficiently stable for the sweet spot checkpoint to remain invariant.

Checkpoints are saved at the end of each epoch (`save_strategy = "epoch"`), and we can see from the training loss that the 24th checkpoint (i.e., epoch corresponding to training step 24) has training loss (**1.0496**) falling the sweet range. This might differ for other model sizes, but the procedure remains the same.

After selection, we:
- Reload the adapter and base model.
- Merge LoRA adapters using `PeftModel.merge_and_unload()`.
- Save the fully merged model under the `TIME` naming convention (e.g., `TIME-14B`).
This model is then ready for downstream inference and benchmarking.


In [68]:
tokenizer = AutoTokenizer.from_pretrained(model_name+"phase3")
model = AutoModelForCausalLM.from_pretrained(model_name+"phase3", torch_dtype=torch.bfloat16)

`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

In [69]:
model = PeftModel.from_pretrained(model, "trainer_output"+model_name[11:]+"/checkpoint-24")
model = model.merge_and_unload()
model.save_pretrained("TIME"+model_name[10:])
tokenizer.save_pretrained("TIME"+model_name[10:])

('TIME-14B/tokenizer_config.json',
 'TIME-14B/special_tokens_map.json',
 'TIME-14B/chat_template.jinja',
 'TIME-14B/vocab.json',
 'TIME-14B/merges.txt',
 'TIME-14B/added_tokens.json',
 'TIME-14B/tokenizer.json')

In [70]:
del model
del tokenizer