# Colab 5: Continued Pretraining
## Domain Adaptation through Continued Pretraining

This notebook demonstrates **Continued Pretraining** - teaching a pre-trained model new domain knowledge.

### What is Continued Pretraining?
- **Continued Pretraining**: Further training a pre-trained model on domain-specific raw text
- **Purpose**: Adapt the model to a specific domain (code, medical, legal, etc.)
- **Different from Finetuning**: Uses raw text, not instruction-response pairs
- **Goal**: Expand model's knowledge in a particular area

### Key Differences:
| Method | Data Format | Goal |
|--------|-------------|------|
| **Pretraining** | Raw text | Learn language |
| **Continued Pretraining** | Domain-specific raw text | Learn domain knowledge |
| **Finetuning** | Instruction-response pairs | Learn to follow instructions |
| **LoRA Finetuning** | Instruction-response pairs | Efficient instruction following |

### Use Cases:
1. **Code Models**: Train on GitHub code repositories
2. **Medical Models**: Train on medical literature and journals
3. **Legal Models**: Train on legal documents and case law
4. **Multilingual**: Add new language capabilities
5. **Recent Events**: Update model with new information

### In This Notebook:
We'll adapt SmolLM2 to become better at Python programming by continuing pretraining on Python code from the `code_search_net` dataset.

## Step 1: Install Unsloth

**Important**: We need to use `datasets==4.3.0` to avoid recursion errors with Unsloth.

In [None]:
%%capture
!pip install unsloth
# Use datasets==4.3.0 to avoid recursion errors
!pip install datasets==4.3.0
!pip install --upgrade transformers accelerate

## Step 2: Import Libraries

In [None]:
from unsloth import FastLanguageModel
from unsloth import is_bfloat16_supported
import torch
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments, TextStreamer

print("All libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## Step 3: Configure Model Parameters

### Continued Pretraining Settings:
- **Higher learning rate**: 1e-4 to 5e-4 (vs 2e-4 for finetuning)
- **Longer training**: More epochs to learn domain knowledge
- **Packing**: Enable for efficiency with varying text lengths
- **No instruction template**: We use raw text, not instruction-response format

In [None]:
# Model configuration
max_seq_length = 2048
dtype = None  # Auto-detect
load_in_4bit = True  # Use 4bit quantization

# Using SmolLM2 135M for faster training
model_name = "unsloth/SmolLM2-135M-Instruct"

print(f"Configuration:")
print(f"  Model: {model_name}")
print(f"  Max Sequence Length: {max_seq_length}")
print(f"  4-bit Quantization: {load_in_4bit}")
print(f"  Training Mode: CONTINUED PRETRAINING")

## Step 4: Load Model and Tokenizer

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

print("Model loaded successfully!")
print(f"Model type: {type(model)}")
print(f"Tokenizer vocab size: {len(tokenizer)}")

## Step 5: Configure for Continued Pretraining

### Options:

**Option A: Full Continued Pretraining (All Parameters)**
- Updates ALL 135M parameters
- Best for major domain adaptation
- More memory intensive

**Option B: LoRA Continued Pretraining**
- Updates only adapter parameters
- Memory efficient
- Good for moderate domain adaptation

We'll use **Option B (LoRA)** for efficiency:

In [None]:
# Configure LoRA for continued pretraining
model = FastLanguageModel.get_peft_model(
    model,
    r=32,  # Higher rank for pretraining (16-64)
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,  # Typically equals r for pretraining
    lora_dropout=0.05,  # Small dropout
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
    loftq_config=None,
)

print("Model configured for CONTINUED PRETRAINING!")
print("Using LoRA adapters for memory efficiency.")
print(f"LoRA rank: 32 (higher than finetuning for better adaptation)")

## Step 6: Load Domain-Specific Dataset

### Dataset: CodeSearchNet (Python)
- Contains Python code snippets and documentation
- Raw code text (not instruction-response pairs)
- Perfect for teaching coding abilities

### Alternative Datasets:
- **Medical**: PubMed abstracts, medical journals
- **Legal**: Legal documents, case law
- **News**: Recent news articles
- **Books**: Domain-specific literature
- **Wikipedia**: General knowledge updates

In [None]:
# Load CodeSearchNet dataset (Python subset)
dataset = load_dataset(
    "code_search_net",
    "python",
    split="train[:5000]",  # Using 5000 samples for demo
    trust_remote_code=True
)

print(f"Dataset loaded: {len(dataset)} samples")
print(f"\nSample data point:")
print(dataset[0])
print(f"\nColumns: {dataset.column_names}")

## Step 7: Prepare Raw Text for Pretraining

### Key Differences from Finetuning:
1. **No chat template**: Just raw text
2. **No instruction format**: Not question-answer pairs
3. **Use function + docstring**: Complete code context
4. **Add EOS token**: Mark end of each document

In [None]:
# Format dataset for continued pretraining
EOS_TOKEN = tokenizer.eos_token

def formatting_func(examples):
    """Format code examples for continued pretraining."""
    texts = []
    
    for func_name, docstring, code in zip(
        examples["func_name"],
        examples["func_documentation_string"],
        examples["whole_func_string"]
    ):
        # Create a complete code snippet with docstring
        # This teaches the model both code and documentation
        if docstring and code:
            text = f"# Python Code Example\n\n{code}\n\n# Documentation:\n{docstring}{EOS_TOKEN}"
        elif code:
            text = f"# Python Code Example\n\n{code}{EOS_TOKEN}"
        else:
            continue
        
        texts.append(text)
    
    return {"text": texts}

# Apply formatting
dataset = dataset.map(
    formatting_func,
    batched=True,
    remove_columns=dataset.column_names  # Remove original columns
)

# Filter out empty entries
dataset = dataset.filter(lambda x: len(x["text"]) > 50)

print(f"Dataset formatted: {len(dataset)} samples")
print(f"\nExample formatted text:")
print(dataset[0]["text"][:500] + "...")

## Step 8: Configure Training Arguments

### Continued Pretraining Settings:
- **Higher learning rate**: `1e-4` (vs `2e-4` for finetuning)
- **More epochs**: `3` epochs to learn domain knowledge
- **Enable packing**: Efficiently use sequence length
- **Longer training**: More steps than finetuning

### Why Different from Finetuning?
- Learning new knowledge requires more training
- Raw text is harder to learn than instruction pairs
- Need to update model's internal representations

In [None]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=True,  # Enable packing for efficiency
    args=TrainingArguments(
        per_device_train_batch_size=4,  # Can use larger batch for pretraining
        gradient_accumulation_steps=4,
        warmup_steps=10,
        num_train_epochs=3,  # Multiple epochs for domain adaptation
        learning_rate=1e-4,  # Higher learning rate for pretraining
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="cosine",  # Cosine decay for pretraining
        seed=3407,
        output_dir="outputs_pretraining",
        report_to="none",
    ),
)

print("Trainer configured for CONTINUED PRETRAINING!")
print(f"\nTraining Configuration:")
print(f"  Batch size: 4")
print(f"  Gradient accumulation: 4")
print(f"  Effective batch size: 16")
print(f"  Epochs: 3")
print(f"  Learning rate: 1e-4 (higher for pretraining)")
print(f"  Packing: Enabled")
print(f"  Total steps: ~{len(dataset) * 3 // 16}")

## Step 9: Start Continued Pretraining!

This will adapt the model to Python programming domain.
Training time: ~15-30 minutes on T4 GPU (longer than finetuning).

In [None]:
# Show GPU memory before training
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

# Start training
print("\nStarting continued pretraining on Python code...")
trainer_stats = trainer.train()

# Show GPU memory after training
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_training = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
training_time = trainer_stats.metrics['train_runtime']

print(f"\n{'='*60}")
print(f"Continued pretraining completed successfully!")
print(f"{'='*60}")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Memory used for training = {used_memory_for_training} GB.")
print(f"Percentage of max memory used = {used_percentage}%")
print(f"Training time = {training_time:.2f} seconds ({training_time/60:.1f} minutes)")
print(f"{'='*60}")

## Step 10: Test the Domain-Adapted Model

Let's test if the model improved at Python coding!

In [None]:
# Enable inference mode
FastLanguageModel.for_inference(model)

# Test prompts - Python coding tasks
test_prompts = [
    "def fibonacci(n):",
    "def merge_sort(arr):",
    "class BinaryTree:",
    "# Function to reverse a string\ndef reverse_string(s):",
]

print("Testing the domain-adapted model on Python code...\n")

for i, prompt in enumerate(test_prompts, 1):
    inputs = tokenizer([prompt], return_tensors="pt").to("cuda")
    
    print(f"{'='*60}")
    print(f"Test {i}: {prompt}")
    print(f"{'='*60}")
    
    text_streamer = TextStreamer(tokenizer, skip_prompt=True)
    output = model.generate(
        **inputs,
        streamer=text_streamer,
        max_new_tokens=150,
        use_cache=True,
        temperature=0.3,  # Lower temperature for code
        top_p=0.9,
        repetition_penalty=1.1,
    )
    print(f"\n")

## Step 11: Compare Before vs After

### What Changed?
- **Before**: Model knows basic Python syntax
- **After**: Model has deeper knowledge of Python patterns, libraries, and idioms

### How to Verify Improvement:
1. **Perplexity**: Lower perplexity on Python code
2. **Code Quality**: Better function implementations
3. **Domain Knowledge**: Uses appropriate libraries and patterns
4. **Consistency**: More coherent code generation

## Step 12: Save the Domain-Adapted Model

In [None]:
# Save the model
model.save_pretrained("smollm2_135m_python_adapted")
tokenizer.save_pretrained("smollm2_135m_python_adapted")

print("Domain-adapted model saved to 'smollm2_135m_python_adapted/'")
print("\nNext steps:")
print("1. Use this as a base model for Python-specific finetuning")
print("2. Continue pretraining on more Python code")
print("3. Finetune on instruction-following tasks (Colab 1 or 2)")
print("4. Apply DPO for preference alignment (Colab 3)")

## Step 13: (Optional) Further Finetuning

### Typical Workflow:
```
1. Continued Pretraining (this notebook)
   ‚Üì
2. Instruction Finetuning (Colab 1 or 2)
   ‚Üì
3. Preference Alignment (Colab 3 - DPO)
   ‚Üì
4. Reasoning Enhancement (Colab 4 - GRPO)
```

### Why This Order?
- **Pretraining**: Builds domain knowledge foundation
- **Finetuning**: Teaches instruction following
- **DPO**: Aligns with human preferences
- **GRPO**: Enhances reasoning abilities

In [None]:
# You can now load this model for further finetuning:
# model, tokenizer = FastLanguageModel.from_pretrained(
#     model_name="smollm2_135m_python_adapted",
#     max_seq_length=2048,
#     dtype=None,
#     load_in_4bit=True,
# )

print("Ready for next stage of training!")

## Summary & Key Takeaways

### What We Did:
1. ‚úÖ Loaded SmolLM2 135M model
2. ‚úÖ Configured LoRA for continued pretraining
3. ‚úÖ Loaded CodeSearchNet Python dataset
4. ‚úÖ Formatted raw code text (no instruction template)
5. ‚úÖ Trained on domain-specific data (3 epochs)
6. ‚úÖ Tested Python code generation
7. ‚úÖ Saved domain-adapted model

### Continued Pretraining vs Other Methods:
| Method | Data Format | Learning Rate | Epochs | Purpose |
|--------|-------------|---------------|--------|----------|
| **Continued Pretraining** | Raw text | 1e-4 (high) | 3-5 | Domain knowledge |
| **Full Finetuning** | Instruction pairs | 2e-4 | 1-3 | Instruction following |
| **LoRA Finetuning** | Instruction pairs | 2e-4 | 1-3 | Efficient instruction following |
| **DPO** | Preference pairs | 5e-5 (low) | 1-2 | Alignment |
| **GRPO** | Prompts only | 5e-6 (lower) | Many | Reasoning |

### When to Use Continued Pretraining:
1. **New Domain**: Medical, legal, scientific domains
2. **New Language**: Adding language support
3. **Specialized Knowledge**: Finance, chemistry, etc.
4. **Recent Data**: Updating with new information
5. **Code Languages**: New programming languages

### Dataset Requirements:
```python
# Good for continued pretraining:
- Large corpus of domain text (100K+ examples)
- High quality, well-formatted text
- Representative of target domain
- Raw text, not Q&A pairs
```

### Tips for Better Results:
1. **More Data**: Use 50K-1M examples for production
2. **Multiple Epochs**: Train for 3-5 epochs
3. **Higher Rank**: Use r=32-64 for LoRA
4. **Enable Packing**: Improves efficiency
5. **Monitor Loss**: Watch for convergence
6. **Quality Over Quantity**: Clean, relevant data is key

### Comparison with Other Colabs:
- **vs Colab 1 (Full Finetuning)**: Uses raw text, not instructions
- **vs Colab 2 (LoRA)**: Higher rank, more epochs, raw text
- **vs Colab 3 (DPO)**: No preference pairs, just domain text
- **vs Colab 4 (GRPO)**: No reward function, just language modeling

### Next Steps:
1. Try other domains (medical, legal, etc.)
2. Increase dataset size for better results
3. Combine with instruction finetuning
4. Apply DPO for alignment
5. Use GRPO for reasoning tasks

# Colab 5: Continued Pretraining
## Teaching an LLM New Knowledge and Languages

This notebook demonstrates **Continued Pretraining (CPT)** - teaching an existing LLM completely new knowledge, languages, or domains.

### What is Continued Pretraining?
- **CPT**: Further pretraining on domain-specific or new data
- **Unlike Finetuning**: Learns new facts, vocabulary, and patterns
- **Expands Knowledge**: Doesn't just format existing knowledge
- **Use Cases**: New languages, specialized domains, recent events

### Key Differences:

| Training Type | Purpose | Data Format | Example |
|--------------|---------|-------------|----------|
| **Pretraining** | Learn language | Raw text | Wikipedia |
| **Continued Pretraining** | Learn new domain | Domain text | Medical papers |
| **Finetuning (SFT)** | Learn format | Q&A pairs | Instruction pairs |
| **Alignment (DPO)** | Learn preferences | Chosen/Rejected | Human feedback |

### What You'll Learn:
1. How to prepare raw text for continued pretraining
2. Training settings for learning new knowledge
3. Teaching a model a new language (e.g., code, domain jargon)
4. Evaluation of new capabilities

### Example Use Cases:
- üè• **Medical LLM**: Train on medical literature
- ‚öñÔ∏è **Legal LLM**: Train on legal documents
- üíª **Code LLM**: Train on code repositories
- üåç **Multilingual**: Add new languages
- üì∞ **Current Events**: Update with recent information

## Step 1: Install Unsloth

In [None]:
%%capture
!pip install unsloth
!pip install --upgrade datasets transformers accelerate

## Step 2: Import Libraries

In [None]:
from unsloth import FastLanguageModel
from unsloth import is_bfloat16_supported
import torch
from datasets import load_dataset, Dataset
from trl import SFTTrainer, DataCollatorForLanguageModeling
from transformers import TrainingArguments, TextStreamer

print("All libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## Step 3: Load Base Model

We'll use a small model and teach it new knowledge.

In [None]:
max_seq_length = 2048
dtype = None
load_in_4bit = True

# Using SmolLM2 135M for demonstration
model_name = "unsloth/SmolLM2-135M-Instruct"

print(f"Loading model: {model_name}")
print("We'll teach this model new domain knowledge!")

In [None]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

print("Model loaded successfully!")
print(f"Tokenizer vocab size: {len(tokenizer)}")

## Step 4: Add LoRA Adapters

Even for continued pretraining, LoRA is efficient!

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r=64,  # Higher rank for learning new knowledge
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=64,
    lora_dropout=0.05,  # Small dropout for regularization
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
)

print("LoRA adapters configured!")
print("Using r=64 for maximum learning capacity.")

## Step 5: Choose Your Domain

### Option 1: Python Programming
Teach the model Python code patterns

### Option 2: Medical/Scientific Text
Teach medical or scientific knowledge

### Option 3: Custom Domain
Your own specialized knowledge

For this demo, we'll use **Python code** as our new domain.

## Step 6: Load Domain-Specific Data

### Data Format for Continued Pretraining:
- **Raw text** (not Q&A pairs!)
- **Large corpus** of domain knowledge
- **Natural format** as it appears in the domain

We'll use a Python code dataset.

In [None]:
# Option 1: Load Python code dataset
print("Loading Python code dataset for continued pretraining...")

# Using a subset of the CodeSearchNet Python dataset
dataset = load_dataset(
    "code_search_net",
    "python",
    split="train[:5000]"  # Using 5000 samples
)

print(f"Dataset loaded: {len(dataset)} code samples")
print(f"\nSample code:")
print(dataset[0]['whole_func_string'][:300] + "...")

## Step 7: Prepare Text for Pretraining

### Key Differences from Finetuning:
1. **No templates**: Just raw text
2. **Longer sequences**: More context
3. **Next token prediction**: Standard language modeling

In [None]:
def prepare_for_pretraining(examples):
    """
    Prepare raw text for continued pretraining.
    For code: Include function with docstring
    """
    texts = []
    
    for func_string, docstring in zip(examples['whole_func_string'], examples['func_documentation_string']):
        # Format: docstring + code (natural format)
        text = f"# {docstring}\n{func_string}"
        texts.append(text)
    
    return {"text": texts}

# Apply formatting
dataset = dataset.map(
    prepare_for_pretraining,
    batched=True,
    remove_columns=dataset.column_names
)

print("Dataset prepared for continued pretraining!")
print(f"\nExample formatted text:")
print(dataset[0]['text'][:400] + "...")

## Alternative: Create Custom Domain Dataset

You can also create your own domain-specific dataset.

In [None]:
# Example: Custom medical/scientific text
# Uncomment to use:

# custom_texts = [
#     """Machine learning is a subset of artificial intelligence that focuses on 
#     building systems that can learn from data. Neural networks are composed of 
#     interconnected nodes that process information...""",
#     
#     """Deep learning uses multiple layers of neural networks to progressively 
#     extract higher-level features from raw input. This hierarchical approach...""",
#     
#     # Add more domain texts...
# ]
# 
# custom_dataset = Dataset.from_dict({"text": custom_texts})
# dataset = custom_dataset

print("To use custom domain text, uncomment and modify the cell above.")

## Step 8: Test Model BEFORE Continued Pretraining

Let's see what the model knows about Python before training.

In [None]:
FastLanguageModel.for_inference(model)

print("Testing model BEFORE continued pretraining on Python code...\n")
print("="*60)

test_prompt = """Write a Python function to calculate the factorial of a number:

def factorial(n):"""

inputs = tokenizer([test_prompt], return_tensors="pt").to("cuda")

print("Prompt:")
print(test_prompt)
print("\nModel's completion (BEFORE training):")

text_streamer = TextStreamer(tokenizer, skip_prompt=True)
outputs_before = model.generate(
    **inputs,
    streamer=text_streamer,
    max_new_tokens=100,
    temperature=0.7,
)

print("\n" + "="*60)

## Step 9: Configure Training for Continued Pretraining

### Important Settings:
- **Higher learning rate**: 5e-5 to 2e-4 (learning new knowledge)
- **More epochs/steps**: Need more exposure to new domain
- **Longer sequences**: More context helps learning
- **No response template**: Just raw text

In [None]:
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=True,  # Pack sequences for efficiency
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        num_train_epochs=3,  # More epochs for deeper learning
        learning_rate=1e-4,  # Higher learning rate for new knowledge
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=10,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="cosine",  # Cosine scheduler for CPT
        seed=3407,
        output_dir="outputs",
        report_to="none",
    ),
)

print("Trainer configured for continued pretraining!")
print(f"\nConfiguration:")
print(f"  Training mode: CONTINUED PRETRAINING")
print(f"  Learning rate: 1e-4 (higher for new knowledge)")
print(f"  Epochs: 3")
print(f"  Packing: True (efficiency)")
print(f"  Scheduler: Cosine")
print(f"  Domain: Python Programming")

## Step 10: Start Continued Pretraining!

The model will learn Python patterns, syntax, and idioms.

In [None]:
# Show GPU memory before training
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

# Start training
print("\nStarting continued pretraining on Python code...")
print("The model is learning Python! üêç\n")

trainer_stats = trainer.train()

# Show statistics
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_training = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
training_time = trainer_stats.metrics['train_runtime']

print(f"\n{'='*60}")
print(f"Continued Pretraining completed successfully!")
print(f"{'='*60}")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Memory used for training = {used_memory_for_training} GB.")
print(f"Percentage of max memory used = {used_percentage}%")
print(f"Training time = {training_time:.2f} seconds")
print(f"{'='*60}")

## Step 11: Test Model AFTER Continued Pretraining

Let's see if the model improved at writing Python code!

In [None]:
FastLanguageModel.for_inference(model)

print("Testing model AFTER continued pretraining on Python code...\n")
print("="*60)

test_prompts = [
    """Write a Python function to calculate the factorial of a number:

def factorial(n):""",
    
    """Write a Python function to check if a number is prime:

def is_prime(n):""",
    
    """Write a Python function to reverse a string:

def reverse_string(s):"""
]

for i, test_prompt in enumerate(test_prompts, 1):
    print(f"\nTest {i}:")
    print("Prompt:")
    print(test_prompt)
    print("\nModel's completion (AFTER training):")
    
    inputs = tokenizer([test_prompt], return_tensors="pt").to("cuda")
    
    text_streamer = TextStreamer(tokenizer, skip_prompt=True)
    outputs = model.generate(
        **inputs,
        streamer=text_streamer,
        max_new_tokens=150,
        temperature=0.3,
    )
    
    print("\n" + "="*60)

## Step 12: Compare Before vs After

### What Changed?
After continued pretraining, you should see:
- ‚úÖ Better Python syntax
- ‚úÖ More idiomatic code
- ‚úÖ Proper function structure
- ‚úÖ Better understanding of Python patterns

In [None]:
print("\nüí° Key Improvements After Continued Pretraining:")
print("="*60)
print("BEFORE: Generic or incorrect code")
print("AFTER: Proper Python with correct patterns")
print("="*60)
print("\nWhat the model learned:")
print("  ‚Ä¢ Python syntax and structure")
print("  ‚Ä¢ Common programming patterns")
print("  ‚Ä¢ Idiomatic Python style")
print("  ‚Ä¢ Function documentation")
print("  ‚Ä¢ Code organization")
print("="*60)

## Step 13: Save the Domain-Adapted Model

In [None]:
model.save_pretrained("smollm2_python_pretrained")
tokenizer.save_pretrained("smollm2_python_pretrained")

print("Domain-adapted model saved to 'smollm2_python_pretrained/'")
print("\nThis model now has deep knowledge of Python programming!")

## Step 14: (Optional) Further Finetuning

After continued pretraining, you can do instruction finetuning for even better results!

In [None]:
print("\nüéØ Recommended Next Steps:")
print("="*60)
print("1. Continued Pretraining (Done!) ‚Üí Learn Python")
print("2. Instruction Finetuning (Next) ‚Üí Learn to follow instructions")
print("3. DPO Alignment (Optional) ‚Üí Align with preferences")
print("="*60)
print("\nThis creates a powerful domain-specific model!")

## Alternative Domains to Explore

You can adapt this notebook for any domain!

In [None]:
print("\nüåü Other Domains to Try:")
print("="*60)
print("\n1. üè• Medical:")
print("   Dataset: PubMed abstracts, medical papers")
print("   Use: Medical diagnosis, literature review")

print("\n2. ‚öñÔ∏è Legal:")
print("   Dataset: Legal documents, case law")
print("   Use: Legal research, document analysis")

print("\n3. üíª Code (Other Languages):")
print("   Dataset: JavaScript, Java, C++ repositories")
print("   Use: Multi-language code generation")

print("\n4. üåç New Natural Language:")
print("   Dataset: Text in target language")
print("   Use: Multilingual capabilities")

print("\n5. üî¨ Scientific:")
print("   Dataset: Research papers, arXiv")
print("   Use: Scientific reasoning, literature")

print("\n6. üì∞ News/Current Events:")
print("   Dataset: Recent news articles")
print("   Use: Up-to-date knowledge")

print("\n7. üéÆ Gaming:")
print("   Dataset: Game wikis, strategy guides")
print("   Use: Game AI, strategy assistant")

print("="*60)

## Summary & Key Concepts

### What is Continued Pretraining?
**Continued Pretraining (CPT)** is the process of:
- Taking an existing pretrained model
- Training it further on domain-specific raw text
- Teaching it new vocabulary, facts, and patterns
- Expanding its knowledge beyond original training

### Training Pipeline:
```
1. Base Model (e.g., SmolLM2)
   ‚Üì
2. Continued Pretraining (Domain Knowledge)
   ‚Üì
3. Instruction Finetuning (Format)
   ‚Üì
4. Alignment (DPO/RLHF)
   ‚Üì
5. Domain Expert Model!
```

### Key Differences:

| Aspect | Pretraining | Continued Pretraining | Finetuning |
|--------|-------------|----------------------|------------|
| **Data** | General text | Domain text | Q&A pairs |
| **Format** | Raw text | Raw text | Structured |
| **Goal** | Learn language | Learn domain | Learn format |
| **Scale** | Trillions | Millions-Billions | Thousands |
| **LR** | 1e-4 to 3e-4 | 5e-5 to 2e-4 | 1e-5 to 5e-5 |

### Data Preparation:

**For Continued Pretraining**:
```python
# Just raw text, no templates!
texts = [
    "Python is a high-level programming language...",
    "Machine learning involves training models...",
    # More raw domain text
]
```

**NOT like finetuning**:
```python
# Don't use this format for CPT:
# {"instruction": "...", "output": "..."}
```

### Training Settings:

**Learning Rate**: 
- Higher than finetuning: 5e-5 to 2e-4
- Lower than initial pretraining: not 3e-4

**Epochs/Steps**:
- More than finetuning: 3-10 epochs
- Need repeated exposure to new knowledge

**Sequence Length**:
- Longer is better: 2048-4096 tokens
- More context helps learning

**Batch Size**:
- Larger if possible: 8-64 effective batch size
- Stable gradient estimates

**Scheduler**:
- Cosine decay works well
- Gradual learning rate reduction

### Use Cases:

**1. Domain Adaptation**:
- Medical, legal, financial
- Specialized vocabulary
- Domain-specific patterns

**2. Language Addition**:
- Add new natural languages
- Programming languages
- Domain-specific jargon

**3. Knowledge Updates**:
- Recent events
- New discoveries
- Updated information

**4. Private Data**:
- Company documents
- Internal knowledge bases
- Proprietary information

### Benefits:

‚úÖ **Deep Knowledge**: True understanding, not just formatting
‚úÖ **Vocabulary**: Learns new terms and concepts
‚úÖ **Patterns**: Understands domain-specific structures
‚úÖ **Generalization**: Better than just memorizing examples

### When to Use CPT:

**Use CPT when**:
- Model lacks domain knowledge
- Need deep expertise
- Have large corpus of domain text
- Want to add new language

**Skip CPT when**:
- Model already knows the domain
- Only need formatting
- Limited domain data
- Quick task-specific adaptation

### Best Practices:

**1. Data Quality**:
- High-quality domain text
- Clean and well-formatted
- Representative of domain

**2. Data Quantity**:
- At least 10MB of text
- Ideally 100MB-1GB+
- More data = better results

**3. Progressive Training**:
```
Stage 1: Continued Pretraining (raw text)
Stage 2: Instruction Finetuning (Q&A)
Stage 3: Alignment (preferences)
```

**4. Evaluation**:
- Test on domain tasks
- Compare before/after
- Use domain experts

**5. Monitoring**:
- Watch for catastrophic forgetting
- Test general capabilities
- Balance old and new knowledge

### Common Mistakes:

‚ùå **Using templates**: CPT needs raw text
‚ùå **Too little data**: Need substantial corpus
‚ùå **Too low LR**: Won't learn new knowledge
‚ùå **Too few epochs**: Need multiple passes
‚ùå **Skipping validation**: Test domain knowledge

### Real-World Examples:

**BloombergGPT**:
- Continued pretraining on financial data
- 363B tokens of financial text
- Outperforms general models on finance

**BioGPT**:
- Continued pretraining on PubMed
- 15M biomedical papers
- State-of-art on biomedical NLP

**CodeLlama**:
- Continued pretraining on code
- 500B tokens of code
- Excellent code generation

### Measuring Success:

**Qualitative**:
- Better domain terminology
- More accurate facts
- Appropriate style

**Quantitative**:
- Domain-specific benchmarks
- Perplexity on domain text
- Task accuracy

### Next Steps:

1. **Gather Domain Data**:
   - Collect 100MB+ of quality text
   - Clean and preprocess

2. **Run CPT**:
   - Use this notebook as template
   - Train for 3-10 epochs
   - Monitor loss

3. **Evaluate**:
   - Test domain knowledge
   - Compare to base model

4. **Instruction Tune**:
   - Use Colab 1 or 2
   - Teach instruction following

5. **Deploy**:
   - Export to GGUF
   - Use with Ollama
   - Production ready!

### Resources:
- [Unsloth CPT Guide](https://docs.unsloth.ai/basics/continued-pretraining)
- [Domain Adaptation Papers](https://arxiv.org/)
- [Code Datasets](https://huggingface.co/datasets?task_categories=code)
- [Domain-Specific Datasets](https://huggingface.co/datasets)

### Conclusion:
Continued Pretraining is essential for:
- ‚úÖ Domain expertise
- ‚úÖ New language learning
- ‚úÖ Knowledge expansion
- ‚úÖ Specialized applications

**CPT + Finetuning + Alignment = Domain Expert AI! üöÄ**

# Colab 5 ‚Äî Continued Pretraining (New Language / Domain)

This notebook shows the steps to perform continued pretraining on a plain-text corpus (for example to teach a model a new language or domain). Continued pretraining typically uses a language-modeling objective over large text corpora.

In [None]:
# Install packages
!pip install --upgrade pip
!pip install unslothai datasets transformers accelerate --quiet
print('Upload your continued-pretraining corpus as plain text (datasets/continued_pretraining.txt)')
: 
,
: {
: 
},
: [
,
1
,
2
,
3
 
: {
: {
: 
,
: 
3
},
: {
: 
}},
: 4,
: 5