# 🚀 Qwen3-8B Turkish 200K Training - Google Colab Version

Bu notebook, Qwen3-8B modelini Türkçe veri seti ile Google Colab'da eğitmek için optimize edilmiştir.

**Özellikler:**
- ✅ Tiktoken tokenizer desteği
- ✅ LoRA fine-tuning
- ✅ Memory optimization
- ✅ Curriculum learning
- ✅ Auto batch size calculation

## 1️⃣ GPU Kontrolü ve Sistem Bilgisi

In [None]:
# GPU kontrolü
!nvidia-smi

import torch
print(f"\nPyTorch Version: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

## 2️⃣ Gerekli Kütüphanelerin Yüklenmesi

In [None]:
%%capture
# Sessiz kurulum için %%capture kullanılıyor

# Temel kütüphaneler
!pip install -q --upgrade pip
!pip install -q torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# Transformers ve ilgili kütüphaneler
!pip install -q transformers==4.44.0
!pip install -q datasets==2.14.0
!pip install -q accelerate==0.32.0
!pip install -q peft==0.11.1
!pip install -q bitsandbytes==0.43.1

# Tiktoken for Qwen3
!pip install -q tiktoken

# Diğer yardımcı kütüphaneler
!pip install -q tqdm
!pip install -q psutil
!pip install -q sentencepiece

print("✅ Tüm kütüphaneler yüklendi!")

## 3️⃣ Google Drive Bağlantısı (Opsiyonel)

In [None]:
# Google Drive'a bağlan (veri seti ve model kaydetmek için)
from google.colab import drive
drive.mount('/content/drive')

# Çalışma dizini oluştur
import os
WORK_DIR = '/content/drive/MyDrive/qwen3_training'
os.makedirs(WORK_DIR, exist_ok=True)
os.chdir(WORK_DIR)
print(f"📁 Çalışma dizini: {WORK_DIR}")


## 4️⃣ Veri Setini Yükleme

In [None]:
# Veri setini indir veya yükle
from datasets import load_dataset, Dataset, load_from_disk
import os

DATASET_PATH = "turkish_200k_dataset"

if os.path.exists(DATASET_PATH):
    print("📂 Mevcut veri seti yükleniyor...")
    dataset = load_from_disk(DATASET_PATH)
else:
    print("📥 Hugging Face'den veri seti yükleniyor...")
    
    # Huseyin/turkish-200k-dataset veri setini yükle
    dataset = load_dataset("Huseyin/turkish-200k-dataset", split="train")
    
    # Veri setini yerel olarak kaydet
    dataset.save_to_disk(DATASET_PATH)
    print("💾 Veri seti yerel olarak kaydedildi")
    
print(f"✅ Veri seti hazır: {len(dataset)} örnek")
print(f"📊 Veri seti sütunları: {dataset.column_names}")

## 5️⃣ Tiktoken Tokenizer Kurulumu

In [None]:
# Qwen3 Tiktoken Tokenizer Wrapper
import json
from pathlib import Path
from typing import List, Optional, Dict, Any, Union
import tiktoken
import torch

class Qwen3TiktokenTokenizer:
    """Qwen3 için tiktoken tabanlı tokenizer"""
    
    def __init__(self, max_length: int = 512):
        self.encoding = tiktoken.get_encoding("cl100k_base")
        self.pad_token = "<|endoftext|>"
        self.eos_token = "<|endoftext|>"
        self.pad_token_id = 100257
        self.eos_token_id = 100257
        self.model_max_length = max_length
        self.padding_side = "left"
        print(f"✅ Tiktoken tokenizer yüklendi (vocab size: {self.encoding.n_vocab})")
    
    def __call__(self, 
                 text: Union[str, List[str]], 
                 padding: bool = True,
                 truncation: bool = True,
                 max_length: Optional[int] = None,
                 return_tensors: Optional[str] = None,
                 **kwargs) -> Dict[str, Any]:
        """Tokenize text"""
        
        if isinstance(text, str):
            texts = [text]
        else:
            texts = text
        
        max_len = max_length or self.model_max_length
        
        all_input_ids = []
        all_attention_masks = []
        
        for txt in texts:
            tokens = self.encoding.encode(txt)
            
            if truncation and len(tokens) > max_len:
                tokens = tokens[:max_len]
            
            if padding:
                original_length = len(tokens)
                if self.padding_side == "left":
                    pad_length = max_len - original_length
                    tokens = [self.pad_token_id] * pad_length + tokens
                    attention_mask = [0] * pad_length + [1] * original_length
                else:
                    tokens = tokens + [self.pad_token_id] * (max_len - original_length)
                    attention_mask = [1] * original_length + [0] * (max_len - original_length)
            else:
                attention_mask = [1] * len(tokens)
            
            all_input_ids.append(tokens)
            all_attention_masks.append(attention_mask)
        
        result = {
            'input_ids': all_input_ids[0] if isinstance(text, str) else all_input_ids,
            'attention_mask': all_attention_masks[0] if isinstance(text, str) else all_attention_masks
        }
        
        if return_tensors == "pt":
            result['input_ids'] = torch.tensor(result['input_ids'])
            result['attention_mask'] = torch.tensor(result['attention_mask'])
        
        return result
    
    def encode(self, text: str, **kwargs) -> List[int]:
        return self.encoding.encode(text)
    
    def decode(self, token_ids, skip_special_tokens: bool = True, **kwargs) -> str:
        if hasattr(token_ids, 'tolist'):
            token_ids = token_ids.tolist()
        if isinstance(token_ids, int):
            token_ids = [token_ids]
        if isinstance(token_ids, list) and len(token_ids) > 0 and isinstance(token_ids[0], list):
            token_ids = token_ids[0]
        if skip_special_tokens and isinstance(token_ids, list):
            token_ids = [t for t in token_ids if t not in [self.pad_token_id, self.eos_token_id]]
        return self.encoding.decode(token_ids)
    
    def __len__(self):
        return self.encoding.n_vocab

# Test tokenizer
tokenizer = Qwen3TiktokenTokenizer(max_length=512)
test_text = "Merhaba, bu bir test metnidir."
tokens = tokenizer(test_text, return_tensors="pt")
print(f"Test tokenization: {tokens['input_ids'].shape}")
decoded = tokenizer.decode(tokens['input_ids'][0])
print(f"Decoded: {decoded[:50]}...")

## 6️⃣ Optimized Training Configuration

In [None]:
from dataclasses import dataclass, field
from typing import List, Optional, Dict
import torch
import os

@dataclass
class TrainingConfig:
    """Colab için optimize edilmiş training configuration"""
    
    # Model
    model_name: str = "Qwen/Qwen2.5-7B"  # Stable version
    
    # Data
    train_size: int = 50000  # Colab için azaltıldı
    test_size: int = 1000
    max_length: int = 256  # Memory için azaltıldı
    
    # LoRA
    lora_r: int = 32  # Colab için optimize edildi
    lora_alpha: int = 64
    lora_dropout: float = 0.05
    lora_target_modules: List[str] = field(default_factory=lambda: [
        "q_proj", "k_proj", "v_proj", "o_proj"
    ])
    
    # Training
    learning_rate: float = 5e-5
    batch_size: int = 2  # Colab GPU için
    gradient_accumulation_steps: int = 8
    num_epochs: int = 1
    
    # Optimization
    use_8bit: bool = True  # 8-bit quantization
    use_gradient_checkpointing: bool = True
    use_bf16: bool = False  # Colab T4 için False
    use_fp16: bool = True  # T4 için FP16
    
    # Paths
    output_dir: str = "./outputs"
    cache_dir: str = "./cache"
    
    # Evaluation
    eval_steps: int = 500
    save_steps: int = 1000
    logging_steps: int = 50
    
    def __post_init__(self):
        # Auto-adjust for GPU
        if torch.cuda.is_available():
            vram_gb = torch.cuda.get_device_properties(0).total_memory / 1e9
            
            if vram_gb < 16:  # T4 GPU (15GB)
                self.batch_size = 1
                self.gradient_accumulation_steps = 16
                self.max_length = 256
                self.lora_r = 16
                print(f"⚙️ T4 GPU detected ({vram_gb:.1f}GB) - Using conservative settings")
            elif vram_gb < 25:  # A100 40GB
                self.batch_size = 2
                self.gradient_accumulation_steps = 8
                self.max_length = 384
                print(f"⚙️ Mid-range GPU detected ({vram_gb:.1f}GB)")
            else:
                self.batch_size = 4
                self.gradient_accumulation_steps = 4
                self.max_length = 512
                print(f"⚙️ High-end GPU detected ({vram_gb:.1f}GB)")
        
        # Create directories
        os.makedirs(self.output_dir, exist_ok=True)
        os.makedirs(self.cache_dir, exist_ok=True)

# Initialize config
config = TrainingConfig()
print(f"\n📋 Configuration:")
print(f"  • Batch size: {config.batch_size}")
print(f"  • Gradient accumulation: {config.gradient_accumulation_steps}")
print(f"  • Max length: {config.max_length}")
print(f"  • LoRA rank: {config.lora_r}")

## 7️⃣ Model Loading with LoRA

In [None]:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType
import torch
import gc

def load_model_with_lora(config: TrainingConfig):
    """Model ve LoRA yükleme"""
    
    print("🔄 Model yükleniyor...")
    
    # Memory cleanup
    gc.collect()
    torch.cuda.empty_cache()
    
    # Quantization config
    bnb_config = BitsAndBytesConfig(
        load_in_8bit=config.use_8bit,
        load_in_4bit=not config.use_8bit,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16 if config.use_fp16 else torch.bfloat16,
        bnb_4bit_use_double_quant=True,
    )
    
    # Load model
    try:
        model = AutoModelForCausalLM.from_pretrained(
            config.model_name,
            quantization_config=bnb_config,
            device_map="auto",
            trust_remote_code=True,
            torch_dtype=torch.float16 if config.use_fp16 else torch.bfloat16,
            cache_dir=config.cache_dir,
            use_cache=False,
            low_cpu_mem_usage=True
        )
    except Exception as e:
        print(f"⚠️ {config.model_name} yüklenemedi: {e}")
        print("🔄 Alternatif model deneniyor...")
        
        # Fallback to smaller model
        config.model_name = "microsoft/phi-2"
        model = AutoModelForCausalLM.from_pretrained(
            config.model_name,
            device_map="auto",
            trust_remote_code=True,
            torch_dtype=torch.float16,
            cache_dir=config.cache_dir,
            use_cache=False
        )
    
    # Enable gradient checkpointing
    if config.use_gradient_checkpointing:
        model.gradient_checkpointing_enable()
        model.enable_input_require_grads()
    
    # Prepare for LoRA
    model = prepare_model_for_kbit_training(
        model,
        use_gradient_checkpointing=config.use_gradient_checkpointing
    )
    
    # LoRA configuration
    lora_config = LoraConfig(
        r=config.lora_r,
        lora_alpha=config.lora_alpha,
        target_modules=config.lora_target_modules,
        lora_dropout=config.lora_dropout,
        bias="none",
        task_type=TaskType.CAUSAL_LM,
        inference_mode=False
    )
    
    # Apply LoRA
    model = get_peft_model(model, lora_config)
    
    # Print model info
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    print(f"✅ Model yüklendi: {config.model_name}")
    print(f"📊 Trainable params: {trainable:,} ({100*trainable/total:.2f}%)")
    
    return model

# Load model
model = load_model_with_lora(config)

## 8️⃣ Data Processing

In [None]:
from datasets import Dataset
import numpy as np
from tqdm.auto import tqdm

def prepare_dataset(dataset, tokenizer, config):
    """Veri setini hazırla ve tokenize et"""
    
    print("📝 Veri seti hazırlanıyor...")
    
    # Shuffle and select
    dataset = dataset.shuffle(seed=42)
    
    if len(dataset) > config.train_size + config.test_size:
        dataset = dataset.select(range(config.train_size + config.test_size))
    
    # Train/test split
    split = dataset.train_test_split(
        test_size=config.test_size,
        seed=42
    )
    
    train_data = split['train']
    test_data = split['test']
    
    print(f"  Train: {len(train_data)} samples")
    print(f"  Test: {len(test_data)} samples")
    
    # Tokenize function
    def tokenize_function(examples):
        outputs = tokenizer(
            examples['text'],
            truncation=True,
            max_length=config.max_length,
            padding='max_length',
            return_tensors=None
        )
        outputs['labels'] = outputs['input_ids'].copy()
        return outputs
    
    # Tokenize datasets
    print("🔄 Tokenizing...")
    
    tokenized_train = train_data.map(
        tokenize_function,
        batched=True,
        batch_size=1000,
        num_proc=2,  # Colab için 2 process
        remove_columns=train_data.column_names,
        desc="Tokenizing train"
    )
    
    tokenized_test = test_data.map(
        tokenize_function,
        batched=True,
        batch_size=1000,
        num_proc=2,
        remove_columns=test_data.column_names,
        desc="Tokenizing test"
    )
    
    # Set format
    tokenized_train.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
    tokenized_test.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
    
    print("✅ Veri seti hazır!")
    
    return tokenized_train, tokenized_test

# Prepare datasets
train_dataset, test_dataset = prepare_dataset(dataset, tokenizer, config)

## 9️⃣ Training

In [None]:
from transformers import (
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
    EarlyStoppingCallback,
    TrainerCallback
)
import math

# Custom callback for better logging
class CustomCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        if logs and 'loss' in logs:
            logs['perplexity'] = min(math.exp(logs['loss']), 1000)

# Training arguments
training_args = TrainingArguments(
    output_dir=config.output_dir,
    
    # Training
    num_train_epochs=config.num_epochs,
    per_device_train_batch_size=config.batch_size,
    per_device_eval_batch_size=config.batch_size * 2,
    gradient_accumulation_steps=config.gradient_accumulation_steps,
    
    # Learning rate
    learning_rate=config.learning_rate,
    lr_scheduler_type="cosine",
    warmup_ratio=0.1,
    weight_decay=0.01,
    
    # Optimization
    optim="paged_adamw_8bit" if config.use_8bit else "adamw_torch",
    fp16=config.use_fp16,
    bf16=config.use_bf16,
    gradient_checkpointing=config.use_gradient_checkpointing,
    
    # Evaluation
    eval_strategy="steps",
    eval_steps=config.eval_steps,
    save_steps=config.save_steps,
    save_total_limit=2,
    
    # Logging
    logging_steps=config.logging_steps,
    logging_first_step=True,
    report_to="none",  # Colab'da wandb kullanmıyoruz
    
    # Best model
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    
    # Performance
    dataloader_num_workers=2,
    dataloader_pin_memory=True,
    
    # Other
    seed=42,
    run_name="qwen3_turkish",
    push_to_hub=False,
    remove_unused_columns=False
)

# Data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,
    pad_to_multiple_of=8
)

# Create trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
    callbacks=[CustomCallback(), EarlyStoppingCallback(early_stopping_patience=3)]
)

print("🚀 Training başlıyor...")
print(f"  Total steps: {len(train_dataset) // (config.batch_size * config.gradient_accumulation_steps) * config.num_epochs}")
print(f"  Effective batch size: {config.batch_size * config.gradient_accumulation_steps}")

In [None]:
# Start training
import time

start_time = time.time()

try:
    # Train
    train_result = trainer.train()
    
    # Evaluate
    eval_result = trainer.evaluate()
    
    # Print results
    elapsed_time = (time.time() - start_time) / 60
    
    print("\n" + "="*50)
    print("✅ TRAINING COMPLETED!")
    print("="*50)
    print(f"⏱️ Total time: {elapsed_time:.1f} minutes")
    print(f"📉 Final train loss: {train_result.training_loss:.4f}")
    print(f"📉 Final eval loss: {eval_result['eval_loss']:.4f}")
    print(f"📊 Perplexity: {min(math.exp(eval_result['eval_loss']), 1000):.2f}")
    
except KeyboardInterrupt:
    print("\n⚠️ Training interrupted by user")
except Exception as e:
    print(f"\n❌ Training error: {e}")
    import traceback
    traceback.print_exc()

## 🔟 Save Model

In [None]:
# Save the final model
print("💾 Model kaydediliyor...")

# Save to local
output_dir = f"{config.output_dir}/final_model"
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

print(f"✅ Model kaydedildi: {output_dir}")

# If using Google Drive
if '/content/drive' in os.getcwd():
    print("📁 Model Google Drive'a kaydedildi")

## 1️⃣1️⃣ Test Model

In [None]:
# Test the model
def generate_text(prompt, max_length=100):
    """Generate text with the trained model"""
    
    # Tokenize
    inputs = tokenizer(prompt, return_tensors="pt")
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    # Generate
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_length,
            temperature=0.7,
            do_sample=True,
            top_p=0.95,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id
        )
    
    # Decode
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated_text

# Test examples
test_prompts = [
    "Türkiye'nin başkenti",
    "Yapay zeka teknolojisi",
    "Bugün hava çok güzel"
]

print("🧪 Model Test Sonuçları:\n")
for prompt in test_prompts:
    result = generate_text(prompt, max_length=50)
    print(f"📝 Prompt: {prompt}")
    print(f"🤖 Generated: {result}\n")
    print("-" * 50)

## 1️⃣2️⃣ Memory Cleanup

In [None]:
# Clean up memory
import gc

# Delete large objects
del model
del trainer
del train_dataset
del test_dataset

# Garbage collection
gc.collect()

# Clear CUDA cache
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print(f"✅ GPU memory cleared")
    print(f"📊 Current GPU memory: {torch.cuda.memory_allocated()/1e9:.2f} GB")

## 📌 Notlar ve İpuçları

### Colab Ücretsiz GPU Limitleri:
- **T4 GPU**: ~15GB VRAM
- **Session limiti**: 12 saat
- **Idle timeout**: 90 dakika

### Performans İpuçları:
1. **Batch size**: GPU belleği doluyorsa azaltın
2. **Max length**: Bellek sorunlarında 128'e düşürün
3. **LoRA rank**: Daha az parametre için 8-16 kullanın
4. **Gradient accumulation**: Batch size düşükse artırın

### Sorun Giderme:
- **CUDA out of memory**: Batch size veya max_length azaltın
- **Tokenizer hatası**: Tiktoken yerine GPT-2 tokenizer kullanın
- **Model yükleme hatası**: Daha küçük model deneyin (phi-2, pythia-1.4b)

### Model Kaydetme:
- Google Drive'a kaydetmeyi unutmayın
- Session biterse model kaybolur!

### Veri Seti:
- Kendi veri setinizi CSV veya JSON formatında yükleyebilirsiniz
- Hugging Face'den hazır Türkçe veri setleri kullanabilirsiniz