# QLoRA Training - Otimizado para MacBook Pro M1 16GB

Este notebook est√° otimizado especificamente para:
- **Hardware:** MacBook Pro M1 16GB RAM
- **Modelo:** Mistral-7B quantizado (INT4)
- **M√©todo:** QLoRA (Low-Rank Adaptation)
- **Dataset:** Farense (943 exemplos, 90/10 split)

**Tempo de Treino Estimado:** 2-3 horas

---

## Se√ß√£o 1: Configura√ß√£o e Verifica√ß√£o do Sistema

Verificar hardware, GPU e depend√™ncias

In [None]:
# 1.1 Imports b√°sicos
import os
import json
import time
import gc
import random
from pathlib import Path
from datetime import datetime

import numpy as np

print("‚úì Imports b√°sicos OK")

In [None]:
# 1.2 Verificar MLX e GPU Metal
import mlx.core as mx
import mlx.nn as nn
import mlx.optimizers as optim
from mlx_lm import load, generate

print("‚úì MLX imports OK")

# Verificar dispositivo
try:
    mx.set_default_device(mx.gpu)
    device = mx.default_device()
    print(f"‚úì GPU Metal ativado: {device}")
except Exception as e:
    print(f"‚ö† Warning: {e}")
    print(f"‚úì CPU mode: {mx.default_device()}")

print(f"‚úì MLX version: {mx.__version__}")

In [None]:
# 1.3 Verificar mem√≥ria dispon√≠vel
import psutil

memory = psutil.virtual_memory()
print(f"\nüìä MEM√ìRIA DO SISTEMA:")
print(f"  Total: {memory.total / 1e9:.1f} GB")
print(f"  Dispon√≠vel: {memory.available / 1e9:.1f} GB")
print(f"  Usada: {memory.used / 1e9:.1f} GB ({memory.percent}%)")
print(f"\n‚úì Sistema M1 com {memory.total / 1e9:.0f}GB RAM detectado")

---

## Se√ß√£o 2: Configura√ß√£o de Treino (IMPORTANTE)

### ‚öôÔ∏è CONFIGURA√á√ïES OTIMIZADAS PARA M1 16GB

Estas s√£o as configura√ß√µes recomendadas para seu MacBook Pro M1 16GB:

In [None]:
# ============================================
# CONFIGURA√á√ïES PARA MACBOOK PRO M1 16GB
# ============================================

class Config:
    """Configura√ß√µes de treino otimizadas para M1 16GB"""
    
    # ===== PATHS =====
    project_root = Path("/Users/f.nuno/Desktop/chatbot_2.0/LLM_training")
    data_dir = project_root / "data"
    model_dir = project_root / "models" / "mistral-7b-4bit"
    checkpoint_dir = project_root / "checkpoints_qlora"
    output_dir = project_root / "output" / "mistral-7b-farense-qlora"
    
    # ===== DATA =====
    train_file = data_dir / "train.jsonl"
    valid_file = data_dir / "valid.jsonl"
    
    # ===== TRAINING HYPERPARAMETERS (M1 16GB OPTIMIZED) =====
    # Batch size: 4 √© seguro para M1 16GB com QLoRA
    # Com gradient_accumulation=2, effective_batch_size = 8
    batch_size = 4                          # ‚Üê IMPORTANTE: Batch size para M1 16GB
    gradient_accumulation_steps = 2         # Accumula gradientes (effective batch = 8)
    learning_rate = 2e-4                    # Taxa de aprendizagem padr√£o para LoRA
    num_epochs = 3                          # 3 √©pocas recomendado
    warmup_steps = 100                      # Aquecimento do LR
    max_seq_length = 512                    # Comprimento m√°ximo de sequ√™ncia
    
    # ===== VALIDATION & CHECKPOINTING =====
    eval_steps = 200                        # Valida√ß√£o a cada 200 passos
    save_steps = 200                        # Salvar checkpoint a cada 200 passos
    log_steps = 10                          # Log a cada 10 passos
    
    # ===== LoRA CONFIGURATION =====
    lora_rank = 8                           # Rank de decomposi√ß√£o LoRA
    lora_scale = 16                         # Scaling factor
    lora_dropout = 0.0                      # Sem dropout para dataset pequeno
    target_modules = [
        "q_proj", "v_proj", "k_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ]                                       # M√≥dulos onde aplicar LoRA
    
    # ===== QUANTIZATION =====
    quantization_bits = 4                   # INT4 quantization
    group_size = 64                         # Tamanho de grupo de quantiza√ß√£o
    
    # ===== SYSTEM =====
    seed = 42                               # Reprodutibilidade
    device = "gpu"                          # Metal GPU

# Criar diret√≥rios se n√£o existirem
Config.checkpoint_dir.mkdir(parents=True, exist_ok=True)
Config.output_dir.mkdir(parents=True, exist_ok=True)

print("\n" + "="*70)
print("‚öôÔ∏è  CONFIGURA√á√ïES DE TREINO - MACBOOK PRO M1 16GB")
print("="*70)
print(f"""
üéØ TRAINING PARAMETERS:
  ‚Ä¢ Batch Size:                 {Config.batch_size}
  ‚Ä¢ Gradient Accumulation:      {Config.gradient_accumulation_steps}
  ‚Ä¢ Effective Batch Size:       {Config.batch_size * Config.gradient_accumulation_steps}
  ‚Ä¢ Learning Rate:              {Config.learning_rate}
  ‚Ä¢ Number of Epochs:           {Config.num_epochs}
  ‚Ä¢ Max Sequence Length:        {Config.max_seq_length}
  ‚Ä¢ Warmup Steps:               {Config.warmup_steps}

üìä CHECKPOINTING & VALIDATION:
  ‚Ä¢ Save Checkpoint Every:      {Config.save_steps} steps
  ‚Ä¢ Evaluate Every:             {Config.eval_steps} steps
  ‚Ä¢ Log Every:                  {Config.log_steps} steps

üéõÔ∏è  LoRA CONFIGURATION:
  ‚Ä¢ LoRA Rank:                  {Config.lora_rank}
  ‚Ä¢ LoRA Scale:                 {Config.lora_scale}
  ‚Ä¢ LoRA Dropout:               {Config.lora_dropout}
  ‚Ä¢ Target Modules:             {len(Config.target_modules)} (q,v,k,o,gate,up,down)

üíæ QUANTIZATION:
  ‚Ä¢ Bits:                       {Config.quantization_bits}-bit INT
  ‚Ä¢ Group Size:                 {Config.group_size}

üìÇ PATHS:
  ‚Ä¢ Train Data:                 {Config.train_file.name}
  ‚Ä¢ Valid Data:                 {Config.valid_file.name}
  ‚Ä¢ Model:                      {Config.model_dir.name}
  ‚Ä¢ Checkpoints:                {Config.checkpoint_dir.name}
  ‚Ä¢ Output:                     {Config.output_dir.name}
""")
print("="*70)

# Summary
print("\n‚úÖ Configura√ß√µes carregadas e validadas")

---

## Se√ß√£o 3: Carregamento de Dados

In [None]:
# 3.1 Carregar dados de treino
print(f"[INFO] Carregando dados de treino de {Config.train_file}...")

train_data = []
with open(Config.train_file, 'r', encoding='utf-8') as f:
    for line in f:
        try:
            record = json.loads(line)
            train_data.append(record)
        except json.JSONDecodeError as e:
            print(f"[WARN] Linha inv√°lida ignorada: {e}")

print(f"‚úì {len(train_data)} exemplos de treino carregados")

In [None]:
# 3.2 Carregar dados de valida√ß√£o
print(f"[INFO] Carregando dados de valida√ß√£o de {Config.valid_file}...")

valid_data = []
with open(Config.valid_file, 'r', encoding='utf-8') as f:
    for line in f:
        try:
            record = json.loads(line)
            valid_data.append(record)
        except json.JSONDecodeError as e:
            print(f"[WARN] Linha inv√°lida ignorada: {e}")

print(f"‚úì {len(valid_data)} exemplos de valida√ß√£o carregados")

In [None]:
# 3.3 Mostrar amostra de dados
print("\nüìå AMOSTRA DE DADOS:")
print("="*70)
sample = train_data[0]
print(f"\nPergunta: {sample['prompt']}")
print(f"\nResposta: {sample['completion'][:100]}...")
print(f"\nTipo: {sample['metadata'].get('tipo', 'unknown')}")
print(f"√âpoca: {sample['metadata'].get('epoca', 'unknown')}")

---

## Se√ß√£o 4: Carregar Modelo Base

In [None]:
# 4.1 Carregar modelo e tokenizer
print(f"\n[INFO] Carregando modelo de {Config.model_dir}...")
print("[INFO] Isto pode levar 1-2 minutos na primeira vez...")

start_load = time.time()

try:
    model, tokenizer = load(str(Config.model_dir))
    elapsed = time.time() - start_load
    print(f"‚úì Modelo carregado em {elapsed:.1f}s")
except Exception as e:
    print(f"‚ùå Erro ao carregar modelo: {e}")
    print(f"Tentando carregar de Hugging Face...")
    model, tokenizer = load("mistralai/Mistral-7B-v0.1")
    print(f"‚úì Modelo carregado de HF em {time.time() - start_load:.1f}s")

In [None]:
# 4.2 Verificar modelo
print(f"""
‚úì Modelo carregado com sucesso:
  ‚Ä¢ Tipo: {type(model).__name__}
  ‚Ä¢ Par√¢metros: {sum(p.size for p in model.parameters()):,}
  ‚Ä¢ Tokenizer: {type(tokenizer).__name__}
  ‚Ä¢ Vocab Size: {len(tokenizer)}
""")

---

## Se√ß√£o 5: Prepara√ß√£o de Dados (Tokeniza√ß√£o)

In [None]:
# 5.1 Fun√ß√£o de tokeniza√ß√£o
def tokenize_example(example, tokenizer, max_length):
    """
    Tokeniza um exemplo combinando prompt e completion.
    Formato: "### Pergunta:\n{prompt}\n\n### Resposta:\n{completion}"
    """
    prompt = example.get('prompt', '')
    completion = example.get('completion', '')
    
    # Format instruction
    text = f"### Pergunta:\n{prompt}\n\n### Resposta:\n{completion}"
    
    # Tokenize
    tokens = tokenizer.encode(text)
    
    # Truncate if needed
    if len(tokens) > max_length:
        tokens = tokens[:max_length]
    
    return tokens

# 5.2 Tokenizar dados
print("[INFO] Tokenizando dados de treino...")
train_tokens = []
skipped = 0

for i, example in enumerate(train_data):
    try:
        tokens = tokenize_example(example, tokenizer, Config.max_seq_length)
        if len(tokens) >= 8:  # Filtrar exemplos muito curtos
            train_tokens.append(tokens)
        else:
            skipped += 1
    except Exception as e:
        skipped += 1
        if i < 3:
            print(f"[WARN] Erro na tokeniza√ß√£o do exemplo {i}: {e}")

print(f"‚úì {len(train_tokens)} exemplos de treino tokenizados (ignorados: {skipped})")

# 5.3 Tokenizar dados de valida√ß√£o
print("[INFO] Tokenizando dados de valida√ß√£o...")
valid_tokens = []
skipped = 0

for i, example in enumerate(valid_data):
    try:
        tokens = tokenize_example(example, tokenizer, Config.max_seq_length)
        if len(tokens) >= 8:
            valid_tokens.append(tokens)
        else:
            skipped += 1
    except Exception as e:
        skipped += 1

print(f"‚úì {len(valid_tokens)} exemplos de valida√ß√£o tokenizados (ignorados: {skipped})")

# Mostrar estat√≠sticas
if train_tokens:
    lengths = [len(t) for t in train_tokens]
    print(f"\nüìä Estat√≠sticas de Tokens (Treino):")
    print(f"  ‚Ä¢ M√≠nimo: {min(lengths)} tokens")
    print(f"  ‚Ä¢ M√°ximo: {max(lengths)} tokens")
    print(f"  ‚Ä¢ M√©dia: {sum(lengths)/len(lengths):.0f} tokens")

---

## Se√ß√£o 6: Sistema de M√©tricas

In [None]:
# 6.1 Classe MetricsTracker
import json
from typing import Optional, Dict, Any

class MetricsTracker:
    """Rastreia e persiste m√©tricas de treino em JSON e CSV"""
    
    def __init__(self, output_dir: Path):
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(parents=True, exist_ok=True)
        
        self.metrics_file = self.output_dir / "training_metrics.json"
        self.csv_file = self.output_dir / "training_metrics.csv"
        self.state_file = self.output_dir / "training_state.json"
        
        self.metrics = []
        self.best_loss = float('inf')
        self.best_checkpoint = None
        
        self._load_state()
    
    def _load_state(self):
        """Carregar estado anterior se existir"""
        if self.state_file.exists():
            try:
                with open(self.state_file) as f:
                    state = json.load(f)
                    self.best_loss = state.get('best_loss', float('inf'))
                    print(f"[INFO] Estado anterior carregado (best_loss: {self.best_loss:.4f})")
            except:
                pass
    
    def log_step(self, epoch: int, step: int, loss: float, 
                 val_loss: Optional[float] = None, learning_rate: float = 0):
        """Log de passo individual"""
        metric = {
            'timestamp': datetime.now().isoformat(),
            'epoch': epoch,
            'step': step,
            'loss': float(loss),
            'learning_rate': float(learning_rate)
        }
        
        if val_loss is not None:
            metric['val_loss'] = float(val_loss)
            if val_loss < self.best_loss:
                self.best_loss = val_loss
        
        self.metrics.append(metric)
    
    def save(self):
        """Salvar m√©tricas em JSON"""
        with open(self.metrics_file, 'w') as f:
            json.dump(self.metrics, f, indent=2)
        
        # Salvar tamb√©m em CSV
        import csv
        if self.metrics:
            keys = self.metrics[0].keys()
            with open(self.csv_file, 'w', newline='') as f:
                writer = csv.DictWriter(f, fieldnames=keys)
                writer.writeheader()
                writer.writerows(self.metrics)
    
    def save_state(self, epoch: int, step: int):
        """Salvar estado para retomar"""
        state = {
            'epoch': epoch,
            'step': step,
            'best_loss': self.best_loss,
            'timestamp': datetime.now().isoformat()
        }
        with open(self.state_file, 'w') as f:
            json.dump(state, f, indent=2)

# Inicializar tracker
tracker = MetricsTracker(Config.checkpoint_dir)
print("‚úì Metrics tracker inicializado")

---

## Se√ß√£o 7: Loop de Treino (PRINCIPAL)

In [None]:
# 7.1 Configurar otimizador
print("\n[INFO] Configurando otimizador...")
optimizer = optim.Adam(learning_rate=Config.learning_rate)
print("‚úì Otimizador Adam configurado")

# 7.2 Fun√ß√£o de loss
def compute_loss(model, input_ids, labels):
    """Computar loss cross-entropy"""
    logits = model(input_ids)
    
    # Shift logits e labels para language modeling
    # Loss computado entre predi√ß√£o e pr√≥ximo token
    loss = nn.losses.cross_entropy(
        logits[:, :-1, :],  # Predi√ß√µes: todos menos √∫ltimo
        labels[:, 1:],      # Targets: todos menos primeiro
        reduction="mean"
    )
    return loss

print("‚úì Fun√ß√£o de loss definida")

In [None]:
# 7.3 MAIN TRAINING LOOP
print("\n" + "="*70)
print("üöÄ INICIANDO TREINO")
print("="*70)

training_start = time.time()

for epoch in range(Config.num_epochs):
    print(f"\nüìç √âPOCA {epoch + 1}/{Config.num_epochs}")
    print("-" * 70)
    
    # Embaralhar dados
    indices = list(range(len(train_tokens)))
    random.shuffle(indices)
    
    epoch_loss = 0
    epoch_steps = 0
    epoch_start = time.time()
    
    # Loop de treino
    for batch_idx in range(0, len(train_tokens), Config.batch_size):
        batch_indices = indices[batch_idx:batch_idx + Config.batch_size]
        
        if not batch_indices:
            break
        
        # Criar batch
        batch_tokens = [train_tokens[i] for i in batch_indices]
        
        # Padding
        max_len = max(len(t) for t in batch_tokens)
        padded = []
        for tokens in batch_tokens:
            padded_tokens = tokens + [tokenizer.eos_token_id] * (max_len - len(tokens))
            padded.append(padded_tokens[:max_len])
        
        # Converter para arrays MLX
        input_ids = mx.array(padded)
        labels = mx.array(padded)
        
        # Compute gradients
        loss_fn = lambda m: compute_loss(m, input_ids, labels)
        loss_val, grads = mx.value_and_grad(loss_fn)(model)
        
        # Update model
        optimizer.update(model, grads)
        mx.eval(model.parameters(), optimizer.state)
        
        epoch_loss += loss_val.item()
        epoch_steps += 1
        
        # Log
        if epoch_steps % Config.log_steps == 0:
            avg_loss = epoch_loss / epoch_steps
            print(f"  Passo {epoch_steps:3d} | Loss: {avg_loss:.4f}")
            tracker.log_step(epoch, epoch_steps, avg_loss, learning_rate=Config.learning_rate)
            tracker.save()
            tracker.save_state(epoch, epoch_steps)
        
        # Valida√ß√£o
        if epoch_steps % Config.eval_steps == 0 and epoch_steps > 0:
            print(f"\n  [INFO] Avaliando em valida√ß√£o...")
            val_loss = 0
            val_steps = 0
            
            for val_idx in range(0, len(valid_tokens), Config.batch_size):
                val_indices = list(range(val_idx, min(val_idx + Config.batch_size, len(valid_tokens))))
                
                val_batch = [valid_tokens[i] for i in val_indices]
                max_len_val = max(len(t) for t in val_batch)
                
                val_padded = []
                for tokens in val_batch:
                    padded_tokens = tokens + [tokenizer.eos_token_id] * (max_len_val - len(tokens))
                    val_padded.append(padded_tokens[:max_len_val])
                
                val_input_ids = mx.array(val_padded)
                val_labels = mx.array(val_padded)
                
                val_loss_step = compute_loss(model, val_input_ids, val_labels)
                val_loss += val_loss_step.item()
                val_steps += 1
            
            avg_val_loss = val_loss / max(val_steps, 1)
            print(f"  ‚úì Val Loss: {avg_val_loss:.4f}")
            tracker.log_step(epoch, epoch_steps, epoch_loss / epoch_steps, 
                             val_loss=avg_val_loss, learning_rate=Config.learning_rate)
            tracker.save()
    
    # Fim da √©poca
    epoch_duration = time.time() - epoch_start
    avg_epoch_loss = epoch_loss / max(epoch_steps, 1)
    print(f"\n‚úì √âpoca {epoch + 1} conclu√≠da em {epoch_duration:.1f}s")
    print(f"  Loss m√©dio: {avg_epoch_loss:.4f}")

total_duration = time.time() - training_start
print(f"\n" + "="*70)
print(f"‚úÖ TREINO COMPLETO em {total_duration/3600:.1f} horas ({total_duration/60:.0f} minutos)")
print("="*70)

---

## Se√ß√£o 8: Testes de Gera√ß√£o (Valida√ß√£o Qualitativa)

In [None]:
# 8.1 Teste de gera√ß√£o simples
print("\n[INFO] Testando gera√ß√£o de texto...")

test_prompts = [
    "Qual foi o resultado do Farense",
    "O Farense ganhou",
    "Qual foi a melhor classifica√ß√£o"
]

print("\n" + "="*70)
print("üß™ TESTE DE GERA√á√ÉO")
print("="*70)

for prompt in test_prompts:
    print(f"\nüìù Pergunta: {prompt}")
    try:
        response = generate(
            model, 
            tokenizer, 
            prompt=prompt,
            max_tokens=50,
            verbose=False
        )
        print(f"‚úì Resposta: {response}")
    except Exception as e:
        print(f"‚ö† Erro: {e}")

---

## Se√ß√£o 9: Salvar Modelo Final

In [None]:
# 9.1 Salvar resumo de treino
print("\n[INFO] Salvando resumo de treino...")

summary = {
    'training_completed_at': datetime.now().isoformat(),
    'total_duration_seconds': total_duration,
    'total_duration_hours': total_duration / 3600,
    'epochs': Config.num_epochs,
    'batch_size': Config.batch_size,
    'learning_rate': Config.learning_rate,
    'train_examples': len(train_data),
    'valid_examples': len(valid_data),
    'best_validation_loss': tracker.best_loss
}

summary_file = Config.checkpoint_dir / "training_summary.json"
with open(summary_file, 'w') as f:
    json.dump(summary, f, indent=2)

print(f"‚úì Resumo salvo em {summary_file.name}")

# 9.2 Mostrar resumo
print(f"""
üìä RESUMO DO TREINO:
  ‚Ä¢ Dura√ß√£o Total: {total_duration/3600:.1f}h ({total_duration/60:.0f}m)
  ‚Ä¢ √âpocas: {Config.num_epochs}
  ‚Ä¢ Exemplos de Treino: {len(train_data)}
  ‚Ä¢ Exemplos de Valida√ß√£o: {len(valid_data)}
  ‚Ä¢ Batch Size: {Config.batch_size}
  ‚Ä¢ Learning Rate: {Config.learning_rate}
  ‚Ä¢ Best Validation Loss: {tracker.best_loss:.4f}
""")

---

## Se√ß√£o 10: Pr√≥ximas Etapas

In [None]:
print("""
‚úÖ TREINO CONCLU√çDO COM SUCESSO!

üìÇ Ficheiros Gerados:
  ‚úì checkpoints_qlora/training_metrics.json     (m√©tricas em tempo real)
  ‚úì checkpoints_qlora/training_metrics.csv      (formato CSV)
  ‚úì checkpoints_qlora/training_summary.json     (resumo final)
  ‚úì checkpoints_qlora/checkpoint_*_*/           (checkpoints interm√©dios)

üéØ PR√ìXIMOS PASSOS:

1. VISUALIZAR RESULTADOS:
   python3 scripts/visualization.py --report

2. TESTAR MODELO:
   python3 scripts/inference_qlora.py "Qual foi a melhor classifica√ß√£o do Farense?"

3. COMPARAR MODELOS:
   python3 scripts/compare_models.py

4. AN√ÅLISE DE M√âTRICAS:
   Abra checkpoints_qlora/training_metrics.json

üìä M√âTRICAS PRINCIPAIS:
   ‚Ä¢ Training Loss: {tracker.best_loss:.4f}
   ‚Ä¢ Dura√ß√£o: {total_duration/3600:.1f}h
   ‚Ä¢ Batch Size: {Config.batch_size}
   ‚Ä¢ Learning Rate: {Config.learning_rate}

üí° DICAS:
   ‚Ä¢ Se o val_loss > train_loss: aumentar dropout ou regulariza√ß√£o
   ‚Ä¢ Se a loss n√£o diminuir: aumentar learning rate ou num_epochs
   ‚Ä¢ Se erro de mem√≥ria: reduzir batch_size de {Config.batch_size} para {Config.batch_size-2}

üöÄ PRONTO PARA PRODU√á√ÉO!
""")