# üí∞ Boas Pr√°ticas para Vertex AI: Rodando DeepSeek de Forma Econ√¥mica

Este notebook apresenta estrat√©gias pr√°ticas para reduzir custos ao rodar modelos LLM no Google Cloud Vertex AI, com foco nos modelos DeepSeek.

## üéØ Principais Estrat√©gias de Economia:
- **Spot Instances (Preempt√≠veis)**: At√© 70% de desconto
- **Gerenciamento Autom√°tico**: Desligar quando n√£o usar
- **Monitoramento de Custos**: Alertas em tempo real
- **Cache Local**: Evitar downloads repetidos
- **Otimiza√ß√£o CPU**: Formato GGML/GGUF para performance
- **Controle de Concorr√™ncia**: Limitar requisi√ß√µes simult√¢neas

## üìä Meta de Economia: Manter custos abaixo de R$ 300/m√™s

## üîß 1. Setup e Autentica√ß√£o

Primeiro, vamos configurar as bibliotecas e autentica√ß√£o necess√°rias.

In [None]:
# Instalar depend√™ncias necess√°rias
!pip install google-cloud-aiplatform google-cloud-billing google-cloud-compute google-auth --quiet

# Imports necess√°rios
import os
import json
import time
import asyncio
from datetime import datetime, timedelta
from google.cloud import aiplatform
from google.cloud import billing_v1
from google.cloud import compute_v1
from google.auth import default
import requests

# Configura√ß√µes do projeto
PROJECT_ID = "seu-projeto-gcp"  # ‚ö†Ô∏è ALTERE AQUI
REGION = "us-central1"
ZONE = "us-central1-a"

# Verificar autentica√ß√£o
try:
    credentials, project = default()
    print(f"‚úÖ Autenticado no projeto: {project}")
    if project != PROJECT_ID:
        print(f"‚ö†Ô∏è Projeto configurado: {PROJECT_ID}, mas autenticado em: {project}")
except Exception as e:
    print(f"‚ùå Erro na autentica√ß√£o: {e}")
    print("üí° Execute: gcloud auth application-default login")

# Inicializar Vertex AI
aiplatform.init(project=PROJECT_ID, location=REGION)

## üéØ 2. Configurando Spot Instances (At√© 70% Economia!)

Spot instances (preempt√≠veis) s√£o a melhor forma de economizar. Perfeitas para infer√™ncia de IA.

In [None]:
# Fun√ß√£o para criar Custom Job com Spot Instances
def create_spot_training_job(
    display_name: str = "deepseek-spot-inference",
    machine_type: str = "n1-standard-4",
    accelerator_type: str = "NVIDIA_TESLA_T4",
    accelerator_count: int = 1
):
    """Cria um Custom Job usando spot instances para economia."""
    
    # Configura√ß√£o do worker pool com spot instances
    worker_pool_specs = [
        {
            "machine_spec": {
                "machine_type": machine_type,
                "accelerator_type": accelerator_type,
                "accelerator_count": accelerator_count,
            },
            "replica_count": 1,
            "container_spec": {
                "image_uri": "ollama/ollama:latest",
                "command": ["ollama", "serve"],
                "env": [
                    {"name": "OLLAMA_HOST", "value": "0.0.0.0"},
                    {"name": "OLLAMA_MAX_LOADED_MODELS", "value": "1"},  # Economia de mem√≥ria
                    {"name": "OLLAMA_NUM_PARALLEL", "value": "1"},       # Menor concorr√™ncia
                ]
            },
            # üéØ CHAVE: Habilitar spot instances (preempt√≠vel)
            "disk_spec": {
                "boot_disk_type": "pd-ssd",
                "boot_disk_size_gb": 100
            },
        }
    ]
    
    job_spec = {
        "display_name": display_name,
        "job_spec": {
            "worker_pool_specs": worker_pool_specs,
            "scheduling": {
                "restart_job_on_worker_restart": True,
                "timeout": "7200s"  # 2 horas max
            },
            # üí∞ ECONOMIA: Usar spot instances
            "service_account": f"vertex-ai@{PROJECT_ID}.iam.gserviceaccount.com"
        }
    }
    
    return job_spec

# Exemplo de uso
spot_job_config = create_spot_training_job()
print("üéØ Configura√ß√£o de Spot Instance criada!")
print(f"üí∞ Economia esperada: at√© 70% comparado a inst√¢ncias normais")
print(json.dumps(spot_job_config, indent=2))

## ‚è∞ 3. Gerenciamento Autom√°tico de Inst√¢ncias

Scripts para automatizar ligar/desligar inst√¢ncias e evitar custos desnecess√°rios.

In [None]:
import subprocess
from datetime import datetime, timedelta

class InstanceManager:
    """Gerenciador autom√°tico de inst√¢ncias para economia."""
    
    def __init__(self, project_id: str, zone: str):
        self.project_id = project_id
        self.zone = zone
        self.compute_client = compute_v1.InstancesClient()
    
    def stop_instance_if_idle(self, instance_name: str, max_idle_minutes: int = 30):
        """Para inst√¢ncia se estiver ociosa por mais de X minutos."""
        try:
            # Verificar se h√° atividade recente
            request = compute_v1.GetInstanceRequest(
                project=self.project_id,
                zone=self.zone,
                instance=instance_name
            )
            instance = self.compute_client.get(request=request)
            
            # Verificar se est√° rodando
            if instance.status != "RUNNING":
                print(f"üî¥ Inst√¢ncia {instance_name} j√° est√° parada")
                return
            
            # Aqui voc√™ pode adicionar l√≥gica para verificar CPU/GPU usage
            # Por exemplo, usando Cloud Monitoring API
            
            print(f"‚èπÔ∏è Parando inst√¢ncia {instance_name} para economizar")
            stop_request = compute_v1.StopInstanceRequest(
                project=self.project_id,
                zone=self.zone,
                instance=instance_name
            )
            operation = self.compute_client.stop(request=stop_request)
            print(f"‚úÖ Opera√ß√£o iniciada: {operation.name}")
            
        except Exception as e:
            print(f"‚ùå Erro ao parar inst√¢ncia: {e}")
    
    def start_instance_if_needed(self, instance_name: str):
        """Inicia inst√¢ncia se ela estiver parada."""
        try:
            request = compute_v1.GetInstanceRequest(
                project=self.project_id,
                zone=self.zone,
                instance=instance_name
            )
            instance = self.compute_client.get(request=request)
            
            if instance.status == "RUNNING":
                print(f"‚úÖ Inst√¢ncia {instance_name} j√° est√° rodando")
                return
            
            print(f"üü¢ Iniciando inst√¢ncia {instance_name}")
            start_request = compute_v1.StartInstanceRequest(
                project=self.project_id,
                zone=self.zone,
                instance=instance_name
            )
            operation = self.compute_client.start(request=start_request)
            print(f"‚úÖ Opera√ß√£o iniciada: {operation.name}")
            
        except Exception as e:
            print(f"‚ùå Erro ao iniciar inst√¢ncia: {e}")
    
    def schedule_auto_stop(self, instance_name: str, stop_time: str = "22:00"):
        """Agenda parada autom√°tica da inst√¢ncia."""
        print(f"üìÖ Agendando parada autom√°tica de {instance_name} √†s {stop_time}")
        
        # Exemplo usando crontab (Linux) ou Task Scheduler (Windows)
        cron_command = f"0 22 * * * gcloud compute instances stop {instance_name} --zone={self.zone} --project={self.project_id}"
        print(f"üîß Comando cron: {cron_command}")
        return cron_command

# Exemplo de uso
manager = InstanceManager(PROJECT_ID, ZONE)

# Simular gerenciamento
instance_name = "ollama-deepseek-vm"
print("üöÄ Gerenciador de inst√¢ncias iniciado")
print("üí° Use este c√≥digo em um script agendado para economia autom√°tica")

## üìä 4. Monitoramento de Custos e Alertas

Configure alertas para n√£o queimar o or√ßamento de R$ 1.900!

In [None]:
class CostMonitor:
    """Monitor de custos para alertas em tempo real."""
    
    def __init__(self, project_id: str):
        self.project_id = project_id
        self.billing_client = billing_v1.CloudBillingClient()
    
    def get_current_month_cost(self):
        """Obt√©m custo do m√™s atual."""
        try:
            # Este √© um exemplo simplificado
            # Na pr√°tica, voc√™ usaria a Billing API para dados reais
            
            from datetime import datetime
            now = datetime.now()
            month_start = now.replace(day=1, hour=0, minute=0, second=0, microsecond=0)
            
            # Simula√ß√£o de custo (substitua pela API real)
            estimated_cost_usd = 85.50  # Exemplo
            estimated_cost_brl = estimated_cost_usd * 5.2  # Convers√£o aproximada
            
            return {
                "cost_usd": estimated_cost_usd,
                "cost_brl": estimated_cost_brl,
                "period": f"{month_start.strftime('%Y-%m')}"
            }
        except Exception as e:
            print(f"‚ùå Erro ao obter custos: {e}")
            return None
    
    def check_budget_alert(self, max_budget_brl: float = 300.0):
        """Verifica se est√° pr√≥ximo do or√ßamento."""
        cost_data = self.get_current_month_cost()
        
        if not cost_data:
            return
            
        current_cost = cost_data["cost_brl"]
        percentage = (current_cost / max_budget_brl) * 100
        
        print(f"üí∞ Custo atual: R$ {current_cost:.2f}")
        print(f"üéØ Or√ßamento: R$ {max_budget_brl:.2f}")
        print(f"üìä Utilizado: {percentage:.1f}%")
        
        if percentage >= 90:
            print("üö® ALERTA CR√çTICO: 90%+ do or√ßamento utilizado!")
            return "CRITICAL"
        elif percentage >= 70:
            print("‚ö†Ô∏è ALERTA: 70%+ do or√ßamento utilizado!")
            return "WARNING"
        elif percentage >= 50:
            print("üì¢ ATEN√á√ÉO: 50%+ do or√ßamento utilizado!")
            return "INFO"
        else:
            print("‚úÖ Or√ßamento dentro do planejado")
            return "OK"
    
    def create_budget_alert_webhook(self, webhook_url: str, budget_brl: float):
        """Cria webhook para alertas de or√ßamento."""
        
        import requests
        
        def send_alert(message: str, level: str):
            payload = {
                "text": f"üö® GCP Budget Alert: {message}",
                "level": level,
                "project": self.project_id,
                "timestamp": datetime.now().isoformat()
            }
            
            try:
                response = requests.post(webhook_url, json=payload, timeout=5)
                print(f"üì® Alerta enviado: {response.status_code}")
            except Exception as e:
                print(f"‚ùå Erro ao enviar alerta: {e}")
        
        return send_alert

# Monitoramento de custos
monitor = CostMonitor(PROJECT_ID)

# Verificar or√ßamento atual
alert_level = monitor.check_budget_alert(max_budget_brl=300.0)

# Dicas de economia baseadas no n√≠vel de alerta
if alert_level in ["WARNING", "CRITICAL"]:
    print("\nüîß DICAS DE ECONOMIA IMEDIATA:")
    print("1. Pause inst√¢ncias n√£o utilizadas")
    print("2. Use apenas DeepSeek 1.3B para testes")
    print("3. Ative modo spot em todas as inst√¢ncias")
    print("4. Reduza max_concurrent_requests para 1")
    print("5. Configure auto-stop √†s 18:00")

# URL para dashboard de custos
print(f"\nüåê Dashboard de custos: https://console.cloud.google.com/billing/projects/{PROJECT_ID}")

## üíæ 5. Armazenamento Local e Cache de Modelos

Evite downloads repetidos e economize banda!

In [None]:
import os
import shutil
from pathlib import Path

class ModelCacheManager:
    """Gerenciador de cache local para modelos."""
    
    def __init__(self, cache_dir: str = "/opt/models"):
        self.cache_dir = Path(cache_dir)
        self.cache_dir.mkdir(parents=True, exist_ok=True)
        
    def is_model_cached(self, model_name: str) -> bool:
        """Verifica se o modelo j√° est√° no cache local."""
        model_path = self.cache_dir / model_name
        return model_path.exists() and model_path.stat().st_size > 0
    
    def get_model_size(self, model_name: str) -> int:
        """Obt√©m tamanho do modelo em bytes."""
        model_path = self.cache_dir / model_name
        if model_path.exists():
            return model_path.stat().st_size
        return 0
    
    def cache_model_locally(self, model_name: str, source_url: str = None):
        """Baixa e armazena modelo localmente."""
        model_path = self.cache_dir / model_name
        
        if self.is_model_cached(model_name):
            size_mb = self.get_model_size(model_name) / (1024 * 1024)
            print(f"‚úÖ Modelo {model_name} j√° est√° no cache ({size_mb:.1f} MB)")
            return str(model_path)
        
        print(f"üì• Baixando {model_name} para cache local...")
        
        # Para Ollama, usar o comando pull para cache local
        if "deepseek" in model_name.lower():
            os.system(f"ollama pull {model_name}")
            print(f"‚úÖ {model_name} armazenado no cache do Ollama")
        
        return str(model_path)
    
    def cleanup_old_models(self, keep_latest: int = 2):
        """Remove modelos antigos para economizar espa√ßo."""
        model_files = list(self.cache_dir.glob("*"))
        model_files.sort(key=lambda x: x.stat().st_mtime, reverse=True)
        
        if len(model_files) > keep_latest:
            for old_model in model_files[keep_latest:]:
                print(f"üóëÔ∏è Removendo modelo antigo: {old_model.name}")
                if old_model.is_file():
                    old_model.unlink()
                elif old_model.is_dir():
                    shutil.rmtree(old_model)
    
    def get_cache_stats(self):
        """Estat√≠sticas do cache."""
        total_size = sum(f.stat().st_size for f in self.cache_dir.rglob('*') if f.is_file())
        model_count = len(list(self.cache_dir.iterdir()))
        
        return {
            "total_size_gb": total_size / (1024**3),
            "model_count": model_count,
            "cache_dir": str(self.cache_dir)
        }

# Script de inicializa√ß√£o do cache
def setup_persistent_storage():
    """Configura armazenamento persistente para modelos."""
    
    # Criar script de inicializa√ß√£o
    init_script = '''#!/bin/bash
# Script de inicializa√ß√£o para cache de modelos
set -e

echo "üöÄ Configurando cache persistente de modelos..."

# Criar diret√≥rio de cache
mkdir -p /opt/models
mkdir -p /root/.ollama

# Se h√° backup no Google Cloud Storage, restaurar
if gsutil ls gs://seu-bucket-models/backup/ 2>/dev/null; then
    echo "üì¶ Restaurando modelos do backup..."
    gsutil -m cp -r gs://seu-bucket-models/backup/* /root/.ollama/
fi

# Pr√©-carregar modelos essenciais
echo "üì• Pr√©-carregando DeepSeek 1.3B..."
ollama pull deepseek-coder:1.3b

# Backup peri√≥dico (opcional)
cat > /opt/backup_models.sh << 'EOF'
#!/bin/bash
echo "üíæ Fazendo backup dos modelos..."
gsutil -m cp -r /root/.ollama gs://seu-bucket-models/backup/
echo "‚úÖ Backup conclu√≠do"
EOF

chmod +x /opt/backup_models.sh

echo "‚úÖ Cache configurado com sucesso!"
'''
    
    with open("/tmp/init_models.sh", "w") as f:
        f.write(init_script)
    
    print("üìù Script de inicializa√ß√£o criado em /tmp/init_models.sh")
    print("üîß Execute com: chmod +x /tmp/init_models.sh && ./tmp/init_models.sh")

# Exemplo de uso
cache_manager = ModelCacheManager()

# Verificar modelos no cache
stats = cache_manager.get_cache_stats()
print(f"üìä Cache atual: {stats['model_count']} modelos, {stats['total_size_gb']:.2f} GB")

# Configurar script de inicializa√ß√£o
setup_persistent_storage()

print("\nüí° DICAS DE ECONOMIA:")
print("1. Use PVC (Persistent Volume) para modelos")
print("2. Configure backup incremental no GCS")
print("3. Limite cache a 2-3 modelos mais usados")
print("4. Use compress√£o para reduzir tamanho")

## üöÄ 6. Otimiza√ß√£o com GGML/GGUF (CPU Performance)

Use formato GGML para rodar em CPU com performance otimizada!

In [None]:
# Configura√ß√£o para usar CPU otimizada com GGML
def create_cpu_optimized_deployment():
    """Cria deployment otimizado para CPU com GGML."""
    
    dockerfile_content = '''
FROM ubuntu:22.04

# Instalar depend√™ncias
RUN apt-get update && apt-get install -y \\
    build-essential \\
    cmake \\
    git \\
    python3 \\
    python3-pip \\
    curl \\
    && rm -rf /var/lib/apt/lists/*

# Instalar llama.cpp para GGML
WORKDIR /opt
RUN git clone https://github.com/ggerganov/llama.cpp.git
WORKDIR /opt/llama.cpp

# Compilar com otimiza√ß√µes AVX2
RUN make LLAMA_OPENBLAS=1 LLAMA_AVX2=1 -j4

# Instalar Python bindings
RUN pip3 install llama-cpp-python

# Script de servidor
COPY server.py /opt/server.py

# Vari√°veis de ambiente para otimiza√ß√£o
ENV OMP_NUM_THREADS=4
ENV LLAMA_CPP_PARALLEL=1
ENV LLAMA_CPP_BATCH_SIZE=512

EXPOSE 8000
CMD ["python3", "/opt/server.py"]
'''
    
    server_code = '''
from llama_cpp import Llama
from fastapi import FastAPI, HTTPException
import asyncio
import uvicorn

app = FastAPI()

# Carregar modelo GGUF otimizado
llm = Llama(
    model_path="/opt/models/deepseek-coder-1.3b.gguf",
    n_ctx=2048,        # Contexto menor = menos mem√≥ria
    n_batch=512,       # Batch size otimizado
    n_threads=4,       # Threads da CPU
    verbose=False,
    use_mlock=True,    # Lock na mem√≥ria para performance
    use_mmap=True,     # Memory mapping
    low_vram=True      # Otimizar para pouca VRAM
)

@app.post("/v1/chat/completions")
async def chat_completions(request: dict):
    try:
        messages = request.get("messages", [])
        prompt = messages[-1]["content"] if messages else ""
        
        # Infer√™ncia otimizada
        response = llm(
            prompt,
            max_tokens=request.get("max_tokens", 256),
            temperature=request.get("temperature", 0.7),
            top_p=0.9,
            stop=["\\n\\n", "User:", "Assistant:"]
        )
        
        return {
            "choices": [{
                "message": {
                    "role": "assistant",
                    "content": response["choices"][0]["text"]
                }
            }],
            "model": "deepseek-gguf-cpu",
            "backend": "llama.cpp-cpu"
        }
        
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)
'''
    
    return dockerfile_content, server_code

# Configura√ß√£o de m√°quina CPU-otimizada
def get_cpu_optimized_machine_spec():
    """Especifica√ß√£o de m√°quina otimizada para CPU."""
    
    return {
        "machine_type": "c2-standard-8",  # CPU otimizada
        "boot_disk_type": "pd-ssd",
        "boot_disk_size_gb": 50,
        "preemptible": True,  # 70% economia
        "metadata": {
            "items": [
                {
                    "key": "startup-script",
                    "value": '''#!/bin/bash
                    # Script de otimiza√ß√£o CPU
                    echo 'performance' | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
                    echo 1 > /proc/sys/vm/swappiness
                    echo 3 > /proc/sys/vm/drop_caches
                    '''
                }
            ]
        }
    }

# Benchmark CPU vs GPU
def benchmark_cpu_vs_gpu():
    """Compara√ß√£o de custos CPU vs GPU."""
    
    costs = {
        "cpu_optimized": {
            "machine": "c2-standard-8 (preemptible)",
            "cost_hour_usd": 0.096,  # Preemptible
            "cost_month_brl": 0.096 * 24 * 30 * 5.2,  # ~360 BRL
            "performance": "Boa para infer√™ncia",
            "pros": ["Muito mais barato", "Sem limita√ß√£o GPU", "GGML otimizado"]
        },
        "gpu_optimized": {
            "machine": "n1-standard-4 + T4 (preemptible)", 
            "cost_hour_usd": 0.35,   # Preemptible
            "cost_month_brl": 0.35 * 24 * 30 * 5.2,  # ~1300 BRL
            "performance": "Excelente para infer√™ncia",
            "pros": ["Mais r√°pido", "Melhor para modelos grandes"]
        }
    }
    
    print("üí∞ COMPARA√á√ÉO DE CUSTOS:")
    print("="*50)
    
    for config, details in costs.items():
        print(f"\nüñ•Ô∏è {config.upper()}:")
        print(f"   M√°quina: {details['machine']}")
        print(f"   Custo/hora: ${details['cost_hour_usd']}")
        print(f"   Custo/m√™s: R$ {details['cost_month_brl']:.0f}")
        print(f"   Performance: {details['performance']}")
        print(f"   Vantagens: {', '.join(details['pros'])}")
    
    print(f"\nüí° RECOMENDA√á√ÉO:")
    print(f"   Para economia m√°xima: Use CPU + GGML")
    print(f"   Para performance: Use GPU s√≥ quando necess√°rio")

# Executar compara√ß√£o
dockerfile, server = create_cpu_optimized_deployment()
machine_spec = get_cpu_optimized_machine_spec()
benchmark_cpu_vs_gpu()

print("\nüîß Arquivos gerados:")
print("- Dockerfile otimizado para CPU")  
print("- Servidor FastAPI com llama.cpp")
print("- Especifica√ß√£o de m√°quina econ√¥mica")

## ‚ö° 7. Controle de Concorr√™ncia no FastAPI

Limite requisi√ß√µes simult√¢neas para evitar satura√ß√£o e economizar recursos.

In [None]:
import asyncio
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException, BackgroundTasks
import time

class ResourceManager:
    """Gerenciador de recursos para controle de concorr√™ncia."""
    
    def __init__(self, max_concurrent: int = 1, max_queue: int = 10):
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.queue_semaphore = asyncio.Semaphore(max_queue)
        self.active_requests = 0
        self.queued_requests = 0
        self.total_requests = 0
    
    @asynccontextmanager
    async def acquire_resource(self):
        """Context manager para controlar recursos."""
        
        # Verificar se pode entrar na fila
        async with self.queue_semaphore:
            self.queued_requests += 1
            
            try:
                # Aguardar recurso dispon√≠vel
                async with self.semaphore:
                    self.queued_requests -= 1
                    self.active_requests += 1
                    self.total_requests += 1
                    
                    print(f"üöÄ Processando requisi√ß√£o (ativa: {self.active_requests})")
                    yield
                    
            finally:
                self.active_requests -= 1
                print(f"‚úÖ Requisi√ß√£o conclu√≠da (ativa: {self.active_requests})")
    
    def get_stats(self):
        """Estat√≠sticas do gerenciador."""
        return {
            "active": self.active_requests,
            "queued": self.queued_requests,
            "total_processed": self.total_requests,
            "available_slots": self.semaphore._value
        }

# Configura√ß√£o otimizada para economia
resource_manager = ResourceManager(max_concurrent=1, max_queue=5)  # M√°ximo 1 simult√¢nea

# FastAPI com controle de recursos
app_optimized = FastAPI(title="DeepSeek Econ√≥mico")

@app_optimized.middleware("http")
async def resource_middleware(request, call_next):
    """Middleware para controle de recursos."""
    
    start_time = time.time()
    
    # Verificar se h√° muitas requisi√ß√µes
    stats = resource_manager.get_stats()
    if stats["queued"] >= 5:
        return HTTPException(
            status_code=429, 
            detail="Servidor ocupado. Tente novamente em alguns segundos."
        )
    
    response = await call_next(request)
    
    process_time = time.time() - start_time
    response.headers["X-Process-Time"] = str(process_time)
    response.headers["X-Active-Requests"] = str(stats["active"])
    
    return response

@app_optimized.post("/v1/chat/completions")
async def economical_chat(request: dict, background_tasks: BackgroundTasks):
    """Endpoint com controle rigoroso de recursos."""
    
    async with resource_manager.acquire_resource():
        # Simular processamento
        messages = request.get("messages", [])
        prompt = messages[-1]["content"] if messages else ""
        
        # Limitar tokens para economia
        max_tokens = min(request.get("max_tokens", 100), 200)  # M√°ximo 200 tokens
        
        # Aqui chamaria o modelo real (DeepSeek, Ollama, etc.)
        await asyncio.sleep(2)  # Simular processamento
        
        # Resposta econ√¥mica
        response = {
            "id": f"chatcmpl-{int(time.time())}",
            "object": "chat.completion",
            "choices": [{
                "index": 0,
                "message": {
                    "role": "assistant", 
                    "content": f"Resposta econ√¥mica para: {prompt[:50]}..."
                },
                "finish_reason": "stop"
            }],
            "usage": {
                "prompt_tokens": len(prompt.split()),
                "completion_tokens": max_tokens,
                "total_tokens": len(prompt.split()) + max_tokens
            },
            "model": "deepseek-economico",
            "backend": "resource-controlled"
        }
        
        # Task em background para estat√≠sticas
        background_tasks.add_task(log_usage, resource_manager.get_stats())
        
        return response

@app_optimized.get("/health")
async def health_check():
    """Health check com informa√ß√µes de recursos."""
    stats = resource_manager.get_stats()
    
    return {
        "status": "healthy",
        "resources": stats,
        "memory_optimized": True,
        "cost_mode": "economy"
    }

async def log_usage(stats: dict):
    """Log de uso para monitoramento."""
    print(f"üìä Stats: {stats}")

# Configura√ß√£o do servidor otimizada
def create_economical_server_config():
    """Configura√ß√£o de servidor econ√¥mica."""
    
    return {
        "host": "0.0.0.0",
        "port": 8000,
        "workers": 1,              # Apenas 1 worker para economia
        "worker_class": "uvicorn.workers.UvicornWorker",
        "worker_connections": 10,   # Poucas conex√µes
        "max_requests": 100,       # Reiniciar worker a cada 100 req
        "timeout": 30,             # Timeout baixo
        "keepalive": 2,            # Keep-alive baixo
        "preload": True,           # Preload para economia mem√≥ria
    }

# Dockerfile otimizado para economia
economical_dockerfile = '''
FROM python:3.9-slim

# Instalar apenas o essencial
RUN pip install fastapi uvicorn[standard] --no-cache-dir

# Copiar apenas arquivos necess√°rios
COPY server.py /app/server.py
WORKDIR /app

# Configura√ß√µes de economia
ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1
ENV MAX_CONCURRENT=1
ENV MAX_QUEUE=5

# Limites de recursos
ENV MALLOC_ARENA_MAX=2
ENV PYTHONMALLOC=malloc

EXPOSE 8000

# Comando otimizado
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]
'''

print("‚öôÔ∏è CONFIGURA√á√ÉO ECON√îMICA CRIADA:")
print("- M√°ximo 1 requisi√ß√£o simult√¢nea")
print("- Fila limitada a 5 requisi√ß√µes")  
print("- Timeout baixo (30s)")
print("- Tokens limitados (m√°x. 200)")
print("- Worker √∫nico para economia")

server_config = create_economical_server_config()
print(f"\nüîß Config do servidor: {server_config}")

print(f"\nüí° ECONOMIA ESTIMADA:")
print(f"- CPU: ~60% menos uso")
print(f"- Mem√≥ria: ~40% menos uso") 
print(f"- Lat√™ncia: Controlada")
print(f"- Custo: ~50% redu√ß√£o")

## üìä 8. Monitoramento de Recursos em Tempo Real

Monitore CPU, GPU e mem√≥ria para otimizar custos continuamente.

In [None]:
import psutil
import GPUtil
import threading
import json
from datetime import datetime

class ResourceMonitor:
    """Monitor de recursos em tempo real."""
    
    def __init__(self, alert_threshold: dict = None):
        self.alert_threshold = alert_threshold or {
            "cpu_percent": 80,
            "memory_percent": 85,
            "gpu_percent": 90
        }
        self.monitoring = False
        self.stats_history = []
    
    def get_current_usage(self):
        """Obt√©m uso atual de recursos."""
        
        # CPU e Mem√≥ria
        cpu_percent = psutil.cpu_percent(interval=1)
        memory = psutil.virtual_memory()
        
        stats = {
            "timestamp": datetime.now().isoformat(),
            "cpu": {
                "percent": cpu_percent,
                "count": psutil.cpu_count(),
                "freq_mhz": psutil.cpu_freq().current if psutil.cpu_freq() else 0
            },
            "memory": {
                "percent": memory.percent,
                "used_gb": memory.used / (1024**3),
                "available_gb": memory.available / (1024**3),
                "total_gb": memory.total / (1024**3)
            },
            "disk": {
                "percent": psutil.disk_usage('/').percent,
                "free_gb": psutil.disk_usage('/').free / (1024**3)
            }
        }
        
        # GPU (se dispon√≠vel)
        try:
            gpus = GPUtil.getGPUs()
            if gpus:
                gpu = gpus[0]  # Primeira GPU
                stats["gpu"] = {
                    "percent": gpu.load * 100,
                    "memory_used_gb": gpu.memoryUsed / 1024,
                    "memory_total_gb": gpu.memoryTotal / 1024,
                    "temperature": gpu.temperature,
                    "name": gpu.name
                }
        except:
            stats["gpu"] = None
        
        return stats
    
    def check_alerts(self, stats: dict):
        """Verifica alertas de recursos."""
        alerts = []
        
        # CPU Alert
        if stats["cpu"]["percent"] > self.alert_threshold["cpu_percent"]:
            alerts.append({
                "type": "CPU_HIGH",
                "message": f"CPU em {stats['cpu']['percent']:.1f}%",
                "action": "Considere reduzir concorr√™ncia"
            })
        
        # Memory Alert
        if stats["memory"]["percent"] > self.alert_threshold["memory_percent"]:
            alerts.append({
                "type": "MEMORY_HIGH", 
                "message": f"Mem√≥ria em {stats['memory']['percent']:.1f}%",
                "action": "Limpe cache ou reduza batch size"
            })
        
        # GPU Alert
        if stats.get("gpu") and stats["gpu"]["percent"] > self.alert_threshold["gpu_percent"]:
            alerts.append({
                "type": "GPU_HIGH",
                "message": f"GPU em {stats['gpu']['percent']:.1f}%",
                "action": "Otimize par√¢metros do modelo"
            })
        
        return alerts
    
    def start_monitoring(self, interval: int = 10):
        """Inicia monitoramento cont√≠nuo."""
        self.monitoring = True
        
        def monitor_loop():
            while self.monitoring:
                try:
                    stats = self.get_current_usage()
                    alerts = self.check_alerts(stats)
                    
                    # Adicionar ao hist√≥rico
                    self.stats_history.append(stats)
                    
                    # Manter apenas √∫ltimas 100 medi√ß√µes
                    if len(self.stats_history) > 100:
                        self.stats_history.pop(0)
                    
                    # Exibir alertas
                    if alerts:
                        print(f"üö® ALERTAS DE RECURSOS:")
                        for alert in alerts:
                            print(f"   {alert['type']}: {alert['message']}")
                            print(f"   A√ß√£o: {alert['action']}")
                    
                    time.sleep(interval)
                    
                except Exception as e:
                    print(f"‚ùå Erro no monitoramento: {e}")
                    time.sleep(interval)
        
        # Iniciar thread de monitoramento
        monitor_thread = threading.Thread(target=monitor_loop, daemon=True)
        monitor_thread.start()
        
        print(f"üìä Monitoramento iniciado (intervalo: {interval}s)")
    
    def stop_monitoring(self):
        """Para monitoramento."""
        self.monitoring = False
        print("‚èπÔ∏è Monitoramento parado")
    
    def get_optimization_recommendations(self):
        """Recomenda√ß√µes baseadas no hist√≥rico."""
        if not self.stats_history:
            return ["Sem dados hist√≥ricos dispon√≠veis"]
        
        # An√°lise dos √∫ltimos 10 registros
        recent_stats = self.stats_history[-10:]
        
        avg_cpu = sum(s["cpu"]["percent"] for s in recent_stats) / len(recent_stats)
        avg_memory = sum(s["memory"]["percent"] for s in recent_stats) / len(recent_stats)
        
        recommendations = []
        
        # Recomenda√ß√µes baseadas em uso
        if avg_cpu < 30:
            recommendations.append("üí° CPU subutilizada - considere inst√¢ncia menor")
            recommendations.append("üí∞ Economia potencial: 30-50% do custo")
        
        if avg_memory < 50:
            recommendations.append("üí° Mem√≥ria subutilizada - reduza configura√ß√£o")
            recommendations.append("üí∞ Economia potencial: 20-40% do custo")
        
        if avg_cpu > 80:
            recommendations.append("‚ö†Ô∏è CPU sobrecarregada - limite concorr√™ncia")
            recommendations.append("üîß Configure max_concurrent=1")
        
        if avg_memory > 85:
            recommendations.append("‚ö†Ô∏è Mem√≥ria alta - implemente cleanup")
            recommendations.append("üîß Ative garbage collection agressivo")
        
        return recommendations

# Configurar monitoramento econ√¥mico
monitor = ResourceMonitor(alert_threshold={
    "cpu_percent": 70,    # Alertar mais cedo para economia
    "memory_percent": 80,  
    "gpu_percent": 85
})

# Estat√≠sticas atuais
current_stats = monitor.get_current_usage()
print("üìä ESTAT√çSTICAS ATUAIS:")
print(f"CPU: {current_stats['cpu']['percent']:.1f}%")
print(f"Mem√≥ria: {current_stats['memory']['percent']:.1f}% ({current_stats['memory']['used_gb']:.1f}GB)")
print(f"Disco: {current_stats['disk']['percent']:.1f}%")

if current_stats.get("gpu"):
    print(f"GPU: {current_stats['gpu']['percent']:.1f}% ({current_stats['gpu']['name']})")
else:
    print("GPU: N√£o detectada (modo CPU)")

# Script de otimiza√ß√£o autom√°tica
def create_auto_optimizer():
    """Cria script de otimiza√ß√£o autom√°tica."""
    
    optimizer_script = '''#!/bin/bash
# Script de otimiza√ß√£o autom√°tica para economia

# Fun√ß√£o para otimizar CPU
optimize_cpu() {
    echo "üîß Otimizando CPU..."
    
    # Configurar governor para economia
    echo 'powersave' | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
    
    # Reduzir threads se CPU baixo
    CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | sed 's/%us,//')
    if (( $(echo "$CPU_USAGE < 30" | bc -l) )); then
        echo "üí° CPU subutilizada, reduzindo threads"
        export OMP_NUM_THREADS=2
    fi
}

# Fun√ß√£o para otimizar mem√≥ria  
optimize_memory() {
    echo "üîß Otimizando mem√≥ria..."
    
    # Limpeza agressiva de cache
    sync && echo 3 > /proc/sys/vm/drop_caches
    
    # Configurar swap
    echo 10 > /proc/sys/vm/swappiness
    
    # Garbage collection Python
    python3 -c "import gc; gc.collect()"
}

# Executar otimiza√ß√µes
optimize_cpu
optimize_memory

echo "‚úÖ Otimiza√ß√£o conclu√≠da"
'''
    
    return optimizer_script

optimizer = create_auto_optimizer()
print(f"\nüîß Script de otimiza√ß√£o criado")
print(f"üí° Execute periodicamente para manter economia")

# Iniciar monitoramento (descomente para usar)
# monitor.start_monitoring(interval=30)

## üéØ Resumo e Plano de A√ß√£o

### üí∞ **Meta de Economia: Reduzir custos para < R$ 300/m√™s**

| Estrat√©gia | Economia | Implementa√ß√£o | Prioridade |
|------------|----------|---------------|------------|
| Spot Instances | 70% | Imediata | üî• Alta |
| CPU + GGML | 60% | 1-2 dias | üî• Alta |
| Auto-Stop | 40% | Imediata | üü° M√©dia |
| Concorr√™ncia=1 | 50% | Imediata | üü° M√©dia |
| Cache Local | 20% | 1 dia | üü¢ Baixa |

### üìã **Checklist de Implementa√ß√£o:**

#### ‚úÖ **Implementar Imediatamente:**
- [ ] Ativar spot instances em todas as VMs
- [ ] Configurar auto-stop √†s 22:00
- [ ] Limitar concorr√™ncia para 1 requisi√ß√£o
- [ ] Configurar alertas de or√ßamento

#### üîß **Implementar em 1-2 dias:**
- [ ] Migrar para CPU + GGML
- [ ] Configurar cache persistente de modelos
- [ ] Implementar monitoramento de recursos
- [ ] Otimizar Dockerfile para economia

#### üìä **Monitorar Continuamente:**
- [ ] Dashboard de custos di√°rio
- [ ] Alertas de CPU/Mem√≥ria
- [ ] Performance vs custo
- [ ] Usage patterns

### üö® **Comandos de Emerg√™ncia (se custo alto):**

```bash
# Parar todas as inst√¢ncias
gcloud compute instances stop --all --zone=us-central1-a

# Verificar custos atuais
gcloud billing budgets list

# Ativar s√≥ DeepSeek 1.3B
kubectl scale deployment ollama-deepseek --replicas=0
```