# **Sistema de RAG para el modelo Gemma2-9b**



Describimos brevemente el sistema de RAG, fallido, propuesto por DeepSeek.

Comenzamos con la instalaci√≥n de nuevas librer√≠as, algo que dio ya much√≠simos problemas, y que llev√≥ al LLM de Colab cerca de una hora solucionar.

In [1]:
import warnings
warnings.filterwarnings("ignore")

"""
SISTEMA RAG COMPLETO PARA GEMMA-2-9B - AN√ÅLISIS ELECTORAL
Optimizado para Google Colab Pro con manejo eficiente de memoria
"""

# ============================================================================
# CELDA 1: INSTALACI√ìN DE DEPENDENCIAS PARA GEMMA
# ============================================================================
print("üöÄ CONFIGURANDO SISTEMA RAG PARA GEMMA-2-9B")
print("=" * 70)

# Core dependencies including numpy and faiss-cpu, with a runtime restart to ensure compatibility
!pip install -q numpy==1.26.4 --force-reinstall
!pip install -q faiss-cpu

# Other RAG dependencies
!pip install -q chromadb==0.4.18  # Vector database
!pip install -q sentence-transformers==2.5.0  # Embeddings - Updated to a version compatible with newer huggingface_hub
!pip install -q rank_bm25==0.2.2  # BM25 retrieval
!pip install -q pypdf==4.2.0  # Para manejar datasets

# These packages are typically managed by Unsloth, so we let Unsloth install its preferred versions later.
# !pip install -q transformers==4.36.0
# !pip install -q accelerate==0.25.0
# !pip install -q bitsandbytes==0.41.3
# !pip install -q sentencepiece==0.1.99
# !pip install -q protobuf==3.20.3
# !pip install -q huggingface_hub==0.10.0 # Pinning for sentence-transformers compatibility (before 0.25.0) - Removed explicit pinning
# !pip install -q datasets==2.15.0

print("‚úÖ Dependencias instaladas. Por favor, REINICIA EL ENTORNO DE EJECUCI√ìN (Runtime -> Restart runtime) y luego vuelve a ejecutar todas las celdas.")

# import os
# os.kill(os.getpid(), 9) # This is a hard restart, can be used for automation, but manual is safer for diagnosis

üöÄ CONFIGURANDO SISTEMA RAG PARA GEMMA-2-9B
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
opencv-python 4.13.0.90 requires numpy>=2; python_version >= "3.9", but you have numpy 1.26.4 which is incompatible.
pytensor 2.37.0 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.
rasterio 1.5.0 requires numpy>=2, but you have numpy 1.26.4 which is incompatible.
opencv-python-headless 4.13.0.90 requires numpy>=2; python_version >= "3.9", but you have numpy 1.26.4 which is incompatible.
opencv-contrib-python 4.13.0.90 requires numpy>=2; python_version >= "3.9", but you have numpy 1.26.4 which is incompatible.
tobler 0.13.0 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.
shap 0.50.0 requires numpy>=2, but you have numpy 1.26.4 which is incompatible.
jax 0.7.2 requires numpy>=2.0, but you have numpy 1.26.4 which is incompa

In [2]:
import torch
import gc
from datetime import datetime

Esta celda de carga sencillamente no funcion√≥, y tuve que cargar despu√©s el modelo y el tokenizador con Unsloth.

In [3]:
# ============================================================================
# CELDA 2: CONFIGURACI√ìN Y CARGA DE MODELO GEMMA OPTIMIZADA
# ============================================================================
import torch
import gc
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

def setup_gemma_model(model_path, quantization="16bit"):
    """
    Carga Gemma-2-9b optimizada para Colab
    """
    print(f"\n‚Ññ CARGANDO GEMMA-2-9B DESDE: {model_path}")
    print("=" * 50)

    from transformers import AutoTokenizer, AutoModelForCausalLM

    # Verificar GPU
    if not torch.cuda.is_available():
        print("‚ö†Ô∏è  ¬°ADVERTENCIA! No hay GPU disponible")
        print("   Gemma-2-9b requiere GPU con al menos 16GB VRAM")
        return None, None

    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"‚úÖ GPU: {gpu_name}")
    print(f"≈∏¬® Memoria GPU: {gpu_memory:.2f} GB")

    # Configuraci√≥n seg√πn memoria disponible
    if gpu_memory < 16:
        print("‚ùå Memoria insuficiente para Gemma-2-9b (se requieren 16GB+)")
        print("≈∏¬® Sugerencia: Usa Google Colab Pro+ con A100 o V100")
        return None, None

    # Configurar carga optimizada
    load_config = {
        "torch_dtype": torch.float16,
        "device_map": "auto",
        "trust_remote_code": True,
    }

    # A√±adir quantizaci√≥n si se especifica
    if quantization == "8bit":
        load_config["load_in_8bit"] = True
        print("≈∏¬® Usando quantizaci√≥n 8-bit")
    elif quantization == "4bit":
        load_config["load_in_4bit"] = True
        load_config["bnb_4bit_compute_dtype"] = torch.float16
        load_config["bnb_4bit_quant_type"] = "nf4"
        print("≈∏¬® Usando quantizaci√≥n 4-bit")
    else:
        print("≈∏¬® Usando precisi√≥n 16-bit")

    try:
        # Limpiar memoria primero
        torch.cuda.empty_cache()
        gc.collect()

        print("‚è≥ Cargando tokenizer...")
        tokenizer = AutoTokenizer.from_pretrained(
            model_path,
            trust_remote_code=True
        )

        # Configurar padding token si no existe
        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token

        print("‚è≥ Cargando modelo... (esto puede tomar 1-2 minutos)")
        model = AutoModelForCausalLM.from_pretrained(
            model_path,
            **load_config
        )

        print(f"‚úÖ Modelo cargado exitosamente")
        print(f"   Dispositivo: {model.device}")
        print(f"   dtype: {model.dtype}")

        # Estad√≠sticas de memoria
        allocated = torch.cuda.memory_allocated() / 1e9
        print(f"≈∏¬® Memoria utilizada: {allocated:.2f} GB / {gpu_memory:.2f} GB")

        return model, tokenizer

    except Exception as e:
        print(f"‚ùå Error cargando modelo: {e}")
        return None, None

# Ruta a tu modelo Gemma fine-tuned
GEMMA_MODEL_PATH = "/content/drive/MyDrive/Practica_LLM_Engineering_25/modelo-Gemma_2-9b-finetuned-4bit"  # Cambia esto

# Cargar modelo (commented out to rely on unsloth model loading)
# model, tokenizer = setup_gemma_model(GEMMA_MODEL_PATH, quantization="16bit")

# if model is None:
#     print("\n‚ö†Ô∏è  Intentando cargar modelo base como fallback...")
#     model, tokenizer = setup_gemma_model("google/gemma-2-9b", quantization="4bit")

Intalamos Unsloth, que creo una incompatibilidad entre √©l y faiss, por la versi√≥n de numpy.

In [4]:
!pip install unsloth
!pip install --no-deps xformers trl peft accelerate bitsandbytes



Continuos problemas... esta es una celda que cre√≥ el propio Colab en uno de los pasos del arreglo, es solo un ejemplo.

In [None]:
# ============================================================================
# CELDA 1.5: INSTALACI√ìN DE FAISS (despu√©s de Unsloth para compatibilidad con NumPy)
# ============================================================================
# This cell is commented out, as faiss-cpu and numpy installation is now handled at the very beginning in cell ud-a7gem6-15
# print("‚è≥ Instalando numpy 1.26.4 y faiss-cpu compatible...")
# !pip install -q numpy==1.26.4 --force-reinstall # Ensure numpy 1.x for faiss compatibility
# !pip install -q faiss-cpu # Instalar la √∫ltima versi√≥n de faiss-cpu, esperando compatibilidad con NumPy 1.x or auto-adapting
# print("‚úÖ faiss-cpu instalado")

Cargamos el modelo y el tokenizer finalmente con Unsloth

In [5]:
from unsloth import FastLanguageModel
import torch

# 1. Definir la ruta de la carpeta en Drive donde est√°n los 4 ficheros
model_path = "/content/drive/MyDrive/Practica_LLM_Engineering_25/modelo-Gemma_2-9b-finetuned-4bit"

# 2. Cargar el modelo y el tokenizador
# Unsloth detectar√° autom√°ticamente los shards (los 4 ficheros)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_path, # Apuntamos a la CARPETA, no a un archivo
    max_seq_length = 2048,   # El mismo que usaste en el entrenamiento
    dtype = None,            # Auto-detecci√≥n
    load_in_4bit = True,     # Cargamos en 4 bits para ahorrar memoria
)

# 3. Poner el modelo en modo inferencia (m√°s r√°pido)
FastLanguageModel.for_inference(model)

print("¬°Modelo cargado correctamente desde los shards de Google Drive!")

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
ü¶• Unsloth Zoo will now patch everything to make training faster!
Unsloth: If you want to finetune Gemma 2, install flash-attn to make it faster!
To install flash-attn, do the below:

pip install --no-deps --upgrade "flash-attn>=2.6.3"
==((====))==  Unsloth 2026.1.4: Fast Gemma2 patching. Transformers: 4.57.6.
   \\   /|    NVIDIA L4. Num GPUs = 1. Max memory: 22.161 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.10.0+cu128. CUDA: 8.9. CUDA Toolkit: 12.8. Triton: 3.6.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.34. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

¬°Modelo cargado correctamente desde los shards de Google Drive!


Aqu√≠ es donde se crea el sistema de embeddings y el vector store.

In [6]:
# ============================================================================
# CELDA 3: SISTEMA DE EMBEDDINGS Y VECTOR STORE
# ============================================================================
print("\n" + "=" * 70)
print("üìö CONSTRUYENDO BASE DE CONOCIMIENTO VECTORIAL")
print("=" * 70)

import pandas as pd
import numpy as np
from typing import List, Dict, Tuple, Optional
import json
import pickle
import os
from tqdm.auto import tqdm

class ElectoralKnowledgeBase:
    """
    Base de conocimiento vectorial para datos electorales
    """

    def __init__(self, embedding_model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"):
        self.embedding_model_name = embedding_model_name
        self.documents = []
        self.metadata = []
        self.embeddings = None

        # Inicializar modelo de embeddings
        print(f"‚è≥ Cargando modelo de embeddings: {embedding_model_name}")
        from sentence_transformers import SentenceTransformer
        self.embedder = SentenceTransformer(embedding_model_name)

        print(f"‚úÖ Embedder cargado (dim: {self.embedder.get_sentence_embedding_dimension()})")

    def load_electoral_datasets(self, dataset_paths: Dict[str, str]):
        """
        Carga datasets electorales y los prepara para embedding
        """
        print("\nüìÇ CARGANDO DATASETS ELECTORALES...")

        all_data = []

        for name, path in dataset_paths.items():
            print(f"  Procesando {name}...")

            try:
                # Cargar dataset (soporta CSV y JSONL)
                if path.endswith('.csv'):
                    df = pd.read_csv(path, encoding='utf-8')
                elif path.endswith('.jsonl'):
                    with open(path, 'r', encoding='utf-8') as f:
                        lines = f.readlines()
                    df = pd.DataFrame([json.loads(line) for line in lines])
                else:
                    print(f"    ‚ö†Ô∏è  Formato no soportado: {path}")
                    continue

                print(f"    ‚úÖ {len(df)} filas cargadas")

                # Crear documentos textuales para embedding
                for idx, row in df.iterrows():
                    # Construir texto descriptivo
                    doc_text = self._row_to_text(row, name)

                    # Metadatos
                    meta = {
                        "dataset": name,
                        "row_index": idx,
                        "columns": list(row.dropna().index),
                        "year": self._extract_year_from_row(row),
                        "municipality": row.get('Municipio', '') if 'Municipio' in row else '',
                        "province": row.get('Provincia', '') if 'Provincia' in row else ''
                    }

                    all_data.append((doc_text, meta))

                    # Limitar para pruebas
                    if len(all_data) >= 10000:  # M√°ximo 10k documentos para Colab
                        print(f"    ‚ö†Ô∏è  L√≠mite de 10k documentos alcanzado")
                        break

            except Exception as e:
                print(f"    ‚ùå Error cargando {name}: {e}")

        # Separar documentos y metadatos
        self.documents = [d[0] for d in all_data]
        self.metadata = [d[1] for d in all_data]

        print(f"\nüìä BASE DE CONOCIMIENTO:")
        print(f"   ‚Ä¢ Total documentos: {len(self.documents):,}")
        print(f"   ‚Ä¢ A√±os cubiertos: {len(set(m.get('year', '') for m in self.metadata))}")
        print(f"   ‚Ä¢ Municipios √∫nicos: {len(set(m.get('municipality', '') for m in self.metadata if m.get('municipality')))}")

        return self

    def _row_to_text(self, row: pd.Series, dataset_name: str) -> str:
        """Convierte una fila de datos a texto para embedding"""

        parts = [f"Datos electorales de {dataset_name}:"]

        # Informaci√≥n geogr√°fica
        if 'Municipio' in row and pd.notna(row['Municipio']):
            parts.append(f"Municipio: {row['Municipio']}")
        if 'Provincia' in row and pd.notna(row['Provincia']):
            parts.append(f"Provincia: {row['Provincia']}")
        if 'CCAA' in row and pd.notna(row['CCAA']):
            parts.append(f"Comunidad Aut√≥noma: {row['CCAA']}")

        # Resultados electorales
        electoral_cols = [col for col in row.index if col.startswith('%')]
        for col in electoral_cols[:10]:  # Limitar a 10 partidos
            if pd.notna(row[col]):
                parts.append(f"{col}: {row[col]:.1f}%")

        # Datos de participaci√≥n
        if 'Participaci√≥n' in row and pd.notna(row['Participaci√≥n']):
            parts.append(f"Participaci√≥n electoral: {row['Participaci√≥n']:.1f}%")

        # Datos socioecon√≥micos
        socio_cols = ['Renta persona', 'Renta hogar', 'Renta Salarios', 'Renta Pensiones']
        for pattern in socio_cols:
            matching = [col for col in row.index if pattern in col]
            for col in matching[:2]:  # Limitar
                if pd.notna(row[col]):
                    parts.append(f"{col}: {row[col]:.0f} ‚Ç¨")

        # A√±o si est√° disponible
        year = self._extract_year_from_row(row)
        if year:
            parts.append(f"A√±o electoral: {year}")

        return " | ".join(parts)

    def _extract_year_from_row(self, row: pd.Series) -> Optional[str]:
        """Extrae el a√±o de una fila de datos"""
        # Buscar columna de a√±o
        year_cols = [col for col in row.index if 'a√±o' in col.lower() or 'year' in col.lower()]
        for col in year_cols:
            if pd.notna(row[col]):
                return str(row[col])[:4]

        # Intentar extraer de nombre de dataset
        return None

    def build_vector_index(self, batch_size: int = 32):
        """
        Construye √≠ndice vectorial FAISS para b√∫squeda eficiente
        """
        print("\nüî® CONSTRUYENDO √çNDICE VECTORIAL...")

        if not self.documents:
            print("‚ö†Ô∏è  No hay documentos para indexar")
            return self

        # Ensure numpy 1.x and faiss-cpu are installed right before import faiss
        print("  Garantizando compatibilidad con NumPy y FAISS...")
        # We are removing the explicit numpy downgrade here as it is handled at the beginning of the notebook.
        # !pip install -q numpy==1.26.4 --force-reinstall # Force reinstall numpy for faiss compatibility
        # !pip install -q faiss-cpu # Instalar la √∫ltima versi√≥n de faiss-cpu, esperando compatibilidad con NumPy 1.x or auto-adapting

        # Generar embeddings en lotes
        print(f"  Generando embeddings para {len(self.documents)} documentos...")

        all_embeddings = []
        for i in tqdm(range(0, len(self.documents), batch_size), desc="Embedding"):
            batch = self.documents[i:i + batch_size]
            batch_embeddings = self.embedder.encode(batch, show_progress_bar=False)
            all_embeddings.append(batch_embeddings)

        self.embeddings = np.vstack(all_embeddings)

        # Diagnostic print
        # print(f"  NumPy version before FAISS import: {np.__version__}")

        # Construir √≠ndice FAISS
        print("  Construyendo √≠ndice FAISS...")
        import faiss

        dimension = self.embeddings.shape[1]
        self.index = faiss.IndexFlatIP(dimension)  # Producto interno para similitud coseno
        self.index.add(self.embeddings.astype('float32'))

        print(f"‚úÖ √çndice construido:")
        print(f"   ‚Ä¢ Dimensiones: {dimension}")
        print(f"   ‚Ä¢ Documentos indexados: {self.index.ntotal}")

        return self

    def search_similar(self, query: str, k: int = 10) -> List[Dict]:
        """
        Busca documentos similares a la consulta
        """
        if self.embeddings is None:
            raise ValueError("√çndice no construido. Llama a build_vector_index() primero.")

        # Embedding de la consulta
        query_embedding = self.embedder.encode([query])

        # B√∫squeda en FAISS
        distances, indices = self.index.search(query_embedding.astype('float32'), k)

        # Formatear resultados
        results = []
        for dist, idx in zip(distances[0], indices[0]):
            if idx < len(self.documents):
                results.append({
                    "document": self.documents[idx],
                    "metadata": self.metadata[idx],
                    "score": float(dist),
                    "similarity": self._cosine_to_percentage(dist)
                })

        return results

    def _cosine_to_percentage(self, cosine_sim: float) -> float:
        """Convierte similitud coseno a porcentaje (0-100)"""
        # Normalizar de [-1, 1] a [0, 100]
        return ((cosine_sim + 1) / 2) * 100

    def save_index(self, path: str):
        """
        Guarda el √≠ndice para uso futuro
        """
        import pickle
        import faiss # Import faiss here

        data = {
            'documents': self.documents,
            'metadata': self.metadata,
            'embeddings': self.embeddings,
            'embedding_model': self.embedding_model_name
        }

        with open(path, 'wb') as f:
            pickle.dump(data, f)

        # Guardar √≠ndice FAISS por separado
        faiss.write_index(self.index, f"{path}.faiss")

        print(f"‚úÖ √çndice guardado en: {path}")

    def load_index(self, path: str):
        """
        Carga un √≠ndice guardado
        """
        import pickle
        import faiss # Import faiss here

        with open(path, 'rb') as f:
            data = pickle.load(f)

        self.documents = data['documents']
        self.metadata = data['metadata']
        self.embeddings = data['embeddings']
        self.embedding_model_name = data['embedding_model']

        # Cargar √≠ndice FAISS
        self.index = faiss.read_index(f"{path}.faiss")

        print(f"‚úÖ √çndice cargado: {len(self.documents)} documentos")

        return self


üìö CONSTRUYENDO BASE DE CONOCIMIENTO VECTORIAL


Este es el procesador de consultas inteligente

In [7]:
# ============================================================================
# CELDA 4: PROCESADOR DE CONSULTAS INTELIGENTE
# ============================================================================
print("\n" + "=" * 70)
print("üß† PROCESADOR INTELIGENTE DE CONSULTAS")
print("=" * 70)

import re
from collections import defaultdict

class ElectoralQueryProcessor:
    """
    Procesa y entiende consultas en lenguaje natural sobre datos electorales
    """

    def __init__(self, knowledge_base: ElectoralKnowledgeBase):
        self.kb = knowledge_base

        # Patrones para extraer informaci√≥n
        self.patterns = {
            'municipio': [
                r'en el municipio de (\w+(?:\s+\w+)*)',
                r'en (\w+(?:\s+\w+)*)\s+\(municipio\)',
                r'municipio de (\w+(?:\s+\w+)*)',
            ],
            'provincia': [
                r'en la provincia de (\w+(?:\s+\w+)*)',
                r'en (\w+(?:\s+\w+)*)\s+\(provincia\)',
                r'provincia de (\w+(?:\s+\w+)*)',
            ],
            'a√±o': [
                r'en (\d{4})',
                r'de (\d{4})',
                r'a√±o (\d{4})',
                r'las elecciones de (\d{4})',
            ],
            'partido': [
                r'al (PP|PSOE|VOX|Cs|UP|IU|Podemos|Ciudadanos)',
                r'el partido (PP|PSOE|VOX|Cs|UP|IU|Podemos|Ciudadanos)',
                r'voto al (PP|PSOE|VOX|Cs|UP|IU)',
                r'porcentaje del (PP|PSOE|VOX|Cs|UP|IU)',
            ],
            'tipo_analisis': [
                (r'evolucion[√≥o]|cambio|entre \d{4} y \d{4}', 'evolucion'),
                (r'correlaci[√≥o]n|relaci[√≥o]n|asociaci[√≥o]n', 'correlacion'),
                (r'promedio|media|promedio|valor medio', 'estadistica'),
                (r'm[√°a]ximo|m[√≠i]nimo|rango', 'extremos'),
                (r'comparar|comparaci[√≥o]n|diferencias?', 'comparacion'),
                (r'tendencia|tendencial|progresi[√≥o]n', 'tendencia'),
                (r'qu[√©e] porcentaje|cu[√°a]nto|cu[√°a]ntos', 'simple_query'),
            ]
        }

        # Lista de municipios y provincias conocidos (se cargan din√°micamente)
        self.known_municipalities = set()
        self.known_provinces = set()
        self._load_geographic_data()

    def _load_geographic_data(self):
        """Carga listas de municipios y provincias conocidos"""
        if self.kb.metadata:
            for meta in self.kb.metadata:
                if 'municipality' in meta and meta['municipality']:
                    self.known_municipalities.add(meta['municipality'].lower())
                if 'province' in meta and meta['province']:
                    self.known_provinces.add(meta['province'].lower())

    def parse_query(self, query: str) -> Dict:
        """
        Parsea una consulta en lenguaje natural
        """
        query_lower = query.lower()
        parsed = {
            'original': query,
            'entities': defaultdict(list),
            'analysis_type': 'general',
            'geographic_scope': None,
            'temporal_scope': None,
            'confidence': 1.0
        }

        # 1. Extraer entidades con patrones
        for entity_type, pattern_list in self.patterns.items():
            if entity_type == 'tipo_analisis':
                continue

            for pattern in pattern_list:
                matches = re.findall(pattern, query_lower, re.IGNORECASE)
                for match in matches:
                    if isinstance(match, tuple):
                        match = match[0]
                    parsed['entities'][entity_type].append(match)

        # 2. Determinar tipo de an√°lisis
        for pattern, analysis_type in self.patterns['tipo_analisis']:
            if re.search(pattern, query_lower, re.IGNORECASE):
                parsed['analysis_type'] = analysis_type
                break

        # 3. Desambiguar municipios/provincias
        self._disambiguate_entities(parsed, query_lower)

        # 4. Determinar √°mbito geogr√°fico
        self._determine_geographic_scope(parsed)

        # 5. Determinar √°mbito temporal
        self._determine_temporal_scope(parsed)

        # 6. Calcular confianza
        parsed['confidence'] = self._calculate_confidence(parsed)

        return parsed

    def _disambiguate_entities(self, parsed: Dict, query_lower: str):
        """Desambigua entre municipios y provincias"""
        # Primero, verificar si los nombres extra√≠dos existen en nuestras listas
        for entity in list(parsed['entities']['municipio'] + parsed['entities']['provincia']):
            entity_lower = entity.lower()

            is_municipality = entity_lower in self.known_municipalities
            is_province = entity_lower in self.known_provinces

            # Si est√° en ambas listas, usar contexto para decidir
            if is_municipality and is_province:
                # Palabras clave que sugieren municipio
                municipality_hints = ['ciudad', 'localidad', 'ayuntamiento', 'alcalde', 'municipal']
                province_hints = ['provincia', 'diputaci√≥n', 'territorio', 'regional']

                if any(hint in query_lower for hint in municipality_hints):
                    parsed['entities']['municipio'].append(entity)
                    if entity in parsed['entities']['provincia']:
                        parsed['entities']['provincia'].remove(entity)
                elif any(hint in query_lower for hint in province_hints):
                    parsed['entities']['provincia'].append(entity)
                    if entity in parsed['entities']['municipio']:
                        parsed['entities']['municipio'].remove(entity)
            elif is_municipality:
                parsed['entities']['municipio'].append(entity)
            elif is_province:
                parsed['entities']['provincia'].append(entity)

    def _determine_geographic_scope(self, parsed: Dict):
        """Determina el √°mbito geogr√°fico de la consulta"""
        if parsed['entities']['municipio']:
            parsed['geographic_scope'] = 'municipal'
            parsed['location'] = parsed['entities']['municipio'][0]
        elif parsed['entities']['provincia']:
            parsed['geographic_scope'] = 'provincial'
            parsed['location'] = parsed['entities']['provincia'][0]
        else:
            parsed['geographic_scope'] = 'nacional'
            parsed['location'] = 'Espa√±a'

    def _determine_temporal_scope(self, parsed: Dict):
        """Determina el √°mbito temporal"""
        if parsed['entities']['a√±o']:
            if len(parsed['entities']['a√±o']) >= 2:
                parsed['temporal_scope'] = 'comparativo'
                parsed['years'] = sorted(parsed['entities']['a√±o'])[:2]  # Tomar los 2 primeros
            else:
                parsed['temporal_scope'] = 'puntual'
                parsed['years'] = parsed['entities']['a√±o']
        else:
            parsed['temporal_scope'] = 'general'
            parsed['years'] = []

    def _calculate_confidence(self, parsed: Dict) -> float:
        """Calcula confianza en el parsing"""
        confidence = 1.0

        # Penalizar si no se encontr√≥ ubicaci√≥n pero la consulta parece necesitarla
        location_indicators = ['en madrid', 'en barcelona', 'en sevilla', 'municipio', 'provincia']
        if any(indicator in parsed['original'].lower() for indicator in location_indicators):
            if not parsed['entities']['municipio'] and not parsed['entities']['provincia']:
                confidence *= 0.7

        # Penalizar si no se encontr√≥ a√±o para consultas temporales
        temporal_indicators = ['en 2019', 'en 2016', 'a√±o', 'elecciones']
        if any(indicator in parsed['original'].lower() for indicator in temporal_indicators):
            if not parsed['entities']['a√±o']:
                confidence *= 0.8

        return round(confidence, 2)

    def create_search_queries(self, parsed_query: Dict) -> List[str]:
        """
        Crea m√∫ltiples consultas de b√∫squeda a partir del parsing
        """
        base_query = parsed_query['original']

        # Consultas alternativas para mejorar recall
        alternative_queries = [base_query]

        # A√±adir sin√≥nimos para tipo de an√°lisis
        analysis_synonyms = {
            'evolucion': ['cambio', 'variaci√≥n', 'diferencia'],
            'correlacion': ['relaci√≥n', 'asociaci√≥n', 'vinculo'],
            'estadistica': ['promedio', 'media', 'valor medio'],
        }

        analysis_type = parsed_query['analysis_type']
        if analysis_type in analysis_synonyms:
            for synonym in analysis_synonyms[analysis_type]:
                alt_query = base_query.replace(
                    analysis_type, synonym
                ) if analysis_type in base_query.lower() else f"{base_query} {synonym}"
                alternative_queries.append(alt_query)

        # A√±adir consultas espec√≠ficas por ubicaci√≥n
        if parsed_query['geographic_scope'] == 'municipal' and parsed_query.get('location'):
            location = parsed_query['location']
            alt_query = f"datos electorales municipio {location} {base_query}"
            alternative_queries.append(alt_query)

        # A√±adir consultas espec√≠ficas por a√±o
        if parsed_query['years']:
            for year in parsed_query['years']:
                alt_query = f"{base_query} a√±o {year}"
                alternative_queries.append(alt_query)

        # Eliminar duplicados
        unique_queries = []
        seen = set()
        for q in alternative_queries:
            if q not in seen:
                seen.add(q)
                unique_queries.append(q)

        return unique_queries[:5]  # Limitar a 5 consultas


üß† PROCESADOR INTELIGENTE DE CONSULTAS


Aqu√≠ construye los prompts para el modelo Gemma.

In [8]:
# ============================================================================
# CELDA 5: CONSTRUCTOR DE PROMPTS PARA GEMMA
# ============================================================================
print("\n" + "=" * 70)
print("‚úçÔ∏è CONSTRUCTOR DE PROMPTS PARA GEMMA-2-9B")
print("=" * 70)

class GemmaPromptBuilder:
    """
    Construye prompts optimizados para Gemma-2-9b
    """

    def __init__(self, tokenizer, max_context_length: int = 8192):
        self.tokenizer = tokenizer
        self.max_context_length = max_context_length
        self.max_response_tokens = 1024

        # Templates espec√≠ficos para Gemma
        self.templates = {
            'analisis_complejo': """Eres un experto analista pol√≠tico especializado en datos electorales espa√±oles.

CONTEXTO Y DATOS DISPONIBLES:
{context}

INSTRUCCIONES:
1. Analiza la siguiente consulta usando EXCLUSIVAMENTE los datos proporcionados
2. Si los datos no contienen informaci√≥n espec√≠fica, ind√≠calo claramente
3. Incluye c√°lculos num√©ricos cuando sean relevantes
4. Proporciona interpretaci√≥n contextual
5. Mant√©n un tono profesional y objetivo

CONSULTA DEL USUARIO:
{query}

RESPUESTA (estructura):
- Resumen ejecutivo
- An√°lisis cuantitativo con datos espec√≠ficos
- Interpretaci√≥n y contexto pol√≠tico
- Limitaciones del an√°lisis basado en datos disponibles

AN√ÅLISIS:""",

            'consulta_simple': """Eres un asistente especializado en datos electorales.

DATOS RELEVANTES:
{context}

PREGUNTA: {query}

Respuesta precisa basada solo en los datos anteriores:""",

            'evolucion_temporal': """Eres un analista de tendencias electorales.

DATOS HIST√ìRICOS:
{context}

AN√ÅLISIS SOLICITADO: {query}

Realiza un an√°lisis de evoluci√≥n temporal que incluya:
1. Comparativa cuantitativa entre periodos
2. Tasa de cambio y significancia
3. Contexto hist√≥rico relevante
4. Posibles factores explicativos

AN√ÅLISIS TEMPORAL:""",

            'correlacion': """Eres un estadista electoral.

DATOS PARA AN√ÅLISIS:
{context}

CONSULTA DE CORRELACI√ìN: {query}

Analiza las relaciones estad√≠sticas considerando:
1. Patrones observables en los datos
2. Posibles correlaciones (sin inferir causalidad)
3. Factores de confusi√≥n a considerar
4. Limitaciones metodol√≥gicas

AN√ÅLISIS ESTAD√çSTICO:"""
        }

    def build_prompt(self, query: str, context_docs: List[Dict],
                    analysis_type: str = 'analisis_complejo') -> Dict:
        """
        Construye un prompt optimizado para Gemma
        """
        # Seleccionar template
        template = self.templates.get(analysis_type, self.templates['analisis_complejo'])

        # Construir contexto formateado
        context_text = self._format_context(context_docs)

        # Construir prompt completo
        full_prompt = template.format(context=context_text, query=query)

        # Verificar longitud
        token_count = len(self.tokenizer.encode(full_prompt))

        # Si es muy largo, truncar contexto
        if token_count > self.max_context_length - self.max_response_tokens:
            full_prompt = self._truncate_context(full_prompt, context_docs)

        return {
            'prompt': full_prompt,
            'token_count': len(self.tokenizer.encode(full_prompt)),
            'context_docs_count': len(context_docs),
            'analysis_type': analysis_type
        }

    def _format_context(self, context_docs: List[Dict]) -> str:
        """Formatea documentos de contexto"""
        if not context_docs:
            return "No hay datos espec√≠ficos disponibles para esta consulta."

        formatted = []

        for i, doc in enumerate(context_docs[:10]):  # Limitar a 10 documentos
            doc_text = f"[Documento {i+1}] {doc['document']}"
            if doc.get('similarity'):
                doc_text += f" (Relevancia: {doc['similarity']:.1f}%)"

            # A√±adir metadatos √∫tiles
            meta = doc.get('metadata', {})
            if meta.get('year'):
                doc_text += f" | A√±o: {meta['year']}"
            if meta.get('municipality'):
                doc_text += f" | Municipio: {meta['municipality']}"
            if meta.get('province'):
                doc_text += f" | Provincia: {meta['province']}"

            formatted.append(doc_text)

        return "\n\n".join(formatted)

    def _truncate_context(self, prompt: str, context_docs: List[Dict]) -> str:
        """Trunca contexto si el prompt es muy largo"""
        # Primero intentar con menos documentos
        for num_docs in [8, 6, 4, 2]:
            if num_docs < len(context_docs):
                truncated_context = self._format_context(context_docs[:num_docs])
                template = self.templates['analisis_complejo']
                truncated_prompt = template.format(
                    context=truncated_context,
                    query=prompt.split("CONSULTA DEL USUARIO:")[-1].split("RESPUESTA")[0]
                )

                if len(self.tokenizer.encode(truncated_prompt)) < self.max_context_length - self.max_response_tokens:
                    return truncated_prompt

        # Si sigue siendo muy largo, usar solo los m√°s relevantes
        sorted_docs = sorted(context_docs, key=lambda x: x.get('similarity', 0), reverse=True)
        return self._format_context(sorted_docs[:2])



‚úçÔ∏è CONSTRUCTOR DE PROMPTS PARA GEMMA-2-9B


En esta celda organiza el RAG

In [9]:
# ============================================================================
# CELDA 6: SISTEMA RAG COMPLETO
# ============================================================================
print("\n" + "=" * 70)
print("ü§ñ SISTEMA RAG COMPLETO PARA GEMMA")
print("=" * 70)

class ElectoralRAGSystem:
    """
    Sistema RAG completo para an√°lisis electoral con Gemma-2-9b
    """

    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
        self.knowledge_base = None
        self.query_processor = None
        self.prompt_builder = None

        # Inicializar componentes
        self._initialize_components()

        # Estad√≠sticas
        self.stats = {
            'queries_processed': 0,
            'avg_retrieval_time': 0,
            'avg_generation_time': 0,
            'cache_hits': 0
        }

        # Cache para respuestas similares
        self.response_cache = {}

    def _initialize_components(self):
        """Inicializa todos los componentes del sistema (without loading KB)"""
        print("Inicializando componentes RAG...")

        # 1. Base de conocimiento
        self.knowledge_base = ElectoralKnowledgeBase()

        # 2. Procesador de consultas
        self.query_processor = ElectoralQueryProcessor(self.knowledge_base)

        # 3. Constructor de prompts
        self.prompt_builder = GemmaPromptBuilder(self.tokenizer)

        print("‚úÖ Sistema RAG inicializado")

    def build_knowledge_base(self, dataset_paths: Dict[str, str], save_path: str = None):
        """
        Construye la base de conocimiento desde datasets
        """
        print("\nüî® CONSTRUYENDO BASE DE CONOCIMIENTO...")

        # Cargar datos
        self.knowledge_base.load_electoral_datasets(dataset_paths)

        # Construir √≠ndice
        self.knowledge_base.build_vector_index()

        # Actualizar procesador de consultas
        self.query_processor = ElectoralQueryProcessor(self.knowledge_base)

        # Guardar si se especific√≥ ruta
        if save_path:
            self.knowledge_base.save_index(save_path)
            print(f"‚úÖ Base de conocimiento guardada en: {save_path}")

        return self

    def answer_query(self, query: str, max_retrieved_docs: int = 8) -> Dict:
        """
        Procesa una consulta y genera respuesta
        """
        import time
        start_time = time.time()

        self.stats['queries_processed'] += 1

        # Paso 1: Verificar cache
        cache_key = hash(query)
        if cache_key in self.response_cache:
            self.stats['cache_hits'] += 1
            print("‚úÖ Respuesta obtenida de cache")
            return self.response_cache[cache_key]

        # Paso 2: Parsear consulta
        parsed_query = self.query_processor.parse_query(query)

        retrieval_start = time.time()

        # Paso 3: Buscar documentos relevantes
        search_queries = self.query_processor.create_search_queries(parsed_query)

        all_results = []
        for search_query in search_queries:
            results = self.knowledge_base.search_similar(search_query, k=max_retrieved_docs)
            all_results.extend(results)

        # Eliminar duplicados y ordenar por relevancia
        unique_results = []
        seen_docs = set()
        for result in sorted(all_results, key=lambda x: x.get('similarity', 0), reverse=True):
            doc_hash = hash(result['document'])
            if doc_hash not in seen_docs:
                seen_docs.add(doc_hash)
                unique_results.append(result)

        retrieval_time = time.time() - retrieval_start

        # Paso 4: Construir prompt
        prompt_info = self.prompt_builder.build_prompt(
            query=query,
            context_docs=unique_results[:max_retrieved_docs],
            analysis_type=parsed_query['analysis_type']
        )

        generation_start = time.time()

        # Paso 5: Generar respuesta con Gemma
        response = self._generate_with_gemma(prompt_info['prompt'], parsed_query)

        generation_time = time.time() - generation_start

        # Paso 6: Formatear respuesta
        final_response = self._format_response(
            query=query,
            response=response,
            parsed_query=parsed_query,
            retrieved_docs=unique_results[:max_retrieved_docs],
            prompt_info=prompt_info,
            retrieval_time=retrieval_time,
            generation_time=generation_time
        )

        total_time = time.time() - start_time

        # Actualizar estad√≠sticas
        self.stats['avg_retrieval_time'] = (
            self.stats['avg_retrieval_time'] * (self.stats['queries_processed'] - 1) + retrieval_time
        ) / self.stats['queries_processed']

        self.stats['avg_generation_time'] = (
            self.stats['avg_generation_time'] * (self.stats['queries_processed'] - 1) + generation_time
        ) / self.stats['queries_processed']

        # Guardar en cache
        self.response_cache[cache_key] = final_response

        return final_response

    def _generate_with_gemma(self, prompt: str, parsed_query: Dict) -> str:
        """Genera respuesta usando Gemma-2-9b"""
        try:
            # Tokenizar prompt
            inputs = self.tokenizer(
                prompt,
                return_tensors="pt",
                truncation=True,
                max_length=8192
            ).to(self.model.device)

            # Configurar generaci√≥n seg√∫n tipo de an√°lisis
            generation_config = self._get_generation_config(parsed_query)

            # Generar
            with torch.no_grad():
                outputs = self.model.generate(
                    **inputs,
                    **generation_config
                )

            # Decodificar
            full_response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)

            # Extraer solo la parte nueva (despu√©s del prompt)
            prompt_length = len(self.tokenizer.decode(inputs.input_ids[0], skip_special_tokens=True))
            response_only = full_response[prompt_length:].strip()

            return response_only

        except Exception as e:
            print(f"‚ùå Error en generaci√≥n: {e}")
            return f"Error generando respuesta: {str(e)}"

    def _get_generation_config(self, parsed_query: Dict) -> Dict:
        """Obtiene configuraci√≥n de generaci√≥n seg√∫n tipo de consulta"""
        base_config = {
            "max_new_tokens": 1024,
            "temperature": 0.7,
            "do_sample": True,
            "top_p": 0.9,
            "repetition_penalty": 1.1,
        }

        # Ajustes por tipo de an√°lisis
        analysis_type = parsed_query.get('analysis_type', 'general')

        if analysis_type in ['evolucion', 'correlacion', 'estadistica']:
            base_config.update({
                "temperature": 0.5,  # M√°s determinista para an√°lisis
                "top_p": 0.8,
                "num_beams": 3,  # Beam search para mayor coherencia
            })
        elif analysis_type == 'simple_query':
            base_config.update({
                "temperature": 0.3,  # Muy determinista para respuestas simples
                "do_sample": False,  # Sampling desactivado
                "num_beams": 2,
            })

        return base_config

    def _format_response(self, query: str, response: str, parsed_query: Dict,
                        retrieved_docs: List[Dict], prompt_info: Dict,
                        retrieval_time: float, generation_time: float) -> Dict:
        """Formatea la respuesta final"""

        # Extraer n√∫meros de la respuesta para verificaci√≥n
        import re
        numbers_in_response = re.findall(r'\d+\.?\d*', response)

        return {
            "query": query,
            "response": response,
            "parsed_query": parsed_query,
            "metadata": {
                "retrieved_documents": len(retrieved_docs),
                "most_relevant_doc": retrieved_docs[0]['similarity'] if retrieved_docs else 0,
                "prompt_tokens": prompt_info['token_count'],
                "retrieval_time_seconds": round(retrieval_time, 2),
                "generation_time_seconds": round(generation_time, 2),
                "total_time_seconds": round(retrieval_time + generation_time, 2),
                "numbers_in_response": numbers_in_response,
                "analysis_type": parsed_query['analysis_type'],
                "confidence": parsed_query['confidence']
            },
            "retrieval_stats": {
                "top_documents": [
                    {
                        "preview": doc['document'][:100] + "...",
                        "similarity": doc.get('similarity', 0),
                        "metadata": doc.get('metadata', {})
                    }
                    for doc in retrieved_docs[:3]
                ]
            },
            "system_stats": self.stats.copy()
        }

    def interactive_session(self):
        """Sesi√≥n interactiva con el sistema"""
        print("\n" + "=" * 70)
        print("üí¨ SESI√ìN INTERACTIVA - AN√ÅLISIS ELECTORAL")
        print("=" * 70)

        print("\nüìã EJEMPLOS DE PREGUNTAS:")
        examples = [
            "¬øC√≥mo evolucion√≥ el voto al PP en Madrid entre 2016 y 2019?",
            "¬øQu√© correlaci√≥n hay entre renta personal y participaci√≥n electoral en Barcelona?",
            "¬øCu√°l fue el partido m√°s votado en Sevilla en las elecciones de 2019?",
            "¬øC√≥mo ha cambiado la participaci√≥n electoral en municipios costeros vs interiores?",
            "¬øQu√© porcentaje de votos obtuvo VOX en Valencia en las √∫ltimas elecciones?"
        ]

        for i, example in enumerate(examples, 1):
            print(f"  {i}. {example}")

        print("\nüí° Puedes hacer preguntas sobre:")
        print("  ‚Ä¢ Resultados por municipio/provincia")
        print("  ‚Ä¢ Evoluci√≥n temporal")
        print("  ‚Ä¢ Correlaciones entre variables")
        print("  ‚Ä¢ Comparativas entre partidos/regiones")
        print("  ‚Ä¢ An√°lisis estad√≠sticos")

        print("\n" + "=" * 70)
        print("‚ùì ESCRIBE TU PREGUNTA (o 'salir' para terminar):")

        while True:
            try:
                query = input("\nüîç Tu pregunta: ").strip()

                if query.lower() in ['salir', 'exit', 'quit', 'q']:
                    print("\nüëã ¬°Hasta pronto!")
                    break

                if not query:
                    continue

                print("\n‚è≥ Procesando...")

                # Obtener respuesta
                result = self.answer_query(query)

                # Mostrar respuesta
                print("\n" + "=" * 70)
                print("ü§ñ RESPUESTA:")
                print("=" * 70)
                print(result["response"])

                # Mostrar metadatos (opcional)
                show_details = input("\nüìä ¬øVer detalles del an√°lisis? (s/n): ").lower()
                if show_details == 's':
                    print(f"\nüìà METADATOS:")
                    print(f"  ‚Ä¢ Tipo de an√°lisis: {result['parsed_query']['analysis_type']}")
                    print(f"  ‚Ä¢ Confianza en parsing: {result['parsed_query']['confidence']}")
                    print(f"  ‚Ä¢ Documentos recuperados: {result['metadata']['retrieved_documents']}")
                    print(f"  ‚Ä¢ Relevancia del mejor documento: {result['metadata']['most_relevant_doc']:.1f}%")
                    print(f"  ‚Ä¢ Tiempo recuperaci√≥n: {result['metadata']['retrieval_time_seconds']}s")
                    print(f"  ‚Ä¢ Tiempo generaci√≥n: {result['metadata']['generation_time_seconds']}s")

                    if result['metadata']['numbers_in_response']:
                        print(f"  ‚Ä¢ N√∫meros en respuesta: {result['metadata']['numbers_in_response']}")

                # Preguntar si fue √∫til
                feedback = input("\nüëç ¬øLa respuesta fue √∫til? (s/n): ").lower()
                if feedback == 'n':
                    print("‚ö†Ô∏è  Gracias por el feedback. ¬øQu√© podr√≠a mejorar?")
                    feedback_text = input("   Sugerencia: ")
                    print("‚úÖ Feedback registrado")

            except KeyboardInterrupt:
                print("\n\n‚ö†Ô∏è  Sesi√≥n interrumpida")
                break
            except Exception as e:
                print(f"\n‚ùå Error: {e}")


ü§ñ SISTEMA RAG COMPLETO PARA GEMMA


Esta celda se ocupa de la configuracion y la ejecuci√≥n.

In [10]:
# ============================================================================
# CELDA 7: CONFIGURACI√ìN Y EJECUCI√ìN COMPLETA
# ============================================================================
def setup_complete_rag_system():
    """
    Configuraci√≥n completa del sistema RAG
    """
    print("üöÄ CONFIGURANDO SISTEMA RAG COMPLETO")
    print("=" * 70)

    # 1. Verificar que el modelo est√° cargado
    if model is None or tokenizer is None:
        print("‚ùå No se pudo cargar el modelo Gemma")
        return None

    # 2. Definir rutas de datasets
    dataset_paths = {
        "elecciones_2019": "/content/drive/MyDrive/Practica_LLM_Engineering_25/df_elecciones_19.csv",  # Ajusta estas rutas
    }

    # 3. Ruta para guardar/la base de conocimiento
    kb_save_path = "/content/drive/MyDrive/Practica_LLM_Engineering_25/electoral_knowledge_base.pkl"

    # 4. Crear sistema RAG (sin intentar cargar KB a√∫n)
    rag_system = ElectoralRAGSystem(model, tokenizer)

    # 5. Construir o cargar base de conocimiento
    faiss_index_path = f"{kb_save_path}.faiss"
    if os.path.exists(kb_save_path) and os.path.exists(faiss_index_path):
        try:
            print("‚úÖ Base de conocimiento ya existe y completa, cargando...")
            rag_system.knowledge_base.load_index(kb_save_path)
        except Exception as e:
            print(f"‚ö†Ô∏è  Error cargando base de conocimiento existente ({e}). Reconstruyendo...")
            rag_system.build_knowledge_base(dataset_paths, kb_save_path)
    else:
        print("üìö Construyendo base de conocimiento desde datasets...")

        # Verificar que los datasets existan
        existing_datasets = {}
        for name, path in dataset_paths.items():
            if os.path.exists(path):
                existing_datasets[name] = path
            else:
                print(f"‚ö†Ô∏è  Dataset no encontrado: {path}")

        if existing_datasets:
            rag_system.build_knowledge_base(existing_datasets, kb_save_path)
        else:
            print("‚ö†Ô∏è  No se encontraron datasets. Usando datos de ejemplo...")
            # Crear datos de ejemplo si no hay datasets reales
            create_sample_data()

    print("\n" + "=" * 70)
    print("üéØ SISTEMA RAG LISTO PARA USAR")
    print("=" * 70)

    return rag_system

def create_sample_data():
    """Crea datos de ejemplo si no hay datasets reales"""
    print("Creando datos de ejemplo...")

    sample_data = {
        "Madrid_2019": {
            "Municipio": "Madrid",
            "Provincia": "Madrid",
            "% PP": 22.5,
            "% PSOE": 28.3,
            "% VOX": 15.2,
            "Participaci√≥n": 75.2,
            "Renta persona 2017": 32000,
            "A√±o": 2019
        },
        "Barcelona_2019": {
            "Municipio": "Barcelona",
            "Provincia": "Barcelona",
            "% PP": 18.3,
            "% PSOE": 25.6,
            "% VOX": 11.4,
            "Participaci√≥n": 72.8,
            "Renta persona 2017": 31000,
            "A√±o": 2019
        },
        "Madrid_2016": {
            "Municipio": "Madrid",
            "Provincia": "Madrid",
            "% PP": 33.2,
            "% PSOE": 25.1,
            "% VOX": 1.2,
            "Participaci√≥n": 73.5,
            "Renta persona 2015": 30500,
            "A√±o": 2016
        }
    }

    # Guardar como CSV
    import pandas as pd
    df = pd.DataFrame.from_dict(sample_data, orient='index')
    sample_path = "/content/drive/MyDrive/Practica_LLM_Engineering_25/sample_electoral_data.csv"
    df.to_csv(sample_path, index=False)

    return { "sample_data": sample_path }

Esta es la llamada prueba r√°pida, donde se le pasan al modelo preguntas que no ha visto.

In [11]:
# ============================================================================
# CELDA 8: PRUEBA R√ÅPIDA DEL SISTEMA
# ============================================================================
def quick_test():
    """Prueba r√°pida del sistema RAG"""
    print("\n" + "=" * 70)
    print("üß™ PRUEBA R√ÅPIDA DEL SISTEMA")
    print("=" * 70)

    # Configurar sistema
    rag_system = setup_complete_rag_system()

    if rag_system is None:
        print("‚ùå No se pudo configurar el sistema")
        return

    # Preguntas de prueba
    test_queries = [
        "¬øQu√© porcentaje de votos obtuvo el PSOE en el municipio de Santander en las elecciones de Abril 2019?",
        "¬øC√≥mo cambi√≥ la participaci√≥n electoral en el municipio de Madrid entre las elecciones de Abril 2019 y Noviembre 2019?",
        "¬øQu√© relaci√≥n hay entre renta personal y voto al PP en el municipio de Sevilla en las elecciones de Abril 2019?"
    ]

    for query in test_queries:
        print(f"\n‚ùì Pregunta: {query}")
        print("‚è≥ Procesando...")

        result = rag_system.answer_query(query)

        print(f"\nü§ñ Respuesta: {result['response'][:200]}...")
        print(f"üìä Documentos recuperados: {result['metadata']['retrieved_documents']}")
        print(f"‚è±Ô∏è  Tiempo total: {result['metadata']['total_time_seconds']}s")
        print("-" * 50)

    # Opci√≥n para sesi√≥n interactiva
    start_interactive = input("\nüéÆ ¬øIniciar sesi√≥n interactiva? (s/n): ").lower()
    if start_interactive == 's':
        rag_system.interactive_session()

Aqu√≠ se asegura del guardado del estado del sistema.

In [12]:
# ============================================================================
# CELDA 9: GUARDADO Y EXPORTACI√ìN
# ============================================================================
def save_system_state(rag_system, save_dir="/content/drive/MyDrive/Practica_LLM_Engineering_25/electoral_rag_system"):
    """Guarda el estado completo del sistema"""
    import pickle
    import os

    print(f"\nüíæ GUARDANDO ESTADO DEL SISTEMA EN: {save_dir}")

    os.makedirs(save_dir, exist_ok=True)

    # Guardar configuraci√≥n
    config = {
        "model_path": GEMMA_MODEL_PATH,
        "knowledge_base_path": "/content/drive/MyDrive/Practica_LLM_Engineering_25/electoral_knowledge_base.pkl",
        "stats": rag_system.stats,
        "save_date": datetime.now().isoformat()
    }

    with open(f"{save_dir}/config.json", "w") as f:
        json.dump(config, f, indent=2)

    # Crear script de carga
    load_script = f"""
# Script para cargar el sistema RAG electoral
import sys
sys.path.append('{save_dir}')

from electoral_rag_system import ElectoralRAGSystem
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

def load_electoral_rag_system():
    '''Carga el sistema RAG para an√°lisis electoral'''

    # Cargar modelo
    model = AutoModelForCausalLM.from_pretrained(
        "{GEMMA_MODEL_PATH}",
        torch_dtype=torch.float16,
        device_map="auto"
    )

    # Cargar tokenizer
    tokenizer = AutoTokenizer.from_pretrained("{GEMMA_MODEL_PATH}")

    # Crear sistema
    rag_system = ElectoralRAGSystem(model, tokenizer, "{save_dir}/knowledge_base.pkl")

    print("‚úÖ Sistema RAG cargado exitosamente")
    return rag_system

# Uso:
# rag_system = load_electoral_rag_system()
# respuesta = rag_system.answer_query("Tu pregunta aqu√≠")
"""

    with open(f"{save_dir}/load_system.py", "w") as f:
        f.write(load_script)

    print("‚úÖ Sistema guardado exitosamente")


Aqu√≠ ejecuta el sistema, bastante nefastamente.

In [13]:
# ============================================================================
# EJECUCI√ìN PRINCIPAL
# ============================================================================
if __name__ == "__main__":
    print("\n" + "=" * 70)
    print("üéØ SISTEMA RAG PARA AN√ÅLISIS ELECTORAL CON GEMMA-2-9B")
    print("=" * 70)

    # The numpy force-reinstall is now handled in cell L5mCFnpi-txr
    # !pip install -q numpy==1.26.4 --force-reinstall

    # Ejecutar prueba r√°pida
    quick_test()

    # Opci√≥n para guardar sistema
    save_option = input("\nüíæ ¬øGuardar estado del sistema? (s/n): ").lower()
    if save_option == 's':
        rag_system = setup_complete_rag_system()
        if rag_system:
            save_system_state(rag_system)

    print("\n" + "=" * 70)
    print("üéä ¬°SISTEMA CONFIGURADO EXITOSAMENTE!")
    print("=" * 70)


üéØ SISTEMA RAG PARA AN√ÅLISIS ELECTORAL CON GEMMA-2-9B

üß™ PRUEBA R√ÅPIDA DEL SISTEMA
üöÄ CONFIGURANDO SISTEMA RAG COMPLETO
Inicializando componentes RAG...
‚è≥ Cargando modelo de embeddings: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/526 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

‚úÖ Embedder cargado (dim: 384)
‚úÖ Sistema RAG inicializado
‚úÖ Base de conocimiento ya existe y completa, cargando...
‚úÖ √çndice cargado: 10000 documentos

üéØ SISTEMA RAG LISTO PARA USAR

‚ùì Pregunta: ¬øQu√© porcentaje de votos obtuvo el PSOE en el municipio de Santander en las elecciones de Abril 2019?
‚è≥ Procesando...


Unsloth: Input IDs of shape torch.Size([2, 2823]) with length 2823 > the model's max sequence length of 2048.
We shall truncate it ourselves. It's imperative if you correct this issue first.
Unsloth: Input IDs of shape torch.Size([1, 2776]) with length 2776 > the model's max sequence length of 2048.
We shall truncate it ourselves. It's imperative if you correct this issue first.


‚ùå Error en generaci√≥n: 'tuple' object has no attribute 'reorder_cache'

ü§ñ Respuesta: Error generando respuesta: 'tuple' object has no attribute 'reorder_cache'...
üìä Documentos recuperados: 8
‚è±Ô∏è  Tiempo total: 24.9s
--------------------------------------------------

‚ùì Pregunta: ¬øC√≥mo cambi√≥ la participaci√≥n electoral en el municipio de Madrid entre las elecciones de Abril 2019 y Noviembre 2019?
‚è≥ Procesando...


Unsloth: Input IDs of shape torch.Size([3, 2705]) with length 2705 > the model's max sequence length of 2048.
We shall truncate it ourselves. It's imperative if you correct this issue first.



ü§ñ Respuesta: % PP: 0.2% | % PSOE: 0.5% | % UP: 0.1% | % VOX: 0.1% | % Cs: 0.1% | % IU: 0.0% | % mayores 65 a√±os: 0.4% | % 20-64 a√±os: 0.5% | % menores 19 a√±os: 0.1% | % Afiliados SS aut√≥nomos: 0.3% | Participaci√≥n...
üìä Documentos recuperados: 8
‚è±Ô∏è  Tiempo total: 70.6s
--------------------------------------------------

‚ùì Pregunta: ¬øQu√© relaci√≥n hay entre renta personal y voto al PP en el municipio de Sevilla en las elecciones de Abril 2019?
‚è≥ Procesando...
‚ùå Error en generaci√≥n: 'tuple' object has no attribute 'reorder_cache'

ü§ñ Respuesta: Error generando respuesta: 'tuple' object has no attribute 'reorder_cache'...
üìä Documentos recuperados: 8
‚è±Ô∏è  Tiempo total: 3.13s
--------------------------------------------------

üéÆ ¬øIniciar sesi√≥n interactiva? (s/n): n

üíæ ¬øGuardar estado del sistema? (s/n): n

üéä ¬°SISTEMA CONFIGURADO EXITOSAMENTE!


Esta es una peque√±a muestra de los problemas que me encontr√© a la hora de intentar ejecutar el c√≥digo de DeepSeek.

# Task
The `NameError: name 'faiss' is not defined` indicates that the `faiss` module was not successfully imported. This is a direct consequence of the `ImportError: numpy.core.multiarray failed to import` which occurred earlier.

The core problem is an incompatibility between the `faiss` library and the NumPy version being used at the time of its import. The error message explicitly states: "A module that was compiled using NumPy 1.x cannot be run in NumPy 2.0.2".

Although the notebook attempts to install `numpy==1.26.4` (which is a NumPy 1.x version) early on, it appears that the Python runtime in Colab might default back to a newer NumPy 2.x version or load it from elsewhere during the session, especially after other installations (like Unsloth) and without a full runtime restart.

**To resolve this, please follow these steps:**

1.  **Restart the Colab Runtime:** Go to the Colab menu: `Runtime` -> `Restart runtime`.
2.  **Run all cells:** After the runtime has restarted, go to `Runtime` -> `Run all`.

This will ensure that `numpy==1.26.4` is correctly installed and recognized across the entire environment before `faiss` is imported, resolving the incompatibility issue.

## Explain the error

### Subtask:
Explain the `NameError: name 'faiss' is not defined` by tracing it back to the `ImportError: numpy.core.multiarray failed to import` and clarifying the NumPy/Faiss incompatibility due to the missing runtime restart.


### Explaining the Error: `NameError: name 'faiss' is not defined`

The `NameError: name 'faiss' is not defined` typically occurs when the `faiss` library fails to import correctly, leading to the `faiss` object not being available in the current scope.

Tracing the error back, we see an `ImportError: numpy.core.multiarray failed to import` which is the fundamental issue. The traceback explicitly mentions:

"A module that was compiled using NumPy 1.x cannot be run in NumPy 2.0.2 as it may crash. To support both 1.x and 2.x versions of NumPy, modules must be compiled with NumPy 2.0."

This indicates a version incompatibility. The Colab environment, by default, often includes a newer version of NumPy (like 2.x). While the notebook attempts to install `numpy==1.26.4` (a NumPy 1.x version) early on (in cell `ud-a7gem6-15`), the warning message after its installation `ERROR: pip's dependency resolver does not currently take into account all the packages that are installed... jax 0.7.2 requires numpy>=2.0, but you have numpy 1.26.4 which is incompatible.` highlights potential conflicts.

Crucially, the initial installation step in `ud-a7gem6-15` explicitly states:

`‚úÖ Dependencias instaladas. Por favor, REINICIA EL ENTORNO DE EJECUCI√ìN (Runtime -> Restart runtime) y luego vuelve a ejecutar todas las celdas.`

This runtime restart is vital. Without it, even if `numpy==1.26.4` is installed, the Python kernel might still be using the pre-installed NumPy 2.x, or other packages loaded later (like Unsloth and its dependencies) could re-introduce or conflict with the NumPy version, causing `faiss` (which was likely compiled against a specific NumPy 1.x ABI) to fail when it tries to load with a NumPy 2.x environment. The `ImportError` for `numpy.core.multiarray` is a strong indicator that `faiss` could not properly interface with the available NumPy.

**To resolve this, ensure a runtime restart is performed immediately after installing the initial dependencies in cell `ud-a7gem6-15` and before running any subsequent cells that rely on specific NumPy or `faiss` versions.**

## Instruct user to restart runtime and re-execute all cells

### Subtask:
Clearly instruct the user to manually restart the Colab runtime and then run all cells from the beginning to ensure the correct NumPy version is loaded for Faiss.


## Summary:

### Q&A
The `NameError: name 'faiss' is not defined` occurred because the `faiss` module failed to import correctly. This failure was due to an `ImportError: numpy.core.multiarray failed to import`, indicating a version incompatibility. Specifically, `faiss`, which was compiled against NumPy 1.x, was attempting to run in an environment where NumPy 2.0.2 was active. Even though `numpy==1.26.4` (a NumPy 1.x version) was installed, a crucial runtime restart was missed, leading to the Python kernel still using an incompatible NumPy 2.x version or conflicting with other loaded packages.

### Data Analysis Key Findings
*   The `NameError: name 'faiss' is not defined` was a direct result of an underlying `ImportError: numpy.core.multiarray failed to import`.
*   The `ImportError` stemmed from an incompatibility between `faiss`, which was compiled with NumPy 1.x, and the NumPy 2.0.2 version detected in the Colab environment.
*   Despite installing `numpy==1.26.4`, the absence of a required runtime restart after installation meant the environment did not consistently use the intended NumPy version, thus preventing `faiss` from loading correctly.

### Insights or Next Steps
*   Always perform a full runtime restart in Colab immediately after installing specific dependency versions (e.g., NumPy) to ensure the new versions are correctly loaded and recognized by the Python kernel across all subsequent operations.
