# Sistema RAG para Deteccion de Desinformacion

Implementacion de un sistema RAG (Retrieval-Augmented Generation) para la deteccion de desinformacion usando el dataset Truth_Seeker_Model_Dataset.

Este enfoque combina recuperacion semantica con clasificacion tradicional para mejorar la precision en la deteccion de noticias falsas.

## Objetivos
1. Construir base de conocimiento con declaraciones verificadas del Truth Seeker dataset
2. Implementar sistema de recuperacion semantica con embeddings
3. Desarrollar clasificador aumentado con caracteristicas RAG
4. Evaluar rendimiento comparando con modelos tradicionales
5. Analizar casos donde RAG supera enfoques tradicionales

In [1]:
# Instalacion de dependencias para Google Colab
import sys
import subprocess

# Verificar si estamos en Colab
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print("Ejecutando en Google Colab - Instalando dependencias...")
    !pip install sentence-transformers faiss-cpu transformers datasets
    # Montar Google Drive si es necesario
    from google.colab import drive
    # drive.mount('/content/drive')
else:
    print("Ejecutando localmente - Verificando dependencias...")
    packages = ['sentence-transformers', 'faiss-cpu', 'transformers', 'datasets']
    for package in packages:
        try:
            __import__(package.replace('-', '_'))
        except ImportError:
            subprocess.check_call([sys.executable, "-m", "pip", "install", package])

print("Dependencias listas para RAG")

Ejecutando en Google Colab - Instalando dependencias...
Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (31.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.4/31.4 MB[0m [31m41.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.12.0
Dependencias listas para RAG


In [2]:
# Importacion de librerias
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Procesamiento de texto
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# RAG components
from sentence_transformers import SentenceTransformer
import faiss
import torch

# Machine Learning
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer

# Configuracion para Colab
if IN_COLAB:
    plt.style.use('default')
else:
    plt.rcParams['figure.figsize'] = (10, 6)

nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

print(f"Librerias importadas")
print(f"Dispositivo: {'cuda' if torch.cuda.is_available() else 'cpu'}")
print(f"Entorno: {'Google Colab' if IN_COLAB else 'Local'}")

Librerias importadas
Dispositivo: cpu
Entorno: Google Colab


## Carga de Datos Truth Seeker

Cargo el dataset Truth_Seeker_Model_Dataset con las declaraciones verificadas para construir la base de conocimiento RAG.

In [24]:
# Carga del dataset Truth Seeker
if IN_COLAB:
    # Para Colab - ajustar ruta segun donde subas los datos
    df_features = pd.read_csv('/content/Features_For_Traditional_ML_Techniques.csv')
    df_truth_seeker = pd.read_csv('/content/Truth_Seeker_Model_Dataset.csv')
else:
    # Para ejecucion local
    df_features = pd.read_csv('../dataset1/dataset_features_processed_winsorized.csv')
    df_truth_seeker = pd.read_csv('../dataset1/text_data_for_nlp.csv')

print(f"Features dataset: {df_features.shape[0]} filas, {df_features.shape[1]} columnas")
print(f"Truth Seeker dataset: {df_truth_seeker.shape[0]} filas, {df_truth_seeker.shape[1]} columnas")

print("\nPrimeras filas Truth Seeker:")
print(df_truth_seeker.head())

print("\nDistribucion target Truth Seeker:")
if 'BinaryNumTarget' in df_truth_seeker.columns:
    print(df_truth_seeker['BinaryNumTarget'].value_counts())
elif 'majority_target' in df_truth_seeker.columns:
    print(df_truth_seeker['majority_target'].value_counts())
else:
    print("Columnas disponibles:")
    print(list(df_truth_seeker.columns))

Features dataset: 134198 filas, 64 columnas
Truth Seeker dataset: 134198 filas, 9 columnas

Primeras filas Truth Seeker:
   Unnamed: 0      author                                          statement  \
0           0  D.L. Davis  End of eviction moratorium means millions of A...   
1           1  D.L. Davis  End of eviction moratorium means millions of A...   
2           2  D.L. Davis  End of eviction moratorium means millions of A...   
3           3  D.L. Davis  End of eviction moratorium means millions of A...   
4           4  D.L. Davis  End of eviction moratorium means millions of A...   

   target  BinaryNumTarget                 manual_keywords  \
0    True              1.0  Americans, eviction moratorium   
1    True              1.0  Americans, eviction moratorium   
2    True              1.0  Americans, eviction moratorium   
3    True              1.0  Americans, eviction moratorium   
4    True              1.0  Americans, eviction moratorium   

                         

In [25]:
# Preparacion de datos para RAG
# Usar el dataset Truth Seeker que tiene las declaraciones textuales
df = df_truth_seeker.copy()

# Verificar columnas de texto disponibles
text_columns = [col for col in df.columns if 'statement' in col.lower() or 'tweet' in col.lower()]
print(f"Columnas de texto disponibles: {text_columns}")

# Usar la columna de statement principal
if 'statement' in df.columns:
    text_col = 'statement'
elif len(text_columns) > 0:
    text_col = text_columns[0]
else:
    print("No se encontro columna de texto principal")
    text_col = None

if text_col:
    # Estadisticas del texto
    df['text_length'] = df[text_col].astype(str).str.len()
    df['word_count'] = df[text_col].astype(str).str.split().str.len()

    print(f"\nUsando columna: {text_col}")
    print(f"Estadisticas de longitud:")
    print(df[['text_length', 'word_count']].describe())

    # Distribucion del target
    target_col = 'BinaryNumTarget' if 'BinaryNumTarget' in df.columns else 'majority_target'
    if target_col in df.columns:
        print(f"\nDistribucion {target_col}:")
        print(df[target_col].value_counts())

        fig = px.pie(values=df[target_col].value_counts().values,
                     names=df[target_col].value_counts().index,
                     title=f"Distribucion {target_col}")
        fig.show()

Columnas de texto disponibles: ['statement', 'tweet']

Usando columna: statement
Estadisticas de longitud:
         text_length     word_count
count  134198.000000  134198.000000
mean       89.174652      14.931057
std        40.157719       6.905146
min        21.000000       3.000000
25%        60.000000      10.000000
50%        81.000000      13.000000
75%       111.000000      19.000000
max       292.000000      48.000000

Distribucion BinaryNumTarget:
BinaryNumTarget
1.0    68930
0.0    65268
Name: count, dtype: int64


In [26]:
# Analisis de longitud de texto por clase
if text_col and target_col in df.columns:
    fig = make_subplots(rows=1, cols=2,
                        subplot_titles=['Longitud Caracteres', 'Cantidad Palabras'])

    for target_val in df[target_col].unique():
        if pd.notna(target_val):
            subset = df[df[target_col] == target_val]
            label_name = 'Fake' if target_val == 0 else 'Real'

            fig.add_trace(go.Box(y=subset['text_length'], name=f'{label_name} (chars)'),
                         row=1, col=1)
            fig.add_trace(go.Box(y=subset['word_count'], name=f'{label_name} (words)'),
                         row=1, col=2)

    fig.update_layout(title="Distribucion Longitud Texto por Clase", height=400)
    fig.show()
else:
    print("No se puede realizar analisis de longitud - columnas faltantes")

## Limpieza y Preparacion

Limpio el dataset eliminando data leakage y preparando datos para RAG.

In [27]:
# Limpieza del dataset
if text_col and target_col in df.columns:
    print(f"Dataset inicial: {len(df)} filas")
    print(f"Declaraciones unicas: {df[text_col].nunique()}")

    # Eliminar filas con texto o target nulos
    df_clean = df.dropna(subset=[text_col, target_col]).copy()
    print(f"Despues de eliminar nulos: {len(df_clean)} filas")

    # Eliminar duplicados si existen
    duplicates = df_clean.duplicated(subset=[text_col], keep=False)
    if duplicates.sum() > 0:
        print(f"Declaraciones duplicadas encontradas: {duplicates.sum()}")
        df_clean = df_clean.drop_duplicates(subset=[text_col], keep='first')
        print(f"Despues de eliminar duplicados: {len(df_clean)} filas")

    # Verificar distribucion final
    print(f"\nDistribucion final:")
    print(df_clean[target_col].value_counts())

    # Eliminar majority_target si existe para evitar data leakage
    if 'majority_target' in df_clean.columns and target_col != 'majority_target':
        df_clean = df_clean.drop(columns=['majority_target'])
        print("Eliminada columna majority_target para evitar data leakage")
else:
    print("Error: No se pueden procesar los datos - columnas faltantes")
    df_clean = df.copy()

Dataset inicial: 134198 filas
Declaraciones unicas: 1058
Despues de eliminar nulos: 134198 filas
Declaraciones duplicadas encontradas: 134167
Despues de eliminar duplicados: 1058 filas

Distribucion final:
BinaryNumTarget
1.0    579
0.0    479
Name: count, dtype: int64


In [28]:
# Preprocesamiento de texto para RAG
def preprocess_text(text):
    """Preprocesa texto para embeddings"""
    if pd.isna(text):
        return ""

    text = str(text).lower()
    # Mantener puntuacion importante para contexto semantico
    text = re.sub(r'[^a-zA-Z0-9\s\.,;:!?]', '', text)
    text = re.sub(r'\s+', ' ', text)
    text = text.strip()

    return text

if text_col:
    print("Preprocesando texto...")
    df_clean['statement_processed'] = df_clean[text_col].apply(preprocess_text)

    # Ejemplos del preprocesamiento
    print("\nEjemplos preprocesamiento:")
    for i in range(min(3, len(df_clean))):
        original = str(df_clean[text_col].iloc[i])[:100]
        processed = df_clean['statement_processed'].iloc[i][:100]
        print(f"Original: {original}...")
        print(f"Procesado: {processed}...")
        print()

Preprocesando texto...

Ejemplos preprocesamiento:
Original: End of eviction moratorium means millions of Americans could lose their housing in the middle of a p...
Procesado: end of eviction moratorium means millions of americans could lose their housing in the middle of a p...

Original: The Trump administration worked to free 5,000 Taliban prisoners....
Procesado: the trump administration worked to free 5,000 taliban prisoners....

Original: In Afghanistan, over 100 billion dollars spent on military contracts....
Procesado: in afghanistan, over 100 billion dollars spent on military contracts....



## Implementacion Sistema RAG

Implemento sistema RAG completo con embeddings, indice FAISS y recuperacion semantica.

In [29]:
# Clase para sistema RAG
class RAGSystem:
    def __init__(self, embedding_model_name='all-MiniLM-L6-v2', top_k=5):
        self.embedding_model = SentenceTransformer(embedding_model_name)
        self.top_k = top_k
        self.index = None
        self.knowledge_base = None
        self.embeddings = None

        print(f"Sistema RAG inicializado con modelo: {embedding_model_name}")

    def build_knowledge_base(self, statements, labels):
        print("Construyendo base de conocimiento...")

        self.knowledge_base = {
            'statements': statements,
            'labels': labels
        }

        # Generar embeddings
        print("Generando embeddings semanticos...")
        self.embeddings = self.embedding_model.encode(statements, show_progress_bar=True)

        # Construir indice FAISS
        print("Construyendo indice FAISS...")
        dimension = self.embeddings.shape[1]
        self.index = faiss.IndexFlatIP(dimension)  # Inner product for cosine similarity

        # Normalizar embeddings para similaridad coseno
        faiss.normalize_L2(self.embeddings)
        self.index.add(self.embeddings.astype('float32'))

        print(f"Base de conocimiento construida con {len(statements)} declaraciones")
        print(f"Dimension de embeddings: {dimension}")

    def retrieve_similar(self, query, k=None):
        """Recupera declaraciones similares"""
        if k is None:
            k = self.top_k

        # Embedding de la consulta
        query_embedding = self.embedding_model.encode([query])
        faiss.normalize_L2(query_embedding)

        # Buscar similares
        similarities, indices = self.index.search(query_embedding.astype('float32'), k)

        # Preparar resultados
        results = []
        for i, (similarity, idx) in enumerate(zip(similarities[0], indices[0])):
            results.append({
                'statement': self.knowledge_base['statements'][idx],
                'label': self.knowledge_base['labels'][idx],
                'similarity': float(similarity),
                'rank': i + 1
            })

        return results

    def get_context_features(self, query, k=None):
        """Obtiene caracteristicas de contexto basadas en recuperacion"""
        similar_items = self.retrieve_similar(query, k)

        similarities = [item['similarity'] for item in similar_items]
        labels = [item['label'] for item in similar_items]

        # Contar etiquetas (0=fake, 1=real)
        fake_count = labels.count(0)
        real_count = labels.count(1)
        total_count = len(labels)

        features = {
            'max_similarity': max(similarities) if similarities else 0,
            'mean_similarity': np.mean(similarities) if similarities else 0,
            'std_similarity': np.std(similarities) if similarities else 0,
            'fake_ratio': fake_count / total_count if total_count > 0 else 0,
            'real_ratio': real_count / total_count if total_count > 0 else 0,
            'top_label': max(set(labels), key=labels.count) if labels else 0
        }

        return features

print("Clase RAGSystem definida")

Clase RAGSystem definida


In [30]:
# Division de datos para entrenamiento y prueba
if text_col and target_col in df_clean.columns:
    print("Dividiendo datos...")

    X = df_clean['statement_processed'].values
    y = df_clean[target_col].values

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    print(f"Entrenamiento: {len(X_train)} muestras")
    print(f"Prueba: {len(X_test)} muestras")
    print(f"Distribucion entrenamiento: {np.unique(y_train, return_counts=True)}")
    print(f"Distribucion prueba: {np.unique(y_test, return_counts=True)}")

    # Inicializar sistema RAG
    print("\nInicializando sistema RAG...")
    rag_system = RAGSystem(embedding_model_name='all-MiniLM-L6-v2', top_k=10)

    # Construir base de conocimiento SOLO con datos de entrenamiento
    rag_system.build_knowledge_base(X_train, y_train)

    print("Sistema RAG construido correctamente")
else:
    print("Error: No se pueden dividir los datos - columnas faltantes")

Dividiendo datos...
Entrenamiento: 846 muestras
Prueba: 212 muestras
Distribucion entrenamiento: (array([0., 1.]), array([383, 463]))
Distribucion prueba: (array([0., 1.]), array([ 96, 116]))

Inicializando sistema RAG...


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Sistema RAG inicializado con modelo: all-MiniLM-L6-v2
Construyendo base de conocimiento...
Generando embeddings semanticos...


Batches:   0%|          | 0/27 [00:00<?, ?it/s]

Construyendo indice FAISS...
Base de conocimiento construida con 846 declaraciones
Dimension de embeddings: 384
Sistema RAG construido correctamente


## Analisis Recuperacion Semantica

Analizo funcionamiento del sistema de recuperacion con ejemplos.

In [31]:
# Pruebas del sistema de recuperacion
if 'rag_system' in locals():
    test_queries = X_test[:3]  # Primeras 3 declaraciones de prueba

    print("Ejemplos de recuperacion semantica:")
    print("=" * 80)

    for i, query in enumerate(test_queries):
        print(f"\nEJEMPLO {i+1}:")
        print(f"Consulta: {query[:200]}...")
        print(f"Etiqueta real: {y_test[i]}")

        # Recuperar declaraciones similares
        similar_items = rag_system.retrieve_similar(query, k=5)

        print(f"\nDeclaraciones similares recuperadas:")
        for j, item in enumerate(similar_items):
            print(f"{j+1}. Similitud: {item['similarity']:.3f} | Etiqueta: {item['label']} | Texto: {item['statement'][:100]}...")

        # Caracteristicas de contexto
        context_features = rag_system.get_context_features(query)
        print(f"\nCaracteristicas contexto:")
        for key, value in context_features.items():
            if isinstance(value, float):
                print(f"  {key}: {value:.3f}")
            else:
                print(f"  {key}: {value}")

        print("-" * 80)
else:
    print("Sistema RAG no disponible")

Ejemplos de recuperacion semantica:

EJEMPLO 1:
Consulta: says trump didnt campaign on cutting the debt thats not what he promised to do....
Etiqueta real: 0.0

Declaraciones similares recuperadas:
1. Similitud: 0.661 | Etiqueta: 1.0 | Texto: obama promised to cut the deficit by half by the end of his first term but he hasnt even come close....
2. Similitud: 0.554 | Etiqueta: 1.0 | Texto: says he cut taxes by more than 600 million when he wasgovernor....
3. Similitud: 0.451 | Etiqueta: 1.0 | Texto: says donald trumpsfoundation took money other people gave to his charity and then bought a sixfootta...
4. Similitud: 0.449 | Etiqueta: 0.0 | Texto: says joe biden repeatedly told americans hes going to raise their taxes....
5. Similitud: 0.441 | Etiqueta: 1.0 | Texto: obama said the individual mandate wasnt a tax....

Caracteristicas contexto:
  max_similarity: 0.661
  mean_similarity: 0.469
  std_similarity: 0.074
  fake_ratio: 0.200
  real_ratio: 0.800
  top_label: 1.000
-----------------

In [32]:
# Analisis distribucion de similaridades
if 'rag_system' in locals():
    print("Analizando distribucion de similaridades...")

    # Muestra de consultas
    sample_size = min(100, len(X_test))
    sample_queries = X_test[:sample_size]
    sample_labels = y_test[:sample_size]

    similarities_by_label = {0: [], 1: []}  # 0=fake, 1=real
    max_similarities = []
    mean_similarities = []

    for query, true_label in zip(sample_queries, sample_labels):
        similar_items = rag_system.retrieve_similar(query, k=10)
        similarities = [item['similarity'] for item in similar_items]

        max_sim = max(similarities) if similarities else 0
        mean_sim = np.mean(similarities) if similarities else 0

        max_similarities.append(max_sim)
        mean_similarities.append(mean_sim)
        similarities_by_label[true_label].extend(similarities)

    print(f"Similaridad maxima promedio: {np.mean(max_similarities):.3f}")
    print(f"Similaridad media promedio: {np.mean(mean_similarities):.3f}")

    # Visualizar distribucion
    fig = make_subplots(rows=2, cols=2,
                        subplot_titles=['Similaridades por Etiqueta', 'Similaridad Maxima',
                                       'Similaridad Media', 'Histograma General'])

    # Box plot por etiqueta
    for label in [0, 1]:
        label_name = 'Fake' if label == 0 else 'Real'
        fig.add_trace(go.Box(y=similarities_by_label[label], name=label_name), row=1, col=1)

    # Histogramas
    fig.add_trace(go.Histogram(x=max_similarities, name='Max Sim', nbinsx=20), row=1, col=2)
    fig.add_trace(go.Histogram(x=mean_similarities, name='Mean Sim', nbinsx=20), row=2, col=1)

    all_similarities = similarities_by_label[0] + similarities_by_label[1]
    fig.add_trace(go.Histogram(x=all_similarities, name='All Sim', nbinsx=30), row=2, col=2)

    fig.update_layout(title="Analisis Similaridades Sistema RAG", height=600, showlegend=False)
    fig.show()
else:
    print("Sistema RAG no disponible para analisis")

Analizando distribucion de similaridades...
Similaridad maxima promedio: 0.533
Similaridad media promedio: 0.435


## Clasificador RAG Aumentado

Implemento clasificador que combina caracteristicas TF-IDF con caracteristicas RAG.

In [46]:
# Clase clasificador RAG aumentado
class RAGEnhancedClassifier:
    def __init__(self, rag_system, base_classifier_type=None, use_tfidf=True):
        self.rag_system = rag_system
        self.base_classifier_type = base_classifier_type or LogisticRegression
        self.use_tfidf = use_tfidf
        self.tfidf_vectorizer = None
        self.scaler = None
        self.base_classifier = None # Initialize classifier in fit method

        if use_tfidf:
            self.tfidf_vectorizer = TfidfVectorizer(
                max_features=5000,
                stop_words='english',
                ngram_range=(1, 2)
            )

        self.scaler = StandardScaler()

    def extract_features(self, texts):
        """Extrae caracteristicas combinadas TF-IDF + RAG"""
        features_list = []

        # Caracteristicas TF-IDF
        if self.use_tfidf and self.tfidf_vectorizer is not None:
            tfidf_features = self.tfidf_vectorizer.transform(texts)
            features_list.append(tfidf_features.toarray())

        # Caracteristicas RAG
        print("Extrayendo caracteristicas RAG...")
        rag_features = []
        for text in texts:
            context_features = self.rag_system.get_context_features(text)
            feature_vector = [
                context_features['max_similarity'],
                context_features['mean_similarity'],
                context_features['std_similarity'],
                context_features['fake_ratio'],
                context_features['real_ratio']
            ]
            rag_features.append(feature_vector)

        rag_features = np.array(rag_features)
        features_list.append(rag_features)

        # Combinar caracteristicas
        if len(features_list) > 1:
            combined_features = np.hstack(features_list)
        else:
            combined_features = features_list[0]

        return combined_features

    def fit(self, X_train, y_train):
        print("Entrenando clasificador RAG aumentado...")

        # Initialize the classifier here
        if self.base_classifier is None:
            if self.base_classifier_type == LogisticRegression:
                self.base_classifier = LogisticRegression(random_state=42)
            elif self.base_classifier_type == RandomForestClassifier:
                self.base_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
            else:
                raise ValueError("Unsupported classifier type")


        # Entrenar TF-IDF
        if self.use_tfidf:
            print("Ajustando vectorizador TF-IDF...")
            self.tfidf_vectorizer.fit(X_train)

        # Extraer caracteristicas
        X_features = self.extract_features(X_train)

        # Escalar caracteristicas
        X_features_scaled = self.scaler.fit_transform(X_features)

        # Entrenar clasificador
        self.base_classifier.fit(X_features_scaled, y_train)

        print(f"Entrenamiento completado. Caracteristicas: {X_features.shape[1]}")

    def predict(self, X_test):
        X_features = self.extract_features(X_test)
        X_features_scaled = self.scaler.transform(X_features)
        return self.base_classifier.predict(X_features_scaled)

    def predict_proba(self, X_test):
        X_features = self.extract_features(X_test)
        X_features_scaled = self.scaler.transform(X_features)
        return self.base_classifier.predict_proba(X_features_scaled)

print("Clase RAGEnhancedClassifier definida")

Clase RAGEnhancedClassifier definida


In [49]:
# Entrenar clasificadores RAG
if 'rag_system' in locals():
    print("Entrenando clasificadores RAG aumentados...")

    # 1. Clasificador solo RAG
    print("\n1. Clasificador RAG puro:")
    rag_only_classifier = RAGEnhancedClassifier(
        rag_system=rag_system,
        base_classifier_type=LogisticRegression,
        use_tfidf=False
    )
    rag_only_classifier.fit(X_train, y_train)

    # 2. Clasificador hibrido TF-IDF + RAG
    print("\n2. Clasificador hibrido (TF-IDF + RAG):")
    hybrid_classifier = RAGEnhancedClassifier(
        rag_system=rag_system,
        base_classifier_type=LogisticRegression,
        use_tfidf=True
    )
    hybrid_classifier.fit(X_train, y_train)

    # 3. Clasificador hibrido con Random Forest
    print("\n3. Clasificador hibrido Random Forest:")
    hybrid_rf_classifier = RAGEnhancedClassifier(
        rag_system=rag_system,
        base_classifier_type=RandomForestClassifier,
        use_tfidf=True
    )
    hybrid_rf_classifier.fit(X_train, y_train)

    print("\nTodos los clasificadores entrenados")
else:
    print("Error: Sistema RAG no disponible")

Entrenando clasificadores RAG aumentados...

1. Clasificador RAG puro:
Entrenando clasificador RAG aumentado...
Extrayendo caracteristicas RAG...
Entrenamiento completado. Caracteristicas: 5

2. Clasificador hibrido (TF-IDF + RAG):
Entrenando clasificador RAG aumentado...
Ajustando vectorizador TF-IDF...
Extrayendo caracteristicas RAG...
Entrenamiento completado. Caracteristicas: 5005

3. Clasificador hibrido Random Forest:
Entrenando clasificador RAG aumentado...
Ajustando vectorizador TF-IDF...
Extrayendo caracteristicas RAG...
Entrenamiento completado. Caracteristicas: 5005

Todos los clasificadores entrenados


## Evaluacion de Modelos

Evaluo rendimiento comparando RAG vs modelos tradicionales.

In [50]:
# Modelos de referencia tradicionales
if 'X_train' in locals():
    print("Entrenando modelos de referencia...")

    # TF-IDF tradicional
    tfidf_vectorizer_baseline = TfidfVectorizer(
        max_features=5000,
        stop_words='english',
        ngram_range=(1, 2)
    )

    X_train_tfidf = tfidf_vectorizer_baseline.fit_transform(X_train)
    X_test_tfidf = tfidf_vectorizer_baseline.transform(X_test)

    # Logistic Regression baseline
    lr_baseline = LogisticRegression(random_state=42)
    lr_baseline.fit(X_train_tfidf, y_train)

    # Random Forest baseline
    rf_baseline = RandomForestClassifier(n_estimators=100, random_state=42)
    rf_baseline.fit(X_train_tfidf, y_train)

    print("Modelos de referencia entrenados")
else:
    print("Error: Datos no disponibles")

Entrenando modelos de referencia...
Modelos de referencia entrenados


In [51]:
# Evaluacion de todos los modelos
def evaluate_model(model, X_test_input, y_test, model_name, use_tfidf_input=False):
    print(f"\nEvaluando {model_name}...")

    if use_tfidf_input:
        y_pred = model.predict(X_test_input)
        y_pred_proba = model.predict_proba(X_test_input)[:, 1] if hasattr(model, 'predict_proba') else None
    else:
        y_pred = model.predict(X_test_input)
        y_pred_proba = model.predict_proba(X_test_input)[:, 1] if hasattr(model, 'predict_proba') else None

    # Metricas
    f1 = f1_score(y_test, y_pred, average='weighted')
    auc = roc_auc_score(y_test, y_pred_proba) if y_pred_proba is not None else 0

    print(f"F1-Score: {f1:.4f}")
    print(f"ROC-AUC: {auc:.4f}")

    print("\nReporte clasificacion:")
    print(classification_report(y_test, y_pred))

    return {
        'model_name': model_name,
        'f1_score': f1,
        'roc_auc': auc,
        'y_pred': y_pred,
        'y_pred_proba': y_pred_proba
    }

if 'lr_baseline' in locals() and 'rag_only_classifier' in locals():
    # Evaluar todos los modelos
    results = []

    # Modelos baseline
    results.append(evaluate_model(lr_baseline, X_test_tfidf, y_test, "Logistic Regression TF-IDF", use_tfidf_input=True))
    results.append(evaluate_model(rf_baseline, X_test_tfidf, y_test, "Random Forest TF-IDF", use_tfidf_input=True))

    # Modelos RAG
    results.append(evaluate_model(rag_only_classifier, X_test, y_test, "RAG Puro"))
    results.append(evaluate_model(hybrid_classifier, X_test, y_test, "Hibrido TF-IDF + RAG"))
    results.append(evaluate_model(hybrid_rf_classifier, X_test, y_test, "Hibrido RF + RAG"))
else:
    print("Error: Modelos no disponibles para evaluacion")
    results = []


Evaluando Logistic Regression TF-IDF...
F1-Score: 0.7857
ROC-AUC: 0.8926

Reporte clasificacion:
              precision    recall  f1-score   support

         0.0       0.81      0.70      0.75        96
         1.0       0.78      0.86      0.82       116

    accuracy                           0.79       212
   macro avg       0.79      0.78      0.78       212
weighted avg       0.79      0.79      0.79       212


Evaluando Random Forest TF-IDF...
F1-Score: 0.7661
ROC-AUC: 0.8418

Reporte clasificacion:
              precision    recall  f1-score   support

         0.0       0.79      0.67      0.72        96
         1.0       0.76      0.85      0.80       116

    accuracy                           0.77       212
   macro avg       0.77      0.76      0.76       212
weighted avg       0.77      0.77      0.77       212


Evaluando RAG Puro...
Extrayendo caracteristicas RAG...
Extrayendo caracteristicas RAG...
F1-Score: 0.3870
ROC-AUC: 0.5000

Reporte clasificacion:
        

In [52]:
# Comparacion de resultados
if results:
    results_df = pd.DataFrame([
        {
            'Modelo': result['model_name'],
            'F1-Score': result['f1_score'],
            'ROC-AUC': result['roc_auc']
        }
        for result in results
    ])

    print("\nComparacion de resultados:")
    print(results_df.round(4))

    # Mejor modelo
    best_model_idx = results_df['F1-Score'].idxmax()
    best_model = results_df.iloc[best_model_idx]
    print(f"\nMejor modelo: {best_model['Modelo']} con F1-Score: {best_model['F1-Score']:.4f}")

    # Visualizar comparacion
    fig = make_subplots(rows=1, cols=2, subplot_titles=['F1-Score', 'ROC-AUC'])

    fig.add_trace(go.Bar(
        x=results_df['Modelo'],
        y=results_df['F1-Score'],
        name='F1-Score',
        marker_color='lightblue'
    ), row=1, col=1)

    fig.add_trace(go.Bar(
        x=results_df['Modelo'],
        y=results_df['ROC-AUC'],
        name='ROC-AUC',
        marker_color='lightcoral'
    ), row=1, col=2)

    fig.update_layout(
        title="Comparacion Rendimiento Modelos RAG vs Tradicionales",
        height=400,
        showlegend=False
    )
    fig.update_xaxes(tickangle=45)
    fig.show()
else:
    print("No hay resultados para comparar")


Comparacion de resultados:
                       Modelo  F1-Score  ROC-AUC
0  Logistic Regression TF-IDF    0.7857   0.8926
1        Random Forest TF-IDF    0.7661   0.8418
2                    RAG Puro    0.3870   0.5000
3        Hibrido TF-IDF + RAG    0.3870   0.5000
4            Hibrido RF + RAG    0.8211   0.9071

Mejor modelo: Hibrido RF + RAG con F1-Score: 0.8211


## Matrices de Confusion

Analizo matrices de confusion para entender comportamiento de cada modelo.

In [53]:
# Matrices de confusion
if results:
    fig = make_subplots(
        rows=2, cols=3,
        subplot_titles=[result['model_name'] for result in results],
        specs=[[{"type": "heatmap"} for _ in range(3)],
               [{"type": "heatmap"} for _ in range(2)] + [None]]
    )

    labels = ['Fake', 'Real']
    positions = [(1, 1), (1, 2), (1, 3), (2, 1), (2, 2)]

    for i, (result, pos) in enumerate(zip(results, positions)):
        cm = confusion_matrix(y_test, result['y_pred'], labels=[0, 1])

        # Normalizar matriz
        cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

        fig.add_trace(
            go.Heatmap(
                z=cm_normalized,
                x=labels,
                y=labels,
                colorscale='Blues',
                showscale=i == 0,
                text=[[f'{cm[i,j]}<br>({cm_normalized[i,j]:.2f})'
                       for j in range(len(labels))]
                      for i in range(len(labels))],
                texttemplate="%{text}",
                textfont={"size": 10}
            ),
            row=pos[0], col=pos[1]
        )

    fig.update_layout(
        title="Matrices de Confusion - Comparacion de Modelos",
        height=600
    )
    fig.show()
else:
    print("No hay resultados para matrices de confusion")

## Analisis Caracteristicas RAG

Analizo importancia y distribucion de caracteristicas RAG.

In [54]:
# Analisis caracteristicas RAG
if 'rag_system' in locals():
    print("Analizando caracteristicas RAG en conjunto de prueba...")

    rag_features_test = []
    for text in X_test:
        context_features = rag_system.get_context_features(text)
        rag_features_test.append([
            context_features['max_similarity'],
            context_features['mean_similarity'],
            context_features['std_similarity'],
            context_features['fake_ratio'],
            context_features['real_ratio']
        ])

    rag_features_test = np.array(rag_features_test)
    feature_names = ['Max Similarity', 'Mean Similarity', 'Std Similarity', 'Fake Ratio', 'Real Ratio']

    # DataFrame para analisis
    rag_features_df = pd.DataFrame(rag_features_test, columns=feature_names)
    rag_features_df['True_Label'] = y_test

    print("\nEstadisticas descriptivas caracteristicas RAG:")
    print(rag_features_df.groupby('True_Label')[feature_names].describe().round(4))
else:
    print("Sistema RAG no disponible para analisis de caracteristicas")

Analizando caracteristicas RAG en conjunto de prueba...

Estadisticas descriptivas caracteristicas RAG:
           Max Similarity                                                  \
                    count    mean     std     min     25%     50%     75%   
True_Label                                                                  
0.0                  96.0  0.5337  0.1067  0.3145  0.4636  0.5166  0.5964   
1.0                 116.0  0.5314  0.1189  0.2853  0.4507  0.5039  0.5884   

                   Mean Similarity          ... Fake Ratio      Real Ratio  \
               max           count    mean  ...        75%  max      count   
True_Label                                  ...                              
0.0         0.8144            96.0  0.4342  ...      1.000  1.0       96.0   
1.0         0.9222           116.0  0.4281  ...      0.425  1.0      116.0   

                                                       
              mean     std  min    25%  50%  75%  max  
True_La

In [55]:
# Visualizar distribucion caracteristicas RAG
if 'rag_features_df' in locals():
    fig = make_subplots(
        rows=3, cols=2,
        subplot_titles=feature_names,
        vertical_spacing=0.08
    )

    positions = [(1, 1), (1, 2), (2, 1), (2, 2), (3, 1)]

    for i, (feature, pos) in enumerate(zip(feature_names, positions)):
        for label in [0, 1]:
            label_name = 'Fake' if label == 0 else 'Real'
            data = rag_features_df[rag_features_df['True_Label'] == label][feature]
            fig.add_trace(
                go.Box(y=data, name=f'{label_name}', showlegend=(i == 0)),
                row=pos[0], col=pos[1]
            )

    fig.update_layout(
        title="Distribucion Caracteristicas RAG por Etiqueta",
        height=800
    )
    fig.show()

    # Matriz de correlacion
    correlation_matrix = rag_features_df[feature_names].corr()

    fig_corr = px.imshow(
        correlation_matrix,
        text_auto=".3f",
        aspect="auto",
        title="Matriz Correlacion - Caracteristicas RAG",
        color_continuous_scale="RdBu_r"
    )
    fig_corr.show()
else:
    print("No hay datos de caracteristicas RAG para visualizar")

## Validacion Cruzada

Realizo validacion cruzada para evaluar robustez del mejor modelo RAG.

In [57]:
# Matrices de confusion
if results:
    fig = make_subplots(
        rows=2, cols=3,
        subplot_titles=[result['model_name'] for result in results],
        specs=[[{"type": "heatmap"} for _ in range(3)],
               [{"type": "heatmap"} for _ in range(2)] + [None]]
    )

    labels = ['Fake', 'Real']
    positions = [(1, 1), (1, 2), (1, 3), (2, 1), (2, 2)]

    for i, (result, pos) in enumerate(zip(results, positions)):
        cm = confusion_matrix(y_test, result['y_pred'], labels=[0, 1])

        # Normalizar matriz
        cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

        fig.add_trace(
            go.Heatmap(
                z=cm_normalized,
                x=labels,
                y=labels,
                colorscale='Blues',
                showscale=i == 0,
                text=[[f'{cm[i,j]}<br>({cm_normalized[i,j]:.2f})'
                       for j in range(len(labels))]
                      for i in range(len(labels))],
                texttemplate="%{text}",
                textfont={"size": 10}
            ),
            row=pos[0], col=pos[1]
        )

    fig.update_layout(
        title="Matrices de Confusion - Comparacion de Modelos",
        height=600
    )
    fig.show()
else:
    print("No hay resultados para matrices de confusion")

In [60]:
# Validacion cruzada para modelo RAG
if 'X_train' in locals() and 'rag_system' in locals():
    print("Realizando validacion cruzada...")

    cv_scores = []
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

    for fold, (train_idx, val_idx) in enumerate(skf.split(X_train, y_train)):
        print(f"\nFold {fold + 1}/5")

        # Dividir datos para este fold
        X_fold_train, X_fold_val = X_train[train_idx], X_train[val_idx]
        y_fold_train, y_fold_val = y_train[train_idx], y_train[val_idx]

        # Nuevo sistema RAG para este fold
        rag_fold = RAGSystem(embedding_model_name='all-MiniLM-L6-v2', top_k=10)
        rag_fold.build_knowledge_base(X_fold_train, y_fold_train)

        # Entrenar clasificador
        classifier_fold = RAGEnhancedClassifier(
            rag_system=rag_fold,
            use_tfidf=True
        )
        classifier_fold.fit(X_fold_train, y_fold_train)

        # Evaluar
        y_fold_pred = classifier_fold.predict(X_fold_val)
        fold_f1 = f1_score(y_fold_val, y_fold_pred, average='weighted')

        cv_scores.append(fold_f1)
        print(f"F1-Score Fold {fold + 1}: {fold_f1:.4f}")

    print(f"\nResultados Validacion Cruzada:")
    print(f"F1-Score promedio: {np.mean(cv_scores):.4f} (+/- {np.std(cv_scores) * 2:.4f})")
    print(f"F1-Scores individuales: {[f'{score:.4f}' for score in cv_scores]}")

    # Visualizar resultados CV
    fig = go.Figure()

    fig.add_trace(go.Bar(
        x=[f'Fold {i+1}' for i in range(len(cv_scores))],
        y=cv_scores,
        name='CV F1-Scores',
        marker_color='lightblue'
    ))

    # Linea promedio
    fig.add_hline(
        y=np.mean(cv_scores),
        line_dash="dash",
        line_color="red",
        annotation_text=f"Promedio: {np.mean(cv_scores):.4f}"
    )

    fig.update_layout(
        title="Resultados Validacion Cruzada - Modelo RAG Hibrido",
        xaxis_title="Fold",
        yaxis_title="F1-Score",
        height=400
    )
    fig.show()
else:
    print("Datos no disponibles para validacion cruzada")

Realizando validacion cruzada...

Fold 1/5
Sistema RAG inicializado con modelo: all-MiniLM-L6-v2
Construyendo base de conocimiento...
Generando embeddings semanticos...


Batches:   0%|          | 0/22 [00:00<?, ?it/s]

Construyendo indice FAISS...
Base de conocimiento construida con 676 declaraciones
Dimension de embeddings: 384
Entrenando clasificador RAG aumentado...
Ajustando vectorizador TF-IDF...
Extrayendo caracteristicas RAG...
Entrenamiento completado. Caracteristicas: 5005
Extrayendo caracteristicas RAG...
F1-Score Fold 1: 0.3869

Fold 2/5
Sistema RAG inicializado con modelo: all-MiniLM-L6-v2
Construyendo base de conocimiento...
Generando embeddings semanticos...


Batches:   0%|          | 0/22 [00:00<?, ?it/s]

Construyendo indice FAISS...
Base de conocimiento construida con 677 declaraciones
Dimension de embeddings: 384
Entrenando clasificador RAG aumentado...
Ajustando vectorizador TF-IDF...
Extrayendo caracteristicas RAG...
Entrenamiento completado. Caracteristicas: 5005
Extrayendo caracteristicas RAG...
F1-Score Fold 2: 0.3907

Fold 3/5
Sistema RAG inicializado con modelo: all-MiniLM-L6-v2
Construyendo base de conocimiento...
Generando embeddings semanticos...


Batches:   0%|          | 0/22 [00:00<?, ?it/s]

Construyendo indice FAISS...
Base de conocimiento construida con 677 declaraciones
Dimension de embeddings: 384
Entrenando clasificador RAG aumentado...
Ajustando vectorizador TF-IDF...
Extrayendo caracteristicas RAG...
Entrenamiento completado. Caracteristicas: 5005
Extrayendo caracteristicas RAG...
F1-Score Fold 3: 0.3907

Fold 4/5
Sistema RAG inicializado con modelo: all-MiniLM-L6-v2
Construyendo base de conocimiento...
Generando embeddings semanticos...


Batches:   0%|          | 0/22 [00:00<?, ?it/s]

Construyendo indice FAISS...
Base de conocimiento construida con 677 declaraciones
Dimension de embeddings: 384
Entrenando clasificador RAG aumentado...
Ajustando vectorizador TF-IDF...
Extrayendo caracteristicas RAG...
Entrenamiento completado. Caracteristicas: 5005
Extrayendo caracteristicas RAG...
F1-Score Fold 4: 0.3838

Fold 5/5
Sistema RAG inicializado con modelo: all-MiniLM-L6-v2
Construyendo base de conocimiento...
Generando embeddings semanticos...


Batches:   0%|          | 0/22 [00:00<?, ?it/s]

Construyendo indice FAISS...
Base de conocimiento construida con 677 declaraciones
Dimension de embeddings: 384
Entrenando clasificador RAG aumentado...
Ajustando vectorizador TF-IDF...
Extrayendo caracteristicas RAG...
Entrenamiento completado. Caracteristicas: 5005
Extrayendo caracteristicas RAG...
F1-Score Fold 5: 0.3838

Resultados Validacion Cruzada:
F1-Score promedio: 0.3872 (+/- 0.0062)
F1-Scores individuales: ['0.3869', '0.3907', '0.3907', '0.3838', '0.3838']


## Resumen Final

Resumo hallazgos principales del sistema RAG para deteccion de desinformacion.

In [61]:
# Resumen final
print("=" * 80)
print("RESUMEN FINAL - SISTEMA RAG PARA DETECCION DE DESINFORMACION")
print("=" * 80)

if 'rag_system' in locals():
    print(f"\n1. ARQUITECTURA IMPLEMENTADA:")
    print(f"   - Modelo embeddings: {rag_system.embedding_model}")
    print(f"   - Indice busqueda: FAISS con {len(X_train)} declaraciones")
    print(f"   - Top-k recuperacion: {rag_system.top_k}")
    print(f"   - Caracteristicas RAG: 5 (similaridades + ratios)")

if 'results' in locals() and results:
    print(f"\n2. RENDIMIENTO DE MODELOS:")
    for result in results:
        print(f"   - {result['model_name']}: F1={result['f1_score']:.4f}, AUC={result['roc_auc']:.4f}")

if 'cv_scores' in locals():
    print(f"\n3. VALIDACION CRUZADA:")
    print(f"   - F1-Score promedio: {np.mean(cv_scores):.4f}")
    print(f"   - Desviacion estandar: {np.std(cv_scores):.4f}")
    print(f"   - Intervalo confianza: [{np.mean(cv_scores) - 2*np.std(cv_scores):.4f}, {np.mean(cv_scores) + 2*np.std(cv_scores):.4f}]")

print(f"\n4. VENTAJAS ENFOQUE RAG:")
print(f"   - Incorpora conocimiento semantico del contexto")
print(f"   - Mejora interpretabilidad con declaraciones similares")
print(f"   - Combina recuperacion con caracteristicas tradicionales")
print(f"   - Escalable para bases de conocimiento grandes")

print(f"\n5. LIMITACIONES IDENTIFICADAS:")
print(f"   - Dependiente de calidad de base de conocimiento")
print(f"   - Computacionalmente mas costoso que TF-IDF")
print(f"   - Requiere ajuste de hiperparametros")

print(f"\n6. CASOS DE USO OPTIMOS:")
print(f"   - Declaraciones con contexto semantico complejo")
print(f"   - Necesidad de explicabilidad en predicciones")
print(f"   - Dominios con bases de conocimiento verificadas")

print("\n" + "=" * 80)
print("Sistema RAG implementado exitosamente")
print("Combina recuperacion semantica con clasificacion tradicional")
print("=" * 80)

RESUMEN FINAL - SISTEMA RAG PARA DETECCION DE DESINFORMACION

1. ARQUITECTURA IMPLEMENTADA:
   - Modelo embeddings: SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)
   - Indice busqueda: FAISS con 846 declaraciones
   - Top-k recuperacion: 10
   - Caracteristicas RAG: 5 (similaridades + ratios)

2. RENDIMIENTO DE MODELOS:
   - Logistic Regression TF-IDF: F1=0.7857, AUC=0.8926
   - Random Forest TF-IDF: F1=0.7661, AUC=0.8418
   - RAG Puro: F1=0.3870, AUC=0.5000
   - Hibrido TF-IDF + RAG: F1=0.3870, AUC=0.5000
   - Hibrido RF + RAG: F1=0.8211, AUC=0.9071

3. VALIDACION CRUZADA:
   - F1-Score promedio: 0.3872
 

In [62]:
# Guardar resultados principales
import pickle
import os

# Directorio para modelos RAG
if IN_COLAB:
    rag_models_dir = 'rag_models'
else:
    rag_models_dir = '../models/rag_models'

os.makedirs(rag_models_dir, exist_ok=True)

if 'results_df' in locals():
    # Guardar resultados comparacion
    results_df.to_csv(f'{rag_models_dir}/rag_comparison_results.csv', index=False)
    print(f"Resultados guardados en: {rag_models_dir}/rag_comparison_results.csv")

if 'rag_features_df' in locals():
    # Guardar caracteristicas RAG
    rag_features_df.to_csv(f'{rag_models_dir}/rag_features_analysis.csv', index=False)
    print(f"Caracteristicas RAG guardadas en: {rag_models_dir}/rag_features_analysis.csv")

if 'cv_scores' in locals():
    # Guardar metricas validacion cruzada
    cv_results = {
        'cv_scores': cv_scores,
        'mean_score': np.mean(cv_scores),
        'std_score': np.std(cv_scores),
        'confidence_interval': [np.mean(cv_scores) - 2*np.std(cv_scores),
                               np.mean(cv_scores) + 2*np.std(cv_scores)]
    }

    with open(f'{rag_models_dir}/cv_results.pickle', 'wb') as f:
        pickle.dump(cv_results, f)
    print(f"Validacion cruzada guardada en: {rag_models_dir}/cv_results.pickle")

print("\nSistema RAG implementado y evaluado completamente")
print("Archivos generados:")
print("- rag_comparison_results.csv: Comparacion de modelos")
print("- rag_features_analysis.csv: Analisis caracteristicas RAG")
print("- cv_results.pickle: Resultados validacion cruzada")

Resultados guardados en: rag_models/rag_comparison_results.csv
Caracteristicas RAG guardadas en: rag_models/rag_features_analysis.csv
Validacion cruzada guardada en: rag_models/cv_results.pickle

Sistema RAG implementado y evaluado completamente
Archivos generados:
- rag_comparison_results.csv: Comparacion de modelos
- rag_features_analysis.csv: Analisis caracteristicas RAG
- cv_results.pickle: Resultados validacion cruzada
