# SISTEMA RAG
Este código permite detectar y justificar la aparición de TTPs de DISARM en una noticia de entrada, basando su razonamiento en información obtenida de fuentes externas.

En primer lugar, se deben haber instalado las dependencias necesarias (requirements.txt).

Es necesario disponer de un token de HuggingFace y otro de la API de Event Registry

El código se ha estructurado en 3 fases: Indexing, Retrieval y Generation.

## 1. INDEXING
En esta fase se extraerán las palabras clave de la noticia de entrada para obtener noticias relacionadas a partir de la API de Event Registry. Los artículos obtenidos se fragmentarán en splits para posteriormente ser representados en vectores (embeddings) y ser almacenados en una base de datos vectorial.

### 1.1. Extractor de keywords

In [None]:
from keybert import KeyBERT
from nltk.corpus import stopwords
import string
from collections import Counter
import nltk
from difflib import SequenceMatcher

# Descargar stopwords solo la primera vez
nltk.download('stopwords')


# Función que devuelve True si a y b son suficientemente similares (es decir, si ratio > threshold)
def similar(a, b, threshold=0.7):
    return SequenceMatcher(None, a.lower(), b.lower()).ratio() > threshold

    
    
# Creamos un Extractor de keywords con el modelo KeyBERT
class KeywordExtractor:
    def __init__(self):
        self.kw_model = KeyBERT()
        self.stop_words = set(stopwords.words('english')) # Todas las noticiaas de entrada están en inglés
        self.punctuation = set(string.punctuation)
        self.min_word_length = 3

        
    # Limpia y normaliza una keyword
    def clean_keyword(self, kw):
        kw = kw.lower().strip()
        kw = ''.join([c for c in kw if c not in self.punctuation])
        return kw if (kw and kw not in self.stop_words 
                     and len(kw) >= self.min_word_length) else None

        
    # Extrae top_n keywords usando KeyBERT + filtrado tradicional
    def extract_keywords(self, text, top_n=5):
        # Extracción inicial con KeyBERT
        keywords = self.kw_model.extract_keywords(
            text,
            keyphrase_ngram_range=(1, 2),  # Permite palabras sueltas y bigramas
            stop_words='english',
            top_n=top_n*3  # Extraemos más para luego filtrar
        )
        
        # Procesamiento y limpieza
        cleaned = []
        for kw, score in keywords:
            kw_clean = self.clean_keyword(kw)
            if kw_clean:
                # Separar bigramas para mejor filtrado
                cleaned.extend(kw_clean.split())
        
        # Frecuencia de términos y selección
        word_counts = Counter(cleaned)
        sorted_words = sorted(word_counts.items(), key=lambda x: (-x[1], -len(x[0])))
        
        # Selección diversificada
        selected = []
        for word, count in sorted_words:
            if not any(similar(word, s) for s in selected): # Evitar palabras demasiado similares
                selected.append(word)
                if len(selected) >= top_n:
                    break
        
        return selected[:top_n]


### 1.2. Extracción de noticias parecidas
Es necesario registrarse en la web newsapi de Event Registry y obtener una API key para poder hacer llamadas al servicio.

In [None]:
from eventregistry import EventRegistry, QueryArticlesIter, QueryItems, ReturnInfo, ArticleInfoFlags
import requests
import re

er_key = # API key de Event Registry
er = EventRegistry(apiKey=er_key)

# Extraer título y cuerpo asumiendo formato "TITLE: ... BODY: ..."
def extraer_titulo_y_cuerpo(texto):
    titulo_match = re.search(r"TITLE:\s*(.*?)\s*BODY:", texto, re.DOTALL | re.IGNORECASE)
    cuerpo_match = re.search(r"BODY:\s*(.*)", texto, re.DOTALL | re.IGNORECASE)

    titulo = titulo_match.group(1).strip() if titulo_match else ""
    cuerpo = cuerpo_match.group(1).strip() if cuerpo_match else ""
    return titulo, cuerpo
        
# Para no obtener de la API la misma noticia que la de entrada
def mismo_articulo(art, noticia):
    titulo_original, cuerpo_original = extraer_titulo_y_cuerpo(noticia)

    titulo_art = art["title"].strip()
    cuerpo_art = art.get("body", "").strip()

    titulo_sim = similar(titulo_art, titulo_original, threshold=0.7)
    cuerpo_sim = similar(cuerpo_art[:500], cuerpo_original[:500], threshold=0.7)

    es_mismo = titulo_sim or cuerpo_sim
    return es_mismo

# Llamada a Event Registry con las keywords
def buscar_noticias_parecidas(texto, max_noticias=100, lang="eng"):
    # Extraemos las palabras clave
    extractor = KeywordExtractor()
    keywords = extractor.extract_keywords(texto)
    print(f"Keywords seleccionadas: {keywords}")
    
    if not keywords:
        return []

    # Construir parámetros de la query
    query_params = {
        "keywords": QueryItems.OR(keywords),
        "lang": lang        
    }
    
    # Construir query 
    q = QueryArticlesIter(**query_params)
    #print(q)
    ret_info = ReturnInfo(articleInfo=ArticleInfoFlags(bodyLen=-1))

    # Llamada a la API
    try:
        articles = list(q.execQuery(er, returnInfo=ret_info, maxItems=max_noticias))
        filtrados = [art for art in articles if not mismo_articulo(art, texto)]
        return filtrados
    except Exception as e:
        print(f"Error en la búsqueda: {e}")
        return []

### 1.3. Split

In [None]:
import tiktoken
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
from langchain_community.vectorstores.utils import filter_complex_metadata

# Definimos el splitter
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=300, # Tamaño máximo de los fragmentos
    chunk_overlap=50) # Para evitar la pérdida de contexto entre los trozos

# Convertir un artículo en formato Document para poder ser fragmentado
def convertir_a_documento(art):
    body = art["body"][:5000] # Hasta 5000 caracteres para no generar demasiados tokens. 
    # Filtrar los metadatos complejos antes de crear el Document
    metadata = {
        "title"        : art["title"],
        "published"    : art["dateTimePub"],
        "source_name"  : art["source"]["title"],
        "source_domain": art["source"]["uri"],
        "url"          : art["url"],
        "authors"      : ", ".join(a["name"] for a in art["authors"]),
        "event_uri"    : art.get("eventUri"),
        "sentiment"    : art.get("sentiment"),
        "body_len"     : len(art["body"]),
    }
    doc = Document(
        page_content = body,
        metadata     = metadata
    )
    return doc

# Fragmentar noticias
def hacer_splits(articulos):
    documentos = [convertir_a_documento(articulo) for articulo in articulos]
    splits = text_splitter.split_documents(documentos)
    return splits

### 1.4. Almacenamiento en la BBDD vectorial

In [None]:
# Registro previo en HuggingFace para acceder a los modelos
from huggingface_hub import login

login(token="")

In [None]:
import chromadb
from chromadb.config import Settings
from langchain_chroma import Chroma
from langchain_huggingface import HuggingFaceEmbeddings

# Cargamos el modelo para vectorizar los fragmentos
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

# Obtener los splits más relevantes de la base de datos
def almacenar_splits(splits):
    # Inicializar el cliente de Chroma
    client = chromadb.Client(Settings())
    collection_name = "temporal_context"

    # Eliminar si ya existía
    try:
        client.delete_collection(collection_name)
    except Exception:
        pass  # No pasa nada si no existía


    # Crear una colección con similitud del coseno
    collection = client.create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"}  # Especificar la métrica de distancia
    )

    # Integrar con LangChain
    vectorstore = Chroma(
        client=client,
        collection_name=collection_name,
        embedding_function=embeddings # Para vectorizar los fragmentos
    )

    # Agregar documentos a la colección
    vectorstore.add_documents(splits)
    return vectorstore

## 2. RETRIEVAL
En esta fase se obtienen los fragmentos más relevantes respecto a la noticia de entrada de la base de datos vectorial creada durante el proceso de Indexing.

In [None]:
# Extraer de la BBDD los 4 fragmentos más relevantes
def obtener_splits_relevantes_bbdd(splits, noticia):
    vectorstore = almacenar_splits(splits)
    retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

    docs = retriever.invoke(noticia) #invoke llama a getRelevantDocuments, este último se va a dejar de usar
    print("Cantidad de splits relevantes:", len(docs))
    context_docs = "\n\n".join(serialize_doc(d) for d in docs)
    
    return context_docs


def serialize_doc(doc):
    meta = doc.metadata
    return (
        f"<<TITLE>>\n"
        f"{meta['title']}\n"
        #f"source={meta['source_name']} ({meta['source_domain']})\n"
        f"<<TEXT>>\n"
        f"{doc.page_content}\n"
        f"<<END>>"
    )

## 3. GENERATION
En esta última etapa se carga el LLM y se le pasa una noticia de entrada, a partir de ella se extraen los fragmentos de noticias más relevantes y se cargan las TTPs de DISARM que puede detectar.

El sistema devuleve una respuesta en lenguaje natural con las TTPs obtenidas y una justificación de su elección.

### 3.1. Cargar el LLM

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "mistralai/Mistral-7B-Instruct-v0.3"

# Cargar LLM y tokenizador
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Función para generar la respuesta
def llm_generate(prompt):
    # Tokenizar input
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device) # return_tensors="pt" porque AutoModelForCausalLM está basado en PyTorch y espera como entrada tensores de PyTorch
    
    # Generar respuesta
    output = model.generate(
        **inputs,
        max_new_tokens=1024,
        do_sample=False,
        pad_token_id=tokenizer.eos_token_id
    )
    result = tokenizer.decode(output[0], skip_special_tokens=True)
    return result[len(prompt):].strip()

### 3.2. Construir el prompt

In [None]:
import json

# Función para cargar las TTPs de DISARM
def cargar_ttps(path):
    with open(path, "r", encoding="utf-8") as f:
        data = json.load(f)

    bloques = []
    for obj in data["objects"]:
        if obj["type"] == "tactic":
            tactic_name = obj["name"]
            for tech in obj.get("techniques", []):
                id_ = tech["id"]
                name = tech["name"]
                desc = tech["description"].strip()
                bloques.append(f"Technique: {id_} - {name} (Tactic: {tactic_name})\nTechnique Description: {desc}")
    return "\n\n".join(bloques)
    

# Función para obtener contexto relevante llamando a las funciones de Indexing y Retrieval
def obtener_contexto (noticia):
    articulos = buscar_noticias_parecidas(noticia)
    splits = hacer_splits(articulos)
    docs = obtener_splits_relevantes_bbdd(splits, noticia)
    return docs

    
# Definición del prompt
def construir_prompt(news_text):
    ttps_text = cargar_ttps("DISARMfiltrado.json")
    news_context = obtener_contexto(news_text)
    
    return f"""
        You are an expert in communication, disinformation, and discourse analysis. You will now receive one news article to analyze, a list of known manipulation and disinformation techniques (DISARM TTPs), and a few other news articles with related context.
        
        Your task is to determine whether the FIRST ARTICLE that you receive contains any patterns of manipulation or disinformation using the CONTEXT ARTICLES as background evidence.

        Use the TTPs section below as the reference list of patterns to detect.
        
       "ttps_detected": [string], // TTP names detected e.g. ["T0066-Degrade Adversary", "T0015-Create Hashtags and Search Artefacts"]
       "key_phrases": [string], // Phrases or evidence that support your findings
       "justification": string // A concise explanation of your reasoning
       
        
        If no clear technique is detected, respond:
        
        "ttps_detected": [],
        "key_phrases": [],
        "justification": string // A concise explanation of your reasoning.

        
        FIRST ARTICLE to analyze:
        {news_text}

        CONTEXT ARTICLES:
        {news_context}

        TTPs (DISARM framework):
        {ttps_text}

        RESPONSE:
        ===
    """

### 3.3. Generación de la respuesta

In [None]:
# NOTICIA QUE SE QUIERE ANALIZAR con formato "TITLE:... BODY:... 
noticia = "TITLE: BUSTED: Leaked Documents Prove Trump Took Laundered Money From Russian Bank. BODY: Convincing President Trump to release his tax returns is proving slightly more difficult than we initially anticipated, but that doesn t mean there haven t been any signs of success from taking the longer route. Take, for example, a 98-page document recently released by the United States Office of Government Ethics.The document, available in its entirety here, clearly shows that not only is Donald Trump outright profiting from the presidency, a direct violation of the Emoluments Clause of the United States Constitution but also that he is in debt to several banks, both domestic and foreign.Although shocking news, none of this particularly comes as a surprise, more or less just confirms what most of us already suspected. However, it s when you delve into the details that you discover the true significance of the President s possible under-the-table actions.German-based Deutsche Bank was served with $630 million in penalties back in January for a $10 billion Russian money laundering scheme involving it s Moscow and New York branches among others. Deutsche Bank also gave Trump four questionable long-term  loans,  potentially Russian money that s been handed to Trump, the final loan given just before the commencement of the presidential election and used to help fund the Trump International Hotel that opened in Washington DC last year.Obviously, this interaction between Trump and Deutsche Bank should raise more than just a few red flags, at least confirming that there is a possibility that the Russians laundered money to Trump as he began his campaign and just prior to hiring senior advisors with ties to the Russian government.Furthermore, Deutsche Bank is gaining somewhat of a reputation for their shady business dealings as well. Not only were they caught in the Russian laundering scheme, the bank also struck a $7.2 billion deal with the US government last December to settle claims for toxic mortgages they packaged and sold between 2005 and 2007, as well as paying $2.5 billion in April 2015 to settle charges it conspired to manipulate global interest rate benchmarks.It s interesting, yet not at all surprising, how corrupt people and organizations just seem to gravitate toward each other."

prompt = construir_prompt(noticia)
result = llm_generate(prompt)

# RESPUESTA DEL LLM
print(result)