# Basic Large Language Model (LLM) concepts and their application

To build an application with LLMs, we need to become familiar with several basic concepts first. We will start with text processing.

## Loading PDF documents

Initially, the texts we wish to analize will be stored in a text format. Ideally, this format will be `.txt`. Nevertheless, this is seldom the case, as it is much more common for texts to be stored in PDF format. Luckily, there exist tools specifically built for these cases.

We must take into account, however, that although these tools are designed for loading and processing documents in PDF format, this file type has some intrinsic limitations that make this end more difficult. The most relevant limitations for our goals are image processing, and parsing mathematical expressions and tables. Images are ignored by PDF text loaders, and tables and mathematical expressions are parsed into raw text. There are some new approaches being developed to correct these issues, like processing PDFs through AI. These approaches are currently not as flexible and compatible with other tools, and are outside of the scope of this session. For now, we will use the simplest approach and take its limitations into account.

As a working example, we will use the article ["Learning from Shared News: When Abundant Information Leads to Belief Polarization" by Bowen, Dimitriev and Galperti (2022)](https://www.nber.org/system/files/working_papers/w28465/w28465.pdf). We will be using the `pypdf` package. We start by downloading the article with `requests` and saving it in the present work directory. As a technical side note, Colab functions through a container that allows us to access a Linux file system; our default work directory is `/content`

In [7]:
%pip install -r ../requirements.txt

[31mERROR: Could not find a version that satisfies the requirement openai==1.13 (from versions: 0.0.2, 0.1.0, 0.1.1, 0.1.2, 0.1.3, 0.2.0, 0.2.1, 0.2.3, 0.2.4, 0.2.5, 0.2.6, 0.3.0, 0.4.0, 0.6.0, 0.6.1, 0.6.2, 0.6.3, 0.6.4, 0.7.0, 0.8.0, 0.9.0, 0.9.1, 0.9.2, 0.9.3, 0.9.4, 0.10.0, 0.10.1, 0.10.2, 0.10.3, 0.10.4, 0.10.5, 0.11.0, 0.11.1, 0.11.2, 0.11.3, 0.11.4, 0.11.5, 0.11.6, 0.12.0, 0.13.0, 0.14.0, 0.15.0, 0.16.0, 0.18.0, 0.18.1, 0.19.0, 0.20.0, 0.22.0, 0.22.1, 0.23.0, 0.23.1, 0.24.0, 0.25.0, 0.26.0, 0.26.1, 0.26.2, 0.26.3, 0.26.4, 0.26.5, 0.27.0, 0.27.1, 0.27.2, 0.27.3, 0.27.4, 0.27.5, 0.27.6, 0.27.7, 0.27.8, 0.27.9, 0.27.10, 0.28.0, 0.28.1, 1.0.0b1, 1.0.0b2, 1.0.0b3, 1.0.0rc1, 1.0.0rc2, 1.0.0rc3, 1.0.0, 1.0.1, 1.1.0, 1.1.1, 1.1.2, 1.2.0, 1.2.1, 1.2.2, 1.2.3, 1.2.4, 1.3.0, 1.3.1, 1.3.2, 1.3.3, 1.3.4, 1.3.5, 1.3.6, 1.3.7, 1.3.8, 1.3.9, 1.4.0, 1.5.0, 1.6.0, 1.6.1, 1.7.0, 1.7.1, 1.7.2, 1.8.0, 1.9.0, 1.10.0, 1.11.0, 1.11.1, 1.12.0, 1.13.3, 1.13.4, 1.14.0, 1.14.1, 1.14.2, 1.14.3, 1.16.0, 1.1

In [2]:
# --- Importar librerías necesarias ---
import feedparser        # Para leer el RSS de RPP
import pandas as pd      # Para manejar los datos en DataFrame
import tiktoken          # Para contar tokens
import chromadb          # Base de datos vectorial
from chromadb.config import Settings
import tiktoken

In [3]:
# --- Leer el RSS de RPP ---
RSS_URL = "https://rpp.pe/rss"
feed = feedparser.parse(RSS_URL)

# --- Tomar solo las primeras 50 noticias ---
items = feed["items"][:50]

# --- Extraer campos relevantes ---
news = []
for it in items:
    news.append({
        "title": it.get("title", ""),             # Título de la noticia
        "description": it.get("summary", ""),     # Resumen o descripción
        "link": it.get("link", ""),               # Enlace original
        "published": it.get("published", "")      # Fecha
    })

# --- Crear DataFrame ---
df_news = pd.DataFrame(news)

# --- Mostrar primeras 5 filas ---
print("Número de noticias descargadas:", len(df_news))
df_news.head()


Número de noticias descargadas: 50


Unnamed: 0,title,description,link,published
0,Preocupación en Alianza: Quevedo dejó el parti...,Kevin Quevedo recibió un duro golpe apenas com...,https://rpp.pe/futbol/descentralizado/video-al...,"Thu, 16 Oct 2025 20:35:29 -0500"
1,Ministerio Público inicia investigación por ho...,"En horas de la tarde, el Comandante general de...",https://rpp.pe/politica/actualidad/eduardo-rui...,"Thu, 16 Oct 2025 20:27:42 -0500"
2,Alianza Lima vs. Sport Boys EN VIVO vía L1MAX:...,Alianza Lima juega en el Alejandro Villanueva ...,https://rpp.pe/futbol/descentralizado/alianza-...,"Thu, 16 Oct 2025 20:45:59 -0500"
3,Jesús María: pacientes con cáncer recibieron s...,"Rotafono de RPP | Manuel Villena Ibáñez, quien...",https://rpp.pe/rotafono/servicios-publicos/jes...,"Thu, 16 Oct 2025 20:37:08 -0500"
4,Congresista Luis Cordero Jon Tay presentó proy...,Según la propuesta del parlamentario de Alianz...,https://rpp.pe/politica/congreso/congresista-l...,"Thu, 16 Oct 2025 20:23:11 -0500"


In [4]:
# Creamos el codificador del modelo cl100k_base (el mismo de GPT-3.5/4)
enc = tiktoken.get_encoding("cl100k_base")

def count_tokens(text: str) -> int:
    """Devuelve cuántos tokens hay en un texto."""
    return len(enc.encode(text))

# Aplicamos la función sobre la columna "description"
df_news["n_tokens"] = df_news["description"].apply(count_tokens)

# Vemos las más largas y el promedio
print("Promedio de tokens por descripción:", df_news["n_tokens"].mean())
df_news.sort_values("n_tokens", ascending=False).head(5)

Promedio de tokens por descripción: 48.46


Unnamed: 0,title,description,link,published,n_tokens
28,Abogado de Dina Boluarte tras rechazo a impedi...,El Poder Judicial declaró infundado el pedido ...,https://rpp.pe/politica/judiciales/abogado-de-...,"Thu, 16 Oct 2025 17:00:41 -0500",87
32,SKY mira al 2026 con nuevos destinos de largo ...,SKY consolida su expansión en Perú apostando p...,https://rpp.pe/economia/economia/schneider-ele...,"Thu, 16 Oct 2025 17:30:42 -0500",77
26,Schneider Electric impulsa la sostenibilidad e...,"Vanessa Moreno, country manager de Schneider E...",https://rpp.pe/economia/economia/schneider-ele...,"Thu, 16 Oct 2025 18:00:03 -0500",72
30,PNP informó que suboficial Luis Magallanes ha ...,El comandante general de la Policía Nacional d...,https://rpp.pe/videos/actualidad/pnp-confirma-...,"Thu, 16 Oct 2025 17:40:51 -0500",71
17,PNP confirma que suboficial Luis Magallanes di...,El comandante general de la Policía Nacional d...,https://rpp.pe/politica/actualidad/pnp-confirm...,"Thu, 16 Oct 2025 17:06:06 -0500",69


In [21]:
pip install sentence-transformers

Collecting tokenizers<0.22,>=0.21 (from transformers<5.0.0,>=4.41.0->sentence-transformers)
  Downloading tokenizers-0.21.4-cp39-abi3-macosx_11_0_arm64.whl.metadata (6.7 kB)
Downloading tokenizers-0.21.4-cp39-abi3-macosx_11_0_arm64.whl (2.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/2.7 MB[0m [31m4.4 MB/s[0m  [33m0:00:00[0m eta [36m0:00:01[0m
[?25hInstalling collected packages: tokenizers
  Attempting uninstall: tokenizers
    Found existing installation: tokenizers 0.22.1
    Uninstalling tokenizers-0.22.1:
      Successfully uninstalled tokenizers-0.22.1
Successfully installed tokenizers-0.21.0
Note: you may need to restart the kernel to use updated packages.


In [5]:
from sentence_transformers import SentenceTransformer

# Cargar el modelo
model = SentenceTransformer("all-MiniLM-L6-v2")

# Aplicar a cada descripción
df_news["embedding"] = df_news["description"].apply(lambda x: model.encode(x))

# Verificar dimensiones del primer embedding
print("Dimensión del embedding:", len(df_news["embedding"][0]))
df_news.head(2)

Dimensión del embedding: 384


Unnamed: 0,title,description,link,published,n_tokens,embedding
0,Preocupación en Alianza: Quevedo dejó el parti...,Kevin Quevedo recibió un duro golpe apenas com...,https://rpp.pe/futbol/descentralizado/video-al...,"Thu, 16 Oct 2025 20:35:29 -0500",37,"[0.031080773, 0.050071213, -0.04250635, -0.039..."
1,Ministerio Público inicia investigación por ho...,"En horas de la tarde, el Comandante general de...",https://rpp.pe/politica/actualidad/eduardo-rui...,"Thu, 16 Oct 2025 20:27:42 -0500",49,"[-0.011027413, 0.008596375, -0.04342265, -0.02..."


In [6]:
import chromadb
from chromadb.config import Settings

# Inicializar cliente de Chroma
client = chromadb.Client(Settings())

# Crear o conectar una colección (como una “tabla” vectorial)
collection = client.get_or_create_collection("rpp_news")

# Preparar datos para agregar
ids = [str(i) for i in range(len(df_news))]  # IDs únicos
documents = df_news["description"].tolist()
metadatas = df_news[["title", "link", "published"]].astype(str).to_dict(orient="records")
embeddings = df_news["embedding"].tolist()

# Agregar a la colección
collection.add(
    ids=ids,
    documents=documents,
    metadatas=metadatas,
    embeddings=embeddings
)

print(f"Se guardaron {len(embeddings)} noticias en la colección 'rpp_news'")

Se guardaron 50 noticias en la colección 'rpp_news'


In [7]:
def search_news(query: str, model, collection, k: int = 5) -> pd.DataFrame:
    """
    Busca las k noticias más parecidas al texto de consulta (query)
    usando embeddings y similitud de coseno.
    """
    # Crear embedding del texto de búsqueda
    query_emb = model.encode([query])

    # Consultar en la colección
    res = collection.query(
        query_embeddings=query_emb,
        n_results=k
    )

    # Convertir resultados a DataFrame
    results = []
    for meta, doc, dist in zip(res["metadatas"][0], res["documents"][0], res["distances"][0]):
        results.append({
            "title": meta.get("title"),
            "description": doc[:200] + "...",   # recorta texto largo
            "link": meta.get("link"),
            "published": meta.get("published"),
            "distance": round(dist, 4)
        })

    df_results = pd.DataFrame(results)
    return df_results

In [8]:
query = "últimas noticias de economía de Peru"
df_results = search_news(query, model, collection, k=5)
df_results

Unnamed: 0,title,description,link,published,distance
0,PNP confirma que suboficial Luis Magallanes di...,El comandante general de la Policía Nacional d...,https://rpp.pe/politica/actualidad/pnp-confirm...,"Thu, 16 Oct 2025 17:06:06 -0500",0.7028
1,PNP informó que suboficial Luis Magallanes ha ...,El comandante general de la Policía Nacional d...,https://rpp.pe/videos/actualidad/pnp-confirma-...,"Thu, 16 Oct 2025 17:40:51 -0500",0.7096
2,Ministerio Público inicia investigación por ho...,"En horas de la tarde, el Comandante general de...",https://rpp.pe/politica/actualidad/eduardo-rui...,"Thu, 16 Oct 2025 20:27:42 -0500",0.7856
3,SKY mira al 2026 con nuevos destinos de largo ...,SKY consolida su expansión en Perú apostando p...,https://rpp.pe/economia/economia/schneider-ele...,"Thu, 16 Oct 2025 17:30:42 -0500",0.7886
4,"""Lo mataron por marchar"": Gian Marco reacciona...","Luego de postergar su gira en Perú, Gian Marco...",https://rpp.pe/famosos/farandula/gian-marco-la...,"Thu, 16 Oct 2025 20:19:48 -0500",0.8058


In [9]:
df_news.to_csv("data/rpp_news.csv", index=False, encoding="utf-8")

In [10]:
from transformers import pipeline
import pandas as pd
import sentencepiece


# Crear pipeline de traducción español→inglés
translator = pipeline("translation", model="facebook/m2m100_418M")

# Traducir la columna description
translated = []
for text in df_news["description"].fillna("").tolist():
    if text.strip() == "":
        translated.append("")
    else:
        result = translator(text, src_lang="es", tgt_lang="en", max_length=512)
        translated.append(result[0]["translation_text"])

# Añadir nueva columna con la traducción
df_news["description_en"] = translated

# Verificar ejemplo
df_news[["description", "description_en"]].head(3)

Device set to use cpu


Unnamed: 0,description,description_en
0,Kevin Quevedo recibió un duro golpe apenas com...,Kevin Quevedo received a hard blow just starte...
1,"En horas de la tarde, el Comandante general de...","In the afternoon, the General Commander of the..."
2,Alianza Lima juega en el Alejandro Villanueva ...,Alianza Lima plays in Alejandro Villanueva aga...


In [11]:
df_news.to_csv("data/rpp_news.csv", index=False, encoding="utf-8")

In [13]:
from transformers import pipeline
# Tomar las primeras 50 noticias en inglés
rpp_texts = df_news["description_en"].fillna("").head(50).tolist()

# Crear el pipeline de clasificación zero-shot
llm_sim = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Categorías del dataset AG News
categories = ["World", "Sports", "Business", "Sci/Tech"]

# Clasificar cada texto
llm_labels = []
for text in rpp_texts:
    result = llm_sim(text, candidate_labels=categories)
    llm_labels.append(result["labels"][0])  # la categoría con mayor score

# Guardar las etiquetas en el DataFrame
df_news["LLM_label"] = llm_labels

df_news[["description_en", "LLM_label"]].head(5)

Device set to use cpu


Unnamed: 0,description_en,LLM_label
0,Kevin Quevedo received a hard blow just starte...,Sports
1,"In the afternoon, the General Commander of the...",World
2,Alianza Lima plays in Alejandro Villanueva aga...,Sports
3,"Rotaphon of RPP, Manuel Villena Ibáñez, who ha...",Sci/Tech
4,According to the proposal of the Parliamentary...,World


In [None]:
# Seleccionar solo las columnas necesarias
df_llm = df_news[["description_en", "LLM_label"]]

# Guardar el CSV limpio
df_llm.to_csv("data/rpp_llm_labels.csv", index=False, encoding="utf-8")


In [15]:
df_llm.head(7)

Unnamed: 0,description_en,LLM_label
0,Kevin Quevedo received a hard blow just starte...,Sports
1,"In the afternoon, the General Commander of the...",World
2,Alianza Lima plays in Alejandro Villanueva aga...,Sports
3,"Rotaphon of RPP, Manuel Villena Ibáñez, who ha...",Sci/Tech
4,According to the proposal of the Parliamentary...,World
5,"After delaying his tour in Peru, Gian Marco re...",World
6,"In statements to RPP, Marcela Ríos, regional d...",Sci/Tech
