
# News Intelligence Laboratory — Task 1 (RPP RSS → Retrieval + Embeddings)

**Goal:** Ingest the latest news from **RPP Perú** RSS, embed them with **SentenceTransformers**, and build a **retrieval system** using **ChromaDB** orchestrated with **LangChain**.

**What you'll do**
0) Load Data from RSS (feedparser)  
1) Tokenize a sample (tiktoken) and decide if chunking is needed  
2) Generate embeddings (`sentence-transformers/all-MiniLM-L6-v2`)  
3) Create/Upsert a **Chroma** collection and configure a retriever  
4) Query: _"Últimas noticias de economía"_ and show results in a DataFrame  
5) Orchestrate an end-to-end pipeline with **LangChain** (LCEL)


In [None]:
!pip install feedparser pandas tiktoken sentence-transformers chromadb langchain langchain-community langchain-text-splitters pydantic numpy tqdm

Collecting feedparser
  Downloading feedparser-6.0.12-py3-none-any.whl.metadata (2.7 kB)
Collecting chromadb
  Downloading chromadb-1.2.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.2 kB)
Collecting langchain-community
  Downloading langchain_community-0.4-py3-none-any.whl.metadata (3.0 kB)
Collecting sgmllib3k (from feedparser)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.2-cp312-cp312-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl.metadata (8.7 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.23.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading 

In [None]:

# --- Optional: install libs in the notebook runtime (uncomment as needed) ---
# %pip install -U feedparser pandas tiktoken sentence-transformers chromadb \
#                 langchain langchain-community langchain-text-splitters \
#                 pydantic numpy tqdm python-dotenv

import os
import uuid
from dataclasses import dataclass
from typing import List, Dict, Any

import feedparser
import pandas as pd
import numpy as np

# Tokenization
import tiktoken

# LangChain + Chroma
from langchain_community.vectorstores import Chroma
from langchain_community.docstore.document import Document
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

from langchain_core.runnables import RunnableLambda

# For display
from tqdm import tqdm

PERSIST_DIR = os.getenv("CHROMA_DB_DIR", "./chroma_rpp")
COLLECTION_NAME = os.getenv("CHROMA_COLLECTION", "rpp_news")
EMBED_MODEL_NAME = os.getenv("EMBED_MODEL_NAME", "sentence-transformers/all-MiniLM-L6-v2")
RSS_URL = os.getenv("RPP_RSS", "https://rpp.pe/rss")
MAX_ITEMS = int(os.getenv("MAX_RSS_ITEMS", "50"))
SEED = 42

os.makedirs(PERSIST_DIR, exist_ok=True)
np.random.seed(SEED)


## 0) Load latest RPP news via RSS

In [None]:

def fetch_rss_items(rss_url: str = RSS_URL, limit: int = MAX_ITEMS) -> List[Dict[str, Any]]:
    # Fetch RSS items using feedparser and return latest `limit` entries.
    # Each record includes: title, description, link, published (date).
    feed = feedparser.parse(rss_url)
    items = []
    for entry in feed.get("entries", [])[:limit]:
        items.append({
            "id": str(uuid.uuid4()),
            "title": entry.get("title", ""),
            "description": entry.get("summary", ""),
            "link": entry.get("link", ""),
            "published": entry.get("published", entry.get("updated", "")),
        })
    return items

rss_items = fetch_rss_items()
df_raw = pd.DataFrame(rss_items)
df_raw


Unnamed: 0,id,title,description,link,published
0,5f4884c8-d843-453f-879d-7c58fbcc4c5f,Flamengo vs. Racing Club EN VIVO vía ESPN: par...,"En el Maracaná, Flamengo y Racing chocarán en ...",https://rpp.pe/futbol/copa-libertadores/flamen...,"Wed, 22 Oct 2025 17:45:11 -0500"
1,01bc3718-86ec-40fe-8328-58664e8face0,JNJ determina que Delia Espinoza no retornará ...,La JNJ enfatizó se mantiene vigente la medida ...,https://rpp.pe/politica/judiciales/delia-espin...,"Wed, 22 Oct 2025 16:58:30 -0500"
2,7218ce28-918d-4297-bb0d-bc62c37364e0,Myriam Hernández en Lima: setlist de canciones...,La 'Baladista de América' llega a Lima para do...,https://rpp.pe/musica/conciertos/myriam-hernan...,"Wed, 22 Oct 2025 17:00:29 -0500"
3,e0738e6a-1fda-46d5-9ea9-88c0dc8268c8,Ala Este de la Casa Blanca será demolida total...,"De acuerdo con el diario The New York Times, e...",https://rpp.pe/mundo/estados-unidos/ala-este-d...,"Wed, 22 Oct 2025 17:15:10 -0500"
4,5a34ae95-4c90-45d7-8874-3cfc5065b766,¡Ya es madre! Valeria Flórez dio a luz a su pr...,"Tatiana Calmell, Camila Escribens, entre otras...",https://rpp.pe/famosos/farandula/valeria-flore...,"Wed, 22 Oct 2025 17:14:58 -0500"
5,83fc61bc-9cf0-47a1-965c-0539759c27bb,Moquegua: pobladores piden nuevo tamizaje para...,"Rotafono de RPP | En el 2023, la Dirección Reg...",https://rpp.pe/rotafono/servicios-publicos/moq...,"Wed, 22 Oct 2025 09:35:30 -0500"
6,7a75a445-ad1d-49a3-b1e8-fe0bea24fb05,Temblor en Chile hoy 22 de octubre: Epicentro ...,¿Cuál fue el último Temblor en Chile hoy 22 de...,https://rpp.pe/mundo/chile/temblor-en-chile-ho...,"Wed, 22 Oct 2025 02:09:45 -0500"
7,bb5c87bd-ea41-4b5e-acdf-bd6bc63e81a3,Isabel Preysler publica íntimas cartas de amor...,Isabel Preysler publicó su libro 'Mi verdadera...,https://rpp.pe/famosos/celebridades/isabel-pre...,"Wed, 22 Oct 2025 08:47:59 -0500"
8,38c43445-6a65-4a48-be67-0403a9fe1fd4,"""Seguimos persiguiendo sueños"": Grupo 5 brilla...","Por primera vez, la orquesta de Monsefú partic...",https://rpp.pe/musica/nacional/grupo-5-brilla-...,"Wed, 22 Oct 2025 17:06:57 -0500"
9,3a9f672f-3c26-451c-9814-0a3ed324a669,Soda Stereo en Lima: precios y cómo comprar en...,"El trío argentino regresa en 2026, seis años d...",https://rpp.pe/musica/conciertos/soda-stereo-e...,"Wed, 22 Oct 2025 17:00:37 -0500"


## 1) Tokenization (tiktoken) — decide if chunking is needed

In [None]:
import tiktoken

# A) conteo de tokens (aprox con tiktoken)
def num_tokens_from_text(text: str, encoding_name: str = "cl100k_base") -> int:
    enc = tiktoken.get_encoding(encoding_name)
    return len(enc.encode(text or ""))

# B) fragmentación por tokens (aprox) con solape
def chunk_by_tokens_tiktoken(text: str, max_tokens: int = 256, overlap: int = 40, encoding_name: str = "cl100k_base") -> list[str]:
    enc = tiktoken.get_encoding(encoding_name)
    ids = enc.encode(text or "")
    if not ids:
        return []
    chunks = []
    start = 0
    while start < len(ids):
        end = min(start + max_tokens, len(ids))
        chunk_ids = ids[start:end]
        chunks.append(enc.decode(chunk_ids))
        if end == len(ids):
            break
        start = max(0, end - overlap)  # solape hacia atrás
    return chunks

# Muestra sobre un artículo: título + descripción
sample_text = (df_raw.iloc[0]["title"] + "\n\n" + df_raw.iloc[0]["description"]) if not df_raw.empty else ""
ntoks = num_tokens_from_text(sample_text)
print(f"Sample tokens (aprox): {ntoks}")

TOKEN_THRESHOLD = 256   # límite operativo para MiniLM
NEED_CHUNKING = ntoks > TOKEN_THRESHOLD
print("¿Necesita fragmentación?", NEED_CHUNKING)


Sample tokens (aprox): 68
¿Necesita fragmentación? False


In [None]:
from langchain_community.docstore.document import Document

def build_documents_from_df(df: pd.DataFrame,
                            max_tokens: int = 256,
                            overlap: int = 40) -> list[Document]:
    docs: list[Document] = []
    for _, row in df.iterrows():
        title = row["title"] or ""
        desc  = row["description"] or ""
        base_text = f"{title}\n\n{desc}".strip()

        # Chunking por tokens si excede límite
        pieces = chunk_by_tokens_tiktoken(base_text, max_tokens=max_tokens, overlap=overlap)
        if not pieces:
            continue

        for i, piece in enumerate(pieces):
            docs.append(
                Document(
                    page_content=piece,
                    metadata={
                        "source": "RPP",
                        "id": row["id"],
                        "title": title,
                        "link": row["link"],
                        "published": row["published"],
                        "chunk_idx": i
                    }
                )
            )
    return docs

docs = build_documents_from_df(df_raw, max_tokens=256, overlap=40)
len(docs), docs[0].metadata


(50,
 {'source': 'RPP',
  'id': '995c7c1c-522e-4742-8dbc-aa6c64425eb2',
  'title': 'Flamengo vs. Racing Club EN VIVO vía ESPN: partidazo por la semifinal de Copa Libertadores en el Maracaná',
  'link': 'https://rpp.pe/futbol/copa-libertadores/flamengo-vs-racing-club-en-vivo-ver-espn-transmision-gratis-desde-maracana-ida-semifinal-copa-libertadores-2025-link-stream-partidos-de-hoy-noticia-1660341',
  'published': 'Wed, 22 Oct 2025 17:45:11 -0500',
  'chunk_idx': 0})

# 2) Embedding

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings

EMBED_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(model_name=EMBED_MODEL_NAME)


  embeddings = HuggingFaceEmbeddings(model_name=EMBED_MODEL_NAME)
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# 3) Create or Upsert Chroma Collection



In [None]:
from langchain_community.vectorstores import Chroma

PERSIST_DIR = "./chroma_rpp"
COLLECTION_NAME = "rpp_news"

# Crea (o reusa) el almacén vectorial
vectorstore = Chroma(
    collection_name=COLLECTION_NAME,
    embedding_function=embeddings,
    persist_directory=PERSIST_DIR
)

# Upsert (agregar documentos; si corres varias veces, podrías deduplicar por id+chunk)
_ = vectorstore.add_documents(docs)
vectorstore.persist()

# Retriever (búsqueda por similitud; k=8 por ejemplo)
retriever = vectorstore.as_retriever(search_kwargs={"k": 8})

  vectorstore = Chroma(
  vectorstore.persist()


# 4) Query Results

In [None]:
def search_news(query: str, k: int = 8) -> pd.DataFrame:
    hits = retriever.invoke(query)
    rows = []
    # Para que no repitamos el mismo título por múltiples chunks, hacemos un "unique by title"
    seen = set()
    for d in hits:
        title = d.metadata.get("title", "")
        if title in seen:
            continue
        seen.add(title)
        rows.append({
            "title": title,
            "description": d.page_content[:300] + ("..." if len(d.page_content) > 300 else ""),
            "link": d.metadata.get("link", ""),
            "date_published": d.metadata.get("published", "")
        })
        if len(rows) >= k:
            break
    return pd.DataFrame(rows, columns=["title", "description", "link", "date_published"])

df_results = search_news("Últimas noticias de economía", k=8)
df_results


Unnamed: 0,title,description,link,date_published
0,Consejo Fiscal pide que el TC revise las más d...,Consejo Fiscal pide que el TC revise las más d...,https://rpp.pe/economia/economia/consejo-fisca...,"Wed, 22 Oct 2025 10:15:26 -0500"
1,"Estados Unidos anunciará un ""aumento sustancia...","Estados Unidos anunciará un ""aumento sustancia...",https://rpp.pe/mundo/estados-unidos/estados-un...,"Wed, 22 Oct 2025 16:33:24 -0500"
2,JNJ determina que Delia Espinoza no retornará ...,JNJ determina que Delia Espinoza no retornará ...,https://rpp.pe/politica/judiciales/delia-espin...,"Wed, 22 Oct 2025 16:58:30 -0500"
3,Policía en presunto estado de ebriedad es dete...,Policía en presunto estado de ebriedad es dete...,https://rpp.pe/peru/junin/huancayo-detienen-a-...,"Wed, 22 Oct 2025 16:05:27 -0500"
4,Agua Marina se retira temporalmente los escena...,Agua Marina se retira temporalmente los escena...,https://rpp.pe/musica/nacional/agua-marina-se-...,"Wed, 22 Oct 2025 11:59:24 -0500"
5,Isabel Preysler publica íntimas cartas de amor...,Isabel Preysler publica íntimas cartas de amor...,https://rpp.pe/famosos/celebridades/isabel-pre...,"Wed, 22 Oct 2025 08:47:59 -0500"
6,Ala Este de la Casa Blanca será demolida total...,Ala Este de la Casa Blanca será demolida total...,https://rpp.pe/mundo/estados-unidos/ala-este-d...,"Wed, 22 Oct 2025 17:15:10 -0500"
7,¿Será necesario prorrogar el estado de emergen...,¿Será necesario prorrogar el estado de emergen...,https://rpp.pe/lima/seguridad/estado-de-emerge...,"Wed, 22 Oct 2025 13:34:50 -0500"


# 5) Orchestrate with LangChain

In [None]:
from dataclasses import dataclass
from langchain_core.runnables import RunnableLambda

@dataclass
class PipelineConfig:
    rss_url: str = RSS_URL
    limit: int = 50
    max_tokens: int = 256
    overlap: int = 40
    model_name: str = EMBED_MODEL_NAME
    persist_dir: str = PERSIST_DIR
    collection_name: str = COLLECTION_NAME

CFG = PipelineConfig()

def step_load(_: object) -> pd.DataFrame:
    items = fetch_rss_items(rss_url=CFG.rss_url, limit=CFG.limit)
    return pd.DataFrame(items)

def step_docs(df: pd.DataFrame) -> list[Document]:
    return build_documents_from_df(df, max_tokens=CFG.max_tokens, overlap=CFG.overlap)

def step_upsert(docs: list[Document]) -> Chroma:
    # Embeddings NUEVOS aquí para evitar estados raros
    embs = HuggingFaceEmbeddings(model_name=CFG.model_name)
    vs = Chroma(
        collection_name=CFG.collection_name,
        embedding_function=embs,
        persist_directory=CFG.persist_dir,
    )
    # 👇 Aquí SIEMPRE pasamos list[Document]
    if not isinstance(docs, list) or (docs and not isinstance(docs[0], Document)):
        raise TypeError(f"step_upsert esperaba list[Document], recibió: {type(docs)}")
    vs.add_documents(docs)
    vs.persist()
    return vs

def step_retriever(vs: Chroma):
    if not isinstance(vs, Chroma):
        raise TypeError(f"step_retriever esperaba Chroma, recibió: {type(vs)}")
    return vs.as_retriever(search_kwargs={"k": 8})

def step_query(ret, query: str) -> pd.DataFrame:
    hits = ret.invoke(query)
    rows, seen = [], set()
    for d in hits:
        title = d.metadata.get("title", "")
        if title in seen:
            continue
        seen.add(title)
        rows.append({
            "title": title,
            "description": d.page_content[:300] + ("..." if len(d.page_content) > 300 else ""),
            "link": d.metadata.get("link", ""),
            "date_published": d.metadata.get("published", "")
        })
        if len(rows) >= 8:
            break
    return pd.DataFrame(rows, columns=["title","description","link","date_published"])

# ---------- LCEL con "wiring" explícito ----------
load_chain      = RunnableLambda(lambda _: {"df": step_load(None)})
docs_chain      = RunnableLambda(lambda d: {"df": d["df"], "docs": step_docs(d["df"])})
upsert_chain    = RunnableLambda(lambda d: {"df": d["df"], "docs": d["docs"], "vs": step_upsert(d["docs"])})
retriever_chain = RunnableLambda(lambda d: {"retriever": step_retriever(d["vs"])})

pipeline = load_chain | docs_chain | upsert_chain | retriever_chain

out = pipeline.invoke(None)
retriever2 = out["retriever"]

final_df = step_query(retriever2, "Últimas noticias de economía")
final_df


Unnamed: 0,title,description,link,date_published
0,Consejo Fiscal pide que el TC revise las más d...,Consejo Fiscal pide que el TC revise las más d...,https://rpp.pe/economia/economia/consejo-fisca...,"Wed, 22 Oct 2025 10:15:26 -0500"
1,"Estados Unidos anunciará un ""aumento sustancia...","Estados Unidos anunciará un ""aumento sustancia...",https://rpp.pe/mundo/estados-unidos/estados-un...,"Wed, 22 Oct 2025 16:33:24 -0500"
