**Ejercicio 9: Uso de la API de Google Gemini**

**Nombre:** Aarón Yumancela

In [31]:
# Cargamos la clave de la API y configuramos el cliente
from google import genai

with open("api_key.txt", "r", encoding="utf-8") as f:
    api_key = f.read().strip()

client = genai.Client(api_key=api_key)


In [32]:
# Cargamos el corpus de documentos a analizar
from sklearn.datasets import fetch_20newsgroups
import pandas as pd

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
docs = newsgroups.data
df = pd.DataFrame(docs, columns=['doc'])
df.head(10)

Unnamed: 0,doc
0,\n\nI am sure some bashers of Pens fans are pr...
1,My brother is in the market for a high-perform...
2,\n\n\n\n\tFinally you said what you dream abou...
3,\nThink!\n\nIt's the SCSI card doing the DMA t...
4,1) I have an old Jasmine drive which I cann...
5,\n\nBack in high school I worked as a lab assi...
6,\n\nAE is in Dallas...try 214/241-6060 or 214/...
7,"\n[stuff deleted]\n\nOk, here's the solution t..."
8,"\n\n\nYeah, it's the second one. And I believ..."
9,\nIf a Christian means someone who believes in...


In [33]:
# Lo normalizamos, limpiamos y comprobamos su resultado
import numpy as np
from tqdm.auto import tqdm
import re

df = df.dropna(subset=["doc"]).reset_index(drop=True)

def clean_text(text: str) -> str:
    text = text.replace("\n", " ")
    text = re.sub(r"[ \t]+", " ", text)
    text = re.sub(r"\s{2,}", " ", text)
    return text.strip()

df["text_norm"] = df["doc"].astype(str).map(clean_text)

df.head()

Unnamed: 0,doc,text_norm
0,\n\nI am sure some bashers of Pens fans are pr...,I am sure some bashers of Pens fans are pretty...
1,My brother is in the market for a high-perform...,My brother is in the market for a high-perform...
2,\n\n\n\n\tFinally you said what you dream abou...,Finally you said what you dream about. Mediter...
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,Think! It's the SCSI card doing the DMA transf...
4,1) I have an old Jasmine drive which I cann...,1) I have an old Jasmine drive which I cannot ...


In [34]:
# Dividimos cada documento en fragmentos solapados para facilitar el retrieval

def split_into_segments(text: str, size: int = 700, stride: int = 150):
    segments = []
    pos = 0
    length = len(text)

    while pos < length:
        segment = text[pos:pos + size].strip()
        if segment:
            segments.append(segment)
        pos += size - stride

    return segments


records = []
for i, row in df.iterrows():
    chunks = split_into_segments(row["text_norm"], size=700, stride=150)
    for j, ch in enumerate(chunks):
        records.append({
            "doc_id": int(i),
            "chunk_id": j,
            "text": ch
        })

chunks_df = pd.DataFrame(records)
chunks_df.head(), len(chunks_df)

(   doc_id  chunk_id                                               text
 0       0         0  I am sure some bashers of Pens fans are pretty...
 1       0         1  s are going to beat the pulp out of Jersey any...
 2       1         0  My brother is in the market for a high-perform...
 3       2         0  Finally you said what you dream about. Mediter...
 4       2         1  ENEVA CONVENTION"??????? YOU FACIST!!!!! Ohhh ...,
 48187)

In [35]:
# Generamos representaciones vectoriales de los fragmentos

from sentence_transformers import SentenceTransformer
import torch

MODEL_NAME = "intfloat/e5-base-v2"

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Usando dispositivo:", device)

model = SentenceTransformer(MODEL_NAME, device=device)

passages = ["passage: " + t for t in chunks_df["text"].tolist()]


embeddings = model.encode(
    passages,
    batch_size=64,
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True
).astype("float32")

embeddings.shape, embeddings.dtype

Usando dispositivo: cuda


Batches:   0%|          | 0/753 [00:00<?, ?it/s]

((48187, 768), dtype('float32'))

In [36]:
query = "¿De qué tratan los textos recuperados y cuáles son las ideas principales?"


query_emb = model.encode(
    ["query: " + query],
    convert_to_numpy=True,
    normalize_embeddings=True
).astype("float32")

In [37]:
!pip install faiss-cpu




In [38]:
import faiss
import numpy as np

index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)

D, I = index.search(query_emb, k=10)

In [39]:
for idx in I.flatten():
    print("----")
    print(chunks_df.iloc[idx]["text"][:500])


----
replies would be appreciated, to here or to rrr@ideas.com Thanks. [RICHR]
----
point at which the texts are perceived as texts. They may be added to (and in some situations, such as the end of Mark, material is lost), but for the most part there are no substantial changes to the existing text. You're basically trying to make a mountain out of a molehill. Some people like to use the game of "telephone" as a metaphor for the transmission of the texts. This clearly wrong. The texts are transmitted accurately.
----
e are translated into thirty languages and provide themes for reflexion for the following year. NOTE: Discussion on the creation of this newsgroup will take place in news.groups. For any further information contact: Brother.Roy@almac.co.uk brother.roy@almac.co.uk
----
at seems an unreasonably pessimistic assumption to me and I want to know if someone has significantly improved on that. I have some ideas of my own on how to approach this problem, but before I spend to much t

In [42]:
# Recuperamos los fragmentos más relevantes según la consulta
top_k = 10
D, I = index.search(query_emb, k=top_k)

top_idxs = I.flatten().tolist()
top_passages = chunks_df.iloc[top_idxs].copy()
top_passages["distance"] = D.flatten()

# Inspeccionamos los fragmentos que se utilizarán en el resumen
top_passages[["doc_id", "chunk_id", "distance", "text"]].head(10)


Unnamed: 0,doc_id,chunk_id,distance,text
13466,5162,1,0.402834,"replies would be appreciated, to here or to rr..."
13957,5311,1,0.423401,point at which the texts are perceived as text...
34796,13371,10,0.423496,e are translated into thirty languages and pro...
32243,12387,1,0.42912,at seems an unreasonably pessimistic assumptio...
46243,18078,1,0.43216,"of religious beliefs), which are by their very..."
39529,15261,0,0.433698,Since I repost this message again for the seco...
42451,16510,2,0.435956,==============================================...
23389,9025,2,0.438042,elcome to disagree (as I know many do) and lit...
32277,12409,2,0.4385,ame and instituion in the body of text.
36396,14042,1,0.438848,hrasing a reasonable argument on this topic. T...


In [43]:
passages_text = []
for i, row in enumerate(top_passages.itertuples(index=False), 1):
    passages_text.append(f"Fragmento {i}:\n{row.text}")

context_block = "\n\n".join(passages_text)


In [44]:
prompt = f"""
Eres un asistente que analiza textos.

PREGUNTA DEL USUARIO:
{query}

PASAJES (top {top_k}):
{context_block}

TAREA:
1) Analiza cada fragmento de manera independiente.
Para cada uno, describe brevemente su idea central (2–3 líneas).
Posteriormente, redacta una síntesis comparativa.
2) Después, escribe un resumen global (5–7 líneas) que sintetice las ideas comunes o diferencias entre los pasajes.
3) No uses información externa a los pasajes.
"""

# 3) Llamada a Gemini
response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents=prompt
)

print(response.text)


Aquí tienes el análisis de los fragmentos proporcionados:

### 1) Análisis independiente y síntesis comparativa

*   **Fragmento 1:** El autor solicita respuestas a su mensaje, proporcionando una dirección de correo electrónico específica para recibir retroalimentación. Es una petición directa de comunicación.
*   **Fragmento 2:** Defiende la precisión en la transmisión de textos históricos, rechazando la metáfora del "teléfono escacharrado" y argumentando que, salvo cambios menores, los textos se mantienen fieles a su origen.
*   **Fragmento 3:** Informa sobre la traducción de materiales a treinta idiomas y la organización de temas de reflexión anuales, incluyendo datos de contacto para un grupo de noticias.
*   **Fragmento 4:** Expresa el deseo de encontrar soluciones ya existentes a un problema para evitar duplicar trabajo, detallando además la logística de publicación en diferentes grupos de discusión.
*   **Fragmento 5:** Aconseja informarse sobre debates previos, especialmente en