# Ejercicio 9: Uso de la API de Google Gemini

En este ejercicio vamos a aprender a utilizar la API de OpenAI

## 1. Uso básico

El siguiente código sirve para conectarse con la API de Google Gemini de forma básica

In [40]:
# Import the Python SDK
import google.generativeai as genai
# Used to securely store your API key
from google.colab import userdata

# Obtén tu clave API de Google desde los secretos de Colab
# Si no tienes una, créala en Google AI Studio y guárdala en Colab secrets con el nombre 'GOOGLE_API_KEY'
GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

# Inicializa el modelo Gemini
# Puedes elegir un modelo diferente si es necesario, por ejemplo 'gemini-pro'
gemini_model = genai.GenerativeModel('models/gemini-2.5-flash')

print("Modelo listo:", gemini_model.model_name)
print("Conexión a la API de Google Gemini establecida y modelo inicializado.")

Modelo listo: models/gemini-2.5-flash
Conexión a la API de Google Gemini establecida y modelo inicializado.


In [39]:
import google.generativeai as genai

models = list(genai.list_models())
for m in models:
    # Solo los que soportan generateContent (los que sirven para resumir)
    if "generateContent" in getattr(m, "supported_generation_methods", []):
        print(m.name)

models/gemini-2.5-flash
models/gemini-2.5-pro
models/gemini-2.0-flash-exp
models/gemini-2.0-flash
models/gemini-2.0-flash-001
models/gemini-2.0-flash-exp-image-generation
models/gemini-2.0-flash-lite-001
models/gemini-2.0-flash-lite
models/gemini-2.0-flash-lite-preview-02-05
models/gemini-2.0-flash-lite-preview
models/gemini-exp-1206
models/gemini-2.5-flash-preview-tts
models/gemini-2.5-pro-preview-tts
models/gemma-3-1b-it
models/gemma-3-4b-it
models/gemma-3-12b-it
models/gemma-3-27b-it
models/gemma-3n-e4b-it
models/gemma-3n-e2b-it
models/gemini-flash-latest
models/gemini-flash-lite-latest
models/gemini-pro-latest
models/gemini-2.5-flash-lite
models/gemini-2.5-flash-image-preview
models/gemini-2.5-flash-image
models/gemini-2.5-flash-preview-09-2025
models/gemini-2.5-flash-lite-preview-09-2025
models/gemini-3-pro-preview
models/gemini-3-flash-preview
models/gemini-3-pro-image-preview
models/nano-banana-pro-preview
models/gemini-robotics-er-1.5-preview
models/gemini-2.5-computer-use-prev

## 2. Retrieval

### 2.1 Cargo el corpus de 20 News Groups

In [2]:
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
newsgroupsdocs = newsgroups.data

In [3]:
import pandas as pd
df = pd.DataFrame(newsgroupsdocs, columns=['text'])
df

Unnamed: 0,text
0,\n\nI am sure some bashers of Pens fans are pr...
1,My brother is in the market for a high-perform...
2,\n\n\n\n\tFinally you said what you dream abou...
3,\nThink!\n\nIt's the SCSI card doing the DMA t...
4,1) I have an old Jasmine drive which I cann...
...,...
18841,DN> From: nyeda@cnsvax.uwec.edu (David Nye)\nD...
18842,\nNot in isolated ground recepticles (usually ...
18843,I just installed a DX2-66 CPU in a clone mothe...
18844,\nWouldn't this require a hyper-sphere. In 3-...


### 2.2 Transformo a embeddings

In [4]:
import pandas as pd
import numpy as np
from tqdm.auto import tqdm
import re

df = df.dropna(subset=["text"]).reset_index(drop=True)

# Limpieza básica
def normalize_text(s: str) -> str:
    s = re.sub(r"\s+", " ", s).strip()
    return s

df["text_norm"] = df["text"].astype(str).map(normalize_text)

df.head()

Unnamed: 0,text,text_norm
0,\n\nI am sure some bashers of Pens fans are pr...,I am sure some bashers of Pens fans are pretty...
1,My brother is in the market for a high-perform...,My brother is in the market for a high-perform...
2,\n\n\n\n\tFinally you said what you dream abou...,Finally you said what you dream about. Mediter...
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,Think! It's the SCSI card doing the DMA transf...
4,1) I have an old Jasmine drive which I cann...,1) I have an old Jasmine drive which I cannot ...


In [5]:
def chunk_text(text: str, max_chars: int = 800, overlap: int = 100):
    """
    Chunking por caracteres.
    max_chars ~ 600-1000 suele funcionar bien.
    overlap ayuda a no cortar ideas a la mitad.
    """
    chunks = []
    start = 0
    n = len(text)
    while start < n:
        end = min(start + max_chars, n)
        chunk = text[start:end]
        chunk = chunk.strip()
        if len(chunk) > 0:
            chunks.append(chunk)
        if end == n:
            break
        start = max(0, end - overlap)
    return chunks

records = []
for i, row in df.iterrows():
    chunks = chunk_text(row["text_norm"], max_chars=800, overlap=100)
    for j, ch in enumerate(chunks):
        records.append({
            "doc_id": int(i),
            "chunk_id": j,
            "text": ch
        })

chunks_df = pd.DataFrame(records)
chunks_df.head(), len(chunks_df)

(   doc_id  chunk_id                                               text
 0       0         0  I am sure some bashers of Pens fans are pretty...
 1       1         0  My brother is in the market for a high-perform...
 2       2         0  Finally you said what you dream about. Mediter...
 3       2         1  urds and Turks once upon a time! Ohhhh so swed...
 4       3         0  Think! It's the SCSI card doing the DMA transf...,
 38871)

In [6]:
from sentence_transformers import SentenceTransformer

MODEL_NAME = "intfloat/e5-base-v2"   # recomendado para retrieval
model = SentenceTransformer(MODEL_NAME)

# Textos a indexar (pasajes)
passages = ["passage: " + t for t in chunks_df["text"].tolist()]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [7]:
# Embeddings (N x D)
# Se debe usar normalize_embeddings=True para similitud coseno
embeddings = model.encode(
    passages,
    batch_size=16,
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True
).astype("float32")

Batches:   0%|          | 0/2430 [00:00<?, ?it/s]

In [8]:
print(embeddings.shape, embeddings.dtype)

(38871, 768) float32


### 2.3 Creo una query y hago la búsqueda

In [9]:
def embed_query(query: str) -> np.ndarray:
    q = "query: " + query
    vec = model.encode(
        [q],
        convert_to_numpy=True,
        normalize_embeddings=True
    ).astype("float32")
    return vec

query_text = "Mars mission launch"

query_vec = embed_query(query_text)
query_vec.shape

(1, 768)

Obtengo los 5 documentos más similares a mi query

In [10]:
!pip -q install faiss-cpu

In [11]:
import faiss
import numpy as np

# Dimensión del embedding (por ejemplo 768 para e5-base-v2)
d = embeddings.shape[1]

# Índice para coseno (con embeddings normalizados)
index = faiss.IndexFlatIP(d)

print("Índice FAISS creado. Dimensión:", d)

Índice FAISS creado. Dimensión: 768


In [12]:
# Asegurar float32 (FAISS lo requiere)
embeddings_f32 = embeddings.astype("float32")

index.add(embeddings_f32)

print("Embeddings cargados al índice.")
print("Total vectores en el índice:", index.ntotal)

Embeddings cargados al índice.
Total vectores en el índice: 38871


In [21]:
import pandas as pd

def embed_query(query: str) -> np.ndarray:
    q = "query: " + query
    vec = model.encode(
        [q],
        convert_to_numpy=True,
        normalize_embeddings=True
    ).astype("float32")
    return vec

query_text = "Mars mission launch"  # consulta
q_emb = embed_query(query_text)

k = 5
scores, idxs = index.search(q_emb, k)

print("Query:", query_text)
print("Top-k:", k)

Query: Mars mission launch
Top-k: 5


In [36]:
top_rows = []
for rank, (score, idx) in enumerate(zip(scores[0], idxs[0]), start=1):
    row = chunks_df.iloc[int(idx)]
    top_rows.append({
        "rank": rank,
        "score": float(score),
        "doc_id": int(row["doc_id"]),
        "chunk_id": int(row["chunk_id"]),
        "text": row["text"]
    })

top_df = pd.DataFrame(top_rows)
top_df[["rank", "score", "doc_id", "chunk_id"]]

Unnamed: 0,rank,score,doc_id,chunk_id
0,1,0.858486,6631,7
1,2,0.848824,6631,4
2,3,0.842103,6631,5
3,4,0.833091,6631,6
4,5,0.831467,6631,3


In [43]:
# Mostrar TEXTO COMPLETO del Top-5 recuperado
print("Query:", query_text)
print("Top-5 resultados (texto completo):\n")

for i, idx in enumerate(idxs[0][:5], start=1):
    row = chunks_df.iloc[int(idx)]
    print("="*100)
    print(f"RANK #{i} | score={scores[0][i-1]:.4f} | doc_id={int(row['doc_id'])} | chunk_id={int(row['chunk_id'])}")
    print("-"*100)
    print(row["text"])
    print()


Query: Mars mission launch
Top-5 resultados (texto completo):

RANK #1 | score=0.8585 | doc_id=6631 | chunk_id=7
----------------------------------------------------------------------------------------------------
h Anniversary, Mariner 7 Launch (Mars Flyby Mission) Mar 29 - 20th Anniversary, Mariner 10, 1st Mercury Flyby * Mar 31 - Galaxy 1R Delta 2 Launch

RANK #2 | score=0.8488 | doc_id=6631 | chunk_id=4
----------------------------------------------------------------------------------------------------
ch Oct ?? - SLV-1 Pegasus Launch Oct ?? - Telstar 4 Atlas Launch Oct 01 - SeaWIFS Launch Oct 22 - Orionid Meteor Shower (Maximum: 00:00 UT, Solar Longitude 208.7 degrees) November 1993 Nov ?? - Solidaridad/MOP-3 Ariane Launch Nov 03 - 20th Anniversary, Mariner 10 Launch (Mercury & Venus Flyby Mission) Nov 03 - S. Taurid Meteor Shower Nov 04 - Galileo Exits Asteroid Belt Nov 06 - Mercury Transits Across the Sun, Visible from Asia, Australia, and the South Pacific * Nov 08 - Mars Obser

In [42]:
# --- Resumen con Gemini: imprime original vs resumen para evidenciar uso del API ---

def summarize_with_gemini(text: str, query: str) -> str:
    prompt = f"""
Eres un asistente que resume texto recuperado por un sistema de búsqueda semántica.

Instrucciones:
- Resume el TEXTO en 2 a 4 líneas.
- Mantén solo lo esencial.
- No inventes datos.
- Si el texto está incompleto o parece fragmento, indícalo brevemente.

Consulta original: {query}

TEXTO:
{text}
"""
    resp = gemini_model.generate_content(prompt)
    return (resp.text or "").strip()

summaries = []

for _, r in top_df.iterrows():
    original = r["text"]

    try:
        resumen = summarize_with_gemini(original, query_text)
    except Exception as e:
        resumen = f"[ERROR resumiendo con Gemini] {e}"

    summaries.append(resumen)

    print("=" * 110)
    print(f"RANK #{int(r['rank'])} | score={r['score']:.4f} | doc_id={int(r['doc_id'])} | chunk_id={int(r['chunk_id'])}")
    print("-" * 110)
    print("TEXTO ORIGINAL:\n")
    print(original)
    print("\n" + "-" * 110)
    print("RESUMEN (Gemini):\n")
    print(resumen)
    print()

top_df["gemini_summary"] = summaries
top_df[["rank", "score", "doc_id", "chunk_id", "gemini_summary"]]

RANK #1 | score=0.8585 | doc_id=6631 | chunk_id=7
--------------------------------------------------------------------------------------------------------------
TEXTO ORIGINAL:

h Anniversary, Mariner 7 Launch (Mars Flyby Mission) Mar 29 - 20th Anniversary, Mariner 10, 1st Mercury Flyby * Mar 31 - Galaxy 1R Delta 2 Launch

--------------------------------------------------------------------------------------------------------------
RESUMEN (Gemini):

El texto menciona el aniversario de la misión Mariner 7 (un sobrevuelo a Marte). Además, celebra el 20º aniversario del Mariner 10, la primera misión de sobrevuelo a Mercurio (29 de marzo), y el lanzamiento del Galaxy 1R Delta 2 (31 de marzo).

Este texto parece ser un fragmento de una lista de eventos o un calendario.

RANK #2 | score=0.8488 | doc_id=6631 | chunk_id=4
--------------------------------------------------------------------------------------------------------------
TEXTO ORIGINAL:

ch Oct ?? - SLV-1 Pegasus Launch Oct ?? - Tel

Unnamed: 0,rank,score,doc_id,chunk_id,gemini_summary
0,1,0.858486,6631,7,El texto menciona el aniversario de la misión ...
1,2,0.848824,6631,4,Este fragmento de texto describe eventos espac...
2,3,0.842103,6631,5,"Este fragmento de texto, incompleto, lista eve..."
3,4,0.833091,6631,6,El texto es un fragmento que lista eventos ast...
4,5,0.831467,6631,3,"Este texto, que parece un listado de eventos e..."
