# Ejercicio 9: Uso de la API de Google Gemini

En este ejercicio vamos a aprender a utilizar la API de OpenAI

## 1. Uso básico

El siguiente código sirve para conectarse con la API de Google Gemini de forma básica

In [2]:
from dotenv import load_dotenv
import os

load_dotenv()
api_key = os.getenv('GEMINI_API_KEY')


from google import genai

client = genai.Client()

response = client.models.generate_content(
    model="gemini-3-flash-preview",
    contents="Explain how AI works in a few words",
)

print(response.text)

AI learns patterns from data to make predictions.


## 2. Retrieval

### 2.1 Cargo el corpus de 20 News Groups

In [3]:
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
newsgroupsdocs = newsgroups.data

In [4]:
import pandas as pd
df = pd.DataFrame(newsgroupsdocs, columns=['text'])
df

Unnamed: 0,text
0,\n\nI am sure some bashers of Pens fans are pr...
1,My brother is in the market for a high-perform...
2,\n\n\n\n\tFinally you said what you dream abou...
3,\nThink!\n\nIt's the SCSI card doing the DMA t...
4,1) I have an old Jasmine drive which I cann...
...,...
18841,DN> From: nyeda@cnsvax.uwec.edu (David Nye)\nD...
18842,\nNot in isolated ground recepticles (usually ...
18843,I just installed a DX2-66 CPU in a clone mothe...
18844,\nWouldn't this require a hyper-sphere. In 3-...


In [5]:
import numpy as np
from tqdm.auto import tqdm
import re

df = df.dropna(subset=["text"]).reset_index(drop=True)

# Limpieza básica
def normalize_text(s: str) -> str:
    s = re.sub(r"\s+", " ", s).strip()
    return s

df["text_norm"] = df["text"].astype(str).map(normalize_text)

df.head()

Unnamed: 0,text,text_norm
0,\n\nI am sure some bashers of Pens fans are pr...,I am sure some bashers of Pens fans are pretty...
1,My brother is in the market for a high-perform...,My brother is in the market for a high-perform...
2,\n\n\n\n\tFinally you said what you dream abou...,Finally you said what you dream about. Mediter...
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,Think! It's the SCSI card doing the DMA transf...
4,1) I have an old Jasmine drive which I cann...,1) I have an old Jasmine drive which I cannot ...


### 2.2 Transformo a embeddings

In [6]:
def chunk_text(text: str, max_chars: int = 800, overlap: int = 100):
    """
    Chunking por caracteres.
    max_chars ~ 600-1000 suele funcionar bien.
    overlap ayuda a no cortar ideas a la mitad.
    """
    chunks = []
    start = 0
    n = len(text)
    while start < n:
        end = min(start + max_chars, n)
        chunk = text[start:end]
        chunk = chunk.strip()
        if len(chunk) > 0:
            chunks.append(chunk)
        if end == n:
            break
        start = max(0, end - overlap)
    return chunks

records = []
for i, row in df.iterrows():
    chunks = chunk_text(row["text_norm"], max_chars=800, overlap=100)
    for j, ch in enumerate(chunks):
        records.append({
            "doc_id": int(i),
            "chunk_id": j,
            "text": ch
        })

chunks_df = pd.DataFrame(records)
chunks_df.head(), len(chunks_df)

(   doc_id  chunk_id                                               text
 0       0         0  I am sure some bashers of Pens fans are pretty...
 1       1         0  My brother is in the market for a high-perform...
 2       2         0  Finally you said what you dream about. Mediter...
 3       2         1  urds and Turks once upon a time! Ohhhh so swed...
 4       3         0  Think! It's the SCSI card doing the DMA transf...,
 38871)

### 2.3 Creo una query y hago la búsqueda

In [7]:
from sentence_transformers import SentenceTransformer

MODEL_NAME = "intfloat/e5-base-v2"   # recomendado para retrieval
model = SentenceTransformer(MODEL_NAME)

# Textos a indexar (pasajes)
passages = ["passage: " + t for t in chunks_df["text"].tolist()]

In [8]:
# Embeddings (N x D)
# Se debe usar normalize_embeddings=True para similitud coseno
embeddings = model.encode(
    passages[:5000],
    batch_size=16,
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True
).astype("float32")

Batches:   0%|          | 0/313 [00:00<?, ?it/s]

In [9]:
print(embeddings.shape, embeddings.dtype)

(5000, 768) float32


In [10]:
def embed_query(query: str) -> np.ndarray:
    q = "query: " + query
    vec = model.encode(
        [q],
        convert_to_numpy=True,
        normalize_embeddings=True
    ).astype("float32")
    return vec

query_text = "My brother is in the market"

query_vec = embed_query(query_text)
query_vec.shape

(1, 768)

Obtengo los 5 documentos más similares a mi query

In [11]:
# Calculo de la similitud coseno entre la query y todos los embeddings
similarities = np.dot(embeddings, query_vec.T).flatten()

top_k = 5
top_indices = np.argsort(similarities)[::-1][:top_k]

print("="*80)
print(f"Query: '{query_text}'")
print("Top 5 documentos más similares:\n")
print("="*80)

for rank, idx in enumerate(top_indices, 1):
    score = similarities[idx]
    text = chunks_df.iloc[idx]["text"]
    doc_id = chunks_df.iloc[idx]["doc_id"]
    chunk_id = chunks_df.iloc[idx]["chunk_id"]
    
    print(f"Rank {rank} (Score: {score:.4f})")
    print(f"Doc ID: {doc_id}, Chunk ID: {chunk_id}")
    print(f"Text: {text[:200]}...")  
    print("-" * 80)

Query: 'My brother is in the market'
Top 5 documentos más similares:

Rank 1 (Score: 0.7779)
Doc ID: 2205, Chunk ID: 0
Text: I hope you're not going to flame him. Please give him the same coutesy you' ve given me....
--------------------------------------------------------------------------------
Rank 2 (Score: 0.7743)
Doc ID: 1749, Chunk ID: 0
Text: I have the following items for sale. The highest bid for each to arrive in my email box by 5:00 pm EDT Wednesday April 21, 1993 gets the item. 1] Skillcraft Senior Chemlab Set 4581 Safe for Ages 10 an...
--------------------------------------------------------------------------------
Rank 3 (Score: 0.7739)
Doc ID: 2539, Chunk ID: 0
Text: Well, I've been informed that the price on the whole thing I'm selling is now less than the price I'm selling it for. That will teach me to wait that long before getting rid of electronic equipment. N...
--------------------------------------------------------------------------------
Rank 4 (Score: 0.7726)