# Ejercicio 9: Uso de la API de OpenAI

En este ejercicio vamos a aprender a utilizar la API de OpenAI

## 1. Uso básico

El siguiente código sirve para conectarse con la API de OpenAI de forma básica

In [5]:
#pip install openai

In [None]:
from openai import OpenAI
API = "APIKEY"
client = OpenAI(api_key=API)
response = client.responses.create(
    model="gpt-4o",
    input="Write a one-sentence bedtime story about a unicorn."
)

print(response.output_text)

Under the silver glow of the moon, the whimsical unicorn danced through a forest of twinkling stars, leaving trails of laughter for dreamers everywhere.


## 2. Retrieval
### 2.1 Cargo el corpus de 20 News Groups

In [7]:
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
newsgroupsdocs = newsgroups.data[:5000]


In [8]:
from sentence_transformers import SentenceTransformer
import numpy as np
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


In [9]:
newsgroups_df = pd.DataFrame(newsgroupsdocs, columns=['raw'])
newsgroups_df

Unnamed: 0,raw
0,\n\nI am sure some bashers of Pens fans are pr...
1,My brother is in the market for a high-perform...
2,\n\n\n\n\tFinally you said what you dream abou...
3,\nThink!\n\nIt's the SCSI card doing the DMA t...
4,1) I have an old Jasmine drive which I cann...
...,...
4995,The following are my thoughts on a meeting tha...
4996,David posts a good translation of a post by Su...
4997,"Note: I am cross-posting (actually, emailing) ..."
4998,\nThen don't complain (maybe it wasn't you) th...


In [10]:
# Cargar el modelo SBERT
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(newsgroupsdocs)

In [11]:
embeddings

array([[ 0.002078  ,  0.02345043,  0.02480886, ...,  0.0014359 ,
         0.01510752,  0.05287581],
       [ 0.05006031,  0.02698093, -0.00886484, ..., -0.00887169,
        -0.06737083,  0.05656362],
       [ 0.01640475,  0.08100051, -0.04953596, ..., -0.04184629,
        -0.07800221, -0.03130953],
       ...,
       [-0.01800568,  0.03764157,  0.0190541 , ..., -0.02706385,
         0.06655327, -0.0518438 ],
       [-0.02670172, -0.01402128,  0.00013569, ..., -0.02312811,
        -0.02680116, -0.03862799],
       [ 0.00509074, -0.01956049, -0.05848601, ...,  0.12941848,
        -0.06473386, -0.01300098]], dtype=float32)

In [12]:
newsgroups_df['embeddings'] = embeddings.tolist()
newsgroups_df

Unnamed: 0,raw,embeddings
0,\n\nI am sure some bashers of Pens fans are pr...,"[0.0020780046470463276, 0.02345043234527111, 0..."
1,My brother is in the market for a high-perform...,"[0.05006030574440956, 0.0269809328019619, -0.0..."
2,\n\n\n\n\tFinally you said what you dream abou...,"[0.016404753550887108, 0.08100050687789917, -0..."
3,\nThink!\n\nIt's the SCSI card doing the DMA t...,"[-0.01939147524535656, 0.011494365520775318, -..."
4,1) I have an old Jasmine drive which I cann...,"[-0.03928707540035248, -0.05540286749601364, -..."
...,...,...
4995,The following are my thoughts on a meeting tha...,"[-0.04432957246899605, 0.042666345834732056, -..."
4996,David posts a good translation of a post by Su...,"[0.01835048384964466, 0.1009376272559166, 0.00..."
4997,"Note: I am cross-posting (actually, emailing) ...","[-0.01800568215548992, 0.03764156997203827, 0...."
4998,\nThen don't complain (maybe it wasn't you) th...,"[-0.02670172043144703, -0.014021284878253937, ..."


In [13]:
from sentence_transformers.util import cos_sim
import torch

In [14]:
# 1. Generar el embedding de la consulta
#query = "What technology did the US benefit from after WW2?"
query = "What this document explain about United States?"
query_embedding = model.encode(query, convert_to_tensor=False)
print(query_embedding)

[-4.40027751e-02  6.76465631e-02  1.81450676e-02  2.71494035e-02
  3.97355147e-02  2.86677741e-02  1.70837287e-02 -5.11704050e-02
 -7.08375424e-02 -5.84035367e-03 -2.68768277e-02  9.36001390e-02
 -5.38061373e-03 -6.05956353e-02 -7.20592961e-02  4.52284105e-02
 -1.30038504e-02 -4.73392345e-02 -2.32526548e-02  4.68500927e-02
  1.05721936e-01  4.48408648e-02  1.61421206e-02  6.92704171e-02
 -3.10235526e-02  6.19343705e-02  5.20551577e-03 -3.54517475e-02
  1.06270779e-02 -3.20176929e-02 -9.43932403e-03 -9.78435203e-03
  5.84899709e-02  1.47197843e-02 -1.96531881e-02 -1.94037799e-02
  1.74149409e-01 -1.98645946e-02  2.77989581e-02 -1.10337818e-02
  1.38298701e-02 -1.01271942e-01  1.29697900e-02  8.35139528e-02
 -7.40730716e-03  1.14514142e-01 -4.31395769e-02  2.95909438e-02
 -3.40581052e-02  1.38381766e-02 -1.20901493e-02  8.85839462e-02
 -3.82152312e-02 -1.93621740e-02  7.27589726e-02  7.54610524e-02
 -1.64738670e-02  1.54060200e-02 -2.94414982e-02 -7.66353235e-02
  6.28036121e-03 -2.38865

In [15]:
# 2. Calcular la similitud del coseno
cosine_scores = cos_sim(query_embedding, embeddings)[0]

Obtengo los 5 documentos más similares a mi query

In [16]:
# 3. Obtener los 5 documentos más relevantes (top-k)
k = 5
top_results = torch.topk(cosine_scores, k=k)
top_results

torch.return_types.topk(
values=tensor([0.4317, 0.4317, 0.4253, 0.4242, 0.4124]),
indices=tensor([2120, 4262, 1685, 1599, 2501]))

In [17]:
# 4. Construir el contexto de forma estructurada
# En lugar de concatenar el texto directamente, lo separo
# Esto ayuda al LLM a entender que son fragmentos distintos
context_separator = "\n\n---\n\n"
retrieved_docs = [newsgroupsdocs[idx] for idx in top_results.indices]
context = context_separator.join(retrieved_docs)

print("--- CONTEXTO RECUPERADO ---")
print(context)
print("--------------FIN DOCUMENTO------------\n")

--- CONTEXTO RECUPERADO ---
From: Center for Policy Research <cpr>
Subject: Unconventional peace proposal


A unconventional proposal for peace in the Middle-East.
---------------------------------------------------------- by
			  Elias Davidsson

The following proposal is based on the following assumptions:

1.      Fundamental human rights, such as the right to life, to
education, to establish a family and have children, to human
dignity, the right to free movement, to free expression, etc. are
more important to human existence that the rights of states.

2.      In the event of a conflict between basic human rights and
rights of collectivities, basic human rights should prevail.

3.      Between the collectivities defining themselves as
Jewish-Israeli and Palestinian-Arab, however labelled, an
unresolved conflict exists.

4.      This conflict has caused great sufferings for millions of
people. It moreover poisons relations between communities, peoples
and nations.

5.      Each yea

In [18]:
# 3. Diseño del Prompt

# SÍNTESIS DEL CONTEXTO
# En este paso, condensamos todos los fragmentos de texto recuperados en un único
# párrafo coherente. Esto limpia el "ruido" y facilita el trabajo para la recuperación de información
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": """
 Eres un asistente experto en síntesis de información. Tu trabajo es responder a la pregunta del usuario resumiendo los puntos clave encontrados en un conjunto de documentos de texto.
 Reglas:
 1. Lee la pregunta general del usuario.
 2. Examina TODOS los documentos proporcionados en el contexto.
 3. En lugar de buscar una respuesta directa, tu objetivo es construir un resumen coherente sobre el tema de la pregunta basándote en los diferentes puntos mencionados en los textos.
 4. Basa tu respuesta 100% en la información de los documentos. No inventes nada.
 5. Si los documentos no contienen absolutamente ninguna mención relevante al tema de la pregunta, responde únicamente: "No se encontraron menciones relevantes en el corpus."
 """
         },
         {
             "role": "user",
             "content": f"""
 PREGUNTA DEL USUARIO:
 "{query}"

 ---
DOCUMENTOS DE CONTEXTO PARA SINTETIZAR:
 {context}
 ---

ahora, resume lo que estos documentos dicen sobre el tema de la pregunta del usuario.
"""
}
],
temperature=0.5
)

In [19]:
print("\n--- La respuesta de la pregunta --", query, "-- es:  \n")
print(response.choices[0].message.content)


--- La respuesta de la pregunta -- What this document explain about United States? -- es:  

Los documentos proporcionados no abordan directamente el tema de los Estados Unidos de manera exhaustiva, pero hay algunas menciones relevantes:

1. **Gasto en el Medio Oriente**: Se menciona que Estados Unidos gasta miles de millones de dólares en ayuda económica y militar a las partes en conflicto en el Medio Oriente, específicamente en el contexto del conflicto israelí-palestino. Además, se sugiere que una parte de estos fondos podría usarse para fomentar matrimonios mixtos entre israelíes y palestinos como una forma de promover la paz.

2. **Constitución de los Estados Unidos**: Hay una discusión sobre la interpretación de la Constitución de los Estados Unidos, específicamente sobre el derecho a portar armas y el papel del Congreso en la recaudación de impuestos y otras funciones gubernamentales. Se destaca el poder del Congreso para regular el comercio, establecer oficinas de correos, dec

## Parte 2, ejemplo de consulta con embeddings utilizando contexto del resultado de los documentos mas similares por similitud del coseno en embeddings

In [20]:
#pip install PyMuPDF
import fitz  # PyMuPDF
import pandas as pd
import re
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize, sent_tokenize
from sentence_transformers import SentenceTransformer, util

In [21]:
# 1: Cargar y procesar el PDF
pdf_path = 'irbook.pdf'
doc = fitz.open(pdf_path)
# Extraer todo el texto del PDF
full_text = ""
for page_num in range(len(doc)):
    page = doc.load_page(page_num)
    full_text += page.get_text() + "\n"

print(f"Total de páginas del PDF: {len(doc)}")

Total de páginas del PDF: 581


In [22]:
# 2: Limpieza básica del texto 
# Patrones específicos a eliminar
patterns_to_remove = [
    r'Online edition \(c\) \d{4} Cambridge UP',
    r'DRAFT!.*Feedback welcome\.',
    r'Cambridge University Press\. Feedback welcome\.',
    r'Figure \d+\.\d+:.*',
    r'Table \d+\.\d+:.*',
    r'\n\d+\n'  # Números de página
]
clean_text = full_text
for pattern in patterns_to_remove:
    clean_text = re.sub(pattern, '', clean_text, flags=re.IGNORECASE)
# Normalizar espacios
clean_text = re.sub(r'\s+', ' ', clean_text).strip()

In [23]:
# 3: Crear chunks semánticos - heuristica 
sentences = sent_tokenize(clean_text)
# Agrupar oraciones en chunks de 5 oraciones cada uno
chunks = []
current_chunk = []
chunk_size = 5  # Número de oraciones por chunk
for i, sentence in enumerate(sentences):
    current_chunk.append(sentence)
    if len(current_chunk) >= chunk_size:
        chunks.append(" ".join(current_chunk))
        current_chunk = []
# Añadir el último chunk si tiene contenido
if current_chunk:
    chunks.append(" ".join(current_chunk))
print(f"Total chunks creados: {len(chunks)}")


Total chunks creados: 2489


In [24]:
# 4: Preprocesamiento con NLTK
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
# Preprocesar cada chunk
processed_chunks = []
for chunk in chunks:
    # Tokenizar
    tokens = word_tokenize(chunk.lower())
    # Filtrar y lematizar
    filtered_tokens = [
        lemmatizer.lemmatize(token) 
        for token in tokens 
        if token.isalpha() and token not in stop_words
    ]
    processed_chunks.append(" ".join(filtered_tokens))
# Crear DataFrame
pdf_df = pd.DataFrame({'raw': full_text,
    'clean_text': chunks,
    'processed_text': processed_chunks
})
# Filtrar chunks vacíos
pdf_df = pdf_df[pdf_df['processed_text'].str.len() > 10]
print(f"Chunks válidos: {len(pdf_df)}")
pdf_df

Chunks válidos: 2473


Unnamed: 0,raw,clean_text,processed_text
0,Online edition (c)\n2009 Cambridge UP\nAn\nInt...,Online edition (c) 2009 Cambridge UP An Introd...,online edition c cambridge introduction inform...
1,Online edition (c)\n2009 Cambridge UP\nAn\nInt...,"The unrounded values are: 806,791 documents, 2...",unrounded value document token per document di...
2,Online edition (c)\n2009 Cambridge UP\nAn\nInt...,We performed stemming with the Porter stemmer ...,performed stemming porter stemmer chapter page...
3,Online edition (c)\n2009 Cambridge UP\nAn\nInt...,Commas in γ codes are for readability only and...,comma γ code readability part actual index dic...
4,Online edition (c)\n2009 Cambridge UP\nAn\nInt...,So for X2 > 6.63 the assumption of independenc...,assumption independence rejected ten largest c...
...,...,...,...
2484,Online edition (c)\n2009 Cambridge UP\nAn\nInt...,"(2001) Zhai: Lafferty and Zhai (2001), Laffert...",zhai lafferty zhai lafferty zhai tao et al zha...
2485,Online edition (c)\n2009 Cambridge UP\nAn\nInt...,(2001b) Zien: Chapelle et al. (2006) Zipf: Zip...,zien chapelle et al zipf zipf ziviani badue et...
2486,Online edition (c)\n2009 Cambridge UP\nAn\nInt...,"(2002), Heinz and Zobel (2003), Heinz et al. (...",heinz zobel heinz et al kaszkiel zobel lester ...
2487,Online edition (c)\n2009 Cambridge UP\nAn\nInt...,"(2002), Williams and Zobel (2005), Williams et...",williams zobel williams et al zobel zobel dart...


In [25]:
# 5: Generar embeddings
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = embedding_model.encode(
    pdf_df['processed_text'].tolist(), 
    show_progress_bar=True,
    convert_to_tensor=True
)

Batches: 100%|██████████| 78/78 [00:21<00:00,  3.66it/s]


In [None]:
# 6: Procesar consulta
query = "¿Qué es el modelo de espacio vectorial (vector space model) y en qué capítulo se explica?"
#query = "Who is Cristobal Colon based to the documents?"
# Preprocesar la consulta
query_tokens = word_tokenize(query.lower())
processed_query_tokens = [
    lemmatizer.lemmatize(token) 
    for token in query_tokens 
    if token.isalpha() and token not in stop_words
]
processed_query = " ".join(processed_query_tokens)

# Generar embedding para la consulta
query_embedding = embedding_model.encode(processed_query, convert_to_tensor=True)

In [34]:
# Calcular similitud coseno
cos_scores = util.cos_sim(query_embedding, embeddings)[0]
cos_scores

tensor([0.1009, 0.2660, 0.1792,  ..., 0.0837, 0.1768, 0.0811])

In [35]:
# Obtener los 5 chunks más relevantes
top_k = 5
top_results = torch.topk(cos_scores, k=top_k)
top_results

torch.return_types.topk(
values=tensor([0.4825, 0.4611, 0.4603, 0.4521, 0.4456]),
indices=tensor([2026, 2162, 2154, 1915, 1828]))

In [36]:
# Construir contexto con los chunks más relevantes
context_parts = []
print("\nChunks más relevantes:")
for i, (score, idx) in enumerate(zip(top_results.values, top_results.indices)):
    chunk_idx = idx.item()
    print(f"Chunk {i+1} (similitud: {score:.4f}):")
    print(pdf_df.iloc[chunk_idx]['clean_text'][:500] + "...")
    print("-" * 80)
    
    context_parts.append(
        f"FRAGMENTO {i+1} (Relevancia: {score:.2f}):\n"
        f"{pdf_df.iloc[chunk_idx]['clean_text']}\n"
    )
context = "\n".join(context_parts)


Chunks más relevantes:
Chunk 1 (similitud: 0.4825):
URL: citeseer.ist.psu.edu/article/kumar00web.html. 441, 526, 529, 531,Kupiec, Julian, Jan Pedersen, and Francine Chen. 1995. A trainable document sum- marizer. In Proc....
--------------------------------------------------------------------------------
Chunk 2 (similitud: 0.4611):
The author-topic model for authors and documents. In Proc. UAI, pp. 487–494. 418, 523, 530, 531 Ross, Sheldon....
--------------------------------------------------------------------------------
Chunk 3 (similitud: 0.4603):
In Proc. WWW, pp. 707–715. 348, 520, 529 Online edition (c) 2009 Cambridge UP BibliographyRiezler, Stefan, Alexander Vasserman, Ioannis Tsochantaridis, Vibhu Mittal, and Yi Liu. 2007....
--------------------------------------------------------------------------------
Chunk 4 (similitud: 0.4521):
373, 522, 523 Han, Eui-Hong, and George Karypis. 2000. Centroid-based document classiﬁcation: Analysis and experimental results. In Proc. PKDD, 

In [37]:

# 7: Construir y enviar prompt
prompt = f"""
Eres un experto en Recuperación de Información. Basado EXCLUSIVAMENTE en los siguientes fragmentos del libro, 
responde la pregunta del usuario. Si la información no está en los fragmentos, di claramente que no tienes datos suficientes.

PREGUNTA: {query}

FRAGMENTOS RELEVANTES "Contexto":
{context}

INSTRUCCIONES:
1. Responde en español
2. Menciona el capítulo o sección relevante si aparece en los fragmentos
3. Si la información está incompleta, sugiere buscar en capítulos específicos
4. Sé preciso y conciso

RESPUESTA:
"""
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Eres un asistente académico especializado en Recuperación de Información."},
        {"role": "user", "content": prompt}
    ],
    temperature=0.3
)

## RESULTADO PARTE 2

In [38]:
# 8: Mostrar resultados
print("Respuesta a la query ", query , " :")
print("=" * 80)
print(response.choices[0].message.content)

Respuesta a la query  Who is Enrique Mafla based to the documents?  :
No tengo datos suficientes en los fragmentos proporcionados para determinar quién es Enrique Mafla. Te sugiero buscar en capítulos o secciones adicionales del libro que puedan contener información sobre él.


In [32]:
# Mostrar los fragmentos usados
print("\nFragmentos utilizados en el contexto:")
for i, part in enumerate(context_parts):
    print(f"\nFragmento {i+1}:")
    print(part[:500] + "...\n")


Fragmentos utilizados en el contexto:

Fragmento 1:
FRAGMENTO 1 (Relevancia: 0.41):
3. In Section 6.3 we show that by viewing each document as a vector of such weights, we can compute a score between a query and each document. This view is known as vector space scoring. Section 6.4 develops several variants of term-weighting for the vector space model. Chapter 7 develops computational aspects of vector space scoring, and related topics.
...


Fragmento 2:
FRAGMENTO 2 (Relevancia: 0.40):
The representa- tion of a set of documents as vectors in a common vector space is known as the vector space model and is fundamental to a host of information retrieval op- VECTOR SPACE MODEL erations ranging from scoring documents on a query, document classiﬁcation and document clustering. We ﬁrst develop the basic ideas underlying vector space scoring; a pivotal step in this development is the view (Section 6.3.2) of queries as vectors in the same vector space as...


Fragmento 3:
FRAGMENTO 3 (Relevan