### **DELIVERABLE 2 - TRD**

The goal of this deliverable was to design and implement a system that allows **extracting key information** from pharmacological scientific studies in PDF format and presenting it in a structured format such as a table, using queries to an LLM. The variables of interest considered were:

- **Title**
- **Authors**
- **Publication Date**
- **Objective of the Study**
- **Methodology Used**
- **Study Sample**
- **Results**
- **Conclusions**
- **Clinical Relevance**

The development of the script was an iterative process, marked by several technical difficulties, adjustments, and important decisions. This notebook documents **the entire process**, from the initial attempts to the final solution, explaining each stage.

The extraction process was carried out in two phases, initially attempting to use the **Hugging Face API**, and ultimately switching to the **OpenAI API**.

### **First Phase: Attempt with Hugging Face Models**

#### **Initial Implementation**
Initially, I decided to use LLM models available on the Hugging Face platform, such as:

- `bert-base-uncased`
- `deepset/roberta-base-squad2`
- `bigscience/bloomz-7b1`
- `mistralai/Mistral-7B-Instruct`

**Reason**: These models can be executed through the Hugging Face API, allowing direct inferences.

#### **Problems Encountered**
1. **503 Errors (Service Unavailable):** Most models returned availability errors.
2. **Excess of Tokens:** Many PDFs had extensive content that exceeded the model's context limits.
3. **Imprecise Answers:** The model responses were not concise or structured.

**Conclusion**: The Hugging Face models were not a viable solution due to these limitations. Next, I opted to switch to the **OpenAI API**, which offers greater robustness and processing capacity.


### **Segunda fase: Uso de la API de OpenAI**

#### **Configuración de la API de OpenAI**
Inicialmente, me encontré con problemas al utilizar la última versión de `openai`. La sintaxis para `ChatCompletion` había cambiado, lo que generaba errores.

**Solución**: Instalé una versión anterior de la librería que garantizaba el funcionamiento:

### **Second Phase: Using the OpenAI API**

#### **Configuring the OpenAI API**
Initially, I encountered issues when using the latest version of `openai`. The syntax for `ChatCompletion` had changed, which generated errors.

**Solution**: I installed an older version of the library to ensure functionality:

In [194]:
pip install openai==0.28


Note: you may need to restart the kernel to use updated packages.


### **Configuring Libraries and OpenAI API Key**

#### **Description**
This initial code snippet configures the necessary libraries and defines the API key to interact with **OpenAI**. Special emphasis is placed on **hiding** the API key using environment variables, a recommended practice to protect sensitive information.

---

#### **Initial Problem**
- In the early versions of the code, the API key was written directly in the script, which poses a **security risk** if the code is shared publicly or uploaded to platforms like **GitHub**.
- It is important to handle API keys securely to prevent misuse.

---

#### **Decision Made**
1. **Use of Environment Variables**:
   - Instead of writing the key directly in the script, it is stored in an environment variable (`Documents_OpenAI`).
   - This variable is accessed using the `os` library with `os.getenv()`.

2. **Benefit**:
   - Protects the API key by not exposing it directly in the code.
   - Facilitates key management in different environments (development, production, etc.).

In [197]:
import openai  # Para interactuar con la API de OpenAI
import pdfplumber  # Para leer y extraer texto de PDFs
import pandas as pd  # Para estructurar y manipular datos
import os # Para manejar variables de entorno

In [199]:
# Configuro la clave API de OpenAI desde variables de entorno para no exponer directamente la clave
openai.api_key = os.getenv("Documents_OpenAI")

In [201]:
# Variables de interés a extraer
variables_of_interest = [
    "Title",
    "Authors",
    "Publication Date",
    "Objective of the Study",
    "Methodology Used",
    "Study Sample",
    "Results",
    "Conclusions",
    "Clinical Relevance"
]

### **Decisions Made to Handle Token Overflow**

#### **Initial Problem**
Once I started testing with the OpenAI API, I encountered a recurring problem when working with long texts extracted from PDFs. OpenAI has a limitation on the number of tokens that can be sent in each query to its models (for example, `gpt-3.5-turbo` allows a maximum of 4096 tokens per request). When I tried to send complete documents or extensive texts as context, the API returned errors like:

> `This model's maximum context length is 4096 tokens. However, your messages resulted in [X] tokens.`

This problem not only interrupted the flow of the analysis but also forced me to stop processing large documents or manually split the content, which was inefficient and impractical.

---

#### **Decision Made**
To solve this problem, I implemented a solution that automates the division of long texts into smaller fragments, ensuring that each fragment did not exceed the token limit allowed by the model. This was achieved with a function that:
1. Splits the text into words.
2. Creates controlled-length fragments that can be sent to the API independently.
3. Assembles the responses from these fragments, when necessary, to build the final response.

---

#### **Reasoning**
1. **Avoid Token Errors:** Sending small fragments allows working within the model's limits and ensures that all texts can be processed, even the most extensive ones.
2. **Maintain Efficiency:** This automatic division eliminates the need for manual interventions, speeding up the processing of PDFs.
3. **Flexibility:** Adjusting the maximum size of the fragments (`max_tokens`) allows optimizing the balance between context and response according to the needs of the analysis.

---

#### **Result**
With this strategy, the problems related to token overflow were solved. Now I can process complete PDFs, even those with a large amount of text, without errors or loss of information. This approach also allowed me to maintain the scalability of the project, making it possible to process multiple documents automatically.

---

#### **Lesson Learned**
Token limits are an important constraint when working with LLMs. Identifying this problem and making decisions aimed at fragmenting the texts was key to continuing the development of the project without interruptions. This experience highlighted the importance of adjusting queries to the technical constraints of the model used.


In [204]:
def split_text_into_chunks(text, max_tokens=500):  # Reducido a fragmentos más pequeños
    words = text.split()
    chunks = []
    current_chunk = []
    current_length = 0
    for word in words:
        current_length += len(word) + 1  # Considera espacios
        if current_length > max_tokens:
            chunks.append(" ".join(current_chunk))
            current_chunk = []
            current_length = len(word) + 1
        current_chunk.append(word)
    if current_chunk:
        chunks.append(" ".join(current_chunk))
    return chunks

### **Adjustments in Queries for Direct Answers and Handling Rate Limit Errors**

#### **Initial Problem**
During development, I faced several challenges related to interacting with the OpenAI API:
1. **Too Long Responses:** Initially, the queries returned lengthy responses with unnecessary introductions. For example, when asking for the title, the response included phrases like *"The title of the study is..."*, making the responses redundant for the purpose of the table.
2. **Rate Limit Errors:** The API quickly reached the token limits per minute, especially when processing multiple PDFs sequentially. This resulted in frequent interruptions and required manually retrying failed queries.

---

#### **Decisions Made**
1. **Optimization of Questions**:
   - For direct variables like *Title*, *Authors*, and *Publication Date*, specific questions were formulated with clear instructions to make the responses concise:
     - *Title:* "Provide only the title."
     - *Authors:* "List the names only."
     - *Date:* "Please answer only DD-MM-YYYY."
   - For the other variables, a brief summary was requested using the instruction: *"Please summarize the response briefly."*

2. **Automatic Management of Rate Limit Errors**:
   - Exception handling was implemented for rate limit errors (`RateLimitError`).
   - The function automatically retries the query after a brief pause, eliminating the need for manual intervention.

3. **Limits on Response Length**:
   - The `max_tokens=100` parameter was configured to limit the length of responses, reducing unnecessary token usage and keeping the responses in the desired format.

---

#### **Result**
With these adjustments, the responses are now precise and aligned with the project's needs. Additionally, the function runs robustly, automatically handling rate limit errors and reducing the time lost in interruptions.

---

#### **Lesson Learned**
Optimizing the questions asked and setting limits on the length of responses was key to avoiding redundancies and keeping the project efficient. Moreover, implementing a retry strategy for rate limit errors significantly improved the reliability of the code when processing large volumes of data.

---


In [207]:
# Función para consultar la API de OpenAI con respuestas directas para ciertas variables
def query_openai_api_with_rate_limit(context, question, var):
    try:
        if var == "Title":
            question = "What is the title of the study? Provide only the title."
        elif var == "Authors":
            question = "Who are the authors of the study? List the names only."
        elif var == "Publication Date":
            question = "What is the publication date? Please answer only DD-MM-YYYY."
        else:
            question = f"What is the {var}? Please summarize the response briefly."

        messages = [
            {"role": "system", "content": "You are an expert in pharmacology and healthcare. Provide concise and accurate answers to pharmaceutical-related questions."},
            {"role": "user", "content": f"Context: {context}\nQuestion: {question}"}
        ]

        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=messages,
            max_tokens=100,
            temperature=0.2
        )

        answer = response["choices"][0]["message"]["content"].strip()
        return answer
    except openai.error.RateLimitError as e:
        print(f"Rate limit reached: {e}. Retrying after a short delay.")
        time.sleep(1)
        return query_openai_api_with_rate_limit(context, question, var)  # Reintenta la consulta
    except openai.error.OpenAIError as e:
        print(f"Error al consultar la API de OpenAI: {e}")
        return None  # Devuelve None si hay un error

### **Text Extraction from PDF Files**

#### **Context**
To analyze documents in PDF format using language models like OpenAI GPT, the first step is to extract the full text from each file. However, at the beginning of development, I encountered several challenges related to text extraction:
1. **Quality of Extracted Content:** Depending on how the PDF is structured (for example, if it includes images or scanned text), the extraction could fail or return incomplete content.
2. **Unexpected Errors:** Some PDFs produced errors during reading, which interrupted the processing flow.

---

#### **Decision Made**
I implemented a dedicated function (`extract_text_from_pdf`) that uses the `pdfplumber` library to read PDFs and extract text. The function:
1. Reads all pages of the document.
2. Verifies if each page contains text before including it in the result.
3. Handles exceptions, logging specific errors without interrupting the processing of other documents.

---

#### **Code Details**
- **Parameters**:
  - `pdf_path`: Path to the PDF file to be processed.
- **Return**:
  - Concatenated text from all pages, or an empty string (`""`) if an error occurs.
- **Error Handling**:
  - If the file cannot be read (for example, corrupt format), a message indicating the error is printed, and an empty string is returned.

---

In [210]:
# Función: extract_text_from_pdf
# Esta función lee un PDF y extrae el texto completo de todas sus páginas.
# Parámetros:
# - pdf_path: Ruta al archivo PDF.
# Retorna:
# - Texto extraído del PDF.
def extract_text_from_pdf(pdf_path):
    try:
        with pdfplumber.open(pdf_path) as pdf:
            text = " ".join([page.extract_text() for page in pdf.pages if page.extract_text()])
        return text
    except Exception as e:
        print(f"Error al leer {pdf_path}: {e}")
        return ""



### **Information Extraction Process from PDFs: `process_pdfs`**

#### **Context**
The purpose of this function is to process a list of PDFs and extract key information defined by the **variables of interest**. This includes aspects such as title, authors, publication date, methodology, among others. Initially, I encountered several challenges when processing multiple documents:
1. **Lack of Extractable Text:** Some PDFs did not contain legible text, which generated errors or empty results.
2. **Repetitive or Unnecessary Responses:** When splitting long texts into fragments, the responses could include redundancies if the processing was not limited.
3. **PDF Identification:** It was difficult to track which document the results belonged to, especially when handling large volumes of data.

---

#### **Decisions Made**
1. **Handling PDFs Without Text**:
   - For PDFs without legible text, the function adds a record with `None` values for all variables and the PDF name, ensuring that no document is omitted in the final result.

2. **Splitting into Fragments**:
   - The function splits the full text of each PDF into manageable fragments using the `split_text_into_chunks` function, ensuring that the queries do not exceed the token limits of the OpenAI model.

3. **Optimization of Responses**:
   - Once a valid response for a variable is obtained from a fragment, the processing of subsequent fragments for that variable stops, reducing the time and cost of queries.

4. **Document Identification**:
   - The PDF file name is added as an additional key (`PDF Name`) in each record of the table, allowing tracking of the source of the extracted data.

---

In [213]:
# Función: process_pdfs
# Procesa múltiples PDFs para extraer información de cada uno de ellos.
# Parámetros:
# - pdf_paths: Lista de rutas a los PDFs.
# Retorna:
# - Lista de diccionarios donde cada diccionario representa un documento y sus variables extraídas.
def process_pdfs(pdf_paths):
    data = []
    for pdf_path in pdf_paths:
        print(f"Procesando {pdf_path}...")
        document_text = extract_text_from_pdf(pdf_path)
        if not document_text:
            data.append({"PDF Name": os.path.basename(pdf_path), **{var: None for var in variables_of_interest}})
            continue
        chunks = split_text_into_chunks(document_text)
        row = {"PDF Name": os.path.basename(pdf_path)}
        for var in variables_of_interest:
            answers = []
            for chunk in chunks:
                if answers:
                    break  # Si ya hay respuesta, no procesar más fragmentos
                answer = query_openai_api_with_rate_limit(chunk, f"What is the {var}?", var)
                if answer:
                    answers.append(answer)
            row[var] = answers[0] if answers else None
            print(f"Variable: {var}, Respuesta: {row.get(var)}")
        data.append(row)
    return data


In [215]:
# Lista de rutas a los PDFs que se procesarán.
# Asegúrate de que estos archivos existan en el directorio de trabajo.
pdf_paths = [
    "fpsyg-08-00308.pdf",
    "fpsyg-09-01240.pdf",
    "fpsyg-11-01354.pdf",
    "fpsyg-12-639236.pdf",
    "fpsyt-14-1301143.pdf"
]


After reading and processing the PDFs, we finally managed to generate the desired files containing the extracted and organized information. To optimize result management and avoid overwriting, we implemented the `datetime` module, as shown in the following snippet:

In [218]:
import datetime  # Para manejar la fecha y hora

# Obtener la fecha y hora actual en el formato deseado
current_time = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M")

# Procesamos los PDFs y almacenamos los datos extraídos en una lista de diccionarios
data = process_pdfs(pdf_paths)
df = pd.DataFrame(data)

# Generar nombres de archivos con la fecha y hora
csv_filename = f"extracted_pharmaceutical_data_{current_time}.csv"
excel_filename = f"extracted_pharmaceutical_data_{current_time}.xlsx"

# Guarda la tabla en un archivo CSV y en un archivo Excel
df.to_csv(csv_filename, index=False)
df.to_excel(excel_filename, index=False)

# Muestra la tabla generada
print("Tabla generada:")
print(df)
print(f"Archivos generados:\n- {csv_filename}\n- {excel_filename}")



Procesando fpsyg-08-00308.pdf...
Variable: Title, Respuesta: Placebo and Nocebo Effects: The Advantage of Measuring Expectations and Psychological Factors
Variable: Authors, Respuesta: Nicole Corsi and Luana Colloca.
Variable: Publication Date, Respuesta: 06-03-2017
Variable: Objective of the Study, Respuesta: The objective of the study was to highlight the importance of measuring expectations and psychological factors in understanding and harnessing the placebo and nocebo effects in healthcare and pharmacology.
Variable: Methodology Used, Respuesta: The methodology used in the study involved measuring expectations and psychological factors related to placebo and nocebo effects. Researchers likely conducted experiments or surveys to assess participants' beliefs, attitudes, and psychological states in relation to the study's objectives.
Variable: Study Sample, Respuesta: The study sample in the research consisted of individuals who were exposed to placebo and nocebo effects. The researc

## **Conclusion of the Deliverable Development**

The development of this project presented numerous challenges and learnings throughout its implementation. From processing PDFs to extracting information using language models like GPT-3.5-turbo, the approach evolved to address performance, organization, and optimization issues.

### **Main Challenges Faced:**
1. **Choice of LLM Model:**
   - Initially, we tried to use models available on the Hugging Face API. However, the results did not meet expectations in terms of accuracy and availability, leading us to opt for the OpenAI API.

2. **Compatibility and Versions:**
   - The latest version of the OpenAI API presented initial issues, requiring a reconfiguration and the use of an older version (0.28). This ensured the functionality of the code but involved an adjustment in the implementation.

3. **Token Limits:**
   - During the first iterations, the queries to the models exceeded the token limits, generating errors and incomplete responses. To solve this, we split long texts into manageable fragments using the `split_text_into_chunks` function, optimizing the interaction with the API.

4. **Lengthy and Redundant Responses:**
   - The initial responses were too long and contained unnecessary introductions. The queries to the API were adjusted to make the responses more direct and relevant, especially for specific variables like title, authors, and publication date.

5. **Organization of Results:**
   - To avoid overwriting previous files and improve traceability, we implemented dynamic filenames with timestamps using the `datetime` module. This ensured that each script execution generated unique and easily identifiable results.

### **Key Achievements:**
- **Efficient PDF Processing:**
   - We implemented a robust solution that allows extracting text from multiple PDFs and splitting it into fragments for efficient API queries.

- **Query Optimization:**
   - The queries were adapted to generate precise and brief responses, tailored to the specific needs of each variable of interest.

- **Result Generation:**
   - The results were structured in a clear and organized table, exported in CSV and Excel formats with dynamic names that facilitate
