# GPT-4 Powered by Dynamic RAG with Google API: Detailed Workflow

## Project Overview

This document details the workflow of the "GPT-4 Powered by Dynamic RAG with Google API" project, which aims to enhance GPT-4's response accuracy and depth by integrating dynamic retrieval-augmented generation (RAG) techniques. Using real-time data fetched from Google API, this system processes user-input questions to generate enriched, context-aware answers, demonstrating a practical application of advanced NLP and machine learning technologies.


## Workflow Steps

1. **User Question Input and Site Specification**:
   - Users input a technical question along with a list of relevant websites that might contain useful information.

2. **Question Reformulation**:
   - The input question is condensed by GPT-4 into a concise 12-word phrase that captures its core essence. This is combined with the specified sites to form a targeted search query.

3. **Optimized Search via SerpAPI**:
   - The reformulated query is used to perform a Google search via SerpAPI, retrieving up to 100 results from the specified sites and beyond.

4. **Semantic Analysis and Document Ranking**:
   - The larger of the original or abstracted question is transformed into embeddings by the BAAI/bge-base-en-v1.5 model, which serve as the basis for semantic similarity analysis.
   - FAISS is initially used to identify the most semantically similar documents. If insufficient relevant documents are found, a fallback mechanism using keyword matching is employed to fill the gap.

5. **Scraping and Cleaning Web Content**:
   - The top documents are scraped to extract web content. This involves sophisticated cleaning techniques to ensure that the text is free from irrelevant elements like scripts and styling.

6. **Context Integration and RAG Setup**:
   - A prompt is created incorporating the original question, the more contextually rich version of the question, and the content from the top-ranked links.
   - This enriched context is fed into GPT-4 to generate a response that leverages both the model's internal knowledge and the newly acquired external information.

## Testing and Results

- **Revalida Test Case**: The system was tested using multiple-choice questions from the Revalida examination.

### Comparison Setup
- **Evaluation of Dynamic RAG Effectiveness**: To assess the impact of the Dynamic RAG setup on answer accuracy, we conducted a controlled comparison using 12 technical questions from the Revalida examination:
  - **With Dynamic RAG**: The system processed all 12 questions, utilizing external data to enhance GPT-4's responses. In this configuration, the system achieved 12 out of 12 correct answers, demonstrating the efficacy of integrating Dynamic RAG.
  - **Without RAG (Baseline GPT-4)**: The same 12 questions were evaluated using only GPT-4, without the aid of external data. In this baseline scenario, GPT-4 accurately answered 11 out of the 12 questions, indicating a slightly lower performance compared to the Dynamic RAG-enhanced setup.
- This comparative test highlights the potential for Dynamic RAG to improve the accuracy of GPT-4's answers by integrating real-time, relevant information from specified external sources.

- **Preliminary Analysis**: Although the sample size was small and not ideal for any statistical significance, the RAG-enhanced setup showed a slight improvement in accuracy over the baseline GPT-4, suggesting that even minimal contextual enrichment can enhance performance.
- **Limitations**: The project was constrained by API credit limits, preventing extensive testing, but the primary goal to employ these advanced technologies in a functional application was achieved.

## Conclusion

This workflow demonstrates the powerful integration of retrieval-augmented generation with GPT-4, leveraging real-time data extraction and advanced NLP techniques to enhance answer quality. The preliminary results are promising, indicating potential for further exploration and optimization in real-world applications.


## Modular RAG Pipeline Function Functionalities

### 1. `preprocess_and_structure_title_description(title: str, description: str) -> str`
**Functionality**: This function cleans and structures titles and descriptions from search results. It removes HTML tags, accents, non-alphabetic characters, and stopwords, and it lemmatizes the words to standardize the content, making it easier to process and compare later.

### 2. `clean_scraped_text_en(text: str, min_word_len: int = 3) -> str`
**Functionality**: Cleans up the text scraped from web pages by removing non-word characters and extra spaces, and filters out short and common words to ensure that only relevant and meaningful text is used in further processing.

### 3. `get_text_from_url(url: str, timeout: int = 10, max_chars: int = 5000) -> str`
**Functionality**: Retrieves text from a specified URL, stripping away unnecessary parts like scripts and styles. This function limits the text to a maximum character count to focus on the most relevant content for processing.

### 4. `reformulate_question(question, openai_key, model="gpt-4")`
**Functionality**: Reformulates a user's natural language question into a concise, search-optimized query using the GPT-4 model. This is crucial for improving the search results by making the query more precise and targeted.

### 5. `run_serpapi_search(reformulated_question, sites, serpapi_key)`
**Functionality**: Conducts a Google search using SerpAPI based on the optimized query and returns the results in a structured DataFrame. This automates the retrieval of relevant search results from specified sites, streamlining data collection and filtering.

### 6. `rank_documents(df, reformulated_question, embedding_model, top_k=5, max_k_faiss=30)`
**Functionality**: Ranks retrieved documents by their semantic similarity using AI techniques like embeddings and FAISS. It also includes a fallback mechanism to ensure that the most relevant documents are prioritized for response generation.

### 7. `scrape_links(links, top_k)`
**Functionality**: Scrapes and cleans the content from selected links. This function ensures that the text used to enrich the answer generation is relevant and cleanly formatted, extracting only the most pertinent content.

### 8. `ask_with_real_context_from_links(question, sites, serpapi_key, openai_api_key, top_k=5, max_k_faiss=30, llm_model_name="gpt-4", embedding_model_name="BAAI/bge-base-en-v1.5")`
**Functionality**: Manages the entire flow of the RAG pipeline, from question reformulation to final answer generation. This function integrates all steps to process a question and produce a well-informed response, leveraging enriched content from the web.

In [1]:
# --- üîß PIP INSTALLS E IMPORTS ---
!pip install -q google-search-results pandas unidecode langchain faiss-cpu sentence-transformers tiktoken langchain-community

from google.colab import drive
import pandas as pd
import requests
import re
import nltk
import spacy
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from unidecode import unidecode
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.schema import Document
from langchain.chat_models import ChatOpenAI
from tiktoken import get_encoding

# --- üîê API KEYS ---
serpapi_key = ""
openai_api_key = ""

# --- üìÅ CONFIG PATH ---
#drive.mount('/content/drive')
path = "/content/drive/MyDrive/Projetos/Agente IA News/"

In [2]:
# üîß Modular RAG Pipeline (completo, incluindo NLP e prints finais)

import spacy
import nltk
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
from unidecode import unidecode
from nltk.corpus import stopwords
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.schema import Document
from tiktoken import get_encoding

# --- ‚ú® NLP PREPROCESS ---
# Download English stopwords from the NLTK library quietly without verbose output.
nltk.download("stopwords", quiet=True)

# Create a set of English stopwords for filtering out common, less meaningful words.
STOPWORDS_EN = set(stopwords.words("english"))

# Load the English language model from spaCy, disabling Named Entity Recognition (NER)
# and parsing to speed up processing since they are not needed for this task.
nlp = spacy.load("en_core_web_sm", disable=["ner", "parser"])

def preprocess_and_structure_title_description(title: str, description: str) -> str:
    """
    Cleans and structures the title and description text from search results.

    Args:
    title (str): The title text to be cleaned and structured.
    description (str): The description text to be cleaned and structured.

    Returns:
    str: A string combining cleaned title and description, prefixed with labels.

    This function applies several text preprocessing steps:
    - Parse and strip HTML content using BeautifulSoup.
    - Replace hyphens with spaces to avoid compound words being misinterpreted.
    - Remove diacritics (accents) and convert text to lowercase with unidecode.
    - Remove URLs and non-alphanumeric characters.
    - Tokenize, lemmatize, and filter out stopwords and short tokens using spaCy.
    """
    def _clean(text: str) -> str:
        # Use BeautifulSoup to parse HTML and get text, avoiding any HTML tags.
        text = BeautifulSoup(text or "", "html.parser").get_text()
        # Replace hyphens with spaces to handle compound words.
        text = text.replace("-", " ")
        # Normalize text by removing accents and converting to lowercase.
        text = unidecode(text).lower()
        # Remove URLs and any non-alphanumeric characters.
        text = re.sub(r'http\S+|[^a-zA-Z0-9\s]', '', text)
        # Process text with spaCy to tokenize and lemmatize.
        doc = nlp(text)
        # Collect lemmatized tokens that are not stopwords and are longer than 2 characters.
        tokens = [token.lemma_ for token in doc if not token.is_stop and len(token) > 2]
        # Join tokens into a single string, replacing multiple spaces with a single space.
        return re.sub(r'\s+', ' ', ' '.join(tokens)).strip()

    # Return the cleaned title and description, each labeled appropriately.
    return f"T√≠tulo: {_clean(title)}\nDescri√ß√£o: {_clean(description)}"


# --- üßΩ SCRAPER + CLEANER ---
def clean_scraped_text_en(text: str, min_word_len: int = 3) -> str:
    """
    Cleans the text scraped from web pages, making it suitable for further processing.

    Args:
    text (str): The raw text to be cleaned.
    min_word_len (int): Minimum length of words to keep.

    Returns:
    str: The cleaned text, with unnecessary characters removed and filtered by word length and stopwords.

    This function performs several cleaning steps:
    - Regular expressions remove non-word characters and extra spaces, simplifying text formatting.
    - The text is split into words, and only words that are not common English stopwords and meet the minimum length criteria are kept.
    """
    # Remove all non-word characters and extra spaces from the text to simplify it.
    text = re.sub(r"\s+", " ", re.sub(r"[^\w\s]", "", text))
    # Find all word boundaries and extract words from the cleaned text.
    words = re.findall(r"\b\w+\b", text)
    # Filter out words that are in the stopwords list or shorter than the minimum length.
    return " ".join([w for w in words if w.lower() not in STOPWORDS_EN and len(w) >= min_word_len])


def get_text_from_url(url: str, timeout: int = 10, max_chars: int = 5000) -> str:
    """
    Fetches and cleans the text content from a specified URL.

    Args:
    url (str): The URL from which to scrape text.
    timeout (int): The timeout in seconds for the network request.
    max_chars (int): The maximum number of characters to return from the cleaned text.

    Returns:
    str: A cleaned string containing the text extracted from the URL, or an empty string if an error occurs.

    This function performs the following operations:
    - Makes a HTTP GET request to the URL with a specified timeout and a common user-agent header to mimic a browser request.
    - Parses the HTML content to remove scripts, styles, and other non-essential sections.
    - Extracts and cleans the visible text, limiting it to the most relevant content up to a specified character limit.
    """
    try:
        # Make a GET request to the URL with a timeout and user-agent header.
        response = requests.get(url, timeout=timeout, headers={"User-Agent": "Mozilla/5.0"})
        # Raise an exception if the request was unsuccessful (e.g., 404, 500 errors).
        response.raise_for_status()
    except requests.RequestException:
        # Return an empty string if there's an error during the request.
        return ""

    # Parse the HTML using BeautifulSoup to navigate and clean the content.
    soup = BeautifulSoup(response.text, "html.parser")
    # Decompose (remove) unnecessary tags like scripts and styles to focus on main content.
    for tag in soup(["script", "style", "header", "footer", "nav", "aside"]):
        tag.decompose()

    # Extract clean text from the HTML, separate lines by newlines and strip unnecessary whitespace.
    text = soup.get_text(separator="\n", strip=True)
    # Return only the relevant cleaned text up to the specified maximum characters, removing empty lines.
    return "\n".join([line.strip() for line in text.splitlines() if line.strip()])[:max_chars]


def reformulate_question(question, openai_key, model="gpt-4"):
    """
    Reformulates a user's question into a concise, search-optimized query using an AI model.

    Args:
    question (str): The original user question to be optimized.
    openai_key (str): API key for accessing OpenAI's services.
    model (str): The name of the GPT model to use, default is "gpt-4".

    Returns:
    dict: A dictionary containing the optimized search query and a semantically detailed version of the question.

    This function utilizes the OpenAI API to transform a verbose or complex user question into a concise query
    optimized for search engines. The goal is to maximize the retrieval of relevant technical references.
    """
    # Initialize the ChatOpenAI model with the specified model name and API key.
    llm = ChatOpenAI(model_name=model, openai_api_key=openai_key, temperature=0)

    # Define the prompt for the AI model, instructing it to condense the question into a concise form.
    prompt = (
        "You are a technical research assistant specialized in improving search engine queries.\n\n"
        "Transform the user's original input into a single, abstracted question with up to 12 words,\n"
        "that captures the core technical problem or diagnostic challenge.\n\n"
        "Your goal is to maximize the number of relevant technical references retrievable from Google.\n"
        "Avoid filler, keep it concise, precise, and rich in meaning.\n\n"
        f"Original question:\n{question}\n\n"
        "Optimized abstracted question (max 12 words):"
    )
    # Invoke the model with the prompt and strip any extraneous characters from the response.
    rewritten = llm.invoke(prompt).content.strip()
    rewritten = rewritten.strip('"')  # Remove quotes if included in the model's response.

    # Post-processing to ensure the response is no longer than 12 words.
    terms = rewritten.split()
    trimmed_terms = " ".join(terms[:12])
    if len(terms) > 12:
        print(f"‚ö†Ô∏è Prompt returned {len(terms)} words. It has been reduced to 12.")

    # Determine a more detailed version of the question based on length comparison.
    detailed_question = question if len(question.split()) > len(trimmed_terms.split()) else trimmed_terms

    # Return both the concise query for search and a semantically detailed question for ranking and response generation.
    return {
        "search_query": trimmed_terms,         # used in search
        "semantic_query": detailed_question    # used for ranking and generating a response
    }


def run_serpapi_search(reformulated_question, sites, serpapi_key):
    """
    Conducts a search using SerpAPI with a reformulated question and specified sites to target specific content.

    Args:
    reformulated_question (str): The optimized query to use for the search.
    sites (list of str): A list of sites to specifically search within.
    serpapi_key (str): The API key for accessing SerpAPI services.

    Returns:
    DataFrame: A pandas DataFrame containing the unique search results with additional cleaned text.

    This function constructs a query that combines specified site filters with the reformulated question
    to target relevant results more effectively.
    """
    # Build the query string by combining site filters with the reformulated question.
    sites_query = " OR ".join([f"site:{site}" for site in sites])
    full_query = f"{sites_query} {reformulated_question}"
    print(f"üåê Google Search Query: {full_query}")

    # Define the parameters for the SerpAPI request.
    params = {
        "engine": "google",       # Define the search engine to use.
        "q": full_query,          # The complete search query.
        "api_key": serpapi_key,   # API key for authentication.
        "num": 100,               # Number of results to retrieve.
        "tbs": "sbd:1"            # Parameter to sort results by date.
    }
    # Make the request to SerpAPI and parse the JSON response.
    response = requests.get("https://serpapi.com/search", params=params)
    data = response.json()

    # Extract the relevant fields from the search results.
    results = [
        {
            "title": res.get("title", ""),
            "link": res.get("link", ""),
            "description": res.get("snippet", "")
        }
        for res in data.get("organic_results", [])
    ]
    # Convert the list of dictionaries to a DataFrame and drop duplicate links.
    df = pd.DataFrame(results).drop_duplicates(subset="link").reset_index(drop=True)
    print(f"‚úÖ Total de resultados √∫nicos: {len(df)}")

    # Apply the preprocessing function to clean and structure the title and description of each result.
    df['clean_text'] = df.apply(
        lambda r: preprocess_and_structure_title_description(r['title'], r['description']),
        axis=1
    )
    return df


def rank_documents(df, reformulated_question, embedding_model, top_k=5, max_k_faiss=30):
    """
    Ranks documents based on semantic similarity using FAISS embeddings and includes a fallback mechanism.

    Args:
    df (DataFrame): DataFrame containing the search results with preprocessed text.
    reformulated_question (str): The search query used to guide the ranking process.
    embedding_model (str): The name of the embedding model to use for document vectorization.
    top_k (int): The number of top results to return.
    max_k_faiss (int): The maximum number of documents to retrieve in FAISS search.

    Returns:
    list: A list of tuples (title, link) representing the top ranked documents.

    This function utilizes HuggingFace embeddings to vectorize documents and FAISS for efficient similarity search.
    It aims to return the most relevant documents based on the semantic content related to the reformulated question.
    """
    # Initialize embeddings using the specified model, ensuring the vectors are normalized.
    embeddings = HuggingFaceEmbeddings(
        model_name=embedding_model,
        encode_kwargs={"normalize_embeddings": True}
    )
    # Prepare documents by creating Document objects for each row that contains meaningful text.
    docs = [
        Document(page_content=row["clean_text"], metadata={"title": row["title"], "link": row["link"]})
        for _, row in df.iterrows() if row["clean_text"].strip()
    ]
    # Create a FAISS retriever from the documents with the embeddings, set to search for similar documents.
    retriever = FAISS.from_documents(docs, embeddings).as_retriever(search_type="similarity", k=max_k_faiss)
    # Use the reformulated question as a retrieval cue to find the most relevant documents.
    faiss_docs = retriever.invoke("Representa√ß√£o para recupera√ß√£o: " + reformulated_question)

    # Track seen links to avoid duplicates and gather the top results.
    seen_links, links = set(), []
    for doc in faiss_docs:
        link = doc.metadata.get("link", "").strip()
        title = doc.metadata.get("title", "").strip()
        # Add only unique links that start with "http://" or "https://".
        if link and link not in seen_links and link.startswith("http"):
            seen_links.add(link)
            links.append((title, link))
        if len(links) >= top_k:
            break

    # If the number of results is less than top_k, use a fallback mechanism.
    if len(links) < top_k:
        keywords = re.findall(r'\w+', reformulated_question.lower())
        pattern = "|".join(map(re.escape, keywords))
        # Filter the original DataFrame for any remaining relevant documents.
        fallback_df = df[df["link"].notnull() & df["clean_text"].str.contains(pattern, case=False, na=False)]
        for _, row in fallback_df.iterrows():
            link = str(row["link"]).strip()
            title = str(row.get("title", "")).strip()
            if link and link not in seen_links and link.startswith("http"):
                seen_links.add(link)
                links.append((title, link))
            if len(links) >= top_k:
                break

    print(f"üîÅ Total de links ap√≥s FAISS + fallback: {len(links)}")
    return links


def scrape_links(links, top_k):
    """
    Scrapes web content from specified links and collects relevant textual content.

    Args:
    links (list of tuple): A list of tuples containing (title, link) to scrape.
    top_k (int): The maximum number of links to process for content extraction.

    Returns:
    tuple: A tuple containing a list of context blocks and a list of valid links.

    This function performs web scraping on the provided links, extracting and cleaning text to be used for further processing.
    It stops when the number of valid links reaches the specified top_k limit.
    """
    # Initialize lists to store context blocks and valid links.
    context_blocks = []
    valid_links = []

    # Iterate over each link provided.
    for title, link in links:
        # Scrape and clean text from the URL using predefined functions.
        content = clean_scraped_text_en(get_text_from_url(link))
        # Print the length of the scraped content for debugging and monitoring.
        print(f"üìÑ {link} ‚Üí {len(content)} chars")

        # If there is meaningful content, append it to the context blocks and valid links lists.
        if content:
            context_blocks.append(f"\nFonte: {title or 'Link'} - {link}\n{content}")
            valid_links.append((title, link))

        # Stop processing if the number of valid links meets the top_k criterion.
        if len(valid_links) >= top_k:
            break

    # Return the context blocks and a list of valid links (excluding titles for the final list).
    return context_blocks, [link for _, link in valid_links]


def ask_with_real_context_from_links(question, sites, serpapi_key, openai_api_key, top_k=5, max_k_faiss=30,
                                     llm_model_name="gpt-4", embedding_model_name="BAAI/bge-base-en-v1.5"):
    """
    Processes a user question through a complete RAG pipeline to generate an answer supported by scraped web content.

    Args:
    question (str): The user's original question.
    sites (list): List of specific sites to search.
    serpapi_key (str): API key for SerpAPI.
    openai_api_key (str): API key for OpenAI services.
    top_k (int): Maximum number of top documents to consider.
    max_k_faiss (int): Maximum number of documents for FAISS to retrieve.
    llm_model_name (str): Name of the OpenAI model used.
    embedding_model_name (str): Name of the embedding model used.

    Returns:
    dict: A dictionary containing the original prompt, final answer, reformulated and semantic queries, selected context documents, and links used.

    This function ties together the steps of reformulating a question, searching for relevant documents, scraping and ranking these documents, and finally generating an answer using an AI language model.
    """
    # Reformulate the question for better search results.
    queries = reformulate_question(question, openai_api_key, llm_model_name)
    # Perform a search with the reformulated question to find relevant documents.
    df = run_serpapi_search(queries["search_query"], sites, serpapi_key)
    # Rank the documents based on relevance to the question.
    links = rank_documents(df, queries["semantic_query"], embedding_model_name, top_k, max_k_faiss)
    # Scrape the top-ranked links and collect their content.
    context_blocks, final_links = scrape_links(links, top_k)

    # Encode the context blocks to check token count against a model's maximum.
    enc = get_encoding("cl100k_base")
    total_tokens = 0
    selected_blocks = []
    # Ensure the total token count does not exceed the limit (e.g., 9500 tokens).
    for block in reversed(context_blocks):
        block_tokens = len(enc.encode(block))
        if total_tokens + block_tokens > 9500:
            break
        selected_blocks.insert(0, block)
        total_tokens += block_tokens

    # Combine the context blocks into a final context string.
    final_context = "\n\n".join(selected_blocks) or "Nenhum conte√∫do relevante foi encontrado."

    # Initialize the language model for generating the answer.
    llm = ChatOpenAI(model_name=llm_model_name, openai_api_key=openai_api_key, temperature=0)
    final_prompt = (
        "You are a technical expert in the subject matter of the user's question. Your task is to generate a coherent, high-quality, "
        "and contextually accurate answer based on:\n\n"
        "- Your own expert knowledge\n"
        "- The retrieved content from a RAG (Retrieval-Augmented Generation) system using FAISS\n\n"
        "Follow these instructions:\n"
        "- Use the retrieved content to enrich and support your answer.\n"
        "- Ensure the answer is coherent and technically sound.\n"
        "- Respond in the same language as the original user question.\n"
        "- If there are multiple perspectives or approaches in the retrieved content, synthesize them into a unified, thoughtful response.\n"
        "- Avoid simply copying text; instead, elaborate and explain using your own words when appropriate.\n\n"
        f"Original user question:\n{question}\n\n"
        f"Rewritten, context-rich version of the question:\n{queries['semantic_query']}\n\n"
        f"Retrieved content from {len(selected_blocks)} relevant documents:\n{final_context}\n\n"
        "Final answer:"
    )

    # Generate the answer using the LLM.
    answer = llm.invoke(final_prompt).content

    # Debugging and monitoring outputs.
    print("\nüó®Ô∏è PERGUNTA ORIGINAL DO USU√ÅRIO\n" + "="*40)
    print(question, "\n")
    print("üß† PERGUNTA REESCRITA PARA SEARCH\n" + "="*40)
    print(queries['search_query'], "\n")
    print("üß† PERGUNTA PARA RAG\n" + "="*40)
    print(queries['semantic_query'], "\n")
    print("üìö LINKS USADOS COMO CONTEXTO\n" + "="*40)
    for i, link in enumerate(final_links, 1):
        print(f"{i}. {link}")
    print("\n‚úÖ RESPOSTA FINAL DO GPT-4\n" + "="*40)
    print(answer)

    return {
        "prompt": final_prompt,
        "answer": answer,
        "reformulated_question": queries['search_query'],
        "semantic_question": queries['semantic_query'],
        "context_docs": selected_blocks,
        "links": final_links
    }



def print_long_text(text, chunk_size=150):
    """
    Prints long text in manageable chunks to make it easier to read in console outputs.

    Args:
    text (str): The text string to be printed.
    chunk_size (int): The number of characters in each chunk of text to be printed.

    This function iterates over the input text and prints it in specified chunk sizes,
    adding a newline between chunks for better readability. This is particularly useful
    for displaying long strings in a more readable format in environments like command
    line or logs where long continuous text can be hard to follow.
    """
    # Iterate over the text in increments of `chunk_size`.
    for i in range(0, len(text), chunk_size):
        # Print a slice of the text from the current position to `chunk_size` characters ahead.
        # `end="\n\n"` adds two new lines after each chunk for clear separation.
        print(text[i:i + chunk_size], end="\n\n")


In [3]:
def gerar_lista_de_questoes(df):
    """
    Formats questions and their answer choices from a DataFrame into a list of readable strings.

    Args:
    df (DataFrame): A pandas DataFrame containing question data with columns for the question statement
                    and answer choices labeled 'Enunciado', 'Alternativa_A', 'Alternativa_B', 'Alternativa_C', and 'Alternativa_D'.

    Returns:
    list: A list of formatted question strings, where each question includes its multiple choice options.

    This function processes each row in the DataFrame, extracting the question statement and multiple-choice
    answers, and formats them into a single string per question, which includes newlines for readability.
    """
    # Initialize an empty list to store formatted questions.
    questoes_formatadas = []

    # Iterate over each row in the DataFrame.
    for _, row in df.iterrows():
        # Start building the question string with the question statement followed by a newline for separation.
        texto = f"{row['Enunciado']}\n\n"
        # Append each multiple choice answer, prefixed with its corresponding label ('A', 'B', 'C', 'D').
        texto += f"A) {row['Alternativa_A']}\n"
        texto += f"B) {row['Alternativa_B']}\n"
        texto += f"C) {row['Alternativa_C']}\n"
        texto += f"D) {row['Alternativa_D']}"
        # Add the fully formatted question string to the list.
        questoes_formatadas.append(texto)

    # Return the list of formatted question strings.
    return questoes_formatadas

In [4]:
from google.colab import drive

# Montar o Google Drive
drive.mount('/content/drive')

# Caminho do arquivo no Google Drive
csv_path = "/content/drive/My Drive/Projetos/Agente IA News/revalida_50_questoes.csv"

# Load Revalida questions from CSV to test the Retrieval-Augmented Generation (RAG) pipeline.
# Carregar o CSV
revalida_questions = pd.read_csv(csv_path)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
revalida_questions.head(2)

Unnamed: 0,Questao,Enunciado,Alternativa_A,Alternativa_B,Alternativa_C,Alternativa_D
0,2,"Homem de 42 anos, em uso cr√¥nico de anti-infla...",√ölcera g√°strica perfurada.,Pancreatite aguda.,Colecistite aguda.,Diverticulite aguda.
1,3,Menina de 7 anos e 6 meses √© encaminhada ao am...,"Puberdade precoce central, pois o crescimento ...","Telarca isolada precoce, pois a diferen√ßa entr...","Telarca isolada precoce, pois o crescimento da...","Puberdade precoce central, pois, al√©m da telar..."


In [6]:
questoes = gerar_lista_de_questoes(revalida_questions)

In [12]:
def processar_enunciados_compilados(questoes, sites, serpapi_key, openai_api_key):
    """
    Processes a list of compiled questions through a RAG pipeline to generate contextualized answers.

    Args:
    questoes (list): A list of questions to be processed.
    sites (list): List of specific websites to target during the search process.
    serpapi_key (str): API key for accessing SerpAPI services.
    openai_api_key (str): API key for OpenAI to use their language models.

    Returns:
    list: A list of dictionaries containing detailed results for each processed question, including any errors encountered.

    This function iterates through each question, uses the RAG pipeline to generate a context-enriched answer, and handles any errors that occur during the process. Each question's response includes details such as the final answer, the reformulated question, and the search context used.
    """
    # Initialize a list to store responses for each question.
    respostas = []

    # Loop through each question in the list.
    for idx, pergunta in enumerate(questoes):
        print(f"\nüß™ Processando quest√£o {idx + 1}/{len(questoes)}")

        try:
            # Attempt to generate an answer using the RAG pipeline with specified parameters.
            resposta = ask_with_real_context_from_links(
                question=pergunta,
                sites=sites,
                serpapi_key=serpapi_key,
                openai_api_key=openai_api_key,
                top_k=5,
                max_k_faiss=30,
                llm_model_name="gpt-4",
                embedding_model_name="BAAI/bge-base-en-v1.5"
            )
        except Exception as e:
            # Handle any exceptions by logging the error and continuing with default values.
            print(f"‚ùå Erro ao processar a quest√£o {idx}: {e}")
            resposta = {
                "prompt": None,
                "answer": None,
                "reformulated_question": None,
                "semantic_question": None,
                "context_docs": [],
                "links": [],
                "error": str(e)
            }

        # Store details of the response in the results list.
        respostas.append({
            "index": idx,
            "pergunta_compilada": pergunta,
            "resposta_gpt": resposta.get("answer"),
            "pergunta_reformulada": resposta.get("reformulated_question"),
            "pergunta_semantica": resposta.get("semantic_question"),
            "prompt_usado": resposta.get("prompt"),
            "contexto_utilizado": resposta.get("context_docs"),
            "links": resposta.get("links"),
            "erro": resposta.get("error", None)
        })

    # Return the list of all responses.
    return respostas


In [21]:
# Call the processar_enunciados_compilados function with a list of questions and other parameters
# to process each question through the RAG pipeline and generate contextualized answers.
respostas_processadas = processar_enunciados_compilados(
    questoes=questoes,  # List of questions to be processed.
    sites=[
        "pubmed.ncbi.nlm.nih.gov",  # PubMed, a free search engine accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics.
        "medscape.com",             # Medscape, a resource for clinical news, health information, and point-of-care tools.
        "https://www.researchgate.net/",  # ResearchGate, a network dedicated to science and research. Connect, collaborate and discover scientific publications, jobs and conferences.
    ],
    serpapi_key=serpapi_key,  # API key for accessing SerpAPI services.
    openai_api_key=openai_api_key  # API key for accessing OpenAI's GPT model.
)

# The result, respostas_processadas, is a list of dictionaries where each dictionary contains detailed information
# about the processing of each question, including generated answers, reformulated questions, used prompts, and any errors encountered.



üß™ Processando quest√£o 1/50
üåê Google Search Query: pubmed.ncbi.nlm.nih.gov OR medscape.com OR https://www.researchgate.net/ Diagnosis for epigastric pain, intraperitoneal fluid, and air in hepatophrenic recess?
‚úÖ Total de resultados √∫nicos: 3
üîÅ Total de links ap√≥s FAISS + fallback: 3
üìÑ https://emedicine.medscape.com/article/1790777-reference ‚Üí 4216 chars
üìÑ https://pmc.ncbi.nlm.nih.gov/articles/PMC11684536/ ‚Üí 4101 chars
üìÑ https://emedicine.medscape.com/article/1980980-overview ‚Üí 4216 chars

üó®Ô∏è PERGUNTA ORIGINAL DO USU√ÅRIO
Homem de 42 anos, em uso cr√¥nico de anti-inflamat√≥rio n√£o esteroide por doen√ßa reum√°tica, d√° entrada no pronto-socorro com 6 horas de evolu√ß√£o de dor epig√°strica de forte intensidade. Sinais vitais: Frequ√™ncia card√≠aca 110 bpm, Press√£o arterial 90 x 50 mmHg, Frequ√™ncia respirat√≥ria 22 irpm, Temperatura axilar 36,5 oC. Ao exame f√≠sico, abdome tenso, com descompress√£o brusca dolorosa nos quatro quadrantes. O hemograma ap


If you meant to use Beautiful Soup to parse the web page found at a certain URL, then something has gone wrong. You should use an Python package like 'requests' to fetch the content behind the URL. Once you have the content as a string, you can feed that string into Beautiful Soup.



    
  text = BeautifulSoup(text or "", "html.parser").get_text()


üîÅ Total de links ap√≥s FAISS + fallback: 5
üìÑ https://www.uptodate.com/contents/whats-new-in-family-medicine ‚Üí 8 chars
üìÑ https://repository.poltekkes-kaltim.ac.id/1178/1/19.%20Clinical%20Case%20Studies%20for%20the%20Family%20Nurse%20Practitioner.pdf ‚Üí 0 chars
üìÑ https://pmc.ncbi.nlm.nih.gov/articles/PMC9099726/ ‚Üí 4147 chars
üìÑ https://sciendo.com/pdf/10.2478/prilozi-2021-0007 ‚Üí 1638 chars
üìÑ https://pmc.ncbi.nlm.nih.gov/articles/PMC4872377/ ‚Üí 3946 chars

üó®Ô∏è PERGUNTA ORIGINAL DO USU√ÅRIO
Mulher de 38 anos, com defici√™ncia cong√™nita de IgA, √© atendida em ambulat√≥rio de cl√≠nica m√©dica devido a insucesso terap√™utico no tratamento de infec√ß√£o por Helicobacter pylori. Apresentava diagn√≥stico de √∫lcera duodenal, tendo sido prescrito omeprazol, amoxicilina e claritromicina. Apesar da melhora cl√≠nica, observou-se persist√™ncia da infec√ß√£o em teste respirat√≥rio com C13. Atribuiu-se o insucesso terap√™utico ao uso recorrente de macrol√≠deos e fluoroquin


If you meant to use Beautiful Soup to parse the web page found at a certain URL, then something has gone wrong. You should use an Python package like 'requests' to fetch the content behind the URL. Once you have the content as a string, you can feed that string into Beautiful Soup.



    
  text = BeautifulSoup(text or "", "html.parser").get_text()


üîÅ Total de links ap√≥s FAISS + fallback: 5
üìÑ https://pmc.ncbi.nlm.nih.gov/articles/PMC4437263/ ‚Üí 4010 chars
üìÑ https://natmedlib.uz/fm/?sitemap/file/9p8vkXmg&view=OXFORD%20Handbooks%20%26%20Textbooks/Oxford_Assess_and_Progress_Clinical_Specialties_Etheridge.pdf ‚Üí 2959 chars
üìÑ https://www.book.bsmi.uz/web/kitoblar/152372397.pdf ‚Üí 0 chars
üìÑ https://emergency-medicine.ecu.edu/wp-content/pv-uploads/sites/151/2023-ECU-EMS-Policy-Protocol-Proc-Med-guide-110123pdf.pdf ‚Üí 1726 chars
üìÑ https://www.ncbi.nlm.nih.gov/books/NBK525974/ ‚Üí 4042 chars

üó®Ô∏è PERGUNTA ORIGINAL DO USU√ÅRIO
Homem de 28 anos foi admitido em hospital ap√≥s 30 minutos de acidente motocicl√≠stico. Apresentava m√∫ltiplas e graves les√µes em face, mand√≠bula e cavidade oral, sem les√µes cervicais. Exame f√≠sico do t√≥rax e do abdome e ultrassonografia focada no abdome no trauma (FAST) sem altera√ß√µes. Sinais vitais: frequ√™ncia card√≠aca 92 bpm; press√£o arterial 130 x 80 mmHg; satura√ß√£o perif√©ri

In [None]:
# sites = [
#     "pubmed.ncbi.nlm.nih.gov",
#     "uptodate.com"
# ]

# resposta = ask_with_real_context_from_links(
#     question=test[1],
#     sites=sites,
#     serpapi_key=serpapi_key,
#     openai_api_key=openai_api_key,
#     top_k=5,
#     max_k_faiss=30,
#     llm_model_name="gpt-4",
#     embedding_model_name="BAAI/bge-base-en-v1.5"
# )
