## Building a Bring Your Own Browser (BYOB) Tool for Web Browsing and Summarization

**Disclaimer: This cookbook is for educational purposes only. Ensure that you comply with all applicable laws and service terms when using web search and scraping technologies. This cookbook will restrict the search to openai.com domain to retrieve the public information to illustrate the concepts.**

Large Language Models (LLMs) such as GPT-4o have a knowledge cutoff date, which means they lack information about events that occurred after that point. In scenarios where the most recent data is essential, it's necessary to provide LLMs with access to current web information to ensure accurate and relevant responses.

In this guide, we will build a Bring Your Own Browser (BYOB) tool using Python to overcome this limitation. Our goal is to create a system that provides up-to-date answers in your application, including the most recent developments such as the latest product launches by OpenAI. By integrating web search capabilities with an LLM, we'll enable the model to generate responses based on the latest information available online.

While you can use any publicly available search APIs, we'll utilize Google's Custom Search API to perform web searches. The retrieved information from the search results will be processed and passed to the LLM to generate the final response through Retrieval-Augmented Generation (RAG).

**Bring Your Own Browser (BYOB)** tools allow users to perform web browsing tasks programmatically. In this notebook, we'll create a BYOB tool that:

**#1. Set Up a Search Engine:** Use a public search API, such as Google's Custom Search API, to perform web searches and obtain a list of relevant search results.  

**#2. Build a Search Dictionary:** Collect the title, URL, and a summary of each web page from the search results to create a structured dictionary of information.  

**#3. Generate a RAG Response:** Implement Retrieval-Augmented Generation (RAG) by passing the gathered information to the LLM, which then generates a final response to the user's query.



In [44]:

# search_systemPrompt = """
# **System Instruction:**
# You are an expert API documentation assistant. Your task is to assist users in extracting API endpoint information from the provided search results, synthesize the relevant details from online sources, and generate a YAML configuration for the given API endpoint.

# **User Input Example:**
# "Find details about the Figma API endpoint most likely to get_team_projects"

# **Expected Output:**
# 1. Perform a search query to find reliable and updated documentation for the specified API endpoint.
# 2. Extract the relevant information, including endpoint details, HTTP method, parameters, and usage examples.
# 3. Generate a YAML configuration based on the following template:
# """

# userPrompt = "The user will provide a dictionary of search results in JSON format for search query {search_term} Based on on the search results provided by the user, provide a detailed response to this query: **'{search_query}'**. Make sure to cite all the sources at the end of your answer."

systemPrompt_final = """
**System Instruction:**
You are an expert API documentation assistant. Your task is to assist users in finding API endpoint information using provided search terms, synthesize the relevant details from online sources, and generate a YAML configuration for the given API endpoint. The YAML should include standard fields such as name, description, method, endpoint URL, parameters, and any other necessary details.

**User Input Example:**
"Find details about the Figma API endpoint most likely to get_team_projects and generate a YAML configuration."

**Expected Output:**
1. Perform a search query to find reliable and updated documentation for the specified API endpoint.
2. Extract the relevant information, including endpoint details, HTTP method, parameters, and usage examples.
3. Generate a YAML configuration based on the following template:

```yaml
name: <API Endpoint Name Always start with the service provider name, like figma_get_team_projects, serpapi_google_search>
servers:
  - url: <API Service Provider Documentation URL>
    description: Optional server description, e.g. Main (production) server
description: <Brief Description of the API Endpoint>
method: <HTTP Method>
endpoint: <Full and Complete API Endpoint URL, like https://api.server.com/v1/projects>
parameters:
  - name: <Parameter Name>
    type: <Data Type>
    required: <true/false>
    description: <Description>
example_request: |
  <Example cURL or HTTP Request>
example_response: |
  <Example API Response>
```

### Explanation:
- Replace `<placeholders>` with the extracted API information.
- Ensure the YAML structure is complete and adheres to standard YAML formatting rules.

**Response Example:**
Here is the YAML configuration for the Twitter API "search tweets" endpoint:

```yaml
name: Twitter API - Search Tweets
description: Allows querying Twitter's recent tweets based on search terms.
method: GET
endpoint: https://api.twitter.com/2/tweets/search/recent
parameters:
  - name: query
    type: string
    required: true
    description: The search query to run against tweets.
  - name: max_results
    type: integer
    required: false
    description: Maximum number of results to return (10–100).
  - name: tweet.fields
    type: string
    required: false
    description: A comma-separated list of additional fields to include in the response.
example_request: |
  curl -X GET "https://api.twitter.com/2/tweets/search/recent?query=chatgpt&max_results=5" \
  -H "Authorization: Bearer YOUR_ACCESS_TOKEN"
example_response: |
  {
    "data": [
      {
        "id": "1234567890",
        "text": "Example tweet content here."
      }
    ],
    "meta": {
      "result_count": 1
    }
  }
```

**Instructions for User Testing:**
- Test the generated YAML in a live application or API tool to ensure correctness.
- Validate the endpoint and parameter descriptions against the official API documentation.
"""


### Use Case 
In this cookbook, we'll take the example of a user who wants to list recent product launches by OpenAI in chronological order. Because the current GPT-4o model has a knowledge cutoff date, it is not expected that the model will know about recent product launches such as the o1-preview model launched in September 2024. 


In [6]:
import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv('.env')

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

tool_name = "confluence_search_page_content"


Given the knowledge cutoff, as expected the model does not know about the recent product launches by OpenAI.

### Setting up a BYOB tool
To provide the model with recent events information, we'll follow these steps:

##### Step 1: Set Up a Search Engine to Provide Web Search Results
##### Step 2: Build a Search Dictionary with Titles, URLs, and Summaries of Web Pages
##### Step 3: Pass the information to the model to generate a RAG Response to the User Query  


Before we begin, ensure you have the following: **Python 3.12 or later** installed on your machine. You will also need a Google Custom Search API key and Custom Search Engine ID (CSE ID). Necessary Python packages installed: `requests`, `beautifulsoup4`, `openai`. And ensure the OPENAI_API_KEY is set up as an environment variable.

#### Step 1: Set Up a Search Engine to Provide Web Search Results
You can use any publicly available web search APIs to perform this task. We will configure a custom search engine using Google's Custom Search API. This engine will fetch a list of relevant web pages based on the user's query, focusing on obtaining the most recent and pertinent results.  

**a. Configure Search API key and Function:** Acquire a Google API key and a Custom Search Engine ID (CSE ID) from the Google Developers Console. You can navigate to this [Programmable Search Engine Link](https://developers.google.com/custom-search/v1/overview) to set up an API key as well as Custom Search Engine ID (CSE ID). 

The `search` function below sets up the search based on search term, the API and CSE ID keys, as well as number of search results to return. We'll introduce a parameter `site_filter` to restrict the output to only `openai.com`
  

In [5]:
import requests  # For making HTTP requests to APIs and websites

def search(search_item, api_key, cse_id, search_depth=10, site_filter=None):
    service_url = 'https://www.googleapis.com/customsearch/v1'

    params = {
        'q': search_item,
        'key': api_key,
        'cx': cse_id,
        'num': search_depth
    }

    try:
        response = requests.get(service_url, params=params)
        response.raise_for_status()
        results = response.json()

        # Check if 'items' exists in the results
        if 'items' in results:
            if site_filter is not None:
                
                # Filter results to include only those with site_filter in the link
                filtered_results = [result for result in results['items'] if site_filter in result['link']]

                if filtered_results:
                    return filtered_results
                else:
                    print(f"No results with {site_filter} found.")
                    return []
            else:
                if 'items' in results:
                    return results['items']
                else:
                    print("No search results found.")
                    return []

    except requests.exceptions.RequestException as e:
        print(f"An error occurred during the search: {e}")
        return []


**b. Identify the search terms for search engine:** Before we can retrieve specific results from a 3rd Party API, we may need to use Query Expansion to identify specific terms our browser search API should retrieve. **Query expansion** is a process where we broaden the original user query by adding related terms, synonyms, or variations. This technique is essential because search engines, like Google's Custom Search API, are often better at matching a range of related terms rather than just the natural language prompt used by a user. 

For example, searching with only the raw query `"List the latest OpenAI product launches in chronological order from latest to oldest in the past 2 years"` may return fewer and less relevant results than a more specific and direct search on a succinct phrase such as `"Latest OpenAI product launches"`. In the code below, we will use the user's original `search_query` to produce a more specific search term to use with the Google API to retrieve the results. 

In [18]:
search_term = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are an expert API documentation assistant. Provide a google search term to find the target API endpoint documentation based on search query provided below in 4-6 words"},
        {"role": "user", "content": search_query}]
).choices[0].message.content

print(search_term)

Confluence API search page content


**c. Invoke the search function:** Now that we have the search term, we will invoke the search function to retrieve the results from Google search API. The results only have the link of the web page and a snippet at this point. In the next step, we will retrieve more information from the webpage and summarize it in a dictionary to pass to the model.

In [19]:
from dotenv import load_dotenv
import os

load_dotenv('.env')

api_key = os.getenv('GOOGLE_API_KEY')
cse_id = os.getenv('CSE_ID')

search_items = search(search_item=search_term, api_key=api_key, cse_id=cse_id, search_depth=6)


In [20]:
for item in search_items:
    print(f"Link: {item['link']}")
    print(f"Snippet: {item['snippet']}\n")

Link: https://developer.atlassian.com/cloud/confluence/rest/v1/api-group-search/
Snippet: Searches for content using the Confluence Query Language (CQL). Note that CQL input queries submitted through the /wiki/rest/api/search endpoint no longer ...

Link: https://community.atlassian.com/t5/Confluence-questions/Rest-API-call-to-search-content-in-a-page-and-it-s-children/qaq-p/999497
Snippet: ... search all pages of the parent based on your text string in the search. Here is what an example would look like: http://localhost:8080/confluence/rest/api/ ...

Link: https://community.developer.atlassian.com/t/how-to-search-confluence-pages-by-content-state/61595
Snippet: Sep 15, 2022 ... The search URL in my case is something like {confluenceBaseURL}/wiki/rest/api/search?cql=content.property%5Bcontent-state-published%5D ...

Link: https://community.atlassian.com/t5/Confluence-questions/Searching-for-Page-by-Content/qaq-p/2594304
Snippet: Jan 31, 2024 ... ... search API for retrieving the page 

#### Step 2: Build a Search Dictionary with Titles, URLs, and Summaries of Web Pages
After obtaining the search results, we'll extract and organize the relevant information, so it can be passed to the LLM for final output. 

**a. Scrape Web Page Content:** For each URL in the search results, retrieve the web page to extract textual content while filtering out non-relevant data like scripts and advertisements as demonstrated in function `retrieve_content`. 

**b. Summarize Content:** Use an LLM to generate concise summaries of the scraped content, focusing on information pertinent to the user's query. Model can be provided the original search text, so it can focus on summarizing the content for the search intent as outlined in function `summarize_content`. 
  
**c. Create a Structured Dictionary:** Organize the data into a dictionary or a DataFrame containing the title, link, and summary for each web page. This structure can be passed on to the LLM to generate the summary with the appropriate citations.    


In [21]:
import requests
from bs4 import BeautifulSoup

TRUNCATE_SCRAPED_TEXT = 30000  # Adjust based on your model's context window
SEARCH_DEPTH = 5

def retrieve_content(url, max_tokens=TRUNCATE_SCRAPED_TEXT):
        try:
            headers = {'User-Agent': 'Mozilla/5.0'}
            response = requests.get(url, headers=headers, timeout=10)
            response.raise_for_status()

            soup = BeautifulSoup(response.content, 'html.parser')
            for script_or_style in soup(['script', 'style']):
                script_or_style.decompose()

            text = soup.get_text(separator=' ', strip=True)
            characters = max_tokens * 4  # Approximate conversion
            text = text[:characters]
            return text
        except requests.exceptions.RequestException as e:
            print(f"Failed to retrieve {url}: {e}")
            return None
        
def summarize_content(content, search_term, character_limit=500):
        prompt = (
            f"You are an AI assistant tasked with summarizing content relevant to '{search_term}'. "
            f"Please provide a concise summary in {character_limit} characters or less."
        )
        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": prompt},
                    {"role": "user", "content": content}]
            )
            summary = response.choices[0].message.content
            return summary
        except Exception as e:
            print(f"An error occurred during summarization: {e}")
            return None

def get_search_results(search_items, character_limit=500):
    # Generate a summary of search results for the given search term
    results_list = []
    for idx, item in enumerate(search_items, start=1):
        url = item.get('link')
        
        snippet = item.get('snippet', '')
        web_content = retrieve_content(url, TRUNCATE_SCRAPED_TEXT)
        
        if web_content is None:
            print(f"Error: skipped URL: {url}")
        else:
            summary = summarize_content(web_content, search_term, character_limit)
            result_dict = {
                'order': idx,
                'link': url,
                'title': snippet,
                'Summary': summary
            }
            results_list.append(result_dict)
    return results_list

In [22]:
results = get_search_results(search_items)

for result in results:
    print(f"Search order: {result['order']}")
    print(f"Link: {result['link']}")
    print(f"Snippet: {result['title']}")
    print(f"Summary: {result['Summary']}")
    print('-' * 80)

Search order: 1
Link: https://developer.atlassian.com/cloud/confluence/rest/v1/api-group-search/
Snippet: Searches for content using the Confluence Query Language (CQL). Note that CQL input queries submitted through the /wiki/rest/api/search endpoint no longer ...
Summary: The Confluence Cloud REST API provides a comprehensive suite of tools for developers, including authentication, content management (attachments, comments, labels, properties, permissions), space management, and user management. It supports both REST and GraphQL APIs, allowing for dynamic interactions with content and settings. The API documentation includes guides, changelogs, and support resources. Users can rate the utility of the page, access system status updates, and review privacy policies. Overall, the Confluence API facilitates tailored content management and integration capabilities.
--------------------------------------------------------------------------------
Search order: 2
Link: https://community.atlas

We retrieved the most recent results. (Note these will vary depending on when you execute this script.) 

#### Step 3: Pass the information to the model to generate a RAG Response to Narrow down the target webpage
With the search data organized in a JSON data structure, we will pass this information to the LLM to narrow down relevant webpage. 


In [23]:
import json 

# final_prompt = (
#     f"The user will provide a dictionary of search results in JSON format for search query {search_term} Based on on the search results provided by the user, provide a detailed response to this query: **'{search_query}'**. Make sure to cite all the sources at the end of your answer."
# )

prompt = f"""
**System Instruction**
You are an expert API documentation assistant. Your task is to understand the user's query on specific API endpoint, synthesize the relevant webpages from a given online search result, and select the most relevant webpage that have information on the given API endpoint and its YAML configuration in order to write a python calling request to the API.

**Expected Output:**
SELECTED ORDER: <order>
URL: <URL>

**Explanation:**
- Replace `<order>` with the selected order number from the search results and `<URL>` with the selected URL from the search results.
- Set <order> to -1 if the search results do not provide enough information on the user input API endpoint.

**User Input:**
The user endpoint query is: {tool_name}
The google search query for this endpoint is: {search_term}
The search results are: {results}
"""


response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": prompt},
        {"role": "user", "content": json.dumps(results)}],
    temperature=0

)
summary = response.choices[0].message.content

print(summary)

SELECTED ORDER: 1
URL: https://developer.atlassian.com/cloud/confluence/rest/v1/api-group-search/


In [28]:
# Parse the response to get the selected order and URL

selected_order = int(summary.split('\n')[0].split(': ')[1])

print(f"Selected order: {selected_order}")

Selected order: 1


#### Step 4: Visit the related search results and retrieve the content

In [32]:
import requests
from bs4 import BeautifulSoup

url = results[selected_order-1]['link']
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}

try:
    response = requests.get(url, headers=headers)
    response.raise_for_status()  # Raise HTTPError for bad responses
    soup = BeautifulSoup(response.text, "html.parser")
    
    main_content = soup.find('body') or soup.find('div', class_='content') or soup.find('article')
    
    if main_content:
        print(len(main_content.get_text(strip=True)))
        # print(main_content.get_text(strip=True))
    else:
        print("Body content not found")

except requests.exceptions.RequestException as e:
    print(f"Error fetching the URL: {e}")


828


### Step 5: Extract the API specs from the webpage

In [45]:
def generate_api_spec(webpage_content, tool_name, search_term):
    userPrompt = f"""
    The user endpoint query is: {tool_name}
    The google search query for this endpoint is: {search_term}
    The first search result is: {webpage_content}
    """
    
    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": systemPrompt_final},
                {"role": "user", "content": userPrompt}
            ],
            temperature=0.1
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error generating YAML config: {e}")
        return None

# Get the webpage content
webpage_content = main_content.get_text(strip=True)

# Extract YAML configuration
yaml_config = generate_api_spec(webpage_content, tool_name, search_term)
print("\nGenerated YAML Configuration:")
print(yaml_config)


Generated YAML Configuration:
Based on the search query "Confluence API search page content," I will provide a YAML configuration for the Confluence API endpoint that allows searching page content. Here is the synthesized information and YAML configuration:

```yaml
name: confluence_search_page_content
servers:
  - url: https://developer.atlassian.com/cloud/confluence/rest/api-group-search/
    description: Main server for Confluence Cloud REST API
description: Searches for content in Confluence pages using CQL (Confluence Query Language).
method: GET
endpoint: https://your-domain.atlassian.net/wiki/rest/api/content/search
parameters:
  - name: cql
    type: string
    required: true
    description: The CQL query to execute for searching content.
  - name: limit
    type: integer
    required: false
    description: The maximum number of results to return.
  - name: start
    type: integer
    required: false
    description: The starting index of the returned results.
  - name: expa

In [49]:
def generate_api_spec_o1(webpage_content, tool_name, search_term):
    userPrompt = f"""
    The user endpoint query is: {tool_name}
    The google search query for this endpoint is: {search_term}
    The first search result is: {webpage_content}
    """
    
    try:
        response = client.chat.completions.create(
            model="o1-mini-2024-09-12",
            messages=[
                {"role": "user", "content": systemPrompt_final + userPrompt}
            ],
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error generating YAML config: {e}")
        return None

# Get the webpage content
webpage_content = main_content.get_text(strip=True)

# Extract YAML configuration
yaml_config = generate_api_spec_o1(webpage_content, tool_name, search_term)
print("\nGenerated YAML Configuration (o1):")
print(yaml_config)


Generated YAML Configuration (o1):
Here is the YAML configuration for the Confluence API "search page content" endpoint:

```yaml
name: Confluence_API_Search_Page_Content
servers:
  - url: https://developer.atlassian.com/cloud/confluence/rest/api-group-content/
    description: Confluence REST API Documentation
description: |
  Searches for page content in Confluence using Confluence Query Language (CQL).
method: GET
endpoint: https://api.atlassian.com/ex/confluence/{cloudId}/wiki/rest/api/content/search
parameters:
  - name: cql
    type: string
    required: true
    description: The CQL query to search for content.
  - name: cqlContext
    type: string
    required: false
    description: The context in which the CQL query is executed.
  - name: limit
    type: integer
    required: false
    description: The maximum number of results to return.
  - name: start
    type: integer
    required: false
    description: The starting index of the first result.
  - name: expand
    type: 

### Conclusion
 
Large Language Models (LLMs) have a knowledge cutoff and may not be aware of recent events. To provide them with the latest information, you can build a Bring Your Own Browser (BYOB) tool using Python. This tool retrieves current web data and feeds it to the LLM, enabling up-to-date responses.

The process involves three main steps:

**#1 Set Up a Search Engine:** Use a public search API, like Google's Custom Search API, to perform web searches and obtain a list of relevant search results.  

**#2 Build a Search Dictionary:** Collect the title, URL, and a summary of each web page from the search results to create a structured dictionary of information.  

**#3. Generate a RAG Response:** Implement Retrieval-Augmented Generation (RAG) by passing the gathered information to the LLM, which then generates a final response to the user's query.

By following these steps, you enhance the LLMs ability to provide up-to-date answers in your application that include the most recent developments, such as the latest product launches by OpenAI.