## Building a Bring Your Own Browser (BYOB) Tool for Web Browsing and Summarization

**Disclaimer: This cookbook is for educational purposes only. Ensure that you comply with all applicable laws and service terms when using web search and scraping technologies. This cookbook will restrict the search to openai.com domain to retrieve the public information to illustrate the concepts.**

Large Language Models (LLMs) such as GPT-4o have a knowledge cutoff date, which means they lack information about events that occurred after that point. In scenarios where the most recent data is essential, it's necessary to provide LLMs with access to current web information to ensure accurate and relevant responses.

In this guide, we will build a Bring Your Own Browser (BYOB) tool using Python to overcome this limitation. Our goal is to create a system that provides up-to-date answers in your application, including the most recent developments such as the latest product launches by OpenAI. By integrating web search capabilities with an LLM, we'll enable the model to generate responses based on the latest information available online.

While you can use any publicly available search APIs, we'll utilize Google's Custom Search API to perform web searches. The retrieved information from the search results will be processed and passed to the LLM to generate the final response through Retrieval-Augmented Generation (RAG).

**Bring Your Own Browser (BYOB)** tools allow users to perform web browsing tasks programmatically. In this notebook, we'll create a BYOB tool that:

**#1. Set Up a Search Engine:** Use a public search API, such as Google's Custom Search API, to perform web searches and obtain a list of relevant search results.  

**#2. Build a Search Dictionary:** Collect the title, URL, and a summary of each web page from the search results to create a structured dictionary of information.  

**#3. Generate a RAG Response:** Implement Retrieval-Augmented Generation (RAG) by passing the gathered information to the LLM, which then generates a final response to the user's query.

------
**For our own usecase**

- Add additional steps to select relevant webpages
  
- Parsed relevant webpages with BeautifulSoup
- Generate a yaml file

-> The followup *generate function from yaml* process is the same as the data process pipeline on existing db (`openapis`)


In [1]:
systemPrompt_byob = """
**System Instruction:**
You are an expert API documentation assistant. Your task is to assist users in finding API endpoint information using provided search terms, synthesize the relevant details from online sources, and generate a YAML configuration for the given API endpoint. The YAML should include standard fields such as name, description, method, endpoint URL, parameters, and any other necessary details.

**Expected Output:**
1. Understand user endpoint search query, and find reliable and updated documentation for the specified API endpoint based on the google search term and its search result.
2. Extract the relevant information, including endpoint details, HTTP method, parameters, and usage examples.
3. Generate a YAML configuration based on openapi 3.1.0 looking like this:

```yaml
openapi: 3.1.0
info:
  title: <Sample API Title>
  description: <Optional multiline or single-line description in [CommonMark](http://commonmark.org/help/) or HTML>
  version: <0.1.9>
servers:
  - url: <API Service Provider Documentation URL, like http://api.example.com/v1>
    description: <Optional server description, e.g. Main (production) server>
paths:
  /users:
    get:
      operationId: getUsers
      summary: Returns a list of users.
      description: Optional extended description in CommonMark or HTML.
      responses:
        '200':
          description: A JSON array of user names
          content:
            application/json:
              schema:
                type: array
                items:
                  type: string
    post:
      operationId: createUser
      summary: Creates a user.
      requestBody:
        required: true
        content:
          application/json:
            schema:
              type: object
              properties:
                username:
                  type: string
      responses:
        '201':
          description: Created
```

**Explanation:**
- Replace `<placeholders>` with the extracted API information.
- Replace `paths` part based on the valid endpoint you found, do not miss operationId
- Ensure the YAML structure is complete and adheres to standard YAML formatting rules.

**Additional Guidelines:**
- Validate the endpoint and parameter descriptions against the official API documentation.
- Always ensure the output is tailored to the user’s use case.
"""


### Use Case 
In this cookbook, we'll take the example of a user who wants to list recent product launches by OpenAI in chronological order. Because the current GPT-4o model has a knowledge cutoff date, it is not expected that the model will know about recent product launches such as the o1-preview model launched in September 2024. 


In [2]:
import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv('.env')

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

# tool_name = "confluence_search_page_content"
tool_name = "reverse_geocoding_api"


Given the knowledge cutoff, as expected the model does not know about the recent product launches by OpenAI.

### Setting up a BYOB tool
To provide the model with recent events information, we'll follow these steps:

##### Step 1: Set Up a Search Engine to Provide Web Search Results
##### Step 2: Build a Search Dictionary with Titles, URLs, and Summaries of Web Pages
##### Step 3: Pass the information to the model to generate a RAG Response to the User Query  


Before we begin, ensure you have the following: **Python 3.12 or later** installed on your machine. You will also need a Google Custom Search API key and Custom Search Engine ID (CSE ID). Necessary Python packages installed: `requests`, `beautifulsoup4`, `openai`. And ensure the OPENAI_API_KEY is set up as an environment variable.

#### Step 1: Set Up a Search Engine to Provide Web Search Results
You can use any publicly available web search APIs to perform this task. We will configure a custom search engine using Google's Custom Search API. This engine will fetch a list of relevant web pages based on the user's query, focusing on obtaining the most recent and pertinent results.  

**a. Configure Search API key and Function:** Acquire a Google API key and a Custom Search Engine ID (CSE ID) from the Google Developers Console. You can navigate to this [Programmable Search Engine Link](https://developers.google.com/custom-search/v1/overview) to set up an API key as well as Custom Search Engine ID (CSE ID). 

The `search` function below sets up the search based on search term, the API and CSE ID keys, as well as number of search results to return. We'll introduce a parameter `site_filter` to restrict the output to only `openai.com`
  

In [3]:
import requests  # For making HTTP requests to APIs and websites

def search(search_item, api_key, cse_id, search_depth=10, site_filter=None):
    service_url = 'https://www.googleapis.com/customsearch/v1'

    params = {
        'q': search_item,
        'key': api_key,
        'cx': cse_id,
        'num': search_depth
    }

    try:
        response = requests.get(service_url, params=params)
        response.raise_for_status()
        results = response.json()

        # Check if 'items' exists in the results
        if 'items' in results:
            if site_filter is not None:
                
                # Filter results to include only those with site_filter in the link
                filtered_results = [result for result in results['items'] if site_filter in result['link']]

                if filtered_results:
                    return filtered_results
                else:
                    print(f"No results with {site_filter} found.")
                    return []
            else:
                if 'items' in results:
                    return results['items']
                else:
                    print("No search results found.")
                    return []

    except requests.exceptions.RequestException as e:
        print(f"An error occurred during the search: {e}")
        return []


**b. Identify the search terms for search engine:** Before we can retrieve specific results from a 3rd Party API, we may need to use Query Expansion to identify specific terms our browser search API should retrieve. **Query expansion** is a process where we broaden the original user query by adding related terms, synonyms, or variations. This technique is essential because search engines, like Google's Custom Search API, are often better at matching a range of related terms rather than just the natural language prompt used by a user. 

For example, searching with only the raw query `"List the latest OpenAI product launches in chronological order from latest to oldest in the past 2 years"` may return fewer and less relevant results than a more specific and direct search on a succinct phrase such as `"Latest OpenAI product launches"`. In the code below, we will use the user's original `search_query` to produce a more specific search term to use with the Google API to retrieve the results. 

In [4]:
search_term = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are an expert API documentation assistant. Provide a google search term to find the target API endpoint documentation based on search query provided below in 4-6 words"},
        {"role": "user", "content": tool_name}]
).choices[0].message.content

print(search_term)

InternalServerError: Error code: 500 - {'error': {'message': 'Timed out generating response. Please try again with a shorter prompt or with `max_tokens` set to a lower value.', 'type': 'internal_error', 'param': None, 'code': 'request_timeout'}}

**c. Invoke the search function:** Now that we have the search term, we will invoke the search function to retrieve the results from Google search API. The results only have the link of the web page and a snippet at this point. In the next step, we will retrieve more information from the webpage and summarize it in a dictionary to pass to the model.

In [6]:
from dotenv import load_dotenv
import os

load_dotenv('.env')

api_key = os.getenv('GOOGLE_API_KEY')
cse_id = os.getenv('CSE_ID')

search_items = search(search_item=search_term, api_key=api_key, cse_id=cse_id, search_depth=6)


In [7]:
for item in search_items:
    print(f"Link: {item['link']}")
    print(f"Snippet: {item['snippet']}\n")

Link: https://developer.atlassian.com/cloud/confluence/rest/v1/api-group-search/
Snippet: Searches for content using the Confluence Query Language (CQL). Note that CQL input queries submitted through the /wiki/rest/api/search endpoint no longer ...

Link: https://community.developer.atlassian.com/t/how-to-search-confluence-pages-by-content-state/61595
Snippet: Sep 15, 2022 ... ... content properties of a page using the endpoint {confluenceBaseURL}/wiki/rest/api/content/{pageId}/property. So, as it is said in the Content ...

Link: https://community.atlassian.com/t5/Confluence-questions/Rest-API-call-to-search-content-in-a-page-and-it-s-children/qaq-p/999497
Snippet: The endpoint I would suggest you use is GET /rest/api/search. With the ... http://localhost:8080/confluence/rest/api/search?cql=parent=123+and+text ...

Link: https://community.developer.atlassian.com/t/problems-with-cql-search-and-expands/62812
Snippet: Oct 23, 2022 ... I currently try to search (Custom Content) and want t

#### Step 2: Build a Search Dictionary with Titles, URLs, and Summaries of Web Pages
After obtaining the search results, we'll extract and organize the relevant information, so it can be passed to the LLM for final output. 

**a. Scrape Web Page Content:** For each URL in the search results, retrieve the web page to extract textual content while filtering out non-relevant data like scripts and advertisements as demonstrated in function `retrieve_content`. 

**b. Summarize Content:** Use an LLM to generate concise summaries of the scraped content, focusing on information pertinent to the user's query. Model can be provided the original search text, so it can focus on summarizing the content for the search intent as outlined in function `summarize_content`. 
  
**c. Create a Structured Dictionary:** Organize the data into a dictionary or a DataFrame containing the title, link, and summary for each web page. This structure can be passed on to the LLM to generate the summary with the appropriate citations.    


In [8]:
import requests
from bs4 import BeautifulSoup

TRUNCATE_SCRAPED_TEXT = 30000  # Adjust based on your model's context window
SEARCH_DEPTH = 5

def retrieve_content(url, max_tokens=TRUNCATE_SCRAPED_TEXT):
        try:
            headers = {'User-Agent': 'Mozilla/5.0'}
            response = requests.get(url, headers=headers, timeout=10)
            response.raise_for_status()

            soup = BeautifulSoup(response.content, 'html.parser')
            for script_or_style in soup(['script', 'style']):
                script_or_style.decompose()

            text = soup.get_text(separator=' ', strip=True)
            characters = max_tokens * 4  # Approximate conversion
            text = text[:characters]
            return text
        except requests.exceptions.RequestException as e:
            print(f"Failed to retrieve {url}: {e}")
            return None
        
def summarize_content(content, search_term, character_limit=500):
        prompt = (
            f"You are an AI assistant tasked with summarizing content relevant to '{search_term}'. "
            f"Please provide a concise summary in {character_limit} characters or less."
        )
        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": prompt},
                    {"role": "user", "content": content}]
            )
            summary = response.choices[0].message.content
            return summary
        except Exception as e:
            print(f"An error occurred during summarization: {e}")
            return None

def get_search_results(search_items, character_limit=500):
    # Generate a summary of search results for the given search term
    results_list = []
    for idx, item in enumerate(search_items, start=1):
        url = item.get('link')
        
        snippet = item.get('snippet', '')
        web_content = retrieve_content(url, TRUNCATE_SCRAPED_TEXT)
        
        if web_content is None:
            print(f"Error: skipped URL: {url}")
        else:
            summary = summarize_content(web_content, search_term, character_limit)
            result_dict = {
                'order': idx,
                'link': url,
                'title': snippet,
                'Summary': summary
            }
            results_list.append(result_dict)
    return results_list

In [9]:
results = get_search_results(search_items)

for result in results:
    print(f"Search order: {result['order']}")
    print(f"Link: {result['link']}")
    print(f"Snippet: {result['title']}")
    print(f"Summary: {result['Summary']}")
    print('-' * 80)

Search order: 1
Link: https://developer.atlassian.com/cloud/confluence/rest/v1/api-group-search/
Snippet: Searches for content using the Confluence Query Language (CQL). Note that CQL input queries submitted through the /wiki/rest/api/search endpoint no longer ...
Summary: The Confluence REST API provides a Search Content endpoint that allows users to query and retrieve content from Confluence using the Confluence Query Language (CQL). This functionality enables efficient searching across various content types, including pages, blog posts, and attachments. Users can refine searches with parameters such as labels, content status, and space restrictions. Detailed documentation includes usage guidelines, authentication methods, and status codes relevant to the API.
--------------------------------------------------------------------------------
Search order: 2
Link: https://community.developer.atlassian.com/t/how-to-search-confluence-pages-by-content-state/61595
Snippet: Sep 15, 2022 ... 

We retrieved the most recent results. (Note these will vary depending on when you execute this script.) 

#### Step 3: Pass the information to the model to generate a RAG Response to Narrow down the target webpage
With the search data organized in a JSON data structure, we will pass this information to the LLM to narrow down relevant webpage. 


In [10]:
import json 

# final_prompt = (
#     f"The user will provide a dictionary of search results in JSON format for search query {search_term} Based on on the search results provided by the user, provide a detailed response to this query: **'{search_query}'**. Make sure to cite all the sources at the end of your answer."
# )

prompt = f"""
**System Instruction**
You are an expert API documentation assistant. Your task is to understand the user's query on specific API endpoint, synthesize the relevant webpages from a given online search result, and select the most relevant webpage that have information on the given API endpoint and its YAML configuration in order to write a python calling request to the API.

**Expected Output:**
SELECTED ORDER: <order>
URL: <URL>

**Explanation:**
- Replace `<order>` with the selected order number from the search results and `<URL>` with the selected URL from the search results.
- Set <order> to -1 if the search results do not provide enough information on the user input API endpoint.

**User Input:**
The user endpoint query is: {tool_name}
The google search query for this endpoint is: {search_term}
The search results are: {results}
"""


response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": prompt},
        {"role": "user", "content": json.dumps(results)}],
    temperature=0

)
summary = response.choices[0].message.content

print(summary)

SELECTED ORDER: 1
URL: https://developer.atlassian.com/cloud/confluence/rest/v1/api-group-search/

Explanation: The first search result provides detailed documentation on the Confluence REST API's Search Content endpoint, which allows users to query and retrieve content using the Confluence Query Language (CQL). This result includes information on usage guidelines, authentication methods, and status codes relevant to the API, making it the most relevant source for understanding the `confluence_search_page_content` endpoint and its configuration.


In [11]:
# Parse the response to get the selected order and URL

selected_order = int(summary.split('\n')[0].split(': ')[1])

print(f"Selected order: {selected_order}")

Selected order: 1


#### Step 4: Visit the related search results and retrieve the content

In [12]:
import requests
from bs4 import BeautifulSoup

url = results[selected_order-1]['link']
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}

try:
    response = requests.get(url, headers=headers)
    response.raise_for_status()  # Raise HTTPError for bad responses
    soup = BeautifulSoup(response.text, "html.parser")
    
    main_content = soup.find('body') or soup.find('div', class_='content') or soup.find('article')
    
    if main_content:
        print(len(main_content.get_text(strip=True)))
        # print(main_content.get_text(strip=True))
    else:
        print("Body content not found")

except requests.exceptions.RequestException as e:
    print(f"Error fetching the URL: {e}")


828


### Step 5: Extract the API specs from the webpage

In [14]:
def generate_api_spec(webpage_content, tool_name, search_term):
    userPrompt = f"""
    The user endpoint query: {tool_name}
    The google search query for this endpoint: {search_term}
    Related search result (parsed html): {webpage_content}
    """
    
    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": systemPrompt_byob},
                {"role": "user", "content": userPrompt}
            ],
            temperature=0.1
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error generating YAML config: {e}")
        return None

# Get the webpage content
webpage_content = main_content.get_text(strip=True)

# Extract YAML configuration
yaml_config_4o = generate_api_spec(webpage_content, tool_name, search_term)
print("\nGenerated YAML Configuration:")
print(yaml_config_4o)


Generated YAML Configuration:
Based on the search query and the parsed HTML content, it seems you are looking for the Confluence REST API endpoint related to searching page content. The Confluence REST API provides a way to interact with Confluence programmatically, and the search functionality is typically accessed via the CQL (Confluence Query Language) search endpoint.

Here's a YAML configuration for the Confluence REST API search content endpoint:

```yaml
openapi: 3.1.0
info:
  title: Confluence REST API
  description: The Confluence REST API provides a way to interact with Confluence programmatically, including searching for content using CQL.
  version: 1.0.0
servers:
  - url: https://your-domain.atlassian.net/wiki/rest/api
    description: Main Confluence Cloud server
paths:
  /content/search:
    get:
      operationId: searchContent
      summary: Search for content in Confluence using CQL.
      description: |
        This endpoint allows you to search for content in Confl

In [17]:
def generate_api_spec_o1(webpage_content, tool_name, search_term):
    userPrompt = f"""
    The user endpoint query: {tool_name}
    The google search query for this endpoint: {search_term}
    The first search result: {webpage_content}
    """
    
    try:
        response = client.chat.completions.create(
            model="o1-mini-2024-09-12",
            messages=[
                {"role": "user", "content": systemPrompt_byob + userPrompt}
            ],
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error generating YAML config: {e}")
        return None

# Get the webpage content
webpage_content = main_content.get_text(strip=True)

# Extract YAML configuration
yaml_config_o1 = generate_api_spec_o1(webpage_content, tool_name, search_term)
print("\nGenerated YAML Configuration (o1):")
print(yaml_config_o1)


Generated YAML Configuration (o1):
```yaml
openapi: 3.1.0
info:
  title: Confluence REST API
  description: |
    The Confluence REST API provides access to Confluence Cloud resources.
    You can use it to create, read, update, and delete content in your Confluence instance.
  version: 1.0.0
servers:
  - url: https://api.atlassian.com/ex/confluence/{cloudId}/wiki/rest/api
    description: Confluence Cloud API Server
    variables:
      cloudId:
        description: The unique identifier for your Confluence Cloud site.
        default: your-cloud-id
paths:
  /content/search:
    get:
      operationId: searchContent
      summary: Search for content in Confluence.
      description: |
        Searches for content within Confluence using Confluence Query Language (CQL).
        You can refine your search using various query parameters.
      parameters:
        - name: cql
          in: query
          description: |
            Confluence Query Language (CQL) statement to query for c

In [18]:
# Extract YAML content between ```yaml and ``` markers
try:
    yaml_start = yaml_config_o1.find("```yaml") + 7  # Skip ```yaml
    yaml_end = yaml_config_o1.find("```", yaml_start)
    yaml_content = yaml_config_o1[yaml_start:yaml_end].strip()
except:
    print("Can not extract YAML from model response")

In [19]:
yaml_content

'openapi: 3.1.0\ninfo:\n  title: Confluence REST API\n  description: |\n    The Confluence REST API provides access to Confluence Cloud resources.\n    You can use it to create, read, update, and delete content in your Confluence instance.\n  version: 1.0.0\nservers:\n  - url: https://api.atlassian.com/ex/confluence/{cloudId}/wiki/rest/api\n    description: Confluence Cloud API Server\n    variables:\n      cloudId:\n        description: The unique identifier for your Confluence Cloud site.\n        default: your-cloud-id\npaths:\n  /content/search:\n    get:\n      operationId: searchContent\n      summary: Search for content in Confluence.\n      description: |\n        Searches for content within Confluence using Confluence Query Language (CQL).\n        You can refine your search using various query parameters.\n      parameters:\n        - name: cql\n          in: query\n          description: |\n            Confluence Query Language (CQL) statement to query for content.\n        

### Generate tool function etc. from YAML
To further generate the tool definition, py function etc, adapt some functions in `genAPIDscp.py`

In [81]:
import yaml
from prompts import genDscpFromYaml_withNoExec

def genDscpFromYamlStr(endpoint: dict, source_yaml_path: str = None, yaml_content: str =None):
    load_dotenv()
    client = OpenAI(api_key=os.getenv('OPENAI_KEY'))
    
        
    if yaml_content:
        source_yaml = yaml.safe_load(yaml_content)
    elif source_yaml_path:
        source_yaml = yaml.safe_load(open(source_yaml_path, 'r'))
      
    if source_yaml:  
        prompt = genDscpFromYaml_withNoExec.replace('{source_yaml}', str(source_yaml)).replace('{target_endpoint}', str(endpoint))

    try:
        completion = client.chat.completions.create(
            model="o1-mini-2024-09-12",
            # model="gpt-4o-2024-11-20",
            messages=[  
                {"role": "user", "content": prompt + structuredResponse}
            ]
            )
        
        response = completion.choices[0].message.content
        
        return response
        
    except Exception as e:
        print(f"Error generating description for {endpoint['name']}: {e}")

In [88]:
endpoint_path = yaml.safe_load(yaml_content)['paths'].keys().__iter__().__next__()

In [82]:
response = genDscpFromYamlStr(endpoint = endpoint_path, yaml_content=yaml_content)
response

'```json\n{\n    "instruction": "Refer to the Confluence Cloud REST API documentation at https://developer.atlassian.com/cloud/confluence/rest/ to understand how to use the search endpoint for retrieving page content using Confluence Query Language (CQL).",\n    "tool_definition": {\n        "type": "function",\n        "function": {\n            "name": "confluence_search_page_content",\n            "description": "Search for page content in Confluence using Confluence Query Language (CQL). Use this function to retrieve pages based on CQL queries, with options to limit the number of results, paginate through results, and expand additional content properties.",\n            "parameters": {\n                "type": "object",\n                "properties": {\n                    "api_key": {\n                        "type": "string",\n                        "description": "Your Confluence API authorization token."\n                    },\n                    "cql": {\n                  

Response preview

In [90]:
import json
from genAPIDscp import responseFormatCheck

OUTPUT_DIR = '.'
                        
if response.startswith('```json') and response.endswith('```'):
    r = response[7:-3].strip('\n')
                        
    try:
        endpoint_name = json.loads(r).get('tool_definition', {}).get('function', {}).get('name', 'unnamed')
    except Exception as e:
        print(f"Error parsing response for {endpoint.get('name', 'unnamed')}: {e}")
        print(r)
        endpoint_name = endpoint.get('name', 'unnamed')
                        
    if responseFormatCheck(r):
        print(f"Valid response format for {endpoint_name}")
    else:
        print(f"Invalid response format for {endpoint_name}")
        print(r)
        raise Exception
                    
    with open(OUTPUT_DIR + '/' + endpoint_name + '.json', 'a') as f:
        # f.write(f"\n\n=== Processing endpoint {endpoint.get('name', 'unnamed')} from {yaml_path} ===\n")
        f.write(r)
        # f.write("\n")

Valid response format for confluence_search_page_content


### Postprocess
If Jira AA-44 worked, do not need to process this on server!

Parse tool function string and add decorators, like `process_func_str` at `AAPI/aapi/utils.py`
