## Introduction

Superpowered AI is a Knowledge Base as a Service (KBaaS) that provides a robust and efficient way to create, store, and query knowledge bases. In this tutorial, we will demonstrate how the query_passages endpoint can be used to efficiently search a webpage for information relevant to a specific query with a single API call.

Autonomous agents like AutoGPT can navigate the web and gather relevant information and perform actions. One challenge with building this kind of agent is that many webpages have too much text to fit into the context window of LLMs like GPT-3 and GPT-4. Agents can only see what's in their context window, so that's a problem. How can we expand access to things that don't fit in context? The answer is semantic search. 

Generally this is done by breaking the text into chunks, creating vector embeddings for each chunk, uploading those vectors to a vector database, and then querying the database with the query embedding vector. For an application like web browsing, where you may never need to search that exact page again, this is a pretty inefficient way to do it. We've built a more efficient and cost-effective way to do this. You simply provide the query along with the content you want to search in a single API call, and then we return the most relevant text snippets, as well as an LLM-generated summary.

## Prerequisites

Before we start, ensure you have the following:

1. A Superpowered AI account with API keys (Sign up [here](https://superpowered.ai) for free access).
2. Python 3 installed on your computer.
3. `beautifulsoup4` and `requests` libraries installed (you can install them using `pip install beautifulsoup4 requests`).

## Step-by-Step Tutorial

### Step 1: Set up the environment

First, we will set up the environment by importing the required libraries and initializing the Superpowered AI SDK with our API keys.

In [1]:
from superpowered import query_passages
from bs4 import BeautifulSoup
import os
import requests

In [None]:
# set API keys
os.environ["SUPERPOWERED_API_KEY_ID"] = "YOUR_API_KEY_ID"
os.environ["SUPERPOWERED_API_KEY_SECRET"] = "YOUR_API_KEY_SECRET"

Replace `"YOUR_API_KEY_ID"` and `"YOUR_API_KEY_SECRET"` with your actual API keys.

### Step 2: Scrape the web page

Next, we will scrape the content of a web page page using the `requests` library and `BeautifulSoup`.

In [None]:
url = "https://en.wikipedia.org/wiki/Mont_Blanc"

# scrape the URL
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
body = soup.find('body')
content = body.text

print (content)

Or, if that doesn't work for the web page you want to scrape, you can use Selenium, which is slower but works for a much wider variety of websites.

In [None]:
from selenium import webdriver
from contextlib import contextmanager
import time

url = "https://superpoweredai.notion.site"

@contextmanager
def get_chrome_driver(options):
    """
    context manager to ensure `driver.quit()` is called after execution
    to avoid memory leaks and zombie processes
    """
    driver = webdriver.Chrome("/opt/chromedriver", options=options)
    try:
        yield driver
    finally:
        driver.quit()

def get_title_and_content_from_url(url: str) -> str:
    """
    get the human-readable text from the website

    This will require us to first render the page using a headless browser
    and then get the text from the page
    """

    print('attempting to get title and content from url: ', url)

    # Use a headless Chrome browser for rendering
    options = webdriver.ChromeOptions()
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')

    with get_chrome_driver(options) as browser:
        browser.set_page_load_timeout(20)

        # Navigate to the URL and wait for the page to render
        browser.get(url)
        time.sleep(10)

        # Get the HTML code of the page
        html = browser.page_source

    # Parse the HTML with BeautifulSoup
    soup = BeautifulSoup(html, 'html.parser')
    title = soup.title.string if soup.title is not None else url
    content = soup.get_text()
    print('length of content: ', len(content))

    return title, content

title, content = get_title_and_content_from_url(url)

print (title)
print (content)

### Step 3: Query the scraped content

Now, we will query the scraped content using the query_passages endpoint. We will ask the question, "What is the name of the highest mountain in the Alps?"

In [9]:
query = "What is the name of the highest mountain in the Alps?"
response = query_passages(query=query, passages=[content], top_k=5, summarize_results=True)

The `query_passages()` function takes the following parameters:

- `query`: The question we want to ask.
- `passages`: A list containing the content we want to search.
- `top_k`: The number of top results to return.
- `summarize_results`: Whether to extract and summarize the top results.

### Step 4: Print the results

Finally, we will print the summarized response and the top results.

In [10]:
print (response["summary"])

Mont Blanc [1] is the highest mountain in the Alps and Western Europe, standing at 4,807.81 m (15,774 ft) above sea level. It is located on the French-Italian border and is the second-most prominent mountain in Europe [4]. The summit of Mont Blanc is a permanent ice cap, with temperatures around −20 °C (−4 °F) [3]. The mountain and its surrounding peaks can create their own weather patterns, with the summit being prone to strong winds and sudden weather changes [3].


In [None]:
for i,result in enumerate(response["ranked_results"]):
    print (f"Result {i+1}:")
    print (result["content"])
    print ()