# Website information retrieval

In this demo, we will put together a website crawler to retrieve information from a particular website. 
In particular, we will collection information from NREL's Annual Technology Baseline (ATB) website: https://atb.nrel.gov/.
The information collecting using this technique can be used in downstream model, such as e.g. a chatbot with a knowledge base attuned to the ATB.

## Website Crawling

The backend technology used for web crawling in ELM is [Crawl4AI](https://github.com/unclecode/crawl4ai). For full control and customization, consider using this library directly. 
Otherwise, if you choose to use the convenience wrapper set up in ELM, you can control the following options:

1) Which URLs from the webpage to follow and which ones to ignore.
2) Which URLs should be prioritized during the crawl.
3) Processing for the page content as it is crawled.
4) Crawl termination condition, either in the form of a conditional based on the content you have already collected or the total number of pages crawled (or both).


## 1. Basic: 2024 ATB Crawl

Let's begin by putting together the simplest possible crawler:

- No prioritization of URLs
- Stay within the ATB website
- Collect all information from all pages visited
- No extra processing of content
- Terminate after crawling the whole website


We'll start with a few imports:

In [1]:
from crawl4ai.deep_crawling.filters import URLFilter
from elm.web.website_crawl import ELMWebsiteCrawler
from rex import init_logger

We'll set the ``elm`` logger to ``DEBUG`` to get a better idea of what is happening as the website is crawled

In [2]:
init_logger("elm", log_level="DEBUG")

<Logger elm (DEBUG)>

Next, we'll need to set up a few utility functions:

- ``empty_link_scorer``: 
    An async scoring function that takes a list of dictionaries (each of 
    which contain information about a link found on the website) and adds
    a ``"score"`` key that determines the links relevance to our search 
    (the higher the score, the more relevant we think the link is). 
    Since we want to collect all the information from the ATB website, 
    we'll simply return ``1`` as the score for all links. Alternatively, 
    this is the callback we could use to make our search as focused as 
    possible for our use-case.
- ``empty_validator``: 
    An async validator function that takes an ``elm.web.document.Document``
    instance and returns whether or not to save this document to the crawl 
    output. Since we want to collect all the information from the ATB website, 
    we'll simply return ``True`` always. Alternatively, this is the callback 
    we could use to decide whether the information from the crawled 
    webpage/document is relevant to our use-case.
- ``empty_terminator``: 
    An async terminator function that takes a list of ``elm.web.document.Document``
    instances and returns whether or not to stop the crawl. Since we want to 
    collect all the information from the ATB website, we'll simply return 
    ``False`` always (continue until no more webpages are found). Alternatively, 
    this is the callback we could use to decide whether we have found enough 
    relevant information from the webpage for our use-case.


In [None]:
async def empty_link_scorer(links):
    """Since we want to collect all the information from the ATB website,
    this scoring function sets the score for all links to 1.
    """
    for link in links:
        link_title_text = link.get("text", "")
        link_url = link.get("href", "")
        # ... compute score ...
        link["score"] = 1

    return links

async def empty_validator(doc):
    """Since we want to collect all the information from the ATB website,
    this validator always returns True.
    """
    return True

async def empty_terminator(docs):
    """Since we want to collect all the information from the ATB website,
    this terminator always returns ``False``.
    """
    return False

To keep this crawling example focused, we'll only consider the 2024 ATB for now. To do this, we have to set up a URL filter, which will do the following:

- Keep URL's contained to the relevant subdomain (https://atb.nrel.gov) - we don't want to crawl the entire `nrel.gov` domain!
- If the URL seems to have a year in it (i.e. a ``/20XX``), make sure that we stick with ``/2024``, which includes previous (or future) years.
- A quirk of the ATB website is that it uses section links within a single web page. This makes the crawler think it is visiting two separate pages, when in reality it is looking at the same page (except in a different location). To get around this, we'll discard any URL's that contain the ``#`` char, which indicates that a section within the page is being linked.

These rules are all combined in an ``apply`` method in a subclass of Crawl4AI's ``URLFilter``:

In [None]:
class ATBFilter(URLFilter):
    """Helper class to filter url's down to ATB website"""

    def apply(self, url):
        """Check if the URL is part of the ATB website.

        Specifically, we check for the full subdomain (https://atb.nrel.gov)
        in the URL. We also check that if a year is "detected" in the URL,
        that it is the 2024 ATB year. Finally, we check that the URL
        does not contain a section link (i.e., it does not contain a `#`).
        For example, these links

        https://atb.nrel.gov/electricity/2024/index/definitions#financialcases
        https://atb.nrel.gov/electricity/2024/index/definitions#r&donlycase

        are just the same link but pointing to different sections of the
        page. Thus, checking for "#" helps avoid collecting redundant
        information.
        """
        correct_subdomain = url.casefold().startswith("https://atb.nrel.gov")
        correct_year = True if "/20" not in url else "/2024" in url
        not_subsection_link = "#" not in url
        return correct_subdomain and correct_year and not_subsection_link

Now we instantiate the crawler using our ``empty_link_scorer`` and 
``empty_validator`` functions. We'll also specify ``{"verify_ssl": False}`` 
as ``file_loader_kwargs`` to avoid some SSL issues over the NREL VPN and 
also set ``include_external=False`` (which is actually already the default) 
to prevent the crawl from collecting information from external domains.
We'll add our own filter that we implemented above using ``extra_url_filters=[ATBFilter()]``,
and we'll specify that we want the crawl to stop gathering extra links after 
a recursion depth of 3 using the ``crawl_strategy_kwargs={"max_depth": 3}`` input.
Finally, we will set the ``max_pages`` input to be large enough that we 
can confidently crawl the entire 2024 ATB website.

<div class="alert alert-block alert-info">
<b>Tip:</b> Try adding <code>browser_config_kwargs={"headless": False}</code> to the <code>ELMWebsiteCrawler</code> initializer below to visualize the crawl process
</div>

In [None]:
crawler = ELMWebsiteCrawler(empty_validator,
                            url_scorer=empty_link_scorer,
                            file_loader_kwargs={"verify_ssl": False},
                            include_external=False,
                            extra_url_filters=[ATBFilter()],
                            crawl_strategy_kwargs={"max_depth": 3},
                            max_pages=1000)

Now we can pass the ``empty_terminator`` callback as well as our website URL to the ``run`` method 
and watch the logs to track the crawl progress!

<div class="alert alert-block alert-warning">
<b>Note:</b> Despite the limitations we imposed, the crawl can take still take some time to run (~10 minutes)
</div>

In [6]:
atb_info = await crawler.run("https://atb.nrel.gov/",
                             termination_callback=empty_terminator)

DEBUG - 2025-07-02 23:19:14,482 [website_crawl.py:518] : Crawled https://atb.nrel.gov/
DEBUG - 2025-07-02 23:19:14,487 [website_crawl.py:533] : Document passed validation check
DEBUG - 2025-07-02 23:19:16,047 [website_crawl.py:518] : Crawled https://atb.nrel.gov/archive
DEBUG - 2025-07-02 23:19:16,051 [website_crawl.py:533] : Document passed validation check
DEBUG - 2025-07-02 23:19:25,061 [website_crawl.py:518] : Crawled https://atb.nrel.gov/contact
DEBUG - 2025-07-02 23:19:25,070 [website_crawl.py:533] : Document passed validation check
DEBUG - 2025-07-02 23:19:25,321 [website_crawl.py:518] : Crawled https://atb.nrel.gov/transportation/2024/data
DEBUG - 2025-07-02 23:19:25,326 [website_crawl.py:533] : Document passed validation check
DEBUG - 2025-07-02 23:19:25,530 [website_crawl.py:518] : Crawled https://atb.nrel.gov/search
DEBUG - 2025-07-02 23:19:25,532 [website_crawl.py:533] : Document passed validation check
DEBUG - 2025-07-02 23:19:25,700 [website_crawl.py:518] : Crawled https:

Now that the crawl is finished, we can examine the documents we extracted

In [9]:
len(atb_info)

324

In [8]:
for doc in atb_info[:5]:
    print(doc)

HTMLDocument with 1 pages
Attrs:
                        source:	https://atb.nrel.gov/archive
  website_link_relevance_score:	1
HTMLDocument with 1 pages
Attrs:
                        source:	https://atb.nrel.gov/contact
  website_link_relevance_score:	1
HTMLDocument with 1 pages
Attrs:
                        source:	https://atb.nrel.gov/transportation/2024/data
  website_link_relevance_score:	1
HTMLDocument with 1 pages
Attrs:
                        source:	https://atb.nrel.gov/search
  website_link_relevance_score:	1
HTMLDocument with 1 pages
Attrs:
                        source:	https://atb.nrel.gov/electricity/2024/data
  website_link_relevance_score:	1


In [22]:
print(atb_info[20].attrs["source"])
print(atb_info[20].text)

https://atb.nrel.gov/electricity/2024/distributed_wind
[Skip to main content](https://atb.nrel.gov/electricity/2024/distributed_wind#content) Contribute to enhancing the Electricity ATB! Share your feedback in this 5-minute [ questionnaire](https://forms.office.com/g/j9Ua4Hsayx), or signup for general [ email updates regarding the ATB](https://atb.nrel.gov/register). [ Sign up for general email updates regarding the ATB ](https://atb.nrel.gov/register) The 2024 Electricity ATB is live! Join the webinar to learn what's new. [ Register to attend](https://nrel.zoomgov.com/webinar/register/WN_mDY0z3ahTyyG_Y86xnZIaA) or [ sign up for general email updates](https://atb.nrel.gov/register). [ ](http://www.nrel.gov "Visit the National Renewable Energy Lab Homepage") [Annual Technology Baseline](https://atb.nrel.gov/) Menu * [ Electricity](https://atb.nrel.gov/electricity/2024/distributed_wind) * [ Electricity](https://atb.nrel.gov/electricity/2024/index) * [About](https://atb.nrel.gov/electrici

### Practice
Here are a few things you can do to expand upon this tutorial and gain some practice setting up a website crawler:

- Exclude duplicated content (either URL filter or document filter)
- Filter out documents with "Page not found" text
- Search only electricity ATB
- Prioritize links that go into definitions, documentation, and methods
- Post-process documents (filter out markdown syntax, chunk text, compute embeddings, store in database, etc)

<br>


## 2. Advanced: Targeted 2024 ATB Crawl

In this example, we'll build on the crawler we set up in the first part of this tutorial, except we'll expand it to be more targeted in it's search. In particular, we'll leverage LLM's to parse the website content to check for specific information about geothermal well flow rates.

We'll start by setting up the URL filter as before:

In [40]:
class ATBFilter(URLFilter):
    """Helper class to filter url's down to ATB website"""

    def apply(self, url):
        """Check if the URL is part of the ATB website.

        Specifically, we check for the full subdomain (https://atb.nrel.gov)
        in the URL. We also check that if a year is "detected" in the URL,
        that it is the 2024 ATB year. Finally, we check that the URL
        does not contain a section link (i.e., it does not contain a `#`).
        For example, these links

        https://atb.nrel.gov/electricity/2024/index/definitions#financialcases
        https://atb.nrel.gov/electricity/2024/index/definitions#r&donlycase

        are just the same link but pointing to different sections of the
        page. Thus, checking for "#" helps avoid collecting redundant
        information.
        """
        correct_subdomain = url.casefold().startswith("https://atb.nrel.gov")
        correct_year = True if "/20" not in url else "/2024" in url
        not_subsection_link = "#" not in url
        return correct_subdomain and correct_year and not_subsection_link

However, for the document validator, we'll apply an LLM to examine the website text:

In [None]:
from elm.ords.llm import LLMCaller
from elm.ords.services.openai import OpenAIService

SYSTEM_MESSAGE = (
    "You examine text scraped form the NREL ATB website to determine if it "
    "contains information relevant to the user's query. Base your responses "
    "only off of the content of the webpage and **not** on any auxiliary "
    "information."
)
INSTRUCTIONS = (
    "Does the text from this website give specifics about the flow rate per "
    "production well used in a geothermal 2024 ATB scenario? "
    "Begin your response with 'Yes' or 'No', briefly explain your answer, "
    "and end your response with the production well flow rate value you found."
)
LLM = LLMCaller(llm_service=OpenAIService, model="gpt-4o-mini")


async def llm_geothermal_validator(doc):
    """Apply an LLM to check for flow rate text."""
    question = f"{INSTRUCTIONS}\n\nText:\n\n{doc.text}"
    llm_response = await LLM.call(sys_msg=SYSTEM_MESSAGE, content=question)
    return llm_response.casefold().startswith("yes")


We'll also assume that once a single document is found, we have the info we need so we can stop the search:

In [51]:
async def single_doc_terminator(docs):
    """Assume that once a website passes validation that we can stop crawl."""
    return len(docs) > 0

To try to minimize LLM costs, let's also prioritize any links that mention geothermal:

In [None]:
async def geothermal_link_scorer(links):
    """Prioritize any web links that mention 'geothermal'."""
    for link in links:
        link_title_text = link.get("text", "")
        link_url = link.get("href", "")
        base_score = 1
        if ("geothermal" in link_title_text.casefold()
            or "geothermal" in link_url.casefold()):
            base_score = 2

        link["score"] = base_score

    return links


Now we can set up crawling as before (note that we use the new validator and link scorer):

In [47]:
crawler = ELMWebsiteCrawler(llm_geothermal_validator,
                            file_loader_kwargs={"verify_ssl": False},
                            extra_url_filters=[ATBFilter()],
                            url_scorer=geothermal_link_scorer,
                            crawl_strategy_kwargs={"max_depth": 3},
                            max_pages=1000)

The only difference this time is that we have to launch the LLM service before we execute the crawl:

In [None]:
import openai
from elm.utilities import validate_azure_api_params
from elm.ords.services.provider import RunningAsyncServices

# func below assumes you have API params set as ENV variables
azure_api_key, azure_version, azure_endpoint = validate_azure_api_params()
client = openai.AsyncAzureOpenAI(
    api_key=azure_api_key,
    api_version=azure_version,
    azure_endpoint=azure_endpoint,
)
services = [OpenAIService(client, rate_limit=500_000)]
async with RunningAsyncServices(services):
    geo_info = await crawler.run("https://atb.nrel.gov/",
                                 termination_callback=single_doc_terminator)

DEBUG - 2025-07-03 00:02:54,377 [provider.py:137] : Initializing Service: OpenAIService


DEBUG - 2025-07-03 00:02:57,569 [website_crawl.py:518] : Crawled https://atb.nrel.gov/
DEBUG - 2025-07-03 00:03:00,471 [website_crawl.py:518] : Crawled https://atb.nrel.gov/archive
DEBUG - 2025-07-03 00:03:08,155 [website_crawl.py:518] : Crawled https://atb.nrel.gov/contact
DEBUG - 2025-07-03 00:03:10,733 [website_crawl.py:518] : Crawled https://atb.nrel.gov/electricity/2024/index
DEBUG - 2025-07-03 00:03:12,260 [website_crawl.py:518] : Crawled https://atb.nrel.gov/transportation/2024/data
DEBUG - 2025-07-03 00:03:13,532 [website_crawl.py:518] : Crawled https://atb.nrel.gov/electricity/2024/data
DEBUG - 2025-07-03 00:03:14,594 [website_crawl.py:518] : Crawled https://atb.nrel.gov/transportation/2024/about
DEBUG - 2025-07-03 00:03:15,675 [website_crawl.py:518] : Crawled https://atb.nrel.gov/electricity/2024/technologies
DEBUG - 2025-07-03 00:03:16,669 [website_crawl.py:518] : Crawled https://atb.nrel.gov/search
DEBUG - 2025-07-03 00:03:17,547 [website_crawl.py:518] : Crawled https://atb

In [49]:
geo_info

[HTMLDocument with 1 pages
 Attrs:
                         source:	https://atb.nrel.gov/electricity/2024/geothermal
   website_link_relevance_score:	2]

### Practice
Here are a few things you can do to expand upon this tutorial and gain some practice setting up a more advanced website crawler:

- Set up a decision tree with more specific content filtering using the LLM
- Score the links using an LLM

<br>
