# Custom web information retrieval using search engines

In this demo, we will build a custom search-engine-based information extraction pipeline based on the ordinance web scraping tool. 
See the [ordinance example](https://github.com/NREL/elm/tree/main/examples/ordinance_gpt) if you are interested in that particular
example. For this exercise, we will set up a pipeline to extract the name of the current director of NREL based on Wikipedia articles. 

## General pipeline structure

We will follow the structure of the ordinance extraction pipeline, which can generally be summarized into these major steps:

1) Collect the text from the top `N` google links over some set of pre-determined queries.
2) Filter results down based on content and/or metadata. 
3) Extract relevant text from webpage or PDF file.
4) Extract structured information from the relevant text.

Let's dissect each portion one at a time!

## Scraping text from Google search results

We will begin by setting up the Google search. To do this, we must come up with one or more relevant queries. 
Since we are interested in looking up the director of NREL using Wikipedia articles about NREL, we can use the following search queries:

In [1]:
QUERIES = ["NREL wiki", "National Renewable Energy Laboratory director"]

We used only two search queries for this example, but you can use as many as you'd like. Try to differentiate them as much as possible to diversify the set of search results Google returns (while staying as on-topic as possible). What would you type into Google to find an answer to the question you are asking?

Once we have a set of queries we are happy with, we can use the `web_search_links_as_docs` function in ELM to perform the Google search and return each google search result as an ELM `Document`.


<div class="alert alert-block alert-info">
<b>Tip:</b> Try adding <code>pw_launch_kwargs={"headless": False, "slow_mo": 1000}</code> to the call below to visualize the search process
</div>

In [None]:
from elm.web.search import web_search_links_as_docs

docs = await web_search_links_as_docs(
    QUERIES,
    pdf_read_kwargs={"verbose": False},
    ignore_url_parts={"openei.org"},
)

We can check `docs` to see that we indeed got some google search results:

In [3]:
docs

[<elm.web.document.HTMLDocument at 0x7f8ce47f7b30>,
 <elm.web.document.HTMLDocument at 0x7f8ce4406a20>,
 <elm.web.document.HTMLDocument at 0x7f8ce41eb770>,
 <elm.web.document.HTMLDocument at 0x7f8ce416a540>]

The `__repr__` for the `Document` class isn't particularly helpful (except to tell us that all links were parsed as HTML).
Instead, we can look at the `attrs` dictionary of each document, which will show use the source for each document:

In [None]:
for d in docs:
    print(d.attrs)

{'source': 'https://en.wikipedia.org/?title=National_Renewable_Energy_Lab&redirect=no'}
{'source': 'https://www.energy.gov/person/dr-martin-keller#:~:text=Martin%20Keller-,Dr.,Alliance%20for%20Sustainable%20Energy%2C%20LLC.'}
{'source': 'https://www2.nrel.gov/about/leadership'}
{'source': 'https://www.linkedin.com/in/martin-keller-a09b016'}


Excellent, it looks like we definitely have some Wikipedia articles about NREL among the results.

However, we also have some documents from links in which we are not interested (e.g., we would like to ignore the official `nrel.gov` page for this exercise). This will generally be true for every analysis, since Google search results can vary broadly. What we need to do next, then, is filter the results down to only include the sources we are interested in.


## Filtering results

The next step is to define some criteria for the sources we are interested in. For our purposes, we would like to limit the results to only include Wikipedia articles. We can accomplish this quite simply - just check to see if "wikipedia" is in the source URL!

Let's implement this basic check in the form of an async function (i.e., a coroutine) that takes a document instance as input and returns a boolean that labels whether the document source is a Wikipedia article: 

In [6]:
async def url_is_wiki(doc):
    return "wikipedia" in doc.metadata.get("source", "")

Easy enough! In practice, this filtering logic can be as complex as you want it to be. It can even include calls to an LLM to parse the content of the document to determine if it contains the information you are interested it. Indeed, this is exactly what the ordinance parsing pipeline does. Check out the [`CountyValidator` implementation](https://nrel.github.io/elm/_modules/elm/ords/validation/location.html#CountyValidator) for an example.

For now, let's get back to applying our simple example. To use the coroutine we just defined, we pass it, along with an initial list of documents, to the appropriately named `filter_documents` function:

In [None]:
from elm.web.utilities import filter_documents

docs = await filter_documents(docs, url_is_wiki)
for d in docs:
    print(d.attrs)

{'source': 'https://en.wikipedia.org/wiki/National_Renewable_Energy_Laboratory'}


Much better!

As mentioned before, you can get as complex as necessary with the filtering step. You can even perform repeated (chained) calls to `filter_documents` to apply multiple levels of filtering to get down to exactly the kind of source you are interested in.

Once you have a curated set of documents, it's time to extract some values!

## Extracting relevant text
Unfortunately, just because we filtered down to the documents we are interested in does not usually mean we can dive right into extracting values. Often, the documents we are examining contain a **lot** of text, most of which is **not** relevant to the question at hand (e.g., ordinance documents can be hundreds of pages long, and often we are just interested in the information found in one small section). 

To get around this, we leverage the LLM to parse the text and extract only the text we are interested in. Let's write another function to call the LLM on chunks of the document text and determine whether the text contains information about NREL's director.
After parsing all of the text chunks, we will stitch back together the relevant chunks to give us only the relevant text:

In [None]:
import asyncio

SYSTEM_MESSAGE = (
    "You extract one or more direct excerpts from a given text based on "
    "the user's request. Maintain all original formatting and characters "
    "without any paraphrasing. If the relevant text is inside of a "
    "space-delimited table, return the entire table with the original "
    "space-delimited formatting. Never paraphrase! Only return portions "
    "of the original text directly."
)
INSTRUCTIONS = (
    "Extract one or more direct text excerpts related to leadership at NREL. "
    "Be sure to include any relevant names and position titles. Include "
    "section headers (if any) for the text excerpts. If there is no text "
    "related to leadership at NREL, simply say: "
    '"No relevant text."'
)

async def extract_relevant_info(doc, text_splitter, llm):
    text_chunks = text_splitter.split_text(doc.text)
    summaries = [
        asyncio.create_task(
            llm.call(
                sys_msg=SYSTEM_MESSAGE,
                content=f"Text:\n{chunk}\n{INSTRUCTIONS}",
            ),
        )
        for chunk in text_chunks
    ]
    summary_chunks = await asyncio.gather(*summaries)
    summary_chunks = [
        chunk for chunk in summary_chunks
        if chunk  # chunk not empty string
        and "no relevant text" not in chunk.lower()  # LLM found relevant info
        and len(chunk) > 20  # chunk is long enough to contain relevant info
    ]
    relevant_text = "\n".join(summary_chunks)
    doc.attrs["relevant_text"] = relevant_text  # store in doc's metadata
    return doc

Before we can call this function, we have to perform some additional setup. Let's start by setting the parameters for our text splitting strategy. You may need to update `model` to match your endpoint:

In [8]:
from functools import partial
from elm import ApiBase
from langchain.text_splitter import RecursiveCharacterTextSplitter
from elm.ords.utilities import RTS_SEPARATORS

model = "gpt-4"
text_splitter = RecursiveCharacterTextSplitter(
    RTS_SEPARATORS,  # or your own custom set of separators
    chunk_size=3000,  # or your own custom chunk size
    chunk_overlap=300,  # or your own custom chunk overlap
    length_function=partial(ApiBase.count_tokens, model=model),
)

We also have to configure the connection with the Azure OpenAI API:

In [9]:
import openai
from elm.utilities import validate_azure_api_params

# func below assumes you have API params set as ENV variables
azure_api_key, azure_version, azure_endpoint = validate_azure_api_params()
client = openai.AsyncAzureOpenAI(
    api_key=azure_api_key,
    api_version=azure_version,
    azure_endpoint=azure_endpoint,
)

Finally, we set up an `LLMCaller`, which is an ELM convenience class for querying an LLM. We also have to perform our function call under the context of `RunningAsyncServices`, which are ELM services that perform convenient tasks for you, such as rate-limiting queries, tracking token usage, and re-submitting failed queries. A full discussion of ELM services is beyond the scope of this demo; all we need to know is that the call to our `extract_relevant_info` coroutine has to happen under the aforementioned context:

In [10]:
from elm.ords.llm import LLMCaller
from elm.ords.services.openai import OpenAIService
from elm.ords.services.provider import RunningAsyncServices


llm = LLMCaller(llm_service=OpenAIService, model=model)
services = [OpenAIService(client, rate_limit=40000)]

async with RunningAsyncServices(services):
    tasks = [
        asyncio.create_task(extract_relevant_info(doc, text_splitter, llm))
        for doc in docs
    ]
    docs = await asyncio.gather(*tasks)

Once processing is complete, we can take a look at the relevant text that the LLM extracted:

In [None]:
for d in docs:
    print("SOURCE:", d.attrs["source"])
    print("================================")
    print(d.attrs["relevant_text"])
    print()

SOURCE: https://en.wikipedia.org/wiki/National_Renewable_Energy_Laboratory
## History

[edit]

Martin Keller became NREL's ninth director in November 2015,[10] and currently
serves as both the director of the laboratory and the president of its
operating contractor, Alliance for Sustainable Energy, LLC.[11] He succeeded
Dan Arvizu, who retired in September 2015 after 10 years in those roles.[12]
"Dr. Martin Keller Named Director of National Renewable Energy Laboratory". _National Renewable Energy Laboratory_. Retrieved June 27, 2017.

"Dr. Martin Keller – Laboratory Director". Retrieved January 30, 2017.

SOURCE: https://en.wikipedia.org/wiki/United_States_Department_of_Energy_National_Laboratories
"National Renewable Energy Laboratory (NREL)

Golden, Colorado, 1977

Operating organization:

Alliance for Sustainable Energy, LLC (since 2008)[11]

Number of employees/ Annual budget (FY2021):

2685  
US$393,000,000"



It's not perfect, but it does contain the info we'll ultimately want to use. You may want to tune the system message and/or instructions to get the best possible result. 

## Extracting values from the text

Finally, we are ready to extract structured information from the relevant text we have collected thus far. To do so, we will use a decision tree framework, which can help guide the LLM through the reasoning steps required to extract the information we are interested in.

Our example task is rather straightforward, so the example graph set up in the code below is likely overkill. Still, it demonstrates the fundamentals required for setting up your own custom decision tree:

In [12]:
import networkx as nx


def setup_decision_tree_graph(text, chat_llm_caller):
    G = nx.DiGraph(text=text, chat_llm_caller=chat_llm_caller)
    G.add_node(
        "init",
        prompt=(
            "Does the following text mention the National Renewable Energy "
            "Laboratory (NREL)?  Begin your response with either 'Yes' or "
            "'No' and justify your answer."
            '\n\n"""\n{text}\n"""'
        ),
    )
    G.add_edge(
        "init", "leadership", condition=lambda x: x.lower().startswith("yes")
    )
    # Can add a branch for the "No" response if we want, but not required
    # since we catch `RuntimeErrors` below.
    G.add_node(
        "leadership",
        prompt=(
            "Does the text mention who the current director of the National "
            "Renewable Energy Laboratory (NREL) is? Begin your response with "
            "either 'Yes' or 'No' and justify your answer."
        ),
    )
    G.add_edge(
        "leadership", "name", condition=lambda x: x.lower().startswith("yes")
    )

    G.add_node(
        "name",
        prompt=(
            "Based on the text, who is the current director of the National "
            "Renewable Energy Laboratory (NREL)?"
        ),
    )
    G.add_edge("name", "final")  # no condition - always go to the end
    G.add_node(
        "final",
        prompt=(
            "Respond based on our entire conversation so far. Return your "
            "answer in JSON format (not markdown). Your JSON file must "
            'include exactly two keys. The keys are "director" and '
            '"explanation". The value of the "director" key should '
            "be a string containing the name of the current director of NREL "
            'as mentioned in the text. The value of the "explanation" '
            "key should be a string containing a short explanation for your "
            "answer."
        ),
    )
    return G

We are almost done! All we have to do now is set up an LLM to parse the relevant text in the document using the tree we just configured. To do this, we implement a short `extract_final_values` that sets up the tree and executes it. The LLM response is parsed from JSON to a Python dictionary.

One small caveat is that we have to use a `ChatLLMCaller` instead of an `LLMCaller`, since the decision tree requires the former (which tracks the LLM's responses as it traverses the decision tree):

In [None]:
from elm.ords.utilities import llm_response_as_json
from elm.ords.extraction.tree import AsyncDecisionTree
from elm.ords.llm import ChatLLMCaller


CHAT_SYSTEM_MESSAGE = (
    "You are a researcher extracting information from wikipedia articles. "
    "Always answer based off of the given text, and never use prior knowledge."
)

async def extract_final_values(doc, model):

    chat_llm = ChatLLMCaller(
        llm_service=OpenAIService,
        system_message=CHAT_SYSTEM_MESSAGE,
        model=model
    )

    G = setup_decision_tree_graph(
        text=doc.attrs["relevant_text"], chat_llm_caller=chat_llm
    )
    tree = AsyncDecisionTree(G)

    try:
        response = await tree.async_run()
    except RuntimeError:  # raised if the tree "condition" is not met
        response = None
    response = llm_response_as_json(response) if response else {}
    response.update(doc.attrs)
    return response

Now we can call our function! As before, we have to put our function call under the `RunningAsyncServices` context:

In [14]:
async with RunningAsyncServices(services):
    tasks = [
        asyncio.create_task(extract_final_values(doc, model)) for doc in docs
    ]
    info_dicts = await asyncio.gather(*tasks)

info_dicts

None of the edge conditions from "leadership" were satisfied: [{'condition': <function setup_decision_tree_graph.<locals>.<lambda> at 0x1387f2020>}]
Ran into an exception when traversing tree. Last message from LLM is printed below. See debug logs for more detail. 
Last message: 
"""
No, the text does not mention who the current director of the National Renewable Energy Laboratory (NREL) is. It only provides information on the operating organization, location, establishment year, number of employees, and annual budget for FY2021.
"""
None of the edge conditions from "leadership" were satisfied: [{'condition': <function setup_decision_tree_graph.<locals>.<lambda> at 0x1387f2020>}]
Traceback (most recent call last):
  File "/Users/rolson2/GitHub/rolson2/elm/elm/ords/extraction/tree.py", line 109, in async_run
    out = await self.async_call_node(node0)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/rolson2/GitHub/rolson2/elm/elm/ords/extraction/tree.py", line 89, in async_cal

[{'director': 'Martin Keller',
  'explanation': 'The text indicates that Martin Keller became the ninth director of the National Renewable Energy Laboratory (NREL) in November 2015 and continues to serve in that capacity.',
  'source': 'https://en.wikipedia.org/wiki/National_Renewable_Energy_Laboratory',
  'relevant_text': '## History\n\n[edit]\n\nMartin Keller became NREL\'s ninth director in November 2015,[10] and currently\nserves as both the director of the laboratory and the president of its\noperating contractor, Alliance for Sustainable Energy, LLC.[11] He succeeded\nDan Arvizu, who retired in September 2015 after 10 years in those roles.[12]\n"Dr. Martin Keller Named Director of National Renewable Energy Laboratory". _National Renewable Energy Laboratory_. Retrieved June 27, 2017.\n\n"Dr. Martin Keller – Laboratory Director". Retrieved January 30, 2017.'},
 {'source': 'https://en.wikipedia.org/wiki/United_States_Department_of_Energy_National_Laboratories',
  'relevant_text': '"

Excellent! Now we have our data! 

All that is left to do is convert the output into a pandas DataFrame (if desired):

In [15]:
import pandas as pd

pd.DataFrame(info_dicts)

Unnamed: 0,director,explanation,source,relevant_text
0,Martin Keller,The text indicates that Martin Keller became t...,https://en.wikipedia.org/wiki/National_Renewab...,## History\n\n[edit]\n\nMartin Keller became N...
1,,,https://en.wikipedia.org/wiki/United_States_De...,"""National Renewable Energy Laboratory (NREL)\n..."


Now you know how to set up your own custom web scraping and information extraction pipeline!

## Next steps

There are several ways you can build on this demo to get practice:

- Filter outputs to give exactly one answer (either filter google search results or final output)
- Update the pipeline to accept any national laboratory as input to lookup the director
- Extract more than one piece of information at a time (e.g., laboratory location? research focus?)
- Add protection against non-deterministic nature of pipeline (i.e., expand the Google search to be as broad as possible, add heuristics to check document content, consider re-running the decision tree if you get a "No" answer from the LLM, or even re-run the end-to-end pipeline if no director name is found)

By now, you should be equipped with the tools to create your own custom web scraping and information extraction pipeline. Feel free to reference the [existing ordinance extraction methods](https://nrel.github.io/elm/_modules/elm/ords/process.html#process_counties_with_openai) for a more in-depth example.