# CMR Query ReACT Agent

#### Using Langchain Agent Library

This Notebook demonstrates how to use the Langchain Agent Library to query the CMR API. It uses the ReACT prompt pattern that language model leverages to chain multiple external tools (e.g. geocoding) together to geenrate a valid CMR query from natural language.

## Install necessary libraries

In [1]:
!pip install langchain opencage utils httpx python-dotenv langchain-openai pydantic==1.10.8

Collecting utils
  Downloading utils-1.0.2.tar.gz (13 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting pydantic==1.10.8
  Downloading pydantic-1.10.8-cp39-cp39-macosx_11_0_arm64.whl.metadata (146 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m146.4/146.4 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
Downloading pydantic-1.10.8-cp39-cp39-macosx_11_0_arm64.whl (2.6 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m38.0 MB/s[0m eta [36m0:00:00[0m31m38.2 MB/s[0m eta [36m0:00:01[0m
[?25hBuilding wheels for collected packages: utils
  Building wheel for utils (setup.py) ... [?25ldone
[?25h  Created wheel for utils: filename=utils-1.0.2-py2.py3-none-any.whl size=13926 sha256=c42291f0866d1eff3a45754336146de87cae167b0ab61a126d4e02da7b326153
  Stored in directory: /Users/iksha/Library/Caches/pip/wheels/4c/a5/a3/ab48e06c936b39960801612ee2767ff53764119f33d3d646e7
Successfully built utils
[33mDE

## Load the required libraries

In [2]:
import os
import re
import datetime

import dotenv
import httpx
import tiktoken
from langchain.agents import Tool
from langchain.tools import BaseTool
from opencage.geocoder import OpenCageGeocode
from langchain import LLMChain, PromptTemplate
from typing import List, Union
from langchain.agents import (
    AgentExecutor,
    AgentOutputParser,
    LLMSingleActionAgent,
)
from langchain.chat_models import AzureChatOpenAI, ChatOpenAI
from langchain.prompts import BaseChatPromptTemplate
from langchain.schema import AgentAction, AgentFinish, HumanMessage


dotenv.load_dotenv()

%load_ext autoreload
%autoreload 2
import sys
sys.path.append("../code")
import pprint
pprinter = pprint.PrettyPrinter(indent=4, width=120, depth=2)

## ReACT Agent

The ReACT agent model refers to a framework that integrates the reasoning capabilities of large language models (LLMs) with the ability to take actionable steps, creating a more sophisticated system that can understand and process information, evaluate situations, take appropriate actions, communicate responses, and track ongoing situations.

The main components of the ReACT agent are:
- Chain of Thought - ReACT Prompt
- Tools for LLMs to use - ReACT Actions
- Helper functions to control and route the Agent's actions - ReACT Controllers

## Prompts

In [3]:
#react template
cmr_template = """
You are responsible to provide a CMR Link to the User's Query. 
Use 'keyword=' to build CMR Query. DO NOT USE 'science_keywords[]='
You have access to the following tools to build the CMR Link:
Use this Base URL: "https://cmr.earthdata.nasa.gov/search/collections?"
If the temporal or spatial information is not available, provide the link without it.
{tools}

Use the following format:

Query: the input question you must answer
Thought: you should always think about what to do. If the Query is not a valid CMR query, Provide Final Answer asking the user to provide the correct CMR query. Any other question to the user will need to be given as Final Answer.
Action: the action to take, should be one of [{tool_names}].
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times until you have all the information to build the CMR Link)
Thought: I now have all the information for CMR query
Final Answer: Provide the link. If a link is not available, suggest the user how to modify the query to get the link.

Begin Loop:

Query: {input}
{agent_scratchpad}"""


datetime_template = """
            convert time string: {datetime} into start and end datetime formatted as: 'temporal[]=yyyy-MM-ddTHH:mm:ssZ,yyyy-MM-ddTHH:mm:ssZ'
            """


## Tools
Tools are external functions that can be used by the agent to perform specific tasks. 

External Tools used By the CMR Agent:

- Datetime identifier and formatting
- Keyword extraction
- geo-location extraction
- Bounding Box formatting for geo-location
- CMR API formatting

In [4]:
opencage_geocoder = OpenCageGeocode(os.environ["OPENCAGE_API_KEY"])

class DatetimeChain(LLMChain):
    """Find datetime for a given time string or a range of time strings. e.g (between 2010 and 2020)"""

    def __init__(self, *args, **kwargs):
        today = datetime.date.today()
        today_string = (
            f"Assume the current year and month is {today.year} and {today.month}."
        )
        template = datetime_template.strip() + today_string
        prompt = PromptTemplate(
            template=template,
            input_variables=["datetime"],
        )
        # super().__init__(prompt=prompt, llm=OpenAI(temperature=0), *args, **kwargs)
        super().__init__(prompt=prompt, llm=AzureChatOpenAI(temperature=0), *args, **kwargs)

    def _run(self, timestring: str) -> str:
        """Find datetime for a given time string"""
        return self.predict(datetime=timestring)

    async def _arun(self, timestring: str) -> str:
        """asynchronous call to find datetime for a given time string"""
        return self.predict(datetime=timestring)


class BoundingBoxFinderTool(BaseTool):
    name = "bounding_box_finder"
    description = "useful to find bounding box in min Longitude, min Latitude, max Longitude, max Latitude format for a given location, region, or a landmark. The output is formatted for CMR API."
    geocoder = opencage_geocoder

    def _run(self, tool_input: str) -> str:
        """Geocode a query (location, region, landmark)"""
        response = self.geocoder.geocode(tool_input, no_annotations="1")
        if response:
            bounds = response[0]["bounds"]
            # convert to bbox
            bbox = "{},{},{},{}".format(
                bounds["southwest"]["lng"],
                bounds["southwest"]["lat"],
                bounds["northeast"]["lng"],
                bounds["northeast"]["lat"],
            )
            return f"bounding_box[]={bbox}"
        return "Cannot parse the query"

    async def _arun(self, tool_input: str) -> str:
        """asynchronous call to Geocode a query"""
        async with httpx.AsyncClient() as client:
            response = self.geocoder.geocode(tool_input, no_annotations="1")
            if response:
                bounds = response[0]["bounds"]
                # convert to bbox
                # bounding_box[]=
                bbox = "{},{},{},{}".format(
                    bounds["southwest"]["lng"],
                    bounds["southwest"]["lat"],
                    bounds["northeast"]["lng"],
                    bounds["northeast"]["lat"],
                )
                return f"bounding_box[]={bbox}"

            else:
                return "Cannot parse the query"


def geocode(text: str) -> str:
    """Geocode a query (location, region, or landmark)"""
    response = opencage_geocoder.geocode(text, no_annotations="1")
    if response:
        bounds = response[0]["bounds"]
        # convert to bbox
        bbox = "{},{},{},{}".format(
            bounds["southwest"]["lng"],
            bounds["southwest"]["lat"],
            bounds["northeast"]["lng"],
            bounds["northeast"]["lat"],
        )
        return f"bounding_box[]={bbox}"


class CMRQueryTool(BaseTool):
    name = "cmr_query_api"
    description = "useful for Querying CMR API based on previous Observations. input is query parameters string"
    base_url = "https://cmr.earthdata.nasa.gov/search/collections?"

    def _run(self, tool_input: str) -> str:
        k: int = 40
        """Filter a CMR response"""

        if self.base_url in tool_input:
            tool_input = tool_input.replace(self.base_url, "")
        return self.base_url + tool_input

    async def _arun(self, tool_input: str) -> str:
        """asynchronous call to filter a CMR response"""
        return [self._filter_response(tool_input)]

def num_tokens_from_string(string: str, model_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.encoding_for_model(model_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens


KeyError: 'OPENCAGE_API_KEY'

## Langchain

Langchain is a library that provides a set of tools to interact with language models, such as GPT-3, and to build agents that can understand and process information, evaluate situations, take appropriate actions, communicate responses, and track ongoing situations. Below is a simple example of how to use the library to implement a ReACT agent.

Components used:
CustomPromptTemplate - ReACT Prompt class
CustomOutputParser - ReACT Output controller class for routing LLM actions
Agent - ReACT Agent class

In [None]:
class CustomPromptTemplate(BaseChatPromptTemplate):
    """
    This is a custom prompt template that uses the `cmr_template` from `prompts.py`
    """

    template: str
    tools: List[Tool]

    def format_messages(self, **kwargs) -> str:
        # Get the intermediate steps (AgentAction, Observation tuples)
        # Format them in a particular way
        intermediate_steps = kwargs.pop("intermediate_steps")
        thoughts = ""
        for action, observation in intermediate_steps:
            thoughts += action.log
            thoughts += f"\nObservation: {observation}\nThought: "
        kwargs["agent_scratchpad"] = thoughts
        # Create a tools variable from the list of tools provided
        kwargs["tools"] = "\n".join(
            [f"{tool.name}: {tool.description}" for tool in self.tools]
        )
        # Create a list of tool names for the tools provided
        kwargs["tool_names"] = ", ".join([tool.name for tool in self.tools])
        formatted = self.template.format(**kwargs)
        return [HumanMessage(content=formatted)]


class CustomOutputParser(AgentOutputParser):
    """
    This is a custom output parser that parses the output of the LLM agent
    """

    def parse(self, llm_output: str) -> Union[AgentAction, AgentFinish]:
        # Check if agent should finish
        if "Final Answer:" in llm_output:
            return AgentFinish(
                return_values={"output": llm_output.split("Final Answer:")[-1].strip()},
                log=llm_output,
            )
        
        regex = r"Action\s*\d*\s*:(.*?)\nAction\s*\d*\s*Input\s*\d*\s*:[\s]*(.*)"
        match = re.search(regex, llm_output, re.DOTALL)
        if not match:
            raise ValueError(f"Could not parse LLM output: `{llm_output}`")
        action = match.group(1).strip()
        action_input = match.group(2)
        # Return the action and action input
        return AgentAction(
            tool=action, tool_input=action_input.strip(" ").strip('"'), log=llm_output
        )


class CMRQueryAgent:
    """
    This is a custom agent that uses the `CustomPromptTemplate` and `CustomOutputParser`
    """

    def __init__(self):
        self.create_tools()
        self.tool_names = [tool.name for tool in self.tools]
        self.prompt = CustomPromptTemplate(
            template=cmr_template,
            tools=self.tools,
            # This omits the `agent_scratchpad`, `tools`, and `tool_names` variables because those are generated dynamically
            # This includes the `intermediate_steps` variable because that is needed
            input_variables=["input", "intermediate_steps"],
        )
        self.output_parser = CustomOutputParser()
        # openai.api_type = "openai"
        # self.llm_chain = LLMChain(
        #     llm=ChatOpenAI(model_name="gpt-4-1106-preview", temperature=0),
        #     prompt=self.prompt,
        # )
        self.llm_chain = LLMChain(
            llm=AzureChatOpenAI(model_name="gpt-4-1106-preview", temperature=0),
            prompt=self.prompt,
        )
        self.create_agent()

    def create_tools(self):
        """create tools for the agent"""
        self.tools = [
            Tool(
                name=BoundingBoxFinderTool().name,
                description=BoundingBoxFinderTool().description,
                func=BoundingBoxFinderTool().run,
            ),
            Tool(
                name="DateTime Extractor",
                description="Extracts time string and converts it to a datetime format",
                func=DatetimeChain().run,
            ),
        ]

    def create_agent(self):
        self.agent = LLMSingleActionAgent(
            llm_chain=self.llm_chain,
            output_parser=self.output_parser,
            stop=["\nObservation:"],
            allowed_tools=self.tool_names,
        )
        self.agent_executor = AgentExecutor.from_agent_and_tools(
            agent=self.agent, tools=self.tools, verbose=True
        )

    def run_query(self, input_text: str):
        return self.agent_executor.run(input_text)

query_agent = CMRQueryAgent()

In [None]:
query= "I want high resolution precipitation data for the past three years over Indonesia"
print(
    query_agent.run_query(query)
)




[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: I need to find the bounding box for Indonesia and convert the past three years into a datetime range for the CMR query.
Action: bounding_box_finder
Action Input: Indonesia[0m

Observation:[36;1m[1;3mbounding_box[]=94.7717124,-11.2085669,141.0194444,6.2744496[0m
[32;1m[1;3mI now have the bounding box for Indonesia. Next, I need to convert "the past three years" into a datetime range.
Action: DateTime Extractor
Action Input: the past three years[0m

Observation:[33;1m[1;3mGiven the current year and month as March 2024, and the request to convert "the past three years" into a start and end datetime format, we first identify the exact dates.

- End date: Since we're considering "the past three years" up to March 2024, the end date would be the last moment of February 2024, to include the full span of three years back from March 2024.
- Start date: To cover the past three years, we go back to March 2021.

Therefo

In [None]:
query= "What is osmosis?"
print(
    query_agent.run_query(query)
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: This is not a valid CMR query. CMR (Collection Metadata Repository) queries are related to Earth science data collections, not general knowledge or definitions.

Final Answer: Please provide a specific Earth science data collection query for CMR. For example, you can ask for satellite data related to "sea surface temperature" or "air quality measurements".[0m

[1m> Finished chain.[0m
Please provide a specific Earth science data collection query for CMR. For example, you can ask for satellite data related to "sea surface temperature" or "air quality measurements".


# Deployed Apps

Interaction with Deployed version of the chain in AWS.

In [6]:
import requests

query = "I want high resolution precipitation data for the past three years over Indonesia"
url = "https://xp775tpyna5dvvsn4q3jkl5fju0hsiup.lambda-url.us-west-2.on.aws/?query="

response = requests.get(f"{url}{query}")
print(response.text)

"Here is the CMR link for high-resolution precipitation data for the past three years over Indonesia:\n\n[https://cmr.earthdata.nasa.gov/search/collections?bounding_box[]=94.7717124,-11.2085669,141.0194444,6.2744496&temporal[]=2021-01-01T00:00:00Z,2024-03-31T23:59:59Z&keyword=precipitation](https://cmr.earthdata.nasa.gov/search/collections?bounding_box[]=94.7717124,-11.2085669,141.0194444,6.2744496&temporal[]=2021-01-01T00:00:00Z,2024-03-31T23:59:59Z&keyword=precipitation)"
