# CMR Query ReACT Agent

#### Using Langchain Agent Library

In [1]:
# !pip install langchain, opencage, utils, httpx
%load_ext autoreload
%autoreload 2
import sys
sys.path.append("../code")
import pprint
pprinter = pprint.PrettyPrinter(indent=4, width=120, depth=2)

## ReACT Agent

The ReACT agent model refers to a framework that integrates the reasoning capabilities of large language models (LLMs) with the ability to take actionable steps, creating a more sophisticated system that can understand and process information, evaluate situations, take appropriate actions, communicate responses, and track ongoing situations.

The main components of the ReACT agent are:
- Chain of Thought - ReACT Prompt
- Tools for LLMs to use - ReACT Actions
- Helper functions to control and route the Agent's actions - ReACT Controllers

## Prompts

In [2]:
#react template
cmr_template = """
You are responsible to provide a CMR Link to the User's Query. 
Use 'keyword=' to build CMR Query. DO NOT USE 'science_keywords[]='
You have access to the following tools to build the CMR Link:
Use this Base URL: "https://cmr.earthdata.nasa.gov/search/collections?"
If the temporal or spatial information is not available, provide the link without it.
{tools}

Use the following format:

Query: the input question you must answer
Thought: you should always think about what to do. If the Query is not a valid CMR query, Provide Final Answer asking the user to provide the correct CMR query. Any other question to the user will need to be given as Final Answer.
Action: the action to take, should be one of [{tool_names}].
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times until you have all the information to build the CMR Link)
Thought: I now have all the information for CMR query
Final Answer: Provide the link. If a link is not available, suggest the user how to modify the query to get the link.

Begin Loop:

Query: {input}
{agent_scratchpad}"""


datetime_template = """
            convert time string: {datetime} into start and end datetime formatted as: 'temporal[]=yyyy-MM-ddTHH:mm:ssZ,yyyy-MM-ddTHH:mm:ssZ'
            """


## Tools

External Tools used:

- Datetime identifier and formatting
- Keyword extraction
- geo-location extraction
- Bounding Box formatting for geo-location
- CMR API formatting

In [3]:
import os
import re

import dotenv
import httpx
import tiktoken
from langchain.agents import Tool, tool
from langchain.tools import BaseTool
from opencage.geocoder import OpenCageGeocode

dotenv.load_dotenv()

import json
import os

import dotenv
import numpy as np
import openai
import utils
from langchain import LLMChain, PromptTemplate
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain_openai import ChatOpenAI
import datetime


dotenv.load_dotenv()

opencage_geocoder = OpenCageGeocode(os.environ["OPENCAGE_API_KEY"])

class DatetimeChain(LLMChain):
    """Find datetime for a given time string or a range of time strings. e.g (between 2010 and 2020)"""

    def __init__(self, *args, **kwargs):
        today = datetime.date.today()
        today_string = (
            f"Assume the current year and month is {today.year} and {today.month}."
        )
        template = datetime_template.strip() + today_string
        prompt = PromptTemplate(
            template=template,
            input_variables=["datetime"],
        )
        super().__init__(prompt=prompt, llm=OpenAI(temperature=0), *args, **kwargs)

    def _run(self, timestring: str) -> str:
        """Find datetime for a given time string"""
        return self.predict(datetime=timestring)

    async def _arun(self, timestring: str) -> str:
        """asynchronous call to find datetime for a given time string"""
        return self.predict(datetime=timestring)


class KeywordChain(LLMChain):
    """Find science keyword to search CMR in a given input"""
    description = "useful for extracting text science keywords from the user query"
    def __init__(self, *args, **kwargs):
        
        prompt = f"""You are a NASA CMR keyword extractor bot, your task is to extract the science keywords from the following user query: {query}. 
        Provide ONLY the earth science keywords such as phenomena and observables as the output. Do not provide dates and locations. Do not provide generic keywords such as 'data' and 'information'."""
        
        prompt = PromptTemplate(
            template=prompt,
            input_variables=["query"],
        )
        super().__init__(prompt=prompt, llm=ChatOpenAI(model="gpt-4-0613", temperature=0), *args, **kwargs)

    def _run(self, query: str) -> str:
        return self.predict(query=query)

    async def _arun(self, query: str) -> str:

        return self.predict(query=query)

class CMRLinkCheckChain(LLMChain):
    """Check the correctness of a CMR link - and provide a correct link"""
    description = "useful for checking the correctness of a CMR link - and provide a correct link"
    def __init__(self, *args, **kwargs):
        
        prompt = f"You are a CMR link checker bot, your task is to check the correctness of the following CMR Link: {link}. Finally, provide the correct CMR link if the provided link is not correct. if the link is correct, provide the link as the final answer."
        
        prompt = PromptTemplate(
            template=prompt,
            query_variables=["link"],
        )
        super().__init__(prompt=prompt, llm=OpenAI(temperature=0), *args, **kwargs)

    def _run(self, query: str) -> str:
        """Find datetime for a given time string"""
        return self.predict(link=query)

    async def _arun(self, query: str) -> str:
        """asynchronous call to find datetime for a given time string"""
        return self.predict(link=query)

class GPTEmbedder:
    """Embedder for keywords using GPT-3 embeddings"""

    def __init__(
        self,
        embeddings_file="/Users/mramasub/work/cmr-prompt-chain/data/keyword_embeddings.npy",
    ):
        openai.api_key = os.getenv("OPENAI_API_KEY")
        self.embedder = OpenAIEmbeddings(openai_api_key=openai.api_key)
        self.text_to_add = "Hierarchy Path: "
        # print pwd
        keywords_file = "../data/keywords.json"
        if os.path.exists(keywords_file):
            self.kws = self.read_kws(keywords_file)
        else:
            # exit
            pass
        self.model = "text-embedding-ada-002"
        self.embeddings = (
            self.load_from_pkl(embeddings_file)
            if os.path.exists(embeddings_file)
            else self.create_embeddings()
        )

        assert len(self.kws) == len(self.embeddings)

    def create_embeddings(self):
        """Create embeddings for all keywords"""
        kws_to_embed = self.kws
        embeddings = self._embed_langchain(kws_to_embed)
        return np.array(embeddings)

    def _embed_langchain(self, texts, use_text_to_add=True):
        text_to_add = ""
        if use_text_to_add:
            text_to_add = self.text_to_add
        return self.embedder.embed_documents([text_to_add + text for text in texts])

    def _embed(self, texts, use_text_to_add=True):
        text_to_add = ""
        if use_text_to_add:
            text_to_add = self.text_to_add
        return openai.Embedding.create(
            input=[text_to_add + kw for kw in texts], model=self.model
        )

    def find_nearest_kw(self, keyword, top_n=1):
        """Find the nearest keyword to the given keyword"""
        embedding = self._embed_langchain([keyword], use_text_to_add=False)
        embedding = np.array(embedding)
        distances = np.linalg.norm(self.embeddings - embedding, axis=1)

        return [self.kws[i] for i in np.argsort(distances)[:top_n]]

    def read_kws(
        self,
        file,
    ):
        """Read keywords from file"""
        kws = []

        with open(file, "r", encoding="utf-8") as f:
            kws = json.load(f)
        return [kw["keyword_string"] for kw in kws]

    def load_from_pkl(self, file):
        """Load embeddings from pickle file"""
        return np.load(file)

class BoundingBoxFinderTool(BaseTool):
    name = "bounding_box_finder"
    description = "useful to find bounding box in min Longitude, min Latitude, max Longitude, max Latitude format for a given location, region, or a landmark. The output is formatted for CMR API."
    geocoder = opencage_geocoder

    def _run(self, tool_input: str) -> str:
        """Geocode a query (location, region, landmark)"""
        response = self.geocoder.geocode(tool_input, no_annotations="1")
        if response:
            bounds = response[0]["bounds"]
            # convert to bbox
            bbox = "{},{},{},{}".format(
                bounds["southwest"]["lng"],
                bounds["southwest"]["lat"],
                bounds["northeast"]["lng"],
                bounds["northeast"]["lat"],
            )
            return f"bounding_box[]={bbox}"
        return "Cannot parse the query"

    async def _arun(self, tool_input: str) -> str:
        """asynchronous call to Geocode a query"""
        async with httpx.AsyncClient() as client:
            response = self.geocoder.geocode(tool_input, no_annotations="1")
            if response:
                bounds = response[0]["bounds"]
                # convert to bbox
                # bounding_box[]=
                bbox = "{},{},{},{}".format(
                    bounds["southwest"]["lng"],
                    bounds["southwest"]["lat"],
                    bounds["northeast"]["lng"],
                    bounds["northeast"]["lat"],
                )
                return f"bounding_box[]={bbox}"

            else:
                return "Cannot parse the query"


def geocode(text: str) -> str:
    """Geocode a query (location, region, or landmark)"""
    response = opencage_geocoder.geocode(text, no_annotations="1")
    if response:
        bounds = response[0]["bounds"]
        # convert to bbox
        bbox = "{},{},{},{}".format(
            bounds["southwest"]["lng"],
            bounds["southwest"]["lat"],
            bounds["northeast"]["lng"],
            bounds["northeast"]["lat"],
        )
        return f"bounding_box[]={bbox}"

    @tool
    async def calculate(expression):
        """Calculate an expression"""
        return eval(expression)


class CMRQueryTool(BaseTool):
    name = "cmr_query_api"
    description = "useful for Querying CMR API based on previous Observations. input is query parameters string"
    base_url = "https://cmr.earthdata.nasa.gov/search/collections?"

    def _run(self, tool_input: str) -> str:
        k: int = 40
        """Filter a CMR response"""

        if self.base_url in tool_input:
            tool_input = tool_input.replace(self.base_url, "")
        return self.base_url + tool_input

    async def _arun(self, tool_input: str) -> str:
        """asynchronous call to filter a CMR response"""
        return [self._filter_response(tool_input)]

# gpt_embedder = GPTEmbedder(embeddings_file="../data/keyword_embeddings.npy")

# class GCMDKeywordSearchTool(BaseTool):
#     """Search for science keyword in GCMD science keyword database. only earth science keywords are allowed, no other keywords are allowed"""
#     name = "gcmd_keyword_search"
#     description = "useful to search for earth science keyword in GCMD database. only earth science keywords and phenomena are allowed as inputs, no other keywords are allowed. The output is formatted for CMR API."

#     def _run(self, tool_input: str) -> str:
#         """Search for a keyword in GCMD"""
    
#         return self.get_formatted_science_kws(tool_input)
    
#     @staticmethod
#     def get_formatted_science_kws(tool_input: str, top_n=5) -> str:
#         """Search for a keyword in GCMD"""
#         keywords = [
#             GCMDKeywordSearchTool().cmr_science_keyword(kw, keyword_pos)
#             for keyword_pos, kw in enumerate(
#                 gpt_embedder.find_nearest_kw(tool_input, top_n=top_n)
#             )
#         ]
#         if isinstance(keywords, list):
#             return keywords[0]
#         return keywords

    @staticmethod
    def get_science_kws(tool_input: str, top_n=5) -> str:
        """Search for a keyword in GCMD"""
        return [kw for kw in gpt_embedder.find_nearest_kw(tool_input, top_n=top_n)]

    async def _arun(self, tool_input: str) -> str:
        """Search for a keyword in GCMD"""
        return self.get_formatted_science_kws(tool_input)

    @staticmethod
    def cmr_science_keyword(keyword_string, keyword_pos):
        level_list = [
            "category",
            "topic",
            "term",
            "variable-level-1",
            "variable-level-2",
            "variable-level-3",
            "detailed-variable",
        ]
        keyword_list = [key.strip() for key in keyword_string.split(">")]

        return "&".join(
            [
                rf"science_keywords[{keyword_pos}][{level_list[i]}]={keyword_list[i].replace(' ', '%20')}"
                for i in range(len(keyword_list))
            ]
        )

def num_tokens_from_string(string: str, model_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.encoding_for_model(model_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens


In [4]:
# ext = KeywordChain()
# ext.predict(query="I want to find the data for the temperature of the earth")

date_time = DatetimeChain()
date_time.predict(datetime="last 10 years")



  warn_deprecated(


"\n\nstart = '2014-04-01T00:00:00Z'\nend = '2024-03-31T23:59:59Z'\n\ntemporal[] = '2014-04-01T00:00:00Z,2024-03-31T23:59:59Z'"

## Langchain

Langchain is a library that provides a set of tools to interact with language models, such as GPT-3, and to build agents that can understand and process information, evaluate situations, take appropriate actions, communicate responses, and track ongoing situations. Below is a simple example of how to use the library to implement a ReACT agent.

Components used:
CustomPromptTemplate - ReACT Prompt class
CustomOutputParser - ReACT Output controller class for routing LLM actions
Agent - ReACT Agent class

In [5]:
import re
from typing import List, Union

from langchain import LLMChain, OpenAI
from langchain.agents import (
    AgentExecutor,
    AgentOutputParser,
    LLMSingleActionAgent,
    Tool,
)
from langchain.chat_models import ChatOpenAI
from langchain.prompts import BaseChatPromptTemplate
from langchain.schema import AgentAction, AgentFinish, HumanMessage

class CustomPromptTemplate(BaseChatPromptTemplate):
    """
    This is a custom prompt template that uses the `cmr_template` from `prompts.py`
    """

    template: str
    tools: List[Tool]

    def format_messages(self, **kwargs) -> str:
        # Get the intermediate steps (AgentAction, Observation tuples)
        # Format them in a particular way
        intermediate_steps = kwargs.pop("intermediate_steps")
        thoughts = ""
        for action, observation in intermediate_steps:
            thoughts += action.log
            thoughts += f"\nObservation: {observation}\nThought: "
        kwargs["agent_scratchpad"] = thoughts
        # Create a tools variable from the list of tools provided
        kwargs["tools"] = "\n".join(
            [f"{tool.name}: {tool.description}" for tool in self.tools]
        )
        # Create a list of tool names for the tools provided
        kwargs["tool_names"] = ", ".join([tool.name for tool in self.tools])
        formatted = self.template.format(**kwargs)
        return [HumanMessage(content=formatted)]


class CustomOutputParser(AgentOutputParser):
    """
    This is a custom output parser that parses the output of the LLM agent
    """

    def parse(self, llm_output: str) -> Union[AgentAction, AgentFinish]:
        # Check if agent should finish
        if "Final Answer:" in llm_output:
            return AgentFinish(
                return_values={"output": llm_output.split("Final Answer:")[-1].strip()},
                log=llm_output,
            )
        
        regex = r"Action\s*\d*\s*:(.*?)\nAction\s*\d*\s*Input\s*\d*\s*:[\s]*(.*)"
        match = re.search(regex, llm_output, re.DOTALL)
        if not match:
            raise ValueError(f"Could not parse LLM output: `{llm_output}`")
        action = match.group(1).strip()
        action_input = match.group(2)
        # Return the action and action input
        return AgentAction(
            tool=action, tool_input=action_input.strip(" ").strip('"'), log=llm_output
        )


class CMRQueryAgent:
    """
    This is a custom agent that uses the `CustomPromptTemplate` and `CustomOutputParser`
    """

    def __init__(self):
        self.create_tools()
        self.tool_names = [tool.name for tool in self.tools]
        self.prompt = CustomPromptTemplate(
            template=cmr_template,
            tools=self.tools,
            # This omits the `agent_scratchpad`, `tools`, and `tool_names` variables because those are generated dynamically
            # This includes the `intermediate_steps` variable because that is needed
            input_variables=["input", "intermediate_steps"],
        )
        self.output_parser = CustomOutputParser()
        self.llm_chain = LLMChain(
            llm=ChatOpenAI(model_name="gpt-4-1106-preview", temperature=0),
            prompt=self.prompt,
        )
        self.create_agent()

    def create_tools(self):
        """create tools for the agent"""
        self.tools = [
            Tool(
                name=BoundingBoxFinderTool().name,
                description=BoundingBoxFinderTool().description,
                func=BoundingBoxFinderTool().run,
            ),
            Tool(
                name="DateTime Extractor",
                description="Extracts time string and converts it to a datetime format",
                func=DatetimeChain().run,
            ),
            # Tool(
            #     name="Keyword Extractor",
            #     description="Find science keyword to search CMR in a given input",
            #     func=KeywordChain().run,
            # ),
            # Tool(
            #     name="Link Checker",
            #     description=CMRLinkCheckChain().description,
            #     func=CMRLinkCheckChain().run,
            # ),
            # Tool(
            #     name=GCMDKeywordSearchTool().name,
            #     description=GCMDKeywordSearchTool().description,
            #     func=GCMDKeywordSearchTool().run,
            # ),
            # Tool(
            #     description=CMRQueryTool().description,
            #     func=CMRQueryTool().run,
            # ),
        ]

    def create_agent(self):
        self.agent = LLMSingleActionAgent(
            llm_chain=self.llm_chain,
            output_parser=self.output_parser,
            stop=["\nObservation:"],
            allowed_tools=self.tool_names,
        )
        self.agent_executor = AgentExecutor.from_agent_and_tools(
            agent=self.agent, tools=self.tools, verbose=True
        )

    def run_query(self, input_text: str):
        return self.agent_executor.run(input_text)

query_agent = CMRQueryAgent()

  warn_deprecated(
  warn_deprecated(


In [9]:
query= "I want high resolution precipitation data for the past three years over Indonesia"
print(
    query_agent.run_query(query)
)




[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: The user is asking for high-resolution precipitation data over Indonesia for the past three years. I need to find the bounding box for Indonesia and the date range for the past three years.

Action: bounding_box_finder
Action Input: Indonesia[0m

Observation:[36;1m[1;3mbounding_box[]=94.7717124,-11.2085669,141.0194444,6.2744496[0m
[32;1m[1;3mI have the bounding box for Indonesia. Now I need to determine the date range for the past three years from the current year, which is 2023.

Action: DateTime Extractor
Action Input: three years ago from 2023[0m

Observation:[33;1m[1;3m

Start datetime: 2020-01-01T00:00:00Z
End datetime: 2020-12-31T23:59:59Z[0m
[32;1m[1;3mThe date range provided is only for the year 2020, which does not cover the past three years. I need to adjust the date range to include 2020, 2021, and 2022.

Action: DateTime Extractor
Action Input: three years ago from 2023[0m

Observation:[33;1

In [14]:
query= "What is osmosis?"
print(
    query_agent.run_query(query)
)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mThought: The query "What is osmosis?" is not a valid CMR query. It is a general question about a scientific concept, not related to the Earth data or satellite collections that the CMR would index. The user needs to provide a query that can be used to search the CMR database, such as a request for satellite data about a specific region, time period, or environmental parameter.

Final Answer: The query "What is osmosis?" does not correspond to a CMR search query. If you are looking for satellite data or environmental research related to osmosis or any other topic, please provide specific details such as the location, time frame, or the type of data you are interested in. For example, you could ask for satellite data on water salinity levels in a particular ocean region during a specific time period.[0m

[1m> Finished chain.[0m
The query "What is osmosis?" does not correspond to a CMR search query. If you are looking for sat