# **Agentic Climate AI Notebook**

### 1.1 Downloading Dependencies


In [None]:
!pip install awscli langchain xarray matplotlib requests zarr xarray cftime langchain_community duckduckgo_search scikit-learn
%pip install \
  boto3==1.40.29 \
  langchain \
  langchain-core \
  langchain-text-splitters \
  langsmith \
  pydantic \
  SQLAlchemy \
  PyYAML \
  botocore==1.40.29 \
  s3transfer \
  requests \
  xarray \
  s3fs==0.4.2 \
  numpy \
  pandas \
  matplotlib \
  cftime \
  zarr \
  dask \
  netCDF4 \
  polars
!pip install --upgrade numpy pandas bottleneck numexpr
!pip install cartopy geopandas


### 1.2 Configure AWS

1. Fill in the necessary details from your AWS Environment, which involves the AWS Acces Key ID, Secrete Access Key, Region Name, and Output format (json)

2. We use AWS for three things:

    ① Open Data (S3) via anonymous or credentialed access;

    ② Neptune/Knowledge Graph (private VPC endpoint);

    ③ Bedrock (optional LLM).



In [None]:
!pip show boto3 botocore s3fs

In [None]:
!aws configure



### 1.3 Mount Your Drive

In [None]:
from google.colab import drive

# Mount Google Drive at /content/drive to access your project files
drive.mount('/content/drive/')

In [5]:
import sys

# Add the path of your agent files to Python search path
# Update the folder_path if your Agents directory is elsewhere
folder_path = '/content/drive/MyDrive/AutoClimDS/Agents'
sys.path.append(folder_path)

In [6]:
from dotenv import load_dotenv
import os

# Load environment variables from the .env file
# Ensure the .env file contains keys such as GRAPH_ID, NEPTUNE_REGION, etc.
dotenv_path = "/content/drive/MyDrive/AutoClimDS/Agents/.env"
load_dotenv(dotenv_path=dotenv_path, override=True)

True

### 1.4 Import Functions

In [None]:

# Import the nasa_cmr_data_acquisition_agent module from the specified path
from nasa_cmr_data_acquisition_agent import get_nasa_cmr_agent
from cesm_verification_agent import create_verification_agent
from knowledge_graph_agent_bedrock import get_knowledge_graph_agent
from cesm_lens_langchain_agent import get_cesm_lens_agent
from climate_research_orchestrator import create_climate_research_orchestrator

# Instantiate the NasaCMRDataAgent agent
knowledge_graph_agent = get_knowledge_graph_agent()
nasa_cmr_data_agent = get_nasa_cmr_agent()
cesm_lens_agent = get_cesm_lens_agent()
cesm_verif_agent = create_verification_agent()
orchestrator = create_climate_research_orchestrator()

#Accept GRAPH_ID as a parameter in the future to be added.

## 2. Agentic AI


### 2.1 Summary
We provide 5 agents in total. Each serves a distinct role in data acquisition, validation, reasoning, and orchestration.


### 2.2 Knowledge Graph Agent
The Knowledge Graph Agent links datasets, metadata, and domains together.  
Use it to **search datasets by location, variable, or theme** (e.g., precipitation, flooding, sea level).

In [None]:
# Example: Query the Knowledge Graph Agent
# Replace the prompt with your own query.
prompt_knowledge_graph = "Find me datasets in NYC for rainfall and flooding "

# Invoke the agent
response_knowledge_graph = knowledge_graph_agent.invoke({"input": prompt_knowledge_graph})

# Show result
print(response_knowledge_graph)

### 2.3 NASA CMR Data Acquisition Agent
The NASA CMR Agent searches, downloads, and preprocesses datasets from NASA and NOAA repositories.  
Use it to **relate observations (e.g., rainfall, flooding, sea level)** with available data products.

In [None]:
# Example: Query the NASA CMR Agent
prompt_nasa_cmr = "What is the relationship between NYC rainfall and flooding using the existing data?"
response_nasa_cmr = nasa_cmr_data_agent.invoke({"input": prompt_nasa_cmr})
print(response_nasa_cmr)

### 2.4 CESM LENS Langchain Agent
The CESM LENS Agent interfaces with the CESM Large Ensemble dataset.  
Use it to **run climate simulations** and extract ensemble-based insights.

In [None]:
# Example: Query the CESM LENS Agent
prompt_cesm_lens = "Climate simulations for rainfall nyc."
response_cesm_lens = cesm_lens_agent.invoke({"input": prompt_cesm_lens})
print(response_cesm_lens)

### 2.5 CESM Verification Agent
The CESM Verification Agent validates CESM outputs against observations.  
Use it to **check CESM projections vs. ground truth** (e.g., Arctic sea ice).

In [None]:
# Example: Query the CESM Verification Agent
prompt_cesm_vertif = "Verify the CESM model's simulation of Arctic sea ice extent."
response_cesm_vertif = cesm_verif_agent.invoke({"input": prompt_cesm_vertif})
print(response_cesm_vertif)

### 2.6 Climate Research Orchestrator
The Orchestrator combines multiple agents for multi-step reasoning.  
Use it to **coordinate across NASA, CESM, and KG agents** for complex research questions.

In [None]:
# Example: Query the Orchestrator
prompt_orchestrator = "What are the potential impacts of a 2 degree Celsius global temperature increase?"
response_orchestrator = orchestrator.invoke({"input": prompt_orchestrator})
print(response_orchestrator)

# 3. Explore Local Database

## 3.1 Observational Datasets
Inspect the **local KG cache** (`climate_knowledge_graph.db`).  
We list tables and preview `stored_datasets`, where `dataset_properties` is JSON (may include links/metadata).


In [None]:
import sqlite3
import pandas as pd

# Connect to local KG cache
conn = sqlite3.connect('climate_knowledge_graph.db')

# List tables to confirm the DB is populated
tables = conn.execute(
    "SELECT name FROM sqlite_master WHERE type='table';"
).fetchall()
print("Tables found:", [t[0] for t in tables])

# Preview the main table of discovered datasets
if any(t[0] == "stored_datasets" for t in tables):
    df = pd.read_sql_query("SELECT * FROM stored_datasets", conn)
    display(df.head(10))  # show a small preview
else:
    print("No 'stored_datasets' table found — run the KG/NasaCMR discovery first.")

conn.close()

Tables found: ['stored_datasets', 'dataset_relationships', 'sqlite_sequence']


Unnamed: 0,dataset_id,title,short_name,dataset_properties,dataset_labels,total_relationships,relationship_types,links,created_at,updated_at
0,dataset_UNH_WWRDII_WATBAL_16263,Gridded Fields of Major Water Balance Componen...,UNH_WWRDII_WATBAL,"{""science_keywords"": ""Category EARTH SCIENCE T...","[""Dataset""]",18,"[""hasCESMVariable"", ""hasConsortium"", ""hasConta...","[{""link_type"": ""HTTP"", ""hreflang"": ""enUS"", ""li...",2025-09-21T15:32:54.507844,2025-09-21T15:32:54.507844
1,dataset_comp_runoff_monthly_xdeg_994_18480,ISLSCP II UNHGRDC Composite Monthly Runoff,comp_runoff_monthly_xdeg_994,"{""science_keywords"": ""Category EARTH SCIENCE T...","[""Dataset""]",19,"[""hasConsortium"", ""hasContact"", ""hasDataCatego...","[{""link_type"": ""Earthdata"", ""hreflang"": ""enUS""...",2025-09-21T15:33:12.193605,2025-09-21T15:33:12.193605
2,dataset_UNH_GRDC_GCRDS_47417,UNH GRDC Global Composite Runoff Data Set v10,UNH_GRDC_GCRDS,"{""science_keywords"": ""Category EARTH SCIENCE T...","[""Dataset""]",14,"[""hasCESMVariable"", ""hasConsortium"", ""hasConta...","[{""link_type"": ""HTTP"", ""hreflang"": ""enUS"", ""li...",2025-09-21T15:33:26.066863,2025-09-21T15:33:26.066863


## 3.2 Climate Simulation Datasets
Inspect the **CESM registry** (`cesm_data_registry.db`).  
`cesm_data_paths` stores Zarr/S3 paths and related descriptors collected by the CESM agent.

In [None]:

# Connect to CESM registry
conn = sqlite3.connect('cesm_data_registry.db')

# List tables to verify registry status
tables = conn.execute(
    "SELECT name FROM sqlite_master WHERE type='table';"
).fetchall()
print("Tables found:", [t[0] for t in tables])

# Preview the path catalog
if any(t[0] == "cesm_data_paths" for t in tables):
    df = pd.read_sql_query("SELECT * FROM cesm_data_paths", conn)
    display(df.head(10))
else:
    print("No 'cesm_data_paths' table found — populate via the CESM agent.")

conn.close()

## 3.3 Debug Links (where are the URLs?)
The KG→NasaCMR hand-off depends on **where URLs are stored**.  
This helper inspects `stored_datasets.dataset_properties` (JSON) and the `dataset_relationships` table to find URL-like fields (HTTP/Earthdata/S3).


In [None]:
import sqlite3
import json

def debug_links_data(db_path='climate_knowledge_graph.db', sample=10):
    """Inspect where URL-like info is stored in the KG cache.

    - Looks into stored_datasets.dataset_properties (JSON)
    - Prints the type/value of the 'links' field if present
    - Scans other JSON fields for URL/S3 patterns
    - Peeks into dataset_relationships for URL-like attributes
    """
    print("=== DEBUGGING LINKS DATA STRUCTURE ===")

    try:
        conn = sqlite3.connect(db_path)
        cur  = conn.cursor()

        # Sample a few datasets to examine their JSON structure
        cur.execute("""
            SELECT dataset_id, title, dataset_properties
            FROM stored_datasets
            LIMIT ?
        """, (sample,))
        rows = cur.fetchall()

        for dataset_id, title, props_json in rows:
            print(f"\nDataset: {dataset_id}")
            print(f"Title  : {title[:80]}")

            if not props_json or props_json == 'None':
                print("  No properties JSON")
                continue

            try:
                props = json.loads(props_json)
            except json.JSONDecodeError as e:
                print("  JSON parse error:", e)
                continue

            # 1) Check canonical 'links' field
            if 'links' in props:
                links_value = props['links']
                print("  links.type :", type(links_value).__name__)
                print("  links.value:", str(links_value)[:200])

            # 2) Hunt for URL-like strings in other fields
            def looks_like_url(s: str) -> bool:
                s_low = s.lower()
                return ('http' in s_low) or ('earthdata' in s_low) or ('s3://' in s_low)

            for k, v in props.items():
                if isinstance(v, str) and looks_like_url(v):
                    print(f"  URL in '{k}': {v[:200]}")
                # if long strings might embed URLs
                elif isinstance(v, str) and len(v) > 80 and looks_like_url(v):
                    print(f"  Possible URL in '{k}': {v[:200]}")

            print("  fields:", list(props.keys()))

        # 3) Relationships table may carry external links/IDs
        print("\nCHECKING dataset_relationships …")
        cur.execute("PRAGMA table_info(dataset_relationships)")
        rel_cols = [c[1] for c in cur.fetchall()]
        print("  columns:", rel_cols)

        cur.execute("SELECT * FROM dataset_relationships LIMIT 5")
        for row in cur.fetchall():
            rel = dict(zip(rel_cols, row))
            # Print only compact view + URL-like hints
            url_keys = [k for k, v in rel.items() if isinstance(v, str) and ('http' in v.lower() or 'earthdata' in v.lower())]
            print("  relationship:", {k: rel[k] for k in (rel_cols[0], rel_cols[-1]) if k in rel})
            for k in url_keys:
                print(f"    URL in {k}: {rel[k][:200]}")

        conn.close()

    except Exception as e:
        print("Error:", e)

    print("\n🏁 Debug complete.")

# Run the debug analysis
debug_links_data()
