# Environmental Rights Violations Chatbot
![env-rights](https://www.coe.int/documents/365513/276439533/hrc-healthy-environment-870x489.jpg/e8822dbb-ebb4-5708-d3cf-2356020454db?t=1734513740985)
## What is it?
Environmental rights means any proclamation of a human right to environmental conditions of a specified quality. Human rights and the environment are intertwined; human rights cannot be enjoyed without a safe, clean and healthy environment; and sustainable environmental governance cannot exist without the establishment of and respect for human rights. This relationship is increasingly recognized, as the right to a healthy environment is enshrined in over 100 constitutions. . Learn more at the United Nations Environment Program (UNEP) page on [Environmental Rights](https://www.unep.org/explore-topics/environmental-rights-and-governance/what-we-do/advancing-environmental-rights/what

The Environmental Rights Violations (ERV) Chatbot queries SQL database containing mean, max, min, median, and standard deviation of air quality by country (and by extension, continent) with reasonable but inconsistent accuracy.

For example, some user questions that the chatbot may answer are: 
1. What is the average and standard deviation of air quality in Indonesia?
2. Which five countries have the worst air quality?
3. What is the average air quality on each continent?
4. In how many countries did maximum air quality surpass 100 micrograms per cubic meter?
5. What is the average air quality among countries that start with the letter D?

## Data sources

Data                     | Source                                | Path (`data/input/...`)                     | Description                                                              | Units      | Sp. Res. | Temp. Res.
-------------------------|---------------------------------------|---------------------------------------------|--------------------------------------------------------------------------|------------|----------| --------- 
Global Air Quality       | [WUSTL](https://sites.wustl.edu/acag/datasets/surface-pm2-5-archive/) | `airQuality.tif`                            | Ground-level fine particulate matter (PM2.5)                             | μg/m<sup>3 | ~1 km    | Single mean of years 2014-2019
Global Heat Stress       | [CHIRTS](https://data.chc.ucsb.edu/products/CHIRTSdaily/v1.0/global_tifs_p05/)               | `heatstress103.tif`                         | Number of days in year where heat index was greater than 103F            | #          | ~5 km    | Year 2016 (hottest up to 2020)         
Global Sanitation Access | [IHME](https://ghdx.healthdata.org/record/ihme-data/lmic-wash-access-geospatial-estimates-2000-2017)                 | `sanitation_access.tif`<sup>[1](#fn1)</sup> | Coverage of any improved sanitation facility access in 90 LMICs          | %          | ~5 km    | Single mean of years 2012-2017<sup>[3](#fn3)</sup>         
Global Stunting          | [IHME](https://ghdx.healthdata.org/record/ihme-data/lmic-child-growth-failure-geospatial-estimates-2000-2017)                 | `stunting.tif`<sup>[2](#fn2)</sup>          | Prevalence of stunting for children under 5 years of age for 105 LMICs   | %          | ~5 km    | Single mean of years 2012-2017<sup>[3](#fn3)</sup>          
Countries                | [World Bank](https://datacatalog.worldbank.org/search/dataset/0038272/World-Bank-Official-Boundaries)                     | `WB_countries_Admin0_10m`                   | World Bank-approved administrative boundaries (Admin 0), i.e. countries  | NA         | NA       | Last updated Mar 19, 2020        

<a name="fn1">1</a>: Previously named `w_access`

<a name="fn2">2</a>: Previously named `IHME_LMIC_CGF_2000_2017_STUNTING_PREV_MEAN_2017_Y2020M01D08.tif`

## Software

This chatbot relies on the LangChain interface and the Llama 3.1 large language model (LLM). We'll use Ollama to download and serve the model in our environment.

## Outline
1. Setup environment
1. Load data into SQL database
1. Define unit tests
1. Build LLM
1. Run LLM on possible queries
1. Assignment




## Setup Environment
Let's run this notebook with the GPU accelerator.

In [None]:
# Spatial packages
!pip install rasterio==1.4.0 rioxarray==0.17.0 geopandas==1.0.1 shapely==2.0.6 rasterstats==0.20.0 --quiet

# LangChain
!pip install langchain_ollama==0.2.0 langchain_core==0.3.9 langchain_community==0.3.1 langchain_experimental==0.3.2 langgraph --quiet

# Visualization
!pip install folium matplotlib mapclassify --quiet
print("installed")

In [None]:
import os
from pathlib import Path
import pandas as pd
import numpy as np
import geopandas as gpd
import rasterio as rio
from rasterstats import zonal_stats
from langchain_community.utilities import SQLDatabase
from sqlalchemy import create_engine
import folium
import matplotlib.pyplot as plt

data_path = '/kaggle/input/erv-chatbot/data'
data_path

## Visualize the Data Layers
The source data is in [GeoTIFF](https://www.ogc.org/publications/standard/geotiff/) format. Let's visualize a layer interactively. First let's open up the tif file with rasterio and look at some of its metadata.

In [None]:
# Open the GeoTIFF file
tif_filepath = "/input/airQuality.tif"
src = rio.open(f'{data_path}{tif_filepath}')
metadata = src.meta
bounds = src.bounds
data_range = (src.read(1).min(), src.read(1).max())

print("Metadata:", metadata)
print("Bounds: ", bounds)
print("Data Range:", data_range)

array = src.read(1)
print("Shape:", array.shape)

plt.figure(figsize=(12,6))
plt.imshow(array, cmap='inferno')
plt.show() 

# close the image
src.close()

Let's try to plot it interactively with Folium.

In [None]:
# File path
tif_filepath = f'{data_path}/input/airQuality.tif'
png_filepath = 'airQualityNorm.png'

# Open the GeoTIFF file
with rio.open(tif_filepath) as src:
    data = src.read(1).astype(np.float32)  # Read the first band
    nodata_value = src.nodata
    bounds = [[src.bounds.bottom, src.bounds.left], [src.bounds.top, src.bounds.right]]

# Replace NoData values with NaN
data[data == nodata_value] = np.nan

# Normalize the data for visualization (0 to 255)
vmin, vmax = np.nanmin(data), np.nanmax(data)
data = (255 * (data - vmin) / (vmax - vmin)).astype(np.uint8)

# Convert array to an image
plt.imsave(png_filepath, data, cmap='inferno', format='png')

# Create a Folium map
m = folium.Map(location=[(bounds[0][0] + bounds[1][0]) / 2, 
                         (bounds[0][1] + bounds[1][1]) / 2], zoom_start=2)

# Add image overlay
img_overlay = folium.raster_layers.ImageOverlay(
    name="Air Quality",
    image=png_filepath,
    bounds=[[-50, -180], [57, 180]],
    opacity=0.9,
    interactive=True
)
img_overlay.add_to(m)

# Add layer control
folium.LayerControl().add_to(m)

# Display map
m

Something is off, the country boundaries don't quite line up. That's because Folium uses the Mercator projection EPSG:3857 and this image is using ESPG:4326. 
**Bonus: repoject the tif into Mercator with `rasterio` per [this example](https://rasterio.readthedocs.io/en/stable/topics/reproject.html#reprojecting-a-geotiff-dataset) in the docs.**

In [None]:
# reproject the image into mercator and revisualize


## Load data into SQL database

To interact with this data through a natural language model, we want it in a SQL database. That way, the LLM can turn out questions into SQL queries, and then interpret the returned data to answer our questions. This function defined below takes takes a tif and calculates the per-country statistics, then writes it to a SQL database. Documentation: https://python.langchain.com/docs/how_to/sql_csv/#sql.

In [None]:
# Calculate zonal statistics in every polygon
def calculate_zonal_stats(list_of_stats: list, geodata_path: Path, raster_path: Path, save_csv_path: Path = None):

    # Load data
    polygons = gpd.read_file(geodata_path)
    with rio.open(raster_path) as src:
        if src.crs != polygons.crs:
            polygons = polygons.to_crs(src.crs)  # align CRS, if appplicable
        raster = src.read(1)
        affine = src.transform
        nodata_value = src.nodata

    # Calculate
    stats = zonal_stats(polygons, raster,
                        stats=list_of_stats,
                        affine=affine, nodata=nodata_value)

    # Reattach to original data
    polygon_x_raster = pd.concat([
        pd.DataFrame(polygons.drop(columns='geometry')),  # drop geometry to reduce size of dataset
        pd.DataFrame(stats)],
        axis=1)

    # Export, if requested
    if save_csv_path is not None:
        polygon_x_raster.to_csv(save_csv_path, index=False)

    return(polygon_x_raster)

# Load data to SQL with LangChain
def data_to_sql(df, name_of_database: str, if_exists_replace: bool = False):
    database_path = f'{data_path}/{name_of_database}.db'
    engine = create_engine(f'sqlite:///{database_path}')
    if not Path(database_path).exists() or if_exists_replace is True:
        df.to_sql(name_of_database, engine, if_exists='replace')
    db = SQLDatabase(engine=engine)

    return db

print('function defined')

In [None]:
# calculate zonal stats per country and write to a SQL database
stats = ['mean', 'max', 'min', 'median', 'std']
df = calculate_zonal_stats(
    list_of_stats=stats,
    geodata_path=Path(f'{data_path}/input/WB_countries_Admin0_10m/'),
    raster_path=Path(f'{data_path}/input/airQuality.tif')
)
df = df[['WB_NAME', 'CONTINENT'] + stats].dropna()
db = data_to_sql(df = df, name_of_database = 'airxcntry', if_exists_replace = False)
db

In [None]:
# inspect the data
df

## Define unit tests

In [None]:
# Potential queries
q_simple = "What is the average and standard deviation of air quality in Indonesia?"
q_relative = "Which five countries have the worst air quality?"
q_summarize = "What is the average air quality on each continent?"
q_filter = "In how many countries did maximum air quality surpass 100 micrograms per cubic meter?"

print(q_simple)

In [None]:
# Manually find answers from dataframe
answer_key = {}

# 1) "What is the average and standard deviation of air quality in Indonesia?"
answer_key = answer_key | {
    'q_simple': df[df['WB_NAME'] == "Indonesia"][['mean', 'std']].round(2).to_dict('records')
}

# 2) "Which five countries have the worst air quality?"
answer_key = answer_key | {
    'q_relative': df.sort_values('mean', ascending=False).head(5)['WB_NAME'].values.tolist()
}

# 3) "What is the average air quality on each continent?"
answer_key = answer_key | {
    'q_summarize': df.groupby('CONTINENT')['mean'].mean().round(2).to_dict()
}

# 4) "In how many countries did maximum air quality surpass 100 micrograms per cubic meter?"
answer_key = answer_key | {
    'q_filter': df[df['max'] > 100]['WB_NAME'].count()
}

In [None]:
answer_key

## Build LLM

We will eventually chain together agents that a) turn question into SQL query, b) run that SQL query, and c) return SQL result using a natural language response; but first let's explore each agent in turn.

Resources:
* Documentation: https://python.langchain.com/docs/tutorials/sql_qa
* Example Notebook for Running Ollama in Kaggle: https://www.kaggle.com/code/sumanmichael/ollama-langchain-in-kaggle-gpu



In [None]:
# download and install ollama
!curl -fsSL https://ollama.com/install.sh | sh

In [None]:
# to support background processes in Kaggle
import os
get_ipython().system = os.system
print('done')

In [None]:
# start ollama in the background
!ollama serve &

In [None]:
# use ollama to pull the Llama v3.1
!ollama pull llama3.1

In [None]:
from langchain_ollama import ChatOllama
from langchain.chains import create_sql_query_chain
from langchain_core.prompts import PromptTemplate
from langchain_community.tools.sql_database.tool import QuerySQLDataBaseTool
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

print('imported')

In [None]:
# Load model
model = ChatOllama(model="llama3.1", seed=123)
model

#### a) Generate SQL query

In [None]:
# a) Generate SQL query (default)
sql_agent_default = create_sql_query_chain(model, db)
sql_code_default = sql_agent_default.invoke({"question": q_simple})
print(sql_code_default)

**Question for you: Is this SQL query going to run successfully? Why or why not?**

**Your answer:**

In [None]:
db.run(sql_code_default)

Oops! It hit an error. Something is wrong with our query. Let's see if we can do some prompt engineering to improve model.

In [None]:
# Default prompt
sql_prompt_default = sql_agent_default.get_prompts()[0]
sql_prompt_default.pretty_print()

##### Prompt engineering

Remove the following instructions, which may be confusing the model: 
* Pay attention to use date('now') function to get the current date, if the question involves "today".
* Wrap each column name in double quotes (") to denote them as delimited identifiers.

And add the following instructions based on how we see the model interpreting our request. 
* Return ONLY the SQLQuery with no other context given. Do not include the question in the output. Do not include the title 'SQLQuery' in the output.


In [None]:
# Improved prompt
sql_prompt_text = '''
    You are a SQLite expert. Given an input question, first create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer to the input question.
    Unless the user specifies in the question a specific number of examples to obtain, query for at most {top_k} results using the LIMIT clause as per SQLite. You can order the results to return the most informative data in the database.
    Never query for all columns from a table. You must query only the columns that are needed to answer the question. 
    Pay attention to use only the column names you can see in the tables below. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which table.
    Return ONLY the SQLQuery with no other context given. Do not include the question in the output. Do not include the title 'SQLQuery' in the output.
    
    Use the following format:
    Question: Question here
    SQLQuery: SQL Query to run
    SQLResult: Result of the SQLQuery
    Answer: Final answer here
    
    Only use the following tables:
    {table_info}
    
    Question: {input}
'''

sql_prompt = PromptTemplate(
    input_variables = ["input", "table_info", "top_k", "dialect"],
    template=sql_prompt_text
)

sql_prompt

In [None]:
# New agent based on improved prompt
sql_agent = create_sql_query_chain(model, db, prompt=sql_prompt)
sql_agent

Now let's see what our new model returns when asked to generate SQL code.


In [None]:
sql_code = sql_agent.invoke({"question": q_simple})
print(sql_code)
print(db.run(sql_code))

print("\nAnswer key:")
answer_key['q_simple']

Yay! We haven't actually built another agent to execute the SQL query yet, so let's do that now; and chain it to the agent that writes SQL code.


#### b) Execute SQL query

In [None]:
# b) Execute SQL query 
execute_sql_query = QuerySQLDataBaseTool(db=db)
write_sql_query = create_sql_query_chain(model, db, prompt=sql_prompt)
sql_chain = write_sql_query | execute_sql_query
sql_chain.invoke({"question": q_simple})

#### c) Natural language response

In [None]:
# Prompt for agent that will interact with user
answer_prompt = PromptTemplate.from_template(
    """Given the following user question, corresponding SQL query, and SQL result, answer the user question.

    Question: {question}
    SQL Query: {query}
    SQL Result: {result}
    Answer:
    
    """
)
answer_prompt

In [None]:
# Chain all agents together
chain = (
    RunnablePassthrough.assign(query=write_sql_query).assign(
        result=itemgetter("query") | execute_sql_query
    )
    | answer_prompt | model | StrOutputParser()
)
chain

In [None]:
# Test chain
final_response = chain.invoke({"question": q_simple})
print(final_response)

## Answers to other queries
Let's test out our Chatbot now on the test questions we set up before. After each one, we'll evaluate how it does. 

In [None]:
# q_relative 
# q_relative = "Which five countries have the worst average air quality?"

print(q_relative, "\n")

r_relative = chain.invoke({"question": q_relative})
print(r_relative)

print("\nAnswer key:", answer_key['q_relative'])

print("\nSQL code written:", sql_agent.invoke({"question": q_relative}))


**Question:** How did the Chatbot do at answering this question? What would you suggest could be done to improve the answer?

**Your Answer:**

In [None]:
# q_summarize

print(q_summarize, "\n")

r_summarize = chain.invoke({"question": q_summarize})
print(r_summarize)

print("\nAnswer key:", answer_key['q_summarize'])

print("\nSQL code written:", sql_agent.invoke({"question": q_summarize}))


**Question:** How did the Chatbot do at answering this question? Was the answer correct? If not, where did the answer come from? What would you suggest could be done to improve the answer?

**Your Answer:**

In [None]:
# q_filter

print(q_filter, "\n")

r_filter = chain.invoke({"question": q_filter})
print(r_filter)

print("\nAnswer key:", answer_key['q_filter'])

print("\nSQL code written:", sql_agent.invoke({"question": q_filter}))


**Question:** How did the Chatbot do at answering this question? Was the answer correct? Did the SQL query seem appropriate? What would you suggest could be done to improve the answer?

**Your Answer:**

## Assignment
**Part 1 (easier):** Spend some time tweaking the prompts for the ERV Chatbot above to see if you can improve the quality of responses to one of the test questions where the Chatbot did poorly. What did you change? Did it help?

**Part 2 (harder):** Pick another one of the input data layers (the one you were assigned) and repeat the steps in this notebook to make it queryable by the ERV Chatbot (ie visualize the data, compute the zonal stats, convert to a SQL database, engineer the prompt, and set up 2-3 test questions). Write the code below.

## Bonus Assignment
**Part 1 (easy):** What could we do to improve this notebook?

**Part 2 (medium):** Here we used Llama 3.1 from Meta. Check out other LLMs in the Ollama library, find one that you think might perform better, and pull it into this notebook to see how it does.

**Part 3 (very hard):** Adapt the ERV Chatbot to be able to answer queries about multiple types of Environmental Rights Violations (ie air quality AND sanitation access).  

## Acknowledgements
Thank you to [Zia Mehrabi](https://www.colorado.edu/envs/zia-mehrabi) for ideating a chatbot that can help people discover human rights violations. Thank you to [Naia Ormaza Zulueta](https://www.colorado.edu/envs/naia-ormaza-zulueta) for compiling the datasets and thank you to [Katie Fankhauser](https://www.linkedin.com/in/katie-fankhauser/) for writing most of the code. [Isaiah Lyons-Galante](https://www.colorado.edu/envs/isaiah-lyons-galante) adapted the notebook for Kaggle and for the AI for Good course. All contributions were made as members of the Better Planet Lab.