# Environmental Rights Violations Chatbot

## What is it?

The ERV Chatbot queries SQL database containing mean, max, min, median, and standard deviation of air quality by country (and by extension, continent) with reasonable but inconsistent accuracy.

For example, some user questions that the chatbot may answer are: 
1. What is the average and standard deviation of air quality in Indonesia?
2. Which five countries have the worst air quality?
3. What is the average air quality on each continent?
4. In how many countries did maximum air quality surpass 100 micrograms per cubic meter?
5. What is the average air quality among countries that start with the letter D?

## Data sources

Data                     | Source                                | Path (`data/input/...`)                     | Description                                                              | Units      | Sp. Res. | Temp. Res.
-------------------------|---------------------------------------|---------------------------------------------|--------------------------------------------------------------------------|------------|----------| --------- 
Global Air Quality       | [WUSTL](https://sites.wustl.edu/acag/datasets/surface-pm2-5-archive/) | `airQuality.tif`                            | Ground-level fine particulate matter (PM2.5)                             | μg/m<sup>3 | ~1 km    | Single mean of years 2014-2019
Global Heat Stress       | [CHIRTS](https://data.chc.ucsb.edu/products/CHIRTSdaily/v1.0/global_tifs_p05/)               | `heatstress103.tif`                         | Number of days in year where heat index was greater than 103F            | #          | ~5 km    | Year 2016 (hottest up to 2020)         
Global Sanitation Access | [IHME](https://ghdx.healthdata.org/record/ihme-data/lmic-wash-access-geospatial-estimates-2000-2017)                 | `sanitation_access.tif`<sup>[1](#fn1)</sup> | Coverage of any improved sanitation facility access in 90 LMICs          | %          | ~5 km    | Single mean of years 2012-2017<sup>[3](#fn3)</sup>         
Global Stunting          | [IHME](https://ghdx.healthdata.org/record/ihme-data/lmic-child-growth-failure-geospatial-estimates-2000-2017)                 | `stunting.tif`<sup>[2](#fn2)</sup>          | Prevalence of stunting for children under 5 years of age for 105 LMICs   | %          | ~5 km    | Single mean of years 2012-2017<sup>[3](#fn3)</sup>          
Countries                | [World Bank](https://datacatalog.worldbank.org/search/dataset/0038272/World-Bank-Official-Boundaries)                     | `WB_countries_Admin0_10m`                   | World Bank-approved administrative boundaries (Admin 0), i.e. countries  | NA         | NA       | Last updated Mar 19, 2020        

<a name="fn1">1</a>: Previously named `w_access`

<a name="fn2">2</a>: Previously named `IHME_LMIC_CGF_2000_2017_STUNTING_PREV_MEAN_2017_Y2020M01D08.tif`

<a name="fn3">3</a>: Confirm with Naia that it was not the last 3 available years of data, or 2014-2017

## Software

This chatbot relies on the LangChain interface and the Llama 3.1 large language model (LLM). All necessary packages are 
included in `environment.yml`, but you will have to [download Ollama](https://python.langchain.com/docs/integrations/chat/ollama/) 
(once) to your local computer before you are able to access the Llama 3.1 model. Keep the Ollama application running to 
use the model within your python environment.

## Outline
1. Setup environment
1. Load data into SQL database
1. Define unit tests
1. Build LLM
1. Run LLM on possible queries




## Setup Environment
Let's run this notebook with the GPU accelerator.

In [6]:
# Spatial packages
!pip install rasterio==1.4.0 rioxarray==0.17.0 geopandas==1.0.1 shapely==2.0.6 rasterstats==0.20.0 --quiet

# LangChain
!pip install langchain_ollama==0.2.0 langchain_core==0.3.9 langchain_community==0.3.1 langchain_experimental==0.3.2 langgraph --quiet

print("installed")

installed


## Load data into SQL database

Documentation: https://python.langchain.com/docs/how_to/sql_csv/#sql

In [7]:
import os
from pathlib import Path
import pandas as pd
import geopandas as gpd
import rasterio as rio
from rasterstats import zonal_stats
from langchain_community.utilities import SQLDatabase
from sqlalchemy import create_engine

data_path = '/kaggle/input/erv-chatbot/data'
data_path

'/kaggle/input/erv-chatbot/data'

This function defined below takes takes a tif and calculates the per-country statistics, then writes it to a SQL database.

In [25]:
# Calculate zonal statistics in every polygon
def calculate_zonal_stats(list_of_stats: list, geodata_path: Path, raster_path: Path, save_csv_path: Path = None):

    # Load data
    polygons = gpd.read_file(geodata_path)
    with rio.open(raster_path) as src:
        if src.crs != polygons.crs:
            polygons = polygons.to_crs(src.crs)  # align CRS, if appplicable
        raster = src.read(1)
        affine = src.transform
        nodata_value = src.nodata

    # Calculate
    stats = zonal_stats(polygons, raster,
                        stats=list_of_stats,
                        affine=affine, nodata=nodata_value)

    # Reattach to original data
    polygon_x_raster = pd.concat([
        pd.DataFrame(polygons.drop(columns='geometry')),  # drop geometry to reduce size of dataset
        pd.DataFrame(stats)],
        axis=1)

    # Export, if requested
    if save_csv_path is not None:
        polygon_x_raster.to_csv(save_csv_path, index=False)

    return(polygon_x_raster)

# Load data to SQL with LangChain
def data_to_sql(df, name_of_database: str, if_exists_replace: bool = False):
    database_path = f'{data_path}/{name_of_database}.db'
    engine = create_engine(f'sqlite:///{database_path}')
    if not Path(database_path).exists() or if_exists_replace is True:
        df.to_sql(name_of_database, engine, if_exists='replace')
    db = SQLDatabase(engine=engine)

    return db

print('function defined')

function defined


In [22]:
# calculate zonal stats per country and write to a SQL database
stats = ['mean', 'max', 'min', 'median', 'std']
df = calculate_zonal_stats(
    list_of_stats=stats,
    geodata_path=Path(f'{data_path}/input/WB_countries_Admin0_10m/'),
    raster_path=Path(f'{data_path}/input/airQuality.tif')
)
df = df[['WB_NAME', 'CONTINENT'] + stats].dropna()
db = data_to_sql(df = df, name_of_database = 'airxcntry', if_exists_replace = False)
db

<langchain_community.utilities.sql_database.SQLDatabase at 0x79b57a9c9270>

In [26]:
# inspect the data
df

Unnamed: 0,WB_NAME,CONTINENT,mean,max,min,median,std
0,Indonesia,Asia,14.905112,36.920002,6.760000,14.260000,3.897446
1,Malaysia,Asia,12.393976,22.320000,7.040000,12.220000,2.588654
2,Chile,South America,16.120725,35.599998,4.160000,15.100000,5.632093
3,Bolivia,South America,28.121454,43.380001,6.920000,29.520000,8.072599
4,Peru,South America,27.543493,43.840000,4.660000,28.260000,8.422236
...,...,...,...,...,...,...,...
231,Bahrain,Asia,58.750000,64.139999,54.060001,58.400002,4.297452
233,"Macau (SAR, China)",Asia,24.799999,24.799999,24.799999,24.799999,0.000000
234,Bonaire (Neth.),Europe,12.380000,13.120000,11.640000,12.380000,0.604207
237,Netherlands,Europe,10.757464,12.320000,9.120000,10.840000,0.663207


## Define unit tests

In [27]:
# Potential queries
q_simple = "What is the average and standard deviation of air quality in Indonesia?"
q_relative = "Which five countries have the worst air quality?"
q_summarize = "What is the average air quality on each continent?"
q_filter = "In how many countries did maximum air quality surpass 100 micrograms per cubic meter?"

print(q_simple)

What is the average and standard deviation of air quality in Indonesia?


In [28]:
# Manually find answers from dataframe
answer_key = {}

# 1) "What is the average and standard deviation of air quality in Indonesia?"
answer_key = answer_key | {
    'q_simple': df[df['WB_NAME'] == "Indonesia"][['mean', 'std']].round(2).to_dict('records')
}

# 2) "Which five countries have the worst air quality?"
answer_key = answer_key | {
    'q_relative': df.sort_values('mean', ascending=False).head(5)['WB_NAME'].values.tolist()
}

# 3) "What is the average air quality on each continent?"
answer_key = answer_key | {
    'q_summarize': df.groupby('CONTINENT')['mean'].mean().round(2).to_dict()
}

# 4) "In how many countries did maximum air quality surpass 100 micrograms per cubic meter?"
answer_key = answer_key | {
    'q_filter': df[df['max'] > 100]['WB_NAME'].count()
}

In [29]:
answer_key

{'q_simple': [{'mean': 14.91, 'std': 3.9}],
 'q_relative': ['Qatar',
  'Bangladesh',
  'Nigeria',
  'Saudi Arabia',
  'United Arab Emirates'],
 'q_summarize': {'Africa': 29.7,
  'Asia': 31.31,
  'Europe': 12.78,
  'North America': 11.88,
  'Oceania': 5.72,
  'Seven seas (open ocean)': 7.44,
  'South America': 17.68},
 'q_filter': 10}

## Build LLM

We will eventually chain together agents that a) turn question into SQL query, b) run that SQL query, and c) return SQL result using a natural language response; but first let's explore each agent in turn.

Resources:
* Documentation: https://python.langchain.com/docs/tutorials/sql_qa
* Example Notebook for Running Ollama in Kaggle: https://www.kaggle.com/code/sumanmichael/ollama-langchain-in-kaggle-gpu



In [33]:
# download and install ollama
!curl -fsSL https://ollama.com/install.sh | sh

>>> Installing ollama to /usr/local
>>> Downloading Linux amd64 bundle
############################################################################################# 100.0%                                                                3.1%                                                                                     8.1%                                                                          9.8%#                                                                          23.0%################################                                                     45.2%
>>> Creating ollama user...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> The Ollama API is now available at 127.0.0.1:11434.
>>> Install complete. Run "ollama" from the command line.


In [36]:
# to support background processes in Kaggle
import os
get_ipython().system = os.system
print('done')

done


In [37]:
# start ollama in the background
!ollama serve &

0

In [38]:
# use ollama to pull the Llama v3.1
!ollama pull llama3.1

0

In [40]:
from langchain_ollama import ChatOllama
from langchain.chains import create_sql_query_chain
from langchain_core.prompts import PromptTemplate
from langchain_community.tools.sql_database.tool import QuerySQLDataBaseTool
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

print('imported')

imported


In [42]:
# Load model
model = ChatOllama(model="llama3.1", seed=123)
model

ChatOllama(model='llama3.1', seed=123)

#### a) Generate SQL query

In [43]:
# a) Generate SQL query (default)
sql_agent_default = create_sql_query_chain(model, db)
sql_code_default = sql_agent_default.invoke({"question": q_simple})
print(sql_code_default)

Question: What is the average and standard deviation of air quality in Indonesia?
SQLQuery: SELECT "mean", "std" FROM "airxcntry" WHERE "WB_NAME" = 'Indonesia'


**Question for you: Is this SQL query going to run successfully? Why or why not?**

**Your answer:**

In [44]:
db.run(sql_code_default)

OperationalError: (sqlite3.OperationalError) near "Question": syntax error
[SQL: Question: What is the average and standard deviation of air quality in Indonesia?
SQLQuery: SELECT "mean", "std" FROM "airxcntry" WHERE "WB_NAME" = 'Indonesia']
(Background on this error at: https://sqlalche.me/e/20/e3q8)

Oops! It hit an error. Something is wrong with our query. Let's see if we can do some prompt engineering to improve model.

In [45]:
# Default prompt
sql_prompt_default = sql_agent_default.get_prompts()[0]
sql_prompt_default.pretty_print()

You are a SQLite expert. Given an input question, first create a syntactically correct SQLite query to run, then look at the results of the query and return the answer to the input question.
Unless the user specifies in the question a specific number of examples to obtain, query for at most 5 results using the LIMIT clause as per SQLite. You can order the results to return the most informative data in the database.
Never query for all columns from a table. You must query only the columns that are needed to answer the question. Wrap each column name in double quotes (") to denote them as delimited identifiers.
Pay attention to use only the column names you can see in the tables below. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which table.
Pay attention to use date('now') function to get the current date, if the question involves "today".

Use the following format:

Question: Question here
SQLQuery: SQL Query to run
SQLResult: Result

##### Prompt engineering

Remove the following instructions, which may be confusing the model: 
* Pay attention to use date('now') function to get the current date, if the question involves "today".
* Wrap each column name in double quotes (") to denote them as delimited identifiers.

And add the following instructions based on how we see the model interpreting our request. 
* Return ONLY the SQLQuery with no other context given. Do not include the question in the output. Do not include the title 'SQLQuery' in the output.


In [47]:
# Improved prompt
sql_prompt_text = '''
    You are a SQLite expert. Given an input question, first create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer to the input question.
    Unless the user specifies in the question a specific number of examples to obtain, query for at most {top_k} results using the LIMIT clause as per SQLite. You can order the results to return the most informative data in the database.
    Never query for all columns from a table. You must query only the columns that are needed to answer the question. 
    Pay attention to use only the column names you can see in the tables below. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which table.
    Return ONLY the SQLQuery with no other context given. Do not include the question in the output. Do not include the title 'SQLQuery' in the output.
    
    Use the following format:
    Question: Question here
    SQLQuery: SQL Query to run
    SQLResult: Result of the SQLQuery
    Answer: Final answer here
    
    Only use the following tables:
    {table_info}
    
    Question: {input}
'''

sql_prompt = PromptTemplate(
    input_variables = ["input", "table_info", "top_k", "dialect"],
    template=sql_prompt_text
)

sql_prompt

PromptTemplate(input_variables=['dialect', 'input', 'table_info', 'top_k'], input_types={}, partial_variables={}, template="\n    You are a SQLite expert. Given an input question, first create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer to the input question.\n    Unless the user specifies in the question a specific number of examples to obtain, query for at most {top_k} results using the LIMIT clause as per SQLite. You can order the results to return the most informative data in the database.\n    Never query for all columns from a table. You must query only the columns that are needed to answer the question. \n    Pay attention to use only the column names you can see in the tables below. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which table.\n    Return ONLY the SQLQuery with no other context given. Do not include the question in the output. Do not include the title

In [48]:
# New agent based on improved prompt
sql_agent = create_sql_query_chain(model, db, prompt=sql_prompt)
sql_agent

RunnableAssign(mapper={
  input: RunnableLambda(...),
  table_info: RunnableLambda(...)
})
| RunnableLambda(lambda x: {k: v for (k, v) in x.items() if k not in ('question', 'table_names_to_use')})
| PromptTemplate(input_variables=['input', 'table_info'], input_types={}, partial_variables={'dialect': 'sqlite', 'top_k': '5'}, template="\n    You are a SQLite expert. Given an input question, first create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer to the input question.\n    Unless the user specifies in the question a specific number of examples to obtain, query for at most {top_k} results using the LIMIT clause as per SQLite. You can order the results to return the most informative data in the database.\n    Never query for all columns from a table. You must query only the columns that are needed to answer the question. \n    Pay attention to use only the column names you can see in the tables below. Be careful to not query 

Now let's see what our new model returns when asked to generate SQL code.


In [50]:
sql_code = sql_agent.invoke({"question": q_simple})
print(sql_code)
print(db.run(sql_code))

print("\nAnswer key:")
answer_key['q_simple']

SELECT mean, std FROM airxcntry WHERE WB_NAME = 'Indonesia'
[(14.905112004829341, 3.897445770948378)]

Answer key:


[{'mean': 14.91, 'std': 3.9}]

Yay! We haven't actually built another agent to execute the SQL query yet, so let's do that now; and chain it to the agent that writes SQL code.


#### b) Execute SQL query

In [51]:
# b) Execute SQL query 
execute_sql_query = QuerySQLDataBaseTool(db=db)
write_sql_query = create_sql_query_chain(model, db, prompt=sql_prompt)
sql_chain = write_sql_query | execute_sql_query
sql_chain.invoke({"question": q_simple})

'[(14.905112004829341, 3.897445770948378)]'

#### c) Natural language response

In [53]:
# Prompt for agent that will interact with user
answer_prompt = PromptTemplate.from_template(
    """Given the following user question, corresponding SQL query, and SQL result, answer the user question.

    Question: {question}
    SQL Query: {query}
    SQL Result: {result}
    Answer:
    
    """
)
answer_prompt

PromptTemplate(input_variables=['query', 'question', 'result'], input_types={}, partial_variables={}, template='Given the following user question, corresponding SQL query, and SQL result, answer the user question.\n\n    Question: {question}\n    SQL Query: {query}\n    SQL Result: {result}\n    Answer:\n    \n    ')

In [54]:
# Chain all agents together
chain = (
    RunnablePassthrough.assign(query=write_sql_query).assign(
        result=itemgetter("query") | execute_sql_query
    )
    | answer_prompt | model | StrOutputParser()
)
chain

RunnableAssign(mapper={
  query: RunnableAssign(mapper={
           input: RunnableLambda(...),
           table_info: RunnableLambda(...)
         })
         | RunnableLambda(lambda x: {k: v for (k, v) in x.items() if k not in ('question', 'table_names_to_use')})
         | PromptTemplate(input_variables=['input', 'table_info'], input_types={}, partial_variables={'dialect': 'sqlite', 'top_k': '5'}, template="\n    You are a SQLite expert. Given an input question, first create a syntactically correct {dialect} query to run, then look at the results of the query and return the answer to the input question.\n    Unless the user specifies in the question a specific number of examples to obtain, query for at most {top_k} results using the LIMIT clause as per SQLite. You can order the results to return the most informative data in the database.\n    Never query for all columns from a table. You must query only the columns that are needed to answer the question. \n    Pay attention to use o

In [55]:
# Test chain
final_response = chain.invoke({"question": q_simple})
print(final_response)

The average air quality in Indonesia is approximately 14.91, and the standard deviation of air quality in Indonesia is approximately 3.90.


### 4) Answers to other queries

In [56]:
# q_relative 
# q_relative = "Which five countries have the worst average air quality?"

print(q_relative, "\n")

r_relative = chain.invoke({"question": q_relative})
print(r_relative)

print("\nAnswer key:", answer_key['q_relative'])

print("\nSQL code written:", sql_agent.invoke({"question": q_relative}))


Which five countries have the worst air quality? 

Based on the SQL result, it appears that the countries with the worst air quality are:

1. Chad
2. India
3. Nigeria
4. Niger
5. China

These five countries have the lowest standard deviation (std) in their air quality data, indicating that they tend to have more consistent levels of poor air quality.

So, my answer is: The five countries with the worst air quality are Chad, India, Nigeria, Niger, and China.

Answer key: ['Qatar', 'Bangladesh', 'Nigeria', 'Saudi Arabia', 'United Arab Emirates']

SQL code written: SELECT WB_NAME FROM airxcntry ORDER BY std DESC LIMIT 5


##### Human assessment: A little verbose and chose to rely on the min instead of mean columns which could be defensible, but should be standardized. Greatly improved by user adding more context.

In [57]:
# q_summarize

print(q_summarize, "\n")

r_summarize = chain.invoke({"question": q_summarize})
print(r_summarize)

print("\nAnswer key:", answer_key['q_summarize'])

print("\nSQL code written:", sql_agent.invoke({"question": q_summarize}))


What is the average air quality on each continent? 

The average air quality on each continent is as follows:

* Africa: 22.63
* Asia: 14.91
* Europe: 9.92
* North America: 16.97
* Oceania: 12.76
* Seven seas (open ocean): 4.88
* South America: 16.12

Answer key: {'Africa': 29.7, 'Asia': 31.31, 'Europe': 12.78, 'North America': 11.88, 'Oceania': 5.72, 'Seven seas (open ocean)': 7.44, 'South America': 17.68}

SQL code written: SELECT CONTINENT, mean FROM airxcntry GROUP BY CONTINENT


##### Human assessment: Incorrect! Looks like it is just taking the mean of the first entry in each continent. 

In [58]:
# q_filter

print(q_filter, "\n")

r_filter = chain.invoke({"question": q_filter})
print(r_filter)

print("\nAnswer key:", answer_key['q_filter'])

print("\nSQL code written:", sql_agent.invoke({"question": q_filter}))


In how many countries did maximum air quality surpass 100 micrograms per cubic meter? 

Based on the SQL query and result provided, it appears that there are 10 countries where the maximum air quality exceeded 100 micrograms per cubic meter. 

The answer to the user's question would be:

"There were 10 countries where the maximum air quality surpassed 100 micrograms per cubic meter."

Answer key: 10

SQL code written: SELECT COUNT(index) FROM airxcntry WHERE max > 100


##### Human assessment: Again verbose and never got around to actually answering the question. Not sure its suggestion is the best / most efficient SQL query either.

## Acknowledgements
Thank you to [Zia Mehrabi](https://www.colorado.edu/envs/zia-mehrabi) for ideating a chatbot that can help people discover human rights violations. Thank you to [Naia Ormaza Zulueta](https://www.colorado.edu/envs/naia-ormaza-zulueta) for compiling the datasets and thank you to [Katie Fankhauser](https://www.linkedin.com/in/katie-fankhauser/) for writing most of the code. [Isaiah Lyons-Galante](https://www.colorado.edu/envs/isaiah-lyons-galante) adapted the notebook for Kaggle and for the AI for Good course. All contributions were made as members of the Better Planet Lab.

## Assignment

## Bonus Assignment