Biodiversity report generator
---
This notebook shows the inner workings of the report generator. It combines the "greeneryDataForNeighborhoodV2" and "retrieveGreeneryGoals" to create an overview of a greenery goal and the progress that has been made to achieve this goal.


# Step 0: Imports


In [74]:
from langchain.chains.sequential import SequentialChain
from langchain.chains.retrieval_qa.base import RetrievalQA
from langchain.chains.llm import LLMChain
from langchain.chains import TransformChain

from langchain_openai import AzureChatOpenAI
from langchain_ollama import ChatOllama
from langchain_huggingface import HuggingFaceEmbeddings

from langchain.memory import SimpleMemory
from langchain.agents import create_sql_agent
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS

from langchain_core.prompts import PromptTemplate
from langchain_community.utilities.sql_database import SQLDatabase
from langchain_community.agent_toolkits.sql.toolkit import SQLDatabaseToolkit
from langchain_community.docstore import InMemoryDocstore

from sqlalchemy import create_engine
from sqlalchemy.pool import StaticPool
from dotenv import load_dotenv
from PyPDF2 import PdfReader
from uuid import uuid4

import requests
import os
import sqlite3
import faiss

# Step 1: Choose a large language model and embedding model
First the LLM and embedding model has to be chosen. There are many options, but a few have been tested and can easily be changed by runing it's specific codeblock.

## LLM
Choose one of the implemented LLMs: Llama3, o3-mini or 4o-mini.


Option A: Llama3 (via Ollama, running locally)

In [16]:
chosen_llm = ChatOllama(base_url='http://localhost:11434', model="llama3")

Option B: o3-mini (via Microsoft Azure, running in cloud):

In [17]:
load_dotenv()

chosen_llm = AzureChatOpenAI(model ="o3-mini", api_version="2025-01-01-preview", azure_endpoint="https://56948-m9bdjgpg-eastus2.cognitiveservices.azure.com/openai/deployments/o3-mini/chat/completions?api-version=2025-01-01-preview", api_key=os.environ.get("AZURE_OPENAI_API_KEY"))

Option C: gpt-4o-mini (via Microsoft Azure, running in cloud):

In [18]:
load_dotenv()

chosen_llm = AzureChatOpenAI(model="gpt-4o-mini", api_version="2025-01-01-preview",
                             azure_endpoint="https://56948-m9bdjgpg-eastus2.cognitiveservices.azure.com/openai/deployments/gpt-4o-mini/chat/completions?api-version=2025-01-01-preview",
                             api_key=os.environ.get("AZURE_OPENAI_API_KEY"))

## Embedding model
Choose one of the implemented models: all-MiniLM-L6-v2 or all-mpnet-base-v2

Option A: all-MiniLM-L6-v2 (faster, but worse)

In [19]:
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

Option B: all-mpnet-base-v2 (slower, but better)

In [20]:
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

# Step 2: Create the chains
The second step is to create the chains that will retrieve the information that is needed to generate biodiversity reports. This information consists of a greenery goal and a greenery percentage of a specific municipality.

## Chain A: Retrieve greenery goal
The first chain is able to retrieve a greenery goal. This goal will be retrieved with a Q&A chain that uses RAG to analyse multiple document chunks. The data that the chain needs will first be prepared to help the LLM analyse the information better and after that the chain will be created that will (later) analyse the prepared data.

### Data preparation: save document chunks in a vector database
First a pdf that contains a greenery goal will be loaded. The text of this document will be extracted and will then be chunked into smaller pieces. After that a vector database will be created and the chunks will be saved into it.

In [21]:
# Load file
file_path = "./GroenvisieSchiedam.pdf"
doc_reader = PdfReader(file_path)

# Extract page content
raw_text = ''
for page in doc_reader.pages:
    text = page.extract_text()
    if text:
        raw_text += text

print("Amount of characters raw text: " + str(len(raw_text)))

# Split the page content into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap  = 200,
)
chunks = text_splitter.split_text(raw_text)

print("Amount of characters first chunk: " + str(len(chunks[0])))

# Create vector store
index = faiss.IndexFlatL2(len(embedding_model.embed_query(chunks[0]))) # Calculates the amount of dimensions the chunk's vector has

vector_store = FAISS(
    embedding_function=embedding_model,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

# Give chunks UUIDS
uuids = []
for chunk in range(len(chunks)):
    uuids.append(str(uuid4()))

# Create embeddings of chunks and store them in the vector database
vector_store.add_texts(texts=chunks, ids=uuids)

print("\n UUIDs of items stored in vector database: " + str(uuids))

Amount of characters raw text: 9039
Amount of characters first chunk: 627

 UUIDs of items stored in vector database: ['6880a5a3-f171-461e-8c25-6241165b0963', 'ecd9e122-2b13-43f8-b289-3580c545b124', '97a0fc78-6ef8-4b46-9293-2668f3c31916', '73e1e2b4-a5dc-4eef-8405-b2848643dbcc', 'f574cb77-e762-4f19-b9ba-83ebd12e9efd', 'c124a020-b328-4fed-81d0-fcd84520384b', '8ea17db0-c9dd-4f93-95b4-a0300d858e4e', '19fdbbe2-71cf-47b3-8acf-a5dc5d9a0e05', 'b913340d-b782-4d70-81c0-fe52474fc2e6', '35a55909-4e4f-4cf1-b8e6-c0898efe44e7', '52d9d229-69c1-4f36-b227-aa9140347235', 'a79d772f-86b8-479b-b6a4-04d1f6eab5ba']


### Create Q&A chain
Create a QA chain that is able to do a similarity search for the 10 most relevant chunks and analyze these chunks to find the percentage asked.

In [120]:
# Create prompt template
template = """
Je bent een documentenanalist. Gebruik de aangeleverde documenten om het percentage te vinden dat de vraag beantwoord. Geef als antwoord alleen het percentage, verder niks.

Vraag: {question}
Aangeleverde documenten: {context}
"""
prompt_template = PromptTemplate(input_variables=['context', 'question'], template=template)

# Create Q&A chain
qa_chain = RetrievalQA.from_chain_type(
    llm=chosen_llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever(search_kwargs={"k": 10}), # This will do the similarity search for the 10 most relevant chunks and adds them to the {context} variable
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt_template},
    output_key="greeneryGoal"
)

print("Prompt die is meegegeven aan chain: ", prompt_template.format(question="[hier komt de ingevulde vraag te staan]", context="[hier komen de aangeleverde document chunks te staan]"))

Prompt die is meegegeven aan chain:  
Je bent een documentenanalist. Gebruik de aangeleverde documenten om het percentage te vinden dat de vraag beantwoord. Geef als antwoord alleen het percentage, verder niks.

Vraag: [hier komt de ingevulde vraag te staan]
Aangeleverde documenten: [hier komen de aangeleverde document chunks te staan]



## Chain B: Retrieve greenery percentage
The second chain is able retreive the greenery percentage of a specific municipality. This is done with a SQL-agent chain. The data that the agent needs will first be prepared to help the agent analyse the information better and after that the chain will be created that will (later) analyse the prepared data.

### Data preparation: store greenery percentages in SQlite database
First the database will be created by building a SQlite database and creating a "database engine" object that will be used for creating the agent's toolkit. As soon as that is done the dataset will be retrieved via an API-request and will then be stored in the database.

In [89]:
# Create database
con = sqlite3.connect("greeneryPercentages.db", check_same_thread=False)
cur = con.cursor()

# Create municipalities table in database
try:
    cur.execute("CREATE TABLE municipalities(name varchar(255), greeneryPercentage float)")
    print("Table created successfully.")

except:
    cur.execute("DROP TABLE municipalities")
    cur.execute("CREATE TABLE municipalities(name varchar(255), greeneryPercentage float)")
    print("Table has been reset, because it already existed.")

# Create database engine object
engine = create_engine(
        "sqlite://",
        creator=lambda: con,
        poolclass=StaticPool,
        connect_args={"check_same_thread": False},
    )
db = SQLDatabase(engine)

# Retrieve dataset
dataset = requests.get('https://data.rivm.nl/geo/ank/ows?service=WFS&request=GetFeature&typeName=rivm_2022_groenpercentage_kaart_per_gemeente&propertyName=gm_naam,_mean&outputFormat=json').json()

# Extract properties and put them into a list
properties = []
for feature in dataset["features"]:
    property = feature["properties"]
    properties.append((property["gm_naam"], property["_mean"]))

# Save the items stored in the list into the database
i=0
while i < len(properties):
    cur.execute("INSERT INTO municipalities VALUES (?,?)", properties[i])
    i = i + 1
con.commit()

Table has been reset, because it already existed.


### Create SQL-agent chain
Create a SQL-agent chain that is able to analyse the created database and create a SQL-query for it to answer the question when it is executed.

In [126]:
# Create toolkit:
toolkit = SQLDatabaseToolkit(db=db,llm=chosen_llm)

# Create agent chain
sql_agent_chain = create_sql_agent(
    llm=chosen_llm,
    toolkit=toolkit,
    verbose=False,
    output_key="greeneryPercentage",
    handleParsingErrors=True
)

# Transform function that returns the result of the agent in a dictionary (making it compatible with the other chains)
def run_sql_agent(inputs):
    result = sql_agent_chain.run(inputs["greeneryPercentageInputPrompt"])
    return {"greeneryPercentage": result}

# Transform Chain that calls the transform function
greeneryPercentageChain = TransformChain(
    input_variables=["greeneryPercentageInputPrompt"],
    output_variables=["greeneryPercentage"],
    transform=run_sql_agent
)

## Chain C: Combine greenery goal and greenery percentage
The third chain will analyse a greenery goal and percentage and write a few lines about the progress towards the goal.


In [125]:
# Create prompt template
template = """
Je bent een biodiversiteit analyst. Je krijgt een groenpercentage van een gemeente en de doelstelling van deze gemeente, aan jou de taak om deze informatie te tonen aan de gebruiker en te vertellen hoever de doelstelling behaald is. Beschrijf de resultaten in maximaal 150 woorden en zorg dat je een formele schrijfstijl gebruikt. Het format hiervoor is als volgt:
'Gemeente: <de gemeente waar je het over hebt>
Doelstelling: <het groenpercentage>
Huidige hoeveelheid groen: <de groendoelstelling>

<jouw resultatenbeschrijving>'


Gemeente: {municipality}
Groenpercentage: {greeneryPercentage}
Groendoelstelling: {greeneryGoal}
"""
prompt_template = PromptTemplate(input_variables=["municipality", "greeneryPercentage", "greeneryGoal"], template=template)

analysis_chain = LLMChain(
    llm=chosen_llm,
    prompt=prompt_template,
    output_key="greeneryAnalysis"
)

print("Prompt die is meegegeven aan chain: ", prompt_template.format(municipality="[hier komt de gemeente te staan die is ingevoerd bij het starten van de sequential chain]", greeneryPercentage="[hier komt het huidige groenpercentage te staan dat gevonden is in chain B]", greeneryGoal="[hier komt de groendoelstelling te staan dat gevonden is in chain A]"))

Prompt die is meegegeven aan chain:  
Je bent een biodiversiteit analyst. Je krijgt een groenpercentage van een gemeente en de doelstelling van deze gemeente, aan jou de taak om deze informatie te tonen aan de gebruiker en te vertellen hoever de doelstelling behaald is. Beschrijf de resultaten in maximaal 150 woorden en zorg dat je een formele schrijfstijl gebruikt. Het format hiervoor is als volgt:
'Gemeente: <de gemeente waar je het over hebt>
Doelstelling: <het groenpercentage>
Huidige hoeveelheid groen: <de groendoelstelling>

<jouw resultatenbeschrijving>'


Gemeente: [hier komt de gemeente te staan die is ingevoerd bij het starten van de sequential chain]
Groenpercentage: [hier komt het huidige groenpercentage te staan dat gevonden is in chain B]
Groendoelstelling: [hier komt de groendoelstelling te staan dat gevonden is in chain A]



# Step 3: Combining the percentage and goal
The third step is create a sequential chain that combines the Q&A chain and the SQL-agent chain. The sequential chain will parse the result of a previous chain to the next one with the use of prompt templates.

## Create sequential chain
Create the sequential chain that combines the two chains created earlier.

In [128]:
# Create combined chain
greeneryChain = SequentialChain(
    memory=SimpleMemory(),
    chains=[greeneryPercentageChain, qa_chain, analysis_chain],
    input_variables=["greeneryPercentageInputPrompt", "municipality", "query"],
    output_variables=["greeneryAnalysis"],
    verbose=True
)

municipality = "Schiedam"

# Execute combined chain
result = greeneryChain.invoke({"greeneryPercentageInputPrompt": "Vind het groenpercentage van gemeente " + municipality, "municipality": municipality, "query": "Wat is het percentage dat de gemeente wil realiseren met een natuurlijke opbouw?"})
print(result['greeneryAnalysis'])



[1m> Entering new SequentialChain chain...[0m

[1m> Finished chain.[0m
Gemeente: Schiedam  
Doelstelling: 40%  
Huidige hoeveelheid groen: 46.52%  

De gemeente Schiedam heeft haar groendoelstelling van 40% ruimschoots behaald, met een huidig groenpercentage van 46.52%. Dit vertegenwoordigt een overschot van 6.52% ten opzichte van de gestelde doelstelling. De aanzienlijke aanwezigheid van groen in Schiedam weerspiegelt een succesvolle inzet voor biodiversiteit en milieubehoud. Het behalen van deze doelstelling biedt een solide basis voor toekomstig beleid, gericht op het verbeteren van de leefomgeving en het bevorderen van flora en fauna. Dit resultaat bevestigt de effectiviteit van de genomen maatregelen tot nu toe, en biedt tegelijkertijd een stimulans voor het continu verbeteren van de groene infrastructuur in de gemeente.
