Biodiversity report generator
---
This notebook shows the inner workings of the report generator. It combines the "greeneryDataForNeighborhoodV2" and "retrieveGreeneryGoals" to create an overview of a greenery goal and the progress that has been made to achieve this goal.


# Step 0: Imports


In [105]:
from langchain.chains.sequential import SequentialChain
from langchain.chains.retrieval_qa.base import RetrievalQA
from langchain.chains.llm import LLMChain
from langchain.chains import TransformChain

from langchain_openai import AzureChatOpenAI
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_ollama import ChatOllama
from langchain_huggingface import HuggingFaceEmbeddings

from langchain.memory import SimpleMemory
from langchain.agents import create_sql_agent
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.callbacks.base import BaseCallbackHandler

from langchain_core.prompts import PromptTemplate
from langchain_community.utilities.sql_database import SQLDatabase
from langchain_community.agent_toolkits.sql.toolkit import SQLDatabaseToolkit
from langchain_community.docstore import InMemoryDocstore

from google import genai
from huggingface_hub import login
from transformers import AutoTokenizer
from sqlalchemy import create_engine
from sqlalchemy.pool import StaticPool
from dotenv import load_dotenv
from PyPDF2 import PdfReader
from uuid import uuid4

import tiktoken
import requests
import os
import sqlite3
import faiss

# Step 1: Choose a large language model and embedding model
First the LLM and embedding model has to be chosen. There are many options, but a few have been tested and can easily be changed by runing it's specific codeblock.

## LLM
Choose one of the implemented LLMs: gemma3 (local), qwen3 (local), o3-mini (cloud), gpt-4o-mini (cloud) or Gemini Flash 2.5 (cloud)


Option A: gemma3:4b (via Ollama, running locally)

In [141]:
chosen_llm = ChatOllama(base_url='http://localhost:11434', model="gemma3:4b")
model_name = "google/gemma-3-4b-it"
huggingFaceToken = os.environ.get("HF_token")



login(token = huggingFaceToken)

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Option B: qwen3:8b (via Ollama, running locally)

In [134]:
chosen_llm = ChatOllama(base_url='http://localhost:11434', model="qwen3:8b")
model_name = "Qwen/Qwen3-8B"

Option C: o3-mini (via Microsoft Azure, running in cloud):

In [147]:
load_dotenv()

chosen_llm = AzureChatOpenAI(model ="o3-mini", api_version="2025-01-01-preview",azure_endpoint="https://56948-m9bdjgpg-eastus2.cognitiveservices.azure.com/openai/deployments/o3-mini/chat/completions?api-version=2025-01-01-preview", api_key=os.environ.get("AZURE_OPENAI_API_KEY"))
model_name = "o3-mini"

Option D: gpt-4o-mini (via Microsoft Azure, running in cloud):

In [167]:
load_dotenv()

chosen_llm = AzureChatOpenAI(model="gpt-4o-mini", api_version="2025-01-01-preview",
                             azure_endpoint="https://56948-m9bdjgpg-eastus2.cognitiveservices.azure.com/openai/deployments/gpt-4o-mini/chat/completions?api-version=2025-01-01-preview",
                             api_key=os.environ.get("AZURE_OPENAI_API_KEY"))
model_name = "gpt-4o-mini"

Option E: Gemini 2.5 Flash (via Google AI platform, running in cloud):

In [161]:
load_dotenv()

chosen_llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash-preview-04-17",
                                    api_key=os.environ.get("GOOGLE_API_KEY"))
model_name = "gemini-2.5-flash-preview-04-17"

## Embedding model
Choose one of the implemented models: all-MiniLM-L6-v2 or all-mpnet-base-v2

Option A: all-MiniLM-L6-v2 (faster, but worse)

In [22]:
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

Option B: all-mpnet-base-v2 (slower, but better)

In [23]:
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

## Step 2: initialize token counters
In this step the token counters for all chains will be initialized. Every model has a different tokenizer, so it is important to select the correct one for each model.

In [168]:
# Choose corresponding tokenizer
if model_name=="Qwen/Qwen3-8B":
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer_func = tokenizer.encode
elif model_name=="google/gemma-3-4b-it":
    tokenizer = AutoTokenizer.from_pretrained(model_name, token=huggingFaceToken)
    tokenizer_func = tokenizer.encode
elif model_name=="o3-mini" or model_name=="gpt-4o-mini":
    tokenizer_func = tiktoken.encoding_for_model(model_name).encode
elif model_name=="gemini-2.5-flash-preview-04-17":
    tokenizer_func = genai.Client().models.count_tokens

# Token counter greenery goal chain
def greeneryGoalTokenCounter(tokenizer_function, prompt):
    countedTokens = 0
    # Count the tokens for the prompt(s)
    if model_name=="gemini-2.5-flash-preview-04-17":
        tokens = tokenizer_function(model=model_name, contents=prompt)
        countedTokens += tokens.total_tokens
    else:
        countedTokens += len(tokenizer_func(prompt))

    return countedTokens

# Token counter callback greenery percentage chain
class TokenCountingHandler(BaseCallbackHandler):
    def __init__(self, tokenizer_function):
        self.tokenizer_function = tokenizer_function
        self.countedTokens = 0

    def on_llm_start(self, serialized, prompts, **kwargs):
        # Count the tokens for the prompt(s)
        for prompt in prompts:
            if model_name=="gemini-2.5-flash-preview-04-17":
                tokens = self.tokenizer_function(model=model_name, contents=prompt)
                self.countedTokens += tokens.total_tokens
            else:
                self.countedTokens += len(self.tokenizer_function(prompt))

    def on_llm_end(self, response, **kwargs):
        # Count the tokens for the respons(es)
        if 'generations' in response:
            for generation in response['generations']:
                text = generation[0].text
                self.countedTokens += len(self.tokenizer_function(text))

# Step 3: Create the chains
The second step is to create the chains that will retrieve the information that is needed to generate biodiversity reports. This information consists of a greenery goal and a greenery percentage of a specific municipality.

## Chain A: Retrieve greenery goal
The first chain is able to retrieve a greenery goal. This goal will be retrieved with a Q&A chain that uses RAG to analyse multiple document chunks. The data that the chain needs will first be prepared to help the LLM analyse the information better and after that the chain will be created that will (later) analyse the prepared data.

### Data preparation: save document chunks in a vector database
First a pdf that contains a greenery goal will be loaded. The text of this document will be extracted and will then be chunked into smaller pieces. After that a vector database will be created and the chunks will be saved into it.

In [25]:
# Load file
file_path = "./GroenvisieSchiedam.pdf"
doc_reader = PdfReader(file_path)

# Extract page content
raw_text = ''
for page in doc_reader.pages:
    text = page.extract_text()
    if text:
        raw_text += text

print("Amount of characters raw text: " + str(len(raw_text)))

# Split the page content into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap  = 200,
)
chunks = text_splitter.split_text(raw_text)

print("Amount of characters first chunk: " + str(len(chunks[0])))

# Create vector store
index = faiss.IndexFlatL2(len(embedding_model.embed_query(chunks[0]))) # Calculates the amount of dimensions the chunk's vector has

vector_store = FAISS(
    embedding_function=embedding_model,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

# Give chunks UUIDS
uuids = []
for chunk in range(len(chunks)):
    uuids.append(str(uuid4()))

# Create embeddings of chunks and store them in the vector database
vector_store.add_texts(texts=chunks, ids=uuids)

print("\n UUIDs of items stored in vector database: " + str(uuids))

Amount of characters raw text: 9039
Amount of characters first chunk: 627

 UUIDs of items stored in vector database: ['8625428e-9ac8-4bba-82a2-1895ecd0d9ec', 'ee29895c-89ea-42a5-a668-10b6d889ccc0', '180fef72-7f39-4684-a3b9-d1d2b5162830', '12cc7d36-ddb6-4959-acb8-12f73a14901c', '09c80bba-bb38-49a7-a973-217d4d80555d', 'c077625d-297c-462e-baf8-3e5864157bf1', '2881772f-415c-4f50-b703-8cfc0812ba6c', '70d566ff-32f1-4ad6-a3e5-e9717d71794b', 'f5c2ff4f-7e37-4970-a823-c2f30e710452', '2c52cd16-110e-4fa1-bf7c-c7b463b5af69', 'd7236400-2caf-44ef-93b3-24c5bd1c24a3', '37ee0f05-28dc-4e06-9373-c5a68081c9b8']


### Create Q&A chain
Create a QA chain that is able to do a similarity search for the 10 most relevant chunks and analyze these chunks to find the percentage asked.

In [169]:
# Create prompt template
template = """
Je bent een documentenanalist. Gebruik de aangeleverde documenten om het percentage te vinden dat de vraag beantwoord. Geef als antwoord alleen het percentage, verder niks.

Vraag: {question}
Aangeleverde documenten: {context}
"""
prompt_template = PromptTemplate(input_variables=['context', 'question'], template=template)

# Create Q&A chain
qa_chain = RetrievalQA.from_chain_type(
    llm=chosen_llm,
    chain_type="stuff",
    retriever=vector_store.as_retriever(search_kwargs={"k": 10}), # This will do the similarity search for the 10 most relevant chunks and adds them to the {context} variable
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt_template},
    output_key="greeneryGoal"
)

# Transform function that returns the result of the agent in a dictionary (making it compatible with the other chains)
def run_qa_chain(inputs):
    # Calculate the tokens of the input
    countedTokensInput = greeneryGoalTokenCounter(tokenizer_func, inputs["query"])

    # Run the chain
    result = qa_chain.invoke(inputs["query"])

    # Put all document chunks in one string
    document_contents = ""
    for document in result["source_documents"]:
        document_contents += document.page_content

    # Calculate the tokens of the documents and output
    countedTokensDocuments = greeneryGoalTokenCounter(tokenizer_func, document_contents)
    countedTokensOutput = greeneryGoalTokenCounter(tokenizer_func, result['greeneryGoal'])

    # Calculate the amount of tokens used
    totalTokens = countedTokensInput + countedTokensDocuments + countedTokensOutput
    print(f"Tokens used for retrieving greenery goal: {totalTokens}")

    return {
        "greeneryGoal": result['greeneryGoal'],
        "tokenCount": inputs["tokenCount"] + totalTokens
    }

# Transform Chain that calls the transform function
greeneryGoalChain = TransformChain(
    input_variables=["query", "tokenCount"],
    output_variables=["greeneryGoal"],
    transform=run_qa_chain
)

## Chain B: Retrieve greenery percentage
The second chain is able retreive the greenery percentage of a specific municipality. This is done with a SQL-agent chain. The data that the agent needs will first be prepared to help the agent analyse the information better and after that the chain will be created that will (later) analyse the prepared data.

### Data preparation: store greenery percentages in SQlite database
First the database will be created by building a SQlite database and creating a "database engine" object that will be used for creating the agent's toolkit. As soon as that is done the dataset will be retrieved via an API-request and will then be stored in the database.

In [150]:
# Create database
con = sqlite3.connect("greeneryPercentages.db", check_same_thread=False)
cur = con.cursor()

# Create municipalities table in database
try:
    cur.execute("CREATE TABLE municipalities(name varchar(255), greeneryPercentage float)")
    print("Table created successfully.")

except:
    cur.execute("DROP TABLE municipalities")
    cur.execute("CREATE TABLE municipalities(name varchar(255), greeneryPercentage float)")
    print("Table has been reset, because it already existed.")

# Create database engine object
engine = create_engine(
        "sqlite://",
        creator=lambda: con,
        poolclass=StaticPool,
        connect_args={"check_same_thread": False},
    )
db = SQLDatabase(engine)

# Retrieve dataset
dataset = requests.get('https://data.rivm.nl/geo/ank/ows?service=WFS&request=GetFeature&typeName=rivm_2022_groenpercentage_kaart_per_gemeente&propertyName=gm_naam,_mean&outputFormat=json').json()

# Extract properties and put them into a list
properties = []
for feature in dataset["features"]:
    property = feature["properties"]
    properties.append((property["gm_naam"], property["_mean"]))

# Save the items stored in the list into the database
i=0
while i < len(properties):
    cur.execute("INSERT INTO municipalities VALUES (?,?)", properties[i])
    i = i + 1
con.commit()

Table has been reset, because it already existed.


### Create SQL-agent chain
Create a SQL-agent chain that is able to analyse the created database and create a SQL-query for it to answer the question when it is executed.

In [170]:
# Create toolkit:
toolkit = SQLDatabaseToolkit(db=db,llm=chosen_llm)

# Create agent chain
sql_agent_chain = create_sql_agent(
    llm=chosen_llm,
    toolkit=toolkit,
    verbose=False,
    output_key="greeneryPercentage",
    handleParsingErrors=True
)

# Transform function that returns the result and used tokens in a dictionary (making it compatible with the other chains)
def run_sql_agent(inputs):
    handler = TokenCountingHandler(tokenizer_func)

    result = sql_agent_chain.run(
        inputs["greeneryPercentageInputPrompt"],
        callbacks=[handler]
    )

    print(f"Tokens used for retrieving greenery percentage: {handler.countedTokens}")
    return {
        "greeneryPercentage": result,
        "tokenCount": inputs["tokenCount"] + handler.countedTokens
    }

# Transform Chain that calls the transform function
greeneryPercentageChain = TransformChain(
    input_variables=["greeneryPercentageInputPrompt"],
    output_variables=["greeneryPercentage"],
    transform=run_sql_agent
)

## Chain C: Combine greenery goal and greenery percentage
The third chain will analyse a greenery goal and percentage and write a few lines about the progress towards the goal.


In [172]:
# Create prompt template
template = """
Je bent een biodiversiteit analyst. Je krijgt een groenpercentage van een gemeente en de doelstelling van deze gemeente, aan jou de taak om deze informatie te tonen aan de gebruiker en te vertellen hoever de doelstelling behaald is. Jouw analyse komt als tekstje op een infographic te staan, beschrijf daarom de resultaten in maximaal 150 woorden. Zorg ervoor dat je een formele schrijfstijl gebruikt. Het format hiervoor is als volgt:
'Gemeente: <de gemeente waar je het over hebt>
Doelstelling: <het groenpercentage>
Huidige hoeveelheid groen: <de groendoelstelling>

<jouw resultatenbeschrijving>



Gemeente: {municipality}
Groenpercentage: {greeneryPercentage}
Groendoelstelling: {greeneryGoal}
"""
prompt_template = PromptTemplate(input_variables=["municipality", "greeneryPercentage", "greeneryGoal"], template=template)

analysis_chain = LLMChain(
    llm=chosen_llm,
    prompt=prompt_template,
    output_key="greeneryAnalysis"
)

# Transform function that returns the result of the agent in a dictionary (making it compatible with the other chains)
def run_analysis_chain(inputs):
    # Calculate the tokens of the input
    countedTokensInput = greeneryGoalTokenCounter(tokenizer_func, prompt_template.format(municipality=inputs["municipality"], greeneryPercentage=inputs["greeneryPercentage"], greeneryGoal=inputs["greeneryGoal"]))

    # Run the chain
    result = analysis_chain.invoke({
    "municipality": inputs["municipality"],
    "greeneryPercentage": inputs["greeneryPercentage"],
    "greeneryGoal": inputs["greeneryGoal"]
})

    # Calculate the tokens of the output
    countedTokensOutput = greeneryGoalTokenCounter(tokenizer_func, result['greeneryAnalysis'])

    # Calculate the amount of tokens used
    totalTokensAnalysis = countedTokensInput + countedTokensOutput
    print(f"Tokens used for generating analysis: {totalTokensAnalysis}")
    total_tokens = inputs["tokenCount"] + totalTokensAnalysis
    print(f"Total amount of tokens used: {total_tokens}")
    return {
        "greeneryAnalysis": result['greeneryAnalysis'],
        "tokenCount": inputs["tokenCount"] + totalTokensAnalysis
    }

# Transform Chain that calls the transform function
analysisChain = TransformChain(
    input_variables=["municipality", "greeneryPercentage", "greeneryGoal", "tokenCount", "query"],
    output_variables=["greeneryAnalysis"],
    transform=run_analysis_chain
)

# Step 4: Combining the percentage and goal
The third step is create a sequential chain that combines the Q&A chain and the SQL-agent chain. The sequential chain will parse the result of a previous chain to the next one with the use of prompt templates.

## Create sequential chain
Create the sequential chain that combines the two chains created earlier.

In [174]:
# Create combined chain
greeneryChain = SequentialChain(
    memory=SimpleMemory(),
    chains=[greeneryPercentageChain, greeneryGoalChain, analysisChain],
    input_variables=["greeneryPercentageInputPrompt", "municipality", "query", "tokenCount"],
    output_variables=["greeneryAnalysis"],
    verbose=True
)

municipality = "Schiedam"

# Execute combined chain
result = greeneryChain.invoke({"greeneryPercentageInputPrompt": "Vind het groenpercentage van gemeente " + municipality, "municipality": municipality, "query": "Wat is het percentage dat de gemeente wil realiseren met een natuurlijke opbouw?", "tokenCount":0})
print(result["greeneryAnalysis"])



[1m> Entering new SequentialChain chain...[0m
Tokens used for retrieving greenery percentage: 2604
Tokens used for retrieving greenery goal: 2026
Tokens used for generating analysis: 346
Total amount of tokens used: 4976

[1m> Finished chain.[0m
Gemeente: Schiedam  
Doelstelling: 40%  
Huidige hoeveelheid groen: 46.52%  

De gemeente Schiedam heeft zijn groenpercentage vastgesteld op 46.52%, wat aanzienlijk boven de gestelde doelstelling van 40% ligt. Dit resultaat toont aan dat Schiedam met succes heeft bijgedragen aan het behoud en de uitbreiding van zijn groene ruimte. De huidige groene ruimte draagt niet alleen bij aan de biodiversiteit, maar verhoogt ook de kwaliteit van de leefomgeving voor de inwoners. De gemeente kan zich nu richten op de verdere verbetering van de kwaliteit van het groen en het stimuleren van biodiversiteitsprojecten, om zo de ecologische duurzaamheid te waarborgen en een gezondere leefomgeving te creëren. De overschrijding van de doelstelling weerspiege