Retrieving (unstructured) greenery goals
---
This notebook retrieves a document with greenery goals in it. This document will be chuncked and then be stored in a vector database. At last it will do execute a similarity search to find relevant documents and it will then use these to find the answer for the question.


Imports
---

In [1]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.chains.retrieval_qa.base import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS

from langchain_community.docstore import InMemoryDocstore

from langchain_openai import AzureChatOpenAI
from langchain_ollama import ChatOllama

from dotenv import load_dotenv
from PyPDF2 import PdfReader
from uuid import uuid4

import os
import faiss
import urllib.request
import requests

Load a large language model
----------
Langchain makes it possible to easily switch LLMs. Llama 3 is used to show the data can be analysed with a locally running open-source model, but it is very slow. So to speed it up I also used o3-mini and gpt-4o-mini to show it works.



Load Llama3:

In [2]:
chosen_llm = ChatOllama(base_url='http://localhost:11434', model="llama3")

Load o3-mini (via Azure):

In [3]:
load_dotenv()

chosen_llm = AzureChatOpenAI(model ="o3-mini", api_version="2025-01-01-preview", azure_endpoint="https://56948-m9bdjgpg-eastus2.cognitiveservices.azure.com/openai/deployments/o3-mini/chat/completions?api-version=2025-01-01-preview", api_key=os.environ.get("AZURE_OPENAI_API_KEY"))

Load gpt-4o-mini (via Azure)

In [2]:
load_dotenv()

chosen_llm = AzureChatOpenAI(model="gpt-4o-mini", api_version="2025-01-01-preview",
                             azure_endpoint="https://56948-m9bdjgpg-eastus2.cognitiveservices.azure.com/openai/deployments/gpt-4o-mini/chat/completions?api-version=2025-01-01-preview",
                             api_key=os.environ.get("AZURE_OPENAI_API_KEY"))

Choose embedding model
---


Choose all-MiniLM-L6-v2 as embedding model (worse, but faster)

In [3]:
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

  from .autonotebook import tqdm as notebook_tqdm


Choose all-mpnet-base-v2 as embedding model (better, but slower)

In [3]:
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

  from .autonotebook import tqdm as notebook_tqdm


Define the question
---
Write the question that needs to be answered by the LLM.

In [9]:
query = "Vind het verweving percentage van de gemeente?"

Load and chunk the document
---
It loads the pdf, extracts the text and chunks it into pieces.

In [5]:
# Load file
'''
url = 'https://omgevingsvisie.utrecht.nl/fileadmin/uploads/documenten/zz-omgevingsvisie/thematisch-beleid/groen/2007-05-groenstructuurplan.pdf'
file_name = 'groenstructuurplanUtrecht.pdf'

with open(file_name, "wb") as file:
        response = requests.get(url)
        file.write(response.content)
'''

file_path = "./GroenbeleidsplanEindhoven.pdf"
doc_reader = PdfReader(file_path)

# Extract page content
raw_text = ''
for i, page in enumerate(doc_reader.pages):
    text = page.extract_text()
    if text:
        raw_text += text

print("Amount of characters raw text: " + str(len(raw_text)))

# Split the page content into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap  = 200,
)
chunks = text_splitter.split_text(raw_text)

print("Amount of characters first chunk: " + str(len(chunks[0])))


Amount of characters raw text: 260257
Amount of characters first chunk: 914


Generate embeddings and save them in the vector database
---
Create a vector database, generate embeddings for the chunks and save the embeddings in the newly created vector database.

In [6]:
index = faiss.IndexFlatL2(len(embedding_model.embed_query(chunks[0]))) # Calculates the amount of dimensions the chunk's vector has

vector_store = FAISS(
    embedding_function=embedding_model,
    index=index,
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

uuids = []
for chunk in range(len(chunks)):
    uuids.append(str(uuid4()))

vector_store.add_texts(texts=chunks, ids=uuids)

print("UUIDs of items stored in vector database: " + str(uuids))


UUIDs of items stored in vector database: ['01e56c31-d3aa-4f99-91df-75545cd8af93', '51e5a2dd-dfe3-4155-93f0-09dd291c5df1', '5505550d-e7ef-48cc-8464-11d2b5a2aa48', '1632a1be-655f-4705-b783-fd5e02f322cf', '4ef9a232-14df-4fc9-8847-577a2d0ca8fe', 'b8ac78ca-0932-414d-8ab1-c0a451725b5f', 'b56f926a-045c-4445-850b-b7cd76b4dd3c', '089e05ca-a533-4fa8-bd8e-f33f70469a35', 'c7818a0b-cfae-4a45-ab4f-9517bf7e4f69', '3c2ec21d-bff4-4206-8c10-c1e57c72649c', 'b91281bd-d6ba-4e7b-8381-aec05ed3b2ac', 'c9fb6c48-36b3-436b-899d-eb4a990c4010', 'abf7d5b0-a1ac-408b-97b8-aafb3aeadcfa', '2968e73f-f343-4e8e-8707-57e5a3761f30', '1499d58c-4023-4bec-8035-1da47378eaee', 'ab084780-c0d7-42dd-b507-4e52d4ff29a3', '5a3a399b-1f48-4cb0-b10c-9fe0fa6e9a29', '0285f91b-0fd0-4bf2-8920-ddf3b66d2aa2', '956a05f3-ffce-4cf1-abba-c44b2b58e196', 'b886b29a-47dc-432b-9bee-6232e9194e35', '4a5a9800-ffa1-4a6e-9d13-eb940972324e', 'b7627552-dcd6-4a61-9ea4-674fe6045a1a', 'f48586ef-1d37-4164-be98-91097caaffe1', '9c72d304-b92e-4c4c-b8d7-0d264b5e78dc

Gather relevant documents
---
Execute a similarity search between the query and the vector database to find the 2 most relevant documents.

In [10]:
relevant_documents = vector_store.similarity_search(query, k=10)

print(f"Vraag: {query}")
print(f"Relevante documenten: ")

for document in relevant_documents:
    print(f" \n-->   {document.page_content} \n")

Vraag: Vind het verweving percentage van de gemeente?
Relevante documenten: 
 
-->   het type water of groen. Naar schatting zijn kopers bereid gemiddeld 7% meer te betalen voor hun woning als deze direct 
grenst aan openbaar groen of water. Een vrij uitzicht op de open ruimte leidt tot 12% prijsverhoging, terwijl de aanwezigheid van aantrekkelijke natuur in de buurt van de woonplaats een waardestijging oplevert van 5% tot 10%. Een bijzonder geval zijn huizen met tuinen grenzend aan water dat in verbinding staat met een recreatieplas; voor deze woningen kan de waarde­stijging oplopen tot bijna 30%. 
De baten van groen worden steeds meer duidelijk. Toch komen concrete projecten soms nog moeilijk van de grond. Een 
reden daarvoor is dat de waarde van groen nog maar beperkt kwantitatief is gemaakt. Dit maakt het lastig om groen onder ­
deel te laten zijn van een business case. In het kader van TEEB ­s tad (The Economics of Ecosystems and Biodiversity) en 

 
-->   vuilconcentratie.
In ied

Analyze the relevant documents and answer the question
---
The last step is to create a question and answer chain where the chosen llm can actually answer the question.

In [11]:
qa_chain = RetrievalQA.from_chain_type(
    llm=chosen_llm,
    chain_type="stuff",  # of "map_reduce" bij grote documenten
    retriever=vector_store.as_retriever(search_kwargs={"k": 10}),
    return_source_documents=True
)

result = qa_chain(query)

print("Vraag:", query)
print("Antwoord:", result['result'])

Vraag: Vind het verweving percentage van de gemeente?
Antwoord: Het verweving percentage van de gemeente is 60% groen en 40% rood.
