# Vereijken vector store builder

In [10]:
import os
import glob
import openai
from dotenv import load_dotenv
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import SeleniumURLLoader
from langchain_community.document_loaders import UnstructuredPDFLoader, UnstructuredPowerPointLoader

To use OpenAI models online we need an API key, which you can request on the OpenAI website. The best way to keep this key is in a hidden `.env` file.

In [11]:
# Load the environment
load_dotenv()
openai.api_key = os.environ["OPENAI_API_KEY"]

Initialize an empty list to hold the documents and specify a location

In [12]:
# Initialize pages
pages = []

# Define the directory
directory = '../Vereijken_input'

Loop over the directory and convert all pages to documents according to their file type.

In [13]:
# Loop over all files in the directory
for filename in glob.glob(os.path.join(directory, '*')):
    # Check the file extension
    if filename.endswith('.pdf'):
        loader = UnstructuredPDFLoader(filename)
        print(f"Processing: {filename}")
    elif filename.endswith('.pptx'):
        loader = UnstructuredPowerPointLoader(filename)
        print(f"Processing: {filename}")
    else:
        continue  # Skip files with other extensions

    # Load and split the file, then add to pages
    pages += loader.load_and_split()

Processing: ../Vereijken_input\20221202 Proposal Vereijken V4.pptx
Processing: ../Vereijken_input\20221202 Proposal Vereijken V5.pptx
Processing: ../Vereijken_input\20230302_Vereijken Impact story slide.pptx
Processing: ../Vereijken_input\20230317_Optimalisatiemodel energieverbruik.pptx
Processing: ../Vereijken_input\20230329_BrightCape_project_Vereijken.pdf
Processing: ../Vereijken_input\Vereijken_dutch_and_english_13062023.pptx


In [14]:
print(f"Length of pages: {len(pages)}")

Length of pages: 44


We can also read data from a website. We will add the Vereijken home page here.

In [15]:
# Add the website
urls = [
    "https://vereijkenkwekerijen.nl/?lang=en",
]

# collect data using selenium url loader
loader = SeleniumURLLoader(urls=urls)
pages += loader.load_and_split()

print(f"Length of pages after URL loader: {len(pages)}")

Length of pages after URL loader: 45


We can check out the webiste content

In [16]:
pages[44].page_content

'Home\n\nTomatoes\n\nCultivation process\n\nVacancies\n\nContact\n\n\tBranches\n\nlogin portal\n\n\n\nHome\n\nTomatoes\n\nCultivation process\n\nVacancies\n\nContact\n\n\tBranches\n\nlogin portal\n\n\n\nFrom greenhouse to fresh on the shelf\n\nWe grow our vine tomatoes on more than 50 hectares, divided over 6 grow locations in the province of Noord-Brabant and the Westland area. Thanks to 40 hectares of illuminated cultivation, customers can count on us all year round. The tomatoes are packed and prepared for the customer at Vereijken Logistics and Triomaas Logistics, our packing locations.\n\nIf you want to find out more about the different steps until the tomatoes are on the supermarket shelf, click on ‘More information’ to view the cultivation process.\n\nMore information\n\nVacancies\n\nWant to be part of a family business and work in a progressive and result-oriented organisation with high-quality products? View our current vacancies and apply today!\n\nUnsolicited application\nWa

We will remove any tab and new line characters to clean up the text a bit.

In [17]:
# Light preprocessing
for page in pages:
    page.page_content = str(page.page_content).replace("\\n", " ")\
                                              .replace("\\t", " ")\
                                              .replace("\n", " ")\
                                              .replace("\t", " ")

print(f"Length of pages after preprocessing: {len(pages)}")

Length of pages after preprocessing: 45


Now we convert this to a vectorstore with OpenAI embeddings.

In [18]:
# Create the vectorstore
vectorstore = FAISS.from_documents(pages, OpenAIEmbeddings())

# Save the vectorstore
vectorstore.save_local('../vector_stores/vereijken.faiss')

Convert the vector store into a retriever

In [19]:
# Create retriever
retriever = vectorstore.as_retriever(search_type='mmr',
                                     search_kwargs={'k': 1, "score_threshold": 0.0})

Let's use the retriever as a stand-alone object

In [20]:
found_documents = retriever.invoke("What kind of model is used for this optimization problem?")

You can see that when asking what model is used for the optimization problem, a page is returned that described the model. You can also see what the cleaned up text in the document looks like. 

In [21]:
found_documents

[Document(page_content='Oplossing    Een geautomatiseerd optimalisatiemodel dat het optimale gebruik van de WKK-eenheden vindt, op basis van de interne energievraag en de marktprijzen voor energie afname en energie terug levering.  Het optimalisatiemodel geeft inzicht in de prestaties van beslissingen voor het gebruik van WKK-eenheden in het verleden.  Uiteindelijk zal het optimalisatiemodel worden gebruikt om het optimale gebruik van de WKK-eenheden te voorspellen, op basis van voorspelde marktprijzen.  10-15% kosten besparing*  in energie verbruik  *Preliminair  Inzicht in prestatie  Van voorgaande beslissingen m.b.t. energie productie en verbruik   6    Overzicht van energiekosten besparing  Nederlandse glastuinbouw  10-15% kosten besparing*  in energie verbruik  *Preliminair  Inzicht in prestatie  Van voorgaande beslissingen m.b.t. energie productie en verbruik   7    Resultaat  Genereren van output (excel)  Totale kosten per vestiging  Close-up per vestiging  Kosten  Gas en apx pr

Posing the query in Dutch doesn't affect the result

In [22]:
found_documents_nederlands = retriever.invoke("Welk soort model wordt gebruikt voor dit optimalisatieprobleem?")

In [23]:
found_documents_nederlands

[Document(page_content='Oplossing    Een geautomatiseerd optimalisatiemodel dat het optimale gebruik van de WKK-eenheden vindt, op basis van de interne energievraag en de marktprijzen voor energie afname en energie terug levering.  Het optimalisatiemodel geeft inzicht in de prestaties van beslissingen voor het gebruik van WKK-eenheden in het verleden.  Uiteindelijk zal het optimalisatiemodel worden gebruikt om het optimale gebruik van de WKK-eenheden te voorspellen, op basis van voorspelde marktprijzen.  10-15% kosten besparing*  in energie verbruik  *Preliminair  Inzicht in prestatie  Van voorgaande beslissingen m.b.t. energie productie en verbruik   6    Overzicht van energiekosten besparing  Nederlandse glastuinbouw  10-15% kosten besparing*  in energie verbruik  *Preliminair  Inzicht in prestatie  Van voorgaande beslissingen m.b.t. energie productie en verbruik   7    Resultaat  Genereren van output (excel)  Totale kosten per vestiging  Close-up per vestiging  Kosten  Gas en apx pr

We take an example case study from the website (in this case Covolt)

In [24]:
# Sections of example case study
sections = [
"""
Client profile
Client: Covolt
Industry: Solar energy
Process: Asset management

Covolt B.V. is a company in the renewable energy sector, where they employ their intelligent energy management system.
This system optimizes the production of solar parks and automatically offers the energy to the market. This results in
a substantial improvement in both efficiency and reliability of these parks.
""",

"""
The problem
As part of their energy management solution, Covolt aims to predict future energy production for one day-ahead and
intraday periods. This predictive capability would enable them to refine their supply bids on the market,
as the effectiveness of energy trading heavily relies on accurate predictions of the production. Within this framework,
the primary challenges lie in training forecasting models with limited historical data (± 2 years) and generating
predictions for entirely new locations that lack any historical data.
""",

"""
Approach
Bright Cape has trained multiple machine learning models to forecast energy production, utilizing both internal and
external data sources. Since weather data was indicated as the primary forecasting driver, a comparison study of
various external weather sources was conducted to select the most accurate weather APIs. Subsequently,
Bright Cape assisted in creating the foundational data infrastructure to retrieve and integrate these diverse
data sources.

Next, this infrastructure was extended by setting up dedicated training, validation, and forecasting pipelines to
ensure continuous operability on the Microsoft Azure cloud platform. These pipelines were designed with modularity
in mind, facilitating the simultaneous development, evaluation, and deployment of both one day-ahead and intraday
forecasting models. Furthermore, a roll-out strategy was implemented to enable the scalability of the models across
locations throughout the Netherlands, including entirely new sites without sufficient available data.

Feature importance techniques were used to assess the significance of each variable and to determine the most
significant predictors, including factors like irradiation and time of the year. These techniques provide valuable
insights for data scientists and business users into what the model has learned from the data and improves the
model’s transparency.
""",

"""
Solutions and added value
Bright Cape assisted in the design, development, and deployment of the forecasting pipelines and the underlying data
infrastructure. The trained models significantly outperformed both the baseline market prediction strategies for
the pilot locations. Forecasting errors were cut in halve, where hourly day-ahead forecasting errors were reduced from
3.3% to 1.5%, and 15-minute intraday forecasting errors improving from 4.7% to 2.3%. Consequently, this led to more
effective energy trading on the market and increased profit margins. Additionally, the project yielded a return on
investment in under a year as it enabled the acquisition of new clientele.

Results
ROI < 1 year
as the project enabled the acquisition of new clientele

From 3.3 to 1.5%
reduced hourly day-ahead baseline forecasting error

From 4.7 to 2.3%
reduced 15-min intra-day baseline forecasting error
"""
]


We do some light preprocessing again and create a vector store from texts instead of documents.

In [25]:
# Light preprocessing
for i in range(len(sections)):
    sections[i] = sections[i].replace("\\n", " ") \
                             .replace("\\t", " ") \
                             .replace("\n", " ") \
                             .replace("\t", " ") \

# Example vector store
example_vector_store = FAISS.from_texts(sections, OpenAIEmbeddings())

We can also test out the example retriever

In [26]:
# Create retriever
example_retriever = example_vector_store.as_retriever(search_type='mmr',
                                                      search_kwargs={'k': 1, "score_threshold": 0.0})

In [27]:
found_doc = example_retriever.invoke("Client profile")

In [28]:
found_doc

[Document(page_content=' Client profile Client: Covolt Industry: Solar energy Process: Asset management  Covolt B.V. is a company in the renewable energy sector, where they employ their intelligent energy management system. This system optimizes the production of solar parks and automatically offers the energy to the market. This results in a substantial improvement in both efficiency and reliability of these parks. ')]

We save the example vector store

In [29]:
# Save the vectorstore
example_vector_store.save_local('../vector_stores/covolt_case_study_example.faiss')