wildfly-langchain4j experimentation

This project is an experimentation of the wildfly-ai feature pack. It aims at studying how to leverage LLMs (impressive) capabilities in Java applications.

This example if based on Jean-François James example used to illustrate his talks at Chateau 2024 unconference and ParisJUG. It is based on a simplified car booking application inspired from the Java meets AI talk from Lize Raes at Devoxx Belgium 2023. The car booking company is called "Miles of Smiles" and the application exposes two AI services:

a chat service to freely discuss with a customer assistant
a fraud service to determine whether a customer is a frauder.

The tools are provided via a Model Context Protocol (aka MCP) Server which is also provided by the wildfly-ai feature pack. For the sake of simplicity, there is no database interaction, the application is standalone and can be used "as is". Of course thanks to WildFly, it can easily be extended according to your needs.

Warning: you must first configure the application to connect to an LLM that supports Function Calling (see Environment Variables below).

First you need to build and run the car-booking-mcp project. This will provide the tools to be used by the chat bot application.

mvn clean install
./target/server/bin/standalone.sh -Djboss.socket.binding.port-offset=10

Second you need to build and run the car-booking project. This will provide the tools to be used by the chat bot application.

mvn clean install
./target/server/bin/standalone.sh

When the application is launched, you have 2 pages:

fraud to detect a fraud for a customer
chat-bot to interact with the chat bit.

Technical context

The project has been developped and tested with:

Java 21 (Temurin OpenJDK distro)
WildFly 35.0.1.Final
Apache Maven 3.9.9
tested with GPT 3-5-turbo, 4 and 4o on a dedicated Azure instance in my company (to be customized according to your context).

Environment variables (configuration)

The following environment variables have been set up to run the application and must be customized according to your environment (can also be defined in application.properties):

MISTRAL_AI_API_KEY

Programming model

There are two AI interfaces (annotated with @RegisterAiService):

ChatAiService: to chat with a "Miles of Smiles" customer support
FraudAiService: to detect if a customer is a frauder.

There are two important features, when it comes to use LLMs in applications:

RAG (Retrieval Augmented Generation)
Function Calling.

Let’s investigate them…

RAG

RAG consists in augmenting the LLM with specific knowledge coming from different data sources: text, PDF, CSV ….

Documents must first be ingested: their content is turned into a vectorial representation (using an embedding model) and stored in a vector database (in memory, pgsql, Croma etc …). This process is usually done in batch mode.

Upon each request, text segments are selected and appended to the user message according to their semantic proximity with the request.

In this project:

4 text documents are ingested (in docs-for-rag directory). One of them is a an apple pie recipe which has nothing to do with car booking, it is interesting to test the retrieval selectivity
Ingestion is performed at application startup (see DocRagIngestor class)
Retrieval is configured on ChatAiService (see retrievalAugmentor parameter on @RegisterAiService annotation), not on FraudAiService
Retrieval is done upon each request (see DocRagAugmentor class)
AllMiniLmL6V2QuantizedEmbeddingModel is a "small" model used for local embedding (ingestion). Quantization is a technique to reduce the size (and the accuracy) of a model allowing it to be run using less powerful hardware.

A key parameter to optimize segment retrieval is MIN_SCORE when building the EmbeddingStoreContentStoreRetriever:

// From 0 (low selectivity) to 1 (high selectivity)
private static final double MIN_SCORE = 0.6;

// Other code

// retriever build
this.retriever = EmbeddingStoreContentRetriever.builder()
                            .embeddingModel(model)
                            .embeddingStore(store)
                            .maxResults(MAX_RESULTS)
                            .minScore(MIN_SCORE)
                            .build();

During my tests, it appears that the default value (0.5) is not selective enough, the apple pie recipe being systematically selected whatever the user question is. Setting it to 0.6 has provided more relevant results.

Function calling

Function calling consists in asking the LLM to call our business logic… which is both powerful and dangerous! It is not supported by all models.

Declaring callable functions

With wildfly-mcp, a callable function is a Java method annotated with @Tool:

@Tool("Get booking details for booking number {bookingNumber} and customer {name} {surname}")
public Booking getBookingDetails(String bookingNumber, String name, String surname) { ... }

Each class implementing such callable functions must be declared in the @RegisterAiService annotation:

@RegisterAiService(toolProvider = mcp)
public interface CustomerSupportAgent { ... }

In this project, all called functions are implemented by BookingService.

But how does it work under the hood?

Request

At runtime, a JSON descriptor of each callable function is added to the request:

{
   "name":"getBookingDetails",
   "description":"Get booking details for booking number {bookingNumber} and customer {name} {surname}",
   "parameters":{
      "type":"object",
      "properties":{
         "surname":{
            "type":"string"
         },
         "name":{
            "type":"string"
         },
         "bookingNumber":{
            "type":"string"
         }
      },
      "required":[
         "bookingNumber",
         "name",
         "surname"
      ]

Response

If the LLM decides to call a function, it answers by describing how to call it:

"choices":[
   {
      "finish_reason":"function_call",
      "index":0,
      "message":{
         "role":"assistant",
         "content":null,
         "function_call":{
            "name":"getBookingsForCustomer",
            "arguments":"{\"surname\":\"Bond\",\"name\":\"James\"}"
         }
      },
      "content_filter_results":{

      }
   }
]

This process can be very slow, a single user message can lead to multiple LLM round-trips.

Parallel function calling seems possible with OpenAI (see OpenAI function calling doc) but I have not tested it yet.

Effective function calling

Of course, the LLM can’t call the function on its own. This is the responsibility of quarkus-langchain4j to turn the LLM response into a local function call.

Due to the non-deterministic nature of the LLM, functions calling must be used with care, especially for critical business logic: payment, cancelation …

To influence the behavior of the LLM, it is recommended to set the temperature and top priority in the configuration:

# Set the model temperature for deterministic responses
quarkus.langchain4j.azure-openai.chat-model.temperature=0.1
# An alternative (or a complement?) to temperature: 0.1 means only top 10% probable tokens are considered
quarkus.langchain4j.azure-openai.chat-model.top-p=0.1

For more information see Mastering Temperature and Top_p in ChatGPT API

Playing with the application

To test the application, you can enter the following user messages (see ChatAiServiceTest and FraudAiServiceTest to get some inspiration):

Hello, how can you help me?
What is your list of cars?
What is your cancelation policy?
What is your fleet size? Be short please.
How many electric cars do you have?
My name is James Bond, please list my bookings
Is my booking 123-456 cancelable?
Is my booking 234-567 cancelable?
Can you check the duration please?
I’m James Bond, can I cancel all my booking 345-678?
Can you provide the details of all my bookings?
fraud James Bond
fraud Emilio Largo

To understand the dynamic of the application, some key log messages are prefixed by DEMO.

Feedback

According to my tests, GPT 3-5-turbo is fast but not enough reliable and accurate to properly act as customer support It is subject to hallucinations. It doesn’t hesitate to invent response. For instance, when asking about SUV and diesel cars, it answers positively although it has no information about it.

GPT-4 is more reliable and accurate but about 6 times slower.

GPT-4o is even more reliable than GPT-4 and as fast as GPT-3-5-turbo. So clearly, it is technically the best option.

Warning: pricing must also be considered when choosing a model in production.

Next steps

To be investigated:

Security: how to secure Function Callings
Testability: how to go beyond just "true or false"
Observability: how to monitor the LLM behavior/quality
Auditability
Parallel function calling
Production readiness
Quarkus native mode (currently not working)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
car-booking-ingestor		car-booking-ingestor
car-booking-mcp		car-booking-mcp
car-booking		car-booking
.gitignore		.gitignore
README.adoc		README.adoc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

wildfly-langchain4j experimentation

Technical context

Environment variables (configuration)

Programming model

RAG

Function calling

Declaring callable functions

Request

Response

Effective function calling

Playing with the application

Feedback

Next steps

About

Uh oh!

Releases

Packages

Languages

jfdenise/car-booking

Folders and files

Latest commit

History

Repository files navigation

wildfly-langchain4j experimentation

Technical context

Environment variables (configuration)

Programming model

RAG

Function calling

Declaring callable functions

Request

Response

Effective function calling

Playing with the application

Feedback

Next steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages