Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ai-agents] Introduce the re-rank agent with MMR ranking #502

Merged
merged 12 commits into from
Sep 29, 2023

Conversation

eolivelli
Copy link
Member

@eolivelli eolivelli commented Sep 28, 2023

Summary

  • new agent that performs re-ranking on the results of a query
  • there is only an algorithm implemented at the moment: MMR, using BM25 and Cosine Similarity
  • update the docker-chatbot and the webcrawler-source examples to use MMR (and query more documents than the available room in the prompt)
  • added the first integration test with a local Vector Database (using JDBC/HerdDB)
  • there is a small breaking change in the "query" and "query-vector-db" agents, now they return a "List<Map<String,Object>>" instead of "List<Map<String, String>>", this way it is possible to get the vector of floats from the query
  • in the integration tests now we are able to load the JDBC drivers (that are not on the classpath but in a directory in target)

Description about MMR

In order to perform MMR you need two functions, one to compute "diversity" and one to give a "relevance".
We are using BM25 + IDF in order to compute "relevance", and we use the average "cosine similarity" to compute "diversity".
The are a few parameters to tune the algorithms.

How do I use this feature ?

You need to provide both a "query" and a set of "documents", and for each of them you must also provide the "embeddings" (for the query and for each document).
This is easy in the standard chatbot pipeline because we compute the embeddings of the query before performing the vector search and when you perform the vector search you can get both the text and the embeddings stored on the database.

 pipeline:
      - name: "Re-rank query results"
        id: step1
        type: "re-rank"
        input: "input-topic"
        output: "output-topic"
        configuration:
            max: 10
            field: "value.query_results"
            output-field: "value.reranked_results"
            query-text: "value.query"
            query-embeddings: "value.query_embeddings"
            text-field: "record.text"
            embeddings-field: "record.embeddings"
            algorithm: "MMR"
            lambda: 0.5
            b: 2
            k1: 1.2

With the "max" parameter you can limit the number of documents, this is pretty useful in case you want to get many documents from the vector database but then you can use only fewer of them to build the prompt.

Parameters

  • max: maximum number of documents to keep
  • field: the field that contains the documents to sort
  • output-field: the field that will hold the results, it can be the same as "field" to override it
  • query-text: this is the field that contains the "query" (usually the question to the chatbot)
  • query-embeddings: this is the field that contains the "embeddings" for the "query", they must be precomputed
  • text-field: the field in the result set that contains the "text", you have to use the "record.xxx" syntax
  • embeddings-field: the field in the result set that contains the "embeddings" for the text, you have to use the "record.xxx" syntax
  • algorithm: "MMR" or "none"
  • lambda: this a the parameter for the MMR algorithm
  • b and k1: parameters for the B25 algoritm

Notes

you must pre-compute embeddings on the query and in all of the documents, but this is usually easy that you usually perform a vector search before this step, so you have to get both the text and the embeddings vector for each document.

@cdbartholomew
Copy link
Member

Using this agent, is it possible to say query the vector store for 100 semantically similar results but only return the 10 most diverse from that larger set?

@eolivelli
Copy link
Member Author

eolivelli commented Sep 28, 2023

Using this agent, is it possible to say query the vector store for 100 semantically similar results but only return the 10 most diverse from that larger set?

I had the same thought, let me add this feature, and configure a "max" property

@eolivelli eolivelli marked this pull request as ready for review September 29, 2023 10:05
@eolivelli eolivelli merged commit 83aeb7d into main Sep 29, 2023
8 checks passed
@eolivelli eolivelli deleted the impl/re-ranking branch September 29, 2023 13:02
@eolivelli eolivelli mentioned this pull request Sep 29, 2023
@acantarero
Copy link

comment is maybe a little late, but question I had looking at this:

I haven't had time to dig into if there's a best practice of which similarity metrics to use for the relevance and diversity.

However, I wonder if we want to allow the user to optionally choose between cosine and bm25.

My thought process here:
bm25 usually involves some text preprocessing steps (stop word removal, stemming/lemmatization, word normalization etc.) that is often not done on text data being used in vector search.

Given that our primary use case is genAI and most of our customers aren't doing those preprocessing steps, bm25 may not actually work as well as it should and they may be better off using cosine similarity for relevance.

@eolivelli
Copy link
Member Author

@acantarero

Currently for MMR we need both BM25 and cosine similarity.
It is a further step, after the vector search.
In fact that if you send a query to the vector database and you ask for the "most similar documents", then you already them.

With this new agent you can retrieve more documents from the database and keep only a selection that is "diverse enough", but still "relevant"

additional note:

With LangStream it is pretty easy to pre-process the data before inserting the documents in the vector database and you can also apply the same preprocessing to normalise the "query" in your chat completion pipeline.

Let's follow up on slack or maybe you can open a "Discussion" or a GH Ticket, this PR has been closed, so nobody will find it easily

@acantarero
Copy link

could you explain more about why we need both?

I read the original MMR paper and it says you can use the same similarity function for both the relevancy and diversity.

@eolivelli
Copy link
Member Author

Cosine similarity is not enough to reduce reduntant documents. With IDF we can ensure that we are not passing duplicate content to the LLM, as tokens have a cost.

benfrank241 pushed a commit to vectorize-io/langstream that referenced this pull request May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants