OA Retrieval System Proposal #3058

melvinebenezer · 2023-05-06T04:58:37Z

High Level OA Retrieval System

Goal of this system
Options available
Design or Workflow for First Version
Other Design thoughts
Open Questions
Timeline for First Version

Goal

automatically retrieve information that will be beneficial for the answer from the LLM
As a first step augmentation of OA with a Wikipedia index ?
https://twitter.com/youraimarketer/status/1652388457036017667 support such document indexing ?
Multilingual ?

Options available

Allow the LM to decide like in the case of plugins
Use index Design (mostly everyone is inclined to this approach)
use a professional vector-db in which we index documents based on embeddings, like for example all of wikipedia
1. Segment the data into chunks (sentences/para)
2. Generate embeddings for each
3. Store the embeddings for retrieval ( FAISS,etc)
4. When presented with query retrieve related chunks from DB using some
  metrics, for example cosine similarity
5. Prompt LLM using query + retrieved chunks to generate the answer
6. https://paperswithcode.com/dataset/beir
7. LangChain being considered ?
8. LLamaIndex ?
9. VectorDB(s) under consideration
  - Qdrant
  - Weaviate https://weaviate.io/blog/sphere-dataset-in-weaviate
10. Benchmarks : http://ann-benchmarks.com/ ?
11. Draw backs :
  - VectorDB(s) fail on semantically distant info.
  - multi-step reasoning may be required or should have that semantic info as a vector in the DB. Need to explore this in detail.

Design or Workflow

Overall there are some similarities between retrieval and OA plugins (i.e. in the simplest case retrieval could be a plugin). The retrieval system will be a bit more closely integrated to the inference system for the easily updatable knowledge of the assistant

Need to come to a consensus on the workflow

the point where the user-input becomes available
- UI changes required
how do we decide that retrieval system should be activated
- use an invisible plugin that would connect to the VectorDB ?
- How do we decide when to query
how is the query generated
- we need to figure out if LLaMA 30B is already well calibrated (i.e. can answer questions about its own knowledge ?
- how are the DB query results then processed and fed into the LLM for output generation?
- how do we decide which embeddings to be used for the queries ?
how are the results processed
- how is the assistant updated ?
- Will this be a multi step reasoning to retrieve semantically distant chunks ?
response presented to the user

Other design thoughts

There are 2 schools of thought for this system

retrieval based models are mostly in knowledge seeking mode. for example QA. In case of creative mode it doesn't make sense to use a retrieval based model
Vs
Most artists & writers, are very deep into reference materials, inspirations, etc

the use-case of retrieval based models are mostly in knowledge seeking mode

Open questions

how are the DB query results then processed and fed into the LLM for output generation?
What are the possible changes required on the website

Timeline for First Version

TBD

umbra-scientia · 2023-05-06T06:27:15Z

We can learn each of these:

how do we decide that retrieval system should be activated

how is the query generated

how are the results processed

By using the loss/reward score already obtained when optimizing:

response presented to the user

If we add a few new tokens and 1 cross attention layer like this:

We can learn the optimal points to execute retrieval, and the lookup matching function jointly.
We can finally have the URLs embedded into the text be pleasant things, instead of dreadful sources of hallucination.
If you're feeling like an outrageous gambler, I am suspicious that if the green boxes were frozen with pre-trained weights, it might not break training or destroy performance.

melvinebenezer · 2023-05-06T09:08:11Z

@kenhktsui @kpoeppel

kenhktsui · 2023-05-06T09:11:45Z

Also added a POC I had done: REALM encoded wikipedia data

draganjovanovich · 2023-05-06T09:15:34Z

Hey, Just to repost demo from discord where I tried out @kenhktsui POC with plugins as a POC also

using this as retrieval store https://github.com/kenhktsui/open-information-retrieval

kpoeppel · 2023-05-06T09:28:15Z

I can also contribute here.

jordiclive · 2023-05-06T11:48:42Z

I can contribute here.

andreaskoepf · 2023-05-06T12:14:29Z

(I have unassigned myself in favor of umbra-scientia since gh has a 10 people assignment limit)

jvdgoltz · 2023-05-06T13:53:36Z

how do we decide that retrieval system should be activated
use an invisible plugin that would connect to the VectorDB ?
How do we decide when to query
how is the query generated
we need to figure out if LLaMA 30B is already well calibrated (i.e. can answer questions about its own knowledge ?
how are the DB query results then processed and fed into the LLM for output generation?
how do we decide which embeddings to be used for the queries ?

I think the LLM can decide of if it wants to use a retrieval query. To train the model to use it we could add some flags in the data labeling interface, like "requires information retrieval". Disadvantage of that is that it has to be clear for the labelers what knowledge can be expected from the model and what it should look up.

I think the query may very well be just the original user message, a good vector DB with good embeddings should already return the text snippets that are most relevant for that query, ranked by similarity. So I would say use as many retrieved text snippets as possible with the context length leaving some room for the reply. Obviously that information wouldn't be included in the chat context, only the final answer.

I have some questions about the general plan:

is OpenAssistant going to host the vector DB
if yes, what data is going to be in there? Vectorized Wikipedia/The Pile/RedJama?
Who decides and how is it decided what goes in there?
Does it even make sense to use the training data in the vector DB? Or would that defeat the purpose of giving the model access to knowledge outside of it's training data? (It could also be as a way to avoid hallucinations in which case it makes sense to duplicate the training data into the Vector DB)

Anyway I have some experience setting up things like this. I'd be honored to contribute!

kenhktsui · 2023-05-06T14:55:39Z

Adding one more consideration here: there are (at least) three ways of incorporating retrieval into LLM, with different degrees of coupling.

Embedding used for retrieval is trained jointly with LLM. Example includes REALM
Pros:

Joint training results in the best performance in general
Cons:
coupling of models and retrieval, which makes SFT slower and more difficult
training will involve asynchronous index refresh because index is learned on the fly, presenting engineering challenge and potential slow training speed

Instruction dataset incorporates retrieval as a tool. Example includes Toolformer Implementation
Pros:

LLM learns when to do retrieval because of the instruction data
decoupling of models and retrieval
Cons:
Construction of instruction data is the most challenging among all options, especially chaining usage of tools.

Embedding used for retrieval is independent from LLM. Prompt engineering is heavily used. This is framework most open-source frameworks use. (langchain, llama index).
Pros:

development of retrieval can be independent from LLM
Cons:
we count on LLM to decide when to do retrieval. So It usually requires LLM to be very strong or heavy prompt engineering has to be done.

kpoeppel · 2023-05-06T15:27:29Z

I think there is a 4th line, kind of a blend of 1 and 3, as presented in the RETRO paper (https://arxiv.org/abs/2112.04426)[https://arxiv.org/abs/2112.04426]:
4. Index is created from pre-trained encoder (e.g. BERT, Contriever...), whereas the ingestion encoder is actively trained.
Pros:

retrieval can be independent from LLM
no retrieval decision needed
pre-trained decoder can be used (see RETRO paper)
Cons:
still some training needed
amount of retrieval is high and fixed

melvinebenezer · 2023-05-09T07:06:07Z

Meeting minutes:

Discussed the major approaches that looks probable
1. Prompt Injection approach (Lang Chain)
2. Toolformer
3. Model manipulation (RETRO etc .. )
Discussed use cases like
1. reduce hallucination
2. Semantic search over private documents
3. Dynamic knowledge update

We all agreed to spend another week of paper reading, discord chats and exploring small tasks.
Will meet in a week and probably be able to decide on the approach

@kpoeppel will share some papers on retrieval

Please comment if I missed something

ash3n · 2023-05-17T22:51:43Z

Another way to incorporate retrieval, sort of an upgrade to 3, is to take a pretrained non-retrieval LLM and fine-tune it with retrieval augmentation simply adding retrieved documents into the input. You can either use a pretrained retriever like RETRO, co-train a retriever like REALM during this fine-tuning stage, or use a nonparametric retriever like BM25 which works surprisingly well.

This method was introduced by a very recent paper, though I'm blanking out on the name. Hopefully someone will be able to identify this particular paper.

melvinebenezer · 2023-05-19T05:09:10Z

Meeting Minutes

Embedding Method Team

@kenhktsui has been experimenting on toolformer
@kpoeppel is confident on experimenting with LLama30b as a retrieval system.
@kpoeppel will check with @umbra-scientia on the status of the experiments done so far pre-trained Encoder/Decoder

Separate proposals will be created (issues) to track these experiments

Prompt-Injection Team

No updates yet

kpoeppel · 2023-05-19T06:58:40Z

Some clarification:
I would not use LLama30b as a retrieval system, but inject retrieved information - encoded with a pre-trained encoder into a pre-trained llama-7b model, learning only the cross-attention part. One could also try out LoRA at first and then fuse those. All of course coordinated with @umbra-scientia.

After those first experiments we can later extrapolate to llama-30B.

melvinebenezer added research needs discussion meeting labels May 6, 2023

melvinebenezer assigned melvinebenezer, andreaskoepf, draganjovanovich, shahules786, AbdBarho and yk May 6, 2023

olliestanley assigned kenhktsui May 6, 2023

melvinebenezer unassigned yk, melvinebenezer, andreaskoepf, draganjovanovich, AbdBarho, shahules786 and kenhktsui May 6, 2023

melvinebenezer assigned kpoeppel, yk, melvinebenezer, andreaskoepf, draganjovanovich, AbdBarho, shahules786 and kenhktsui May 6, 2023

andreaskoepf added this to Open-Assistant May 6, 2023

github-project-automation bot moved this to 📫 Triage in Open-Assistant May 6, 2023

andreaskoepf moved this from 📫 Triage to ⚙ In Progress in Open-Assistant May 6, 2023

andreaskoepf assigned jordiclive and umbra-scientia and unassigned andreaskoepf May 6, 2023

andreaskoepf mentioned this issue May 7, 2023

OA-powered chatbot for finding content in documentation #3068

Closed

kpoeppel mentioned this issue May 9, 2023

Add retrieval first summary of links and papers. #3112

Closed

andreaskoepf removed this from Open-Assistant Jun 5, 2023

LAION-AI locked and limited conversation to collaborators Jun 14, 2023

andreaskoepf converted this issue into discussion #3462 Jun 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

OA Retrieval System Proposal #3058

OA Retrieval System Proposal #3058

melvinebenezer commented May 6, 2023 •

edited

Loading

umbra-scientia commented May 6, 2023

melvinebenezer commented May 6, 2023 •

edited

Loading

kenhktsui commented May 6, 2023

draganjovanovich commented May 6, 2023

kpoeppel commented May 6, 2023

jordiclive commented May 6, 2023

andreaskoepf commented May 6, 2023

jvdgoltz commented May 6, 2023

kenhktsui commented May 6, 2023

kpoeppel commented May 6, 2023

melvinebenezer commented May 9, 2023

ash3n commented May 17, 2023 •

edited

Loading

melvinebenezer commented May 19, 2023

kpoeppel commented May 19, 2023 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

OA Retrieval System Proposal #3058

OA Retrieval System Proposal #3058

Comments

melvinebenezer commented May 6, 2023 • edited Loading

High Level OA Retrieval System

Goal

Options available

Design or Workflow

Other design thoughts

Open questions

Timeline for First Version

umbra-scientia commented May 6, 2023

melvinebenezer commented May 6, 2023 • edited Loading

kenhktsui commented May 6, 2023

draganjovanovich commented May 6, 2023

kpoeppel commented May 6, 2023

jordiclive commented May 6, 2023

andreaskoepf commented May 6, 2023

jvdgoltz commented May 6, 2023

kenhktsui commented May 6, 2023

kpoeppel commented May 6, 2023

melvinebenezer commented May 9, 2023

ash3n commented May 17, 2023 • edited Loading

melvinebenezer commented May 19, 2023

Meeting Minutes

Embedding Method Team

Prompt-Injection Team

kpoeppel commented May 19, 2023 • edited Loading

This issue was moved to a discussion.

melvinebenezer commented May 6, 2023 •

edited

Loading

melvinebenezer commented May 6, 2023 •

edited

Loading

ash3n commented May 17, 2023 •

edited

Loading

kpoeppel commented May 19, 2023 •

edited

Loading