First version of file-search tool for assistant-api #20

RobinQu · 2024-05-21T13:47:40Z

file search

search pipeline

https://platform.openai.com/docs/assistants/tools/file-search/how-it-works

The file_search tool implements several retrieval best practices out of the box to help you extract the right data from your files and augment the model’s responses. The file_search tool:

Rewrites user queries to optimize them for search.

Breaks down complex user queries into multiple searches it can run in parallel.

Runs both keyword and semantic searches across both assistant and thread vector stores.

Reranks search results to pick the most relevant ones before generating the final response.

online search sources

https://platform.openai.com/docs/assistants/tools/file-search/vector-stores

Each vector_store can hold up to 10,000 files.
Today, you can attach at most one vector store to an assistant and at most one vector store to a thread.

vector store source:

tool_resources on assistant object -> vector_store_id
tool_resources on thread object -> vector_store_id
attachments on user message. -> file_id -> create a new VS or insert into VS of this thread?

tool choices

Does it always trigger file search if any vs is configured? It seems it's not anymore.

Read about users' complains after V2 is released.

I guess that internal agent will decide if it's necessary to call file-search.

Another discussion about how file search tool works:
https://community.openai.com/t/how-knowledge-base-files-are-handled-assistants-api/601721/14

data expiration

https://platform.openai.com/docs/assistants/tools/file-search/managing-costs-with-expiration-policies

Vector stores created using thread helpers (like tool_resources.file_search.vector_stores in Threads or message.attachments in Messages) have a default expiration policy of 7 days after they were last active (defined as the last time the vector store was part of a run).

data deletion

Deleting the vector store file object or,
By deleting the underlying file object (which removes the file it from all vector_store and code_interpreter configurations across all assistants and threads in your organization)

The text was updated successfully, but these errors were encountered:

RobinQu · 2024-06-08T10:15:19Z

More implementaion details

Content annotation in file search

Annotation data schema: https://platform.openai.com/docs/api-reference/messages/object
Post-processing with annotaitons: https://platform.openai.com/docs/assistants/tools/file-search/step-5-create-a-run-and-check-the-output

Existiting methods

Example 1: langchain

https://python.langchain.com/v0.2/docs/how_to/qa_citations/#setup

Direct prompting: explictly return structured citation data using specific prompt
Retrievel post-processing: Retruning compressed docs with answer.
Generation post-processing: Asking model give explainations for citations.

Example 2: llmware

https://medium.com/@darrenoberst/using-llmware-for-rag-evidence-verification-8611abf2dbeb
https://github.com/llmware-ai/llmware

use retrieval based method to provide evidences for more reliable RAG pipeline.

Example 3: A more elaborate prompt guided citations

https://medium.com/@yotamabraham/in-text-citing-with-langchain-question-answering-e19a24d81e39

response = llm.call_as_llm(f"{qdocs} Question: Please answer the question with citation to the paragraphs. /
 For every sentence you write, cite the book name and paragraph number as <id_x_x> /
 
 At the end of your commentary: 
 1. Add key words from the book paragraphs. / 
 2. Suggest a further question that can be answered by the paragraphs provided. / 
 3. Create a sources list of book names, paragraph Number author name, and a link for each book you cited.")

display(Markdown(response))

According to the book The International Journalism Handbook by 
Rodrigo Zamith, technology has played a significant role in shaping todays 
journalistic work (s_225_100). The development of the printing press 
allowed for the mass distribution of journalism, although it also imposed 
limitations on the formats that journalistic products could take (s_225_100). 
The telegraph enabled the development of newswire services and facilitated 
quick transmission of reports from remote locations (s_225_100). 
On the other hand, the proliferation of the telephone allowed reporters to 
conduct more reporting from within the newsroom by directly contacting 
their sources (s_225_100).

Technological actants have also influenced the way news audiences and 
journalists communicate with each other (s_274_100). 
Platforms like Twitter have made it easier for audience members to 
provide immediate and public feedback to journalists, leading to more 
meaningful and direct audience participation (s_274_100). However, this 
can also result in negative forms of participation, such as brigading
and strategic harassment of journalists (s_274_100).

In recent times, journalists are more likely to work in teams, 
collaborate across organizations, and involve their audiences in 
various aspects of news production (s_914_100). 
This shift has moved away from the historical practice of journalists 
working in a more solitary fashion (s_914_100).

The accessibility of news content and sources has increased significantly, 
allowing news audiences to have access to a wide range of options (s_268_100). 
This has made it challenging for a single journalistic outlet to gain a 
near-monopoly on audiences (s_268_100). However, a few large organizations 
with strong brand recognition can still capture substantial audiences, 
while smaller journalistic outlets cater to niche audiences and are often 
considered interchangeable by users (s_268_100).

Keywords: technology, printing press, telegraph, telephone, 
audience participation, news production, news content accessibility, 
journalistic outlets.

Further question: How has the evolution of technology impacted the 
credibility and trustworthiness of journalistic outlets?

Sources:

Book: The International Journalism Handbook
Paragraph numbers: s_225_100, s_274_100, s_914_100, s_268_100
Author: Rodrigo Zamith
Link: https://books.rodrigozamith.com/the-international-journalism-handbook/

Problems

it is easy to trace information for file_citation field in annoataions, but text field, which acts as special marks in resposne, is unclear for re-production.
In Example 3, we get very promising results with text anchor in answer and citations in the end. But this requires a complete different prompt to generate answer which is intrusive to current design that allows abitary prompt from agent exectuors.
start_index and end_index are aparrently implementation dependent, and thus impossible to reproduce unless OpenAI dis-closes more details.

Some tests

Direct prompting with `mixtral-8x7b-q6-guff`

rag_with_citation.txt

 Cheetahs are capable of running at speeds between 93 to 104 kilometers per hour (58 to 65 miles per hour) (id.1).

Despite their impressive speed, cheetahs only score at 16 body lengths per second, which is lower than Anna's hummingbird's length-specific velocity (id.3).

1. Cheetah speed
2. Running speed of cheetahs

Quotations from context information:

* "The cheetah is capable of running at 93 to 104 km/h (58 to 65 mph)" (id.1)
* "it has evolved specialized adaptations for speed, including a light build, long thin legs and a long tail" (id.1)
* "Anna's hummingbird has the highest known length-specific velocity attained by any vertebrate" (id.3)
* "The cheetah, the fastest land mammal, scores at only 16 body lengths per second" (id.3)

Generation post-processing

generation_post_processing.txt

Purposed solution

Use LLM to post-process final answer.

input_parser  -(prompt)->  agent_exectuor [  file_search, and other tool uses ]  - (answer)  -> annotator -> (final answer with annotations)

RobinQu added this to the v0.1.4 milestone May 21, 2024

RobinQu mentioned this issue May 21, 2024

Modular RAG implementations: reranker, multi-retrievers #16

Closed

RobinQu mentioned this issue Jun 14, 2024

Limitations of mini-assistant #22

Open

RobinQu changed the title ~~file-search tool for assistant-api~~ first version of file-search tool for assistant-api Jun 14, 2024

RobinQu changed the title ~~first version of file-search tool for assistant-api~~ First version of file-search tool for assistant-api Jun 14, 2024

RobinQu closed this as completed Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First version of file-search tool for assistant-api #20

First version of file-search tool for assistant-api #20

RobinQu commented May 21, 2024 •

edited

Loading

RobinQu commented Jun 8, 2024 •

edited

Loading

First version of file-search tool for assistant-api #20

First version of file-search tool for assistant-api #20

Comments

RobinQu commented May 21, 2024 • edited Loading

file search

search pipeline

online search sources

tool choices

data expiration

data deletion

RobinQu commented Jun 8, 2024 • edited Loading

More implementaion details

Content annotation in file search

Existiting methods

Example 1: langchain

Example 2: llmware

Example 3: A more elaborate prompt guided citations

Problems

Some tests

Direct prompting with mixtral-8x7b-q6-guff

Generation post-processing

Purposed solution

RobinQu commented May 21, 2024 •

edited

Loading

RobinQu commented Jun 8, 2024 •

edited

Loading

Direct prompting with `mixtral-8x7b-q6-guff`