Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First version of file-search tool for assistant-api #20

Closed
RobinQu opened this issue May 21, 2024 · 1 comment
Closed

First version of file-search tool for assistant-api #20

RobinQu opened this issue May 21, 2024 · 1 comment
Milestone

Comments

@RobinQu
Copy link
Owner

RobinQu commented May 21, 2024

file search

search pipeline

https://platform.openai.com/docs/assistants/tools/file-search/how-it-works

The file_search tool implements several retrieval best practices out of the box to help you extract the right data from your files and augment the model’s responses. The file_search tool:

  • Rewrites user queries to optimize them for search.
  • Breaks down complex user queries into multiple searches it can run in parallel.
  • Runs both keyword and semantic searches across both assistant and thread vector stores.
  • Reranks search results to pick the most relevant ones before generating the final response.

online search sources

https://platform.openai.com/docs/assistants/tools/file-search/vector-stores

Each vector_store can hold up to 10,000 files.
Today, you can attach at most one vector store to an assistant and at most one vector store to a thread.

vector store source:

  • tool_resources on assistant object -> vector_store_id
  • tool_resources on thread object -> vector_store_id
  • attachments on user message. -> file_id -> create a new VS or insert into VS of this thread?

tool choices

Does it always trigger file search if any vs is configured? It seems it's not anymore.

Read about users' complains after V2 is released.

I guess that internal agent will decide if it's necessary to call file-search.

Another discussion about how file search tool works:
https://community.openai.com/t/how-knowledge-base-files-are-handled-assistants-api/601721/14

data expiration

https://platform.openai.com/docs/assistants/tools/file-search/managing-costs-with-expiration-policies

Vector stores created using thread helpers (like tool_resources.file_search.vector_stores in Threads or message.attachments in Messages) have a default expiration policy of 7 days after they were last active (defined as the last time the vector store was part of a run).

data deletion

  • Deleting the vector store file object or,
  • By deleting the underlying file object (which removes the file it from all vector_store and code_interpreter configurations across all assistants and threads in your organization)
@RobinQu
Copy link
Owner Author

RobinQu commented Jun 8, 2024

More implementaion details

Content annotation in file search

Existiting methods

Example 1: langchain

https://python.langchain.com/v0.2/docs/how_to/qa_citations/#setup

  • Direct prompting: explictly return structured citation data using specific prompt
  • Retrievel post-processing: Retruning compressed docs with answer.
  • Generation post-processing: Asking model give explainations for citations.

Example 2: llmware

https://medium.com/@darrenoberst/using-llmware-for-rag-evidence-verification-8611abf2dbeb
https://github.com/llmware-ai/llmware

use retrieval based method to provide evidences for more reliable RAG pipeline.

Example 3: A more elaborate prompt guided citations

https://medium.com/@yotamabraham/in-text-citing-with-langchain-question-answering-e19a24d81e39

response = llm.call_as_llm(f"{qdocs} Question: Please answer the question with citation to the paragraphs. /
 For every sentence you write, cite the book name and paragraph number as <id_x_x> /
 
 At the end of your commentary: 
 1. Add key words from the book paragraphs. / 
 2. Suggest a further question that can be answered by the paragraphs provided. / 
 3. Create a sources list of book names, paragraph Number author name, and a link for each book you cited.")

display(Markdown(response))

According to the book The International Journalism Handbook by 
Rodrigo Zamith, technology has played a significant role in shaping todays 
journalistic work (s_225_100). The development of the printing press 
allowed for the mass distribution of journalism, although it also imposed 
limitations on the formats that journalistic products could take (s_225_100). 
The telegraph enabled the development of newswire services and facilitated 
quick transmission of reports from remote locations (s_225_100). 
On the other hand, the proliferation of the telephone allowed reporters to 
conduct more reporting from within the newsroom by directly contacting 
their sources (s_225_100).

Technological actants have also influenced the way news audiences and 
journalists communicate with each other (s_274_100). 
Platforms like Twitter have made it easier for audience members to 
provide immediate and public feedback to journalists, leading to more 
meaningful and direct audience participation (s_274_100). However, this 
can also result in negative forms of participation, such as brigading
and strategic harassment of journalists (s_274_100).

In recent times, journalists are more likely to work in teams, 
collaborate across organizations, and involve their audiences in 
various aspects of news production (s_914_100). 
This shift has moved away from the historical practice of journalists 
working in a more solitary fashion (s_914_100).

The accessibility of news content and sources has increased significantly, 
allowing news audiences to have access to a wide range of options (s_268_100). 
This has made it challenging for a single journalistic outlet to gain a 
near-monopoly on audiences (s_268_100). However, a few large organizations 
with strong brand recognition can still capture substantial audiences, 
while smaller journalistic outlets cater to niche audiences and are often 
considered interchangeable by users (s_268_100).

Keywords: technology, printing press, telegraph, telephone, 
audience participation, news production, news content accessibility, 
journalistic outlets.

Further question: How has the evolution of technology impacted the 
credibility and trustworthiness of journalistic outlets?

Sources:

Book: The International Journalism Handbook
Paragraph numbers: s_225_100, s_274_100, s_914_100, s_268_100
Author: Rodrigo Zamith
Link: https://books.rodrigozamith.com/the-international-journalism-handbook/

Problems

  • it is easy to trace information for file_citation field in annoataions, but text field, which acts as special marks in resposne, is unclear for re-production.
  • In Example 3, we get very promising results with text anchor in answer and citations in the end. But this requires a complete different prompt to generate answer which is intrusive to current design that allows abitary prompt from agent exectuors.
  • start_index and end_index are aparrently implementation dependent, and thus impossible to reproduce unless OpenAI dis-closes more details.

Some tests

Direct prompting with mixtral-8x7b-q6-guff

rag_with_citation.txt

 Cheetahs are capable of running at speeds between 93 to 104 kilometers per hour (58 to 65 miles per hour) (id.1).

Despite their impressive speed, cheetahs only score at 16 body lengths per second, which is lower than Anna's hummingbird's length-specific velocity (id.3).

1. Cheetah speed
2. Running speed of cheetahs

Quotations from context information:

* "The cheetah is capable of running at 93 to 104 km/h (58 to 65 mph)" (id.1)
* "it has evolved specialized adaptations for speed, including a light build, long thin legs and a long tail" (id.1)
* "Anna's hummingbird has the highest known length-specific velocity attained by any vertebrate" (id.3)
* "The cheetah, the fastest land mammal, scores at only 16 body lengths per second" (id.3)

Generation post-processing

generation_post_processing.txt

Purposed solution

Use LLM to post-process final answer.

input_parser  -(prompt)->  agent_exectuor [  file_search, and other tool uses ]  - (answer)  -> annotator -> (final answer with annotations)

@RobinQu RobinQu changed the title file-search tool for assistant-api first version of file-search tool for assistant-api Jun 14, 2024
@RobinQu RobinQu changed the title first version of file-search tool for assistant-api First version of file-search tool for assistant-api Jun 14, 2024
@RobinQu RobinQu closed this as completed Jun 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant