Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Confluence connector only save partial document body in JSON #1501

Open
jazelly opened this issue May 23, 2024 · 5 comments
Open

[BUG]: Confluence connector only save partial document body in JSON #1501

jazelly opened this issue May 23, 2024 · 5 comments
Assignees
Labels
core-team-only investigating Core team or maintainer will or is currently looking into this issue possible bug Bug was reported but is not confirmed or is unable to be replicated.

Comments

@jazelly
Copy link
Contributor

jazelly commented May 23, 2024

How are you running AnythingLLM?

Local development

What happened?

After confluence connector scraping confluence documents, the document bodies are not fully saved in JSON under storage.

After embedding them, it will not provide useful info as expected. For example, we have a confluence doc containing some code snippets and would like to ask questions to retrieve that. The code snippets is lost after scraping, however, which caused the LLM to response basic info.

I am not sure if this is a limitation of Atlassian API but surely users would expect more than just some basic info of the confluence documents.

Are there known steps to reproduce?

No response

@jazelly jazelly added the possible bug Bug was reported but is not confirmed or is unable to be replicated. label May 23, 2024
@shatfield4 shatfield4 self-assigned this May 23, 2024
@timothycarambat
Copy link
Member

@jazelly the pageContent of the associated docment is empty?

@timothycarambat timothycarambat added investigating Core team or maintainer will or is currently looking into this issue core-team-only labels May 23, 2024
@jazelly
Copy link
Contributor Author

jazelly commented May 23, 2024

@timothycarambat the pageContent is not empty. It has content, but just not include script content, e.g.

VIEW ALL\nsql\nASSIGN TO AN ACCOUNT\nThe account must already exist.\nsql\n

Notice the sql in the pageContent, which is supposed to be a SQL command. LLM makes up the answers when we ask a question related to that, since the prompt contains no reference to the real command

@jainpradeep
Copy link

Issue faced with local deployment as well. LLM responses are poor.

@timothycarambat
Copy link
Member

Issue faced with local deployment as well. LLM responses are poor.

Has nothing to do with the deployment method or RAG structure, the RAG results are bad because the scraper is returning poor information from the documents. As @jazelly mentions, it seems like some non-text blocks are not returned or parsed using the Langchain parser - which is where this lies

@jazelly
Copy link
Contributor Author

jazelly commented May 30, 2024

This might be an issue better for LangChain community.

To us, the current solution is nothing more than writing our own scraper to download these documents, and upload them to anything-llm via APIs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core-team-only investigating Core team or maintainer will or is currently looking into this issue possible bug Bug was reported but is not confirmed or is unable to be replicated.
Projects
None yet
Development

No branches or pull requests

4 participants