-
Notifications
You must be signed in to change notification settings - Fork 7.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change chunking/splitting method from SentenceWindowNodeParser to SentenceSplitter or SemanticSplitterNodeParser #1917
Comments
Try for example |
No need to pass Sentence Splitter dynamically as it is by default present in Service Context Class definition. Below is the configuration in Service Context. node_parser = ( If you don't pass any splitting method in the transformations , it will go to _get_default_node_parser which is using SentenceSplitter. Follow below steps: Go to ingest_service.py and in transformations remove node parser FYI if you don't want to use Sentence splitting method use TokenTextSplitter . You need to do below changes in ingest_service.py from llama_index.node_parser import SentenceWindowNodeParser from engine_gpt.components.embedding.embedding_component import EmbeddingComponent logger = logging.getLogger(name) @singleton |
I am getting very bad responses from the LLM when chatting with the file I uploaded. I am using Qdrant as a vector DB and I checked the contents in the database and realized that all the chunks are only one sentence long.
This is because by default privategpt uses SentenceWindowNodeParser to split the text in
ingest_service.py
node_parser = SentenceWindowNodeParser.from_defaults()
instead of the default I tried implementing SentenceSplitter like this:
node_parser = SentenceSplitter.from_defaults(chunk_size=1024, chunk_overlap=200)
I also tried implementing SemanticSplitterNodeParser like this:
Both implementations work and the embedding happens without errors, but when I check the database the file is still split by sentences instead of 1024 characters in case of SentenceSplitter or by meaning in case of SemanticSplitterNodeParser.
Should I make some other changes and what am I missing? Why is splitting by sentence the default if it's so ineffective?
I run privategpt with:
PGPT_PROFILES=vllm make run
Thanks in advance for any help!
The text was updated successfully, but these errors were encountered: