Change chunking/splitting method from SentenceWindowNodeParser to SentenceSplitter or SemanticSplitterNodeParser #1917

lenartgolob · 2024-05-08T14:24:24Z

I am getting very bad responses from the LLM when chatting with the file I uploaded. I am using Qdrant as a vector DB and I checked the contents in the database and realized that all the chunks are only one sentence long.
This is because by default privategpt uses SentenceWindowNodeParser to split the text in ingest_service.py
node_parser = SentenceWindowNodeParser.from_defaults()

instead of the default I tried implementing SentenceSplitter like this:
node_parser = SentenceSplitter.from_defaults(chunk_size=1024, chunk_overlap=200)

I also tried implementing SemanticSplitterNodeParser like this:

        ollama_embedding = OllamaEmbedding(
            model_name="nomic-embed-text:latest",
            base_url="http://localhost:11434",
            ollama_additional_kwargs={"mirostat": 0},
        )           

        node_parser = SemanticSplitterNodeParser(buffer_size=5, embed_model=ollama_embedding)

Both implementations work and the embedding happens without errors, but when I check the database the file is still split by sentences instead of 1024 characters in case of SentenceSplitter or by meaning in case of SemanticSplitterNodeParser.

Should I make some other changes and what am I missing? Why is splitting by sentence the default if it's so ineffective?
I run privategpt with:
PGPT_PROFILES=vllm make run
Thanks in advance for any help!

The text was updated successfully, but these errors were encountered:

AlexPerkin · 2024-05-10T09:03:24Z

Try for example
node_parser = SentenceWindowNodeParser.from_defaults(window_size=20)

sonaliverma · 2024-05-25T02:58:09Z

No need to pass Sentence Splitter dynamically as it is by default present in Service Context Class definition.

Below is the configuration in Service Context.

node_parser = (
text_splitter # text splitter extends node parser
or node_parser
or _get_default_node_parser(
chunk_size=chunk_size or DEFAULT_CHUNK_SIZE,
chunk_overlap=chunk_overlap or SENTENCE_CHUNK_OVERLAP,
callback_manager=callback_manager,
)
)

If you don't pass any splitting method in the transformations , it will go to _get_default_node_parser which is using SentenceSplitter. Follow below steps:

Go to ingest_service.py and in transformations remove node parser
transformations=[node_parser,embedding_component.embedding_model]

FYI if you don't want to use Sentence splitting method use TokenTextSplitter . You need to do below changes in ingest_service.py

from llama_index.node_parser import SentenceWindowNodeParser
from llama_index.node_parser import TokenTextSplitter

from engine_gpt.components.embedding.embedding_component import EmbeddingComponent
from engine_gpt.components.ingest.ingest_component import get_ingestion_component
from engine_gpt.components.llm.llm_component import LLMComponent
from engine_gpt.components.node_store.node_store_component import NodeStoreComponent
from engine_gpt.components.vector_store.vector_store_component import (
VectorStoreComponent,
)
from engine_gpt.server.ingest.model import IngestedDoc
from engine_gpt.settings.settings import settings

logger = logging.getLogger(name)

@singleton
class IngestService:
@Inject
def init(
self,
llm_component: LLMComponent,
vector_store_component: VectorStoreComponent,
embedding_component: EmbeddingComponent,
node_store_component: NodeStoreComponent,
) -> None:
self.llm_service = llm_component
self.storage_context = StorageContext.from_defaults(
vector_store=vector_store_component.vector_store,
docstore=node_store_component.doc_store,
index_store=node_store_component.index_store,
)
#node_parser = SentenceWindowNodeParser.from_defaults()
text_splitter= TokenTextSplitter()
self.ingest_service_context = ServiceContext.from_defaults(
llm=self.llm_service.llm,
embed_model=embedding_component.embedding_model,
#node_parser=node_parser,
text_splitter=text_splitter,
# Embeddings done early in the pipeline of node transformations, right
# after the node parsing
transformations=[text_splitter,embedding_component.embedding_model],

lenartgolob changed the title ~~Change chunking method from SentenceWindowNodeParser to SentenceSplitter or SemanticSplitterNodeParser~~ Change chunking/splitting method from SentenceWindowNodeParser to SentenceSplitter or SemanticSplitterNodeParser May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change chunking/splitting method from SentenceWindowNodeParser to SentenceSplitter or SemanticSplitterNodeParser #1917

Change chunking/splitting method from SentenceWindowNodeParser to SentenceSplitter or SemanticSplitterNodeParser #1917

lenartgolob commented May 8, 2024 •

edited

AlexPerkin commented May 10, 2024

sonaliverma commented May 25, 2024 •

edited

Change chunking/splitting method from SentenceWindowNodeParser to SentenceSplitter or SemanticSplitterNodeParser #1917

Change chunking/splitting method from SentenceWindowNodeParser to SentenceSplitter or SemanticSplitterNodeParser #1917

Comments

lenartgolob commented May 8, 2024 • edited

AlexPerkin commented May 10, 2024

sonaliverma commented May 25, 2024 • edited

lenartgolob commented May 8, 2024 •

edited

sonaliverma commented May 25, 2024 •

edited