Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change chunking/splitting method from SentenceWindowNodeParser to SentenceSplitter or SemanticSplitterNodeParser #1917

Open
lenartgolob opened this issue May 8, 2024 · 2 comments

Comments

@lenartgolob
Copy link

lenartgolob commented May 8, 2024

I am getting very bad responses from the LLM when chatting with the file I uploaded. I am using Qdrant as a vector DB and I checked the contents in the database and realized that all the chunks are only one sentence long.
This is because by default privategpt uses SentenceWindowNodeParser to split the text in ingest_service.py
node_parser = SentenceWindowNodeParser.from_defaults()

instead of the default I tried implementing SentenceSplitter like this:
node_parser = SentenceSplitter.from_defaults(chunk_size=1024, chunk_overlap=200)

I also tried implementing SemanticSplitterNodeParser like this:

        ollama_embedding = OllamaEmbedding(
            model_name="nomic-embed-text:latest",
            base_url="http://localhost:11434",
            ollama_additional_kwargs={"mirostat": 0},
        )           

        node_parser = SemanticSplitterNodeParser(buffer_size=5, embed_model=ollama_embedding)

Both implementations work and the embedding happens without errors, but when I check the database the file is still split by sentences instead of 1024 characters in case of SentenceSplitter or by meaning in case of SemanticSplitterNodeParser.

Should I make some other changes and what am I missing? Why is splitting by sentence the default if it's so ineffective?
I run privategpt with:
PGPT_PROFILES=vllm make run
Thanks in advance for any help!

@lenartgolob lenartgolob changed the title Change chunking method from SentenceWindowNodeParser to SentenceSplitter or SemanticSplitterNodeParser Change chunking/splitting method from SentenceWindowNodeParser to SentenceSplitter or SemanticSplitterNodeParser May 8, 2024
@AlexPerkin
Copy link

Try for example
node_parser = SentenceWindowNodeParser.from_defaults(window_size=20)

@sonaliverma
Copy link

sonaliverma commented May 25, 2024

No need to pass Sentence Splitter dynamically as it is by default present in Service Context Class definition.

Below is the configuration in Service Context.

node_parser = (
text_splitter # text splitter extends node parser
or node_parser
or _get_default_node_parser(
chunk_size=chunk_size or DEFAULT_CHUNK_SIZE,
chunk_overlap=chunk_overlap or SENTENCE_CHUNK_OVERLAP,
callback_manager=callback_manager,
)
)

If you don't pass any splitting method in the transformations , it will go to _get_default_node_parser which is using SentenceSplitter. Follow below steps:

Go to ingest_service.py and in transformations remove node parser
transformations=[node_parser,embedding_component.embedding_model]

FYI if you don't want to use Sentence splitting method use TokenTextSplitter . You need to do below changes in ingest_service.py

from llama_index.node_parser import SentenceWindowNodeParser
from llama_index.node_parser import TokenTextSplitter

from engine_gpt.components.embedding.embedding_component import EmbeddingComponent
from engine_gpt.components.ingest.ingest_component import get_ingestion_component
from engine_gpt.components.llm.llm_component import LLMComponent
from engine_gpt.components.node_store.node_store_component import NodeStoreComponent
from engine_gpt.components.vector_store.vector_store_component import (
VectorStoreComponent,
)
from engine_gpt.server.ingest.model import IngestedDoc
from engine_gpt.settings.settings import settings

logger = logging.getLogger(name)

@singleton
class IngestService:
@Inject
def init(
self,
llm_component: LLMComponent,
vector_store_component: VectorStoreComponent,
embedding_component: EmbeddingComponent,
node_store_component: NodeStoreComponent,
) -> None:
self.llm_service = llm_component
self.storage_context = StorageContext.from_defaults(
vector_store=vector_store_component.vector_store,
docstore=node_store_component.doc_store,
index_store=node_store_component.index_store,
)
#node_parser = SentenceWindowNodeParser.from_defaults()
text_splitter= TokenTextSplitter()
self.ingest_service_context = ServiceContext.from_defaults(
llm=self.llm_service.llm,
embed_model=embedding_component.embedding_model,
#node_parser=node_parser,
text_splitter=text_splitter,
# Embeddings done early in the pipeline of node transformations, right
# after the node parsing
transformations=[text_splitter,embedding_component.embedding_model],

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants