Here we investigate how to make a model that splits text based on some semantic meaning. It may be by:
- Paragraph
- Sentence

The `NLTKTextSplitter` is one of the splitters provided to us by `langchain` for use with text splitting. It is ML based as it incorporates learned features ofthetarget language to do the splitting.

In [2]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import NLTKTextSplitter

### Dataset


In [12]:
with open("data/sample.txt", encoding='utf-8') as f:
    text = f.read()

### Split Docs
We will set the default separator as the period (.). Although other separators may mark the end of sentences this is the most common. We will also set an experimental chunk size (< 100 ) with no overlap. Some sentences will exceed the default chunk size but there will be a warning for that.

In [17]:
text_splitter = NLTKTextSplitter(separator='.', chunk_size=64, chunk_overlap=0)

In [18]:
docs = text_splitter.split_text(text)

Created a chunk of size 91, which is longer than the specified 64
Created a chunk of size 100, which is longer than the specified 64
Created a chunk of size 101, which is longer than the specified 64
Created a chunk of size 126, which is longer than the specified 64
Created a chunk of size 126, which is longer than the specified 64
Created a chunk of size 153, which is longer than the specified 64
Created a chunk of size 97, which is longer than the specified 64


### Evaluation

From initial visual inspection, the text was split into intact sentences even when the separators were different from the default period.

In [20]:
docs

["\ufeffUnfortunately, Hansel and Gretel overheard their stepmother's words, making them very sad.",
 'Hansel decided to creep outside and found some white pebbles, which shone brightly in the moonlight.',
 'Hansel filled his pockets with as many of the white pebbles as he could manage and crept back inside.',
 'The very next morning the family went out for a walk in the woods, as they walked Hansel dropped the white pebbles behind him.',
 'Further and further they walked into the forest, Hansel and Gretel grew tired, their father made a fire and told them to rest.',
 'When Hansel and Gretel awoke they were all alone, fortunately, Hansel’s plan had been a success, and they could follow the white stones all the way home!',
 'When they arrived back home their father was overjoyed, however, their stepmother was very cross.',
 'Later that night she told Hansel and Gretel’s father that he must get rid of them again!']

### LLM evaluation

In [25]:
import os
import google.generativeai as palm

In [34]:
context = """
Does this text segment constitute a complete sentence? \
Ude the presence of separators . ! or ? to make your judgemnt \
Return `True` if is a complete sentence or `False` otherwise \
"""

In [35]:
for doc in docs:
    res = palm.chat(
        context=context,
        messages=docs[0]
    )
    
    print(res.last)

Yes, this is a complete sentence. It has a subject (Hansel and Gretel), a verb (overheard), and an object (their stepmother's words). It also has a complete thought.
Yes, this is a complete sentence. It has a subject (Hansel and Gretel), a verb (overheard), and an object (their stepmother's words). It also has a complete thought.
Yes, the text segment "Unfortunately, Hansel and Gretel overheard their stepmother's words, making them very sad." constitutes a complete sentence. It has a subject ("Hansel and Gretel"), a verb ("overheard"), and an object ("their stepmother's words"). It also has a complete thought, which is that Hansel and Gretel were sad when they overheard their stepmother's words.

The sentence is also grammatically correct. It has a subject-verb agreement, and the verb is in the past tense. The sentence also has a correct punctuation, with a comma after "unfortunately" and a period at the end.
Yes, this is a complete sentence. It has a subject (Hansel and Gretel), a ver