<a href="https://colab.research.google.com/github/QiaoLin22/MASTER-LLM-DL/blob/main/chunking_strategies.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
!pip install git+https://github.com/brandonstarxel/chunking_evaluation.git

In [None]:
# Main Chunking Functions
from chunking_evaluation.chunking import (
    ClusterSemanticChunker,
    LLMSemanticChunker,
    FixedTokenChunker,
    RecursiveTokenChunker,
    KamradtModifiedChunker
)
# Additional Dependencies
import tiktoken
from chromadb.utils import embedding_functions
from chunking_evaluation.utils import openai_token_count
import os

In [6]:
with open("./pride_and_prejudice.txt", 'r', encoding='utf-8') as file:
        document = file.read()

print("First 1000 Characters: ", document[:1000])

First 1000 Characters:  ﻿The Project Gutenberg eBook of Pride and Prejudice
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: Pride and Prejudice

Author: Jane Austen

Release date: June 1, 1998 [eBook #1342]
                Most recently updated: October 29, 2024

Language: English

Credits: Chuck Greif and the Online Distributed Proofreading Team at http://www.pgdp.net (This file was produced from images available at The Internet Archive)


*** START OF THE PROJECT GUTENBERG EBOOK PRIDE AND PREJUDICE ***
                            [Illustration:

                             GEORGE AL

In [7]:
def analyze_chunks(chunks, use_tokens=False):
    # Print the chunks of interest
    print("\nNumber of Chunks:", len(chunks))
    print("\n", "="*50, "200th Chunk", "="*50,"\n", chunks[199])
    print("\n", "="*50, "201st Chunk", "="*50,"\n", chunks[200])

    chunk1, chunk2 = chunks[199], chunks[200]

    if use_tokens:
        encoding = tiktoken.get_encoding("cl100k_base")
        tokens1 = encoding.encode(chunk1)
        tokens2 = encoding.encode(chunk2)

        # Find overlapping tokens
        for i in range(len(tokens1), 0, -1):
            if tokens1[-i:] == tokens2[:i]:
                overlap = encoding.decode(tokens1[-i:])
                print("\n", "="*50, f"\nOverlapping text ({i} tokens):", overlap)
                return
        print("\nNo token overlap found")
    else:
        # Find overlapping characters
        for i in range(min(len(chunk1), len(chunk2)), 0, -1):
            if chunk1[-i:] == chunk2[:i]:
                print("\n", "="*50, f"\nOverlapping text ({i} chars):", chunk1[-i:])
                return
        print("\nNo character overlap found")


# Character Text Splitting
The simplest form of chunking would be simply counting some number of characters and splitting at that count.

In [8]:
def chunk_text(document, chunk_size, overlap):
    chunks = []
    stride = chunk_size - overlap
    current_idx = 0

    while current_idx < len(document):
        # Take chunk_size characters starting from current_idx
        chunk = document[current_idx:current_idx + chunk_size]
        if not chunk:  # Break if we're out of text
            break
        chunks.append(chunk)
        current_idx += stride  # Move forward by stride

    return chunks

In [9]:
character_chunks = chunk_text(document, chunk_size=400, overlap=0)

analyze_chunks(character_chunks)


Number of Chunks: 1871

 y to their aunt, and
to a milliner’s shop just over the way. The two youngest of the family,
Catherine and Lydia, were particularly frequent in these attentions:
their minds were more vacant than their sisters’, and when nothing
better offered, a walk to Meryton was necessary to amuse their morning
hours and furnish conversation for the evening; and, however bare of
news the country in general mig

 ht be, they always contrived to learn
some from their aunt. At present, indeed, they were well supplied both
with news and happiness by the recent arrival of a militia regiment in
the neighbourhood; it was to remain the whole winter, and Meryton was
the head-quarters.

Their visits to Mrs. Philips were now productive of the most interesting
intelligence. Every day added something to their knowledg

No character overlap found


In [10]:

character_overlap_chunks = chunk_text(document, chunk_size=800, overlap=400)

analyze_chunks(character_overlap_chunks)


Number of Chunks: 1871

 y to their aunt, and
to a milliner’s shop just over the way. The two youngest of the family,
Catherine and Lydia, were particularly frequent in these attentions:
their minds were more vacant than their sisters’, and when nothing
better offered, a walk to Meryton was necessary to amuse their morning
hours and furnish conversation for the evening; and, however bare of
news the country in general might be, they always contrived to learn
some from their aunt. At present, indeed, they were well supplied both
with news and happiness by the recent arrival of a militia regiment in
the neighbourhood; it was to remain the whole winter, and Meryton was
the head-quarters.

Their visits to Mrs. Philips were now productive of the most interesting
intelligence. Every day added something to their knowledg

 ht be, they always contrived to learn
some from their aunt. At present, indeed, they were well supplied both
with news and happiness by the recent arrival of a militia reg

In [11]:
import tiktoken

# Loading cl100k_base tokenizer
encoder = tiktoken.get_encoding("cl100k_base")

# Text Example
text = "hamburger"
tokens = encoder.encode(text)

print("Tokens:", tokens)

Tokens: [71, 47775]


In [12]:
for i in range(len(tokens)):
    print(f"Token {i+1}:", encoder.decode([tokens[i]]))

print("Full Decoding: ", encoder.decode(tokens))

Token 1: h
Token 2: amburger
Full Decoding:  hamburger


In [13]:
def count_tokens(text, model="cl100k_base"):
    """Count tokens in a text string using tiktoken"""
    encoder = tiktoken.get_encoding(model)
    return print(f"Number of tokens: {len(encoder.encode(text))}")

In [14]:
fixed_token_chunker = FixedTokenChunker(
    chunk_size=400,
    chunk_overlap=0,
    encoding_name="cl100k_base"
)

token_chunks = fixed_token_chunker.split_text(document)

analyze_chunks(token_chunks, use_tokens=True)


Number of Chunks: 440

  as to meet Miss Bennet. The
subject was pursued no further, and the gentlemen soon afterwards went
away.




[Illustration:

“At Church”
]




CHAPTER XXXI.


[Illustration]

Colonel Fitzwilliam’s manners were very much admired at the Parsonage,
and the ladies all felt that he must add considerably to the pleasure of
their engagements at Rosings. It was some days, however, before they
received any invitation thither, for while there were visitors in the
house they could not be necessary; and it was not till Easter-day,
almost a week after the gentlemen’s arrival, that they were honoured by
such an attention, and then they were merely asked on leaving church to
come there in the evening. For the last week they had seen very little
of either Lady Catherine or her daughter. Colonel Fitzwilliam had called
at the Parsonage more than once during the time, but Mr. Darcy they had
only seen at church.

The invitation was accepted, of course, and at a proper hour they j

In [15]:
count_tokens(token_chunks[0])

Number of tokens: 400


In [16]:
fixed_token_chunker = FixedTokenChunker(
    chunk_size=400,
    chunk_overlap=200,
    encoding_name="cl100k_base"
)

token_overlap_chunks = fixed_token_chunker.split_text(document)

analyze_chunks(token_overlap_chunks, use_tokens=True)


Number of Chunks: 878

  _heard_ nothing of his going away when I
was at Netherfield. I hope your plans in favour of the ----shire will
not be affected by his being in the neighbourhood.”

“Oh no--it is not for _me_ to be driven away by Mr. Darcy. If _he_
wishes to avoid seeing _me_ he must go. We are not on friendly terms,
and it always gives me pain to meet him, but I have no reason for
avoiding _him_ but what I might proclaim to all the world--a sense of
very great ill-usage, and most painful regrets at his being what he is.
His father, Miss Bennet, the late Mr. Darcy, was one of the best men
that ever breathed, and the truest friend I ever had; and I can never be
in company with this Mr. Darcy without being grieved to the soul by a
thousand tender recollections. His behaviour to myself has been
scandalous; but I verily believe I could forgive him anything and
everything, rather than his disappointing the hopes and disgracing the
memory of his father.”

Elizabeth found the interest

In [17]:
recursive_character_chunker = RecursiveTokenChunker(
    chunk_size=800,  # Character Length
    chunk_overlap=0,  # Overlap
    length_function=len,  # Character length with len()
    separators=["\n\n", "\n", ".", "?", "!", " ", ""] # According to Research
)

recursive_character_chunks = recursive_character_chunker.split_text(document)
analyze_chunks(recursive_character_chunks, use_tokens=False)


Number of Chunks: 1270

 When tea was over Mr. Hurst reminded his sister-in-law of the
card-table--but in vain. She had obtained private intelligence that Mr.
Darcy did not wish for cards, and Mr. Hurst soon found even his open
petition rejected. She assured him that no one intended to play, and the
silence of the whole party on the subject seemed to justify her. Mr.
Hurst had, therefore, nothing to do but to stretch himself on one of the
sofas and go to sleep. Darcy took up a book. Miss Bingley did the same;
and Mrs. Hurst, principally occupied in playing with her bracelets and
rings, joined now and then in her brother’s conversation with Miss
Bennet.

 Miss Bingley’s attention was quite as much engaged in watching Mr.
Darcy’s progress through _his_ book, as in reading her own; and she was
perpetually either making some inquiry, or looking at his page. She
could not win him, however, to any conversation; he merely answered her
question and read on. At length, quite exhausted by the a

In [18]:
len(recursive_character_chunks[199]) # Chunk 200

635

In [19]:
recursive_character_chunker = RecursiveTokenChunker(
    chunk_size=800,  # Character Length
    chunk_overlap=400,  # Overlap
    length_function=len,  # Character length with len()
    separators=["\n\n", "\n", ".", "?", "!", " ", ""] # According to Research
)

recursive_character_overlap_chunks = recursive_character_chunker.split_text(document)
analyze_chunks(recursive_character_overlap_chunks, use_tokens=False)


Number of Chunks: 1536

 “Yes; and they have another, who lives somewhere near Cheapside.”

“That is capital,” added her sister; and they both laughed heartily.

“If they had uncles enough to fill _all_ Cheapside,” cried Bingley, “it
would not make them one jot less agreeable.”

“But it must very materially lessen their chance of marrying men of any
consideration in the world,” replied Darcy.

To this speech Bingley made no answer; but his sisters gave it their
hearty assent, and indulged their mirth for some time at the expense of
their dear friend’s vulgar relations.

 With a renewal of tenderness, however, they repaired to her room on
leaving the dining-parlour, and sat with her till summoned to coffee.
She was still very poorly, and Elizabeth would not quit her at all, till
late in the evening, when she had the comfort of seeing her asleep, and
when it appeared to her rather right than pleasant that she should go
down stairs herself. On entering the drawing-room, she found the who

In [20]:
recursive_token_chunker = RecursiveTokenChunker(
    chunk_size=800,  # Character Length
    chunk_overlap=0,  # Overlap
    length_function=openai_token_count,
    separators=["\n\n", "\n", ".", "?", "!", " ", ""] # According to Research
)

recursive_token_chunks = recursive_token_chunker.split_text(document)

analyze_chunks(recursive_token_chunks, use_tokens=True)


Number of Chunks: 253

 Elizabeth was now most heartily sorry that she had, from the distress of
the moment, been led to make Mr. Darcy acquainted with their fears for
her sister; for since her marriage would so shortly give the proper
termination to the elopement, they might hope to conceal its
unfavourable beginning from all those who were not immediately on the
spot.

She had no fear of its spreading farther, through his means. There were
few people on whose secrecy she would have more confidently depended;
but at the same time there was no one whose knowledge of a sister’s
frailty would have mortified her so much. Not, however, from any fear of
disadvantage from it individually to herself; for at any rate there
seemed a gulf impassable between them. Had Lydia’s marriage been
concluded on the most honourable terms, it was not to be supposed that
Mr. Darcy would connect himself with a family, where to every other
objection would now be added an alliance and relationship of the neare

In [21]:
recursive_token_chunker = RecursiveTokenChunker(
    chunk_size=800,  # Character Length
    chunk_overlap=400,  # Overlap
    length_function=openai_token_count,
    separators=["\n\n", "\n", ".", "?", "!", " ", ""] # According to Research
)

recursive_token_overlap_chunks = recursive_token_chunker.split_text(document)

analyze_chunks(recursive_token_overlap_chunks, use_tokens=True)


Number of Chunks: 427

 “I do not mean to say that a woman may not be settled too near her
family. The far and the near must be relative, and depend on many
varying circumstances. Where there is fortune to make the expense of
travelling unimportant, distance becomes no evil. But that is not the
case _here_. Mr. and Mrs. Collins have a comfortable income, but not
such a one as will allow of frequent journeys--and I am persuaded my
friend would not call herself _near_ her family under less than _half_
the present distance.”

Mr. Darcy drew his chair a little towards her, and said, “_You_ cannot
have a right to such very strong local attachment. _You_ cannot have
been always at Longbourn.”

Elizabeth looked surprised. The gentleman experienced some change of
feeling; he drew back his chair, took a newspaper from the table, and,
glancing over it, said, in a colder voice,--

“Are you pleased with Kent?”

A short dialogue on the subject of the country ensued, on either side
calm and concise

In [25]:
import os
import getpass
from openai import OpenAI

from google.colab import userdata


# Set OpenAI API key
api_key = userdata.get('OPENAI_API_KEY')

# Initialize OpenAI client
client = OpenAI(api_key=api_key)
embedding_function = embedding_functions.OpenAIEmbeddingFunction(api_key=api_key, model_name="text-embedding-3-large")

In [29]:
!pip install langchain_experimental
!pip install langchain_openai
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

lc_semantic_chunker = SemanticChunker(OpenAIEmbeddings(api_key=api_key))

lc_semantic_chunks = lc_semantic_chunker.create_documents([document])



In [30]:
print("# of Chunks:", len(lc_semantic_chunks), "\n")
print(lc_semantic_chunks[199].page_content)
print("\n\n", "="*50, "\n\n")
print(lc_semantic_chunks[200].page_content)
print("\n\n", "="*50, "\n\n")

count_tokens(lc_semantic_chunks[199].page_content)
count_tokens(lc_semantic_chunks[200].page_content)

# of Chunks: 305 

“John told us Mr. Darcy was
here when you sent for us;--was it so?”

“Yes; and I told him we should not be able to keep our engagement. _That_ is all settled.”

“What is all settled?” repeated the other, as she ran into her room to
prepare. “And are they upon such terms as for her to disclose the real
truth? Oh, that I knew how it was!”

But wishes were vain; or, at best, could serve only to amuse her in the
hurry and confusion of the following hour. Had Elizabeth been at leisure
to be idle, she would have remained certain that all employment was
impossible to one so wretched as herself; but she had her share of
business as well as her aunt, and amongst the rest there were notes to
be written to all their friends at Lambton, with false excuses for their
sudden departure. An hour, however, saw the whole completed; and Mr. Gardiner, meanwhile, having settled his account at the inn, nothing
remained to be done but to go; and Elizabeth, after all the misery of
the mornin

In [31]:
kamradt_chunker = KamradtModifiedChunker(
    avg_chunk_size=400,      # Target size in tokens
    min_chunk_size=50,       # Initial split size
    embedding_function=embedding_function  # Pass your embedding function
)

# Split your text
modified_kamradt_chunks = kamradt_chunker.split_text(document)

In [32]:
analyze_chunks(modified_kamradt_chunks, use_tokens=True)
print("\n\n", "="*50, "\n\n")
count_tokens(modified_kamradt_chunks[200])


Number of Chunks: 432

 brother, and that she had often heard him speak so affectionately of his
sister as to prove him capable of some amiable feeling;--that had his
actions been what Wickham represented them, so gross a violation of everything right could hardly have been concealed from the world; and
that friendship between a person capable of it and such an amiable man
as Mr. Bingley was incomprehensible. She grew absolutely ashamed of herself. Of neither Darcy nor Wickham
could she think, without feeling that she had been blind, partial,
prejudiced, absurd. “How despicably have I acted!” she cried. “I, who have prided myself on
my discernment! I, who have valued myself on my abilities! who have often disdained the generous candour of my sister, and gratified my
vanity in useless or blameless distrust. How humiliating is this discovery! Yet, how just a humiliation! Had I been in love, I could not
have been more wretchedly blind. But vanity, not love, has been my folly. Pleased wit

In [33]:
cluster_chunker = ClusterSemanticChunker(
    embedding_function=embedding_function,
    max_chunk_size=400,
    length_function=openai_token_count
)

cluster_chunker_chunks = cluster_chunker.split_text(document)

analyze_chunks(cluster_chunker_chunks, use_tokens=True)


Number of Chunks: 991

 But if you will listen to his letter, you may, perhaps, be a little
softened by his manner of expressing himself.” “No, that I am sure I shall not: and I think it was very impertinent of
him to write to you at all, and very hypocritical. I hate such false friends. Why could not he keep on quarrelling with you, as his father
did before him?” “Why, indeed, he does seem to have had some filial scruples on that
head, as you will hear.”

     /* RIGHT “Hunsford, near Westerham, Kent, _15th October_. */ “Dear Sir, “The disagreement subsisting between yourself and my late honoured
     father always gave me much uneasiness; and, since I have had the
     misfortune to lose him, I have frequently wished to heal the breach: but, for some time, I was kept back by my own doubts,
     fearing lest it might seem disrespectful to his memory for me to be
     on good terms with anyone with whom it had always pleased him to be at variance.”--‘There, Mrs. Bennet.’--“My mind, ho

In [36]:
llm_chunker = LLMSemanticChunker(
    organisation="openai",
    model_name="gpt-4o",
    api_key=api_key
)

llm_chunker_chunks = llm_chunker.split_text(document)

analyze_chunks(llm_chunker_chunks, use_tokens=True)

Processing chunks: 100%|█████████▉| 4869/4871 [13:36<00:00,  5.96it/s]


Number of Chunks: 670

 “When I do myself the honour of speaking to you next on the subject, I
shall hope to receive a more favourable answer than you have now given
me; though I am far from accusing you of cruelty at present, because I know it to be the established custom of your sex to reject a man on the
first application, and, perhaps, you have even now said as much to
encourage my suit as would be consistent with the true delicacy of the female character.” “Really, Mr. Collins,” cried Elizabeth, with some warmth, “you puzzle me
exceedingly. If what I have hitherto said can appear to you in the form of encouragement, I know not how to express my refusal in such a way as
may convince you of its being one.” “You must give me leave to flatter myself, my dear cousin, that your
refusal of my addresses are merely words of course. My reasons for believing it are briefly these:--It does not appear to me that my hand
is unworthy of your acceptance, or that the establishment I can offer
wou


