<center><h1>Exploring Semantic Chunking Methods</h1></center>

This notebook explores three semantic chunking techniques, which are crucial for breaking text into meaningful units. We will implement and compare these techniques to assess their performance and suitability for text generation. 

### **Table of Contents**

* [Section 1. Install Libraries](#section-one)
* [Section 2. Import Libraries](#section-two)
* [Section 3. Load Data](#section-three)
* [Section 4. Load Embedding Model](#section-four)
* [Section 5. Testing Different Chunking Techniques](#section-five)
    *     [Statistical Chunking](#section-six)
    *     [Consecutive Chunking](#section-seven)
    *     [Cumulative Chunking](#section-eight)
* [Section 6. Conclusion](#section-nine)

## **Step 1. Install Libraries** <a id="section-one"></a>

In [1]:
%%capture
!pip install -qU semantic-chunkers
!pip install -qU datasets==2.19.1
!pip install -qU langchain 
!pip install -qU pypdf
!pip install -qU langchain-community

## **Step 2. Import Libraries** <a id="section-two"></a>

In [2]:
from datasets import load_dataset
from semantic_router.encoders import HuggingFaceEncoder
from semantic_chunkers import StatisticalChunker
from semantic_chunkers import ConsecutiveChunker
from semantic_chunkers import CumulativeChunker
from langchain_community.document_loaders import PyPDFLoader

## **Step 3. Load Data** <a id="section-three"></a>

In this notebook, we will use the [2024 NVIDIA annual report](https://investor.nvidia.com/financial-info/annual-reports-and-proxies/default.aspx) in PDF format. We will parse it using the PyPDFLoader function from LangChain. The text in the NVIDIA report is straightforward to parse and does not require OCR for text extraction. Our focus here is on testing different semantic chunking techniques, rather than exploring various data extraction methods.

In [3]:
file_path = (
    "/kaggle/input/nvidia-annual-report/NVIDIA-2024-Annual-Report.pdf"
)
loader = PyPDFLoader(file_path)
pages = loader.load_and_split()

pages[0]

Document(metadata={'source': '/kaggle/input/nvidia-annual-report/NVIDIA-2024-Annual-Report.pdf', 'page': 0}, page_content='2024  \nNVIDIA Corporation\nAnnual Review  \n \nNotice of Annual Meeting \nProxy Statement  \nForm 10-K')

In [12]:
pages[0:2]

[Document(metadata={'source': '/kaggle/input/nvidia-annual-report/NVIDIA-2024-Annual-Report.pdf', 'page': 0}, page_content='2024  \nNVIDIA Corporation\nAnnual Review  \n \nNotice of Annual Meeting \nProxy Statement  \nForm 10-K'),
 Document(metadata={'source': '/kaggle/input/nvidia-annual-report/NVIDIA-2024-Annual-Report.pdf', 'page': 2}, page_content='“The sum of all that \nNVIDIA’s doing \nwill indeed create \nthe next industrial \nrevolution”\nCNBC\nAccelerated computing is sustainable \ncomputing.  Every data center in the world \nneeds to be accelerated to reclaim power, \nachieve sustainability, and realize net-zero \nemissions. Accelerated data centers could save \nan incredible 19 terawatt-hours of electricity \nannually if run on GPU and DPU accelerators vs \nCPUs. That’s about the same energy as a year’s \nworth of trips by 2.9 million passenger cars. \nThe efficiency of accelerated computing \npaved the way for generative AI. The most \ncritical computing platform of our gen

In [14]:
content_list = []
for page in pages:
    content_list.append(page.page_content)
content = ''.join(content_list)
len(content)

656803

In [20]:
type(content)

str

In [16]:
content = content[:20_000]

In [17]:
len(content)

20000

## **Step 4. Load Embedding Model** <a id="section-four"></a>

In [18]:
encoder = HuggingFaceEncoder(name="sentence-transformers/all-MiniLM-L6-v2")

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

## **Step 5. Testing Different Chunking Techniques** <a id="section-five"></a>

### **Statistical Chunking** <a id="section-six"></a>

For Statistical Chunking, this technique automatically calculates the score threshold. As our most robust chunking method, it uses a dynamic similarity threshold to identify local similarity splits, balancing accuracy and efficiency. The StatisticalChunker can automatically determine an optimal threshold value, reducing the need for extensive customization compared to other chunking methods. Additionally, this technique allows for parameter adjustments, such as setting the maximum and minimum token numbers for each split.

In [52]:
chunker = StatisticalChunker(
    encoder=encoder,
    #min_split_tokens=200,
    #max_split_tokens=500,
)

In [53]:
chunks = chunker(docs=[content])

[32m2024-07-10 19:42:26 INFO semantic_chunkers.utils.logger Single document exceeds the maximum token limit of 300. Splitting to sentences before semantically merging.[0m


  0%|          | 0/11 [00:00<?, ?it/s]

In [54]:
chunker.print(chunks[0])

Split 1, tokens 127, triggered by: 0.14
[31m2024 NVIDIA Corporation Annual Review Notice of Annual Meeting Proxy Statement Form 10-K“The sum of all that NVIDIA’s doing will indeed create the next industrial revolution” CNBC Accelerated computing is sustainable computing. Every data center in the world needs to be accelerated to reclaim power, achieve sustainability, and realize net-zero emissions. Accelerated data centers could save an incredible 19 terawatt-hours of electricity annually if run on GPU and DPU accelerators vs CPUs. That’s about the same energy as a year’s worth of trips by 2.9 million passenger cars. The efficiency of accelerated computing[0m
----------------------------------------------------------------------------------------


Split 2, tokens 125, triggered by: 0.12
[32mpaved the way for generative AI. The most critical computing platform of our generation, generative AI will reshape the world’s largest industries and create an entirely new one. NVIDIA, the pion

### **Consecutive Chunking** <a id="section-seven"></a>

For the Consecutive Chunking method, we begin by splitting the text into smaller sentences. These sentences are then merged to form larger chunks. If there is a drop in the similarity score below a predefined threshold which is `score_threshold`, the merging process stops. The sentences merged up to that point, which have maintained a similarity score above the threshold, are considered a single chunk. We then start forming the next chunk with the remaining sentences. This approach ensures that only similar sentences are grouped together, improving the coherence of each chunk.

In [46]:
chunker = ConsecutiveChunker(
    encoder=encoder, 
    score_threshold=0.2 # 
)

In [47]:
chunks = chunker(docs=[content])

  0%|          | 0/11 [00:00<?, ?it/s]

  0%|          | 0/649 [00:00<?, ?it/s]

In [48]:
chunker.print(chunks[0])

Split 1, tokens None, triggered by: 0.10
[31m2024[0m
----------------------------------------------------------------------------------------


Split 2, tokens None, triggered by: 0.14
[32mNVIDIA Corporation Annual Review Notice of Annual Meeting[0m
----------------------------------------------------------------------------------------


Split 3, tokens None, triggered by: 0.08
[34mProxy Statement[0m
----------------------------------------------------------------------------------------


Split 4, tokens None, triggered by: 0.08
[35mForm 10-K“The sum of all that[0m
----------------------------------------------------------------------------------------


Split 5, tokens None, triggered by: 0.17
[31mNVIDIA’s doing[0m
----------------------------------------------------------------------------------------


Split 6, tokens None, triggered by: 0.04
[32mwill indeed create the next industrial revolution”[0m
----------------------------------------------------------------------

### **Cumulative Chunking** <a id="section-eight"></a>

For Cumulative Chunking, we compare the embeddings of consecutive splits. We start by taking the embeddings of the first \( n \) splits and comparing them to the embedding of the \( n+1 \) split. If they are not similar, the first \( n \) splits are defined as a chunk, and we repeat the process with the subsequent splits. For example:

- The embedding of split 1 is similar to the embedding of splits 1 and 2.
- The embedding of splits 1 and 2 is similar to the embedding of splits 1, 2, and 3.
- The embedding of splits 1, 2, and 3 is similar to the embedding of splits 1, 2, 3, and 4.

In this case, splits 1, 2, and 3 are defined as a chunk, and we start the process again from split 4. While this method ensures that each chunk contains highly similar content, it takes longer to run compared to other chunking techniques.

In [49]:
chunker = CumulativeChunker(
    encoder=encoder, 
    score_threshold=0.2
)

In [50]:
chunks = chunker(docs=[content])

  0%|          | 0/650 [00:00<?, ?it/s]

In [51]:
chunker.print(chunks[0])

Split 1, tokens None, triggered by: 0.10
[31m2024[0m
----------------------------------------------------------------------------------------


Split 2, tokens None, triggered by: 0.06
[32mNVIDIA Corporation Annual Review Notice of Annual Meeting[0m
----------------------------------------------------------------------------------------


Split 3, tokens None, triggered by: 0.08
[34mProxy Statement[0m
----------------------------------------------------------------------------------------


Split 4, tokens None, triggered by: 0.08
[35mForm 10-K“The sum of all that[0m
----------------------------------------------------------------------------------------


Split 5, tokens None, triggered by: 0.17
[31mNVIDIA’s doing[0m
----------------------------------------------------------------------------------------


Split 6, tokens None, triggered by: 0.19
[32mwill indeed create the next industrial[0m
----------------------------------------------------------------------------------

## **Step 6. Conclusion** <a id="section-nine"></a>

<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:110%;
           font-family:Roboto;
           letter-spacing:0.5px">

<p style="padding: 10px;
              color:white;">
              In conclusion, this notebook explored three semantic chunking techniques applied to parsing the 2024 NVIDIA annual report. Among them, statistical chunking emerged as the most effective, providing coherent and informative text chunks. This method proved efficient, delivering results in a reasonable timeframe compared to the other techniques tested. Its ability to dynamically adjust similarity thresholds contributed to producing meaningful content chunks without excessive processing time.
</p>
</div>
