<a href="https://colab.research.google.com/github/SARA3SAEED/LLM-2/blob/main/39_Building_Advanced_LLM_Applications_Module_5_Semantic_Chunking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Credits: James Briggs**

In [None]:
!pip install -qU \
    semantic-chunkers \
    datasets==2.19.1

# An Intro to Semantic Chunkers

Semantic chunkers allow us to build more context aware chunks of information. We can use this for RAG, splitting video, audio, and much more.

In this example, we will stick with a simple RAG-focused example. We will learn about three different types of built-in chunkers available to us; `StatisticalChunker`, `ConsecutiveChunker`, `CumulativeChunker`. To begin, we need some data.

**Note:** by using the [async methods here]([link](https://github.com/aurelio-labs/semantic-chunkers/blob/main/docs/02-chunkers-async.ipynb)) docs can be processed *40x* faster.

In [None]:
from datasets import load_dataset

data = load_dataset("jamescalam/ai-arxiv2", split="train")
data

Dataset({
    features: ['id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'content', 'references'],
    num_rows: 2673
})

In [None]:
content = data[3]["content"]
print(content[:1000])

# Mamba: Linear-Time Sequence Modeling with Selective State Spaces
# Albert Gu*1 and Tri Dao*2
1Machine Learning Department, Carnegie Mellon University 2Department of Computer Science, Princeton University agu@cs.cmu.edu, tri@tridao.me
# Abstract
Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformersâ computational ineï¬ciency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities

We will keep a smaller section of content to speed up (and limit cost) for the examples.

In [None]:
content = content[:20_000]

We will experiment with different semantic chunking methods on the above text. Every chunker requires an _encoder_ for which we can choose from open source encoders via `HuggingfaceEncoder` or `FastembedEncoder`, and proprietary API encoders like `OpenAIEncoder` or `CohereEncoder`.

We will use the `OpenAIEncoder` with `text-embedding-3-small`:

In [None]:
import os
from getpass import getpass
from semantic_router.encoders import OpenAIEncoder

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") or getpass(
    "OpenAI API key: "
)

encoder = OpenAIEncoder(name="text-embedding-3-small")

* 'allow_population_by_field_name' has been renamed to 'populate_by_name'
* 'smart_union' has been removed


OpenAI API key: ··········


## Statistical Chunking

The statistical chunking method our most robust chunking method, it uses a varying similarity threshold to identify more dynamic and local similarity splits. It offers a good balance between accuracy and efficiency _but_ can only be used for text documents (unlike the multi-modal `ConsecutiveChunker`).

The `StatisticalChunker` can automatically identify a good threshold value to use while chunking our text, so it tends to require less customization than our other chunkers.

In [None]:
from semantic_chunkers import StatisticalChunker

chunker = StatisticalChunker(encoder=encoder)

In [None]:
chunks = chunker(docs=[content])

[32m2024-08-10 09:36:40 INFO semantic_chunkers.utils.logger Single document exceeds the maximum token limit of 300. Splitting to sentences before semantically merging.[0m


  0%|          | 0/6 [00:00<?, ?it/s]

Print and compare sync and async chunks.

In [None]:
chunker.print(chunks[0])

Split 1, tokens 300, triggered by: token limit
[31m# Mamba: Linear-Time Sequence Modeling with Selective State Spaces # Albert Gu*1 and Tri Dao*2 1Machine Learning Department, Carnegie Mellon University 2Department of Computer Science, Princeton University agu@cs.cmu.edu, tri@tridao.me # Abstract Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformersâ computational ineï¬ ciency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input

## Consecutive Chunking

Consecutive chunking is the simplest version of semantic chunking.

In [None]:
from semantic_chunkers import ConsecutiveChunker

chunker = ConsecutiveChunker(encoder=encoder, score_threshold=0.3)

In [None]:
chunks = chunker(docs=[content])

  0%|          | 0/6 [00:00<?, ?it/s]

  0%|          | 0/328 [00:00<?, ?it/s]

In [None]:
chunker.print(chunks[0])

Split 1, tokens None, triggered by: 0.09
[31m# Mamba:[0m
----------------------------------------------------------------------------------------


Split 2, tokens None, triggered by: 0.10
[32mLinear-Time Sequence Modeling with Selective State Spaces[0m
----------------------------------------------------------------------------------------


Split 3, tokens None, triggered by: 0.25
[34m# Albert Gu*1 and Tri Dao*2 1Machine Learning Department, Carnegie Mellon University 2Department of Computer Science, Princeton University agu@cs.cmu.edu, tri@tridao.me[0m
----------------------------------------------------------------------------------------


Split 4, tokens None, triggered by: 0.22
[35m# Abstract[0m
----------------------------------------------------------------------------------------


Split 5, tokens None, triggered by: 0.30
[31mFoundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architectur

## Cumulative Chunking

Cumulative chunking is a more compute intensive process, but can often provide more stable results as it is more noise resistant. However, it is _very expensive_ in both time and (if using APIs) money.

In [None]:
from semantic_chunkers import CumulativeChunker

chunker = CumulativeChunker(encoder=encoder, score_threshold=0.3)

In [None]:
chunks = chunker(docs=[content])

  0%|          | 0/329 [00:00<?, ?it/s]

In [None]:
chunker.print(chunks[0])

Split 1, tokens None, triggered by: 0.09
[31m# Mamba:[0m
----------------------------------------------------------------------------------------


Split 2, tokens None, triggered by: 0.10
[32mLinear-Time Sequence Modeling with Selective State Spaces[0m
----------------------------------------------------------------------------------------


Split 3, tokens None, triggered by: 0.28
[34m# Albert Gu*1 and Tri Dao*2 1Machine Learning Department, Carnegie Mellon University 2Department of Computer Science, Princeton University agu@cs.cmu.edu, tri@tridao.me[0m
----------------------------------------------------------------------------------------


Split 4, tokens None, triggered by: 0.22
[35m# Abstract[0m
----------------------------------------------------------------------------------------


Split 5, tokens None, triggered by: 0.23
[31mFoundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architectur

---