#### Different Chunking Techniques

Chunking document text can be done in various ways depending on the context and the intended use case, such as natural language processing (NLP), summarization, or document analysis. Here are different approaches:

In [13]:
import warnings
warnings.filterwarnings('ignore') 
from sentence_transformers import SentenceTransformer
from sklearn.cluster import AgglomerativeClustering
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
nltk.download('punkt_tab', quiet=True)

True

In [2]:
text = """ 
An ANN consists of connected units or nodes called artificial neurons, which loosely model the neurons in the brain. 
These are connected by edges, which model the synapses in the brain. Each artificial neuron receives signals from connected neurons, 
then processes them and sends a signal to other connected neurons. The "signal" is a real number, and the output of each neuron is 
computed by some non-linear function of the sum of its inputs, called the activation function. The strength of the signal at each 
connection is determined by a weight, which adjusts during the learning process.
"""

text = """
Rock is a broad genre of popular music that originated as rock and roll in the United States in the late 1940s and early 1950s, developing into a range of different styles from the mid-1960s, particularly in the United States and the United Kingdom. It has its roots in rock and roll, a style that drew directly from the genres of blues, rhythm and blues, and country music. Rock also drew strongly from genres such as electric blues and folk, and incorporated influences from jazz and other musical styles. For instrumentation, rock is typically centered on the electric guitar, usually as part of a rock group with electric bass guitar, drums, and one or more singers. Usually, rock is song-based music with a 4 4 time signature and utilizing a verse–chorus form, but the genre has become extremely diverse. Like pop music, lyrics often stress romantic love but also address a wide variety of other themes that are frequently social or political. Rock was the most popular genre of music in the U.S. and much of the Western world from the 1950s to the 2010s. Rock musicians in the mid-1960s began to advance the album ahead of the single as the dominant form of recorded music expression and consumption, with the Beatles at the forefront of this development. Their contributions lent the genre a cultural legitimacy in the mainstream and initiated a rock-informed album era in the music industry for the next several decades. By the late 1960s "classic rock"[3] period, a few distinct rock music subgenres had emerged, including hybrids like blues rock, folk rock, country rock, Southern rock, raga rock, and jazz rock, which contributed to the development of psychedelic rock, influenced by the countercultural psychedelic and hippie scene. New genres that emerged included progressive rock, which extended artistic elements, heavy metal, which emphasized an aggressive thick sound, and glam rock, which highlighted showmanship and visual style. In the second half of the 1970s, punk rock reacted by producing stripped-down, energetic social and political critiques. Punk was an influence in the 1980s on new wave, post-punk and eventually alternative rock. From the 1990s, alternative rock began to dominate rock music and break into the mainstream in the form of grunge, Britpop, and indie rock. Further fusion subgenres have since emerged, including pop-punk, electronic rock, rap rock, and rap metal. Some movements were conscious attempts to revisit rock's history, including the garage rock/post-punk revival in the 2000s. Since the 2010s, rock has lost its position as the pre-eminent popular music genre in world culture, but remains commercially successful. The increased influence of hip-hop and electronic dance music can be seen in rock music, notably in the techno-pop scene of the early 2010s and the pop-punk-hip-hop revival of the 2020s.
"""

In [3]:
def fixed_size_chunking(text, chunk_size, overlap, char=False):
    """
    Splits the input text into chunks of a fixed size with optional overlap.

    Parameters:
    text (str): The input text to be chunked.
    chunk_size (int): The size of each chunk.
    overlap (int): The number of overlapping elements between consecutive chunks.
    char (bool): If True, chunk by characters. If False, chunk by words. Default is False.

    Returns:
    list: A list of text chunks.
    """

    if char:
        return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size - overlap)]
    else:
        text = text.split()
        return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size - overlap)]
    

def recursive_chunking(text, max_size):
    
    """
    Splits the input text into chunks of a specified maximum size, ensuring that chunks do not split words.
    Args:
        text (str): The input text to be chunked.
        max_size (int): The maximum size of each chunk.
    Returns:
        list of str: A list of text chunks, each with a length up to max_size.
    """
    if len(text) <= max_size:
        return [text]
    
    split_idx = max_size
    while split_idx > 0 and text[split_idx] != ' ':
        split_idx -= 1

    if split_idx == 0:           # No space found
        split_idx = max_size

    return [text[:split_idx].strip()] + recursive_chunking(text[split_idx:].strip(), max_size)
    


def semantic_chunking(text, num_chunks=3):
    """
    Splits text into semantically coherent chunks.
    
    Parameters:
    - text: The text to be chunked.
    - num_chunks: The number of semantic chunks.
    
    Returns:
    - A list of chunks, each a semantically grouped portion of the text.
    """
    # Load a pre-trained Sentence Transformer model
    model = SentenceTransformer('all-MiniLM-L6-v2')

    # Tokenize the text into sentences
    sentences = sent_tokenize(text)

    # Compute sentence embeddings
    embeddings = model.encode(sentences)

    # Cluster the sentences using Agglomerative Clustering
    clustering = AgglomerativeClustering(n_clusters=num_chunks, metric='cosine', linkage='average')
    clusters = clustering.fit_predict(embeddings)

    # Group sentences by their cluster labels
    grouped_sentences = [[] for _ in range(num_chunks)]
    for sentence, label in zip(sentences, clusters):
        grouped_sentences[label].append(sentence)

    # Combine sentences in each cluster to form chunks
    chunks = [" ".join(group) for group in grouped_sentences]
    return chunks

def print_chunks(chunks):
    for i, chunk in enumerate(chunks):
        print(f"Chunk {i + 1}:")
        print(chunk)
        print("=====")

In [4]:
chunks = fixed_size_chunking(text, 100, 20)
print_chunks(chunks)

Chunk 1:
['Rock', 'is', 'a', 'broad', 'genre', 'of', 'popular', 'music', 'that', 'originated', 'as', 'rock', 'and', 'roll', 'in', 'the', 'United', 'States', 'in', 'the', 'late', '1940s', 'and', 'early', '1950s,', 'developing', 'into', 'a', 'range', 'of', 'different', 'styles', 'from', 'the', 'mid-1960s,', 'particularly', 'in', 'the', 'United', 'States', 'and', 'the', 'United', 'Kingdom.', 'It', 'has', 'its', 'roots', 'in', 'rock', 'and', 'roll,', 'a', 'style', 'that', 'drew', 'directly', 'from', 'the', 'genres', 'of', 'blues,', 'rhythm', 'and', 'blues,', 'and', 'country', 'music.', 'Rock', 'also', 'drew', 'strongly', 'from', 'genres', 'such', 'as', 'electric', 'blues', 'and', 'folk,', 'and', 'incorporated', 'influences', 'from', 'jazz', 'and', 'other', 'musical', 'styles.', 'For', 'instrumentation,', 'rock', 'is', 'typically', 'centered', 'on', 'the', 'electric', 'guitar,', 'usually']
=====
Chunk 2:
['and', 'incorporated', 'influences', 'from', 'jazz', 'and', 'other', 'musical', 'style

In [5]:

chunks = fixed_size_chunking(text, 100, 20, char=True)

print_chunks(chunks)

Chunk 1:

Rock is a broad genre of popular music that originated as rock and roll in the United States in the
=====
Chunk 2:
United States in the late 1940s and early 1950s, developing into a range of different styles from th
=====
Chunk 3:
erent styles from the mid-1960s, particularly in the United States and the United Kingdom. It has it
=====
Chunk 4:
d Kingdom. It has its roots in rock and roll, a style that drew directly from the genres of blues, r
=====
Chunk 5:
e genres of blues, rhythm and blues, and country music. Rock also drew strongly from genres such as 
=====
Chunk 6:
from genres such as electric blues and folk, and incorporated influences from jazz and other musical
=====
Chunk 7:
zz and other musical styles. For instrumentation, rock is typically centered on the electric guitar,
=====
Chunk 8:
the electric guitar, usually as part of a rock group with electric bass guitar, drums, and one or mo
=====
Chunk 9:
drums, and one or more singers. Usually, rock is song-based mus

In [6]:
chunks = recursive_chunking(text, 100)  # split text into chunks of up to 100 words
print_chunks(chunks)

Chunk 1:
Rock is a broad genre of popular music that originated as rock and roll in the United States in the
=====
Chunk 2:
late 1940s and early 1950s, developing into a range of different styles from the mid-1960s,
=====
Chunk 3:
particularly in the United States and the United Kingdom. It has its roots in rock and roll, a style
=====
Chunk 4:
that drew directly from the genres of blues, rhythm and blues, and country music. Rock also drew
=====
Chunk 5:
strongly from genres such as electric blues and folk, and incorporated influences from jazz and
=====
Chunk 6:
other musical styles. For instrumentation, rock is typically centered on the electric guitar,
=====
Chunk 7:
usually as part of a rock group with electric bass guitar, drums, and one or more singers. Usually,
=====
Chunk 8:
rock is song-based music with a 4 4 time signature and utilizing a verse–chorus form, but the genre
=====
Chunk 9:
has become extremely diverse. Like pop music, lyrics often stress romantic love but also ad

In [7]:
chunks = semantic_chunking(text, num_chunks=6)
print_chunks(chunks)

Chunk 1:

Rock is a broad genre of popular music that originated as rock and roll in the United States in the late 1940s and early 1950s, developing into a range of different styles from the mid-1960s, particularly in the United States and the United Kingdom. It has its roots in rock and roll, a style that drew directly from the genres of blues, rhythm and blues, and country music. Rock also drew strongly from genres such as electric blues and folk, and incorporated influences from jazz and other musical styles. For instrumentation, rock is typically centered on the electric guitar, usually as part of a rock group with electric bass guitar, drums, and one or more singers. Usually, rock is song-based music with a 4 4 time signature and utilizing a verse–chorus form, but the genre has become extremely diverse. Rock was the most popular genre of music in the U.S. and much of the Western world from the 1950s to the 2010s. By the late 1960s "classic rock"[3] period, a few distinct rock musi

#### Pros and Cons of different chunking techniques


| **Chunking Technique**        | **Description**                                                 | **Pros**                                                                                 | **Cons**                                                                                     |
|--------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|
| **Fixed Size (Characters)**   | Splits text into chunks of a fixed number of characters.        | Easy to implement; consistent chunk size.                                               | May break words or sentences; lacks semantic context.                                       |
| **Fixed Size (Words)**        | Splits text into chunks of a fixed number of words.             | Respects word boundaries; simple implementation.                                         | May split sentences awkwardly; ignores semantics.                                           |
| **Fixed Size (Sentences)**    | Splits text into chunks of a fixed number of sentences.         | Maintains sentence integrity; better readability.                                        | May still ignore semantic coherence across sentences.                                       |                           |
| **Sentence Embedding (Semantic)** | Uses clustering of sentence embeddings to group sentences semantically. | Captures semantic relationships; highly adaptable.                                       | Computationally intensive; depends on the quality of the embeddings and clustering.         |
| **Recursive-Based**           | Recursively splits text to ensure chunk size limits are met.    | Guarantees chunk size adherence; flexible for various text lengths.                     | May split sentences awkwardly; less efficient for large documents.                         |
