# Chunking Strategy

## Installing Dependencies

In [1]:
%pip install -qU PyPDF2 langchain langchain-community langchain-groq langchain-huggingface faiss-cpu langgraph

Note: you may need to restart the kernel to use updated packages.


## Extracting the text from the file

In [2]:
## Get the PDF file
import PyPDF2

resume_path = "AI and the Future.pdf"


full_text = ""

with open(resume_path, "rb") as file:
    reader = PyPDF2.PdfReader(file)
    for page in reader.pages:
        text = page.extract_text()
        if text:
            full_text += text + "\n"  # add a newline between pages

In [None]:
print(full_text)  

AI and the Future  
Artificial Intelligence (AI) stands at the forefront of a technological revolution, poised to 
redefine the contours of human existence. As we navigate the complexities of the 21st 
century, AI’s ability to process vast datasets, learn from patterns, and mak e autonomous 
decisions is transforming industries, societies, and individual lives. This essay explores the 
trajectory of AI, its potential to shape a future of unprecedented opportunity, the challenges it 
presents, and the ethical frameworks needed to ens ure it serves humanity’s best interests. By 
examining its current state, future applications, and societal implications, we can envision a 
path where AI amplifies human potential while addressing global challenges.  
The Evolution and Current State of AI  
AI has evolved from a speculative concept to a cornerstone of modern technology. Early AI 
systems, rooted in rule -based programming, have given way to sophisticated machine 
learning models, neural ne

## Chunking
There are different types of chunkings we have(We will use kangchain for chunking strategies)
Text splitters split documents into smaller chunks for use in downstream applications.
1. `Character Text Spliter:` Here we split the documents on the basis of chunk size and overlap. Nothing new here
2. `Recursive Character Text Spliter:` Here we split the documents on the basis of chunk size and overlap. The text is splited by introducing `\n` new line charter. They are split intelligently (not in the middle of sentences/words unless necessary).
3. `Document Specific Spliter:` Here we will do the splitting base on the document(Specific). Like if we have a markdown file then we will spit based on markdown only(Splits by #, ##, ### etc.) , if we have a python fiel the we will do the python specific split (Respects function/class boundaries). etc
4. `Semantic Chunking:`  Splits text not by fixed size, but by meaningful semantic boundaries, determined via embeddings.
    * Splits text into sentences.

    * Groups sentences in sets of (e.g.) three.

    * Computes embeddings for each group.

    * Merges sentence-groups whose embeddings are close together (i.e., semantically similar).

    * Creates chunks that maintain semantic coherence — you get whole ideas/topics, not arbitrary character spans. 
    

### 1. Character Text Splitter
Two Types:
* *Token-based:* Splits text based on the number of tokens, which is useful when working with language models.
* *Character-based:* Splits text based on the number of characters, which can be more consistent across different types of text.

NOTE:- For both the chunking we need `seperators = '/n'`, For the chunking to work, otherwise it will take the entire text as single chunk 

#### Charter-based:
This is the simplest method. This splits based on a given character sequence, which defaults to "\n\n". Chunk length is measured by number of characters.

How the text is split: by single character separator.
How the chunk size is measured: by number of characters.
To obtain the string content directly, use .split_text.

To create LangChain Document objects (e.g., for use in downstream tasks), use .create_documents.

In [4]:
%pip install -qU langchain-text-splitters

Note: you may need to restart the kernel to use updated packages.


In [None]:
from langchain_text_splitters import CharacterTextSplitter
# Create a CharacterTextSplitter instance
text_splitter = CharacterTextSplitter(
    separator="\n",  # Split by single newlines
    chunk_size=1000,  # Each chunk will be at most 1000 characters
    chunk_overlap=200,  # Allow for some overlap between chunks
)

# Split the text into chunks
chunks = text_splitter.create_documents([full_text])
# Print the number of chunks and the first chunk
print(f"Number of chunks: {len(chunks)}")

Number of chunks: 12


In [19]:
for doc in chunks:
    print(doc.page_content)
    print('\n\n')

AI and the Future  
Artificial Intelligence (AI) stands at the forefront of a technological revolution, poised to 
redefine the contours of human existence. As we navigate the complexities of the 21st 
century, AI’s ability to process vast datasets, learn from patterns, and mak e autonomous 
decisions is transforming industries, societies, and individual lives. This essay explores the 
trajectory of AI, its potential to shape a future of unprecedented opportunity, the challenges it 
presents, and the ethical frameworks needed to ens ure it serves humanity’s best interests. By 
examining its current state, future applications, and societal implications, we can envision a 
path where AI amplifies human potential while addressing global challenges.  
The Evolution and Current State of AI  
AI has evolved from a speculative concept to a cornerstone of modern technology. Early AI 
systems, rooted in rule -based programming, have given way to sophisticated machine



AI has evolved from a sp

#### Token Based
Here we will use `tiktoken` to count the number of tokens.
We can use tiktoken to estimate tokens used. It will probably be more accurate for the OpenAI models.

How the text is split: by character passed in.
How the chunk size is measured: by tiktoken tokenizer.


In [7]:
%pip install --upgrade --quiet langchain-text-splitters tiktoken

Note: you may need to restart the kernel to use updated packages.


In [8]:
import tiktoken

# finding the number of tokens in the text
# This will use the tiktoken library to count the tokens in the text.
encoding = tiktoken.get_encoding("cl100k_base")
tokens = encoding.encode(full_text)
print("Number of tokens:", len(tokens))


Number of tokens: 1783


In [22]:
from langchain_text_splitters import CharacterTextSplitter

# Create a CharacterTextSplitter instance with tiktoken
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    separator="\n",  # Split by single newlines
    encoding_name="cl100k_base", 
    chunk_size=300, 
    chunk_overlap=50
)
chunk_texts = text_splitter.split_text(full_text)

In [23]:
print(len(chunk_texts))  # Print the number of chunks

8


### 2. Recursive Character Text Spliter:
This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

How the text is split: by list of characters.
How the chunk size is measured: by number of characters.

In [30]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, 
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False,)

documents = text_splitter.create_documents([full_text])

In [31]:
print(f"Number of chunks after recursive splitting: {len(documents)}")

Number of chunks after recursive splitting: 12


In [32]:
for doc in documents:
    print(doc.page_content)
    print('\n\n')

AI and the Future  
Artificial Intelligence (AI) stands at the forefront of a technological revolution, poised to 
redefine the contours of human existence. As we navigate the complexities of the 21st 
century, AI’s ability to process vast datasets, learn from patterns, and mak e autonomous 
decisions is transforming industries, societies, and individual lives. This essay explores the 
trajectory of AI, its potential to shape a future of unprecedented opportunity, the challenges it 
presents, and the ethical frameworks needed to ens ure it serves humanity’s best interests. By 
examining its current state, future applications, and societal implications, we can envision a 
path where AI amplifies human potential while addressing global challenges.  
The Evolution and Current State of AI  
AI has evolved from a speculative concept to a cornerstone of modern technology. Early AI 
systems, rooted in rule -based programming, have given way to sophisticated machine



AI has evolved from a sp