<a href="https://colab.research.google.com/github/Hamza-Atiq/Internship-at-neurooceans-ai/blob/main/rag_from_langchain_day4_splitters_neurooceansai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Text Splitters**

# Why split documents?

1- Handling non-uniform document lengths

2- Overcoming model limitations

3- Improving representation quality

4- Enhancing retrieval precision

5- Optimizing computational resources

# **Approaches**

# 1- **Length-based**

This approach ensures that each chunk doesn't exceed a specified size limit

-> Benefits of length-based splitting:

1- Straightforward implementation

2- Consistent chunk sizes

3- Easily adaptable to different model requirements

# Types of length-based splitting

1- Token based

Splits text based on the number of tokens, which is useful when working
with language models.

2- Character-based

Splits text based on the number of characters, which can be more consistent across different types of text.

In [None]:
! pip install -q langchain_community

# PDF loader

In [None]:
%pip install -q pypdf

In [None]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader('/content/National AI Policy Consultation Draft V1.pdf')

pages = [page async for page in loader.alazy_load()]

# How to split text by tokens

# **tiktoken**

installing libraries

In [None]:
%pip install -qU langchain-text-splitters tiktoken

## **Character Text Splitter**

In [None]:
from langchain_text_splitters import CharacterTextSplitter

tiktoken splits can be larger than the chunk size measured by the tiktoken tokenizer.

In [None]:
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    encoding_name = 'cl100k_base',
    chunk_size = 50,
    chunk_overlap = 0,
)

# Iterate through pages and split text

texts = []
for page in pages:
  page_texts = page.page_content
  texts.extend(text_splitter.split_text(page_texts))

# Display the split texts (optional)
for text in texts[:5]:  # Display first 5 chunks
    print(text)

i
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
                        Draft
National
 
Artificial Intelligence Policy
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Government of Pakistan
Ministry of Information Technology & Telecommunication
https://moitt.gov.pk
ii 
 
Acknowledgments 
The Government of Pakistan, Ministry of IT & Telecom , pays its gratitude to all the officials  and consultants, 
particularly RSM Pakistan and GlowBug Technologies (Pvt.) Ltd. , facilitators, developers, and stakeholders who 
rigorously and relentlessly participated in the revi ew, drafting, harmonizing, and ratification of the National 
Artificial Intelligence Policy – 2022, helping the Ministry with an all -inclusive user -centric, evidence -based, 
forward-looking, and agile policy framework for enabling Pakistan towards a digital economy and society.
iii 
 
Table of Contents 
1 Executive Summary ....................................................................................................

## **Recursive Text Splitter**

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [None]:
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size = 50,
    chunk_overlap = 0,
)

# Iterate through pages and split text

texts = []
for page in pages:
  page_texts = page.page_content
  texts.extend(text_splitter.split_text(page_texts))

# Display the split texts (optional)
for text in texts[:5]:  # Display first 5 chunks
    print(text)

i
Draft
National
 
Artificial Intelligence Policy
Government of Pakistan
Ministry of Information Technology & Telecommunication
https://moitt.gov.pk
ii 
 
Acknowledgments 
The Government of Pakistan, Ministry of IT & Telecom , pays its gratitude to all the officials  and consultants,


In [None]:
texts[:5]

['i',
 'Draft\nNational\n \nArtificial Intelligence Policy',
 'Government of Pakistan\nMinistry of Information Technology & Telecommunication',
 'https://moitt.gov.pk',
 'ii \n \nAcknowledgments \nThe Government of Pakistan, Ministry of IT & Telecom , pays its gratitude to all the officials  and consultants,']

# How to recursively split text by characters

This text splitter is the recommended one for generic text.

It is parameterized by a list of characters.

It tries to split on them in order until the chunks are small enough.

The default list is ["\n\n", "\n", " ", ""].

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 50,
    chunk_overlap = 10,
    length_function = len,
    is_separator_regex = False,
)

texts = []
for page in pages:
  page_texts = page.page_content
  texts.extend(text_splitter.create_documents([page_texts]))

# Display the split texts (optional)
for text in texts[:5]:  # Display first 5 chunks
    print(text)

page_content='i'
page_content='Draft'
page_content='National
 
Artificial Intelligence Policy'
page_content='Government of Pakistan'
page_content='Ministry of Information Technology &'


In [None]:
texts = []
for page in pages:
  page_texts = page.page_content
  texts.extend(text_splitter.split_text(page_texts))

# Display the split texts (optional)
for text in texts[:5]:  # Display first 5 chunks
    print(text)

i
Draft
National
 
Artificial Intelligence Policy
Government of Pakistan
Ministry of Information Technology &


# Splitting text from languages without word boundaries

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    separators=[
        "\n\n",
        "\n",
        " ",
        ".",
        ",",
        "\u200b",  # Zero-width space
        "\uff0c",  # Fullwidth comma
        "\u3001",  # Ideographic comma
        "\uff0e",  # Fullwidth full stop
        "\u3002",  # Ideographic full stop
        "",
    ],
    # Existing args
)

# How to split Markdown by Headers

In [None]:
from langchain_text_splitters import MarkdownHeaderTextSplitter

In [None]:
# Define headers to split on
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

# Initialize the Markdown header text splitter
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

# Read the Markdown file content
with open('/content/README.md', 'r', encoding='utf-8') as f:
    markdown_content = f.read()

# Split the Markdown content
md_headers_split = markdown_splitter.split_text(markdown_content)

md_headers_split

[Document(metadata={'Header 1': 'Chapter 6: Python Dictionary Exercises'}, page_content='This repository contains a series of Python exercises from Chapter 6, showcasing practical usage of dictionaries, list operations, and loops. These exercises are organized to build familiarity with dictionary structures and their various applications. Below, each exercise is explained in detail with the associated code and comments.  \n---'),
 Document(metadata={'Header 1': 'Chapter 6: Python Dictionary Exercises', 'Header 2': 'Exercise 6.1: Person'}, page_content='This exercise creates a dictionary containing personal information, then prints the formatted details using dictionary keys.  \n```python\n# 6.1 Person\nperson : dict[str , all] = {\'first_name\' : "Hamza" , \'last_name\' : \'Atiq\' , \'age\' : 27 , \'city\' : "Rawalpindi"}\nprint(f"Person first name is {person[\'first_name\'].upper()} last name is {person[\'last_name\'].upper()}")\nprint(f"Person age is {person[\'age\']} and he lives in

# Token Text Splitter

In [None]:
from langchain_text_splitters import TokenTextSplitter

In [None]:
text_splitter = TokenTextSplitter(chunk_size=50, chunk_overlap=0)

texts = []
for page in pages:
  page_texts = page.page_content
  texts.extend(text_splitter.split_text(page_texts))

# Display the text
for text in texts[:5]:  # Display first 5 chunks
    print(text)

 
 
i
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
                        Draft
National
 
Artificial Intelligence Policy
 
 
 
 
 
 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Government of Pakistan
Ministry of Information Technology & Telecommunication
https
://moitt.gov.pk
 
 
 

 
 
ii 
 
Acknowledgments 
The Government of Pakistan, Ministry of IT & Telecom , pays its gratitude to all the officials  and consultants, 
particularly RSM Pakistan and GlowBug Technologies (Pvt.) Ltd


# **Spacy**

Installing libraries

In [None]:
#%pip install -q numpy==1.24.3


Collecting numpy==1.24.3
  Downloading numpy-1.24.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Downloading numpy-1.24.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m69.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 2.0.2
    Uninstalling numpy-2.0.2:
      Successfully uninstalled numpy-2.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
albucore 0.0.19 requires numpy>=1.24.4, but you have numpy 1.24.3 which is incompatible.
albumentations 1.4.20 requires numpy>=1.24.4, but you have numpy 1.24.3 which is incompatible.
blis 1.0.2 requires numpy<3.0.0,>=2.0.0, but you have numpy 1.24.3 which is incompatible.
thinc 8.3.2 r

In [None]:

%pip install -qU spacy

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cupy-cuda12x 12.2.0 requires numpy<1.27,>=1.20, but you have numpy 2.0.2 which is incompatible.
gensim 4.3.3 requires numpy<2.0,>=1.18.5, but you have numpy 2.0.2 which is incompatible.
langchain 0.3.11 requires numpy<2,>=1.22.4; python_version < "3.12", but you have numpy 2.0.2 which is incompatible.
langchain-community 0.3.11 requires numpy<2,>=1.22.4; python_version < "3.12", but you have numpy 2.0.2 which is incompatible.
matplotlib 3.8.0 requires numpy<2,>=1.21, but you have numpy 2.0.2 which is incompatible.
pytensor 2.26.4 requires numpy<2,>=1.17.0, but you have numpy 2.0.2 which is incompatible.
tensorflow 2.17.1 requires numpy<2.0.0,>=1.23.5; python_version <= "3.11", but you have numpy 2.0.2 which is incompatible.[0m[31m
[0m

In [None]:
from langchain_text_splitters import SpacyTextSplitter

text_splitter = SpacyTextSplitter(chunk_size=1000)

texts = []
for page in pages:
  page_texts = page.page_content
  texts.extend(text_splitter.split_text(page_texts))

# Display the text
for text in texts[:5]:  # Display first 5 chunks
    print(text)



i
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
                        Draft
National
 
Artificial Intelligence Policy
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Government of Pakistan
Ministry of Information Technology & Telecommunication
https://moitt.gov.pk
ii 
 
Acknowledgments 
The Government of Pakistan, Ministry of IT & Telecom , pays its gratitude to all the officials  and consultants, 
particularly RSM Pakistan and GlowBug Technologies (Pvt.)

Ltd. , facilitators, developers, and stakeholders who 
rigorously and relentlessly participated in the revi ew, drafting, harmonizing, and ratification of the National 
Artificial Intelligence Policy – 2022, helping the Ministry with an all -inclusive user -centric, evidence -based, 
forward-looking, and agile policy framework for enabling Pakistan towards a digital economy and society.
iii 
 
Table of Contents 
1 Executive Summary ...................................................................................................

# Sentence Transormers

Installing Libraries

In [None]:
from langchain_text_splitters import SentenceTransformersTokenTextSplitter


In [None]:
# Initialize the SentenceTransformersTokenTextSplitter with no overlap
splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)

# Constants
count_start_and_stop_tokens = 2

# Initialize the tokens list
tokens = []

# Process each page in the input
for page in pages:
    if not isinstance(page, str):
        page = str(page)  # Ensure the input is a string
    try:
        # Calculate the token count for the page
        text_token_count = splitter.count_tokens(text=page) - count_start_and_stop_tokens
        tokens.append(text_token_count)

        # Calculate the token multiplier
        token_multiplier = splitter.maximum_tokens_per_chunk // text_token_count + 1

        # Create a longer text for splitting, using the multiplier
        text_to_split = page * token_multiplier
        print(f"Tokens in text to split: {splitter.count_tokens(text=text_to_split)}")

        # Split the extended text into chunks
        text_chunks = splitter.split_text(text=text_to_split)

        # Output a sample chunk for verification
        print("Sample Chunk:", text_chunks[1] if len(text_chunks) > 1 else "Only one chunk available.")

    except Exception as e:
        # Handle errors gracefully
        print(f"Error processing page: {page}\n{e}")
        tokens.append(0)  # Append 0 or handle error case as needed

# Output all token counts
print("Token counts per page:", tokens)


  from tqdm.autonotebook import tqdm, trange
  if dtype.type == np.bool:


TypeError: 

In [None]:
dir(splitter)

In [None]:
# Initialize the SentenceTransformersTokenTextSplitter with no overlap
splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)
splitter.maximum_tokens_per_chunk

In [None]:
from langchain_text_splitters import SentenceTransformersTokenTextSplitter

splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0)


A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.0.2 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/colab_kernel_launcher.py", line 37, in <module>
    ColabKernelApp.launch_instance()
  File "/usr/local/lib/python3.10/dist-packages/traitlets/config/application.py", line 992, in launch_instance
    app.start()
  File "/usr/local/lib/python3.10/dist-packages/ipykernel/ke

AttributeError: _ARRAY_API not found

RuntimeError: Failed to import transformers.integrations.integration_utils because of the following error (look up to see its traceback):
Failed to import transformers.modeling_utils because of the following error (look up to see its traceback):
initialization of _pywrap_checkpoint_reader raised unreported exception

In [None]:
print(splitter.maximum_tokens_per_chunk)

In [None]:
text = "hamza"

count_start_and_stop_tokens = 2
text_token_count = splitter.count_tokens(text=text) - count_start_and_stop_tokens
print(text_token_count)

In [None]:
print(splitter.count_tokens(text=text))

In [None]:
token_multiplier = splitter.maximum_tokens_per_chunk // text_token_count + 1

print(token_multiplier)

# `text_to_split` does not fit in a single chunk
text_to_split = text * token_multiplier

print(text_to_split)

print(f"tokens in text to split: {splitter.count_tokens(text=text_to_split)}")

In [None]:
384 // 3