# TokenTextSplitter

- Author: [Ilgyun Jeong](https://github.com/johnny9210)
- Design: 
- Peer Review:
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain-academy/blob/main/module-4/sub-graph.ipynb) [![Open in LangChain Academy](https://cdn.prod.website-files.com/65b8cd72835ceeacd4449a53/66e9eba12c7b7688aa3dbb5e_LCA-badge-green.svg)](https://academy.langchain.com/courses/take/intro-to-langgraph/lessons/58239937-lesson-2-sub-graphs)

## Overview

Language models operate within token limits, making it crucial to manage text within these constraints. 

TokenTextSplitter serves as an effective tool for segmenting text into manageable chunks based on token count, ensuring compliance with these limitations.

### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Basic Usage of Tiktoken](#basic-usage-of-tiktoken)
- [Basic Usage of TokenTextSplitter](#basic-usage-of-tokentextsplitter)
- [Basic Usage of spaCy](#basic-usage-of-spaCy)
- [Basic Usage of SentenceTransformers](#basic-usage-of-sentencetransformers)
- [Basic Usage of NLTK](#basic-usage-of-NLTK)
- [Basic Usage of KoNLPy](#basic-usage-of-KoNLPy)
- [Basic Usage of Hugging Face tokenizer](#basic-usage-of-Hugging-Face-tokenizer)

### References

- [Langchain TokenTextSplitter](https://python.langchain.com/api_reference/text_splitters/base/langchain_text_splitters.base.TokenTextSplitter.html)
----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [None]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_text_splitters",
        "tiktoken",
        "spacy",
        "sentence-transformers",
        "nltk",
        "konlpy",
    ]
)

In [None]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "TokenTextSplitter",
    }
)

You can alternatively set `OPENAI_API_KEY` in `.env` file and load it. 

[Note] This is not necessary if you've already set `OPENAI_API_KEY` in previous steps.

In [None]:
from dotenv import load_dotenv

load_dotenv()

## Basic Usage of tiktoken

`tiktoken` is a fast BPE tokenizer created by OpenAI.

- Open the file ./data/(eng)appendix-keywords.txt and read its contents.
- Store the read content in the file variable.

In [5]:
# Open the file data/(eng)appendix-keywords.txt and create a file object named f.
with open("./data/(eng)appendix-keywords.txt") as f:
    file = (
        f.read()
    )  # Read the contents of the file and store them in the file variable.

Print a portion of the content read from the file.

In [None]:
# Print a portion of the content read from the file.
print(file[:500])

Use the `CharacterTextSplitter` to split the text.

- Initialize the text splitter using the `from_tiktoken_encoder` method, which is based on the Tiktoken encoder.

In [7]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    # Set the chunk size to 300.
    chunk_size=300,
    # Ensure there is no overlap between chunks.
    chunk_overlap=0,
)
# Split the file text into chunks.
texts = text_splitter.split_text(file)

Print the number of divided chunks.

In [None]:
print(len(texts))  # Output the number of divided chunks.

Print the first element of the texts list.

In [None]:
# Print the first element of the texts list.
print(texts[0])

Reference
- When using `CharacterTextSplitter.from_tiktoken_encoder`, the text is split solely by `CharacterTextSplitter`, and the `Tiktoken` tokenizer is only used to measure and merge the divided text. (This means that the split text might exceed the chunk size as measured by the `Tiktoken` tokenizer.)
- When using `RecursiveCharacterTextSplitter.from_tiktoken_encoder`, the divided text is ensured not to exceed the chunk size allowed by the language model. If a split text exceeds this size, it is recursively divided. Additionally, you can directly load the `Tiktoken` splitter, which guarantees that each split is smaller than the chunk size.

## Basic Usage of TokenTextSplitter

Use the `TokenTextSplitter` class to split the text into token-based chunks.

In [None]:
from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(
    chunk_size=200,  # Set the chunk size to 10.
    chunk_overlap=0,  # Set the overlap between chunks to 0.
)

# Split the state_of_the_union text into chunks.
texts = text_splitter.split_text(file)
print(texts[0])  # Print the first chunk of the divided text.

## Basic Usage of spaCy

spaCy is an open-source software library for advanced natural language processing, written in Python and Cython programming languages.

Another alternative to NLTK is using the spaCy tokenizer.

1. How the text is divided: The text is split using the spaCy tokenizer.
2. How the chunk size is measured: It is measured by the number of characters.

Download the en_core_web_sm model.

In [None]:
!python -m spacy download en_core_web_sm --quiet

Open the `appendix-keywords.txt` file and read its contents.

In [12]:
# Open the file data/(eng)appendix-keywords.txt and create a file object named f.
with open("./data/(eng)appendix-keywords.txt") as f:
    file = (
        f.read()
    )  # Read the contents of the file and store them in the file variable.

Verify by printing a portion of the content.

In [None]:
# Print a portion of the content read from the file.
print(file[:350])

Create a text splitter using the `SpacyTextSplitter` class.


In [14]:
import warnings
from langchain_text_splitters import SpacyTextSplitter

# Ignore  warning messages.
warnings.filterwarnings("ignore")

# Create the SpacyTextSplitter.
text_splitter = SpacyTextSplitter(
    chunk_size=200,  # Set the chunk size to 200.
    chunk_overlap=50,  # Set the overlap between chunks to 50.
)

Use the `split_text` method of the `text_splitter` object to split the `file` text.

In [None]:
# Split the file text using the text_splitter.
texts = text_splitter.split_text(file)
print(texts[0])  # Print the first element of the split text.

## Basic Usage of SentenceTransformers

`SentenceTransformersTokenTextSplitter` is a text splitter specialized for `sentence-transformer` models.

Its default behavior is to split text into chunks that fit within the token window of the sentence-transformer model being used.


In [16]:
from langchain_text_splitters import SentenceTransformersTokenTextSplitter

# Create a sentence splitter and set the overlap between chunks to 0.
splitter = SentenceTransformersTokenTextSplitter(chunk_size=200, chunk_overlap=0)

Check the sample text.

In [None]:
# Open the data/(eng)appendix-keywords.txt file and create a file object named f.
with open("./data/(eng)appendix-keywords.txt") as f:
    file = f.read()  # Read the file content and store it in the variable file.

# Print a portion of the content read from the file.
print(file[:350])

The following code counts the number of tokens in the text stored in `the file` variable, excluding the count of start and stop tokens, and prints the result.

In [None]:
count_start_and_stop_tokens = 2  # Set the number of start and stop tokens to 2.

# Subtract the count of start and stop tokens from the total number of tokens in the text.
text_token_count = splitter.count_tokens(text=file) - count_start_and_stop_tokens
print(text_token_count)  # Print the calculated number of tokens in the text.

Use the `splitter.split_text()` function to split the text stored in the `text_to_split` variable into chunks.

In [19]:
text_chunks = splitter.split_text(text=file)  # Split the text into chunks.

Split the text into chunks.


In [None]:
# Print the 0th chunk.
print(text_chunks[1])  # Print the second chunk from the divided text chunks.

## Basic Usage of NLTK

The Natural Language Toolkit (NLTK) is a library and a collection of programs for English natural language processing (NLP), written in the Python programming language.

Instead of simply splitting by "\n\n", NLTK can be used to split text based on NLTK tokenizers.
1. Text splitting method: The text is split using the NLTK tokenizer.
2.	Chunk size measurement: The size is measured by the number of characters.
3.	`nltk` (Natural Language Toolkit) is a Python library for natural language processing.
4.	It supports various NLP tasks such as text preprocessing, tokenization, morphological analysis, and part-of-speech tagging.

Before using NLTK, you need to run `nltk.download('punkt_tab')`.

The reason for running `nltk.download('punkt_tab')` is to allow the NLTK (Natural Language Toolkit) library to download the necessary data files required for tokenizing text.

Specifically, punkt_tab is a tokenization model capable of splitting text into words or sentences for multiple languages, including English.

In [None]:
import nltk

nltk.download("punkt_tab")

Verify the sample text.


In [None]:
# Open the data/(eng)appendix-keywords.txt file and create a file object named f.
with open("./data/(eng)appendix-keywords.txt") as f:
    file = (
        f.read()
    )  # Read the contents of the file and store them in the file variable.

# Print a portion of the content read from the file.
print(file[:350])

- Create a text splitter using the `NLTKTextSplitter` class.
- Set the `chunk_size` parameter to 1000 to split the text into chunks of up to 1000 characters.

In [22]:
from langchain_text_splitters import NLTKTextSplitter

text_splitter = NLTKTextSplitter(
    chunk_size=300,  # Set the chunk size to 200.
    chunk_overlap=0,  # Set the overlap between chunks to 0.
)

Use the `split_text` method of the `text_splitter` object to split the `file` text.

In [None]:
# Split the file text using the text_splitter.
texts = text_splitter.split_text(file)
print(texts[0])  # Print the first element of the split text.

## Basic Usage of KoNLPy

KoNLPy (Korean NLP in Python) is a Python package for Korean Natural Language Processing (NLP).

Tokenization
Tokenization involves the process of dividing text into smaller, more manageable units called tokens.

These tokens often represent meaningful elements such as words, phrases, symbols, or other components crucial for further processing and analysis.

In languages like English, tokenization typically involves separating words based on spaces and punctuation.

The effectiveness of tokenization largely depends on the tokenizer's understanding of the language structure, ensuring the generation of meaningful tokens.

Tokenizers designed for English lack the ability to comprehend the unique semantic structure of other languages, such as Korean, and therefore cannot be effectively used for Korean text processing.

### Korean Tokenization Using KoNLPy’s Kkma Analyzer

For Korean text, KoNLPy includes a morphological analyzer called Kkma (Korean Knowledge Morpheme Analyzer).

Kkma provides detailed morphological analysis for Korean text.
It breaks sentences into words and further decomposes words into their morphemes while identifying the part of speech for each token.
It can also split text blocks into individual sentences, which is particularly useful for processing lengthy texts.

### Considerations When Using Kkma
Kkma is known for its detailed analysis. However, this precision can affect processing speed.
Therefore, Kkma is best suited for applications that prioritize analytical depth over rapid text processing.
- KoNLPy is a Python package for Korean Natural Language Processing, offering features such as morphological analysis, part-of-speech tagging, and syntactic parsing.

Verify the sample text.

In [None]:
# Open the data/(eng)appendix-keywords.txt file and create a file object named f.
with open("./data/(eng)appendix-keywords.txt") as f:
    file = f.read()  # Read the file content and store it in the variable file.

# Print a portion of the content read from the file.
print(file[:350])

This is an example of splitting Korean text using KonlpyTextSplitter.

In [25]:
from langchain_text_splitters import KonlpyTextSplitter

# Create a text splitter object using KonlpyTextSplitter.
text_splitter = KonlpyTextSplitter()

Use the `text_splitter` to split `the file` content into sentences.

In [None]:
texts = text_splitter.split_text(file)  # Split the file content into sentences.
print(texts[0])  # Print the first sentence from the divided text.

## Basic Usage of Hugging Face tokenizer

Hugging Face provides various tokenizers.

This code demonstrates calculating the token length of a text using one of Hugging Face's tokenizers, GPT2TokenizerFast.

The text splitting approach is as follows:

- The text is split at the character level.

The chunk size measurement is determined as follows:

- It is based on the number of tokens calculated by the Hugging Face tokenizer.
- A `tokenizer` object is created using the `GPT2TokenizerFast` class.
- `from_pretrained` method is called to load the pre-trained "gpt2" tokenizer model.

In [27]:
from transformers import GPT2TokenizerFast

# Load the GPT-2 tokenizer.
hf_tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

In [None]:
# Open the data/(eng)appendix-keywords.txt file and create a file object named f.
with open("./data/(eng)appendix-keywords.txt") as f:
    file = f.read()  # Read the file content and store it in the variable file.

# Print a portion of the content read from the file.
print(file[:350])

`from_huggingface_tokenizer` method is used to initialize a text splitter with a Hugging Face tokenizer (`tokenizer`).

In [29]:
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
    # Use the Hugging Face tokenizer to create a CharacterTextSplitter object.
    hf_tokenizer,
    chunk_size=300,
    chunk_overlap=50,
)
# Split the file text into chunks.
texts = text_splitter.split_text(file)

Check the split result of the first element

In [None]:
print(texts[1])  # Print the first element of the texts list.