# Text Splitters

Text splitters are used to divide long pieces of text into smaller, semantically meaningful chunks. This allows for better handling and processing of the text, while maintaining the context between the chunks. Here's an overview of text splitters and their customization options. Here is the text to be used in this example notebook: 

In [1]:
text = """
Large Language Models (LLMs) are a class of deep learning models designed to process and understand vast amounts of natural language data. 
They are built on neural network architectures, particularly the transformer architecture, which allows them to capture complex language 
patterns and relationships between words or phrases in large-scale text datasets. 

As a matter of fact, LLM can also be understood as variants of transformer.  The transformer architecture relies on a mechanism called self-attention,
which allows the model to weigh the importance of different words or phrases in a given context. This has proven to be particularly effective in capturing 
long-range dependencies and understanding the nuances of natural language. 

Recall that the transformer architecture represents the neural network model for natural language processing tasks based on encoder-decoder architecture, 
which was introduced in the paper “Attention Is All You Need” by Vaswani et al. in 2017. The key component of the transformer architecture is 
the self-attention mechanism, which enables the model to attend to different parts of the input sequence to compute a representation for each position. 
The transformer consists of two main components: the encoder network and the decoder network. 
The encoder network takes an input sequence and produces a sequence of hidden states, while the decoder network takes a target sequence and uses the encoder’s 
output to generate a sequence of predictions. Both the encoder and decoder are composed of multiple layers of self-attention and feedforward neural networks. 
The picture given below represents the original transformer architecture.
"""

The Character Text Splitter splits text into chunks based on a specified number of characters. It is a simple and straightforward method to divide text based on character count. Use the Character Text Splitter when you need to split text into fixed-size chunks based on the number of characters. This approach can be useful when working with models or systems that have specific character-based limitations or requirements.

In [2]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=0)
texts = text_splitter.split_text(text)
texts

Created a chunk of size 361, which is longer than the specified 100
Created a chunk of size 382, which is longer than the specified 100


['Large Language Models (LLMs) are a class of deep learning models designed to process and understand vast amounts of natural language data. \nThey are built on neural network architectures, particularly the transformer architecture, which allows them to capture complex language \npatterns and relationships between words or phrases in large-scale text datasets.',
 'As a matter of fact, LLM can also be understood as variants of transformer.  The transformer architecture relies on a mechanism called self-attention,\nwhich allows the model to weigh the importance of different words or phrases in a given context. This has proven to be particularly effective in capturing \nlong-range dependencies and understanding the nuances of natural language.',
 'Recall that the transformer architecture represents the neural network model for natural language processing tasks based on encoder-decoder architecture, \nwhich was introduced in the paper “Attention Is All You Need” by Vaswani et al. in 2017.

In [3]:
[len(chunk) for chunk in texts]

[359, 381, 938]

The PythonCodeTextSplitter is specifically designed to handle code snippets or programming languages. It considers code-specific patterns and structures to split the text.


In [4]:
from langchain.text_splitter import PythonCodeTextSplitter

PYTHON_CODE = """
# Proprietary LLM from e.g. OpenAI
# pip install openai
from langchain.llms import OpenAI
llm = OpenAI(model_name="text-davinci-003")

# Alternatively, open-source LLM hosted on Hugging Face
# pip install huggingface_hub
from langchain import HuggingFaceHub
llm = HuggingFaceHub(repo_id = "google/flan-t5-xl")

# The LLM takes a prompt as an input and outputs a completion
prompt = "Alice has a parrot. What animal is Alice's pet?"
completion = llm(prompt)
"""


text_splitter = PythonCodeTextSplitter()
texts = text_splitter.split_text(PYTHON_CODE)

texts

['# Proprietary LLM from e.g. OpenAI\n# pip install openai\nfrom langchain.llms import OpenAI\nllm = OpenAI(model_name="text-davinci-003")\n\n# Alternatively, open-source LLM hosted on Hugging Face\n# pip install huggingface_hub\nfrom langchain import HuggingFaceHub\nllm = HuggingFaceHub(repo_id = "google/flan-t5-xl")\n\n# The LLM takes a prompt as an input and outputs a completion\nprompt = "Alice has a parrot. What animal is Alice\'s pet?"\ncompletion = llm(prompt)']

The NLTK (Natural Language Toolkit) is a popular library for natural language processing tasks. NLTK provides various tools and utilities, including tokenizers and text splitting functionalities. Use the NLTK Text Splitter when you require more advanced text processing and analysis, such as sentence or word-level tokenization. It is suitable for tasks that involve natural language understanding, language modeling, or sentiment analysis.

In [5]:
import nltk
from langchain.text_splitter import NLTKTextSplitter

nltk.download('punkt')

text_splitter = NLTKTextSplitter()
texts = text_splitter.split_text(text)
texts

[nltk_data] Downloading package punkt to /Users/award40/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['Large Language Models (LLMs) are a class of deep learning models designed to process and understand vast amounts of natural language data.\n\nThey are built on neural network architectures, particularly the transformer architecture, which allows them to capture complex language \npatterns and relationships between words or phrases in large-scale text datasets.\n\nAs a matter of fact, LLM can also be understood as variants of transformer.\n\nThe transformer architecture relies on a mechanism called self-attention,\nwhich allows the model to weigh the importance of different words or phrases in a given context.\n\nThis has proven to be particularly effective in capturing \nlong-range dependencies and understanding the nuances of natural language.\n\nRecall that the transformer architecture represents the neural network model for natural language processing tasks based on encoder-decoder architecture, \nwhich was introduced in the paper “Attention Is All You Need” by Vaswani et al.\n\ni

The Recursive Character Text Splitter is a more sophisticated text splitter that recursively splits text based on a list of characters. It attempts to keep semantically related pieces of text together, such as paragraphs, sentences, and words. Use the Recursive Character Text Splitter when you want to split text while maintaining the integrity of paragraphs, sentences, or words. This approach can be useful for tasks that involve text summarization, topic modeling, or any scenario where preserving the semantic structure of the text is important.

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
texts = text_splitter.split_text(text)
texts

['Large Language Models (LLMs) are a class of deep learning models designed to process and understand',
 'and understand vast amounts of natural language data.',
 'They are built on neural network architectures, particularly the transformer architecture, which',
 'architecture, which allows them to capture complex language',
 'patterns and relationships between words or phrases in large-scale text datasets.',
 'As a matter of fact, LLM can also be understood as variants of transformer.  The transformer',
 'The transformer architecture relies on a mechanism called self-attention,',
 'which allows the model to weigh the importance of different words or phrases in a given context.',
 'in a given context. This has proven to be particularly effective in capturing',
 'long-range dependencies and understanding the nuances of natural language.',
 'Recall that the transformer architecture represents the neural network model for natural language',
 'natural language processing tasks based on enc

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language


PYTHON_CODE = """
# Proprietary LLM from e.g. OpenAI
# pip install openai
from langchain.llms import OpenAI
llm = OpenAI(model_name="text-davinci-003")

# Alternatively, open-source LLM hosted on Hugging Face
# pip install huggingface_hub
from langchain import HuggingFaceHub
llm = HuggingFaceHub(repo_id = "google/flan-t5-xl")

# The LLM takes a prompt as an input and outputs a completion
prompt = "Alice has a parrot. What animal is Alice's pet?"
completion = llm(prompt)
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=200, chunk_overlap=0
)
python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs

[Document(page_content='# Proprietary LLM from e.g. OpenAI\n# pip install openai\nfrom langchain.llms import OpenAI\nllm = OpenAI(model_name="text-davinci-003")', metadata={}),
 Document(page_content='# Alternatively, open-source LLM hosted on Hugging Face\n# pip install huggingface_hub\nfrom langchain import HuggingFaceHub\nllm = HuggingFaceHub(repo_id = "google/flan-t5-xl")', metadata={}),
 Document(page_content='# The LLM takes a prompt as an input and outputs a completion\nprompt = "Alice has a parrot. What animal is Alice\'s pet?"\ncompletion = llm(prompt)', metadata={})]

spaCy is a popular open-source library for advanced natural language processing. The spaCy Text Splitter utilizes the spaCy tokenizer to split text into chunks based on a specified chunk size.

🔴 Keep getting an error when trying to load in the spacy module

In [8]:
# import spacy
# from langchain.text_splitter import SpacyTextSplitter

# nlp = spacy.load("en_core_web_sm")

# text_splitter = SpacyTextSplitter(nlp, chunk_size=1000)
# texts = text_splitter.split_text(text)
# texts

Hugging Face is a popular platform for natural language processing, providing a wide range of pre-trained models and tokenizers. The Hugging Face tokenizer, such as GPT2TokenizerFast, allows you to tokenize text and measure the chunk size in terms of tokens. Use the Hugging Face Tokenizer when you want to work with pre-trained models or utilize specific tokenization features provided by the Hugging Face library. It is particularly useful when dealing with transformer-based models like GPT-2 or BERT.

In [18]:
from transformers import GPT2TokenizerFast
from langchain.text_splitter import CharacterTextSplitter

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(tokenizer, chunk_size=60, chunk_overlap=0)
texts = text_splitter.split_text(text)
texts

Created a chunk of size 67, which is longer than the specified 60
Created a chunk of size 74, which is longer than the specified 60


['Large Language Models (LLMs) are a class of deep learning models designed to process and understand vast amounts of natural language data. \nThey are built on neural network architectures, particularly the transformer architecture, which allows them to capture complex language \npatterns and relationships between words or phrases in large-scale text datasets.',
 'As a matter of fact, LLM can also be understood as variants of transformer.  The transformer architecture relies on a mechanism called self-attention,\nwhich allows the model to weigh the importance of different words or phrases in a given context. This has proven to be particularly effective in capturing \nlong-range dependencies and understanding the nuances of natural language.',
 'Recall that the transformer architecture represents the neural network model for natural language processing tasks based on encoder-decoder architecture, \nwhich was introduced in the paper “Attention Is All You Need” by Vaswani et al. in 2017.