In [1]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

In [15]:
toy_text = "Welcome back to Code with Prince, I love coding"

In [22]:
toy_text_two = "thisissomegibirishtextjljlajflajlfsjalfjsl"

In [16]:
chunk_size = 20
chunk_overlap = 3

## Recursive Splitter

In [17]:
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

In [18]:
recursive_splitter.split_text(toy_text)

['Welcome back to Code', 'with Prince, I love', 'coding']

Overlap, you can now see the overlap below. `hte` has been used as the over lap between the first split and the second split. As ``chunk_overlap = 3``. When text splitting, spaces are also considered characters.

In [23]:
recursive_splitter.split_text(toy_text_two)

['thisissomegibirishte', 'htextjljlajflajlfsja', 'sjalfjsl']

In [20]:
character_splitter.split_text(toy_text)

['Welcome back to Code with Prince, I love coding']

## Character Splitter

In [46]:
character_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

In [47]:
character_splitter.split_text(toy_text)

['Welcome back to Code with Prince, I love coding']

You can see the text is not split, as character splitter splits based on new line character `\n`

In [52]:
character_splitter.split_text(toy_text)

['Welcome back to Code with Prince, I love coding']

In [50]:
character_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = "\n"
)

In [53]:
toy_text_three = """Welcome back to Code with Prince, 
I love coding"""

In [54]:
character_splitter.split_text(toy_text_three)

Created a chunk of size 34, which is longer than the specified 20


['Welcome back to Code with Prince,', 'I love coding']

## Application On Larger Text

In [66]:
larget_toy_text = """Welcome back to "Code with Prince"! In this video, the fourth installment of \
our TypeScript tutorial series, we'll dive into the world of arrays in TypeScript. Arrays are essential \
data structures that allow you to store and manipulate collections of values efficiently.\n\n
Welcome back to "Code with Prince"! In this video, the second installment of our TypeScript tutorial \
series, we'll dive deep into the world of data types in TypeScript. Understanding data types is crucial \
for writing robust and maintainable code, and TypeScript provides powerful mechanisms to enforce type \
safety in your applications."""

In [67]:
recursive_splitter.split_text(larget_toy_text)

['Welcome back to "Code with Prince"! In this video, the fourth installment of our TypeScript tutorial series, we\'ll dive into the world of arrays in TypeScript. Arrays are essential data structures that allow you to store and manipulate collections of values efficiently.',
 'Welcome back to "Code with Prince"! In this video, the second installment of our TypeScript tutorial series, we\'ll dive deep into the world of data types in TypeScript',
 '. Understanding data types is crucial for writing robust and maintainable code, and TypeScript provides powerful mechanisms to enforce type safety in your applications.']

In [68]:
chunk_size=300
chunk_overlap=3

In [69]:
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separators=["\n\n", "\n", "\. ", " ", ""]
)

In [70]:
splits = recursive_splitter.split_text(larget_toy_text)

In [71]:
splits

['Welcome back to "Code with Prince"! In this video, the fourth installment of our TypeScript tutorial series, we\'ll dive into the world of arrays in TypeScript. Arrays are essential data structures that allow you to store and manipulate collections of values efficiently.',
 'Welcome back to "Code with Prince"! In this video, the second installment of our TypeScript tutorial series, we\'ll dive deep into the world of data types in TypeScript',
 '. Understanding data types is crucial for writing robust and maintainable code, and TypeScript provides powerful mechanisms to enforce type safety in your applications.']

In [72]:
len(splits)

3

One thing you can notice is that we have split on sentences but, the `.` are in the wrong places. Let's go ahead and fix this. This is because of the `\.` in the separators. Let's fix it with a loop behind regex

In [73]:
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)

In [74]:
splits = recursive_splitter.split_text(larget_toy_text)

In [75]:
splits

['Welcome back to "Code with Prince"! In this video, the fourth installment of our TypeScript tutorial series, we\'ll dive into the world of arrays in TypeScript. Arrays are essential data structures that allow you to store and manipulate collections of values efficiently.',
 'Welcome back to "Code with Prince"! In this video, the second installment of our TypeScript tutorial series, we\'ll dive deep into the world of data types in TypeScript.',
 'Understanding data types is crucial for writing robust and maintainable code, and TypeScript provides powerful mechanisms to enforce type safety in your applications.']

In [76]:
len(splits)

3

You can also use a character splitter if you want to.

In [81]:
chunk_size=300
chunk_overlap=3

In [82]:
character_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = "\n"
)

In [83]:
splits = character_splitter.split_text(larget_toy_text)

In [84]:
splits

['Welcome back to "Code with Prince"! In this video, the fourth installment of our TypeScript tutorial series, we\'ll dive into the world of arrays in TypeScript. Arrays are essential data structures that allow you to store and manipulate collections of values efficiently.',
 'Welcome back to "Code with Prince"! In this video, the second installment of our TypeScript tutorial series, we\'ll dive deep into the world of data types in TypeScript. Understanding data types is crucial for writing robust and maintainable code, and TypeScript provides powerful mechanisms to enforce type safety in your applications.']

In [80]:
len(splits)

2

## RecursiveCharacterTextSplitter

This is from the official doc chatbot

"The RecursiveCharacterTextSplitter is a text splitter module that is recommended for `generic text`. It splits the text based on a list of characters provided as parameters. The default list includes ["\n\n", "\n", " ", ""] which tries to keep paragraphs, sentences, and words together as long as possible. The chunk size is measured by the number of characters."

In [85]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [86]:
rc_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
)

In [88]:
splits = rc_splitter.split_text(larget_toy_text)

In [89]:
splits

['Welcome back to "Code with Prince"! In this video, the fourth installment of our TypeScript tutorial series, we\'ll dive into the world of arrays in TypeScript. Arrays are essential data structures that allow you to store and manipulate collections of values efficiently.',
 'Welcome back to "Code with Prince"! In this video, the second installment of our TypeScript tutorial series, we\'ll dive deep into the world of data types in TypeScript. Understanding data types is crucial for writing robust and maintainable code, and TypeScript provides powerful mechanisms to',
 'to enforce type safety in your applications.']

In [90]:
len(splits)

3

### Token Splitting

Token based splitting is useful as LLMs have context windows designated in tokens. Each token is **4** characters long.

In the context of LLMs (Language Model-based Learning Systems), **"context windows"** refer to the window of surrounding words or tokens that are considered when the model is generating or predicting the next word or token in a sequence. The context window allows the model to take into account the contextual information provided by the preceding words or tokens to make more accurate predictions.

For example, in a sentence like **"The cat is sitting on the _ _"**, the context window might include the words **"The," "cat," "is," and "sitting."** The model analyzes this context window to predict the most likely word to complete the sentence, such as "mat" or "chair."

The size of the context window can vary depending on the specific implementation and task. A larger context window provides the model with more contextual information but also requires more computational resources. The optimal size of the context window depends on factors like the complexity of the language being modeled and the specific task or application

In [94]:
!pip install tiktoken

Collecting tiktoken
  Downloading tiktoken-0.4.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m[31m2.8 MB/s[0m eta [36m0:00:01[0m
[?25hCollecting regex>=2022.1.18 (from tiktoken)
  Downloading regex-2023.6.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (781 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m781.9/781.9 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m MB/s[0m eta [36m0:00:01[0m:01[0m
Installing collected packages: regex, tiktoken
Successfully installed regex-2023.6.3 tiktoken-0.4.0


In [91]:
from langchain.text_splitter import TokenTextSplitter

In [92]:
chunk_size=1
chunk_overlap=0

In [98]:
token_splitter = TokenTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
)

In [99]:
splits = token_splitter.split_text(toy_text)

In [100]:
splits

['Welcome',
 ' back',
 ' to',
 ' Code',
 ' with',
 ' Prince',
 ',',
 ' I',
 ' love',
 ' coding']

### Splitting Documents

In [101]:
from langchain.document_loaders import PyPDFLoader

In [117]:
loader = PyPDFLoader("./datasets/example_doc.pdf")
pages = loader.load()

In [118]:
rc_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
)

In [119]:
splits = rc_splitter.split_documents(pages)

In [120]:
splits

[Document(page_content='M', metadata={'source': './datasets/example_doc.pdf', 'page': 0}),
 Document(page_content='a', metadata={'source': './datasets/example_doc.pdf', 'page': 0}),
 Document(page_content='s', metadata={'source': './datasets/example_doc.pdf', 'page': 0}),
 Document(page_content='t', metadata={'source': './datasets/example_doc.pdf', 'page': 0}),
 Document(page_content='e', metadata={'source': './datasets/example_doc.pdf', 'page': 0}),
 Document(page_content='r', metadata={'source': './datasets/example_doc.pdf', 'page': 0}),
 Document(page_content='i', metadata={'source': './datasets/example_doc.pdf', 'page': 0}),
 Document(page_content='n', metadata={'source': './datasets/example_doc.pdf', 'page': 0}),
 Document(page_content='g', metadata={'source': './datasets/example_doc.pdf', 'page': 0}),
 Document(page_content='\n', metadata={'source': './datasets/example_doc.pdf', 'page': 0}),
 Document(page_content='F', metadata={'source': './datasets/example_doc.pdf', 'page': 0})

In [112]:
!pip install unstructured

Collecting unstructured
  Downloading unstructured-0.8.0-py3-none-any.whl (1.4 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m[31m2.6 MB/s[0m eta [36m0:00:01[0m0m
[?25hCollecting argilla (from unstructured)
  Downloading argilla-1.12.0-py3-none-any.whl (2.6 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[?25hCollecting chardet (from unstructured)
  Downloading chardet-5.1.0-py3-none-any.whl (199 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m[36m0:00:01[0m
[?25hCollecting filetype (from unstructured)
  Downloading filetype-1.2.0-py2.py3-none-any.whl (19 kB)
Collecting msg-parser (from unstructured)
  Downloading msg_parser-1.2.0-py2.py3-none-any.whl (101 kB)
[2K     [38;2;114;156;

  Downloading httpcore-0.16.3-py3-none-any.whl (69 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.6/69.6 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rfc3986[idna2008]<2,>=1.3 (from httpx<0.24,>=0.15->argilla->unstructured)
  Using cached rfc3986-1.5.0-py2.py3-none-any.whl (31 kB)
Collecting commonmark<0.10.0,>=0.9.0 (from rich<=13.0.1->argilla->unstructured)
  Downloading commonmark-0.9.1-py2.py3-none-any.whl (51 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.1/51.1 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
Collecting h11<0.15,>=0.13 (from httpcore<0.17.0,>=0.15.0->httpx<0.24,>=0.15->argilla->unstructured)
  Using cached h11-0.14.0-py3-none-any.whl (58 kB)
Building wheels for collected packages: python-docx, python-pptx, olefile, wrapt
  Building wheel for python-docx (setup.py) ... [?25ldone
[?25h  Created wheel for python-docx: filename=python_docx-0.8.11-py3-none-any.whl size=184490 sh

In [113]:
from langchain.document_loaders import UnstructuredPDFLoader

In [114]:
loader = UnstructuredPDFLoader("./datasets/Mastering Functions in TypeScript_ A Comprehensive Guide _ Code with Prince.pdf")
pages = loader.load()

[nltk_data] Downloading package punkt to /home/prince/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/prince/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [115]:
splits = rc_splitter.split_documents(pages)

In [116]:
splits

[Document(page_content='M', metadata={'source': './datasets/Mastering Functions in TypeScript_ A Comprehensive Guide _ Code with Prince.pdf'}),
 Document(page_content='a', metadata={'source': './datasets/Mastering Functions in TypeScript_ A Comprehensive Guide _ Code with Prince.pdf'}),
 Document(page_content='s', metadata={'source': './datasets/Mastering Functions in TypeScript_ A Comprehensive Guide _ Code with Prince.pdf'}),
 Document(page_content='t', metadata={'source': './datasets/Mastering Functions in TypeScript_ A Comprehensive Guide _ Code with Prince.pdf'}),
 Document(page_content='e', metadata={'source': './datasets/Mastering Functions in TypeScript_ A Comprehensive Guide _ Code with Prince.pdf'}),
 Document(page_content='r', metadata={'source': './datasets/Mastering Functions in TypeScript_ A Comprehensive Guide _ Code with Prince.pdf'}),
 Document(page_content='i', metadata={'source': './datasets/Mastering Functions in TypeScript_ A Comprehensive Guide _ Code with Prince.