## Level 1: Character Splitting

Character splitting is the most basic form of splitting up your text. It is the process of simply dividing your text into N-character sized chunks regardless of their content or form.

This method isn't recommended for any applications - but it's a great starting point for us to understand the basics.

In [1]:
text = "This is the text I would like to chunk up. It is the example text for this exercise"

In [2]:
chunks = []

chunk_size = 35 # Characters

# Run through the a range with the length of your text and iterate every chunk_size you want
for i in range(0, len(text), chunk_size):
    chunk = text[i:i + chunk_size]
    chunks.append(chunk)
chunks

['This is the text I would like to ch',
 'unk up. It is the example text for ',
 'this exercise']

### same thing using langchain

In [3]:
from langchain.text_splitter import CharacterTextSplitter

In [5]:
text_splitter = CharacterTextSplitter(chunk_size = 35, chunk_overlap=0, separator='', strip_whitespace=False)


Then we can actually split our text via create_documents. Note: create_documents expects a list of texts, so if you just have a string (like we do) you'll need to wrap it in []

In [6]:
text_splitter.create_documents([text])

[Document(page_content='This is the text I would like to ch'),
 Document(page_content='unk up. It is the example text for '),
 Document(page_content='this exercise')]

#### using the overlap and separator

In [8]:
text_splitter = CharacterTextSplitter(chunk_size = 35, chunk_overlap = 4, separator='')
text_splitter.create_documents([text])

[Document(page_content='This is the text I would like to ch'),
 Document(page_content='o chunk up. It is the example text'),
 Document(page_content='ext for this exercise')]

In [9]:
text_splitter = CharacterTextSplitter(chunk_size = 35, chunk_overlap=0, separator='ch')
text_splitter.create_documents([text])


[Document(page_content='This is the text I would like to'),
 Document(page_content='unk up. It is the example text for this exercise')]

## Level 2: Recursive Character Text Splitting

specify a series of separatators which will be used to split our docs.

You can see the default separators for LangChain below.

- "\n\n" - Double new line, or most commonly paragraph breaks
- "\n" - New lines
- " " - Spaces
- "" - Characters

In [15]:
from langchain.text_splitter import RecursiveCharacterTextSplitter


In [16]:
text = """
One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear.

Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor's, you don't get half as many customers. You get no customers, and you go out of business.

It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]
"""

In [17]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 65, chunk_overlap=0)


In [18]:
text_splitter.create_documents([text])


[Document(page_content="One of the most important things I didn't understand about the"),
 Document(page_content='world when I was a child is the degree to which the returns for'),
 Document(page_content='performance are superlinear.'),
 Document(page_content='Teachers and coaches implicitly told us the returns were linear.'),
 Document(page_content='"You get out," I heard a thousand times, "what you put in." They'),
 Document(page_content='meant well, but this is rarely true. If your product is only'),
 Document(page_content="half as good as your competitor's, you don't get half as many"),
 Document(page_content='customers. You get no customers, and you go out of business.'),
 Document(page_content="It's obviously true that the returns for performance are"),
 Document(page_content='superlinear in business. Some think this is a flaw of'),
 Document(page_content='capitalism, and that if we changed the rules it would stop being'),
 Document(page_content='true. But superlinear returns for

the splitter first looks for double new lines (paragraph break).

Once paragraphs are split, then it looks at the chunk size, if a chunk is too big, then it'll split by the next separator. If the chunk is still too big, then it'll move onto the next one.

In [19]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 450, chunk_overlap=0)
text_splitter.create_documents([text])

[Document(page_content="One of the most important things I didn't understand about the world when I was a child is the degree to which the returns for performance are superlinear."),
 Document(page_content='Teachers and coaches implicitly told us the returns were linear. "You get out," I heard a thousand times, "what you put in." They meant well, but this is rarely true. If your product is only half as good as your competitor\'s, you don\'t get half as many customers. You get no customers, and you go out of business.'),
 Document(page_content="It's obviously true that the returns for performance are superlinear in business. Some think this is a flaw of capitalism, and that if we changed the rules it would stop being true. But superlinear returns for performance are a feature of the world, not an artifact of rules we've invented. We see the same pattern in fame, power, military victories, knowledge, and even benefit to humanity. In all of these, the rich get richer. [1]")]

## Level 3: Document Specific Splitting

To work with next level ie, pictures, PDFs or code snippets

This level is all about making a chunking strategy to fit your different data formats.

The Markdown, Python, and JS splitters will basically be similar to Recursive Character, but with different separators.

### Markdown
Separators:

- `\n#{1,6}` - Split by new lines followed by a header (H1 through H6)
- ` ```\n` - Code blocks
- `\n\\*\\*\\*+\n` - Horizontal Lines
- `\n---+\n` - Horizontal Lines
- `\n___+\n` - Horizontal Lines
- `\n\n` Double new lines
- `\n` - New line
- `" "` - Spaces
- `""` - Character

In [20]:
from langchain.text_splitter import MarkdownTextSplitter

In [21]:
splitter = MarkdownTextSplitter(chunk_size = 40, chunk_overlap=0)

In [22]:
markdown_text = """
# Fun in California

## Driving

Try driving on the 1 down to San Diego

### Food

Make sure to eat a burrito while you're there

## Hiking

Go to Yosemite
"""

In [23]:
splitter.create_documents([markdown_text])

[Document(page_content='# Fun in California\n\n## Driving'),
 Document(page_content='Try driving on the 1 down to San Diego'),
 Document(page_content='### Food'),
 Document(page_content="Make sure to eat a burrito while you're"),
 Document(page_content='there'),
 Document(page_content='## Hiking\n\nGo to Yosemite')]

### Python
Python splitters here

- `\nclass` - Classes first
- `\ndef` - Functions next
- `\n\tdef` - Indented functions
- `\n\n` - Double New lines
- `\n` - New Lines
- `" "` - Spaces
- `""` - Characters

In [24]:
from langchain.text_splitter import PythonCodeTextSplitter

In [25]:
python_text = """
class Person:
  def __init__(self, name, age):
    self.name = name
    self.age = age

p1 = Person("John", 36)

for i in range(10):
    print (i)
"""

In [26]:
python_splitter = PythonCodeTextSplitter(chunk_size=100, chunk_overlap=0)


In [27]:
python_splitter.create_documents([python_text])


[Document(page_content='class Person:\n  def __init__(self, name, age):\n    self.name = name\n    self.age = age'),
 Document(page_content='p1 = Person("John", 36)\n\nfor i in range(10):\n    print (i)')]

### JS
Very similar to python.

Separators:

- `\nfunction` - Indicates the beginning of a function declaration
- `\nconst` - Used for declaring constant variables
- `\nlet` - Used for declaring block-scoped variables
- `\nvar` - Used for declaring a variable
- `\nclass` - Indicates the start of a class definition
- `\nif` - Indicates the beginning of an if statement
- `\nfor` - Used for for-loops
- `\nwhile` - Used for while-loops
- `\nswitch` - Used for switch statements
- `\ncase` - Used within switch statements
- `\ndefault` - Also used within switch statements
- `\n\n` - Indicates a larger separation in text or code
- `\n` - Separates lines of code or text
- `" "` - Separates words or tokens in the code
- `""` - Makes every character a separate element

In [28]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, Language


In [29]:
javascript_text = """
// Function is called, the return value will end up in x
let x = myFunction(4, 3);

function myFunction(a, b) {
// Function returns the product of a and b
  return a * b;
}
"""

In [30]:
js_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.JS, chunk_size=65, chunk_overlap=0
)

### PDFs w/ tables

A very convenient way to do this is with Unstructured, a library dedicated to making your data LLM ready.

In [3]:
import os
from unstructured.partition.pdf import partition_pdf
from unstructured.staging.base import elements_to_json

  from .autonotebook import tqdm as notebook_tqdm


In [6]:
filename = 'static\\salesForceReport.pdf'

In [7]:
# Extracts the elements from the PDF
elements = partition_pdf(
    filename=filename,

    # Unstructured Helpers
    strategy="hi_res", 
    infer_table_structure=True, 
    model_name="yolox"
)

PDFInfoNotInstalledError: Unable to get page count. Is poppler installed and in PATH?