# Document Splitting
- splitting into smaller chunks can be tricky
- retaining meaningful relationships is important --> chunks should be semantically relevent and coherent
    - e.g., not desirable if an important piece of information is split across 2 chunks
    - we can use *chunk overlaps* to tackle this problem (like a sliding window)

## EITHER: get your [OpenAI API Key](https://platform.openai.com/account/api-keys)

In [None]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

## OR: use [LocalAI as an OpenAI replacement](https://localai.io/howtos/easy-request-openai/)

In [41]:
import os
import openai

# Specify the port your LocalAI docker container runs on
openai.api_base = "http://localhost:8080/v1"  # default
# openai.api_base = "http://localhost:9095/v1"  # for lunademo
openai.api_key = "sx-xxx"  # not needed for LocalAI (dummy)
OPENAI_API_KEY = "sx-xxx"
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

In [42]:
# Specify the model you are using
# llm_model = ""
# llm_model = "lunademo"

**We import most popular TextSplitters here**
- but there are many different splitters depending on the task
    - e.g., *SpacyTextSplitter* using Spacy to look at sentences, *Language* for programming languages (CPP, Python, Ruby, Markdown, etc)


In [43]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

**For demonstration, we set very small sizes for chunk and overlap**

In [44]:
chunk_size =26
chunk_overlap = 4

In [45]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

Why doesn't this split the string below?

In [6]:
text1 = 'abcdefghijklmnopqrstuvwxyz'  # because it's 26 chars

In [7]:
r_splitter.split_text(text1)

['abcdefghijklmnopqrstuvwxyz']

In [8]:
text2 = 'abcdefghijklmnopqrstuvwxyzabcdefg'

In [9]:
r_splitter.split_text(text2)  # here we get 2 chunks, first of 26chars, 2nd of the rest with 4 chars overlap

['abcdefghijklmnopqrstuvwxyz', 'wxyzabcdefg']

- try a more complex string here (with spaces between characters)

In [10]:
text3 = "a b c d e f g h i j k l m n o p q r s t u v w x y z"

In [11]:
r_splitter.split_text(text3)  # whitespace chars are counted as well

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

In [12]:
c_splitter.split_text(text3)  # whitespace chars are not counted here, splits only occur at \n by default

['a b c d e f g h i j k l m n o p q r s t u v w x y z']

**We can specify a separator for splitting, here `' '` whitespace**

In [46]:
c_splitter = CharacterTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap,
    separator = ' '
)
c_splitter.split_text(text3)

['a b c d e f g h i j k l m', 'l m n o p q r s t u v w x', 'w x y z']

Try your own examples!

## Recursive splitting details

`RecursiveCharacterTextSplitter` is recommended for generic text. 

In [47]:
some_text = """When writing documents, writers will use document structure to group content. \
This can convey to the reader, which idea's are related. For example, closely related ideas \
are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n  \
Paragraphs are often delimited with a carriage return or two carriage returns. \
Carriage returns are the "backslash n" you see embedded in this string. \
Sentences have a period at the end, but also, have a space.\
and words are separated by space."""

In [48]:
len(some_text)

496

**Let's define a new character splitter**
- this time no chunk overlap
- we use whitespace character `' '` as separator

In [49]:
c_splitter = CharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0,
    separator = ' '
)

**And a new recursive splitter**
- again no chunk overlap
- this time a list of separators (which are actually the default separators anyway, but for clarification)
    - the splitter will use the first element for splitting first and only move on to the next elements in the list, if further splitting according to chunk size is necessary
    - here: splitter will split by *double new lines* `\n\n` first, if more splitting is needed *single new line* `\n`next, then *whitespace characters* `" "`, finally *single characters* `""` if it really needs to do that

In [50]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=0, 
    separators=["\n\n", "\n", " ", ""]
)

In [51]:
c_splitter.split_text(some_text)

['When writing documents, writers will use document structure to group content. This can convey to the reader, which idea\'s are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document. \n\n Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also,',
 'have a space.and words are separated by space.']

In [52]:
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example, closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.",
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

**Let's reduce the chunk size a bit and add a period `"\."` to our separators**
- this will split between sentences
- however, see the output: this will split before periods because of underlying *regex* (we fix this in the next example)

In [53]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "\. ", " ", ""]
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

**Here we use more sophisticated regex for period splitting:** `"(?<=\. )"`

In [54]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=0,
    separators=["\n\n", "\n", "(?<=\. )", " ", ""]
)
r_splitter.split_text(some_text)

["When writing documents, writers will use document structure to group content. This can convey to the reader, which idea's are related. For example,",
 'closely related ideas are in sentances. Similar ideas are in paragraphs. Paragraphs form a document.',
 'Paragraphs are often delimited with a carriage return or two carriage returns. Carriage returns are the "backslash n" you see embedded in this',
 'string. Sentences have a period at the end, but also, have a space.and words are separated by space.']

### Do this on a more real world example: using PDF

In [55]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("MachineLearning-Lecture01.pdf")
pages = loader.load()

**Here we also specify the function used to measure length (`len` is already default)**

In [56]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=150,
    length_function=len
)

In [57]:
docs = text_splitter.split_documents(pages)

In [58]:
len(docs)

77

In [59]:
len(pages)

22

**Same approach with Notion DB**
- does not work locally here

In [None]:
from langchain.document_loaders import NotionDirectoryLoader
loader = NotionDirectoryLoader("docs/Notion_DB")
notion_db = loader.load()

In [None]:
docs = text_splitter.split_documents(notion_db)

In [None]:
len(notion_db)

In [None]:
len(docs)

## Token splitting

We can also split on token count explicity, if we want.

This can be useful because LLMs often have context windows designated in tokens.

Tokens are often ~4 characters.

- **Note:** Llama does not work with OpenAI's `tiktoken`, use `LlamaTokenizer` instead

In [27]:
from langchain.text_splitter import TokenTextSplitter

In [28]:
text_splitter = TokenTextSplitter(chunk_size=1, chunk_overlap=0)

In [29]:
text1 = "foo bar bazzyfoo"

In [30]:
text_splitter.split_text(text1)

['foo', ' bar', ' b', 'az', 'zy', 'foo']

In [31]:
text_splitter = TokenTextSplitter(chunk_size=10, chunk_overlap=0)

In [32]:
docs = text_splitter.split_documents(pages)

In [33]:
docs[0]

Document(page_content='MachineLearning-Lecture01  \n', metadata={'source': 'MachineLearning-Lecture01.pdf', 'page': 0})

In [34]:
pages[0].metadata

{'source': 'MachineLearning-Lecture01.pdf', 'page': 0}

## Context aware splitting

Chunking aims to keep text with common context together.

A text splitting often uses sentences or other delimiters to keep related text together but many documents (such as Markdown) have structure (headers) that can be explicitly used in splitting.

We can use `MarkdownHeaderTextSplitter` to preserve header metadata in our chunks, as show below.
- this adds information to the original metadata!

In [35]:
from langchain.document_loaders import NotionDirectoryLoader
from langchain.text_splitter import MarkdownHeaderTextSplitter

In [36]:
markdown_document = """# Title\n\n \
## Chapter 1\n\n \
Hi this is Jim\n\n Hi this is Joe\n\n \
### Section \n\n \
Hi this is Lance \n\n 
## Chapter 2\n\n \
Hi this is Molly"""

In [37]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

In [38]:
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)
md_header_splits = markdown_splitter.split_text(markdown_document)

In [39]:
md_header_splits[0]

Document(page_content='Hi this is Jim  \nHi this is Joe', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1'})

In [40]:
md_header_splits[1]

Document(page_content='Hi this is Lance', metadata={'Header 1': 'Title', 'Header 2': 'Chapter 1', 'Header 3': 'Section'})

Try on a real Markdown file, like a Notion database.

In [None]:
loader = NotionDirectoryLoader("docs/Notion_DB")
docs = loader.load()
txt = ' '.join([d.page_content for d in docs])

In [None]:
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on
)

In [None]:
md_header_splits = markdown_splitter.split_text(txt)

In [None]:
md_header_splits[0]