In [14]:
from langchain_community.document_loaders import PyPDFLoader, TextLoader

In [4]:
loader = PyPDFLoader('Attention.pdf')
doc = loader.load()

# RecursiveCharacterTextSplitter

#### Text Splitting from Documents- RecursiveCharacter Text Splitters
This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

- How the text is split: by list of characters.
- How the chunk size is measured: by number of characters.


In [32]:
from langchain_text_splitters import RecursiveCharacterTextSplitter, CharacterTextSplitter, HTMLHeaderTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=100,chunk_overlap=20)

In [12]:
doc_split = text_splitter.split_documents(doc)
len(doc_split)

453

In [27]:
loader = TextLoader('speech.txt')
text = loader.load()

text_splitter.split_documents(text)

[Document(metadata={'source': 'speech.txt'}, page_content='Time management is the practice of planning and controlling how you spend your time to be more'),
 Document(metadata={'source': 'speech.txt'}, page_content='time to be more productive and efficient. It can help you get more done in less time, and can also'),
 Document(metadata={'source': 'speech.txt'}, page_content='time, and can also improve the quality of your work, reduce stress, and give you more time for'),
 Document(metadata={'source': 'speech.txt'}, page_content='you more time for creative projects.'),
 Document(metadata={'source': 'speech.txt'}, page_content='Here are some tips for better time management: \nPrioritize'),
 Document(metadata={'source': 'speech.txt'}, page_content='Make a to-do list and prioritize tasks so you can complete important or urgent tasks first.'),
 Document(metadata={'source': 'speech.txt'}, page_content='Set boundaries'),
 Document(metadata={'source': 'speech.txt'}, page_content="Let people kno

In [31]:
with open('speech.txt') as file:
    speech = file.read()
    
text_splitter.create_documents([speech])

[Document(metadata={}, page_content='Time management is the practice of planning and controlling how you spend your time to be more'),
 Document(metadata={}, page_content='time to be more productive and efficient. It can help you get more done in less time, and can also'),
 Document(metadata={}, page_content='time, and can also improve the quality of your work, reduce stress, and give you more time for'),
 Document(metadata={}, page_content='you more time for creative projects.'),
 Document(metadata={}, page_content='Here are some tips for better time management: \nPrioritize'),
 Document(metadata={}, page_content='Make a to-do list and prioritize tasks so you can complete important or urgent tasks first.'),
 Document(metadata={}, page_content='Set boundaries'),
 Document(metadata={}, page_content="Let people know when you're not available and set boundaries for yourself. You can set your phone"),
 Document(metadata={}, page_content='can set your phone to do-not-disturb during certain 

# CharacterTextSplitter

#### How to split by character-Character Text Splitter
This is the simplest method. This splits based on a given character sequence, which defaults to "\n\n". Chunk length is measured by number of characters.

1. How the text is split: by single character separator.
2. How the chunk size is measured: by number of characters.


# HTMLHeaderTextSplitter

##### How to split by HTML header
HTMLHeaderTextSplitter is a "structure-aware" chunker that splits text at the HTML element level and adds metadata for each header "relevant" to any given chunk. It can return chunks element by element or combine elements with the same metadata, with the objectives of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitters as part of a chunking pipeline.


In [34]:
from langchain_text_splitters import HTMLHeaderTextSplitter


url = 'https://plato.stanford.edu/entries/goedel-incompleteness/'
header_to_split_on = [
    ('h1','Header 1'),
    ('h2','Header 2'),
    ('h3','Header 3'),
    ('h4','Header 4')
    ]

html_splitter = HTMLHeaderTextSplitter(header_to_split_on)
html_header_split = html_splitter.split_text_from_url(url)

html_header_split

[Document(metadata={}, page_content="Stanford Encyclopedia of Philosophy  \nMenu  \nBrowse About Support SEP  \nTable of Contents What's New Random Entry Chronological Archives  \nEditorial Information About the SEP Editorial Board How to Cite the SEP Special Characters Advanced Tools Contact  \nSupport the SEP PDFs for SEP Friends Make a Donation SEPIA for Libraries  \nEntry Navigation  \nEntry Contents Bibliography Academic Tools Friends PDF Preview Author and Citation Info Back to Top  \nGödel’s Incompleteness Theorems"),
 Document(metadata={'Header 1': 'Gödel’s Incompleteness Theorems'}, page_content="First published Mon Nov 11, 2013; substantive revision Thu Apr 2, 2020  \nGödel’s two incompleteness theorems are among the most important results in modern logic, and have deep implications for various issues. They concern the limits of provability in formal axiomatic theories. The first incompleteness theorem states that in any consistent formal system \\(F\\) within which a certain

# RecursiveJsonSplitter

#### How to split JSON data
This json splitter splits json data while allowing control over chunk sizes. It traverses json data depth first and builds smaller json chunks. It attempts to keep nested json objects whole but will split them if needed to keep chunks between a min_chunk_size and the max_chunk_size.

If the value is not a nested json, but rather a very large string the string will not be split. If you need a hard cap on the chunk size consider composing this with a Recursive Text splitter on those chunks. There is an optional pre-processing step to split lists, by first converting them to json (dict) and then splitting them as such.

- How the text is split: json value.
- How the chunk size is measured: by number of characters.


In [36]:
import json
import requests

json_data =  requests.get('https://api.smith.langchain.com/openapi.json').json()

{'openapi': '3.1.0',
 'info': {'title': 'LangSmith', 'version': '0.1.0'},
 'paths': {'/api/v1/sessions/{session_id}': {'get': {'tags': ['tracer-sessions'],
    'summary': 'Read Tracer Session',
    'description': 'Get a specific session.',
    'operationId': 'read_tracer_session_api_v1_sessions__session_id__get',
    'security': [{'API Key': []}, {'Tenant ID': []}, {'Bearer Auth': []}],
    'parameters': [{'name': 'session_id',
      'in': 'path',
      'required': True,
      'schema': {'type': 'string', 'format': 'uuid', 'title': 'Session Id'}},
     {'name': 'include_stats',
      'in': 'query',
      'required': False,
      'schema': {'type': 'boolean',
       'default': False,
       'title': 'Include Stats'}},
     {'name': 'accept',
      'in': 'header',
      'required': False,
      'schema': {'anyOf': [{'type': 'string'}, {'type': 'null'}],
       'title': 'Accept'}}],
    'responses': {'200': {'description': 'Successful Response',
      'content': {'application/json': {'sch

In [40]:
from langchain_text_splitters import RecursiveJsonSplitter

json_splitter =  RecursiveJsonSplitter(max_chunk_size=100)
json_split_data = json_splitter.split_json(json_data)

In [43]:
# for chunk in json_split_data[:3]:
#     print(chunk)

{'openapi': '3.1.0', 'info': {'title': 'LangSmith', 'version': '0.1.0'}}
{'paths': {'/api/v1/sessions/{session_id}': {'get': {'tags': ['tracer-sessions']}}}}
{'paths': {'/api/v1/sessions/{session_id}': {'get': {'summary': 'Read Tracer Session'}}}}


In [46]:
# The splitter can also output documents
docs = json_splitter.create_documents(texts=[json_data])

In [48]:
for chunk in docs[:3]:
    print(chunk)

page_content='{"openapi": "3.1.0", "info": {"title": "LangSmith", "version": "0.1.0"}}'
page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"tags": ["tracer-sessions"]}}}}'
page_content='{"paths": {"/api/v1/sessions/{session_id}": {"get": {"summary": "Read Tracer Session"}}}}'
