![Document Loaders and Splitters](../assets/document-loaders-and-splitters.png)
---


### Learning objective:
By the end of this lesson, you will be able to manage documents with document loaders and splitters. 


### About:  
Often, we need to format documents before we can use the prompts. In this lesson, we will use loaders and splitters to format documents from file directories, csv files, and websites. 


### Prerequisites:
- Python (required) 
- Intro to LangChain and prior prompt  eng. lessons (required) 
- Visual Studio Code (recommended)
- GitHub Copilot lessons (recommended) 

### Contents
1. [Imports](#imports)
1. [Document Loaders](#loaders)
1. [Text splitters](#splitters)

### Activities
1. [Lab](#lab)


## Installs

You may need to install the following tools


- %pip install --upgrade  tiktoken
- %pip install --upgrade  "unstructured[all-docs]"

<a id='imports'></a>
## Imports

In [1]:
from langchain_openai import ChatOpenAI #openai chatbot
from langchain_core.prompts import ChatPromptTemplate #template for chat prompts
from langchain_core.output_parsers import StrOutputParser #output parser for string output 

<a id='loaders'></a>
## Document Loaders
Document loaders allow you to bring in text from another source, such as a file on your computer, data from a website, or even YouTube video transcripts. 
LangChain offers functionality to help load and store those documents in so that they can be used by the language model. 

[LangChain Docs](https://python.langchain.com/docs/modules/data_connection/document_loaders/)

### Supported loaders:
- CSV
- File Directory
- HTML
- JSON
- Markdown
- PDF


Additionally, LangChain allows offers multiple 3rd party integrations to help with loading documents of different formats and from common sites (like YouTube, and Hackernews). 

[Integrations](https://python.langchain.com/docs/integrations/document_loaders/)


We will work with several documents and load them in for this lab. 

#### Data sources stored in assets folder:
- **CSV Source:** Fanaee-T, Hadi, and Gama, Joao, "Event labeling combining ensemble detectors and background knowledge", Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, doi:10.1007/s13748-013-0040-3.
- **Webloader Source:** [GPT-4 article](https://arxiv.org/html/2303.08774v4)
- **Markdown Source:** [OpenAI Github README.md example](https://github.com/openai/openai-cookbook?tab=readme-ov-file)
- **File Directory Source:** Sample Meeting text files generated by GPT and stored in "my_docs" folder

In each example below you will see the import, the loader, and print statement to view the doc. 

In [2]:
# import nltk
# import ssl

# try:
#     _create_unverified_https_context = ssl._create_unverified_context
# except AttributeError:
#     pass
# else:
#     ssl._create_default_https_context = _create_unverified_https_context

# nltk.download('all')

In [3]:
# File Directory
## Note: You'll see this code again in the summarization lab! 
from langchain.document_loaders import DirectoryLoader
loader = DirectoryLoader('assets/my_docs')
docs = loader.load()
print(docs)

[Document(page_content='Meeting Notes - July 20, 2021\n\nDiscussed monthly sales targets and progress towards meeting them\n\nReviewed marketing campaigns and analyzed their effectiveness\n\nIdentified potential leads and discussed strategies for converting them into customers\n\nDiscussed customer complaints and identified areas for improvement in customer service\n\nAgreed on action items and assigned responsibilities for addressing the issues raised', metadata={'source': 'assets/my_docs/document_3.txt'}), Document(page_content='Meeting Notes - May 5, 2021\n\nReviewed customer feedback and identified areas for improvement\n\nBrainstormed ideas for new product features and enhancements\n\nAssigned development tasks to team members and set deadlines\n\nDiscussed marketing strategy to promote the new features\n\nAgreed on next steps and scheduled a demo session for the updated product', metadata={'source': 'assets/my_docs/document_2.txt'}), Document(page_content='Meeting Notes - March 1

In [5]:
# WebBaseLoader 
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://arxiv.org/html/2303.08774v4")
docs = loader.load()
print(docs)


[Document(page_content='\n\n\n\nGPT-4 Technical Report\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n1 Introduction\n2 Scope and Limitations of this Technical Report\n\n3 Predictable Scaling\n\n3.1 Loss Prediction\n3.2 Scaling of Capabilities on HumanEval\n\n\n\n4 Capabilities\n\n4.1 Visual Inputs\n\n\n5 Limitations\n6 Risks & mitigations\n7 Conclusion\n\nA Exam Benchmark Methodology\n\nA.1 Sourcing.\nA.2 Prompting: multiple-choice\nA.3 Prompting: free-response\nA.4 Images\nA.5 Scoring\nA.6 Codeforces rating\nA.7 Model snapshot details\n\nA.8 Example few-shot prompts\n\nExample prompt for a multiple choice exam\nExample prompt for a free-response question\n\n\n\n\nB Impact of RLHF on capability\nC Contamination on professional and academic exams\nD Contamination on academic benchmarks\nE GSM-8K in GPT-4 training\nF Multilingual MMLU\nG Examples of GPT-4 Visual Input\nH System Card\n\n\n\n\n\n\n\n\n\n\n\nHTML conversions sometimes display errors due to content that did not convert correctly from the sou

In [6]:
# Markdown File 
from langchain.document_loaders import UnstructuredMarkdownLoader
markdown_path = "assets/markdown_example.md"
loader = UnstructuredMarkdownLoader(markdown_path)
data = loader.load()
print(data)

[Document(page_content="✨ Navigate at cookbook.openai.com\n\nExample code and guides for accomplishing common tasks with the OpenAI API. To run these examples, you'll need an OpenAI account and associated API key (create a free account here).\n\nMost code examples are written in Python, though the concepts can be applied in any language.\n\nFor other useful tools, guides and courses, check out these related resources from around the web.\n\nContributing\n\nThe OpenAI Cookbook is a community-driven resource. Whether you're submitting an idea, fixing a typo, adding a new guide, or improving an existing one, your contributions are greatly appreciated!\n\nBefore contributing, read through the existing issues and pull requests to see if someone else is already working on something similar. That way you can avoid duplicating efforts.\n\nIf there are examples or guides you'd like to see, feel free to suggest them on the issues page.\n\nIf you'd like to contribute new content, make sure to rea

<a id='splitters'></a>
## Text Splitters

Large documents (or collections of documents) often need to be split or chunked in such a way that it is more meaningful for application and manageable for the language model you are using. **Reminder:** language models limit the size of what you can pass to and get back form them, e.g., GPT-4 Turbo has a context window of 128k tokens which is for both the prompt and response.

[docs](https://python.langchain.com/docs/modules/data_connection/document_transformers/)


### How text splitters work
- Split into small, semantically meaningful pieces. Sentences are common. 
- Combine those into meaningful chunks as defined by a function of your choice with a defined size.
- Create a new “document” from that chunk and continue through the text. Each chunk will have overlap with the previous chunk and the following chunk. 

### You control
- What text is split on 
- How chunks are created 

### LangChain Supported splitters
- [docs](https://python.langchain.com/docs/modules/data_connection/document_transformers/)
- We will focus on the [recursive splitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter) which is recommended for generic text for this lab
- We will use a [token splitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/split_by_token)




### Recursive splitter recommend for generic text

In [7]:
nursery_rhyme = f"""
Hey, diddle, diddle,
The cat and the fiddle,
The cow jumped over the moon;
The little dog laughed
To see such sport,
And the dish ran away with the spoon.
"""

In [8]:
# Recursive splitter recommend for generic text
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=75,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)

In [9]:
# examine split documents
texts = text_splitter.create_documents([nursery_rhyme])
print(texts[0])
print(texts[1])


page_content='Hey, diddle, diddle,\nThe cat and the fiddle,\nThe cow jumped over the moon;' metadata={}
page_content='The little dog laughed\nTo see such sport,' metadata={}


In [10]:
text_splitter.split_text(nursery_rhyme)[:]

['Hey, diddle, diddle,\nThe cat and the fiddle,\nThe cow jumped over the moon;',
 'The little dog laughed\nTo see such sport,',
 'To see such sport,\nAnd the dish ran away with the spoon.']

### Token splitter 
Language models have a token limits so it can be helpful to split on tokens when planning to pass docs to a LLM. [docs](https://python.langchain.com/docs/modules/data_connection/document_transformers/split_by_token)

Note: This code is also used in the summarization lab from the Advanced Prompting module. 

In [11]:
# Character Text Splitter Example (simple splitter )
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000, chunk_overlap=0
)
split_docs = text_splitter.split_documents(docs)
# print(split_docs[1])

<a id='lab'></a>
## Lab

1. Load my_documents from file folder and split with a token splitter 
2. Load a document from wikipedia "https://en.wikipedia.org/wiki/Ancient_Rome" and split with recursive text splitter 

In [12]:
#1a loader 
loader = DirectoryLoader('assets/my_docs')
docs = loader.load()
len(docs)

5

In [13]:
#1b splitter 
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000, chunk_overlap=10
)
split_docs = text_splitter.split_documents(docs)
print(split_docs)

[Document(page_content='Meeting Notes - July 20, 2021\n\nDiscussed monthly sales targets and progress towards meeting them\n\nReviewed marketing campaigns and analyzed their effectiveness\n\nIdentified potential leads and discussed strategies for converting them into customers\n\nDiscussed customer complaints and identified areas for improvement in customer service\n\nAgreed on action items and assigned responsibilities for addressing the issues raised', metadata={'source': 'assets/my_docs/document_3.txt'}), Document(page_content='Meeting Notes - May 5, 2021\n\nReviewed customer feedback and identified areas for improvement\n\nBrainstormed ideas for new product features and enhancements\n\nAssigned development tasks to team members and set deadlines\n\nDiscussed marketing strategy to promote the new features\n\nAgreed on next steps and scheduled a demo session for the updated product', metadata={'source': 'assets/my_docs/document_2.txt'}), Document(page_content='Meeting Notes - March 1

In [15]:
#2a web loader
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://en.wikipedia.org/wiki/Ancient_Rome")
pages = loader.load()
print(pages)

[Document(page_content='\n\n\n\nAncient Rome - Wikipedia\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nJump to content\n\n\n\n\n\n\n\nMain menu\n\n\n\n\n\nMain menu\nmove to sidebar\nhide\n\n\n\n\t\tNavigation\n\t\n\n\nMain pageContentsCurrent eventsRandom articleAbout WikipediaContact usDonate\n\n\n\n\n\n\t\tContribute\n\t\n\n\nHelpLearn to editCommunity portalRecent changesUpload file\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCreate account\n\nLog in\n\n\n\n\n\n\n\n\nPersonal tools\n\n\n\n\n\n Create account Log in\n\n\n\n\n\n\t\tPages for logged out editors learn more\n\n\n\nContributionsTalk\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nContents\nmove to sidebar\nhide\n\n\n\n\n(Top)\n\n\n\n\n\n1Early Italy and the founding of Rome\n\n\n\n\n\n\n\n2Kingdom\n\n\n\n\n\n\n\n3Republic\n\n\n\nToggle Republic subsection\n\n\n\n\n\n3.1Punic Wars\n\n\n\n\n\n\n\n\n\

In [16]:
#2b splitter 
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=1000,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)
split_docs = text_splitter.split_documents(pages)
print(split_docs)

[Document(page_content='Ancient Rome - Wikipedia\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nJump to content\n\n\n\n\n\n\n\nMain menu\n\n\n\n\n\nMain menu\nmove to sidebar\nhide\n\n\n\n\t\tNavigation\n\t\n\n\nMain pageContentsCurrent eventsRandom articleAbout WikipediaContact usDonate\n\n\n\n\n\n\t\tContribute\n\t\n\n\nHelpLearn to editCommunity portalRecent changesUpload file\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCreate account\n\nLog in\n\n\n\n\n\n\n\n\nPersonal tools\n\n\n\n\n\n Create account Log in\n\n\n\n\n\n\t\tPages for logged out editors learn more\n\n\n\nContributionsTalk\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nContents\nmove to sidebar\nhide\n\n\n\n\n(Top)\n\n\n\n\n\n1Early Italy and the founding of Rome\n\n\n\n\n\n\n\n2Kingdom\n\n\n\n\n\n\n\n3Republic\n\n\n\nToggle Republic subsection\n\n\n\n\n\n3.1Punic Wars\n\n\n\n\n\n\n\n\n\n4Late R