![Document Loaders and Splitters](../assets/document-loaders-and-splitters.png)
---


### Learning objective:
By the end of this lesson, you will be able to manage documents with document loaders and splitters. 


### About:  
Often, we need to format documents before we can use the prompts. In this lesson, we will use loaders and splitters to format documents from file directories, csv files, and websites. 


### Prerequisites:
- Python (required) 
- Intro to LangChain and prior prompt  eng. lessons (required) 
- Visual Studio Code (recommended)
- GitHub Copilot lessons (recommended) 

### Contents
1. [Imports](#imports)
1. [Document Loaders](#loaders)
1. [Text splitters](#splitters)

### Activities
1. [Lab](#lab)


## Installs

You may need to install the following tools


- %pip install --upgrade  tiktoken
- %pip install --upgrade  "unstructured[all-docs]"

<a id='imports'></a>
## Imports

In [None]:
from langchain_openai import ChatOpenAI #openai chatbot
from langchain_core.prompts import ChatPromptTemplate #template for chat prompts
from langchain_core.output_parsers import StrOutputParser #output parser for string output 

<a id='loaders'></a>
## Document Loaders
Document loaders allow you to bring in text from another source, such as a file on your computer, data from a website, or even YouTube video transcripts. 
LangChain offers functionality to help load and store those documents in so that they can be used by the language model. 

[LangChain Docs](https://python.langchain.com/docs/modules/data_connection/document_loaders/)

### Supported loaders:
- CSV
- File Directory
- HTML
- JSON
- Markdown
- PDF


Additionally, LangChain allows offers multiple 3rd party integrations to help with loading documents of different formats and from common sites (like YouTube, and Hackernews). 

[Integrations](https://python.langchain.com/docs/integrations/document_loaders/)


We will work with several documents and load them in for this lab. 

#### Data sources stored in assets folder:
- **CSV Source:** Fanaee-T, Hadi, and Gama, Joao, "Event labeling combining ensemble detectors and background knowledge", Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg, doi:10.1007/s13748-013-0040-3.
- **Webloader Source:** [GPT-4 article](https://arxiv.org/html/2303.08774v4)
- **Markdown Source:** [OpenAI Github README.md example](https://github.com/openai/openai-cookbook?tab=readme-ov-file)
- **File Directory Source:** Sample Meeting text files generated by GPT and stored in "my_docs" folder

In each example below you will see the import, the loader, and print statement to view the doc. 

In [None]:
# File Directory
## Note: You'll see this code again in the summarization lab! 
from langchain_community.document_loaders import DirectoryLoader
loader = DirectoryLoader('assets/my_docs')
docs = loader.load()
print(docs)

In [None]:
# WebBaseLoader 
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://arxiv.org/html/2303.08774v4")
docs = loader.load()
print(docs)


In [None]:
# Markdown File 
from langchain_community.document_loaders import UnstructuredMarkdownLoader
markdown_path = "assets/markdown_example.md"
loader = UnstructuredMarkdownLoader(markdown_path)
data = loader.load()
print(data)

<a id='splitters'></a>
## Text Splitters

Large documents (or collections of documents) often need to be split or chunked in such a way that it is more meaningful for application and manageable for the language model you are using. **Reminder:** language models limit the size of what you can pass to and get back form them, e.g., GPT-4 Turbo has a context window of 128k tokens which is for both the prompt and response.

[docs](https://python.langchain.com/docs/modules/data_connection/document_transformers/)


### How text splitters work
- Split into small, semantically meaningful pieces. Sentences are common. 
- Combine those into meaningful chunks as defined by a function of your choice with a defined size.
- Create a new “document” from that chunk and continue through the text. Each chunk will have overlap with the previous chunk and the following chunk. 

### You control
- What text is split on 
- How chunks are created 

### LangChain Supported splitters
- [docs](https://python.langchain.com/docs/modules/data_connection/document_transformers/)
- We will focus on the [recursive splitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter) which is recommended for generic text for this lab
- We will use a [token splitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/split_by_token)




### Recursive splitter recommend for generic text

In [None]:
nursery_rhyme = f"""
Hey, diddle, diddle,
The cat and the fiddle,
The cow jumped over the moon;
The little dog laughed
To see such sport,
And the dish ran away with the spoon.
"""

In [None]:
# Recursive splitter recommend for generic text
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=75,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
)

In [None]:
# examine split documents
texts = text_splitter.create_documents([nursery_rhyme])
print(texts[0])
print(texts[1])


In [None]:
text_splitter.split_text(nursery_rhyme)[:]

### Token splitter 
Language models have a token limits so it can be helpful to split on tokens when planning to pass docs to a LLM. [docs](https://python.langchain.com/docs/modules/data_connection/document_transformers/split_by_token)

Note: This code is also used in the summarization lab from the Advanced Prompting module. 

In [None]:
# Character Text Splitter Example (simple splitter )
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000, chunk_overlap=0
)
split_docs = text_splitter.split_documents(docs)
# print(split_docs[1])

<a id='lab'></a>
## Lab

1. Load my_documents from file folder and split with a token splitter 
2. Load a document from wikipedia "https://en.wikipedia.org/wiki/Ancient_Rome" and split with recursive text splitter 

In [None]:
#1a loader 


In [None]:
#1b splitter 


In [None]:
#2a web loader


In [None]:
#2b splitter 
