## Set Up the Environment

In [1]:
%run setup.ipynb

## Document Loaders

Document loaders are used to import data from various sources into LangChain as `Document` objects. A `Document` typically includes a piece of text along with its associated metadata.

### Examples of Document Loaders:

- **Text File Loader:** Loads data from a simple `.txt` file.
- **Web Page Loader:** Retrieves the text content from any web page.
- **YouTube Video Transcript Loader:** Loads transcripts from YouTube videos.

### Functionality:

- **Load Method:** Each document loader has a `load` method that enables the loading of data as documents from a pre-configured source.
- **Lazy Load Option:** Some loaders also support a "lazy load" feature, which allows data to be loaded into memory gradually as needed.

For more detailed information, visit [LangChain's document loader documentation](https://python.langchain.com/docs/modules/data_connection/document_loaders/).


### Markdown Loader

* Markdown is a lightweight markup language for creating formatted text using a plain-text editor.
  
* This showcases how to load Markdown documents into a langchain document format that we can use in our pipelines and chains.
  
* This Loader loads the whole document.


#### Download nltk packages if needed

In [3]:
import nltk
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/sourav.banerjee/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/sourav.banerjee/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

### Load Document in a single Section

In [4]:
from langchain_community.document_loaders import UnstructuredMarkdownLoader
loader = UnstructuredMarkdownLoader("../../docs/README.md", mode='single')
docs = loader.load()

In [7]:
print(f"\nThe number of documents : {len(docs)}\n")


The number of documents : 1



In [8]:
print(f"\n Type of first documents : {type(docs[0])}\n")


 Type of first documents : <class 'langchain_core.documents.base.Document'>



In [9]:
type(docs[0])

langchain_core.documents.base.Document

In [10]:
print(docs[0].metadata)

{'source': '../../docs/README.md'}


In [11]:
print(docs[0].page_content[:100])

🦜️🔗 LangChain

⚡ Build context-aware reasoning applications ⚡

Release Notes

CI

PyPI - License

Py


### Load document and separate based on elements

In [12]:
loader = UnstructuredMarkdownLoader("../../docs/README.md", mode="elements")
docs = loader.load()
print(f"\nThe number of documents : {len(docs)}\n")


The number of documents : 75



In [13]:
docs[:10]

[Document(metadata={'source': '../../docs/README.md', 'category_depth': 0, 'languages': ['eng'], 'file_directory': '../../docs', 'filename': 'README.md', 'filetype': 'text/markdown', 'last_modified': '2025-05-30T10:16:46', 'category': 'Title', 'element_id': '200b8a7d0dd03f66e4f13456566d2b3a'}, page_content='🦜️🔗 LangChain'),
 Document(metadata={'source': '../../docs/README.md', 'languages': ['eng'], 'file_directory': '../../docs', 'filename': 'README.md', 'filetype': 'text/markdown', 'last_modified': '2025-05-30T10:16:46', 'parent_id': '200b8a7d0dd03f66e4f13456566d2b3a', 'category': 'NarrativeText', 'element_id': '80d06543c0c2b75ca147f3509e518a47'}, page_content='⚡ Build context-aware reasoning applications ⚡'),
 Document(metadata={'source': '../../docs/README.md', 'image_url': 'https://img.shields.io/github/release/langchain-ai/langchain?style=flat-square', 'link_texts': ['Release Notes'], 'link_urls': ['https://github.com/langchain-ai/langchain/releases'], 'languages': ['eng'], 'file_

In [14]:
from collections import Counter
Counter([doc.metadata['category'] for doc in docs])

Counter({'ListItem': 26,
         'NarrativeText': 17,
         'Title': 13,
         'Image': 12,
         'UncategorizedText': 7})

In [15]:
docs[0].metadata

{'source': '../../docs/README.md',
 'category_depth': 0,
 'languages': ['eng'],
 'file_directory': '../../docs',
 'filename': 'README.md',
 'filetype': 'text/markdown',
 'last_modified': '2025-05-30T10:16:46',
 'category': 'Title',
 'element_id': '200b8a7d0dd03f66e4f13456566d2b3a'}

In [16]:
docs[0].page_content

'🦜️🔗 LangChain'

In [17]:
docs[1].metadata

{'source': '../../docs/README.md',
 'languages': ['eng'],
 'file_directory': '../../docs',
 'filename': 'README.md',
 'filetype': 'text/markdown',
 'last_modified': '2025-05-30T10:16:46',
 'parent_id': '200b8a7d0dd03f66e4f13456566d2b3a',
 'category': 'NarrativeText',
 'element_id': '80d06543c0c2b75ca147f3509e518a47'}

In [18]:
docs[1].page_content

'⚡ Build context-aware reasoning applications ⚡'

### Comparing Unstructured.io loaders vs LangChain wrapper API

In [1]:
from unstructured.partition.md import partition_md

docs = partition_md(filename="../../docs/README.md")

In [2]:
len(docs)

75

We can see below the documents are not any more LangChain Document type. Rather they are Unstructured Document Type

In [3]:
docs[:10]

[<unstructured.documents.elements.Title at 0x137b5e350>,
 <unstructured.documents.elements.NarrativeText at 0x13c0cc990>,
 <unstructured.documents.elements.Image at 0x13c0dd8d0>,
 <unstructured.documents.elements.Image at 0x13c0ca550>,
 <unstructured.documents.elements.Image at 0x13df3b510>,
 <unstructured.documents.elements.Image at 0x13df3b490>,
 <unstructured.documents.elements.Image at 0x13df3b450>,
 <unstructured.documents.elements.Image at 0x13df3b2d0>,
 <unstructured.documents.elements.Image at 0x13df3b550>,
 <unstructured.documents.elements.Image at 0x13df3b610>]

In [4]:
docs[0].to_dict()

{'type': 'Title',
 'element_id': '200b8a7d0dd03f66e4f13456566d2b3a',
 'text': '🦜️🔗 LangChain',
 'metadata': {'category_depth': 0,
  'languages': ['eng'],
  'file_directory': '../../docs',
  'filename': 'README.md',
  'filetype': 'text/markdown',
  'last_modified': '2025-05-30T10:16:46'}}

In [5]:
docs[1].to_dict()

{'type': 'NarrativeText',
 'element_id': '80d06543c0c2b75ca147f3509e518a47',
 'text': '⚡ Build context-aware reasoning applications ⚡',
 'metadata': {'languages': ['eng'],
  'file_directory': '../../docs',
  'filename': 'README.md',
  'filetype': 'text/markdown',
  'last_modified': '2025-05-30T10:16:46',
  'parent_id': '200b8a7d0dd03f66e4f13456566d2b3a'}}

In [11]:
i = 0
for doc in docs:
    print(f"Document {i+1}:")
    print(f"Text: {doc.text}")
    print(f"Metadata: {doc.metadata.to_dict()}")
    i = i+1


Document 1:
Text: 🦜️🔗 LangChain
Metadata: {'category_depth': 0, 'languages': ['eng'], 'file_directory': '../../docs', 'filename': 'README.md', 'filetype': 'text/markdown', 'last_modified': '2025-05-30T10:16:46'}
Document 2:
Text: ⚡ Build context-aware reasoning applications ⚡
Metadata: {'languages': ['eng'], 'file_directory': '../../docs', 'filename': 'README.md', 'filetype': 'text/markdown', 'last_modified': '2025-05-30T10:16:46', 'parent_id': '200b8a7d0dd03f66e4f13456566d2b3a'}
Document 3:
Text: Release Notes
Metadata: {'image_url': 'https://img.shields.io/github/release/langchain-ai/langchain?style=flat-square', 'link_texts': ['Release Notes'], 'link_urls': ['https://github.com/langchain-ai/langchain/releases'], 'languages': ['eng'], 'file_directory': '../../docs', 'filename': 'README.md', 'filetype': 'text/markdown', 'last_modified': '2025-05-30T10:16:46'}
Document 4:
Text: CI
Metadata: {'image_url': 'https://github.com/langchain-ai/langchain/actions/workflows/check_diffs.yml/badge

#### Convert UnStructured Documents to LangChain Documents

In [12]:
from langchain_core.documents import Document

lc_docs = [Document(page_content=doc.text,
                    metadata=doc.metadata.to_dict())
              for doc in docs]
lc_docs[:10]

[Document(metadata={'category_depth': 0, 'languages': ['eng'], 'file_directory': '../../docs', 'filename': 'README.md', 'filetype': 'text/markdown', 'last_modified': '2025-05-30T10:16:46'}, page_content='🦜️🔗 LangChain'),
 Document(metadata={'languages': ['eng'], 'file_directory': '../../docs', 'filename': 'README.md', 'filetype': 'text/markdown', 'last_modified': '2025-05-30T10:16:46', 'parent_id': '200b8a7d0dd03f66e4f13456566d2b3a'}, page_content='⚡ Build context-aware reasoning applications ⚡'),
 Document(metadata={'image_url': 'https://img.shields.io/github/release/langchain-ai/langchain?style=flat-square', 'link_texts': ['Release Notes'], 'link_urls': ['https://github.com/langchain-ai/langchain/releases'], 'languages': ['eng'], 'file_directory': '../../docs', 'filename': 'README.md', 'filetype': 'text/markdown', 'last_modified': '2025-05-30T10:16:46'}, page_content='Release Notes'),
 Document(metadata={'image_url': 'https://github.com/langchain-ai/langchain/actions/workflows/check_