## Document loaders
Document Loaders in LangChain help in loading documents from a variety of structured/Unstructured sources.

In [1]:
import os
# Disable pip version check
os.environ['PIP_DISABLE_PIP_VERSION_CHECK'] = '1'
import warnings
warnings.filterwarnings('ignore')

In [2]:
from dotenv import load_dotenv, dotenv_values
import google.generativeai as genai
from IPython.display import Markdown, display
load_dotenv()
os.getenv("GOOGLE_API_KEY") 
my_api_key = os.getenv("GOOGLE_API_KEY")
genai.configure(api_key=my_api_key)

In [3]:
from langchain_google_genai.chat_models import  ChatGoogleGenerativeAI
llm = ChatGoogleGenerativeAI(model= "gemini-1.5-flash", temperature = 0.6) # "chat-bison@001"

#### Loading CSVs

In [4]:
from langchain_community.document_loaders.csv_loader import CSVLoader

file_path = 'Data/iris.csv'


loader = CSVLoader(file_path=file_path)
data = loader.load()

for record in data[:2]:
    print(record)

page_content='setosa: 1.4\nversicolor: 0.2\nvirginica: 0' metadata={'source': 'Data/iris.csv', 'row': 0}
page_content='setosa: 1.4\nversicolor: 0.2\nvirginica: 0' metadata={'source': 'Data/iris.csv', 'row': 1}


In [5]:
# Customizing Parsing 

loader = CSVLoader(
    file_path=file_path,
    csv_args={
        "delimiter": ",",
        "quotechar": '"',
        "fieldnames": ["setosa", "versicolor", "virginica"],
    },
)

data = loader.load()
for record in data[:2]:
    print(record)


page_content='setosa: setosa\nversicolor: versicolor\nvirginica: virginica' metadata={'source': 'Data/iris.csv', 'row': 0}
page_content='setosa: 1.4\nversicolor: 0.2\nvirginica: 0' metadata={'source': 'Data/iris.csv', 'row': 1}


##### Specifying a column as  document source
The "source" key on Document metadata can be set using a column of the CSV. Use the source_column argument to specify a source for the document created from each row. Otherwise file_path will be used as the source for all documents created from the CSV file.
This is useful when using documents loaded from CSV files for chains that answer questions using sources.

In [6]:
loader = CSVLoader(file_path=file_path, source_column="virginica")

data = loader.load()
for record in data[:2]:
    print(record)

page_content='setosa: 1.4\nversicolor: 0.2\nvirginica: 0' metadata={'source': '0', 'row': 0}
page_content='setosa: 1.4\nversicolor: 0.2\nvirginica: 0' metadata={'source': '0', 'row': 1}


In [7]:
# python's tempfile can be used when working with CSV strings directly.
import tempfile
from io import StringIO

string_data = """
"Team", "Payroll (millions)", "Wins"
"Nationals",     81.34, 98
"Reds",          82.20, 97
"Yankees",      197.96, 95
"Giants",       117.62, 94
""".strip()

with tempfile.NamedTemporaryFile(delete=False, mode="w+") as temp_file:
    temp_file.write(string_data)
    temp_file_path = temp_file.name

loader = CSVLoader(file_path=temp_file_path)
data = loader.load()
for record in data[:2]:
    print(record)




page_content='Team: Nationals\n"Payroll (millions)": 81.34\n"Wins": 98' metadata={'source': 'C:\\Users\\dpokh\\AppData\\Local\\Temp\\tmpkevgitpw', 'row': 0}
page_content='Team: Reds\n"Payroll (millions)": 82.20\n"Wins": 97' metadata={'source': 'C:\\Users\\dpokh\\AppData\\Local\\Temp\\tmpkevgitpw', 'row': 1}


#### Loading from a directory
LangChain's DirectoryLoader implements functionality for reading files from disk into LangChain Document objects. Here we demonstrate:

    How to load from a filesystem, including use of wildcard patterns;
    How to use multithreading for file I/O;
    How to use custom loader classes to parse specific file types (e.g., code);
    How to handle errors, such as those due to decoding.

Install the Python SDK to support all document types with <b> pip install "unstructured[all-docs]" </b>

For plain text files, HTML, XML, JSON and Emails that do not require any extra dependencies, you can run <b> pip install unstructured </b>

To process other doc types, you can install the extras required for those documents, such as <b>pip install "unstructured[docx,pptx]" </b>
Install the following system dependencies if they are not already available on your system. 

Depending on what document types you're parsing, you may not need all of these.

libmagic-dev (filetype detection)

poppler-utils (images and PDFs)

tesseract-ocr (images and PDFs, install tesseract-lang for additional language support)

libreoffice (MS Office docs)

pandoc (EPUBs, RTFs and Open Office docs). Please note that to handle RTF files, you need version 2.14.2 or newer. Running either make install-pandoc or ./scripts/install-pandoc.sh will install the correct version for you.

In [8]:
from langchain_community.document_loaders import DirectoryLoader
loader = DirectoryLoader("Markdown_Dir", glob="**/*.md")
docs = loader.load()
len(docs)

3

In [9]:
print(docs[0].page_content[:100])

(markdown)=

Markdown

Introduction

In this chapter, you'll meet the lightweight markup language ca


##### Show a progress bar
By default a progress bar will not be shown. To show a progress bar, install the tqdm library (e.g. pip install tqdm), and set the show_progress parameter to True.

##### Use multithreading
By default the loading happens in one thread. In order to utilize several threads set the use_multithreading flag to true.

In [10]:
!pip install tqdm -q

In [11]:
loader = DirectoryLoader("../", glob="**/*.md", show_progress=True, use_multithreading =True ) # Load all the MD files
docs = loader.load()
len(docs)

100%|████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 11.34it/s]


4

##### Change loader class
By default this uses the UnstructuredLoader class. To customize the loader, specify the loader class in the loader_cls kwarg. Below we show an example using TextLoader:

In [12]:
from langchain_community.document_loaders import TextLoader

loader = DirectoryLoader("../", glob="**/*.md", loader_cls=TextLoader)
docs = loader.load()
print(docs[0].page_content[:100])

# langchain-gemini
Gemini 101: Code along with Langchain to explore Google's AI magic.



If you need to load Python source code files, use the PythonLoader:
   ```python
   
        from langchain_community.document_loaders import PythonLoader
        
        loader = DirectoryLoader("../../../../../", glob="**/*.py", loader_cls=PythonLoader)


##### Silent fail
We can pass the parameter silent_errors to the DirectoryLoader to skip the files which could not be loaded and continue the load process.

##### Auto detect encodings
We can also ask TextLoader to auto detect the file encoding before failing, by passing the autodetect_encoding to the loader class.

In [13]:
path = "Data/"
text_loader_kwargs = {"autodetect_encoding": True}
loader = DirectoryLoader(
    path, glob="**/*.txt", loader_cls=TextLoader, silent_errors=True,loader_kwargs=text_loader_kwargs
)
docs = loader.load()
doc_sources = [doc.metadata["source"] for doc in docs]
doc_sources


['Data\\Api.txt']

#### Parsing/Loading HTML

In [None]:
# Loading HTML with Unstructured
from langchain_community.document_loaders import UnstructuredHTMLLoader
htmlpath = "Data/mdguide.html"
loader = UnstructuredHTMLLoader(htmlpath) # Load all the MD files
data= loader.load()
len(data)
print(data[:10])

In [15]:
# Loading HTML with BeautifulSoup4

In [16]:
!pip install bs4 -q

In [None]:
from langchain_community.document_loaders import BSHTMLLoader

loader = BSHTMLLoader(htmlpath,open_encoding = 'utf-8')
data = loader.load()

print(data)

#### Loading JSON

JSON (JavaScript Object Notation) is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other serializable values).

JSON Lines is a file format where each line is a valid JSON value.

LangChain implements a JSONLoader to convert JSON and JSONL data into LangChain Document objects. It uses a specified jq schema to parse the JSON files, allowing for the extraction of specific fields into the content and metadata of the LangChain Document.

It uses the jq python package. Check out this manual for a detailed documentation of the jq syntax.

!pip install jq -q

!pip install pyjq -q

jq cannot be installed in windows. So navigate to https://jeffreyknockel.com/jq/ - download the .whl file as per your python version and pip install it 

```python
!python -m pip install "C:\Users\dpokh\Downloads\jq-1.4.0-cp311-cp311-win_amd64.whl"

In [18]:
from langchain_community.document_loaders import JSONLoader
import json
from pathlib import Path
from pprint import pprint


json_path="Data/json_sample.json"
data = json.loads(Path(json_path).read_text())
pprint(data)

{'is_still_participant': True,
 'messages': [{'content': "That's rough, buddy.",
               'sender_name': 'Zuko',
               'timestamp_ms': 1579137191303,
               'type': 'Generic'},
              {'content': 'My first girlfriend turned into the moon',
               'sender_name': 'Sokka',
               'timestamp_ms': 1579137103044,
               'type': 'Generic'},
              {'content': "Everyone in the Fire Nation thinks I'm a traitor, I "
                          "couldn't drag her into it.",
               'sender_name': 'Zuko',
               'timestamp_ms': 1579137078312,
               'type': 'Generic'},
              {'content': 'Yeah',
               'sender_name': 'Zuko',
               'timestamp_ms': 1579136858575,
               'type': 'Generic'},
              {'content': 'That gloomy girl who sighs a lot?!',
               'sender_name': 'Sokka',
               'timestamp_ms': 1579136847743,
               'type': 'Generic'},
              {'c

Extracting the values under the content field within the messages key of the JSON data  can easily be done through the JSONLoader as shown below.

In [19]:
loader = JSONLoader(
    file_path= json_path,
    jq_schema='.messages[].content',
    text_content=False)

data = loader.load()

In [20]:
pprint(data)

[Document(page_content="That's rough, buddy.", metadata={'source': 'C:\\Users\\dpokh\\langchain-gemini\\langchain-gemini\\Gemini_Langchain_Intro\\Data\\json_sample.json', 'seq_num': 1}),
 Document(page_content='My first girlfriend turned into the moon', metadata={'source': 'C:\\Users\\dpokh\\langchain-gemini\\langchain-gemini\\Gemini_Langchain_Intro\\Data\\json_sample.json', 'seq_num': 2}),
 Document(page_content="Everyone in the Fire Nation thinks I'm a traitor, I couldn't drag her into it.", metadata={'source': 'C:\\Users\\dpokh\\langchain-gemini\\langchain-gemini\\Gemini_Langchain_Intro\\Data\\json_sample.json', 'seq_num': 3}),
 Document(page_content='Yeah', metadata={'source': 'C:\\Users\\dpokh\\langchain-gemini\\langchain-gemini\\Gemini_Langchain_Intro\\Data\\json_sample.json', 'seq_num': 4}),
 Document(page_content='That gloomy girl who sighs a lot?!', metadata={'source': 'C:\\Users\\dpokh\\langchain-gemini\\langchain-gemini\\Gemini_Langchain_Intro\\Data\\json_sample.json', 'seq_

##### Loading Markdown

In [21]:
from langchain_community.document_loaders import UnstructuredMarkdownLoader
from langchain_core.documents import Document

markdown_path = "../README.md"
loader = UnstructuredMarkdownLoader(markdown_path)

data = loader.load()
assert len(data) == 1
assert isinstance(data[0], Document)
readme_content = data[0].page_content
print(readme_content[:250])

langchain-gemini

Gemini 101: Code along with Langchain to explore Google's AI magic.


In [22]:
loader = UnstructuredMarkdownLoader(markdown_path, mode="elements")

data = loader.load()
print(f"Number of documents: {len(data)}\n")

for document in data[:2]:
    print(f"{document}\n")

print(set(document.metadata["category"] for document in data))

Number of documents: 2

page_content='langchain-gemini' metadata={'source': '../README.md', 'category_depth': 0, 'last_modified': '2024-07-02T09:55:35', 'languages': ['eng'], 'filetype': 'text/markdown', 'file_directory': '..', 'filename': 'README.md', 'category': 'Title'}

page_content="Gemini 101: Code along with Langchain to explore Google's AI magic." metadata={'source': '../README.md', 'last_modified': '2024-07-02T09:55:35', 'languages': ['eng'], 'parent_id': '8e5c643ea375992feaa4406a0d53d353', 'filetype': 'text/markdown', 'file_directory': '..', 'filename': 'README.md', 'category': 'NarrativeText'}

{'NarrativeText', 'Title'}


##### Loading Portable Document Format (PDF)

In [23]:
!pip install --upgrade --quiet pypdf

In [24]:
from langchain_community.document_loaders import PyPDFLoader

file_path = (
  "Data/layout_parser.pdf"
)
loader = PyPDFLoader(file_path)
pages = loader.load_and_split()

pages[0]

Document(page_content='LayoutParser : A Uniﬁed Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen1( \x00), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n1Allen Institute for AI\nshannons@allenai.org\n2Brown University\nruochen zhang@brown.edu\n3Harvard University\n{melissadell,jacob carlson }@fas.harvard.edu\n4University of Washington\nbcgl@cs.washington.edu\n5University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily deployed in production and extended for further\ninvestigation. However, various factors like loosely organized codebases\nand sophisticated model conﬁgurations complicate the easy reuse of im-\nportant innovations by a wide audience. Though there have been on-going\neﬀorts to improve reusability and simplify deep learning (DL) model\ndevelopment in 

##### Extract text from images

In [25]:
!pip install --upgrade --quiet rapidocr-onnxruntime

In [26]:
loader = PyPDFLoader(file_path, extract_images=True)
pages = loader.load()
pages[4].page_content

'LayoutParser : A Uniﬁed Toolkit for DL-Based DIA 5\nTable 1: Current layout detection models in the LayoutParser model zoo\nDataset Base Model1Large Model Notes\nPubLayNet [38] F / M M Layouts of modern scientiﬁc documents\nPRImA [3] M - Layouts of scanned modern magazines and scientiﬁc reports\nNewspaper [17] F - Layouts of scanned US newspapers from the 20th century\nTableBank [18] F F Table region on modern scientiﬁc and business document\nHJDataset [31] F / M - Layouts of history Japanese documents\n1For each dataset, we train several models of diﬀerent sizes for diﬀerent needs (the trade-oﬀ between accuracy\nvs. computational cost). For “base model” and “large model”, we refer to using the ResNet 50 or ResNet 101\nbackbones [ 13], respectively. One can train models of diﬀerent architectures, like Faster R-CNN [ 28] (F) and Mask\nR-CNN [ 12] (M). For example, an F in the Large Model column indicates it has a Faster R-CNN model trained\nusing the ResNet 101 backbone. The platform i

##### Using PyMuPDF
PyMuPDF is optimized for speed, and contains detailed metadata about the PDF and its pages. It returns one document per page:

In [27]:
!pip install --upgrade --quiet pymupdf

In [28]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(file_path
)
data = loader.load()
data[0]

Document(page_content='LayoutParser: A Uniﬁed Toolkit for Deep\nLearning Based Document Image Analysis\nZejiang Shen1 (\x00), Ruochen Zhang2, Melissa Dell3, Benjamin Charles Germain\nLee4, Jacob Carlson3, and Weining Li5\n1 Allen Institute for AI\nshannons@allenai.org\n2 Brown University\nruochen zhang@brown.edu\n3 Harvard University\n{melissadell,jacob carlson}@fas.harvard.edu\n4 University of Washington\nbcgl@cs.washington.edu\n5 University of Waterloo\nw422li@uwaterloo.ca\nAbstract. Recent advances in document image analysis (DIA) have been\nprimarily driven by the application of neural networks. Ideally, research\noutcomes could be easily deployed in production and extended for further\ninvestigation. However, various factors like loosely organized codebases\nand sophisticated model conﬁgurations complicate the easy reuse of im-\nportant innovations by a wide audience. Though there have been on-going\neﬀorts to improve reusability and simplify deep learning (DL) model\ndevelopment 

##### Using pdfminer

In [29]:
#!pip install --upgrade --quiet pdfminer pdfminer.six
!pip install --upgrade --quiet pdfminer.six

In [None]:
from langchain_community.document_loaders import PDFMinerLoader

file_path = (
  "Data/layout_parser.pdf"
)
loader = PDFMinerLoader(file_path)
data = loader.load()
data[0]

###### Using PDFMiner to generate HTML text

In [None]:
from langchain_community.document_loaders import PDFMinerPDFasHTMLLoader
loader = PDFMinerPDFasHTMLLoader(file_path)
docs = loader.load()
docs[0]

##### Loading from PDF Directory

In [32]:
from langchain_community.document_loaders import PyPDFDirectoryLoader
loader = PyPDFDirectoryLoader("Data/")

docs = loader.load()

#data[0]
len(docs)

16

##### UnstructuredPDFLoader

In [None]:
###!pip install --quiet pillow-heif
###!pip install matplotlib
##!pip install unstructured-inference --user
## !pip install'unstructured_pytesseract'

In [None]:
from langchain_community.document_loaders import UnstructuredPDFLoader

loader = UnstructuredPDFLoader(file_path, mode="elements")
data = loader.load()
data[0]

In [None]:
from langchain_community.document_loaders import OnlinePDFLoader

loader = OnlinePDFLoader("https://arxiv.org/pdf/2302.03803.pdf")
data = loader.load()
data[0]