# LangChain Document Loaders

https://python.langchain.com/docs/modules/data_connection/document_loaders/html

### Checkout list of loaders

https://python.langchain.com/docs/integrations/document_loaders

### Document class

##### Document
https://api.python.langchain.com/en/stable/documents/langchain_core.documents.base.Document.html#langchain_core.documents.base.Document

## 1. PDF Loader

https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf

#### Dependency

Install the pypdf packge.

pip install pypdf

#### API

##### PyPDFLoader
https://api.python.langchain.com/en/stable/document_loaders/langchain_community.document_loaders.pdf.PyPDFLoader.html#langchain_community.document_loaders.pdf.PyPDFLoader




In [1]:
# !pip install pypdf

### Load a PDF from arxiv.org

In [18]:
from langchain_community.document_loaders import PyPDFLoader

# Load from local file system or from a URL
# You may also us the PDF web loader

source_pdf = 'https://arxiv.org/pdf/2201.11903.pdf'

pdf_loader = PyPDFLoader(source)

# Read from local file system
# pdf_loader = PyPDFLoader('c:/Users/raj/Downloads/arxiv-cot-2201.11903.pdf')

# Loads pdf into a single document
pdf_document = pdf_loader.load()

print("Document list size : ", len(pdf_document))
print("Metadata : ", pdf_document[0].metadata)
print("Page content : ")
print(pdf_document[0].page_content)

Document list size :  43
Metadata :  {'source': 'https://arxiv.org/pdf/2201.11903.pdf', 'page': 0}
Page content : 
Chain-of-Thought Prompting Elicits Reasoning
in Large Language Models
Jason Wei Xuezhi Wang Dale Schuurmans Maarten Bosma
Brian Ichter Fei Xia Ed H. Chi Quoc V . Le Denny Zhou
Google Research, Brain Team
{jasonwei,dennyzhou}@google.com
Abstract
We explore how generating a chain of thought —a series of intermediate reasoning
steps—signiﬁcantly improves the ability of large language models to perform
complex reasoning. In particular, we show how such reasoning abilities emerge
naturally in sufﬁciently large language models via a simple method called chain-of-
thought prompting , where a few chain of thought demonstrations are provided as
exemplars in prompting.
Experiments on three large language models show that chain-of-thought prompting
improves performance on a range of arithmetic, commonsense, and symbolic
reasoning tasks. The empirical gains can be striking. For instan

In [9]:
# load_and_split() Loads the PDF using PyPDF into a list of documents
# PDF is chunked by page; each document represent a page with page number.
pages = pdf_loader.load_and_split()

print("Number of pages : ", len(pages))
print("Metadata : ", pages[10].metadata)

pages[0].page_content

## 2. CSV Loader

#### API

https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.csv_loader.CSVLoader.html#langchain_community.document_loaders.csv_loader.CSVLoader

#### EV data
Sample data for testing. Docwnload the file from link below:

https://catalog.data.gov/dataset/electric-vehicle-population-data

In [20]:
from langchain_community.document_loaders.csv_loader import CSVLoader

# CHANGE this
ev_file_path = 'c:/Users/raj/Downloads/Electric_Vehicle_Population_Data.csv'

csv_loader = CSVLoader(file_path=ev_file_path)

csv_data = csv_loader.load()

print("Number of rows : ", len(csv_data))
print("Metadata : ", csv_data[10].metadata)

csv_data[0].page_content

Number of pages :  177866
Metadata :  {'source': 'c:/Users/raj/Downloads/Electric_Vehicle_Population_Data.csv', 'row': 10}


'VIN (1-10): 5YJYGDEE1L\nCounty: King\nCity: Seattle\nState: WA\nPostal Code: 98122\nModel Year: 2020\nMake: TESLA\nModel: MODEL Y\nElectric Vehicle Type: Battery Electric Vehicle (BEV)\nClean Alternative Fuel Vehicle (CAFV) Eligibility: Clean Alternative Fuel Vehicle Eligible\nElectric Range: 291\nBase MSRP: 0\nLegislative District: 37\nDOL Vehicle ID: 125701579\nVehicle Location: POINT (-122.30839 47.610365)\nElectric Utility: CITY OF SEATTLE - (WA)|CITY OF TACOMA - (WA)\n2020 Census Tract: 53033007800'

## 3. URL Loader

https://python.langchain.com/docs/integrations/document_loaders/url

API

##### UnstructuredURLLoader

Use the unstructured partition function to detect the MIME type and route the file to the appropriate partitioner.

You can run the loader in one of two modes: “single” and “elements”. If you use “single” mode, the document will be returned as a single langchain Document object. If you use “elements” mode, the unstructured library will split the document into elements such as Title and NarrativeText. You can pass in additional unstructured kwargs after mode to apply different unstructured settings.

https://api.python.langchain.com/en/stable/document_loaders/langchain_community.document_loaders.url.UnstructuredURLLoader.html#langchain_community.document_loaders.url.UnstructuredURLLoader
##### WebLoaders


This covers how to use WebBaseLoader to load all text from HTML webpages into a document format that we can use downstream. 

https://python.langchain.com/docs/integrations/document_loaders/web_base

API

https://sj-langchain.readthedocs.io/en/latest/document_loaders/langchain.document_loaders.web_base.WebBaseLoader.html




#### Requirement

##### URL Loader
!pip install unstructured libmagic 

In [8]:
from langchain_community.document_loaders import WebBaseLoader


urls = [
        'https://en.wikipedia.org/wiki/Large_language_model',
        'https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)'
       ]


web_loader = WebBaseLoader(urls)

# Returns a list of *Document* objects
# Metadata has the {'source': '...', 'title': '...', 'language': '...}
data = web_loader.load()

In [20]:
len(data)

2

In [21]:
data[0].metadata

{'source': 'https://en.wikipedia.org/wiki/Large_language_model',
 'title': 'Large language model - Wikipedia',
 'language': 'en'}

In [13]:
pages = web_loader.load_and_split()

In [23]:
len(pages)

46

## 4. Wikipedia Loader

https://python.langchain.com/docs/integrations/document_loaders/wikipedia

#### API

https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.wikipedia.WikipediaLoader.html#langchain_community.document_loaders.wikipedia.WikipediaLoader

* query (str) – The query string to search on Wikipedia.
* lang (str, optional) – The language code for the Wikipedia language edition. Defaults to “en”.
* load_max_docs (int, optional) – The maximum number of documents to load. Defaults to 100.
* load_all_available_meta (bool, optional) – Indicates whether to load all available metadata for each document. Defaults to False.
* doc_content_chars_max (int, optional) – The maximum number of characters for the document content. Defaults to 4000.

#### Note
Requires the python wikipedia package


!pip install --upgrade --quiet  wikipediauiet  wikipedia

In [2]:
from langchain_community.document_loaders import WikipediaLoader

wiki_loader = WikipediaLoader("Large Language Models", load_max_docs=2)

wiki_docs = wiki_loader.load()

In [3]:
print("Number of docs : ", len(wiki_docs))

Number of docs :  2


In [5]:
print(wiki_docs[0].metadata)

{'title': 'Large language model', 'summary': 'A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation and other natural language processing tasks such as classification. LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process. LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word.LLMs are artificial neural networks. The largest and most capable, as of March 2024, are built with a decoder-only transformer-based architecture while some recent implementations are based on other architectures, such as recurrent neural network variants and Mamba (a state space model).Up to 2020, fine tuning was the only way a model could be adapted to be able to accomplish specific tasks. Larger sized models, such as GPT-3, however, can be prompt-

In [6]:
print(wiki_docs[0].page_content)

A large language model (LLM) is a language model notable for its ability to achieve general-purpose language generation and other natural language processing tasks such as classification. LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process. LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word.LLMs are artificial neural networks. The largest and most capable, as of March 2024, are built with a decoder-only transformer-based architecture while some recent implementations are based on other architectures, such as recurrent neural network variants and Mamba (a state space model).Up to 2020, fine tuning was the only way a model could be adapted to be able to accomplish specific tasks. Larger sized models, such as GPT-3, however, can be prompt-engineered to achieve similar results. They ar