Document Loaders

In [1]:
## Data Ingestion
# Text loader
from langchain.document_loaders import TextLoader
loader=TextLoader("./data/speech.txt")
text_documents=loader.load()
text_documents

[Document(metadata={'source': './data/speech.txt'}, page_content='Hi hello sample')]

In [2]:
#CSV Loader

#The CSV loader loads a CSV file with a single row per document. The output contains both page content as well as the metadata associated.

from langchain.document_loaders import CSVLoader
loader = CSVLoader(file_path="./data/titanic.csv")
data = loader.load()
print(data[0])



page_content='PassengerId: 1
Survived: 0
Pclass: 3
Name: Braund, Mr. Owen Harris
Sex: male
Age: 22
SibSp: 1
Parch: 0
Ticket: A/5 21171
Fare: 7.25
Cabin: 
Embarked: S' metadata={'source': './data/titanic.csv', 'row': 0}


In [5]:
#HTML Loader

#We can load HTML documents in a document format that we can use for further downstream tasks. We have similar syntax.

from langchain.document_loaders import UnstructuredHTMLLoader
loader = UnstructuredHTMLLoader('./data/openai.htm')
data = loader.load()
data



[Document(metadata={'source': './data/openai.htm'}, page_content='ChatGPT on your desktop\n\nChat about email, screenshots, files, and anything on your screen.\n\nDownload\n\nAvailable now on macOS.*\n\nSeamlessly integrates with how you work, write, and create\n\nFaster access to ChatGPT\n\nOption + Space opens ChatGPT from any screen on your desktop.\n\nSpeak with ChatGPT from your desktop\n\nStart a conversation. Practice a new language. Tap the headphone icon to begin.\n\nDo more on your desktop with ChatGPT\n\nDownload\n\nThe desktop app is only available for macOS 14+ with Apple Silicon (M1 or better). Coming to Windows later this year.')]

In [6]:
#Markdown Loader

from langchain.document_loaders import UnstructuredMarkdownLoader
loader = UnstructuredMarkdownLoader(file_path='./data/README.md')
data = loader.load()
print(data[0].page_content)



github-download

github-download downloads commit comments and select issues metadata, saving the raw JSON and writing summary .csv files.

Installing

Download the .jar file here. It includes all dependencies. You must have the Java Runtime Environment version 7 or above.

Usage

github-download can be run from the command line. It has three required flags:

repo. The full repository name, e.g., PovertyAction/github-download.

to. The directory in which to save the metadata. It will be created if it does not exist already.

token. The name of a text file that contains solely a GitHub OAuth token. GitHub will supply you a token, which is a single string. You must copy it to a text file, then specify the name of that file to -token.

All together:

java -jar github-download.jar -repo PovertyAction/github-download -token token.txt -to metadata

If the name of the .jar file is not github-download.jar, use the actual filename in the command above, or rename the file as github-download.jar.

In [7]:
# PDF Loader

#Here we use PyPDF load the PDF documents. Each document contains the page content and metadata with page numbers.

from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader('./data/attention.pdf')

data = loader.load()

print(data[0].page_content)



Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.comNoam Shazeer∗
Google Brain
noam@google.comNiki Parmar∗
Google Research
nikip@google.comJakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.comAidan N. Gomez∗†
University of Toronto
aidan@cs.toronto.eduŁukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experiments on two machine translation tasks show these models to
be superior in quality while being more parallelizable and requiring signiﬁcantly
less time to train. Our model achiev

In [8]:
#Wikipedia Loader

#We use WikipediaLoader to fetch the documents. It mainly has 3 arguments: query: which is used to find the topic in Wikipedia, optional lang: default=“en”, and load_max_docs: default=100, which is used to limit the number of documents downloaded.

from langchain.document_loaders import WikipediaLoader
loader = WikipediaLoader(query='LangChain', load_max_docs=1)
data = loader.load()
data[0].metadata


{'title': 'LangChain',
 'summary': "LangChain is a framework designed to simplify the creation of applications using large language models (LLMs). As a language model integration framework, LangChain's use-cases largely overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis.\n\n",
 'source': 'https://en.wikipedia.org/wiki/LangChain'}

In [9]:
#ArXiv Loader

#arXiv is an open-access archive for 2 million scholarly articles. We can use ArxivLoader to extract information of any paper. We need the article id that would be available in the URL of the paper to use the loader.

from langchain_community.document_loaders import ArxivLoader
loader = ArxivLoader(query='1706.03762', load_max_docs=1) 
data = loader.load()
data[0].metadata


{'Published': '2023-08-02',
 'Title': 'Attention Is All You Need',
 'Authors': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin',
 'Summary': 'The dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks in an encoder-decoder configuration. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer, based\nsolely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to be\nsuperior in quality while being more parallelizable and requiring significantly\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014\nEnglish-to-German translation task, improving over the existing best results,\nincluding ensembles by over 2 BLEU. On the WMT 2014 English-to-French\ntranslation task, 