# Data Ingestion- Document Loaders

- Checkout the DocumentLoader Documentation from Langchain
- (https://python.langchain.com/docs/integrations/document_loaders/)

## Text Loader

In [None]:
from langchain_community.document_loaders import TextLoader
# Load the text file
loader = TextLoader("speech.txt")
loader

<langchain_community.document_loaders.text.TextLoader at 0x7f9ad0771c30>

# Load entire speech.txt into one document

In [2]:
# Create a text document using load function
text_documents = loader.load()
text_documents

[Document(metadata={'source': 'speech.txt'}, page_content='A text file (sometimes spelled textfile; \nan old alternative name is flat file) is a kind of computer file \nthat is structured as a sequence of lines of electronic text. \nA text file exists stored as data within a computer file system.\n\nIn operating systems such as CP/M, where the operating system does \nnot keep track of the file size in bytes, the end of a text file is \ndenoted by placing one or more special characters, \nknown as an end-of-file (EOF) marker, as padding after the last \nline in a text file. In modern operating systems such as DOS, \nMicrosoft Windows and Unix-like systems, \ntext files do not contain any special EOF character, \nbecause file systems on those operating systems \nkeep track of the file size in bytes.\n\nSome operating systems, such as Multics, Unix-like systems, \nCP/M, DOS, the classic Mac OS, and Windows, \nstore text files as a sequence of bytes, \nwith an end-of-line delimiter at the 

# Reading a pdf file

In [3]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("Raptor.pdf")
docs = loader.load()
docs

[Document(metadata={'source': 'Raptor.pdf', 'page': 0}, page_content='Published as a conference paper at ICLR 2024\nRAPTOR: R ECURSIVE ABSTRACTIVE PROCESSING\nFOR TREE -ORGANIZED RETRIEVAL\nParth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, Christopher D. Manning\nStanford University\npsarthi@cs.stanford.edu\nABSTRACT\nRetrieval-augmented language models can better adapt to changes in world state\nand incorporate long-tail knowledge. However, most existing methods retrieve\nonly short contiguous chunks from a retrieval corpus, limiting holistic under-\nstanding of the overall document context. We introduce the novel approach of\nrecursively embedding, clustering, and summarizing chunks of text, constructing\na tree with differing levels of summarization from the bottom up. At inference\ntime, our RAPTOR model retrieves from this tree, integrating information across\nlengthy documents at different levels of abstraction. Controlled experiments show\nthat retrieval with

In [None]:
type(docs) # List of documents

list

In [6]:
type(docs[0]) # Document type

langchain_core.documents.base.Document

# Web based Loader

In [None]:
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader(web_path=("https://lilianweng.github.io/posts/2023-06-23-agent/")) # One webpage

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [10]:
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader(web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
                        ) # Multiple web pages

In [12]:
loader.load()

[Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'title': "LLM Powered Autonomous Agents | Lil'Log", 'description': 'Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final resu

# Extract Specific Information from web pages


In [14]:
import bs4
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader(web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
                       bs_kwargs = dict(parse_only = bs4.SoupStrainer(
                          class_ = ("post-title", "post-content", "post-header")
                       ))
                    ) 

In [15]:
loader.load()

[Document(metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}, page_content='\n\n      LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview#\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistake

# Handling Arxiv (Reserach Papers dowloaded from Arxiv)

In [16]:
from langchain_community.document_loaders import ArxivLoader

# Load the Attention is all you need paper
docs = ArxivLoader(query="1706.03762", load_max_docs=2).load()
len(docs)

1

In [17]:
docs

[Document(metadata={'Published': '2023-08-02', 'Title': 'Attention Is All You Need', 'Authors': 'Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin', 'Summary': 'The dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks in an encoder-decoder configuration. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transformer, based\nsolely on attention mechanisms, dispensing with recurrence and convolutions\nentirely. Experiments on two machine translation tasks show these models to be\nsuperior in quality while being more parallelizable and requiring significantly\nless time to train. Our model achieves 28.4 BLEU on the WMT 2014\nEnglish-to-German translation task, improving over the existing best results,\nincluding ensembles by over 2 BLEU. On the WMT 2014 English-to-French\ntr

In [18]:
# Supports all arguments of `ArxivAPIWrapper`
loader = ArxivLoader(
    query="reasoning",
    load_max_docs=2,
    # doc_content_chars_max=1000,
    # load_all_available_meta=False,
    # ...
)

In [19]:
docs = loader.load()
docs[0]



In [20]:
print(docs[0].metadata)

{'Published': '2024-10-16', 'Title': 'Graph-constrained Reasoning: Faithful Reasoning on Knowledge Graphs with Large Language Models', 'Authors': 'Linhao Luo, Zicheng Zhao, Chen Gong, Gholamreza Haffari, Shirui Pan', 'Summary': 'Large language models (LLMs) have demonstrated impressive reasoning\nabilities, but they still struggle with faithful reasoning due to knowledge\ngaps and hallucinations. To address these issues, knowledge graphs (KGs) have\nbeen utilized to enhance LLM reasoning through their structured knowledge.\nHowever, existing KG-enhanced methods, either retrieval-based or agent-based,\nencounter difficulties in accurately retrieving knowledge and efficiently\ntraversing KGs at scale. In this work, we introduce graph-constrained reasoning\n(GCR), a novel framework that bridges structured knowledge in KGs with\nunstructured reasoning in LLMs. To eliminate hallucinations, GCR ensures\nfaithful KG-grounded reasoning by integrating KG structure into the LLM\ndecoding process

# WikipediaLoader

In [22]:
from langchain_community.document_loaders import WikipediaLoader
docs = WikipediaLoader(query="Hunter x hunter", load_max_docs=2).load()
len(docs)

2

In [23]:
docs

[Document(metadata={'title': 'Hunter × Hunter', 'summary': 'Hunter × Hunter (pronounced "hunter hunter") is a Japanese manga series written and illustrated by Yoshihiro Togashi. It has been serialized in Shueisha\'s shōnen manga magazine Weekly Shōnen Jump since March 1998, although the manga has frequently gone on extended hiatuses since 2006. Its chapters have been collected in 38 tankōbon volumes as of September 2024. The story focuses on a young boy named Gon Freecss who discovers that his father, who left him at a young age, is actually a world-renowned Hunter, a licensed professional who specializes in fantastical pursuits such as locating rare or unidentified animal species, treasure hunting, surveying unexplored enclaves, or hunting down lawless individuals. Gon departs on a journey to become a Hunter and eventually find his father. Along the way, Gon meets various other Hunters and encounters the paranormal.\nHunter × Hunter was adapted into a 62-episode anime television serie