# Document Loader
### Docs: https://python.langchain.com/docs/integrations/document_loaders/
- Document Loaders converts the different types of documents into the standard langchain document object

## Text Loader
- can load any type of .txt file

In [None]:
# langchain_community.document_loaders has different kinds od document loaders

from langchain_community.document_loaders import TextLoader

loader=TextLoader('speech.txt')
loader



In [None]:
text_documents=loader.load() # helps to create text documents
text_documents

In [None]:
type(text_documents)

## Loading a PDf File
- Every page of pdf is converted into a document
- 

In [None]:

from langchain_community.document_loaders import PyPDFLoader
loader=PyPDFLoader('attention.pdf')

docs=loader.load()
docs

In [None]:
print(type(docs))
type(docs[0])

## Web based loader
- We give a website url then it reads the content present in the website

In [None]:
from langchain_community.document_loaders import WebBaseLoader
import bs4  # BeautifulSoup is used internally for parsing HTML

# You can pass multiple URLs as a tuple inside `web_paths`
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),

    # Using SoupStrainer to filter and parse only the specific classes
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-title", "post-content", "post-header")  # You can add more classes here
        )
    )
)

# Load the document(s)
docs = loader.load()
docs



In [None]:

# Print the content of the first document (or iterate over all)
print(docs[0].page_content)

## Arxiv
- arXiv is a free distribution service and an open-access archive for nearly 2.4 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. Materials on this site are not peer-reviewed by arXiv.

- we made the entire website a document

In [None]:

from langchain_community.document_loaders import ArxivLoader


docs = ArxivLoader(query="1706.03762", load_max_docs=2).load()
print(len(docs))
docs

In [None]:
docs[0].page_content

## Wikipedia Loader


In [None]:
from langchain_community.document_loaders import WikipediaLoader

docs = WikipediaLoader(query="full metal alchemist", load_max_docs=10).load()
len(docs)
print(docs)

In [None]:
# Printing only the data from  wikipedia

# Print number of documents
print(f"Number of documents: {len(docs)}\n")

# Access and print the content of each document
for i, doc in enumerate(docs):
    print(f"Document {i+1} content:\n")
    print(doc.page_content[:1000])  # show first 1000 characters only for readability
    print("\n" + "-"*80 + "\n")