# Data Ingestion with Document Loaders
This script demonstrates how to load raw data from different sources (text files,
PDFs, and web pages) using `langchain_community` loaders. 
Each loader outputs a list of `Document` objects, which can be further processed
in downstream pipelines (splitting, embedding, vector stores, etc.)

##### Text File Loader

In [6]:
from langchain_community.document_loaders import TextLoader

#Create a loader for a local text file
text_loader = TextLoader("../data/AI_and_society.txt")

#Actually load the file content into Document objects
text_documents = text_loader.load()

#`text_documents` is a list of Documents with attributes like `page_content`
print(text_documents[0].page_content[:200])  #print the first 200 characters

Artificial Intelligence and Society

Artificial Intelligence (AI) is rapidly transforming the way we live, work, and interact with technology. 
Today, AI systems are able to recognize speech, translat


#### PDF File

In [8]:
from langchain_community.document_loaders import PyPDFLoader

#Create a loader for a PDF file
pdf_loader = PyPDFLoader("../data/Attention is all you need.pdf")

#Load all pages of the PDF into a list of Documents
pdf_documents = pdf_loader.load()

print(f"Number of pages loaded: {len(pdf_documents)}")
print(type(pdf_documents[0]))  #confirm they are Document objects

  from .autonotebook import tqdm as notebook_tqdm


Number of pages loaded: 15
<class 'langchain_core.documents.base.Document'>


#### Web Based Loader 

In [9]:
from langchain_community.document_loaders import WebBaseLoader

#Load a single web page as Documents
web_loader = WebBaseLoader(
    web_paths=("https://www.cloudflare.com/it-it/learning/ai/what-is-large-language-model/",)
)

web_documents = web_loader.load()
print(f"Documents loaded from web: {len(web_documents)}")


USER_AGENT environment variable not set, consider setting it to identify your requests.


Documents loaded from web: 1


In [11]:
web_loader.load()

[Document(metadata={'source': 'https://www.cloudflare.com/it-it/learning/ai/what-is-large-language-model/', 'title': 'Just a moment...', 'language': 'en-US'}, page_content='Just a moment...Enable JavaScript and cookies to continue\n')]

#### Web Loader with BeautifulSoap Filters

In [12]:
from langchain_community.document_loaders import WebBaseLoader
import bs4

#Optionally filter the HTML DOM to parse only specific classes
filtered_web_loader = WebBaseLoader(
    web_paths=("https://www.cloudflare.com/it-it/learning/ai/what-is-large-language-model/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-title", "post-content", "post-header")
        )
    ),
)

filtered_documents = filtered_web_loader.load()
print(f"Filtered documents from web: {len(filtered_documents)}")

Filtered documents from web: 1


In [20]:
loader

<langchain_community.document_loaders.web_base.WebBaseLoader at 0x24cb0dd06d0>