## Document Loading
### Loaders

Loaders 處理存取與轉換資料的細節：

- 存取來源（Accessing）：
    - 網站（Web Sites）
    - 資料庫（Data Bases）
    - YouTube
    - arXiv 論文平台
    - ...
- 支援的資料格式（Data Types）：
    - PDF 檔案
    - HTML 頁面
    - JSON 檔案
    - Word、PowerPoint 等 Office 文件

會回傳一個 `Document` 物件陣列

- https://python.langchain.com/docs/integrations/document_loaders/

### Retrieval augmented generation
In retrieval augmented generation (RAG), an LLM retrieves contextual documents from an external dataset as part of its execution.

This is useful if we want to ask question about specific documents (e.g., our PDFs, a set of videos, etc).

![]()

In [2]:
from langchain_community.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader("./kubernetes-cheatsheet.pdf")
pages = loader.load() # Each page is a Document. A Document contains text (page_content) and metadata.

In [5]:
page = pages[0]
print(page.page_content[0:250])

Kubernetes Cheatsheet
What is Kubernetes 
Kapsule and 
Kubernetes Kosmos?
Kubernetes is an open-source platform 
that enables developers to manage their 
containerized applications. Kapsule and 
Kosmo both provide a managed 
environment for creating,


In [6]:
print(page.metadata)

{'producer': 'PDFium', 'creator': 'PDFium', 'creationdate': 'D:20250518155345', 'source': './kubernetes-cheatsheet.pdf', 'file_path': './kubernetes-cheatsheet.pdf', 'total_pages': 1, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'moddate': '', 'trapped': '', 'modDate': '', 'creationDate': 'D:20250518155345', 'page': 0}


## WebBaseLoader

In [2]:
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://kubernetes.io/")

In [5]:
docs = loader.load()
print(len(docs))

1


In [None]:
print(docs[0].page_content[:500])

Kubernetes
KubernetesDocumentationKubernetes BlogTrainingPartnersCommunityCase StudiesVersionsRelease Information
v1.33
v1.32
v1.31
v1.30
v1.29Englishবাংলা (Bengali)
中文 (Chinese)
Français (French)
Deutsch (German)
हिन्दी (Hindi)
Bahasa Indonesia (Indonesian)
Italiano (Italian)
日本語 (Japanese)
한국어 (Korean)
Polski (Polish)
Português (Portuguese)
Русский (Russian)
Español (Spanish)
Українська (Ukrainian)
Tiếng Việt (Vietnamese)
Production-Grade Container OrchestrationLearn Kubernetes Basics
Kubernet
