# 1.1 Web Base Loaders (Component Document Loaders)

# 1. WebBaseLoader

# 2. UnstructuredURLLoader

# 3. SeleniumURLLoader


# 1. WebBaseLoader

What: General loader to fetch and parse web pages.

When to use: When you need to scrape and extract text from a webpage URL.

pip install -qU langchain-community beautifulsoup4

In [None]:
import os
from langchain_community.document_loaders import WebBaseLoader

# Set the USER_AGENT environment variable
os.environ["USER_AGENT"] = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " \
                          "AppleWebKit/537.36 (KHTML, like Gecko) " \
                          "Chrome/120.0.0.0 Safari/537.36"


loader = WebBaseLoader("https://vtohal.medium.com/mastering-mcqs-the-ultimate-guide-26a180865cd2")

docs = loader.load()

print(docs[0].page_content)

USER_AGENT environment variable not set, consider setting it to identify your requests.


Mastering MCQs: The Ultimate Guide | by Vishwajeet Ohal | MediumSitemapOpen in appSign upSign inMedium LogoWriteSign upSign inMastering MCQs: The Ultimate GuideVishwajeet Ohal6 min read·Apr 18, 2021--ListenSharePress enter or click to view image in full sizePhoto by Green Chameleon on UnsplashWith a massive shift of paradigm initiated by the global pandemic, many organizations and educational institutions have moved online and subsequently have adapted the Multiple Choice Question (MCQ) structure of examinations. That is why students of the online age undoubtedly need to adjust their way of preparing for their academic evaluation, and to do exactly that, I have constructed a comprehensive guide — using the knowledge, I have gained over 10 years of participation in Olympiads and other exams — to approach MCQs and ace your next examinations. This guide will not only help you for your University examinations, but also for competitive exams like GATE, GRE and GMAT among others.It is easier

# 2. UnstructuredURLLoader

What: Loader using unstructured library to extract structured content from web pages.

When to use: When you need richer text extraction (headers, tables, lists) from a webpage.

In [4]:
from langchain_community.document_loaders import SeleniumURLLoader

urls = [
    "https://www.investopedia.com/articles/active-trading/111115/why-all-worlds-top-10-companies-are-american.asp"
]

loader = SeleniumURLLoader(urls=urls)
docs = loader.load()

print(f"Number of docs: {len(docs)}")
print(docs[0].page_content)  


Number of docs: 1
Table of Contents

Table of Contents

Walmart

Amazon

PetroChina

China Petroleum & Chemical

UnitedHealth

Apple

Berkshire Hathaway

CVS

Volkswagen

Exxon Mobil

FAQs

The Bottom Line

10 Biggest Companies in the World

WMT, AMZ, and PCCYF top the list of the 10 biggest companies in the world by revenue

By

Nathan Reiff



Full Bio

Nathan Reiff has been writing expert articles and news about financial topics such as investing and trading, cryptocurrency, ETFs, and alternative investments on Investopedia since 2016.

Learn about our editorial policies

Updated November 19, 2024

Reviewed by Margaret James

Fact checked by

Jared Ecker

BW Photo

Fact checked by Jared Ecker

Full Bio

Jared Ecker is a researcher and fact-checker. He possesses over a decade of experience in the Nuclear and National Defense sectors resolving issues on platforms as varied as stealth bombers to UAVs. He holds an A.A.S. in Aviation Maintenance Technology, a B.A. in History, and a M.S. 

# 3. SeleniumURLLoader

What: Uses Selenium to load dynamic (JavaScript-rendered) web pages.

When to use: When the webpage requires JS rendering (like news sites or dashboards).

In [5]:
from langchain_community.document_loaders import SeleniumURLLoader

urls = ["https://python.langchain.com/docs/integrations/document_loaders/"]

loader = SeleniumURLLoader(
    urls=urls,
    headless=True, 
    browser="chrome"
)

docs = loader.load()
print(docs[0].page_content)

Open on GitHub

Document loaders

DocumentLoaders load data into the standard LangChain Document format.

Each DocumentLoader has its own specific parameters, but they can all be invoked in the same way with the .load method. An example use case is as follows:

from langchain_community.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(
    ...  # <-- Integration specific parameters here
)
data = loader.load()

Webpages​

The below document loaders allow you to load webpages.

See this guide for a starting point: How to: load web pages.

Document Loader Description Package/API Web Uses urllib and BeautifulSoup to load and parse HTML web pages Package Unstructured Uses Unstructured to load and parse web pages Package RecursiveURL Recursively scrapes all child links from a root URL Package Sitemap Scrapes all pages on a given sitemap Package Spider Crawler and scraper that returns LLM-ready data. API Firecrawl API service that can be deployed locally. API Docling Uses Docli