# WebBaseLoader

This covers how to use `WebBaseLoader` to load all text from `HTML` webpages into a document format that we can use downstream. For more custom logic for loading webpages look at some child class examples such as `IMSDbLoader`, `AZLyricsLoader`, and `CollegeConfidentialLoader`. 

If you don't want to worry about website crawling, bypassing JS-blocking sites, and data cleaning, consider using `FireCrawlLoader` or the faster option `SpiderLoader`.

## Overview
### Integration details

- TODO: Fill in table features.
- TODO: Remove JS support link if not relevant, otherwise ensure link is correct.
- TODO: Make sure API reference links are correct.

| Class | Package | Local | Serializable | JS support|
| :--- | :--- | :---: | :---: |  :---: |
| [WebBaseLoader](https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.web_base.WebBaseLoader.html) | [langchain_community](https://python.langchain.com/api_reference/community/index.html) | ✅ | ❌ | ❌ | 
### Loader features
| Source | Document Lazy Loading | Native Async Support
| :---: | :---: | :---: | 
| WebBaseLoader | ✅ | ✅ | 

## Setup

### Credentials

`WebBaseLoader` does not require any credentials.

### Installation

To use the `WebBaseLoader` you first need to install the `langchain-community` python package.


In [1]:
%pip install -qU langchain_community beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


## Initialization

Now we can instantiate our model object and load documents:

In [2]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://nd9fgiy0w0.feishu.cn/docx/JfjjdJgTuoc484x8HFGcdzAinNe")

USER_AGENT environment variable not set, consider setting it to identify your requests.


To bypass SSL verification errors during fetching, you can set the "verify" option:

`loader.requests_kwargs = {'verify':False}`

### Initialization with multiple pages

You can also pass in a list of pages to load from.

In [3]:
loader_multiple_pages = WebBaseLoader(["https://www.espn.com/", "https://google.com"])

## Load

In [4]:
docs = loader.load()

docs[0]

Document(metadata={'source': 'https://nd9fgiy0w0.feishu.cn/docx/JfjjdJgTuoc484x8HFGcdzAinNe', 'title': 'Docs', 'language': 'No language found.'}, page_content=' Docs             Add ShortcutLast modified: 5 hours agoShareæ¸¯å�£ç²¾é€‰ç\xa0”æŠ¥ 10.2Type "/" for commandsæ¸¯å�£ç²¾é€‰ç\xa0”æŠ¥ 10.2â€‹1 ä¸\xadè‹±å›¾è¡¨ GS-GOAL Kickstart China catch-up  tracking cross-asset repricing in the bullish reversal 9.30â€‹1 è‹±æ–‡å›¾è¡¨ GS-GOAL Kickstart_ China catch-up - tracking cross-asset repricing in the bullish reversal 9.30.pdf2.91MB1 ä¸\xadæ–‡å›¾è¡¨ GS-GOAL Kickstart_ China catch-up - tracking cross-asset repricing in the bullish reversal.pdf6.64MBâ€‹2 è‹±æ–‡å›¾è¡¨ JPM-Commodity Market Positioning & Flows Global commodity open interest recovery continues as base metals are buoyed by China stimulus 9.30â€‹2 è‹±æ–‡å›¾è¡¨ JPM-Commodity Market Positioning & Flows Global commodity open interest recovery continues as base metals are buoyed by China stimulus 9.30.pdf6.45MB2 ä¸\xadæ–‡å›¾è¡¨ JPM-Commo

In [5]:
print(docs[0].metadata)

{'source': 'https://nd9fgiy0w0.feishu.cn/docx/JfjjdJgTuoc484x8HFGcdzAinNe', 'title': 'Docs', 'language': 'No language found.'}


### Load multiple urls concurrently

You can speed up the scraping process by scraping and parsing multiple urls concurrently.

There are reasonable limits to concurrent requests, defaulting to 2 per second.  If you aren't concerned about being a good citizen, or you control the server you are scraping and don't care about load, you can change the `requests_per_second` parameter to increase the max concurrent requests.  Note, while this will speed up the scraping process, but may cause the server to block you.  Be careful!

In [6]:
%pip install -qU  nest_asyncio

# fixes a bug with asyncio and jupyter
import nest_asyncio

nest_asyncio.apply()

Note: you may need to restart the kernel to use updated packages.


In [7]:
loader = WebBaseLoader(["https://www.espn.com/", "https://google.com"])
loader.requests_per_second = 1
docs = loader.aload()
docs

Fetching pages: 100%|##########| 2/2 [00:01<00:00,  1.10it/s]


[Document(metadata={'source': 'https://www.espn.com/', 'title': 'ESPN - Serving Sports Fans. Anytime. Anywhere.', 'description': 'Visit ESPN for live scores, highlights and sports news. Stream exclusive games on ESPN+ and play fantasy sports.', 'language': 'en'}, page_content="\n\n\n\n\n\n\n\n\nESPN - Serving Sports Fans. Anytime. Anywhere.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n        Skip to main content\n    \n\n        Skip to navigation\n    \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n<\n\n>\n\n\n\n\n\n\n\n\n\nMenuESPN\n\n\n\n\n\nscores\n\n\n\n\nNEW! Find where to watch all of your favorite sports!\n\n\n\n\n\n\n\nNFLNBAMLBNCAAFNHLSoccerWNBAMore SportsBoxingCFLNCAACricketF1GolfHorseLLWSMMANASCARNBA G LeagueNBA Summer LeagueNCAAMNCAAWNWSLOlympicsPLLProfessional WrestlingRacingRN BBRN FBRugbySports BettingTennisX Ga

### Loading a xml file, or using a different BeautifulSoup parser

You can also look at `SitemapLoader` for an example of how to load a sitemap file, which is an example of using this feature.

In [8]:
loader = WebBaseLoader(
    "https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml"
)
loader.default_parser = "xml"
docs = loader.load()
docs

[Document(metadata={'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}, page_content='\n\n10\nEnergy\n3\n2018-01-01\n2018-01-01\nfalse\nUniform test method for the measurement of energy efficiency of commercial packaged boilers.\nÂ§ 431.86\nSection Â§ 431.86\n\nEnergy\nDEPARTMENT OF ENERGY\nENERGY CONSERVATION\nENERGY EFFICIENCY PROGRAM FOR CERTAIN COMMERCIAL AND INDUSTRIAL EQUIPMENT\nCommercial Packaged Boilers\nTest Procedures\n\n\n\n\n§\u2009431.86\nUniform test method for the measurement of energy efficiency of commercial packaged boilers.\n(a) Scope. This section provides test procedures, pursuant to the Energy Policy and Conservation Act (EPCA), as amended, which must be followed for measuring the combustion efficiency and/or thermal efficiency of a gas- or oil-fired commercial packaged boiler.\n(b) Testing and Calculations. Determine the thermal efficiency or combustion efficiency of commercial packaged boilers by condu

## Lazy Load

You can use lazy loading to only load one page at a time in order to minimize memory requirements.

In [9]:
pages = []
for doc in loader.lazy_load():
    pages.append(doc)

print(pages[0].page_content[:100])
print(pages[0].metadata)



10
Energy
3
2018-01-01
2018-01-01
false
Uniform test method for the measurement of energy efficien
{'source': 'https://www.govinfo.gov/content/pkg/CFR-2018-title10-vol3/xml/CFR-2018-title10-vol3-sec431-86.xml'}


## Using proxies

Sometimes you might need to use proxies to get around IP blocks. You can pass in a dictionary of proxies to the loader (and `requests` underneath) to use them.

In [10]:
loader = WebBaseLoader(
    "https://www.walmart.com/search?q=parrots",
    proxies={
        "http": "http://{username}:{password}:@proxy.service.com:6666/",
        "https": "https://{username}:{password}:@proxy.service.com:6666/",
    },
)
docs = loader.load()

KeyboardInterrupt: 

## API reference

For detailed documentation of all `WebBaseLoader` features and configurations head to the API reference: https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.web_base.WebBaseLoader.html