# Recursive URL

We may want to process load all URLs under a root directory.

For example, let's look at the [Python 3.9 Documentation](https://docs.python.org/3.9/).

This has many interesting child pages that we may want to read in bulk.

Of course, the `WebBaseLoader` can load a list of pages. 

But, the challenge is traversing the tree of child pages and actually assembling that list!
 
We do this using the `RecursiveUrlLoader`.

This also gives us the flexibility to exclude some children, customize the extractor, and more.

## Basic Example

Let's run through a basic example of how to use the `RecursiveUrlLoader` on the [Python 3.9 Documentation](https://docs.python.org/3.9/).

### Library Installation

Before starting let's make sure we have installed the proper libraries to run our code examples.

In [None]:
%pip install --upgrade --quiet langchain_community

In [1]:
from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader

In [2]:
url = "https://docs.python.org/3.9/"
loader = RecursiveUrlLoader(
    url=url
)
docs = loader.load()

  k = self.parse_starttag(i)


Let's examine the metadata of the first document we loaded:

In [4]:
docs[0].metadata

{'source': 'https://docs.python.org/3.9/',
 'content_type': 'text/html',
 'title': '3.9.19 Documentation',
 'language': None}

Great! That looks like the root page we started from. Let's look at the meta data of the next document

In [5]:
docs[1].metadata

{'source': 'https://docs.python.org/3.9/c-api/index.html',
 'content_type': 'text/html',
 'title': 'Python/C API Reference Manual — Python 3.9.19 documentation',
 'language': None}

That url looks like a child of our root page, which is great! Let's move on from metadata to examine the content of one of our documents

In [8]:
docs[0].page_content[:250]

'\n<!DOCTYPE html>\n\n<html xmlns="http://www.w3.org/1999/xhtml">\n  <head>\n    <meta charset="utf-8" /><title>3.9.19 Documentation</title><meta name="viewport" content="width=device-width, initial-scale=1.0">\n    \n    <link rel="stylesheet" href="_static'

That certainly looks like HTML that comes from the url https://docs.python.org/3.9/, which is what we expected. Let's now look at some variations we can make to our basic example that can be helpful in different situations. 

## More Examples

## Adding an Extractor

In the basic example, our documents page content was unfiltered - meaning that it returned raw HTML. In most cases, we would like to extract more useful information. To do this, we can pass an extractor to our `RecursiveUrlLoader` that helps us only use the information we want. 

In [None]:
%pip install --upgrade --quiet beautifulsoup4

In [11]:
from bs4 import BeautifulSoup as Soup

loader = RecursiveUrlLoader(
    url=url, extractor=lambda x: Soup(x, "html.parser").text
)
docs = loader.load()

  k = self.parse_starttag(i)


Now, let's take a look at the contents of the first document

In [12]:
docs[0].page_content[:250]

'\n\n\n\n3.9.19 Documentation\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nDownload\nDownload these documents\nDocs by version\n\nPython 3.13 (in development)\nPython 3.12 (stable)\nPython 3.11 (security-fixes)\nPython 3.10 (security-fixes)\nPython 3.9 (security-fixe'

As we desired, all the HTML tags have been removed and we are just left with the raw text from the page. You can create your own extractors to only pull the exact information you want from a page. In addition you can pass an extractor to the parameter `metadata_extractor` which is a function that inputs the raw HTML, url, and `requests.Response` and returns a dictionary that is used as the metadata.

### Lazy Loading

A helpful way to improve performance is to use lazy loading, which alleviates us from having to wait for all our documents to load at once. This will minimize the stress on memory by only loading one URL at a time. Let's take a look at how this works in practice.

In [19]:
loader = RecursiveUrlLoader(
    url=url
)
docs = loader.lazy_load()

In [20]:
first_doc = next(docs)
print(first_doc.metadata)

{'source': 'https://docs.python.org/3.9/', 'content_type': 'text/html', 'title': '3.9.19 Documentation', 'language': None}


We see that lazy loading is much quicker if we are only interested in the first element, or if we have code that will only require us to look at one website at a time. Instead of storing all of the documents in memory, we will only store a single one at a time.

## More Advanced Topics

These examples show just a few of the ways in which you can modify the default `RecursiveUrlLoader`, but there are many more modifications that can be made to best fit your specific use case. Using the parameters `link_regex` and `exclude_dirs` can help you filter out unwanted Urls, and combining the `use_async` parameter with method functions `aload()` and `alazy_load()` can allow you to use asynchronous loading for more advanced projects. For more information on these and other parameters, please read the API reference in detail.