# Recursive URL Loader

We may want to process load all URLs under a root directory.

For example, let's look at the [LangChain JS documentation](https://js.langchain.com/docs/).

This has many interesting child pages that we may want to read in bulk.

Of course, the `WebBaseLoader` can load a list of pages. 

But, the challenge is traversing the tree of child pages and actually assembling that list!
 
We do this using the `RecursiveUrlLoader`.

This also gives us the flexibility to exclude some children (e.g., the `api` directory with > 800 child pages).

In [1]:
from langchain.document_loaders import WebBaseLoader
from langchain.document_loaders.recursive_url_loader import RecursiveUrlLoader

Let's try a simple example.

In [2]:
url = "https://js.langchain.com/docs/modules/memory/examples/"
loader = RecursiveUrlLoader(url=url)
docs = loader.load()

In [3]:
len(docs)

12

In [4]:
docs[0].page_content

'\n\n\n\n\nMomento-Backed Chat Memory | 🦜️🔗 Langchain\n\n\n\n\n\nSkip to main content🦜️🔗 LangChainConceptsPython DocsJS/TS DocsGitHubCTRLKWelcome to LangChainGetting StartedSetup and InstallationQuickstart, using LLMsQuickstart, using Chat ModelsComponentsSchemaModelsPromptsIndexesMemoryExamplesBuffer MemoryUsing Buffer Memory with Chat ModelsBuffer Window MemoryConversation SummaryDynamoDB-Backed Chat MemoryEntity MemoryMomento-Backed Chat MemoryMotörhead MemoryRedis-Backed Chat MemoryUpstash Redis-Backed Chat MemoryVectorStore-Backed MemoryZep MemoryChainsAgentsUse CasesPersonal AssistantsQuestion AnsweringTabular Question AnsweringInteracting with APIsSummarizationAgent SimulationsAutonomous AgentsProductionEvents /\u200b CallbacksDeploymentTracingEcosystemAPI ReferenceComponentsMemoryExamplesMomento-Backed Chat MemoryMomento-Backed Chat MemoryFor distributed, serverless persistence across chat sessions, you can swap in a Momento-backed chat message history.\nBecause a Momento cache

In [5]:
docs[0].metadata

{'source': 'https://js.langchain.com/docs/modules/memory/examples/momento',
 'title': 'Momento-Backed Chat Memory | 🦜️🔗 Langchain',
 'description': 'For distributed, serverless persistence across chat sessions, you can swap in a Momento-backed chat message history.',
 'language': 'en'}

Now, let's try a more extensive example, the `docs` root dir.

We will skip everything under `api`.

In [None]:
url = 'https://js.langchain.com/docs/'
exclude_dirs=['https://js.langchain.com/docs/api/']
loader=RecursiveUrlLoader(url=url,exclude_dirs=exclude_dirs)

# Lazy load to yield docs while crawling
docs = [print(doc) or doc for doc in loader.lazy_load()]

In [12]:
docs[0].page_content[:50]

'\n\n\n\n\nSetup and Installation | 🦜️🔗 Langchain\n\n\n\n\n\nS'

In [13]:
docs[0].metadata

{'source': 'https://js.langchain.com/docs/getting-started/install',
 'title': 'Setup and Installation | 🦜️🔗 Langchain',
 'description': 'Updating from <0.0.52? See this section for instructions.',
 'language': 'en'}