# **WebRAG: A Retrieval-Augmented Generation (RAG) System for Web Content**

## **Project Overview**

WebRAG is designed to extract content from a given sitemap, recursively find all sub-pages of a target website, clean and store the content in a vector database, and use it in a Retrieval-Augmented Generation (RAG) pipeline powered by LangChain.

## **1. Web Crawling**

Extract links from a sitemap using robots.txt or sitemap.xml.

In [None]:
import requests
import xml.etree.ElementTree as ET

def get_sitemap_links(sitemap_url, filter_path="/doc/"):
    """解析 `sitemap.xml` 并返回符合筛选条件的网页链接"""
    try:
        response = requests.get(sitemap_url, timeout=10)
        if response.status_code != 200:
            print(f"⚠️ Failed to fetch sitemap: {sitemap_url}")
            return []

        sitemap_xml = response.text
        root = ET.fromstring(sitemap_xml)
        namespaces = {"ns": "http://www.sitemaps.org/schemas/sitemap/0.9"}

        # 获取所有链接
        urls = [elem.text for elem in root.findall(".//ns:loc", namespaces)]
        
        # 只保留符合 `filter_path` 的链接
        filtered_urls = [url for url in urls if filter_path in url]
        
        return filtered_urls
    except Exception as e:
        print(f"⚠️ Error parsing sitemap {sitemap_url}: {e}")
        return []

In [None]:
sitemap_url = "https://python.langchain.com/sitemap.xml"
filtered_pages = get_sitemap_links(sitemap_url, filter_path="https://python.langchain.com/docs/integrations/chat/")

print(f"\n🔗 Found {len(filtered_pages)} Pages in Sitemap")


🔗 Found 79 Pages in Sitemap


## **2. Data Preprocessing**

Clean and structure raw webpage data for efficient retrieval. Remove unnecessary elements (JS, CSS, ads) using `BeautifulSoup`. Use `LangChain` for intelligent chunking. Convert chunks into vector embeddings. Store vectors in `Pinecone` with metadata (page URL, chunk ID, etc.) for retrieval.

In [None]:
import getpass
import os
import time

from pinecone import Pinecone, ServerlessSpec

if not os.getenv("PINECONE_API_KEY"):
    os.environ["PINECONE_API_KEY"] = getpass.getpass("Enter your Pinecone API key: ")

pinecone_api_key = os.environ.get("PINECONE_API_KEY")

pc = Pinecone(api_key=pinecone_api_key)

Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/getpass.py", line 48, in unix_getpass
    fd = os.open('/dev/tty', os.O_RDWR|os.O_NOCTTY)
OSError: [Errno 6] Device not configured: '/dev/tty'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/getpass.py", line 59, in unix_getpass
    fd = sys.stdin.fileno()
io.UnsupportedOperation: fileno

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/zerenshen/.vscode/extensions/ms-python.python-2024.22.2-darwin-arm64/python_files/python_server.py", line 133, in exec_user_input
    retval = callable_(user_input, user_globals)
  File "<string>", line 8, in <module>
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Py