# RAG Data Pipeline: OpenAI Research Page

This notebook demonstrates how to use **Thordata's Universal Scraping API**
to fetch a dynamic web page, clean the HTML into LLM‑friendly text, and save
it as a Markdown file ready for vector database ingestion.

We support two modes:

- **Live mode** (`USE_LIVE_THORDATA = True`): calls Thordata APIs and consumes credits.
- **Offline mode** (`USE_LIVE_THORDATA = False`): loads data from local cache files.

In [1]:
import os
from typing import List

from bs4 import BeautifulSoup
from dotenv import load_dotenv

from thordata import ThordataClient
from datetime import datetime
from pathlib import Path
load_dotenv()  # Load THORDATA_* tokens from .env

True

In [2]:
# Resolve project root from this notebook location:
# notebooks/rag -> parent = notebooks -> parent = repo root
ROOT_DIR = Path.cwd().parents[1]

# Toggle between live API calls and local cached data.
# Set to True only when you want to consume Thordata credits.
USE_LIVE_THORDATA = False

# Cache directory for raw HTML (under repo root)
CACHE_DIR = ROOT_DIR / "data"
RAW_HTML_PATH = CACHE_DIR / "openai_research_raw.html"

# Fallback: cleaned markdown generated by previous scripts or runs.
# This file is expected to be located in the project root.
CLEAN_MD_FALLBACK_PATH = ROOT_DIR / "knowledge_base_sample.md"

target_url = "https://openai.com/research/"

print("CWD:", os.getcwd())
print("ROOT_DIR:", ROOT_DIR)
print("USE_LIVE_THORDATA:", USE_LIVE_THORDATA)
print("RAW_HTML_PATH:", RAW_HTML_PATH)
print("CLEAN_MD_FALLBACK_PATH:", CLEAN_MD_FALLBACK_PATH)

CWD: D:\Thordata_Work\thordata-cookbook\notebooks\rag
ROOT_DIR: D:\Thordata_Work\thordata-cookbook
USE_LIVE_THORDATA: False
RAW_HTML_PATH: D:\Thordata_Work\thordata-cookbook\data\openai_research_raw.html
CLEAN_MD_FALLBACK_PATH: D:\Thordata_Work\thordata-cookbook\knowledge_base_sample.md


In [3]:
def clean_html_to_markdown(html_content: str) -> str:
    """
    Convert messy HTML into Markdown-style text suitable for RAG / LLMs.

    - Remove scripts, styles, navigation, footers, iframes, etc.
    - Collect headings (h1–h3) and reasonably long paragraphs.
    """
    soup = BeautifulSoup(html_content, "html.parser")

    # Remove irrelevant tags
    for tag in soup(["script", "style", "nav", "footer", "iframe", "noscript"]):
        tag.decompose()

    markdown_lines: List[str] = []

    # Headings (H1–H3)
    for heading in soup.find_all(["h1", "h2", "h3"]):
        level = int(heading.name[1])
        prefix = "#" * level
        markdown_lines.append(f"\n{prefix} {heading.get_text(strip=True)}\n")

    # Paragraphs
    for p in soup.find_all("p"):
        text = p.get_text(strip=True)
        if len(text) > 20:
            markdown_lines.append(text)

    return "\n".join(markdown_lines)

In [4]:
SCRAPER_TOKEN = os.getenv("THORDATA_SCRAPER_TOKEN")
PUBLIC_TOKEN = os.getenv("THORDATA_PUBLIC_TOKEN")
PUBLIC_KEY = os.getenv("THORDATA_PUBLIC_KEY")

if not SCRAPER_TOKEN:
    raise ValueError("Please set THORDATA_* variables in your .env file.")

client = ThordataClient(
    scraper_token=SCRAPER_TOKEN,
    public_token=PUBLIC_TOKEN,
    public_key=PUBLIC_KEY,
)

client

<thordata.client.ThordataClient at 0x1af083512b0>

In [5]:
if USE_LIVE_THORDATA:
    # Live mode: call Thordata Universal API and cache raw HTML
    print(f"Fetching page from Thordata: {target_url}")
    html = client.universal_scrape(
        url=target_url,
        js_render=True,
        output_format="HTML",
    )

    os.makedirs(CACHE_DIR, exist_ok=True)
    with open(RAW_HTML_PATH, "w", encoding="utf-8") as f:
        f.write(html)
    print(f"Cached raw HTML to {RAW_HTML_PATH}")

    # Clean HTML into markdown
    markdown_content = clean_html_to_markdown(html)
else:
    # Offline mode:
    # 1) Prefer an existing cleaned markdown file if present
    if os.path.exists(CLEAN_MD_FALLBACK_PATH):
        print(f"Loading cleaned markdown directly from {CLEAN_MD_FALLBACK_PATH}")
        with open(CLEAN_MD_FALLBACK_PATH, "r", encoding="utf-8") as f:
            markdown_content = f.read()
    # 2) Otherwise, try to load cached raw HTML and clean it
    elif os.path.exists(RAW_HTML_PATH):
        print(f"Loading cached HTML from {RAW_HTML_PATH}")
        with open(RAW_HTML_PATH, "r", encoding="utf-8") as f:
            html = f.read()
        markdown_content = clean_html_to_markdown(html)
    else:
        raise FileNotFoundError(
            f"No cached HTML ({RAW_HTML_PATH}) or fallback markdown "
            f"({CLEAN_MD_FALLBACK_PATH}) found. "
            "Run in live mode once to create them."
        )

len(markdown_content)

Loading cleaned markdown directly from D:\Thordata_Work\thordata-cookbook\knowledge_base_sample.md


2795

In [6]:
print(markdown_content[:1000])

Source: https://openai.com/research/

Source: https://openai.com/research/


# Pioneering research on the path to AGI


### We use Deep Learning to leverage large amounts of data and advanced reasoning to train AI systems for task completion.


### GPT


### o series


### Visual


### Audio


### Text


### Featured roles


### Research Engineer


### Research Engineer, Codex


### Research Engineer, Frontier Evals & Environments


### Research Scientist


## Explore all research

We believe our research will eventually lead to artificial general intelligence, a system that can solve human-level problems. Our mission is to ensure that AGI benefits all of humanity.
OpenAI’s GPT series models are fast, versatile, and cost-efficient AI systems designed to understand context, generate content, and reason across text, images, and more.
A smarter, more conversational ChatGPT
ReleaseNov 12, 20258 min read
Our smartest, fastest, and most useful model yet, with thinking built in
ReleaseAug 7, 

In [7]:
# 使用时间戳生成不重复的名称
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_file = CACHE_DIR / f"knowledge_base_{timestamp}.md"  # 存在 data/ 里

with output_file.open("w", encoding="utf-8") as f:
    f.write(f"Source: {target_url}\n\n")
    f.write(markdown_content)

print(f"Saved cleaned markdown to {output_file}")

Saved cleaned markdown to D:\Thordata_Work\thordata-cookbook\data\knowledge_base_20251128_100915.md


The file `knowledge_base_sample.md` is now ready to be embedded into a vector
database (e.g. Pinecone, Weaviate, Chroma) and used in a RAG pipeline.