# RAG Data Pipeline: OpenAI Research Page

This notebook demonstrates how to use **Thordata's Universal Scraping API**
to fetch a dynamic web page, clean the HTML into LLM-friendly text, and save
it as a Markdown file ready for vector database ingestion.

In [2]:
import os
from typing import List

from bs4 import BeautifulSoup
from dotenv import load_dotenv

from thordata import ThordataClient

load_dotenv()  # Load THORDATA_* tokens from .env

True

In [3]:
def clean_html_to_markdown(html_content: str) -> str:
    """
    Convert messy HTML into Markdown-style text suitable for RAG / LLMs.
    """
    soup = BeautifulSoup(html_content, "html.parser")

    # Remove irrelevant tags
    for tag in soup(["script", "style", "nav", "footer", "iframe", "noscript"]):
        tag.decompose()

    markdown_lines: List[str] = []

    # Headings (H1–H3)
    for heading in soup.find_all(["h1", "h2", "h3"]):
        level = int(heading.name[1])
        prefix = "#" * level
        markdown_lines.append(f"\n{prefix} {heading.get_text(strip=True)}\n")

    # Paragraphs
    for p in soup.find_all("p"):
        text = p.get_text(strip=True)
        if len(text) > 20:
            markdown_lines.append(text)

    return "\n".join(markdown_lines)

In [4]:
scraper_token = os.getenv("THORDATA_SCRAPER_TOKEN")
public_token = os.getenv("THORDATA_PUBLIC_TOKEN")
public_key = os.getenv("THORDATA_PUBLIC_KEY")

if not scraper_token:
    raise ValueError("Please set THORDATA_SCRAPER_TOKEN in your .env file.")

client = ThordataClient(scraper_token, public_token, public_key)

target_url = "https://openai.com/research/"

print(f"Fetching page: {target_url}")
html = client.universal_scrape(
    url=target_url,
    js_render=True,
    output_format="HTML",
)
len(html)

Fetching page: https://openai.com/research/


347665

In [5]:
markdown_content = clean_html_to_markdown(html)

print("Length of cleaned markdown:", len(markdown_content))
print("\nPreview:\n")
print(markdown_content[:1000])

Length of cleaned markdown: 2719

Preview:


# Pioneering research on the path to AGI


### We use Deep Learning to leverage large amounts of data and advanced reasoning to train AI systems for task completion.


### GPT


### o series


### Visual


### Audio


### Text


### Featured roles


### Research Engineer


### Research Engineer, Codex


### Research Engineer, Frontier Evals & Environments


### Research Scientist


## Explore all research

We believe our research will eventually lead to artificial general intelligence, a system that can solve human-level problems. Our mission is to ensure that AGI benefits all of humanity.
OpenAI’s GPT series models are fast, versatile, and cost-efficient AI systems designed to understand context, generate content, and reason across text, images, and more.
A smarter, more conversational ChatGPT
ReleaseNov 12, 20258 min read
Our smartest, fastest, and most useful model yet, with thinking built in
ReleaseAug 7, 202515 min read
Improved ability

In [6]:
output_file = "knowledge_base_sample.md"

with open(output_file, "w", encoding="utf-8") as f:
    f.write(f"Source: {target_url}\n\n")
    f.write(markdown_content)

print(f"Saved to {output_file}")

Saved to knowledge_base_sample.md
