
# Data Preparation Overview

This document describes the preparation of data used to support the development and evaluation of algorithms for retrieving and labeling API documentation pages. The process includes collecting raw HTML documents, labeling relevant pages as endpoint-related, and storing all data in a structured format for later use.


## 1. Data Collection Objectives

the data in our case consists of raw HTML pages from multiple API documentation websites. Based on the HTML document—referred to here as an *API page* we need to perform two tasks. To better understand the tasks, we introduce one more distinction: the set of *API pages* contains the *endpoint pages* (API pages that actually contain the endpoints that need to be implemented) and their complement *non-endpoint pages*, which we are not interested in. Note here that there are also non-endpoint pages that contain general information about the API, which might be required to properly generate the endpoints, but that will be covered later during development.


## 2. Tasks

Two main tasks in the project are defined as:

#### Task 1: Given the website link, parse the website and retrieve all API pages

#### Task 2: Using the API pages generate the python code to interact with endpoints.

The data preparation aims to make both these task easier during development by providing structured data infrastructure that will ensure more modular development and save time allowing to interact with application parts without need to rely on other parts.

## 3. Benchmark Construction

The website parsing requires not only saving the pages but also the good algorithm that will ensure all the endpoints are collected without too many additional irrelevant pages, for that we need to design the algorithm.

For the algorith design we decided to make the benchmark driven approach, where we construct the benchmark that after the run of parsing process with certain algorithm gives the result statistics. To make this benchmark we review API docs and labeled the Endpoint docs.

Example object:
```
[
      {
          "name": "aNewSpring",
          "base_url": "https://support.anewspring.com",
          "domain_url": "/en/",
          "endpoint_pages" :
                [
                    "https://support.anewspring.com/en/articles/70415-api-calls-users",
                    "https://support.anewspring.com/en/articles/70416-api-calls-user-groups",
                    "https://support.anewspring.com/en/articles/70417-api-calls-subenvironments",
                    "https://support.anewspring.com/en/articles/70418-api-calls-subscriptions",
                    "https://support.anewspring.com/en/articles/70407-api-calls-templates-and-courses",
                    "https://support.anewspring.com/en/articles/70626-api-calls-bundles",
                    "https://support.anewspring.com/en/articles/70419-api-calls-calendar-items",
                    "https://support.anewspring.com/en/articles/70420-api-calls-access-codes",
                    "https://support.anewspring.com/en/articles/70421-api-calls-events",
                    "https://support.anewspring.com/en/articles/70588-api-calls-profile"
                ]
      }
]
```

However the webparsing process is time-consuming and this is suboptimal to wait each time to evaluate algorithm, saving pages are also required for easier modular development of code generation. We created the database table for development



## 4. Database Schema

To support modular and repeatable development, all scraped pages and their labels are stored in a local database. The table structure is as follows:

| Column | Type | Description|
|------|-----|----------------------------------------------|
| `url`| TEXT| Primary key. Full URL of the scraped page|
| `html`| TEXT| Raw HTML content of the page|
| `is_endpoint`| BOOLEAN| Whether the page is an endpoint page|


These functions are defined in the `db_utils`, here I have duplicated them to show you.

In [None]:
import json
from src.scraper.scraper import bfs_site
from src.scraper.eval_db.db_utils import get_connection


def init_db():
    conn = get_connection()
    cur = conn.cursor()
    cur.execute('''
        create table if not exists pages (
            url TEXT primary key,
            html TEXT,
            is_endpoint BOOLEAN)
    ''')
    conn.commit()
    cur.close()
    conn.close()

def get_page_by_url(url:str) -> str:
    conn = get_connection()
    cur = conn.cursor()
    cur.execute("SELECT html FROM pages WHERE url = %s", (url,))
    row = cur.fetchone()
    cur.close()
    conn.close()
    return row[0] if row else None

def get_api_pages_by_url(url: str) -> list:
    conn = get_connection()
    cur = conn.cursor()
    cur.execute("SELECT html FROM pages WHERE url LIKE %s and is_endpoint = True", (f"%{url}%",))
    rows = cur.fetchall()
    cur.close()
    conn.close()
    return rows

def is_site_in_db(url: str) -> bool:
    conn = get_connection()
    cur = conn.cursor()
    cur.execute("SELECT 1 FROM pages WHERE url LIKE %s LIMIT 1", (f"%{url}%",))
    exists = cur.fetchone() is not None
    cur.close()
    conn.close()
    return exists

def store_page(url: str, html: str, is_endpoint: bool) -> None:
    conn = get_connection()
    cur = conn.cursor()
    cur.execute('''
        INSERT INTO pages (url, html, is_endpoint)
        VALUES (%s, %s, %s)
        on conflict do nothing
    ''', (url, html, is_endpoint))
    conn.commit()
    cur.close()
    conn.close()

def add_api_to_db(url: str, domain_url: str) -> None:

    with open('../benchmark_data.json', 'r') as file:
        data = json.load(file)
        for obj in data:
            if obj.get("base_url") == url:
                endpoint_pages = obj.get("endpoint_pages")


    data = bfs_site(url, lambda content: True, domain_url)

    all_scraped_pages = data.get("endpoint_pages", {})

    for key, value in all_scraped_pages.items():
        is_endpoint = key in endpoint_pages
        store_page(key, value, is_endpoint)

in the `add_api_to_db` function we have the `bfs_site` function, since we want to parse all the pages from the website, we pass:

```python
lambda content: True
```

That filters and adds all the pages to the result.

All collected data, including both labeled and unlabeled pages, is saved. This allows re-use of the same data for different parsing strategies or model experiments without re-scraping.

This database serve as mimic of the internet, with small adjustment to the `bfs_site` function we could retrieve the html from database by link using `get_page_by_url` and then use `beutifulsoup` library to work with the html just like it was received from the web request.


## 5. Workflow and Labeling Logic

1. The benchmark JSON file is loaded.
2. The crawler executes a breadth-first crawl starting from the `base_url`, staying within the domain.
3. All reachable pages are saved with their HTML content.
4. Each page is checked against the benchmark `endpoint_pages`. If it is found there, it is labeled as `is_endpoint = True`.
5. The complete set of pages and labels is stored in the database.

This process separates data collection from downstream tasks and enables reproducible evaluation of new crawling strategies and model variants.


## 6. Assumptions for Downstream Development

- Crawling and labeling are performed once and stored persistently.
- All downstream code assumes access to the database rather than repeating the crawling step.
- The endpoint pages are used for tasks like code generation or API schema inference.

This allows for fast iteration and independent development of later components.
