This script retrieves and structures annotation data from Hypothes.is, an open-source annotation platform widely utilized in educational contexts, where students collaboratively annotate and comment on digital texts such as articles, academic papers, and online resources. Hypothes.is provides a valuable dataset for analyzing student interactions, highlighting meaningful excerpts, and capturing individual insights directly associated with specific text fragments (Hypothes.is, n.d.). Leveraging students' annotations enhances the construction of personalized knowledge graphs, as these annotations inherently represent authentic user engagement and reflection, making them ideal for building nuanced, context-rich educational applications.

Reference:
Hypothes.is. (n.d.). Hypothes.is API documentation. Retrieved March 25, 2025, from https://h.readthedocs.io/en/latest/api/

Output JSON structure:

The script produces a nested JSON object organized by Hypothes.is users and the documents they annotated.
Each user key maps to a collection of documents (identified by their URI), and each document contains:
- its title,
- and a list of annotations made by that user on that document.

Each annotation object contains:
- `id`: the unique annotation ID from Hypothes.is
- `highlighted_text`: the exact text the user selected and annotated
- `comment`: the user's written comment (non-empty)
- `created`: the timestamp when the annotation was created
- `position`: start and end character offsets (if available)
- `context`: preceding and following text around the highlight (if available)

Annotations are only included if:
1. The document has a title.
2. The comment is not empty.

In [4]:
import requests
import json
import time
from collections import defaultdict

HYPOTHESIS_API_URL = "https://api.hypothes.is/api/search"

def fetch_annotations_paginated(total_annotations, per_page=200, sleep_between_calls=1):
    annotations = []
    offset = 0

    while len(annotations) < total_annotations:
        params = {
            "limit": per_page,
            "offset": offset,
        }

        response = requests.get(HYPOTHESIS_API_URL, params=params)
        response.raise_for_status()
        batch = response.json().get('rows', [])

        if not batch:
            break

        annotations.extend(batch)
        offset += per_page

        print(f"Fetched {len(annotations)} annotations so far...")

        time.sleep(sleep_between_calls)

    return annotations[:total_annotations]

# Organize annotations with all required filters applied
def organize_annotations(annotations):
    data = defaultdict(lambda: defaultdict(lambda: {"title": "", "annotations": []}))

    for ann in annotations:
        doc_title = ann.get('document', {}).get('title', [])
        comment = ann.get('text', '').strip()

        # Skip annotations without document titles or empty comments
        if not doc_title or not doc_title[0].strip() or not comment:
            continue

        doc_title = doc_title[0].strip()
        user = ann.get('user', 'unknown_user').split(':')[1].split('@')[0]
        doc_uri = ann.get('uri', 'unknown_document')

        annotation = {
            "id": ann.get('id'),
            "highlighted_text": "",
            "comment": comment,
            "created": ann.get('created'),
            "position": {},
            "context": {}
        }

        target = ann.get('target', [])
        if target:
            selectors = target[0].get('selector', [])
            for selector in selectors:
                if selector['type'] == 'TextQuoteSelector':
                    annotation["highlighted_text"] = selector.get('exact', '')
                    annotation["context"] = {
                        "preceding_text": selector.get('prefix', ''),
                        "following_text": selector.get('suffix', '')
                    }
                elif selector['type'] == 'TextPositionSelector':
                    annotation["position"] = {
                        "start_char": selector.get('start'),
                        "end_char": selector.get('end')
                    }

        doc_entry = data[user][doc_uri]
        doc_entry["title"] = doc_title
        doc_entry["annotations"].append(annotation)

    return data

# Main execution
if __name__ == "__main__":
    TOTAL_ANNOTATIONS = 10000
    annotations = fetch_annotations_paginated(total_annotations=TOTAL_ANNOTATIONS)

    print(f"\nSuccessfully fetched {len(annotations)} annotations.")

    structured_data = organize_annotations(annotations)

    # Save structured data to JSON
    with open("annotations_database.json", "w", encoding='utf-8') as f:
        json.dump(structured_data, f, indent=2, ensure_ascii=False)

    print("Annotations successfully organized and saved to annotations_database.json.")

Fetched 200 annotations so far...
Fetched 400 annotations so far...
Fetched 600 annotations so far...
Fetched 800 annotations so far...
Fetched 1000 annotations so far...
Fetched 1200 annotations so far...
Fetched 1400 annotations so far...
Fetched 1600 annotations so far...
Fetched 1800 annotations so far...
Fetched 2000 annotations so far...
Fetched 2200 annotations so far...
Fetched 2400 annotations so far...
Fetched 2600 annotations so far...
Fetched 2800 annotations so far...
Fetched 3000 annotations so far...
Fetched 3200 annotations so far...
Fetched 3400 annotations so far...
Fetched 3600 annotations so far...
Fetched 3800 annotations so far...
Fetched 4000 annotations so far...
Fetched 4200 annotations so far...
Fetched 4400 annotations so far...
Fetched 4600 annotations so far...
Fetched 4800 annotations so far...
Fetched 5000 annotations so far...
Fetched 5200 annotations so far...
Fetched 5400 annotations so far...
Fetched 5600 annotations so far...
Fetched 5800 annotations