# Data Annotation for Astronomical NER

## Objective

The goal of this notebook is to process raw NASA ADS abstract data into a structured format suitable for training a spaCy Named Entity Recognition (NER) model. This involves identifying astronomical source names (like "SN 2023ixf" or "Crab Nebula") in the text and marking their exact locations.

## Process Overview

The process consists of the following key steps:

1.  **Load Raw Data**: Ingest the downloaded JSON files containing abstracts and metadata from NASA ADS.
2.  **Extract Gold-Standard Entities**: Use the `keywords` field in the data to identify "gold-standard" astronomical entities. We specifically look for keywords prefixed with `individual:`, which reliably tag specific celestial objects.
3.  **Locate Entities in Text**: Search for these extracted entities within the corresponding document's `title` and `abstract`.
4.  **Format for spaCy**: Structure the text and entity locations into the specific format required by spaCy for training, which is a list of tuples, where each tuple contains the text and a dictionary of entity spans.

## Key Functions

-   `extract_astronomical_keyword_entities()`: Parses the `keyword` list from a document to pull out any terms identified as an "individual" celestial object.
-   `find_exact_matches()`: A robust function that uses regular expressions with word boundaries (`\b`) to find the precise start and end character offsets of an entity string in a body of text. This prevents partial matches (e.g., finding "Norma" inside "34 Normae").
-   `build_data_model()`: The main orchestration function that takes a raw document, runs the extraction and search steps, and compiles the final structured output.
-   `build_spacy_ner_data()`: Formats the final text and entity locations into the `(text, {"entities": [...]})` tuple structure that spaCy expects for a single training example.

## Final Output

The primary output of this notebook is a list of spaCy training examples stored in the `spacy_ner_data` key of the generated data model. This data can be saved and used directly in a spaCy training pipeline to teach the model to recognize astronomical objects.


In [1]:
from pathlib import Path
import re
import json
from typing import Optional, Any

In [3]:
def extract_astronomical_keyword_entities(doc: dict) -> list[str]:
    """Extracts astronomical entities from document keywords.

    This function iterates through the keywords of a NASA ADS document record
    and extracts entities specifically marked with the "individual: " prefix.

    Args:
        doc: A dictionary representing a single document from the ADS API.

    Returns:
        A list of unique astronomical entity names.
    """
    entities = []
    keywords = doc.get("keyword", [])
    for keyword in keywords:
        if "individual: " in keyword.lower():
            entity = keyword.split("individual: ")[-1].strip()
            entities.append(entity)
    return list(set(entities))


def remove_overlapping_entities(entities: list[str]) -> list[str]:
    """Remove entities that are substrings of other entities.
    
    This prevents overlapping entity matches like 'HD 189733' and 'HD 189733 b'
    by keeping only the longest version of each entity.
    
    Args:
        entities: List of entity strings
        
    Returns:
        List of entities with overlapping substrings removed
    """
    if not entities:
        return entities
    
    # Sort by length (longest first) to prioritize longer matches
    sorted_entities = sorted(entities, key=len, reverse=True)
    filtered_entities = []
    
    for entity in sorted_entities:
        # Check if this entity is a substring of any already accepted entity
        is_substring = False
        for accepted_entity in filtered_entities:
            if entity != accepted_entity and entity in accepted_entity:
                is_substring = True
                break
        
        if not is_substring:
            filtered_entities.append(entity)
    
    return filtered_entities


def find_entities_in_text(
    text: str, entities: list[str]
) -> dict[str, list[tuple[int, int]]]:
    """Finds all occurrences of a list of entity strings in a text.

    Args:
        text: The text to search within.
        entities: A list of string entities to find.

    Returns:
        A dictionary where keys are the found entities and values are lists
        of (start, end) character offset tuples for each occurrence.
    """
    found_entities = {}
    for entity in entities:
        if locations := find_exact_matches(text, entity):
            found_entities[entity] = locations
    return found_entities


def find_exact_matches(text: str, search_string: str) -> list[tuple[int, int]]:
    """Finds occurrences of the exact string using word boundaries.

    This ensures that substrings like "34 Normae" are not matched when searching
    for "Normae".

    Args:
        text: The string to search within.
        search_string: The exact string to search for.

    Returns:
        A list of tuples, where each tuple contains the (start, end) index
        of a match. The end index is inclusive.
    """
    if not search_string:
        return []

    # Use word boundaries to ensure whole-word matching.
    pattern = re.compile(r'\b' + re.escape(search_string) + r'\b')

    occurrences = []
    for match in pattern.finditer(text):
        start_index = match.start()
        end_index = match.end() - 1  # End index is inclusive
        occurrences.append((start_index, end_index))
    return occurrences


def resolve_overlapping_spans(entities_dict: dict[str, list[tuple[int, int]]]) -> list[tuple[int, int, str]]:
    """Resolve overlapping entity spans by keeping the longest match.
    
    When multiple entities overlap in the text, keep only the longest one.
    This prevents spaCy training errors from conflicting entity spans.
    
    Args:
        entities_dict: Dictionary mapping entity names to their span locations
        
    Returns:
        List of (start, end, label) tuples with overlaps resolved
    """
    # Flatten all entities into (start, end, entity_name) tuples
    all_spans = []
    for entity_name, locations in entities_dict.items():
        for start, end in locations:
            all_spans.append((start, end, entity_name))
    
    # Sort by start position, then by span length (longest first)
    all_spans.sort(key=lambda x: (x[0], -(x[1] - x[0])))
    
    # Remove overlapping spans, keeping longest ones
    resolved_spans = []
    for start, end, entity_name in all_spans:
        # Check if this span overlaps with any already accepted span
        overlaps = False
        for accepted_start, accepted_end, _ in resolved_spans:
            # Check for overlap: spans overlap if one starts before the other ends
            if not (end < accepted_start or start > accepted_end):
                overlaps = True
                break
        
        if not overlaps:
            resolved_spans.append((start, end, "ASTRO_OBJ"))
    
    return resolved_spans


def search_title_for_entities(
    doc: dict, entities: list[str]
) -> tuple[str, dict[str, list[tuple[int, int]]]]:
    """Searches the document title for specified entities.

    Args:
        doc: A dictionary representing a single document from the ADS API.
        entities: A list of string entities to find.

    Returns:
        A tuple containing the document title text and a dictionary of
        found entities with their character offsets.
    """
    text = " ".join(doc.get("title", []))
    found_entities = find_entities_in_text(text, entities)
    return text, found_entities


def search_abstract_for_entities(
    doc: dict, entities: list[str]
) -> tuple[str, dict[str, list[tuple[int, int]]]]:
    """Searches the document abstract for specified entities.

    Args:
        doc: A dictionary representing a single document from the ADS API.
        entities: A list of string entities to find.

    Returns:
        A tuple containing the document abstract text and a dictionary of
        found entities with their character offsets.
    """
    text = doc.get("abstract", "")
    found_entities = find_entities_in_text(text, entities)
    return text, found_entities

def build_data_model(doc):
    """
    Builds the data structure for training a NER model.
    """
    raw_entities = extract_astronomical_keyword_entities(doc)
    if not raw_entities:
        return None
    
    # Remove overlapping entities (e.g., keep "HD 189733 b" instead of both "HD 189733" and "HD 189733 b")
    entities = remove_overlapping_entities(raw_entities)
    
    data_model = {
        "doc_id": doc.get("id"),
        "title": doc.get("title", ""),
        "abstract": doc.get("abstract", ""),
        "keywords": doc.get("keyword", []),
        "objects": entities,
        "raw_objects": raw_entities,  # Keep original for debugging
    }
    
    title, title_entities = search_title_for_entities(doc, entities)
    abstract, abstract_entities = search_abstract_for_entities(doc, entities)

    data_model["title_entities"] = title_entities
    data_model["abstract_entities"] = abstract_entities

    spacy_ner_data = []
    if title_entities:
        spacy_ner_data.extend(build_spacy_ner_data(title, title_entities))
    if abstract_entities:
        spacy_ner_data.extend(build_spacy_ner_data(abstract, abstract_entities))
    data_model["spacy_ner_data"] = spacy_ner_data
    
    return data_model

def build_spacy_ner_data(text: str, entities: dict[str, list[tuple[int]]]) -> list[dict]:
    """Builds the data structure for training a spaCy NER model.

    This function converts a text and a dictionary of entity locations into the
    specific format required for a single spaCy training example. The format is
    a list containing the text and a dictionary with an "entities" key.
    The value of "entities" is a list of tuples, where each tuple represents
    a single entity with its start offset, end offset (inclusive), and label.

    Args:
        text: The source text containing the entities.
        entities: A dictionary where keys are entity names and values are lists
                  of (start, end) character offsets for each occurrence.

    Returns:
        A list containing the text and an entity dictionary, formatted for
        spaCy training. For example:
        ['Some text about SN 2023ixf.', {'entities': [(16, 25, 'ASTRO_OBJ')]}]
    """
    # Use the overlap resolution function to clean up conflicting spans
    resolved_entities = resolve_overlapping_spans(entities)
    
    ner_data = [text, {"entities": resolved_entities}]
    return ner_data

In [4]:
def read_abstracts(data_dir: Path) -> list[dict[str, Any]]:
    """
    Reads ADS abstracts from JSON files in the given directory.

    Args:
        data_dir: Path to the directory containing the JSON files.

    Returns:
        A list of dictionaries, where each dictionary represents an abstract.
    """
    abstract_files = list(data_dir.glob("*.json"))
    all_abstracts: list[dict[str, Any]] = []
    for file in abstract_files:
        with open(file, "r") as f:
            abstract_batch = json.load(f).get("response")
            docs = abstract_batch.get("docs", [])
            all_abstracts.extend(docs)
    return all_abstracts


def process_abstracts(abstracts: list[dict[str, Any]]) -> list[dict[str, Any]]:
    """
    Processes a list of abstracts to build data models.

    Args:
        abstracts: A list of dictionaries, where each dictionary represents an abstract.

    Returns:
        A list of data models (dictionaries) built from the abstracts.
    """
    spacy_ner_data: list[dict[str, Any]] = []
    for doc in abstracts:
        data_model = build_data_model(doc)
        if data_model:
            spacy_ner_data.append(data_model)
    return spacy_ner_data


def write_data(data: list[dict[str, Any]], output_path: Path) -> None:
    """
    Writes the processed data to a JSON file.

    Args:
        data: A list of dictionaries to write to the file.
        output_path: The path to the output JSON file.
    """
    output_path.parent.mkdir(parents=True, exist_ok=True)  # Ensure directory exists
    with open(output_path, "w") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)
    print(f"Saved spaCy NER data to {output_path}")


def main(data_root: Path, output_path: Path) -> None:
    """
    Main function to execute the data processing pipeline.

    Args:
        data_root: The root directory containing the ADS abstracts.
        output_path: The path to save the processed data.
    """
    ads_data = data_root / "ads_abstracts"
    abstracts = read_abstracts(ads_data)
    processed_data = process_abstracts(abstracts)
    write_data(processed_data, output_path)


if __name__ == "__main__":
    DATA_ROOT = Path("../data")
    OUTPATH = DATA_ROOT / "ner_data" / "spacy_ner_data.json"
    main(DATA_ROOT, OUTPATH)


Saved spaCy NER data to ../data/ner_data/spacy_ner_data.json


## Example implementation

Testing the functions before we build a more generic script

### Test overlap detection and resolution

In [6]:
# Test overlap detection functions
test_entities = ["HD 189733", "HD 189733 b", "Fomalhaut", "Fomalhaut b", "NGC 4258", "Crab Nebula"]
print("Original entities:", test_entities)
cleaned_entities = remove_overlapping_entities(test_entities)
print("Cleaned entities:", cleaned_entities)

# Test overlap resolution in spans
test_text = "HD 189733 b orbits HD 189733 every 2.2 days"
test_entities_dict = {
    "HD 189733": [(0, 8), (17, 25)],
    "HD 189733 b": [(0, 10)]
}
print(f"\nTest text: '{test_text}'")
print("Entity spans before resolution:", test_entities_dict)
resolved_spans = resolve_overlapping_spans(test_entities_dict)
print("Resolved spans:", resolved_spans)

# Data inputs
DATA_ROOT = Path("../data")
ADS_DATA = DATA_ROOT / "ads_abstracts"

Original entities: ['HD 189733', 'HD 189733 b', 'Fomalhaut', 'Fomalhaut b', 'NGC 4258', 'Crab Nebula']
Cleaned entities: ['HD 189733 b', 'Fomalhaut b', 'Crab Nebula', 'NGC 4258']

Test text: 'HD 189733 b orbits HD 189733 every 2.2 days'
Entity spans before resolution: {'HD 189733': [(0, 8), (17, 25)], 'HD 189733 b': [(0, 10)]}
Resolved spans: [(0, 10, 'ASTRO_OBJ'), (17, 25, 'ASTRO_OBJ')]


In [7]:
# Process ADS abstracts with overlap detection
abstract_files = list(ADS_DATA.glob("*.json"))
spacy_ner_data = []
overlap_stats = {"total_docs": 0, "docs_with_overlaps": 0, "entities_removed": 0}

for file in abstract_files:
    with open(file, "r") as f:
        abstract_batch = json.load(f).get("response")
        docs = abstract_batch.get("docs", [])
    for doc in docs:
        overlap_stats["total_docs"] += 1
        
        # Get raw entities
        raw_entities = extract_astronomical_keyword_entities(doc)
        if raw_entities:
            # Clean overlapping entities
            cleaned_entities = remove_overlapping_entities(raw_entities)
            
            # Track statistics
            if len(cleaned_entities) < len(raw_entities):
                overlap_stats["docs_with_overlaps"] += 1
                overlap_stats["entities_removed"] += len(raw_entities) - len(cleaned_entities)
                print(f"Doc {doc.get('id', 'unknown')}: {raw_entities} -> {cleaned_entities}")
        
        data_model = build_data_model(doc)
        if data_model:
            spacy_ner_data.append(data_model)

print(f"\nOverlap cleaning statistics:")
print(f"Total documents processed: {overlap_stats['total_docs']}")
print(f"Documents with overlapping entities: {overlap_stats['docs_with_overlaps']}")
print(f"Total overlapping entities removed: {overlap_stats['entities_removed']}")    

Doc 14921651: ['HD 189733 b', 'HD 189733'] -> ['HD 189733 b']
Doc 14917098: ['GRB 140506A', 'GRB 140506A host'] -> ['GRB 140506A host']
Doc 14916547: ['HD 97658', 'HD 97658 b'] -> ['HD 97658 b']
Doc 14946658: ['WASP 52b', 'WASP 52'] -> ['WASP 52b']
Doc 14946620: ['SWIFT J1753.5-0127', 'SWIFT J1753.5-0127 - X-rays: binaries'] -> ['SWIFT J1753.5-0127 - X-rays: binaries']
Doc 14926052: ['M31', 'M31N 2008-12a'] -> ['M31N 2008-12a']
Doc 14949374: ['Cyg X-1', 'Cyg X-1 - X-rays: binaries'] -> ['Cyg X-1 - X-rays: binaries']
Doc 14947993: ['4U 1820-30 - X-rays: stars', '4U 1820-30'] -> ['4U 1820-30 - X-rays: stars']
Doc 14921476: ['HAT-P-33', 'HAT-P-33b'] -> ['HAT-P-33b']
Doc 14946477: ['Fomalhaut b', 'Fomalhaut'] -> ['Fomalhaut b']
Doc 14946329: ['382004', '(315898)', '342842', '315898'] -> ['(315898)', '382004', '342842']
Doc 14926172: ['M31', 'M31N 2008-12a'] -> ['M31N 2008-12a']
Doc 14926048: ['FW Tau', 'FW Tau C'] -> ['FW Tau C']
Doc 14921883: ['HD 3167b', 'HD 3167'] -> ['HD 3167b']
Doc 14

In [8]:
#Output data to disk
OUTPATH = DATA_ROOT / "ner_data" / "spacy_ner_data.json"

with open(OUTPATH, "w") as f:
    json.dump(spacy_ner_data, f, indent=2, ensure_ascii=False)

print(f"Saved spaCy NER data to {OUTPATH}")

Saved spaCy NER data to ../data/ner_data/spacy_ner_data.json
