# Data Annotation for Astronomical NER

## Objective

The goal of this notebook is to process raw NASA ADS abstract data into a structured format suitable for training a spaCy Named Entity Recognition (NER) model. This involves identifying astronomical source names (like "SN 2023ixf" or "Crab Nebula") in the text and marking their exact locations.

## Process Overview

The process consists of the following key steps:

1.  **Load Raw Data**: Ingest the downloaded JSON files containing abstracts and metadata from NASA ADS.
2.  **Extract Gold-Standard Entities**: Use the `keywords` field in the data to identify "gold-standard" astronomical entities. We specifically look for keywords prefixed with `individual:`, which reliably tag specific celestial objects.
3.  **Locate Entities in Text**: Search for these extracted entities within the corresponding document's `title` and `abstract`.
4.  **Format for spaCy**: Structure the text and entity locations into the specific format required by spaCy for training, which is a list of tuples, where each tuple contains the text and a dictionary of entity spans.

## Key Functions

-   `extract_astronomical_keyword_entities()`: Parses the `keyword` list from a document to pull out any terms identified as an "individual" celestial object.
-   `find_exact_matches()`: A robust function that uses regular expressions with word boundaries (`\b`) to find the precise start and end character offsets of an entity string in a body of text. This prevents partial matches (e.g., finding "Norma" inside "34 Normae").
-   `build_data_model()`: The main orchestration function that takes a raw document, runs the extraction and search steps, and compiles the final structured output.
-   `build_spacy_ner_data()`: Formats the final text and entity locations into the `(text, {"entities": [...]})` tuple structure that spaCy expects for a single training example.

## Final Output

The primary output of this notebook is a list of spaCy training examples stored in the `spacy_ner_data` key of the generated data model. This data can be saved and used directly in a spaCy training pipeline to teach the model to recognize astronomical objects.


In [None]:
from pathlib import Path
import re
import json
 from typing import Optional, Any

In [4]:
def extract_astronomical_keyword_entities(doc: dict) -> list[str]:
    """Extracts astronomical entities from document keywords.

    This function iterates through the keywords of a NASA ADS document record
    and extracts entities specifically marked with the "individual: " prefix.

    Args:
        doc: A dictionary representing a single document from the ADS API.

    Returns:
        A list of unique astronomical entity names.
    """
    entities = []
    keywords = doc.get("keyword", [])
    for keyword in keywords:
        if "individual: " in keyword.lower():
            entity = keyword.split("individual: ")[-1].strip()
            entities.append(entity)
    return list(set(entities))


def find_entities_in_text(
    text: str, entities: list[str]
) -> dict[str, list[tuple[int, int]]]:
    """Finds all occurrences of a list of entity strings in a text.

    Args:
        text: The text to search within.
        entities: A list of string entities to find.

    Returns:
        A dictionary where keys are the found entities and values are lists
        of (start, end) character offset tuples for each occurrence.
    """
    found_entities = {}
    for entity in entities:
        if locations := find_exact_matches(text, entity):
            found_entities[entity] = locations
    return found_entities


def find_exact_matches(text: str, search_string: str) -> list[tuple[int, int]]:
    """Finds occurrences of the exact string using word boundaries.

    This ensures that substrings like "34 Normae" are not matched when searching
    for "Normae".

    Args:
        text: The string to search within.
        search_string: The exact string to search for.

    Returns:
        A list of tuples, where each tuple contains the (start, end) index
        of a match. The end index is inclusive.
    """
    if not search_string:
        return []

    # Use word boundaries to ensure whole-word matching.
    pattern = re.compile(r'\b' + re.escape(search_string) + r'\b')

    occurrences = []
    for match in pattern.finditer(text):
        start_index = match.start()
        end_index = match.end() - 1  # End index is inclusive
        occurrences.append((start_index, end_index))
    return occurrences


def search_title_for_entities(
    doc: dict, entities: list[str]
) -> tuple[str, dict[str, list[tuple[int, int]]]]:
    """Searches the document title for specified entities.

    Args:
        doc: A dictionary representing a single document from the ADS API.
        entities: A list of string entities to find.

    Returns:
        A tuple containing the document title text and a dictionary of
        found entities with their character offsets.
    """
    text = " ".join(doc.get("title", []))
    found_entities = find_entities_in_text(text, entities)
    return text, found_entities


def search_abstract_for_entities(
    doc: dict, entities: list[str]
) -> tuple[str, dict[str, list[tuple[int, int]]]]:
    """Searches the document abstract for specified entities.

    Args:
        doc: A dictionary representing a single document from the ADS API.
        entities: A list of string entities to find.

    Returns:
        A tuple containing the document abstract text and a dictionary of
        found entities with their character offsets.
    """
    text = doc.get("abstract", "")
    found_entities = find_entities_in_text(text, entities)
    return text, found_entities

def build_data_model(doc):
    """
    Builds the data structure for training a NER model.
    """
    entities = extract_astronomical_keyword_entities(doc)
    if not entities:
        return None
    
    data_model = {
        "doc_id": doc.get("id"),
        "title": doc.get("title", ""),
        "abstract": doc.get("abstract", ""),
        "keywords": doc.get("keyword", []),
        "objects": entities,
    }
    
    title, title_entities = search_title_for_entities(doc, entities)
    abstract, abstract_entities = search_abstract_for_entities(doc, entities)

    data_model["title_entities"] = title_entities
    data_model["abstract_entities"] = abstract_entities

    spacy_ner_data = []
    if title_entities:
        spacy_ner_data.extend(build_spacy_ner_data(title, title_entities))
    if abstract_entities:
        spacy_ner_data.extend(build_spacy_ner_data(abstract, abstract_entities))
    data_model["spacy_ner_data"] = spacy_ner_data
    
    return data_model

def build_spacy_ner_data(text: str, entities: dict[str, list[tuple[int]]]) -> list[dict]:
    """Builds the data structure for training a spaCy NER model.

    This function converts a text and a dictionary of entity locations into the
    specific format required for a single spaCy training example. The format is
    a list containing the text and a dictionary with an "entities" key.
    The value of "entities" is a list of tuples, where each tuple represents
    a single entity with its start offset, end offset (inclusive), and label.

    Args:
        text: The source text containing the entities.
        entities: A dictionary where keys are entity names and values are lists
                  of (start, end) character offsets for each occurrence.

    Returns:
        A list containing the text and an entity dictionary, formatted for
        spaCy training. For example:
        ['Some text about SN 2023ixf.', {'entities': [(16, 25, 'ASTRO_OBJ')]}]
    """
    ner_data = [text,
                {
                    "entities": [(*loc, "ASTRO_OBJ") for _, locs in entities.items() for loc in locs]
                },
                ]
    return ner_data

In [None]:

def read_abstracts(data_dir: Path) -> List[Dict[str, Any]]:
     """
     Reads ADS abstracts from JSON files in the given directory.
 
     Args:
         data_dir: Path to the directory containing the JSON files.
 
     Returns:
         A list of dictionaries, where each dictionary represents an abstract.
     """
     abstract_files = list(data_dir.glob("*.json"))
     all_abstracts: List[Dict[str, Any]] = []
     for file in abstract_files:
         with open(file, "r") as f:
             abstract_batch = json.load(f).get("response")
             docs = abstract_batch.get("docs", [])
             all_abstracts.extend(docs)
     return all_abstracts
 
 
def build_data_model(doc: Dict[str, Any]) -> Optional[Dict[str, Any]]:
     """
     Builds a data model from a single document (abstract).  This is a placeholder;
     replace with your actual data model building logic.
 
     Args:
         doc: A dictionary representing a single abstract document.
 
     Returns:
         An optional dictionary representing the data model, or None if the model
         cannot be built from the document.
     """
     # Replace this with your actual data model building logic
     # This is just a placeholder
     try:
         title = doc.get("title", "N/A")
         abstract = doc.get("abstract", "N/A")
         return {"title": title, "abstract": abstract}  # Example data model
     except:
         return None
 
 
def process_abstracts(abstracts: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
     """
     Processes a list of abstracts to build data models.
 
     Args:
         abstracts: A list of dictionaries, where each dictionary represents an abstract.
 
     Returns:
         A list of data models (dictionaries) built from the abstracts.
     """
     spacy_ner_data: List[Dict[str, Any]] = []
     for doc in abstracts:
         data_model = build_data_model(doc)
         if data_model:
             spacy_ner_data.append(data_model)
     return spacy_ner_data
 
 
def write_data(data: List[Dict[str, Any]], output_path: Path) -> None:
     """
     Writes the processed data to a JSON file.
 
     Args:
         data: A list of dictionaries to write to the file.
         output_path: The path to the output JSON file.
     """
     output_path.parent.mkdir(parents=True, exist_ok=True)  # Ensure directory exists
     with open(output_path, "w") as f:
         json.dump(data, f, indent=2, ensure_ascii=False)
     print(f"Saved spaCy NER data to {output_path}")
 
 
def main(data_root: Path, output_path: Path) -> None:
     """
     Main function to execute the data processing pipeline.
 
     Args:
         data_root: The root directory containing the ADS abstracts.
         output_path: The path to save the processed data.
     """
     ads_data = data_root / "ads_abstracts"
     abstracts = read_abstracts(ads_data)
     processed_data = process_abstracts(abstracts)
     write_data(processed_data, output_path)
 
 
 if __name__ == "__main__":
     DATA_ROOT = Path("../data")
     OUTPATH = DATA_ROOT / "ner_data" / "spacy_ner_data.json"
     main(DATA_ROOT, OUTPATH)


## Example implementation

Testing the functions before we build a more generic script

In [24]:
# Data inputs
DATA_ROOT = Path("../data")
ADS_DATA = DATA_ROOT / "ads_abstracts"

In [25]:
# Process ADS abstracts 
abstract_files = list(ADS_DATA.glob("*.json"))
spacy_ner_data = []

for file in abstract_files:
    with open(file, "r") as f:
        abstract_batch = json.load(f).get("response")
        docs = abstract_batch.get("docs", [])
    for doc in docs:
        data_model = build_data_model(doc)
        if data_model:
            spacy_ner_data.append(data_model)        

In [26]:
#Output data to disk
OUTPATH = DATA_ROOT / "ner_data" / "spacy_ner_data.json"

with open(OUTPATH, "w") as f:
    json.dump(spacy_ner_data, f, indent=2, ensure_ascii=False)

print(f"Saved spaCy NER data to {OUTPATH}")

Saved spaCy NER data to ../data/ner_data/spacy_ner_data.json
