# API Annotation Pipeline

This notebook performs the full annotation process described in Section 3.5.1 of the thesis.
It uses external API (LLM-based) calls to obtain tokenization, lemmatization, and POS tagging
for AOI-level texts derived from the lemma-enriched materials.

**Input**
- `data/05_indexed_lemmas.csv` — lemma-enriched materials (from 03_data_prep.ipynb)

**Intermediate files created in this notebook**
- `data/05_aoi_dictionary_simplified.json` — normalized AOI sentences for API prompting

**Outputs (in order)**
1. `data/06_api_responses.csv` — raw API tokenization responses
2. `data/07_api_responses_with_lemmas.csv` — extended responses with lemmas and POS tags
3. `data/08_api_parsed_responses.csv` — structured token–lemma–POS table
4. `data/09_api_materials_parsed.csv` — AOI-aligned parsed materials
5. `data/10_api_materials_corrected.csv` — manually verified version
6. `data/11_api_materials_supervisor_corrected.csv` — supervisor-checked dataset
7. `data/12_api_materials_enriched.csv` — final enriched annotation dataset for merging


## Setup: Imports and File Paths

Import all required libraries and define standardized input and output paths for this notebook.
All file paths are centralized in a single dictionary (`PATHS`) to ensure reproducibility and
consistent file handling across the entire data preparation pipeline.


In [1]:
import os
import re
import csv
import json
import random
import pandas as pd
from pathlib import Path
import openai

# Define canonical data paths for this notebook
PATHS = {
    # Inputs
    "lemmas": "data/05_indexed_lemmas.csv",
    "aoi_dict_simplified": "data/05_aoi_dictionary_simplified.json",

    # API Annotation Outputs
    "responses": "data/06_api_responses.csv",
    "responses_with_lemmas": "data/07_api_responses_with_lemmas.csv",
    "parsed": "data/08_api_parsed_responses.csv",
    "materials_parsed": "data/09_api_materials_parsed.csv",
    "final": "data/10_materials_parsed_collapsed.csv"
}

# Ensure all required directories exist
for path in PATHS.values():
    Path(path).parent.mkdir(parents=True, exist_ok=True)

## Step 2: Create Simplified AOI Dictionary for API Annotation

This step constructs a simplified AOI dictionary from the lemma-enriched materials (`05_indexed_lemmas.csv`).
Each entry groups AOIs by their base key and stores:
- `ids`: list of `[Word_index, AOI_ann]` pairs
- `concat`: concatenated AOI text

This simplified version is used for efficient API prompting and can optionally be reduced further
to a slim subset (`slim_dict.json`) for testing or cost estimation.


In [2]:
df = pd.read_csv(
    PATHS["lemmas"],
    sep="\t",
    quoting=0,
    low_memory=False
)

def get_base_key(aoi_id: str) -> str:
    """Extract the base key from an AOI ID by stripping trailing digits."""
    match = re.search(r"(.*?)(\d+)$", aoi_id)
    return match.group(1) if match else aoi_id

def normalize_text(text: str) -> str:
    """Normalize apostrophes and spacing; remove unwanted punctuation."""
    text = re.sub(r"[‘’´′]", "'", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

result_dict = {}

for _, row in df.iterrows():
    aoi_id = row["id.global.aoi"]
    word_id = row["Word_index"]
    aoi_content = normalize_text(row["AOI_ann"])

    base_key = get_base_key(aoi_id)

    if base_key not in result_dict:
        result_dict[base_key] = {"ids": [], "concat": ""}

    result_dict[base_key]["ids"].append([word_id, aoi_content])

    if result_dict[base_key]["concat"]:
        result_dict[base_key]["concat"] += " " + aoi_content
    else:
        result_dict[base_key]["concat"] = aoi_content

output_path = Path(PATHS["aoi_dict_simplified"])
with open(output_path, "w", encoding="utf-8") as f:
    json.dump(result_dict, f, ensure_ascii=False, indent=2)

print(f"Simplified AOI dictionary saved to {output_path} with {len(result_dict)} entries.")

Simplified AOI dictionary saved to data/05_aoi_dictionary_simplified.json with 5 entries.


## Step 3: Define Helper Functions for API Annotation

This section defines the reusable helper functions and prompts used for API-based
tokenization and lemmatization. These functions are shared across all subsequent steps.

In [3]:
def batch_tokens_and_sentences(data, batch_size=120):
    """Create batches of tokens and sentences for API processing."""
    token_batch, sentence_batch = [], []
    for key, value in data.items():
        tokens = [item[1] for item in value["ids"]]
        sentence = value["concat"]
        if len(token_batch) + len(tokens) <= batch_size:
            token_batch.extend(tokens)
            sentence_batch.append(sentence)
        else:
            yield token_batch, sentence_batch
            token_batch, sentence_batch = tokens, [sentence]
    if token_batch:
        yield token_batch, sentence_batch


def extract_tokens(batch):
    """Extract tokens and assign an index within the batch."""
    return [(token, i) for i, token in enumerate(batch["tokens"])]


def format_tokens_for_prompt(token_id_list):
    """Format token–ID pairs into a string for inclusion in the API prompt."""
    return "\n".join([f"{token} {token_id}" for token, token_id in token_id_list])


class LLMCaller:
    """Wrapper for OpenAI ChatCompletion calls."""
    def __init__(self, api_key, model_name="gpt-4o"):
        self.api_key = api_key
        self.model_name = model_name
        openai.api_key = api_key

    def call_llm_openai(self, system_prompt, user_prompt):
        """Send a formatted system/user prompt to the API and return the model response."""
        response = openai.chat.completions.create(
            model=self.model_name,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt}
            ]
        )
        return response.choices[0].message.content


def remove_triple_backticks_with_newlines(text):
    """Remove leading/trailing triple backticks and adjacent newlines."""
    if not isinstance(text, str):
        return text
    text = text.strip()
    text = re.sub(r"^```(?:\n)?", "", text)
    text = re.sub(r"(?:\n)?```$", "", text)
    return text


def simulate_tokenization_response(tokens):
    """
    Generate a realistic synthetic tokenization response that mimics
    LLM tokenization behavior, including punctuation splitting and contractions.
    Tokens within a batch are separated by tabs ('\\t') instead of newlines.
    """
    punct = [".", ",", ";", ":", "!", "?"]
    result = []
    word_id = 0

    for token in tokens:
        token = str(token).strip()

        # handle French contractions manually: e.g., "d’orthopédie", "l'os"
        if "’" in token or "'" in token:
            for apos in ["’", "'"]:
                if apos in token:
                    split_idx = token.index(apos)
                    prefix = token[:split_idx].strip()
                    rest = token[split_idx + 1 :].strip()
                    if prefix:
                        result.append(f"{prefix}|{word_id}")
                    result.append(f"'|{word_id}")
                    if rest:
                        result.append(f"{rest}|{word_id}")
                    break

        # simulate punctuation split (e.g. "clinique.")
        elif any(token.endswith(p) for p in punct):
            word = token[:-1].strip()
            mark = token[-1]
            if word:
                result.append(f"{word}|{word_id}")
            result.append(f"{mark}|{word_id}")

        # normal case
        else:
            result.append(f"{token}|{word_id}")

        word_id += 1

    # join tokens with spaces (safe for TSV export)
    return " ".join(result)



# Prompts for tokenization and lemmatization

system_prompt_tokenization = """You are an NLP model specialized in processing French medical texts. Tokenize this sentence into "Token|WordID" pairs, one per line.
Assign IDs to the token based on the given Word IDs.
 **ABSOLUTELY ALWAYS** break contractions, compound words, and punctuation into separate tokens.
Repeat the original ID for all split parts; do not concatenate tokens.
Output only valid "Token|WordID" pairs - no extra text, no missing or extra IDs.
"""

system_prompt_lemmatization = """You are an NLP model specialized in processing French medical text tokens. Lemmatize these token and annotate them with POS tags while preserving original word IDs.
Ensure lemmatization and POS tags are context-appropriate and original IDs are maintained. Output format: Token|ID|Lemma|POS. Output only the list of lemmatized and annotated items."""

## Step 4: Tokenization API Calls

This step sends batches of AOI text to the API for tokenization using the defined
`system_prompt_tokenization`. Each batch is constructed from the simplified AOI
dictionary (`05_aoi_dictionary_simplified.json`) and processed through the LLM API.
In debug mode, the notebook will print example prompts instead of sending requests.


### API Key Configuration

The notebook can run in two modes:

- **Live mode:** Requires a valid OpenAI API key to send requests.
- **Debug mode:** Skips API calls and uses placeholder responses.

To run this notebook with live API calls, you must provide a valid OpenAI API key.
The recommended approach is to store the key in a local `.env` file (not under version control).

Create a file named `.env` in the project root with the following line:

`OPENAI_API_KEY=sk-your_api_key_here`

This notebook will automatically read the key from that file.
If no `.env` file is found, it will fall back to a plain `API_KEY` file.


If no key is found and `DEBUG_MODE` is `True`, the notebook will continue in debug mode automatically.

In [4]:
DEBUG_MODE = True  # Set to False for live API calls

def load_api_key(debug_mode: bool = False):
    """
    Load the OpenAI API key from a .env file, API_KEY file, or environment variable.
    If none is found and debug mode is enabled, continue without raising an error.
    """
    key = None
    env_path = Path(".env")

    # Option 1: .env file
    if env_path.exists():
        with open(env_path, "r", encoding="utf-8") as f:
            for line in f:
                if line.startswith("OPENAI_API_KEY="):
                    key = line.strip().split("=", 1)[1]
                    break

    # Option 2: API_KEY file
    if not key and Path("API_KEY").exists():
        with open("API_KEY", "r", encoding="utf-8") as f:
            key = f.read().strip()

    # Option 3: environment variable
    if not key:
        key = os.getenv("OPENAI_API_KEY")

    # Handle missing key
    if not key:
        if debug_mode:
            print("No API key found. Continuing in debug mode (no API calls will be made).")
            return None
        else:
            raise RuntimeError(
                "OpenAI API key not found. Please create a .env file or API_KEY file in the project root."
            )

    os.environ["OPENAI_API_KEY"] = key
    print("OpenAI API key loaded successfully.")
    return key


api_key = load_api_key(DEBUG_MODE)

No API key found. Continuing in debug mode (no API calls will be made).


In [5]:
# Load the simplified AOI dictionary
with open(PATHS["aoi_dict_simplified"], "r", encoding="utf-8") as f:
    sentence_dict = json.load(f)

print(f"Simplified AOI dictionary loaded with {len(sentence_dict)} entries.")

# Create batches for tokenization
batches = [
    {"tokens": tokens, "sentences": sentences}
    for tokens, sentences in batch_tokens_and_sentences(sentence_dict, batch_size=120)
]
print(f"Created {len(batches)} batches for API processing.")

# Initialize API caller (safe for both debug and live)
llm_caller = LLMCaller(api_key if api_key else "", model_name="gpt-4o")

responses = []

for i, batch in enumerate(batches):
    sentences_text = " ".join(batch["sentences"])
    token_id_list = extract_tokens(batch)
    formatted_tokens = format_tokens_for_prompt(token_id_list)

    user_prompt = "Sentences:\n{sentences}\n\nWordIDs:\n{tokens}"
    user_prompt_formatted = user_prompt.format(tokens=formatted_tokens, sentences=sentences_text)

    if DEBUG_MODE or not api_key:
        fake_response = simulate_tokenization_response(batch["tokens"])
        responses.append(fake_response)
        if i == 0:
            print("Debug Mode: Example synthetic tokenization response:\n")
            print(fake_response)
    else:
        response = llm_caller.call_llm_openai(system_prompt_tokenization, user_prompt_formatted)
        cleaned_response = remove_triple_backticks_with_newlines(response)
        responses.append(cleaned_response)

    if i % 10 == 0:
        print(f"Processed {i} batches")

# Combine batches and responses into a single DataFrame
df = pd.DataFrame(batches)
df["response"] = responses

# Save to TSV without altering the response text
df.to_csv(
    PATHS["responses"],
    sep="\t",
    index=False
)

print(f"Tokenization responses saved to {PATHS['responses']} (shape: {df.shape})")

Simplified AOI dictionary loaded with 5 entries.
Created 1 batches for API processing.
Debug Mode: Example synthetic tokenization response:

Cas|0 clinique|1 .|1 Un|2 patient|3 âgé|4 Le|5 patient|6 souffre|7 d|8 '|8 orthopédie|8 fracture|9 .|9
Processed 0 batches
Tokenization responses saved to data/06_api_responses.csv (shape: (1, 3))


## Step 5: Lemmatization and POS Tagging API Calls

This step processes the tokenized responses produced in Step 4 (`06_api_responses.csv`)
and sends them to the API for lemmatization and POS tagging using the
`system_prompt_lemmatization`.

When `DEBUG_MODE` is enabled or no API key is available, the notebook
prints an example prompt and skips actual API requests.


In [6]:
# Load tokenization responses
df_tokenized = pd.read_csv(PATHS["responses"], sep="\t", quoting=0, low_memory=False)
print(f"Loaded {PATHS['responses']} with {len(df_tokenized)} entries.")

lemma_responses = []

for i, response in enumerate(df_tokenized["response"]):
    if not isinstance(response, str) or not response.strip():
        lemma_responses.append(None)
        continue

    cleaned_response = remove_triple_backticks_with_newlines(response)

    if DEBUG_MODE or not api_key:
        lines = []
        for line in cleaned_response.split():  # split on whitespace, not tabs
            if "|" not in line:
                continue
            token, token_id = line.split("|", 1)
            token = token.strip()
            lemma = token.lower()

            if token.strip() in [".", ",", ";", ":", "!", "?", "'", "’"]:
                pos = "PUNCT"
                lemma = token.strip()
            elif token.lower() in ["et", "ou", "mais"]:
                pos = "CCONJ"
            elif token.istitle():
                pos = "NOUN"
            else:
                pos = random.choice(["NOUN", "VERB", "ADJ", "DET"])

            lines.append(f"{token}|{token_id}|{lemma}|{pos}")

        fake_output = "\t".join(lines)
        lemma_responses.append(fake_output)

        if i == 0:
            print("Debug Mode: Example synthetic lemma/POS response:\n")
            print(fake_output)
    else:
        lemmatization_response = llm_caller.call_llm_openai(
            system_prompt_lemmatization, cleaned_response
        )
        cleaned_output = remove_triple_backticks_with_newlines(lemmatization_response)
        lemma_responses.append(cleaned_output)

    if i % 10 == 0:
        print(f"Processed {i} responses")

df_tokenized["lemma_response"] = lemma_responses

# Save lemmatization results
df_tokenized.to_csv(
    PATHS["responses_with_lemmas"],
    sep="\t",
    index=False,
    quoting=csv.QUOTE_ALL
)

print(f"Lemmatization responses saved to {PATHS['responses_with_lemmas']} (shape: {df_tokenized.shape})")

Loaded data/06_api_responses.csv with 1 entries.
Debug Mode: Example synthetic lemma/POS response:

Cas|0|cas|NOUN	clinique|1|clinique|DET	.|1|.|PUNCT	Un|2|un|NOUN	patient|3|patient|VERB	âgé|4|âgé|VERB	Le|5|le|NOUN	patient|6|patient|ADJ	souffre|7|souffre|DET	d|8|d|DET	'|8|'|PUNCT	orthopédie|8|orthopédie|VERB	fracture|9|fracture|VERB	.|9|.|PUNCT
Processed 0 responses
Lemmatization responses saved to data/07_api_responses_with_lemmas.csv (shape: (1, 4))


## Step 6: Parse Lemmatization Responses into Structured Format

This step parses the raw API responses from Step 5 (`07_api_responses_with_lemmas.csv`)
into a structured table with one row per token and explicit columns:
`ID`, `Token`, `Lemma`, and `POS`.

The parsed data are saved to `08_api_parsed_responses.csv`, which forms the basis
for later alignment with AOI annotations.


In [7]:
# Load lemmatization responses
df_lemmas = pd.read_csv(PATHS["responses_with_lemmas"], sep="\t", quoting=0, low_memory=False)
print(f"Loaded {PATHS['responses_with_lemmas']} with {len(df_lemmas)} entries.")

def reformat_lemma_response(lemma_response):
    """
    Reformats the lemma response string to switch the ID and word,
    such that the format is ID|Word|Lemma|POS.
    Handles both tab- and space-separated entries.
    """
    if not isinstance(lemma_response, str):
        return lemma_response

    parts = re.split(r"\s+", lemma_response.strip())  # split on any whitespace
    reformatted_parts = []
    for part in parts:
        fields = part.split("|")
        if len(fields) == 4:
            token, id_, lemma, pos = fields
            reformatted_parts.append(f"{id_}|{token}|{lemma}|{pos}")
    return "\n".join(reformatted_parts)


def parse_reformatted_response(df):
    """
    Parses the 'reformatted_lemma_response' column of the input DataFrame
    into a new DataFrame with columns ID, Token, Lemma, POS.
    Works regardless of whether entries are separated by tabs, spaces, or newlines.
    """
    data = []
    for _, row in df.iterrows():
        response = row.get("reformatted_lemma_response")
        if isinstance(response, str):
            # split on any whitespace
            lines = re.split(r"\s+", response.strip())
            for line in lines:
                match = re.match(r"(\d+)\|([^|]+)\|([^|]+)\|([^|]+)$", line)
                if match:
                    id_, token, lemma, pos = match.groups()
                    data.append({
                        "ID": id_.strip(),
                        "Token": token.strip(),
                        "Lemma": lemma.strip(),
                        "POS": pos.strip()
                    })
    return pd.DataFrame(data)


# Reformat and parse
df_lemmas["reformatted_lemma_response"] = df_lemmas["lemma_response"].apply(
    lambda x: reformat_lemma_response(x) if isinstance(x, str) else x
)
df_parsed = parse_reformatted_response(df_lemmas)

# Save structured output
df_parsed.to_csv(PATHS["parsed"], sep="\t", index=False, quoting=0)
print(f"Parsed responses saved to {PATHS['parsed']} (shape: {df_parsed.shape})")

Loaded data/07_api_responses_with_lemmas.csv with 1 entries.
Parsed responses saved to data/08_api_parsed_responses.csv (shape: (14, 4))


## Step 7: Align Parsed Tokens Back to AOIs

This step aligns the parsed token–lemma–POS table from the API
with the AOI entries stored in the simplified AOI dictionary.

**Inputs**
- `data/08_api_parsed_responses.tsv` — structured token, lemma, and POS output
- `data/05_aoi_dictionary_simplified.json` — simplified AOI dictionary

**Process**
Tokens are sequentially mapped to AOIs in the order they appear in the dictionary.
This allows reconstruction of the full AOI-level text with linguistic annotations.

**Output**
- `data/09_api_materials_parsed.tsv` — token–lemma–POS entries aligned to AOI annotations

**Note:**
Each AOI remains token-level at this stage.
Multi-token AOIs (e.g., `d’orthopédie`) are intentionally kept as multiple rows
sharing the same `aoi_id`.
The manual and supervisor correction steps (Steps 10–12) later combine these
entries into merged AOIs for the final annotated dataset.


In [8]:
# Load parsed token–lemma–POS data
df_parsed = pd.read_csv(
    PATHS["parsed"],
    sep="\t",
    quoting=0,
    low_memory=False
)
print(f"Loaded parsed responses with {df_parsed.shape[0]} tokens.")

# Load simplified AOI dictionary
with open(PATHS["aoi_dict_simplified"], "r", encoding="utf-8") as f:
    aoi_dict = json.load(f)
print(f"Loaded AOI dictionary with {len(aoi_dict)} entries.")


def align_tokens_to_aois(df, aoi_dict):
    """
    Aligns parsed tokens to AOIs based on ID transitions in the DataFrame.
    The AOI dictionary is traversed sequentially, but repeated token IDs
    reuse the same AOI entry until the ID changes (mimicking the old logic).
    """

    aligned_rows = []
    dict_keys = list(aoi_dict.keys())
    key_idx = 0
    id_idx = 0
    prev_df_id = None

    for _, row in df.iterrows():
        token_id = row["ID"]
        token = row["Token"]
        lemma = row["Lemma"]
        pos = row["POS"]

        # On new ID → move to next AOI token in dictionary
        if prev_df_id is None or token_id != prev_df_id:
            if key_idx < len(dict_keys):
                ids = aoi_dict[dict_keys[key_idx]]["ids"]
                if id_idx < len(ids):
                    aoi_id, aoi_word = ids[id_idx]
                    id_idx += 1
                else:
                    # move to next AOI entry
                    key_idx += 1
                    id_idx = 0
                    if key_idx < len(dict_keys):
                        ids = aoi_dict[dict_keys[key_idx]]["ids"]
                        if ids:
                            aoi_id, aoi_word = ids[id_idx]
                            id_idx += 1
                        else:
                            aoi_id, aoi_word = None, None
            else:
                aoi_id, aoi_word = None, None
        else:
            # same ID → reuse previous AOI
            aoi_id, aoi_word = aligned_rows[-1]["aoi_id"], aligned_rows[-1]["aoi_text"]

        # Add aligned row
        aligned_rows.append({
            "aoi_base": dict_keys[key_idx] if key_idx < len(dict_keys) else None,
            "aoi_id": aoi_id,
            "aoi_text": aoi_word,
            "token_id": token_id,
            "token": token,
            "lemma": lemma,
            "pos": pos
        })

        prev_df_id = token_id

    return pd.DataFrame(aligned_rows)


# Run alignment
df_aligned = align_tokens_to_aois(df_parsed, aoi_dict)
print(f"Aligned {df_aligned.shape[0]} tokens to AOI entries.")

# Save aligned data
df_aligned.to_csv(
    PATHS["materials_parsed"],
    sep="\t",
    index=False,
    quoting=csv.QUOTE_ALL  # safe for special characters
)
print(f"Aligned AOI–token data saved to {PATHS['materials_parsed']} (shape: {df_aligned.shape})")

Loaded parsed responses with 14 tokens.
Loaded AOI dictionary with 5 entries.
Aligned 14 tokens to AOI entries.
Aligned AOI–token data saved to data/09_api_materials_parsed.csv (shape: (14, 7))


### Step 7.5: Manual and Supervisor Corrections

After the automatic API-based parsing, the file `09_api_materials_parsed.csv`
contained a machine-generated mapping of AOIs to their corresponding tokens,
lemmas, and POS tags. While structurally aligned, this dataset occasionally
contained multiple tokens per AOI or inconsistent lemmatizations.

#### Manual Correction Phase
A manually reviewed version (`corrected_parsed_materials.csv`) was created by
fixing parsing errors and ensuring that each `id.global.aoi` corresponded to
a linguistically valid annotation. This guaranteed that every AOI had a clean
token–lemma–POS mapping.

#### Supervisor-Corrected Integration
The project supervisor then integrated this manually corrected data with
`04_spacy_annotations.csv`, which contained additional linguistic metrics such as
word frequencies and syntactic dependency counts (`left_dependents`,
`right_dependents`). This integration produced the consolidated file
`ann_materials_with-lemmas_2025-04-22.csv`, combining the following sources:

- from **09_api_materials_parsed.csv**: core lexical annotations
  (`aoi_text`, `token`, `lemma`, `pos`);
- from **04_spacy_annotations.csv**: structural, frequency, and dependency-based
  features (`word_count_*`, `word_text_frequency`, `left_dependents`,
  `right_dependents`, etc.).

This supervisor-corrected dataset serves as the **final authoritative annotation
resource**, aligning token-level linguistic information with higher-level textual
and syntactic features. It is used as the input for the automatic collapsing and
enrichment process described in Step 7.6.

### Step 7.6: Automatic Collapsing and Enrichment of Parsed Materials

This step reproduces the structure of the supervisor-corrected dataset
(`ann_materials_with-lemmas_2025-04-22.csv`) by automatically combining
the API-based annotations (`09_api_materials_parsed.csv`) with the
linguistic and structural information from the spaCy-based file
(`04_spacy_annotations.csv`).

**Input**
- `09_api_materials_parsed.csv`
  (automatically parsed AOI-token–lemma–POS mappings)
- `04_spacy_annotations.csv`
  (linguistic and structural features such as frequency counts and dependency metrics)

**Process**
The function performs a deterministic AOI-level merge and collapse:
- uses `id.global.aoi` as the unique key,
- joins the lexical information from the API output (token, lemma, POS)
  with the syntactic and frequency information from the spaCy annotations,
- ensures that each AOI corresponds to exactly one row with unified
  linguistic and structural descriptors,
- reconstructs filename and indexing information to mirror the structure
  of the supervisor-corrected dataset.

This procedure yields a single enriched dataset that can stand in for the
unpublished supervisor-corrected file, maintaining the same column schema
and ready for downstream merging with the eye-tracking metrics.

**Output**
- `materials_parsed_collapsed.csv`
  (automatically enriched and collapsed AOI-level dataset)


In [4]:
# ---------------------------------------------------------------------
# 1. Load both datasets
# ---------------------------------------------------------------------
df09 = pd.read_csv(PATHS["materials_parsed"], sep="\t", low_memory=False)
df05 = pd.read_csv(PATHS["lemmas"], sep="\t", low_memory=False)

print(f"Loaded 09 data: {df09.shape}, 05 data: {df05.shape}")

# ---------------------------------------------------------------------
# 2. Rebuild and normalize AOI identifiers in 09 data
# ---------------------------------------------------------------------
df09["aoi_base"] = (
    df09["aoi_base"]
    .astype(str)
    .str.replace(r"[-_]+$", "", regex=True)
    .str.strip()
)

# Canonical AOI ID
df09["id.global.aoi"] = df09["aoi_base"].astype(str) + "-" + df09["aoi_id"].astype(str)

# ---------------------------------------------------------------------
# 3. Collapse multiple tokens per AOI (one row per unique AOI)
# ---------------------------------------------------------------------
def concat_nonempty(values):
    """Join unique non-empty strings with ' | ', preserving first-occurrence order."""
    vals = [str(v).strip() for v in values if pd.notna(v) and str(v).strip()]
    vals_unique = list(dict.fromkeys(vals))  # order-preserving de-duplication
    return " | ".join(vals_unique) if vals_unique else ""

collapse_cols = ["token", "token_id", "lemma", "pos", "aoi_text"]

df09_collapsed = (
    df09.groupby(["aoi_base", "aoi_id", "id.global.aoi"], as_index=False)
        .agg({col: concat_nonempty for col in collapse_cols})
)

print(f"Collapsed 09 data: {df09_collapsed.shape}")

# ---------------------------------------------------------------------
# 4. Merge collapsed linguistic info into the 05 (lemmas) dataset
# ---------------------------------------------------------------------
df_merged = pd.merge(
    df05,
    df09_collapsed,
    on="id.global.aoi",
    how="left",
    suffixes=("", "_from09")
)

# ---------------------------------------------------------------------
# 5. Add unified linguistic fields (merged)
# ---------------------------------------------------------------------
# AOI text (prefer collapsed 09 version if present)
aoi_text_col = "aoi_text_from09" if "aoi_text_from09" in df_merged.columns else "aoi_text"

if "AOI_ann" in df_merged.columns:
    if aoi_text_col in df_merged.columns:
        df_merged["AOI_ann"] = df_merged["AOI_ann"].fillna(df_merged[aoi_text_col])
    else:
        df_merged["AOI_ann"] = df_merged["AOI_ann"].fillna("")
else:
    df_merged["AOI_ann"] = df_merged[aoi_text_col] if aoi_text_col in df_merged.columns else ""


# Lemma/POS merged from collapsed 09 (fallback: original 05 columns)
df_merged["lemma_merged"] = df_merged.get("lemma_from09", "")
if "lemma" in df_merged.columns:
    df_merged["lemma_merged"] = df_merged["lemma_merged"].where(df_merged["lemma_merged"].ne(""), df_merged["lemma"])

df_merged["pos_merged"] = df_merged.get("pos_from09", "")
if "pos" in df_merged.columns:
    df_merged["pos_merged"] = df_merged["pos_merged"].where(df_merged["pos_merged"].ne(""), df_merged["pos"])

# token and token_id merged
if "token_from09" in df_merged.columns:
    df_merged["token_merged"] = df_merged["token_from09"]
else:
    df_merged["token_merged"] = df_merged.get("token", "")

if "token_id_from09" in df_merged.columns:
    df_merged["token_id_merged"] = df_merged["token_id_from09"]
else:
    df_merged["token_id_merged"] = df_merged.get("token_id", "")

# ---------------------------------------------------------------------
# 6. Drop helper columns (post-merge)
# ---------------------------------------------------------------------
df_merged.drop(columns=[c for c in df_merged.columns if c.endswith("_from09")], inplace=True, errors="ignore")

# ---------------------------------------------------------------------
# 7. Reorder and complete columns to match supervisor-style schema
# ---------------------------------------------------------------------
expected_cols = [
    "Media", "textid", "text.type", "text_version", "screenid",
    "Sentence_index", "Word_index", "word_id_screen", "word_id_text",
    "AOI_ann", "AOI_length", "is.in.bracket", "id.phrase.in.brackets",
    "AOI.that.in.fact.should.be", "tag", "tag.type", "tag.id",
    "annotated_text", "flag", "id.ann", "id.ann.global",
    "Sentence_id_current", "Sentence_id_match", "id.piece",
    "lemma_merged", "pos_merged", "token_merged", "token_id_merged",
    "is.hyphenated", "is.compound.hyphen",
    "word_count_text", "word_count_page", "total_word_count_text",
    "left_dependents", "right_dependents", "id.global.aoi",
    "AOI_unif_quot", "filename.ann", "id.line"
]

for col in expected_cols:
    if col not in df_merged.columns:
        df_merged[col] = ""

df_final = df_merged[expected_cols]

# ---------------------------------------------------------------------
# 8. Save final collapsed + enriched dataset
# ---------------------------------------------------------------------
Path(PATHS["final"]).parent.mkdir(parents=True, exist_ok=True)
df_final.to_csv(PATHS["final"], sep="\t", index=False, quoting=0)

print(f"\nEnriched materials dataset saved to {PATHS['final']} (shape: {df_final.shape})")
print("\nSample with linguistic features:")
print(df_final[[
    "id.global.aoi", "token_merged", "token_id_merged", "lemma_merged", "pos_merged"
]].head(10))

Loaded 09 data: (14, 7), 05 data: (10, 38)
Collapsed 09 data: (10, 8)

Enriched materials dataset saved to data/materials_parsed_collapsed.csv (shape: (10, 39))

Sample with linguistic features:
                  id.global.aoi        token_merged token_id_merged  \
0   clinical_001_original_1-1-1                 Cas               0   
1   clinical_001_original_1-1-2        clinique | .               1   
2   clinical_001_original_1-2-1                  Un               2   
3   clinical_001_original_1-2-2             patient               3   
4   clinical_001_original_2-2-3                 âgé               4   
5  medical_002_simplified_1-1-1                  Le               5   
6  medical_002_simplified_1-1-2             patient               6   
7  medical_002_simplified_1-1-3             souffre               7   
8  medical_002_simplified_1-2-1  d | ' | orthopédie               8   
9  medical_002_simplified_1-2-2        fracture | .               9   

         lemma_merged  

<div style="
  background-color:#2b3a42;
  color:#f1f1f1;
  border-left: 6px solid #5da3a3;
  padding: 14px;
  border-radius: 5px;
  line-height: 1.6;
">

<h3 style="color:#ffffff;">Step 8: Return to <code>data_prep.ipynb</code></h3>

The API annotation and automatic collapsing steps are now complete.
You have generated the reproducible, frequency-enriched annotation dataset:

<strong><code>materials_parsed_collapsed.csv</code></strong>

To continue the preprocessing workflow, switch back to the main data preparation notebook
(<code>data_prep.ipynb</code>) and proceed with:

<strong>Step&nbsp;7: Merge Enriched Materials with Eye-Tracking Features</strong>

This next step combines the API-enriched linguistic data with the experimental
eye-tracking dataset (<code>00_input_eye_tracking_data_with_metrics_dummy.csv</code>)
to form a unified corpus for subsequent aggregation and model training.

</div>