# Data Preparation Pipeline

This notebook documents the full preprocessing pipeline from externally provided experimental data
to the final machine-learning–ready feature sets (`ml_features_*.csv`).

**Note on data availability**
All data files used in this pipeline contain either (i) the experimental stimulus texts or (ii) eye-tracking data collected during the reading experiment.
These materials were collected and are currently being prepared for publication by Oksana Ivchenko.
Because they are not yet publicly released, I cannot redistribute them here.

For transparency, this notebook documents all preprocessing steps.
Synthetic placeholder files are provided in the repository so that the pipeline can still be executed to illustrate the workflow end-to-end.

---

## Inputs (external, not created by this repository)

- **`ETdata_oksana_for_constantin_2025-04-22-old.csv`**
  AOI-level eye-tracking feature export from the experiment (provided by Oksana Ivchenko).
  Contains participant and session metadata, text identifiers, and aggregated eye-tracking measures per Area of Interest (AOI).
  This file represents processed AOI features, not raw gaze streams.

- **`eye_tracking_data.csv`**
  Annotated stimulus texts and AOI metadata (provided by supervisor).
  Contains word and sentence indices, AOI surface forms, and annotation tags.
  Does not include participant information or eye-tracking features.

- **`ann_materials_with-lemmas_2025-04-22.csv`**
  Supervisor-corrected materials annotation.
  Derived from an earlier corrected version of the automatically parsed materials.

---

## Outputs (generated by this pipeline)

The following intermediate and final datasets are created from the external inputs.
They can all be regenerated by running the steps in this notebook.

### Stimulus Texts Branch
- `et_data_indexed_freq.csv`
  Text materials with word and lemma frequency counts.
- `et_data_indexed_freq_deps_pos_lemmas.csv` / `et_data_indexed_freq_deps_pos_lemmas_v2.csv`
  Texts enriched with POS tags, lemmas, and dependency parses using spaCy.
- `et_data_lemmas_indexed.csv`
  Materials with unique lemma IDs and counts, used as the basis for AOI dictionaries.
- `aoi_dict.json`
  AOIs grouped by base key, storing concatenated text and AOI–ID tuples.
- `slim_dict.json`
  Simplified AOI dictionary used for API prompting.

### API Annotation Branch
- `responses.csv`
  Raw API responses from the tokenization step.
- `responses_with_lemmas.csv`
  API responses from the lemmatization and POS tagging step.
- `parsed_responses.csv`
  Structured token/lemma/POS table parsed from API outputs.
- `materials_parsed.csv` (=`parsed_materials.csv`)
  Materials aligned with AOIs after API annotation.
- `corrected_parsed_materials.csv`
  Manually corrected version of `materials_parsed.csv`.
- `ann_materials_with-lemmas_2025-04-22.csv`
  Supervisor-corrected materials.
- `ann_materials_with_lemmas_and_frequencies_2025_06_17.csv`
  Supervisor-corrected materials enriched with lemma frequency counts.

### Merged Dataset
- `et_data_merged_with_ann_materials_25_06_17.csv`
  Eye-tracking features merged with enriched annotated materials via `id.global.aoi`.

### Aggregated Features for Machine Learning
- `ml_features_all_*.csv`
  Aggregated features across all AOIs.
- `ml_features_medical_*.csv`
  Aggregated features restricted to medical AOIs.
- `ml_features_non_medical_*.csv`
  Aggregated features restricted to non-medical AOIs.
- `ml_features_content_*.csv`
  Aggregated features restricted to content-word AOIs.


# Setup and Imports

This section loads the core Python libraries required for data preparation.
At this stage, only general-purpose packages are needed:

- **pandas**: data manipulation and tabular processing
- **pathlib**: convenient and platform-independent file paths

As the pipeline develops, additional imports (e.g. for linguistic annotation, machine learning, or visualization) will be added here.


In [1]:
import pandas as pd
import numpy as np
import random
from pathlib import Path
import re, json, spacy

from tensorflow.python.tpu.ops.gen_tpu_embedding_ops import merge_dedup_data

2025-10-20 17:39:10.641782: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-10-20 17:39:10.643220: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-10-20 17:39:10.670833: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-10-20 17:39:10.671798: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Synthetic Example Data for Reproducibility

Because the original experimental eye-tracking data and annotated materials
cannot be publicly distributed, this notebook generates a pair of **synthetic datasets**
that reproduce the structure, schema, and processing logic of the real inputs.
These files allow the entire preprocessing pipeline to be executed end-to-end
without access to restricted data.

### Files Produced

**1. `00_input_eye_tracking_data_dummy.csv`**
Simulates the structure of the original `eye_tracking_data.csv`.
Each row represents an annotated *Area of Interest* (AOI), corresponding to a word or short phrase
within clinical or medical reading materials.
The dataset covers both *original* and *simplified* text versions.

**2. `00_input_eye_tracking_data_with_metrics_dummy.csv`**
Extends the AOI dataset by adding **synthetic participant-level eye-tracking metrics**,
mimicking the original file `ETdata_oksana_for_constantin_2025-04-22-old.csv`.
Each AOI entry is replicated across multiple participants, and artificial
fixation, visit, and saccade measures are generated to preserve the numerical range and
distribution of the real experimental data.

### Design Characteristics
- Preserves the **column names and data types** of the original files.
- Includes French AOI content with:
  - contractions and apostrophes (*d’orthopédie*, *l’hôpital*),
  - diacritics (*âgé*, *orthopédie*),
  - punctuation within AOIs (*clinique.*, *fracture.*).
- Simulates both **medical** and **clinical** text types in *original* and *simplified* versions.
- Introduces **word repetitions** to validate frequency-counting logic.
- Adds **participant metadata** (age, expertise, study background) for multiple simulated readers.
- Generates randomized but realistic numeric ranges for fixation and visit metrics.

### Purpose
These synthetic datasets serve as **reproducible stand-ins** for confidential experimental data.
They make it possible to:
- test and validate preprocessing functions,
- run all feature-extraction and merging steps, and
- publish the complete pipeline transparently while keeping participant data private.

The real experimental data remains unpublished and cannot be shared
until released by its original collector.


In [2]:
import pandas as pd
import numpy as np
import random
from pathlib import Path


def create_dummy_eye_tracking_data_french_realistic(output_dir="raw"):
    """
    Create a minimal synthetic dataset mimicking the structure of
    `eye_tracking_data.csv` used in the experiment.

    The dataset simulates French clinical and medical reading materials,
    including both "original" and "simplified" text versions. Each entry
    represents an annotated Area of Interest (AOI), corresponding to
    individual words or short word groups.

    Parameters
    ----------
    output_dir : str or Path, optional
        Directory where the synthetic dataset will be saved (default: 'raw/').

    Returns
    -------
    pd.DataFrame
        Synthetic AOI-level dataset preserving the schema of the original
        experimental input file.
    """
    # Ensure the output directory exists
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)

    # Representative AOI data drawn from simulated original and simplified texts
    data = {
        "Media": [
            "clinical_001_original_1",
            "clinical_001_original_1",
            "clinical_001_original_1",
            "clinical_001_original_1",
            "clinical_001_original_2",
            "medical_002_simplified_1",
            "medical_002_simplified_1",
            "medical_002_simplified_1",
            "medical_002_simplified_1",
            "medical_002_simplified_1"
        ],
        "textid": [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
        "text.type": ["clinical"] * 5 + ["medical"] * 5,
        "text_version": ["original"] * 5 + ["simplified"] * 5,
        "screenid": [1, 1, 1, 1, 2, 1, 1, 1, 1, 1],
        "Sentence_index": [1, 1, 2, 2, 2, 1, 1, 1, 2, 2],
        "Word_index": [1, 2, 1, 2, 3, 1, 2, 3, 1, 2],
        "word_id_screen": [1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
        "word_id_text": [1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
        "AOI_ann": [
            "Cas", "clinique.", "Un", "patient", "âgé",
            "Le", "patient", "souffre", "d’orthopédie", "fracture."
        ],
        "AOI_nopunct": [
            "Cas", "clinique", "Un", "patient", "âgé",
            "Le", "patient", "souffre", "d’orthopédie", "fracture"
        ],
        "AOI_length": [3, 8, 2, 7, 3, 2, 7, 7, 11, 8],
        "is.in.bracket": [False] * 10,
        "id.phrase.in.brackets": [None] * 10,
        "AOI.that.in.fact.should.be": [None] * 10,
        # Medical-tagging placeholders left empty for later use
        "tag": [None] * 10,
        "tag.type": [1, None, None, None, None, 2, None, 5, 6, 7],
        "tag.id": [None] * 10,
        "annotated_text": [
            "NA", "NA", "NA", "NA", "NA",
            "NA", "NA", "NA", "orthopédie", "fracture"
        ],
        "Sentence_id_current": [1, 1, 2, 2, 2, 1, 1, 1, 2, 2],
        "Sentence_id_match": [1, 1, 2, 2, 2, 1, 1, 1, 2, 2],
        "id.piece": list(range(10)),
        "filename.ann": [
            "001_tagged_finished_words_original.csv"
        ] * 5 + [
            "002_tagged_finished_words_simplified.csv"
        ] * 5,
        "id.global.aoi": [
            "clinical_001_original_1-1-1",
            "clinical_001_original_1-1-2",
            "clinical_001_original_1-2-1",
            "clinical_001_original_1-2-2",
            "clinical_001_original_2-2-3",
            "medical_002_simplified_1-1-1",
            "medical_002_simplified_1-1-2",
            "medical_002_simplified_1-1-3",
            "medical_002_simplified_1-2-1",
            "medical_002_simplified_1-2-2"
        ]
    }

    # Construct the DataFrame
    df_dummy = pd.DataFrame(data)

    # Save the dataset to file
    output_file = output_path / "00_input_eye_tracking_data_dummy.csv"
    df_dummy.to_csv(output_file, sep="\t", index=False)

    print(f"Dummy AOI dataset created: {df_dummy.shape}")
    return df_dummy


def create_dummy_eye_tracking_data_from_aoi(
    input_file="raw/00_input_eye_tracking_data_dummy.csv",
    output_dir="raw"
):
    """
    Generate synthetic eye-tracking metrics aligned to a dummy AOI dataset.

    This function extends the AOI-level dummy data by assigning randomised
    fixation and saccade metrics, simulating the structure of the original
    experimental eye-tracking data. Each AOI entry is replicated for several
    participants, with metadata describing demographic and study background.

    Parameters
    ----------
    input_file : str or Path
        Path to the dummy AOI dataset.
    output_dir : str or Path
        Directory where the synthetic eye-tracking dataset will be saved.

    Returns
    -------
    pd.DataFrame
        Synthetic eye-tracking dataset with participant-level variation.
    """
    # Load AOI-level base data
    df_aoi = pd.read_csv(input_file, sep="\t", dtype=str).fillna("")

    # Ensure output directory exists
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)

    # Define participant metadata
    participants_info = [
        {
            "Participant": "Participant10",
            "set1": "1_A",
            "set2": "A",
            "age": 22,
            "sex": "W",
            "Payment": "X",
            "is.expert": "non-expert",
            "educational_background": "Licence",
            "field_of_study": "3ème année de psychologie",
            "current_situation": "étudiante",
            "current_situation_old": "student",
            "Participant_commnts_new": "NA",
        },
        {
            "Participant": "Participant11",
            "set1": "1_B",
            "set2": "B",
            "age": 26,
            "sex": "M",
            "Payment": "X",
            "is.expert": "expert",
            "educational_background": "Master",
            "field_of_study": "5ème année de médecine",
            "current_situation": "interne à l’hôpital",
            "current_situation_old": "medical_resident",
            "Participant_commnts_new": "NA",
        },
        {
            "Participant": "Participant12",
            "set1": "2_A",
            "set2": "A",
            "age": 24,
            "sex": "W",
            "Payment": "Y",
            "is.expert": "non-expert",
            "educational_background": "Licence",
            "field_of_study": "sciences cognitives",
            "current_situation": "étudiante",
            "current_situation_old": "student",
            "Participant_commnts_new": "NA",
        },
    ]

    rows = []

    # Build full participant-level dataset
    for pinfo in participants_info:
        for _, row in df_aoi.iterrows():
            # Copy all original AOI columns
            base_row = row.to_dict()

            # Generate synthetic fixation metrics
            total_fix = np.random.randint(100, 800)
            avg_fix = round(np.random.uniform(100, total_fix), 2)
            min_fix = np.random.randint(50, int(avg_fix))
            max_fix = np.random.randint(int(avg_fix), total_fix + 100)
            num_fix = np.random.randint(1, 4)
            ttf = np.random.randint(8000, 16000)
            dur_first = np.random.randint(50, 400)
            pupil = round(np.random.uniform(2.2, 3.2), 5)
            num_visit = np.random.randint(1, 3)

            # --- Whole-fixation metrics ---
            total_whole_fix = total_fix + np.random.randint(50, 150)
            avg_whole_fix = round(np.random.uniform(80, total_whole_fix / num_fix), 2)
            min_whole_fix = np.random.randint(40, int(avg_whole_fix))
            max_whole_fix = np.random.randint(int(avg_whole_fix), total_whole_fix + 100)
            num_whole_fix = num_fix + np.random.randint(0, 2)
            ttf_whole = ttf + np.random.randint(-200, 200)
            dur_first_whole = np.random.randint(60, 400)
            pupil_whole = round(np.random.uniform(2.1, 3.3), 5)

            # Add participant-level and metric information
            base_row.update({
                "list": pinfo["set1"],
                "Participant": pinfo["Participant"],
                "Participant_unique": f"{pinfo['Participant']}-{pinfo['set1']}",
                "age": pinfo["age"],
                "sex": pinfo["sex"],
                "Payment": pinfo["Payment"],
                "is.expert": pinfo["is.expert"],
                "educational_background": pinfo["educational_background"],
                "field_of_study": pinfo["field_of_study"],
                "current_situation": pinfo["current_situation"],
                "current_situation_old": pinfo["current_situation_old"],
                "Participant_commnts_new": pinfo["Participant_commnts_new"],
                "Recording": f"Recording{random.randint(10,20)}",
                "situation_actuelle": pinfo["current_situation"],
                "Timeline": f"Timeline_{pinfo['set1']}",
                "Media_old": f"{row['Media'].split('_')[0]}_{row['textid']}_{row['text_version']}",
                "id.global.aoi.participant": f"{pinfo['Participant']}-{row['id.global.aoi']}",
                # Fixation metrics
                "Total_duration_of_fixations": total_fix,
                "Average_duration_of_fixations": avg_fix,
                "Minimum_duration_of_fixations": min_fix,
                "Maximum_duration_of_fixations": max_fix,
                "Number_of_fixations": num_fix,
                "Time_to_first_fixation": ttf,
                "Duration_of_first_fixation": dur_first,
                "Average_pupil_diameter": pupil,
                # Whole-fixation metrics
                "Total_duration_of_whole_fixations": total_whole_fix,
                "Average_duration_of_whole_fixations": avg_whole_fix,
                "Minimum_duration_of_whole_fixations": min_whole_fix,
                "Maximum_duration_of_whole_fixations": max_whole_fix,
                "Number_of_whole_fixations": num_whole_fix,
                "Time_to_first_whole_fixation": ttf_whole,
                "Duration_of_first_whole_fixation": dur_first_whole,
                "Average_whole-fixation_pupil_diameter": pupil_whole,
                # Visit metrics
                "Total_duration_of_Visit": total_fix,
                "Average_duration_of_Visit": avg_fix,
                "Minimum_duration_of_Visit": min_fix,
                "Maximum_duration_of_Visit": max_fix,
                "Number_of_Visits": num_visit,
                "Time_to_first_Visit": ttf,
                "Duration_of_first_Visit": dur_first,
                # Saccades and regressions
                "Number_of_saccades_in_AOI": np.random.randint(0, 3),
                "Time_to_entry_saccade": ttf - 40,
                "Time_to_exit_saccade": ttf + 50,
                "Peak_velocity_of_entry_saccade": round(np.random.uniform(180, 280), 2),
                "Peak_velocity_of_exit_saccade": round(np.random.uniform(180, 280), 2),
                "First-pass_first_fixation_duration": dur_first,
                "First-pass_duration": total_fix,
                "Regression-path_duration": 0,
                "Selective_regression-path_duration": 0,
                "First-pass_regression": 0,
                "Re-reading_duration": 0,
                # Eye openness
                "Average_eye_openness": round(np.random.uniform(2.4, 2.8), 3),
                "Average_whole-fixation_eye_openness": round(np.random.uniform(2.3, 2.7), 3),
                # Filename placeholder
                "filename.et": "eye_tracking_metrics_dummy.xlsx",
            })

            rows.append(base_row)

    # Assemble final DataFrame
    df_eye = pd.DataFrame(rows)

    # Export to file
    output_file = output_path / "00_input_eye_tracking_data_with_metrics_dummy.csv"
    df_eye.to_csv(output_file, sep="\t", index=False)
    print(f"Created dummy eye-tracking data aligned with AOIs: {df_eye.shape} rows")
    return df_eye


# Generate both synthetic datasets
dummy_df = create_dummy_eye_tracking_data_french_realistic()
dummy_eye_df = create_dummy_eye_tracking_data_from_aoi("raw/00_input_eye_tracking_data_dummy.csv")

Dummy AOI dataset created: (10, 24)
Created dummy eye-tracking data aligned with AOIs: (30, 78) rows


## Step 1: Build AOI dictionary

**Input**
`00_input_eye_tracking_data.csv` (external, provided by supervisor)
This file contains the annotated stimulus texts at the AOI (word) level,
with identifiers such as `id.global.aoi` and surface forms in `AOI_ann`.

**Process**
AOIs are grouped by their *base key* (sentence-level identifier, derived from `id.global.aoi`).
For each base key, the function collects:
- `ids`: the list of AOI identifiers (`id.global.aoi`)
- `concatenated_aoi`: the corresponding AOI words joined into a sentence-like string

This creates a sentence-level view of the data that is used later for:
- linguistic annotation with spaCy,
- alignment of eye-tracking data with parsed text,
- API-based annotation (slim dictionary).

**Output**
`01_aoi_dictionary.json`
A JSON file where each entry is keyed by a base AOI ID and contains:
- `ids`: list of `(id.global.aoi)` values for the AOIs in that sentence,
- `concatenated_aoi`: the concatenated AOI text string.


In [3]:
def get_base_key(aoi_id: str) -> str:
    """
    Extract the base key from an AOI ID by stripping trailing digits.
    Example:
        clinical_114_original_1-2-8 → clinical_114_original_1-2-
    """
    match = re.search(r"(.*?)(\d+)$", aoi_id)
    return match.group(1) if match else aoi_id


def build_aoi_dict(input_file: str, output_dir="raw", output_name="01_aoi_dictionary.json"):
    """
    Build an AOI dictionary directly from the eye_tracking_data (or lemma-indexed) file.

    Each entry stores:
        - ids: list of AOI IDs
        - concatenated_aoi: AOI words concatenated into a sentence-like string

    Parameters
    ----------
    input_file : str
        Path to the input TSV file (e.g., 'raw/00_input_eye_tracking_data.csv' or
        'raw/04_indexed_lemmas.csv').
    output_dir : str or Path, optional
        Directory where the output JSON file will be saved. Defaults to 'raw/'.
    output_name : str, optional
        Filename for the AOI dictionary JSON. Defaults to '05_aoi_dictionary.json'.

    Returns
    -------
    dict
        The AOI dictionary with base keys as entries.
    """
    input_path = Path(input_file)
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)
    output_file = output_path / output_name

    df = pd.read_csv(input_path, sep="\t", quoting=0)
    result_dict = {}

    for _, row in df.iterrows():
        aoi_id = row["id.global.aoi"]
        aoi_content = str(row["AOI_ann"])
        base_key = get_base_key(aoi_id)

        if base_key not in result_dict:
            result_dict[base_key] = {"ids": [], "concatenated_aoi": ""}

        result_dict[base_key]["ids"].append(aoi_id)

        if result_dict[base_key]["concatenated_aoi"]:
            result_dict[base_key]["concatenated_aoi"] += " " + aoi_content
        else:
            result_dict[base_key]["concatenated_aoi"] = aoi_content

    with open(output_file, "w", encoding="utf-8") as f:
        json.dump(result_dict, f, ensure_ascii=False, indent=2)

    print(f"AOI dictionary saved to: {output_file}")
    print(f"Number of entries: {len(result_dict)}")

    return result_dict


# Example usage
aoi_dict = build_aoi_dict("raw/00_input_eye_tracking_data_dummy.csv")

AOI dictionary saved to: raw/01_aoi_dictionary.json
Number of entries: 5


## Step 2: Add unique word identifiers and frequency counts

**Input**
`00_input_eye_tracking_data.csv`
Text-level AOI annotation file with columns such as:
`['Media', 'textid', 'text.type', 'text_version', 'screenid',
 'Sentence_index', 'Word_index', 'word_id_screen', 'word_id_text',
 'AOI_ann', 'AOI_nopunct', 'AOI_length', 'is.in.bracket',
 'id.phrase.in.brackets', 'tag', 'tag.type', 'tag.id',
 'annotated_text', 'Sentence_id_current', 'Sentence_id_match',
 'id.piece', 'filename.ann', 'id.global.aoi']`

**Output**
`02_indexed_words.csv`
Same data, enriched with:
- `unique_word_text_id`: unique word IDs per text (case-insensitive, punctuation removed)
- `unique_word_screen_id`: unique word IDs per screen (within each text)
- `word_count_text`: frequency of each word in the text
- `word_count_screen`: frequency of each word on the screen


In [4]:
def assign_word_ids_by_group(group: pd.DataFrame, id_name: str) -> pd.DataFrame:
    """
    Assign unique IDs to words within a given group (e.g., per text or per screen).

    Words are matched case-insensitively based on 'AOI_nopunct'.
    Each unique lowercase word receives an incrementing integer ID.
    """
    word_to_id = {}
    ids = []
    id_counter = 1

    # Temporary lowercase representation for case-insensitive ID assignment
    group["temp_word"] = group["AOI_nopunct"].str.lower()

    for word in group["temp_word"]:
        if word not in word_to_id:
            word_to_id[word] = id_counter
            id_counter += 1
        ids.append(word_to_id[word])

    group[id_name] = ids
    group.drop(columns=["temp_word"], inplace=True)
    group[id_name] = group[id_name].astype(int)

    return group


def preprocess_materials(input_file: str, output_dir="raw", output_name="01_indexed_words.csv") -> pd.DataFrame:
    """
    Main preprocessing routine for annotated eye-tracking materials.

    Steps
    -----
    1. Assign unique word IDs per text and per screen.
    2. Count word occurrences in each text and on each screen.

    Parameters
    ----------
    input_file : str
        Path to the TSV file containing the AOI-level annotated materials
        (e.g., 'raw/00_input_eye_tracking_data.csv').
    output_dir : str or Path, optional
        Directory where the enriched dataset will be saved.
        Defaults to 'raw/'.
    output_name : str, optional
        Filename for the enriched output.
        Defaults to '01_indexed_words.csv'.

    Returns
    -------
    pd.DataFrame
        The enriched dataframe with word IDs and frequency annotations.
    """
    input_path = Path(input_file)
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)
    output_file = output_path / output_name

    df = pd.read_csv(input_path, sep="\t", quoting=0)

    # Assign unique word IDs per text
    for text_file in df["filename.ann"].unique():
        text_group = df[df["filename.ann"] == text_file].copy()
        group = assign_word_ids_by_group(text_group, "unique_word_text_id")
        df.loc[df["filename.ann"] == text_file, "unique_word_text_id"] = group["unique_word_text_id"]

    # Assign unique word IDs per screen (within each text)
    for text_file in df["filename.ann"].unique():
        screens = df[df["filename.ann"] == text_file]["screenid"].unique()
        for screen in screens:
            screen_group = df[(df["filename.ann"] == text_file) & (df["screenid"] == screen)].copy()
            group = assign_word_ids_by_group(screen_group, "unique_word_screen_id")
            df.loc[(df["filename.ann"] == text_file) & (df["screenid"] == screen),
                   "unique_word_screen_id"] = group["unique_word_screen_id"]

    # Compute word frequency within text and screen scopes
    df["word_count_text"] = df.groupby(["filename.ann", "unique_word_text_id"])["unique_word_text_id"].transform("count")
    df["word_count_screen"] = df.groupby(["filename.ann", "screenid", "unique_word_screen_id"])["unique_word_screen_id"].transform("count")

    # Save enriched dataset
    df.to_csv(output_file, sep="\t", index=False)
    print(f"Preprocessed data saved to: {output_file}")
    print(f"Data shape: {df.shape}")

    return df

In [5]:
df_indexed = preprocess_materials(
    input_file="raw/00_input_eye_tracking_data_dummy.csv",
    output_dir="raw",
    output_name="02_indexed_words.csv"
)
df_indexed

Preprocessed data saved to: raw/02_indexed_words.csv
Data shape: (10, 28)


Unnamed: 0,Media,textid,text.type,text_version,screenid,Sentence_index,Word_index,word_id_screen,word_id_text,AOI_ann,...,annotated_text,Sentence_id_current,Sentence_id_match,id.piece,filename.ann,id.global.aoi,unique_word_text_id,unique_word_screen_id,word_count_text,word_count_screen
0,clinical_001_original_1,1,clinical,original,1,1,1,1,1,Cas,...,,1,1,0,001_tagged_finished_words_original.csv,clinical_001_original_1-1-1,1.0,1.0,1,1
1,clinical_001_original_1,1,clinical,original,1,1,2,2,2,clinique.,...,,1,1,1,001_tagged_finished_words_original.csv,clinical_001_original_1-1-2,2.0,2.0,1,1
2,clinical_001_original_1,1,clinical,original,1,2,1,3,3,Un,...,,2,2,2,001_tagged_finished_words_original.csv,clinical_001_original_1-2-1,3.0,3.0,1,1
3,clinical_001_original_1,1,clinical,original,1,2,2,4,4,patient,...,,2,2,3,001_tagged_finished_words_original.csv,clinical_001_original_1-2-2,4.0,4.0,1,1
4,clinical_001_original_2,1,clinical,original,2,2,3,5,5,âgé,...,,2,2,4,001_tagged_finished_words_original.csv,clinical_001_original_2-2-3,5.0,1.0,1,1
5,medical_002_simplified_1,2,medical,simplified,1,1,1,1,1,Le,...,,1,1,5,002_tagged_finished_words_simplified.csv,medical_002_simplified_1-1-1,1.0,1.0,1,1
6,medical_002_simplified_1,2,medical,simplified,1,1,2,2,2,patient,...,,1,1,6,002_tagged_finished_words_simplified.csv,medical_002_simplified_1-1-2,2.0,2.0,1,1
7,medical_002_simplified_1,2,medical,simplified,1,1,3,3,3,souffre,...,,1,1,7,002_tagged_finished_words_simplified.csv,medical_002_simplified_1-1-3,3.0,3.0,1,1
8,medical_002_simplified_1,2,medical,simplified,1,2,1,4,4,d’orthopédie,...,orthopédie,2,2,8,002_tagged_finished_words_simplified.csv,medical_002_simplified_1-2-1,4.0,4.0,1,1
9,medical_002_simplified_1,2,medical,simplified,1,2,2,5,5,fracture.,...,fracture,2,2,9,002_tagged_finished_words_simplified.csv,medical_002_simplified_1-2-2,5.0,5.0,1,1


## Step 3: Add word frequencies

**Input**
`02_indexed_words.csv`

**Output**
`03_word_frequencies.csv`
Enriched with:
- `total_word_count_text`: total number of words in each text
- `word_text_frequency`: relative frequency of each word within the text


In [6]:
def compute_word_frequencies(
    input_file="raw/02_indexed_words.csv",
    output_dir="raw",
    output_name="03_word_frequencies.csv"
):
    """
    Compute total word counts and relative frequencies per text.

    Parameters
    ----------
    input_file : str
        Path to the indexed word dataset (from preprocess_materials).
    output_dir : str or Path, optional
        Directory where the output file will be saved. Defaults to 'raw/'.
    output_name : str, optional
        Output filename. Defaults to '03_word_frequencies.csv'.

    Returns
    -------
    pd.DataFrame
        DataFrame with additional columns:
            - total_word_count_text
            - word_text_frequency
    """
    input_path = Path(input_file)
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)
    output_file = output_path / output_name

    df = pd.read_csv(input_path, sep="\t", quoting=0)

    # Compute total word count per text
    word_counts = df.groupby("filename.ann")["word_id_text"].max().reset_index()
    word_counts.columns = ["filename.ann", "total_word_count_text"]

    # Merge back into dataframe
    df = df.merge(word_counts, on="filename.ann")

    # Compute relative frequency
    df["word_text_frequency"] = df["word_count_text"] / df["total_word_count_text"]

    # Save enriched dataset
    df.to_csv(output_file, sep="\t", index=False)
    print(f"Word frequency data saved to: {output_file}")
    print(f"Data shape: {df.shape}")

    return df

In [7]:
df_freq = compute_word_frequencies(
    input_file="raw/02_indexed_words.csv",
    output_dir="raw",
    output_name="03_word_frequencies.csv"
)
df_freq.head()

Word frequency data saved to: raw/03_word_frequencies.csv
Data shape: (10, 30)


Unnamed: 0,Media,textid,text.type,text_version,screenid,Sentence_index,Word_index,word_id_screen,word_id_text,AOI_ann,...,Sentence_id_match,id.piece,filename.ann,id.global.aoi,unique_word_text_id,unique_word_screen_id,word_count_text,word_count_screen,total_word_count_text,word_text_frequency
0,clinical_001_original_1,1,clinical,original,1,1,1,1,1,Cas,...,1,0,001_tagged_finished_words_original.csv,clinical_001_original_1-1-1,1.0,1.0,1,1,5,0.2
1,clinical_001_original_1,1,clinical,original,1,1,2,2,2,clinique.,...,1,1,001_tagged_finished_words_original.csv,clinical_001_original_1-1-2,2.0,2.0,1,1,5,0.2
2,clinical_001_original_1,1,clinical,original,1,2,1,3,3,Un,...,2,2,001_tagged_finished_words_original.csv,clinical_001_original_1-2-1,3.0,3.0,1,1,5,0.2
3,clinical_001_original_1,1,clinical,original,1,2,2,4,4,patient,...,2,3,001_tagged_finished_words_original.csv,clinical_001_original_1-2-2,4.0,4.0,1,1,5,0.2
4,clinical_001_original_2,1,clinical,original,2,2,3,5,5,âgé,...,2,4,001_tagged_finished_words_original.csv,clinical_001_original_2-2-3,5.0,1.0,1,1,5,0.2


## Step 4: Add POS, lemmas, and dependency features

**Input**
- `03_word_frequencies.csv`
- `aoi_dict.json` (AOIs grouped with concatenated text)

**Output**
`04_spacy_annotations.csv`
Enriched with:
- `pos` (part of speech from spaCy)
- `lemma` (canonical form)
- `left_dependents` / `right_dependents` (syntactic dependents from dependency parse)


In [8]:
from pathlib import Path
import pandas as pd
import spacy
import re

# Load French language model
try:
    nlp = spacy.load("fr_core_news_lg")
except OSError:
    raise RuntimeError(
        "The spaCy model 'fr_core_news_lg' is not installed. "
        "Install it via: python -m spacy download fr_core_news_lg"
    )


def normalize_text(text: str) -> str:
    """
    Remove punctuation (except apostrophes) and normalize spacing/apostrophes.

    Parameters
    ----------
    text : str
        Input string to normalize.

    Returns
    -------
    str
        Cleaned text with standardized apostrophes and single spacing.
    """
    text = text.replace("’", "'")
    text = re.sub(r"[^\w\s']", "", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text


def is_contraction_prefix(text: str) -> bool:
    """
    Check if token looks like a French contraction (prefix + apostrophe).

    Parameters
    ----------
    text : str
        Token text.

    Returns
    -------
    bool
        True if the token ends with an apostrophe, indicating a contraction.
    """
    return re.match(r"^.+[’']", text) is not None


def process_french_words(df: pd.DataFrame, sentence_dict: dict) -> pd.DataFrame:
    """
    Align AOIs with spaCy tokens and add POS, lemma, and dependency features.

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame containing AOI-level text with columns:
        'id.global.aoi' and 'AOI_ann'.
    sentence_dict : dict
        Dictionary of AOI sentence groups, e.g. from build_aoi_dict().

    Returns
    -------
    pd.DataFrame
        The input DataFrame with added columns:
        'pos', 'lemma', 'left_dependents', 'right_dependents'.
    """
    # Initialize columns to store annotations
    for col in ["left_dependents", "right_dependents", "pos", "lemma"]:
        if col not in df.columns:
            df[col] = None

    # Iterate through each AOI sentence group
    for base_key, data in sentence_dict.items():
        doc = nlp(data["concatenated_aoi"])
        id_globals = sorted(data["ids"])

        # Collect AOIs for this group
        aoi_matches = []
        for id_global in id_globals:
            matching_rows = df[df["id.global.aoi"] == id_global]
            for idx, row in matching_rows.iterrows():
                aoi_matches.append((row["AOI_ann"], idx))

        matched_tokens = [False] * len(doc)

        # Align each AOI with a spaCy token
        for aoi_word, idx in aoi_matches:
            aoi_norm = normalize_text(aoi_word)
            token_info = None

            # Exact or normalized match
            for i, token in enumerate(doc):
                if not matched_tokens[i] and (token.text == aoi_word or normalize_text(token.text) == aoi_norm):
                    token_info = token
                    matched_tokens[i] = True
                    break

            # Fallback: fuzzy match or contraction handling
            if not token_info:
                for i, token in enumerate(doc):
                    token_norm = normalize_text(token.text)
                    if (aoi_norm == token_norm or aoi_norm in token_norm or token_norm in aoi_norm) and not matched_tokens[i]:
                        token_info = token
                        matched_tokens[i] = True
                        break
                    if i < len(doc) - 1 and is_contraction_prefix(token.text):
                        contraction = normalize_text(token.text + doc[i + 1].text)
                        if aoi_norm == contraction:
                            token_info = doc[i + 1]
                            matched_tokens[i + 1] = True
                            break

            # Assign extracted linguistic features
            if token_info:
                df.at[idx, "left_dependents"] = len([c for c in token_info.children if c.i < token_info.i])
                df.at[idx, "right_dependents"] = len([c for c in token_info.children if c.i > token_info.i])
                df.at[idx, "pos"] = token_info.pos_
                df.at[idx, "lemma"] = token_info.lemma_

    return df

In [9]:
# Load the required input files
df_freq = pd.read_csv("raw/03_word_frequencies.csv", sep="\t", quoting=0)
with open("raw/01_aoi_dictionary.json", "r", encoding="utf-8") as f:
    aoi_dict = json.load(f)

# Process with spaCy
df_spacy = process_french_words(df_freq, aoi_dict)

# Save enriched dataset
output_path = Path("raw/04_spacy_annotations.csv")
df_spacy.to_csv(output_path, sep="\t", index=False)
print(f"Saved {output_path.name} (shape: {df_spacy.shape})")

df_spacy.head()


Saved 04_spacy_annotations.csv (shape: (10, 34))


Unnamed: 0,Media,textid,text.type,text_version,screenid,Sentence_index,Word_index,word_id_screen,word_id_text,AOI_ann,...,unique_word_text_id,unique_word_screen_id,word_count_text,word_count_screen,total_word_count_text,word_text_frequency,left_dependents,right_dependents,pos,lemma
0,clinical_001_original_1,1,clinical,original,1,1,1,1,1,Cas,...,1.0,1.0,1,1,5,0.2,0,2,NOUN,cas
1,clinical_001_original_1,1,clinical,original,1,1,2,2,2,clinique.,...,2.0,2.0,1,1,5,0.2,0,0,ADJ,clinique
2,clinical_001_original_1,1,clinical,original,1,2,1,3,3,Un,...,3.0,3.0,1,1,5,0.2,0,0,DET,un
3,clinical_001_original_1,1,clinical,original,1,2,2,4,4,patient,...,4.0,4.0,1,1,5,0.2,1,0,NOUN,patient
4,clinical_001_original_2,1,clinical,original,2,2,3,5,5,âgé,...,5.0,1.0,1,1,5,0.2,0,0,ADJ,âgé


## Step 5: Assign lemma IDs and counts

**Input**
- `04_spacy_annotations.csv`

**Process**
- Assign unique lemma IDs per text (`unique_lemma_text_id`)
- Assign unique lemma IDs per screen (`unique_lemma_screen_id`)
- Count lemma occurrences in texts and on screens

**Output**
`05_indexed_lemmas.csv`
Enriched with:
- `unique_lemma_text_id`
- `unique_lemma_screen_id`
- `lemma_count_text`
- `lemma_count_screen`


In [10]:
from pathlib import Path
import pandas as pd

def assign_ids_by_group(group: pd.DataFrame, value_col: str, id_name: str) -> pd.DataFrame:
    """
    Assign unique IDs to values within a given group (e.g., lemmas per text).

    Parameters
    ----------
    group : pd.DataFrame
        Subset of the main DataFrame for a specific text or screen.
    value_col : str
        Column name containing the values to assign IDs to (e.g., 'lemma').
    id_name : str
        Name of the new column to store the assigned IDs.

    Returns
    -------
    pd.DataFrame
        The same group with a new integer ID column.
    """
    value_to_id = {}
    ids = []
    id_counter = 1

    for val in group[value_col]:
        if val not in value_to_id:
            value_to_id[val] = id_counter
            id_counter += 1
        ids.append(value_to_id[val])

    group[id_name] = ids
    return group


def preprocess_lemmas(input_file: str, output_dir="raw", output_name="05_indexed_lemmas.csv") -> pd.DataFrame:
    """
    Assign lemma IDs per text and screen, and count lemma frequencies.

    Parameters
    ----------
    input_file : str
        Path to the CSV file containing annotated AOI data with lemma information.
    output_dir : str or Path, optional
        Directory where the output will be saved. Defaults to 'raw/'.
    output_name : str, optional
        Output filename. Defaults to '05_indexed_lemmas.csv'.

    Returns
    -------
    pd.DataFrame
        The enriched DataFrame with added lemma ID and frequency columns.
    """
    input_path = Path(input_file)
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)
    output_file = output_path / output_name

    df = pd.read_csv(input_path, sep="\t", quoting=0)

    # Assign unique lemma IDs per text
    for text in df["filename.ann"].unique():
        text_group = df[df["filename.ann"] == text].copy()
        group = assign_ids_by_group(text_group, "lemma", "unique_lemma_text_id")
        df.loc[df["filename.ann"] == text, "unique_lemma_text_id"] = group["unique_lemma_text_id"]

    # Assign unique lemma IDs per screen
    for text in df["filename.ann"].unique():
        for screen in df[df["filename.ann"] == text]["screenid"].unique():
            screen_group = df[(df["filename.ann"] == text) & (df["screenid"] == screen)].copy()
            group = assign_ids_by_group(screen_group, "lemma", "unique_lemma_screen_id")
            df.loc[(df["filename.ann"] == text) & (df["screenid"] == screen),
                   "unique_lemma_screen_id"] = group["unique_lemma_screen_id"]

    # Count lemma frequencies
    df["lemma_count_text"] = df.groupby(["filename.ann", "unique_lemma_text_id"])["unique_lemma_text_id"].transform("count")
    df["lemma_count_screen"] = df.groupby(["filename.ann", "screenid", "unique_lemma_screen_id"])["unique_lemma_screen_id"].transform("count")

    # Save enriched data
    df.to_csv(output_file, sep="\t", index=False)
    print(f"Saved {output_file} with shape {df.shape}")

    return df

In [11]:
df_lemmas = preprocess_lemmas(
    input_file="raw/04_spacy_annotations.csv",
    output_dir="raw",
    output_name="05_indexed_lemmas.csv"
)
df_lemmas.head()

Saved raw/05_indexed_lemmas.csv with shape (10, 38)


Unnamed: 0,Media,textid,text.type,text_version,screenid,Sentence_index,Word_index,word_id_screen,word_id_text,AOI_ann,...,total_word_count_text,word_text_frequency,left_dependents,right_dependents,pos,lemma,unique_lemma_text_id,unique_lemma_screen_id,lemma_count_text,lemma_count_screen
0,clinical_001_original_1,1,clinical,original,1,1,1,1,1,Cas,...,5,0.2,0,2,NOUN,cas,1.0,1.0,1,1
1,clinical_001_original_1,1,clinical,original,1,1,2,2,2,clinique.,...,5,0.2,0,0,ADJ,clinique,2.0,2.0,1,1
2,clinical_001_original_1,1,clinical,original,1,2,1,3,3,Un,...,5,0.2,0,0,DET,un,3.0,3.0,1,1
3,clinical_001_original_1,1,clinical,original,1,2,2,4,4,patient,...,5,0.2,1,0,NOUN,patient,4.0,4.0,1,1
4,clinical_001_original_2,1,clinical,original,2,2,3,5,5,âgé,...,5,0.2,0,0,ADJ,âgé,5.0,1.0,1,1


<div style="
  background-color:#2b3a42;
  color:#f1f1f1;
  border-left: 6px solid #5da3a3;
  padding: 14px;
  border-radius: 5px;
  line-height: 1.6;
">

<h2 style="color:#ffffff;">Step 6: Continue in External Notebook – <code>01_api_annotation_pipeline.ipynb</code></h2>

The lemma-indexed dataset produced above (<code>et_data_lemmas_indexed.csv</code>)
contains all automatically derived linguistic features from the local spaCy pipeline.

To further enrich the materials with context-aware linguistic annotations
(tokenization, lemmatization, and POS tagging), proceed to the dedicated notebook:

<strong><code>01_api_annotation_pipeline.ipynb</code></strong>

That notebook takes <code>aoi_dict.json</code> as input,
runs the API-based enrichment pipeline, and outputs the collapsed and
frequency-annotated materials file:

<strong><code>materials_parsed_collapsed.csv</code></strong>

Once the API notebook has completed, return here to continue with
<strong>Step 7: Merging Enriched Materials with Eye-Tracking Data.</strong>

</div>

## Step 7: Compute Lemma Frequencies per Text and Screen

**Input**
- `raw/materials_parsed_collapsed.csv`
  AOI-level annotated dataset containing merged tokens, lemmas, and POS tags.

**Process**
This step enriches the AOI-level dataset with lemma- and word-frequency statistics.
Each AOI may contain one or more lemmas (split by `" | "`), which are analyzed to compute:
- **Counts:** `lemma_count_text`, `lemma_count_screen`
- **Frequencies:** `lemma_frequency_text`, `lemma_frequency_screen`
- **Word-level frequencies:** `word_frequency_text`, `word_frequency_screen`

These measures describe how frequent each lemma or word is within its text (`filename.ann`)
and within its corresponding screen (`screenid`).

**Output**
- `raw/materials_parsed_enriched.csv`
  Enriched dataset containing all original AOI-level fields plus additional
  frequency statistics for lemmas and words.

In [12]:
def add_lemma_frequencies(df: pd.DataFrame) -> pd.DataFrame:
    """
    Add lemma and word frequency columns directly to the provided dataframe.

    Expected columns:
      - filename.ann : text identifier
      - screenid     : screen identifier
      - lemma_merged : string, possibly with multiple lemmas separated by '|'
      - word_count_screen, word_count_text, total_word_count_text : numeric counts
    """
    df = df.copy()

    # Rename for easier internal reference
    df.rename(
        columns={
            "filename.ann": "text",
            "screenid": "screen",
            "lemma_merged": "lemma",
            "word_count_screen": "word_count_screen",
            "total_word_count_text": "total_words_per_text",
        },
        inplace=True,
    )

    # --- Expand multi-lemma AOIs -----------------------------------------
    df_exploded = df.assign(
        lemma_individual=df["lemma"].str.split(r"\s*\|\s*")
    ).explode("lemma_individual")

    # --- Lemma-level counts and frequencies -------------------------------
    df_exploded["lemma_count_text"] = df_exploded.groupby(
        ["text", "lemma_individual"]
    )["lemma_individual"].transform("count")

    df_exploded["lemma_count_screen"] = df_exploded.groupby(
        ["text", "screen", "lemma_individual"]
    )["lemma_individual"].transform("count")

    df_exploded["total_lemma_count_text"] = df_exploded.groupby("text")[
        "lemma_individual"
    ].transform("count")

    df_exploded["total_lemma_count_screen"] = df_exploded.groupby(
        ["text", "screen"]
    )["lemma_individual"].transform("count")

    df_exploded["lemma_frequency_text"] = (
        df_exploded["lemma_count_text"] / df_exploded["total_lemma_count_text"]
    )
    df_exploded["lemma_frequency_screen"] = (
        df_exploded["lemma_count_screen"] / df_exploded["total_lemma_count_screen"]
    )

    # --- Collapse lemma stats back to AOI level ---------------------------
    lemma_stats = (
        df_exploded.groupby(["text", "screen", "lemma"], as_index=False)
        .agg({
            "lemma_individual": list,
            "lemma_count_screen": list,
            "lemma_count_text": list,
            "total_lemma_count_screen": "first",
            "total_lemma_count_text": "first",
            "lemma_frequency_screen": list,
            "lemma_frequency_text": list,
        })
    )

    # --- Merge stats back into the original dataframe ---------------------
    df_merged = df.merge(lemma_stats, on=["text", "screen", "lemma"], how="left")

    # --- Word-level frequencies ------------------------------------------
    df_merged["word_frequency_text"] = (
        df_merged["word_count_text"] / df_merged["total_words_per_text"]
    )
    df_merged["total_word_count_screen"] = df_merged.groupby(
        ["text", "screen"]
    )["screen"].transform("count")
    df_merged["word_frequency_screen"] = (
        df_merged["word_count_screen"] / df_merged["total_word_count_screen"]
    )

    # Restore original column names
    df_merged.rename(
        columns={
            "text": "filename.ann",
            "screen": "screenid",
            "lemma": "lemma_merged",
            "word_count_screen": "word_count_screen",
            "total_words_per_text": "total_word_count_text",
        },
        inplace=True,
    )

    return df_merged

# ----------------------------------------------------------------------
# Apply to the collapsed materials file and save enriched dataset
# ----------------------------------------------------------------------
df_collapsed = pd.read_csv("raw/10_materials_parsed_collapsed.csv", sep="\t", low_memory=False)
df_enriched = add_lemma_frequencies(df_collapsed)

# Save the enriched dataset
output_path = "raw/11_materials_parsed_enriched.csv"
Path(output_path).parent.mkdir(parents=True, exist_ok=True)
df_enriched.to_csv(output_path, sep="\t", index=False)

print(f"Enriched dataset saved to {output_path} (shape: {df_enriched.shape})")

Enriched dataset saved to raw/11_materials_parsed_enriched.csv (shape: (10, 49))


## Step 8: Merge Eye-Tracking Data with Enriched Linguistic Annotations

**Overview**

This step integrates the linguistic annotation results with the eye-tracking dataset, producing a unified table that aligns fixation-based behavioral measures with lexical and syntactic information.
It corresponds to the stage in the original preprocessing workflow where `ETdata_oksana_for_constantin_2025-04-22-old.csv` was merged with the supervisor-corrected materials.
Here, the same structure is reproduced using the synthetic datasets.

**Input**
- `00_input_eye_tracking_data_with_metrics_dummy.csv`
  Synthetic eye-tracking dataset containing AOI-level fixation and visit metrics for several participants.
- `materials_parsed_collapsed.csv`
  Collapsed and enriched materials dataset, representing AOI-level linguistic annotations (tokens, lemmas, POS).

**Process**
Both datasets share a common identifier, `id.global.aoi`, which uniquely specifies each Area of Interest (AOI).
The merging procedure combines linguistic and behavioral features into a single record per AOI:

- merges AOI-level gaze measures (e.g., fixation durations, saccades, regressions)
  with linguistic information (e.g., lemma, POS, token);
- performs a *left join* to ensure that every gaze observation is preserved,
  even when no linguistic annotation is available.

**Output**
- `et_data_merged_with_ann_materials_dummy.csv`
  Unified dataset combining eye-tracking and linguistic features.
  This file serves as the basis for subsequent feature aggregation, frequency analysis,
  and statistical modeling.


In [13]:
# ---------------------------------------------------------------------
# File paths
# ---------------------------------------------------------------------
ET_FILE = "raw/00_input_eye_tracking_data_with_metrics_dummy.csv"
MATERIALS_FILE = "raw/11_materials_parsed_enriched.csv"
OUTPUT_FILE = "raw/et_data_merged_with_ann_materials_dummy.csv"

# ---------------------------------------------------------------------
# Utility: Compare column structures
# ---------------------------------------------------------------------
def compare_columns(df1: pd.DataFrame, df2: pd.DataFrame, name1: str, name2: str):
    """Compare column sets between two DataFrames and print unique columns per file."""
    cols1, cols2 = set(df1.columns), set(df2.columns)
    print(f"\nColumns unique to {name1}: {sorted(list(cols1 - cols2))}")
    print(f"Columns unique to {name2}: {sorted(list(cols2 - cols1))}")
    print(f"Shared columns: {len(cols1 & cols2)}")

# ---------------------------------------------------------------------
# 1. Load input datasets
# ---------------------------------------------------------------------
et_df = pd.read_csv(ET_FILE, sep="\t", dtype=str, low_memory=False)
materials_df = pd.read_csv(MATERIALS_FILE, sep="\t", dtype=str, low_memory=False)

print(f"Loaded eye-tracking data: {et_df.shape}")
print(f"Loaded annotated materials: {materials_df.shape}")

# ---------------------------------------------------------------------
# 2. Inspect column structures (optional)
# ---------------------------------------------------------------------
compare_columns(et_df, materials_df, "eye_tracking_data", "materials_data")

# ---------------------------------------------------------------------
# 3. Remove redundant metadata columns from materials before merge
# ---------------------------------------------------------------------
redundant_cols = [
    "Media", "textid", "text.type", "text_version", "screenid",
    "Sentence_index", "Word_index"
]
materials_subset = materials_df.drop(columns=[c for c in redundant_cols if c in materials_df.columns], errors="ignore")

# ---------------------------------------------------------------------
# 4. Merge on global AOI identifier with clear suffixes
# ---------------------------------------------------------------------
if "id.global.aoi" not in et_df.columns or "id.global.aoi" not in materials_subset.columns:
    raise KeyError("Both datasets must contain the column 'id.global.aoi' for merging.")

merged_df = pd.merge(
    et_df,
    materials_subset,
    on="id.global.aoi",
    how="left",
    suffixes=("_et", "_mat"),
    validate="m:1"  # ensures one annotation per AOI
)

print(f"\nMerged dataset created with shape: {merged_df.shape}")

# ---------------------------------------------------------------------
# 5. Resolve duplicate columns (keep material version if not empty)
# ---------------------------------------------------------------------
for col in merged_df.columns:
    if col.endswith("_et") and col[:-3] + "_mat" in merged_df.columns:
        base = col[:-3]
        et_col = col
        mat_col = base + "_mat"

        # Prefer non-empty material value; fallback to ET if missing
        merged_df[base] = merged_df[mat_col].where(
            merged_df[mat_col].notna() & (merged_df[mat_col] != ""),
            merged_df[et_col]
        )

# Drop all suffixed columns (_et and _mat)
merged_df = merged_df.drop(columns=[c for c in merged_df.columns if c.endswith(("_et", "_mat"))])

# ---------------------------------------------------------------------
# 6. Save merged output
# ---------------------------------------------------------------------
Path(OUTPUT_FILE).parent.mkdir(parents=True, exist_ok=True)
merged_df.to_csv(OUTPUT_FILE, sep="\t", index=False)

print(f"Merged dataset saved to {OUTPUT_FILE}")
print("\nPreview of merged data:")
display(merged_df.head(10))

# ---------------------------------------------------------------------
# 7. Integrity check: AOI overlap diagnostics
# ---------------------------------------------------------------------
et_ids = set(et_df["id.global.aoi"].unique())
mat_ids = set(materials_df["id.global.aoi"].unique())
common_ids = et_ids & mat_ids

print("\n--- Merge Key Diagnostics ---")
print(f"AOI IDs in ET data: {len(et_ids)}")
print(f"AOI IDs in materials: {len(mat_ids)}")
print(f"Common AOI IDs: {len(common_ids)}")

if len(common_ids) == 0:
    print("No matching AOI identifiers found — check ID construction in earlier steps.")
else:
    print("Example matched AOI IDs:", list(common_ids)[:5])

print("\nExample ET-only AOIs:", list(et_ids - mat_ids)[:5])
print("Example materials-only AOIs:", list(mat_ids - et_ids)[:5])


Loaded eye-tracking data: (30, 78)
Loaded annotated materials: (10, 49)

Columns unique to eye_tracking_data: ['AOI_nopunct', 'Average_duration_of_Visit', 'Average_duration_of_fixations', 'Average_duration_of_whole_fixations', 'Average_eye_openness', 'Average_pupil_diameter', 'Average_whole-fixation_eye_openness', 'Average_whole-fixation_pupil_diameter', 'Duration_of_first_Visit', 'Duration_of_first_fixation', 'Duration_of_first_whole_fixation', 'First-pass_duration', 'First-pass_first_fixation_duration', 'First-pass_regression', 'Maximum_duration_of_Visit', 'Maximum_duration_of_fixations', 'Maximum_duration_of_whole_fixations', 'Media_old', 'Minimum_duration_of_Visit', 'Minimum_duration_of_fixations', 'Minimum_duration_of_whole_fixations', 'Number_of_Visits', 'Number_of_fixations', 'Number_of_saccades_in_AOI', 'Number_of_whole_fixations', 'Participant', 'Participant_commnts_new', 'Participant_unique', 'Payment', 'Peak_velocity_of_entry_saccade', 'Peak_velocity_of_exit_saccade', 'Re-re

Unnamed: 0,Media,textid,text.type,text_version,screenid,Sentence_index,Word_index,AOI_nopunct,id.global.aoi,list,...,id.phrase.in.brackets,AOI.that.in.fact.should.be,tag,tag.type,tag.id,annotated_text,Sentence_id_current,Sentence_id_match,id.piece,filename.ann
0,clinical_001_original_1,1,clinical,original,1,1,1,Cas,clinical_001_original_1-1-1,1_A,...,,,,1.0,,,1,1,0,001_tagged_finished_words_original.csv
1,clinical_001_original_1,1,clinical,original,1,1,2,clinique,clinical_001_original_1-1-2,1_A,...,,,,,,,1,1,1,001_tagged_finished_words_original.csv
2,clinical_001_original_1,1,clinical,original,1,2,1,Un,clinical_001_original_1-2-1,1_A,...,,,,,,,2,2,2,001_tagged_finished_words_original.csv
3,clinical_001_original_1,1,clinical,original,1,2,2,patient,clinical_001_original_1-2-2,1_A,...,,,,,,,2,2,3,001_tagged_finished_words_original.csv
4,clinical_001_original_2,1,clinical,original,2,2,3,âgé,clinical_001_original_2-2-3,1_A,...,,,,,,,2,2,4,001_tagged_finished_words_original.csv
5,medical_002_simplified_1,2,medical,simplified,1,1,1,Le,medical_002_simplified_1-1-1,1_A,...,,,,2.0,,,1,1,5,002_tagged_finished_words_simplified.csv
6,medical_002_simplified_1,2,medical,simplified,1,1,2,patient,medical_002_simplified_1-1-2,1_A,...,,,,,,,1,1,6,002_tagged_finished_words_simplified.csv
7,medical_002_simplified_1,2,medical,simplified,1,1,3,souffre,medical_002_simplified_1-1-3,1_A,...,,,,5.0,,,1,1,7,002_tagged_finished_words_simplified.csv
8,medical_002_simplified_1,2,medical,simplified,1,2,1,d’orthopédie,medical_002_simplified_1-2-1,1_A,...,,,,6.0,,orthopédie,2,2,8,002_tagged_finished_words_simplified.csv
9,medical_002_simplified_1,2,medical,simplified,1,2,2,fracture,medical_002_simplified_1-2-2,1_A,...,,,,7.0,,fracture,2,2,9,002_tagged_finished_words_simplified.csv



--- Merge Key Diagnostics ---
AOI IDs in ET data: 10
AOI IDs in materials: 10
Common AOI IDs: 10
Example matched AOI IDs: ['clinical_001_original_1-1-1', 'clinical_001_original_1-1-2', 'medical_002_simplified_1-1-1', 'clinical_001_original_1-2-1', 'medical_002_simplified_1-1-3']

Example ET-only AOIs: []
Example materials-only AOIs: []


# Machine Learning Feature Aggregation Pipeline

**Overview**

This pipeline represents the final stage of preprocessing, transforming the merged
eye-tracking × linguistic dataset into structured input for machine-learning models.

It aggregates numerical eye-tracking and linguistic metrics per participant
and text subset, producing a compact representation of each participant’s reading behavior.

---

## Inputs
- `et_data_merged_with_ann_materials_dummy.csv`
  (originally `et_data_merged_with_ann_materials_25_06_17.csv`)
  Unified dataset containing both linguistic and eye-tracking features at the AOI level.
  The dummy file reproduces the same structure and column schema as the real dataset.

---

## Outputs
- `agg_data/ml_features_<scope>_<timestamp>.csv`
  Aggregated machine-learning feature tables.
  The `<scope>` token reflects the subset of features selected
  (e.g., `all`, `medical`, `non_medical`, `content`).

---

## Process Summary

1. **Load data**
   Reads the merged AOI-level dataset and verifies file integrity.

2. **Automatic feature discovery**
   Identifies all numerical columns not listed in the exclusion set.
   Ensures future extensibility by automatically including new numeric features.

3. **Aggregation**
   Groups features by participant (`Participant_unique`) and text subset,
   computing summary statistics (`mean`, `std`, `min`, `max`).

4. **Subset-specific feature creation**
   Builds filtered subsets (e.g., *medical*, *non-medical*, *content words*),
   aggregates each independently, and merges them into the main feature table.

5. **Label assignment and imputation**
   Adds the target variable (`is.expert`) and fills missing feature values with zeros.

6. **Export**
   Saves a timestamped, tab-delimited file in `agg_data/` containing all aggregated features.

---

**Example scopes**
- `"feature_scope": "all"` → Aggregate all AOIs.
- `"feature_scope": "medical,non_medical"` → Generate medical and non-medical subsets.
- `"feature_scope": "content"` → Restrict aggregation to content-word AOIs (NOUN, VERB, ADJ).

In [14]:
import pandas as pd
from pathlib import Path
from datetime import datetime

def load_data(filepath):
    """Loads data from the specified filepath."""
    print(f"Loading data from '{filepath}'...")
    try:
        return pd.read_csv(filepath, sep="\t", low_memory=False)
    except FileNotFoundError:
        print(f"Error: The file '{filepath}' was not found.")
        return None

def discover_features(df, exclude_cols):
    """Automatically discovers numerical features to aggregate, excluding specified columns."""
    print("Automatically discovering numerical features...")
    numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns.tolist()
    features = [col for col in numerical_cols if col not in exclude_cols]
    print(f"Found {len(features)} features to aggregate.")
    return features

def aggregate_data(data, group_by_cols, feature_cols, agg_funcs):
    """Performs data aggregation."""
    print(f"Aggregating {len(feature_cols)} features by {group_by_cols}...")
    return data.groupby(group_by_cols)[feature_cols].agg(agg_funcs)

def process_feature_subsets(df, features_to_agg, agg_funcs, subset_configs):
    """Creates and returns a dictionary of aggregated feature subsets."""
    subset_dfs = {}
    if not subset_configs:
        return subset_dfs

    print("\nProcessing feature subsets...")
    for config in subset_configs:
        name = config['name']
        filter_col = config['filter_col']
        filter_vals = config['filter_values']
        group_by = config['group_by']
        # Default to 'exact' if method is not specified
        filter_method = config.get('filter_method', 'exact')

        if filter_col not in df.columns:
            print(f"Warning: Subset '{name}' requires column '{filter_col}', which was not found. Skipping.")
            continue

        print(f"--- Creating subset: '{name}' (using '{filter_method}' filter) ---")

        if filter_method == 'contains':
            # Create a regex pattern to find any of the values in the list
            # e.g., 'NOUN|VERB|ADJ'
            regex_pattern = '|'.join(filter_vals)
            # Filter rows where the column string contains any of the POS tags
            subset_data = df[df[filter_col].str.contains(regex_pattern, na=False)].copy()
        elif filter_method == 'exclude':
            # Select rows where the value in filter_col is NOT in the filter_values list.
            subset_data = df[~df[filter_col].isin(filter_vals)].copy()
        else: # Default to 'exact' matching
            subset_data = df[df[filter_col].isin(filter_vals)].copy()

        if subset_data.empty:
            print(f"Warning: No data found for subset '{name}'. No features will be added.")
            continue

        agg_df = aggregate_data(subset_data, group_by, features_to_agg, agg_funcs)
        agg_df.columns = ['_'.join(map(str, col)).strip('_') + f'_{name}' for col in agg_df.columns]
        subset_dfs[name] = agg_df.reset_index()
        print(f"Successfully created {len(agg_df.columns)} features for subset '{name}'.")

    return subset_dfs

def flatten_and_reshape(df, widen, group_by_cols):
    """Flattens columns and optionally reshapes the DataFrame to a wide format."""
    if widen:
        print("Reshaping data to wide format...")
        df = df.unstack(level=group_by_cols[1:])

    print("Flattening column names...")
    df.columns = ['_'.join(map(str, col)).strip('_') for col in df.columns]
    return df.reset_index()

def merge_dataframes(main_df, subset_dfs):
    """Merges the main aggregated DataFrame with all subset DataFrames."""
    if not subset_dfs:
        return main_df

    print("\nMerging data with subset features...")
    for name, df_to_merge in subset_dfs.items():
        merge_key = df_to_merge.columns[0]
        main_df = pd.merge(main_df, df_to_merge, on=merge_key, how='left')
        print(f"Merged '{name}' features.")
    return main_df

def save_data(df, output_path):
    """Saves the final DataFrame to a specified path."""
    print(f"\n--- Preprocessing Complete! ---")
    output_path.parent.mkdir(parents=True, exist_ok=True)
    df.to_csv(output_path, sep="\t", index=False)
    print(f"Final preprocessed data saved to: {output_path}")
    print(f"Shape of the final DataFrame: {df.shape}")

def main(config):
    """Main function to orchestrate the data aggregation pipeline."""
    # Unpack config for clarity
    paths = config['paths']
    params = config['parameters']
    columns = config['columns']
    funcs = config['aggregation_functions']
    subsets_config = config.get('feature_subsets', [])

    # Load data and identify numeric variables to aggregate
    raw_data = load_data(paths['input_filepath'])
    if raw_data is None:
        return # Stop execution if data loading failed
    features_to_agg = discover_features(raw_data, columns['exclude_from_aggregation'])

    # Determine which aggregation functions to use
    agg_funcs_to_use = funcs.get('full')
    if params.get('use_reduced_aggs', False):
        agg_funcs_to_use = funcs.get('reduced')
        print("Using reduced aggregation functions.")
    else:
        print("Using full aggregation functions.")

    # 1. Create a base DataFrame with unique participant IDs to merge everything into
    id_col = columns['id_column']
    merged_data = raw_data[[id_col]].drop_duplicates().reset_index(drop=True)
    print(f"\nCreated base DataFrame with {len(merged_data)} unique participants.")

    # Get the feature scope and parse it into components for robust checking
    feature_scope = params.get('feature_scope', '')
    scope_parts = feature_scope.split(',') # Use comma as the delimiter
    print(f"Parsed feature scope: {scope_parts}")

    # 2. Conditionally perform the main aggregation if "all" is in the scope parts
    if 'all' in scope_parts:
        print("\nPerforming main aggregation on all data...")
        main_agg = aggregate_data(raw_data, columns['main_group_by'], features_to_agg, agg_funcs_to_use)
        main_agg_flat = flatten_and_reshape(main_agg, params['widen_format'], columns['main_group_by'])
        merged_data = pd.merge(merged_data, main_agg_flat, on=id_col, how='left')
        print("Main aggregation features merged.")
    else:
        print("\nSkipping main aggregation as 'all' is not in feature_scope.")

    # 3. Process and merge special feature subsets that are mentioned in the scope parts
    active_subsets = [s for s in subsets_config if s['name'] in scope_parts]
    subset_dfs = process_feature_subsets(raw_data, features_to_agg, agg_funcs_to_use, active_subsets)
    merged_data = merge_dataframes(merged_data, subset_dfs)

    # 4. Merge expertise labels (if available) and impute missing values
    if columns['label_column'] in raw_data.columns:
        expertise_df = raw_data.groupby(id_col)[columns['label_column']].first().reset_index()
        merged_data = pd.merge(merged_data, expertise_df, on=id_col, how='left')
        print("\nExpertise labels merged.")
    else:
        print(f"\nWarning: Label column '{columns['label_column']}' not found. Skipping label merge.")

    # Impute NaNs for all columns except the main ID column
    feature_cols_to_impute = [col for col in merged_data.columns if col != id_col]
    merged_data[feature_cols_to_impute] = merged_data[feature_cols_to_impute].fillna(0)
    print(f"Missing values imputed. Remaining nulls: {merged_data.isnull().sum().sum()}")
    final_data = merged_data

    # 5. Construct dynamic output filename and save
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    # shape_str = 'wide' if params['widen_format'] else 'long'
    # agg_str = 'reduced_aggs' if params['use_reduced_aggs'] else 'full_aggs'
    # output_filename = f"{paths['output_prefix']}_{feature_scope}_{shape_str}_{agg_str}_{timestamp}.csv"

    output_filename = f"{paths['output_prefix']}_{feature_scope}_{timestamp}.csv"
    output_path = Path(paths['output_directory']) / output_filename

    # Save the final aggregated dataset with a timestamped filename
    save_data(final_data, output_path)


config = {
    # --- File Paths ---
    "paths": {
        "input_filepath": "raw/et_data_merged_with_ann_materials_dummy.csv",
        "output_directory": "agg_data",
        "output_prefix": "ml_features"
    },

    # --- Pipeline Parameters ---
    "parameters": {
        "widen_format": True,        # true for wide format, false for long format
        "use_reduced_aggs": False,     # true for [mean, std], false for [mean, std, min, max]
        # Controls which features are generated by matching names from 'feature_subsets'
        # Examples:
        # "feature_scope": "medical"                --> will produce ['medical']
        # "feature_scope": "medical,non_medical"    --> will produce ['medical', 'non_medical']
        # "feature_scope": "all,content"            --> will produce ['all', 'content']
        "feature_scope": "non_medical"
    },

    # --- Column Definitions ---
    "columns": {
        "id_column": "Participant_unique",
        "label_column": "is.expert",
        "main_group_by": [
            "Participant_unique",
            "text.type",
            "text_version"
        ],
        # Columns to EXCLUDE from automatic feature discovery
        "exclude_from_aggregation": [
            "Participant_unique", "text.type", "text_version", "Participant",
            "list", "textid", "screenid", "Sentence_index", "Word_index",
            "is.expert", "age", "sex", "Payment", "educational_background",
            "field_of_study", "current_situation", "current_situation_old",
            "Participant_commnts_new", "Recording", "situation_actuelle",
            "Timeline", "Media", "Media_old", "tag.type"
        ]
    },

    # --- Aggregation Function Sets ---
    "aggregation_functions": {
        "full": ["mean", "std", "min", "max"],
        "reduced": ["mean", "std"]
    },

    # --- Feature Subset Extraction ---
    # Define all possible feature subsets here.
    # The pipeline will only generate the ones whose 'name' is in 'feature_scope'
    "feature_subsets": [
        {
            "name": "medical",
            "filter_col": "tag.type",
            "filter_values": [1, 2, 3, 4, 5, 7],
            "group_by": ["Participant_unique"],
            "filter_method": "exact" # Use exact matching for numerical tags
        },
        {
            "name": "non_medical",
            "filter_col": "tag.type",
            # Using the same list of medical tags, but to EXCLUDE them
            "filter_values": [1, 2, 3, 4, 5, 7],
            "group_by": ["Participant_unique"],
            "filter_method": "exclude" # Exclude rows with these tags
        },
        {
          "name": "content",
          "filter_col": "pos_merged",
          "filter_values": ["NOUN", "VERB", "ADJ"],
          "group_by": ["Participant_unique"],
          "filter_method": "contains" # Use 'contains' for concatenated strings
        }
    ]
}

# --- EXECUTE THE PIPELINE ---
main(config)

Loading data from 'raw/et_data_merged_with_ann_materials_dummy.csv'...
Automatically discovering numerical features...
Found 64 features to aggregate.
Using full aggregation functions.

Created base DataFrame with 3 unique participants.
Parsed feature scope: ['non_medical']

Skipping main aggregation as 'all' is not in feature_scope.

Processing feature subsets...
--- Creating subset: 'non_medical' (using 'exclude' filter) ---
Aggregating 64 features by ['Participant_unique']...
Successfully created 256 features for subset 'non_medical'.

Merging data with subset features...
Merged 'non_medical' features.

Expertise labels merged.
Missing values imputed. Remaining nulls: 0

--- Preprocessing Complete! ---
Final preprocessed data saved to: agg_data/ml_features_non_medical_20251020_173951.csv
Shape of the final DataFrame: (3, 258)


## Appendix – Diagnostic and Validation Utilities

This appendix includes two simple diagnostic tools used to verify data quality and schema consistency
across the pipeline. These checks are not part of the main preprocessing workflow but ensure
transparency, reproducibility, and data integrity before statistical modeling.

### 1. Compare Dataset Schemas

Verifies that datasets from different processing stages contain the expected column structures.
This check helps confirm that merging and enrichment steps preserve required variables.

In [15]:
## Utility: Compare column sets between two DataFrames

import pandas as pd

def compare_columns(df1: pd.DataFrame, name1: str, df2: pd.DataFrame, name2: str):
    """
    Compare the columns of two DataFrames and print overlaps and differences.

    Args:
        df1: First DataFrame.
        name1: Display name for df1.
        df2: Second DataFrame.
        name2: Display name for df2.
    """
    cols1 = list(df1.columns)
    cols2 = list(df2.columns)

    set1, set2 = set(cols1), set(cols2)

    only_in_1 = sorted(set1 - set2)
    only_in_2 = sorted(set2 - set1)
    common    = sorted(set1 & set2)

    print(f"Columns in {name1} ({len(cols1)}):")
    print(cols1)
    print()
    print(f"Columns in {name2} ({len(cols2)}):")
    print(cols2)
    print("\n— Differences —")
    print(f"Only in {name1} ({len(only_in_1)}): {only_in_1}")
    print(f"Only in {name2} ({len(only_in_2)}): {only_in_2}")
    print(f"Common ({len(common)}): {common}")

In [16]:
ann_materials = pd.read_csv('raw/10_materials_parsed_collapsed.csv', sep="\t", low_memory=False)
et_data       = pd.read_csv('raw/00_input_eye_tracking_data_with_metrics_dummy.csv', sep="\t", low_memory=False)

compare_columns(ann_materials, "ann_materials", et_data, "et_data")

Columns in ann_materials (39):
['Media', 'textid', 'text.type', 'text_version', 'screenid', 'Sentence_index', 'Word_index', 'word_id_screen', 'word_id_text', 'AOI_ann', 'AOI_length', 'is.in.bracket', 'id.phrase.in.brackets', 'AOI.that.in.fact.should.be', 'tag', 'tag.type', 'tag.id', 'annotated_text', 'flag', 'id.ann', 'id.ann.global', 'Sentence_id_current', 'Sentence_id_match', 'id.piece', 'lemma_merged', 'pos_merged', 'token_merged', 'token_id_merged', 'is.hyphenated', 'is.compound.hyphen', 'word_count_text', 'word_count_screen', 'total_word_count_text', 'left_dependents', 'right_dependents', 'id.global.aoi', 'AOI_unif_quot', 'filename.ann', 'id.line']

Columns in et_data (78):
['Media', 'textid', 'text.type', 'text_version', 'screenid', 'Sentence_index', 'Word_index', 'word_id_screen', 'word_id_text', 'AOI_ann', 'AOI_nopunct', 'AOI_length', 'is.in.bracket', 'id.phrase.in.brackets', 'AOI.that.in.fact.should.be', 'tag', 'tag.type', 'tag.id', 'annotated_text', 'Sentence_id_current', 'Se

### 2. Check Average Data Length per Participant

Evaluates the number of AOI entries per participant to assess dataset balance.
This helps ensure that each participant contributed a similar amount of valid data,
and can highlight missing or incomplete participant recordings.


In [17]:
def average_data_length_per_participant(df: pd.DataFrame, participant_col: str = "Participant_unique") -> pd.DataFrame:
    """
    Calculate the number of AOI-level entries (data length) per participant.

    Args:
        df (pd.DataFrame): Merged dataset containing participant data.
        participant_col (str): Column identifying participants.

    Returns:
        pd.DataFrame: Participant IDs with their respective AOI counts.
    """
    return df.groupby(participant_col).size().reset_index(name="average_data_length")

# Example usage on the merged ET–annotation dataset
merged_data = pd.read_csv('raw/et_data_merged_with_ann_materials_dummy.csv', sep="\t", low_memory=False)
participant_col = "Participant_unique"
average_length = average_data_length_per_participant(merged_data)

# Display range of participant data lengths
min_participant = average_length.loc[average_length["average_data_length"].idxmin()]
max_participant = average_length.loc[average_length["average_data_length"].idxmax()]

print(
    f"{min_participant[participant_col]} has a minimum data length of {min_participant['average_data_length']}"
)
print(
    f"{max_participant[participant_col]} has a maximum data length of {max_participant['average_data_length']}"
)

Participant10-1_A has a minimum data length of 10
Participant10-1_A has a maximum data length of 10
