# API Cost and Token Length Analysis

This notebook analyses the expected token usage and estimated costs of API-based
annotation (tokenization, lemmatization, and POS tagging) for the French medical
AOI texts used in the thesis.

It provides:
- descriptive statistics about AOI lengths,
- visualisation of their distribution,
- identification of the longest AOIs (worst-case prompt sizes),
- and token-based cost estimation for GPT-4o and GPT-4o-mini models.

These calculations informed the batching and budgeting strategy for the annotation
runs described in Section 3.5.1 of the thesis.


## Setup: Imports and Dependencies

Import libraries for data handling, analysis, visualization, and API tokenization cost estimation.

In [None]:
import json
import pandas as pd
import numpy as np
import heapq
import matplotlib.pyplot as plt
import tiktoken

## Step 1: Load AOI dictionary

**Input**
`aoi_tuples_dict.json`
Generated previously in `00_annotation_preprocessing.ipynb`.
Contains AOI sentence groups with `id_aoi_tuples` (lists of `(AOI_ID, word)` pairs).

**Process**
Loads the JSON dictionary and counts the number of AOI sentence entries to verify input integrity.

**Output**
- In-memory `sentence_dict` (Python dictionary)
- Printed number of sentences for reference


In [None]:
with open("aoi_tuples_dict.json", "r", encoding="utf-8") as f:
    sentence_dict = json.load(f)

num_sentences = len(sentence_dict)
print(f"Number of sentences: {num_sentences}")


## Step 2: Compute AOI length statistics

**Input**
`sentence_dict` loaded in Step 1.

**Process**
For each AOI group, counts the number of `id_aoi_tuples` entries to determine:
- average AOI length,
- minimum and maximum AOI length.

**Output**
Console summary:


In [None]:
num_entries_list = [len(v["id_aoi_tuples"]) for v in sentence_dict.values()]
average_entries = np.mean(num_entries_list)
max_entries = np.max(num_entries_list)
min_entries = np.min(num_entries_list)

print(f"Average entries per AOI: {average_entries:.2f}")
print(f"Maximum entries: {max_entries}")
print(f"Minimum entries: {min_entries}")


## Step 3: Identify longest AOIs

**Input**
`sentence_dict`

**Process**
Uses a heap-based ranking to extract the ten AOIs with the highest number of words (`id_aoi_tuples`).
This identifies potential outliers and large prompts that may exceed model limits.

**Output**
Console output listing the top 10 longest AOI entries and their lengths.


In [None]:
top_longest = heapq.nlargest(
    10, [(len(v["id_aoi_tuples"]), k) for k, v in sentence_dict.items()]
)

print("Top 10 longest AOIs:")
for length, key in top_longest:
    print(f"{key}: {length} entries")


## Step 4: Visualize AOI length distribution

**Input**
`sentence_dict`

**Process**
Plots a histogram showing the distribution of AOI lengths (number of word entries per sentence).
This helps determine whether the AOI set is balanced or if some entries are unusually long.

**Output**
Matplotlib histogram with labeled axes:
- X-axis: “Number of AOI entries”
- Y-axis: “Frequency”


In [None]:
def generate_sentence_length_histogram(sentence_dict):
    """
    Plot histogram of the number of AOI entries per text segment.

    Parameters
    ----------
    sentence_dict : dict
        Dictionary mapping text IDs to AOI-entry lists.
    """
    lengths = [len(v["id_aoi_tuples"]) for v in sentence_dict.values()]
    plt.hist(lengths, bins="auto", color="skyblue", edgecolor="black")
    plt.title("Distribution of AOI Lengths")
    plt.xlabel("Number of AOI entries")
    plt.ylabel("Frequency")
    plt.grid(axis="y", alpha=0.75)
    plt.show()

generate_sentence_length_histogram(sentence_dict)


## Step 5: Estimate API token usage and cost

**Input**
Representative prompt–response pair (copied from test annotations).
Model rates for GPT-4o and GPT-4o-mini.

**Process**
- Encodes input and output strings using `tiktoken` to simulate real API tokenization.
- Calculates approximate costs per request given OpenAI’s published rates.
- Allows toggling between models to compare pricing.

**Output**
Console output summarizing:


In [None]:
# Example prompt/response pair
example_prompt = """..."""
example_response = """..."""

enc = tiktoken.encoding_for_model("gpt-4o")
input_tokens = len(enc.encode(example_prompt))
output_tokens = len(enc.encode(example_response))

# Pricing (USD per 1M tokens)
RATES = {
    "gpt-4o": {"in": 2.5, "out": 10},
    "gpt-4o-mini": {"in": 0.15, "out": 0.6},
}

model = "gpt-4o"  # or "gpt-4o-mini"
input_cost = (input_tokens / 1e6) * RATES[model]["in"]
output_cost = (output_tokens / 1e6) * RATES[model]["out"]
total_cost = input_cost + output_cost

print(
    f"Model: {model}\n"
    f"Input tokens: {input_tokens}\n"
    f"Output tokens: {output_tokens}\n"
    f"Estimated cost: ${total_cost:.4f}"
)


## Step 6: Interpretation

### Interpretation

The histogram above shows the typical AOI segment lengths used for API annotation.
The token and cost analysis provides an upper bound for processing expenses.
In the production annotation pipeline (see `01_api_annotation_pipeline.ipynb`),
AOIs were batched to remain below the tested token limit and expected cost per run.

Our results indicated that using GPT-4o we would pay less than 4\\$ for the entire
dataset, including lemmatization and POS tagging, while GPT-4o-mini would cost less than 0.25\\$.
