# üß† Engineering Deep Dive: Understanding Token Usage & Reasoning

### üìÑ Overview
As an LLM Engineer, the `CompletionUsage` object is your primary dashboard for cost and performance. It reveals not just "how much" text was generated, but **how the model thought** to get there. Understanding this structure is critical for optimizing latency and budget.

### üîç The Anatomy of Usage
When you receive a response from the OpenAI API (and compatible providers), it includes a `usage` dictionary.

CompletionUsage(
    prompt_tokens=14, 
    completion_tokens=652, 
    total_tokens=666, 
    completion_tokens_details=CompletionTokensDetails(
        reasoning_tokens=576, 
        accepted_prediction_tokens=0, 
        rejected_prediction_tokens=0
    ), 
    prompt_tokens_details=PromptTokensDetails(
        cached_tokens=0
    )
)

üóùÔ∏è Key Metrics Explained
1. The Basics (The Bill)
prompt_tokens: The input you sent (System Prompt + User Message + History).

completion_tokens: The total output generated by the model.

total_tokens: The sum of the above. This is the number you are billed for.

2. The "Hidden" Cost: Reasoning Tokens
Modern "Reasoning Models" (like OpenAI o1 or preview models) perform a "Chain of Thought" (CoT) process before outputting the final answer.

reasoning_tokens: Tokens generated during the model's internal scratchpad phase.

Visibility: Invisible (You do not see these in the final string).

Billing: Billable (You pay for them as if they were output text).

The "Visible" Output: $$ \text{Visible Text} = \text{Completion Tokens} - \text{Reasoning Tokens} $$

‚ö†Ô∏è Engineering Trap: In your example, the model generated 652 tokens but 576 were reasoning. You paid for 652 tokens to get a ~76 token answer.

Efficiency: ~11% of tokens were result, ~89% were overhead.

Use Case: This is acceptable for complex logic, but wasteful for simple greetings.

3. Prompt Caching
cached_tokens: Tokens that the API recognized from a previous request (usually found in prompt_tokens_details).

Benefit: Cached tokens are typically 50% cheaper and result in lower latency.

Strategy: Structure your prompts so the static parts (huge system instructions, RAG context) come first to maximize cache hits.

üõ†Ô∏è Python Utility: Usage Analyzer
A helper function to print a clean cost/performance report after every API call.

In [None]:
def analyze_usage(usage_object):
    """
    Parses the OpenAI Usage object to reveal hidden costs and efficiency.
    """
    # 1. Basic Extraction
    p_tokens = usage_object.prompt_tokens
    c_tokens = usage_object.completion_tokens
    t_tokens = usage_object.total_tokens
    
    # 2. Deep Dive extraction (handling safe access if attributes are missing)
    details = getattr(usage_object, 'completion_tokens_details', None)
    reasoning = getattr(details, 'reasoning_tokens', 0) if details else 0
    
    prompt_details = getattr(usage_object, 'prompt_tokens_details', None)
    cached = getattr(prompt_details, 'cached_tokens', 0) if prompt_details else 0
    
    # 3. Calculations
    visible_tokens = c_tokens - reasoning
    reasoning_ratio = (reasoning / c_tokens) * 100 if c_tokens > 0 else 0
    
    # 4. The Report
    print(f"üìä --- Tokenomics Report ---")
    print(f"Total Billable:   {t_tokens}")
    print(f"Input (Prompt):   {p_tokens} (Cached: {cached})")
    print(f"Output (Total):   {c_tokens}")
    print(f"  ‚îú‚îÄ Visible:     {visible_tokens}")
    print(f"  ‚îî‚îÄ Reasoning:   {reasoning} ({reasoning_ratio:.1f}% of output)")
    
    if reasoning_ratio > 50:
        print("\n‚ö†Ô∏è NOTE: High reasoning overhead. Ensure this task requires deep logic.")

# Example Usage with your data:
# (You would pass the actual response.usage object here)
from types import SimpleNamespace # Just for mocking the object in this example

mock_usage = SimpleNamespace(
    prompt_tokens=14,
    completion_tokens=652,
    total_tokens=666,
    completion_tokens_details=SimpleNamespace(reasoning_tokens=576),
    prompt_tokens_details=SimpleNamespace(cached_tokens=0)
)

analyze_usage(mock_usage)

### üß™ Lab Notes: Optimization Strategy

*Based on the analysis of Reasoning Tokens, I have established the following heuristic for model selection:*

| Task Type | Recommended Model Class | Why? |
| :--- | :--- | :--- |
| **Creative Writing / Chat** | Standard (GPT-4o, Llama 3) | Low reasoning overhead; we want to pay for visible text. |
| **Complex Math / Coding** | Reasoning (o1, o1-mini) | The hidden `reasoning_tokens` are necessary to ensure accuracy. |
| **Simple Extraction** | Small/Fast (GPT-4o-mini) | Avoid paying for "thinking" when the task is trivial. |

In [None]:
def ask_optimized(question):
    """
    Demonstrates how to strip away token usage for simple queries.
    """
    
    # üî¥ OPTION A: The "Expensive" Way (Default/Reasoning)
    # Using a reasoning model or a chatty prompt
    print(f"‚ùì Question: {question}")
    
    # üü¢ OPTION B: The "Optimized" Way
    # 1. Model: Use 'gpt-4o-mini' (Cheaper, faster, no hidden reasoning tokens)
    # 2. System Prompt: Enforce extreme brevity.
    # 3. Max_Tokens: Hard limit to prevent rambling.
    
    response = client.chat.completions.create(
        model="gpt-4o-mini", 
        messages=[
            {
                "role": "system", 
                # The "Concise" Instruction is the biggest token saver
                "content": "You are a precise database. Answer immediately. No filler words. No full sentences."
            },
            {"role": "user", "content": question}
        ],
        max_tokens=10,  # Hard cap. If it needs more than 10 tokens for a simple fact, it's failing.
        temperature=0   # Deterministic. Don't get creative.
    )
    
    result = response.choices[0].message.content
    usage = response.usage
    
    print("\n--- ‚úÖ Optimized Result ---")
    print(f"Answer: {result}")
    print(f"Total Tokens Billable: {usage.total_tokens}")
    print(f"  (Prompt: {usage.prompt_tokens}, Completion: {usage.completion_tokens})")

# Test it
ask_optimized("What is the capital of France?")