---
## 1. Dependencies

We need `tiktoken` for accurate token counting (same library used by OpenAI).

In [38]:
# Install tiktoken if needed
# !pip install tiktoken

import json
from typing import Any, Dict
from dataclasses import dataclass
import tiktoken

print("✓ Dependencies imported")

✓ Dependencies imported


---
## 2. Exception Class

First, we need an exception for analysis errors:

In [39]:
class Json2ToonError(Exception):
    """Base exception for json2toon errors."""
    pass


class AnalysisError(Json2ToonError):
    """Raised when analysis or metrics operations fail."""
    pass


print("✓ Exception classes defined")

✓ Exception classes defined


---
## 3. ComparisonResult Dataclass

A dataclass to hold comparison results. Using `@dataclass` gives us:
- Automatic `__init__`, `__repr__`, `__eq__`
- Clear, typed structure
- Immutable-like semantics

In [40]:
@dataclass
class ComparisonResult:
    """
    Results of comparing JSON vs TOON formats.
    
    Attributes:
        json_tokens: Number of tokens in JSON format
        toon_tokens: Number of tokens in TOON format
        savings: Absolute token savings (json - toon)
        savings_percent: Percentage of tokens saved
    """
    json_tokens: int
    toon_tokens: int
    savings: int
    savings_percent: float


# Test the dataclass
result = ComparisonResult(
    json_tokens=100,
    toon_tokens=60,
    savings=40,
    savings_percent=40.0
)
print(f"ComparisonResult: {result}")
print(f"Savings: {result.savings} tokens ({result.savings_percent}%)")

ComparisonResult: ComparisonResult(json_tokens=100, toon_tokens=60, savings=40, savings_percent=40.0)
Savings: 40 tokens (40.0%)


---
## 4. Token Counting with tiktoken

### Understanding tiktoken

Tiktoken is OpenAI's library for token counting. Different models use different encodings:
- `cl100k_base`: Used by GPT-4, GPT-3.5-turbo (most common)
- `p50k_base`: Used by older models like text-davinci-003
- `r50k_base`: Used by older models like davinci

### The count_tokens Function

In [41]:
def count_tokens(text: str, encoding_name: str = "cl100k_base") -> int:
    """
    Count tokens in text using tiktoken.
    
    Args:
        text: Text to count tokens in
        encoding_name: Tiktoken encoding to use (default: cl100k_base for GPT-4)
        
    Returns:
        Number of tokens in the text
        
    Raises:
        AnalysisError: If token counting fails
        
    How it works:
    1. Get the encoding object for the specified model
    2. Encode the text into a list of token IDs
    3. Return the length of that list
    """
    try:
        # Get the encoding (tokenizer) for the specified model
        encoding = tiktoken.get_encoding(encoding_name)
        
        # Encode returns a list of token IDs
        tokens = encoding.encode(text)
        
        # The length is the token count
        return len(tokens)
        
    except Exception as e:
        raise AnalysisError(f"Failed to count tokens: {e}") from e


print("✓ count_tokens function defined")

✓ count_tokens function defined


### Testing Token Counting

Let's see how different text lengths affect token counts:

In [42]:
# Test various strings
test_strings = [
    "Hello",
    "Hello, World!",
    "The quick brown fox jumps over the lazy dog.",
    '{"name": "Alice", "age": 30}',
    'name: Alice\nage: 30',
]

print("Token counts for various strings:")
print("-" * 60)
for s in test_strings:
    tokens = count_tokens(s)
    print(f"'{s[:40]}...' → {tokens} tokens" if len(s) > 40 else f"'{s}' → {tokens} tokens")

Token counts for various strings:
------------------------------------------------------------
'Hello' → 1 tokens
'Hello, World!' → 4 tokens
'The quick brown fox jumps over the lazy ...' → 10 tokens
'{"name": "Alice", "age": 30}' → 12 tokens
'name: Alice
age: 30' → 8 tokens


### Visualizing Tokens

Let's see what tokens actually look like:

In [43]:
def visualize_tokens(text: str, encoding_name: str = "cl100k_base"):
    """Show how text is tokenized."""
    encoding = tiktoken.get_encoding(encoding_name)
    token_ids = encoding.encode(text)
    
    print(f"Text: '{text}'")
    print(f"Token count: {len(token_ids)}")
    print(f"Token IDs: {token_ids}")
    print("Decoded tokens:")
    for tid in token_ids:
        decoded = encoding.decode([tid])
        print(f"  {tid} → '{decoded}'")
    print()


# Visualize JSON vs TOON
visualize_tokens('{"name": "Alice"}')
visualize_tokens('name: Alice')

Text: '{"name": "Alice"}'
Token count: 6
Token IDs: [5018, 609, 794, 330, 62786, 9388]
Decoded tokens:
  5018 → '{"'
  609 → 'name'
  794 → '":'
  330 → ' "'
  62786 → 'Alice'
  9388 → '"}'

Text: 'name: Alice'
Token count: 3
Token IDs: [609, 25, 30505]
Decoded tokens:
  609 → 'name'
  25 → ':'
  30505 → ' Alice'



---
## 5. Format Comparison

The `compare_formats` function compares token usage between JSON and TOON.

### Implementation

First, let's create a minimal encoder for our examples:

In [44]:
class SimpleEncoder:
    """
    Minimal TOON encoder for metrics demonstration.
    (In the real module, we'd use the full ToonEncoder)
    """
    
    def encode(self, data: Any) -> str:
        """Encode data to TOON format."""
        return self._encode_value(data, 0)
    
    def _encode_value(self, value: Any, indent: int) -> str:
        if value is None:
            return "null"
        elif isinstance(value, bool):
            return "true" if value else "false"
        elif isinstance(value, (int, float)):
            return str(value)
        elif isinstance(value, str):
            # Quote if contains special chars or looks like number
            if any(c in value for c in ':[]{}\n') or self._looks_like_number(value):
                return f'"{value}"'
            return value
        elif isinstance(value, list):
            return self._encode_array(value, indent)
        elif isinstance(value, dict):
            return self._encode_object(value, indent)
        return str(value)
    
    def _looks_like_number(self, s: str) -> bool:
        try:
            float(s)
            return True
        except ValueError:
            return False
    
    def _encode_array(self, arr: list, indent: int) -> str:
        if not arr:
            return "[]"
        # Simple arrays use JSON-like format
        if all(isinstance(x, (int, float, str, bool, type(None))) for x in arr):
            items = [self._encode_value(x, indent) for x in arr]
            return f"[{', '.join(items)}]"
        # Complex arrays
        lines = []
        prefix = "  " * (indent + 1)
        for item in arr:
            encoded = self._encode_value(item, indent + 1)
            lines.append(f"{prefix}- {encoded}")
        return "\n" + "\n".join(lines)
    
    def _encode_object(self, obj: dict, indent: int) -> str:
        if not obj:
            return "{}"
        lines = []
        prefix = "  " * indent
        for key, value in obj.items():
            encoded = self._encode_value(value, indent + 1)
            if isinstance(value, dict) and value:
                lines.append(f"{prefix}{key}:")
                for line in encoded.split('\n'):
                    if line.strip():
                        lines.append(f"  {line}")
            else:
                lines.append(f"{prefix}{key}: {encoded}")
        return "\n".join(lines)


encoder = SimpleEncoder()
print("✓ SimpleEncoder created for demonstration")

✓ SimpleEncoder created for demonstration


### The compare_formats Function

In [45]:
def compare_formats(
    data: Any,
    encoder,
    encoding_name: str = "cl100k_base"
) -> ComparisonResult:
    """
    Compare token counts between JSON and TOON formats.
    
    Args:
        data: Python data structure to compare
        encoder: ToonEncoder instance (or compatible)
        encoding_name: Tiktoken encoding to use
        
    Returns:
        ComparisonResult with token counts and savings
        
    Raises:
        AnalysisError: If comparison fails
        
    How it works:
    1. Convert data to JSON string (with standard indent=2)
    2. Count tokens in JSON string
    3. Convert data to TOON string using encoder
    4. Count tokens in TOON string
    5. Calculate savings (absolute and percentage)
    """
    try:
        # Step 1 & 2: Get JSON representation and count tokens
        json_str = json.dumps(data, indent=2)
        json_tokens = count_tokens(json_str, encoding_name)
        
        # Step 3 & 4: Get TOON representation and count tokens
        toon_str = encoder.encode(data)
        toon_tokens = count_tokens(toon_str, encoding_name)
        
        # Step 5: Calculate savings
        savings = json_tokens - toon_tokens
        
        # Calculate percentage (avoid division by zero)
        savings_percent = (
            (savings / json_tokens * 100) if json_tokens > 0 else 0
        )
        
        return ComparisonResult(
            json_tokens=json_tokens,
            toon_tokens=toon_tokens,
            savings=savings,
            savings_percent=savings_percent
        )
        
    except Exception as e:
        raise AnalysisError(f"Failed to compare formats: {e}") from e


print("✓ compare_formats function defined")

✓ compare_formats function defined


### Testing Format Comparison

In [46]:
# Test with simple data
simple_data = {
    "name": "Alice",
    "age": 30,
    "active": True
}

result = compare_formats(simple_data, encoder)

print("Simple Data Comparison:")
print(f"  JSON tokens:  {result.json_tokens}")
print(f"  TOON tokens:  {result.toon_tokens}")
print(f"  Savings:      {result.savings} tokens ({result.savings_percent:.1f}%)")

Simple Data Comparison:
  JSON tokens:  22
  TOON tokens:  12
  Savings:      10 tokens (45.5%)


In [47]:
# Test with more complex data
complex_data = {
    "users": [
        {"name": "Alice", "age": 30, "city": "New York"},
        {"name": "Bob", "age": 25, "city": "Los Angeles"},
        {"name": "Charlie", "age": 35, "city": "Chicago"}
    ],
    "metadata": {
        "total": 3,
        "page": 1,
        "hasMore": False
    }
}

result = compare_formats(complex_data, encoder)

print("Complex Data Comparison:")
print(f"  JSON tokens:  {result.json_tokens}")
print(f"  TOON tokens:  {result.toon_tokens}")
print(f"  Savings:      {result.savings} tokens ({result.savings_percent:.1f}%)")
print()

# Show the actual formats
print("JSON Format:")
print(json.dumps(complex_data, indent=2))
print()
print("TOON Format:")
print(encoder.encode(complex_data))

Complex Data Comparison:
  JSON tokens:  114
  TOON tokens:  78
  Savings:      36 tokens (31.6%)

JSON Format:
{
  "users": [
    {
      "name": "Alice",
      "age": 30,
      "city": "New York"
    },
    {
      "name": "Bob",
      "age": 25,
      "city": "Los Angeles"
    },
    {
      "name": "Charlie",
      "age": 35,
      "city": "Chicago"
    }
  ],
  "metadata": {
    "total": 3,
    "page": 1,
    "hasMore": false
  }
}

TOON Format:
users: 
    -     name: Alice
    age: 30
    city: New York
    -     name: Bob
    age: 25
    city: Los Angeles
    -     name: Charlie
    age: 35
    city: Chicago
metadata:
    total: 3
    page: 1
    hasMore: false


---
## 6. Report Generation

The `generate_report` function creates formatted reports from comparison results.

### Supported Formats:
- **text**: Simple plain text report
- **json**: Machine-readable JSON format
- **markdown**: Formatted markdown with tables

In [48]:
def generate_report(
    comparison: ComparisonResult,
    output_format: str = "text"
) -> str:
    """
    Generate a comparison report in various formats.
    
    Args:
        comparison: ComparisonResult to report on
        output_format: Output format - 'text', 'json', or 'markdown'
        
    Returns:
        Formatted report string
        
    Raises:
        AnalysisError: If report generation fails
        
    Implementation Notes:
    - JSON format is useful for programmatic processing
    - Markdown format is great for documentation
    - Text format is the default, human-readable option
    """
    try:
        if output_format == "json":
            # JSON format - machine readable
            return json.dumps({
                "json_tokens": comparison.json_tokens,
                "toon_tokens": comparison.toon_tokens,
                "savings": comparison.savings,
                "savings_percent": round(comparison.savings_percent, 2)
            }, indent=2)
        
        elif output_format == "markdown":
            # Markdown format - with table
            return f"""# Token Comparison Report

| Format | Tokens |
|--------|--------|
| JSON   | {comparison.json_tokens} |
| TOON   | {comparison.toon_tokens} |

**Savings:** {comparison.savings} tokens ({comparison.savings_percent:.1f}%)
"""
        
        else:  # text (default)
            # Plain text format
            return f"""Token Comparison Report:
- JSON tokens: {comparison.json_tokens}
- TOON tokens: {comparison.toon_tokens}
- Savings: {comparison.savings} tokens ({comparison.savings_percent:.1f}%)"""
    
    except Exception as e:
        raise AnalysisError(f"Failed to generate report: {e}") from e


print("✓ generate_report function defined")

✓ generate_report function defined


### Testing Report Generation

In [49]:
# Create a comparison result for testing
test_result = ComparisonResult(
    json_tokens=150,
    toon_tokens=95,
    savings=55,
    savings_percent=36.67
)

# Text format
print("=== TEXT FORMAT ===")
print(generate_report(test_result, "text"))
print()

=== TEXT FORMAT ===
Token Comparison Report:
- JSON tokens: 150
- TOON tokens: 95
- Savings: 55 tokens (36.7%)



In [50]:
# JSON format
print("=== JSON FORMAT ===")
print(generate_report(test_result, "json"))
print()

=== JSON FORMAT ===
{
  "json_tokens": 150,
  "toon_tokens": 95,
  "savings": 55,
  "savings_percent": 36.67
}



In [51]:
# Markdown format
print("=== MARKDOWN FORMAT ===")
print(generate_report(test_result, "markdown"))

=== MARKDOWN FORMAT ===
# Token Comparison Report

| Format | Tokens |
|--------|--------|
| JSON   | 150 |
| TOON   | 95 |

**Savings:** 55 tokens (36.7%)



---
## 7. Complete Example

Let's put everything together with a real-world example:

In [52]:
# Real-world API response data
api_response = {
    "status": "success",
    "code": 200,
    "data": {
        "products": [
            {"id": 1, "name": "Laptop", "price": 999.99, "inStock": True},
            {"id": 2, "name": "Mouse", "price": 29.99, "inStock": True},
            {"id": 3, "name": "Keyboard", "price": 79.99, "inStock": False},
            {"id": 4, "name": "Monitor", "price": 349.99, "inStock": True},
            {"id": 5, "name": "Headphones", "price": 149.99, "inStock": True}
        ],
        "pagination": {
            "page": 1,
            "pageSize": 5,
            "totalItems": 42,
            "totalPages": 9
        }
    },
    "timestamp": "2024-01-15T10:30:00Z"
}

# Ensure helpers are available even if earlier cells were skipped
try:
    compare_formats
    generate_report
except NameError:
    from json2toon.metrics import compare_formats, generate_report

# Compare formats
comparison = compare_formats(api_response, encoder)

# Generate and display report
print(generate_report(comparison, "markdown"))

# Token Comparison Report

| Format | Tokens |
|--------|--------|
| JSON   | 258 |
| TOON   | 196 |

**Savings:** 62 tokens (24.0%)



In [53]:
# Show the actual difference
print("=" * 60)
print("JSON FORMAT:")
print("=" * 60)
print(json.dumps(api_response, indent=2))
print()
print("=" * 60)
print("TOON FORMAT:")
print("=" * 60)
print(encoder.encode(api_response))

JSON FORMAT:
{
  "status": "success",
  "code": 200,
  "data": {
    "products": [
      {
        "id": 1,
        "name": "Laptop",
        "price": 999.99,
        "inStock": true
      },
      {
        "id": 2,
        "name": "Mouse",
        "price": 29.99,
        "inStock": true
      },
      {
        "id": 3,
        "name": "Keyboard",
        "price": 79.99,
        "inStock": false
      },
      {
        "id": 4,
        "name": "Monitor",
        "price": 349.99,
        "inStock": true
      },
      {
        "id": 5,
        "name": "Headphones",
        "price": 149.99,
        "inStock": true
      }
    ],
    "pagination": {
      "page": 1,
      "pageSize": 5,
      "totalItems": 42,
      "totalPages": 9
    }
  },
  "timestamp": "2024-01-15T10:30:00Z"
}

TOON FORMAT:
status: success
code: 200
data:
    products: 
        -       id: 1
        name: Laptop
        price: 999.99
        inStock: true
        -       id: 2
        name: Mouse
        price: 2

---
## 8. Summary

### Key Components Implemented:

1. **ComparisonResult** - Dataclass holding comparison metrics
2. **count_tokens()** - Token counting using tiktoken
3. **compare_formats()** - Compare JSON vs TOON token usage
4. **generate_report()** - Generate formatted reports

### Key Concepts:

- **tiktoken** is OpenAI's tokenizer library
- **cl100k_base** encoding is used by GPT-4/3.5
- TOON typically saves 20-40% tokens vs JSON
- Savings vary based on data structure

### Why This Matters:

- LLM costs are based on token usage
- Context windows have token limits
- Smaller representations = more data per prompt

In [54]:
print("✓ Module 6: Metrics - Complete!")
print()
print("Functions implemented:")
print("  - count_tokens(text, encoding_name)")
print("  - compare_formats(data, encoder, encoding_name)")
print("  - generate_report(comparison, output_format)")
print()
print("Classes implemented:")
print("  - ComparisonResult (dataclass)")
print("  - AnalysisError (exception)")

✓ Module 6: Metrics - Complete!

Functions implemented:
  - count_tokens(text, encoding_name)
  - compare_formats(data, encoder, encoding_name)
  - generate_report(comparison, output_format)

Classes implemented:
  - ComparisonResult (dataclass)
  - AnalysisError (exception)
