### OCI Data Science - Useful Tips
<details>
<summary><font size="2">Check for Public Internet Access</font></summary>

```python
import requests
response = requests.get("https://oracle.com")
assert response.status_code==200, "Internet connection failed"
```
</details>
<details>
<summary><font size="2">Helpful Documentation </font></summary>
<ul><li><a href="https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm">Data Science Service Documentation</a></li>
<li><a href="https://docs.cloud.oracle.com/iaas/tools/ads-sdk/latest/index.html">ADS documentation</a></li>
</ul>
</details>
<details>
<summary><font size="2">Typical Cell Imports and Settings for ADS</font></summary>

```python
%load_ext autoreload
%autoreload 2
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import logging
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

import ads
from ads.dataset.factory import DatasetFactory
from ads.automl.provider import OracleAutoMLProvider
from ads.automl.driver import AutoML
from ads.evaluations.evaluator import ADSEvaluator
from ads.common.data import ADSData
from ads.explanations.explainer import ADSExplainer
from ads.explanations.mlx_global_explainer import MLXGlobalExplainer
from ads.explanations.mlx_local_explainer import MLXLocalExplainer
from ads.catalog.model import ModelCatalog
from ads.common.model_artifact import ModelArtifact
```
</details>
<details>
<summary><font size="2">Useful Environment Variables</font></summary>

```python
import os
print(os.environ["NB_SESSION_COMPARTMENT_OCID"])
print(os.environ["PROJECT_OCID"])
print(os.environ["USER_OCID"])
print(os.environ["TENANCY_OCID"])
print(os.environ["NB_REGION"])
```
</details>

# JSON vs TOON-like: Token Efficiency Mini Experiment

**Author:** Cristina Varas Menadas  

This notebook explores how different input formats impact token usage for Large Language Models (LLMs), using a simple synthetic dataset of user records. The goal is not to perfectly replicate the TOON spec, but to get a **practical, data-driven feel** for:

- How much more verbose **pretty-printed JSON** is compared to other options.
- How much we can gain just by using **minified JSON**.
- How a **TOON-like, tabular format** can further reduce token count for uniform, structured data.

### What this notebook does

- Generates fake user data (id, name, country, purchases, segment).
- Encodes the same data in three formats:
  - JSON (pretty-printed, human-friendly).
  - JSON (minified, no extra whitespace).
  - A simple **TOON-like** format: `users[N]{cols}:\nvalues...`
- Uses the `cl100k_base` tokenizer (compatible with many OpenAI models) to measure:
  - Token counts per format.
  - Character counts per format.
- Compares token savings of:
  - TOON-like vs pretty JSON.
  - TOON-like vs minified JSON.

### Why this matters

There is a lot of buzz around new LLM-oriented formats like TOON that claim **30–60% token savings** compared to JSON.  
This notebook helps answer a more grounded question:

> *“How much do I really gain in practice, and how much can I already save just by cleaning up my JSON and prompts?”*

Use this as a small, reproducible experiment to support discussions or articles on **LLM cost optimization**, **prompt design**, and **data formats** for AI workloads.


## 1. Install dependencies

In [1]:
!pip install tiktoken pandas

Collecting tiktoken
  Downloading tiktoken-0.12.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (6.7 kB)
Downloading tiktoken-0.12.0-cp311-cp311-manylinux_2_28_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m75.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tiktoken
Successfully installed tiktoken-0.12.0


## 2. Imports, tokenizer, and helper to count tokens

In [2]:
import json
import random
import tiktoken
import pandas as pd

# Use the tokenizer compatible with many OpenAI models (e.g. cl100k_base)
enc = tiktoken.get_encoding("cl100k_base")

def count_tokens(text: str) -> int:
    """Count tokens using the cl100k_base tokenizer."""
    return len(enc.encode(text))


## 3. Generate sample user data

In [3]:
def generate_users(n_rows: int):
    """
    Generate a list of dictionaries with fake user data.
    This simulates a typical payload you might send to an LLM:
    users, purchases, country, segment, etc.
    """
    countries = ["ES", "FR", "DE", "IT", "UK", "PL"]
    segments = ["standard", "vip", "new", "churn_risk"]

    data = []
    for i in range(1, n_rows + 1):
        user = {
            "id": i,
            "name": f"User {i}",
            "country": random.choice(countries),
            "purchases": random.randint(0, 50),
            "segment": random.choice(segments),
        }
        data.append(user)
    return data

# Quick sanity check
sample_data = generate_users(3)
sample_data


[{'id': 1,
  'name': 'User 1',
  'country': 'DE',
  'purchases': 8,
  'segment': 'standard'},
 {'id': 2,
  'name': 'User 2',
  'country': 'FR',
  'purchases': 38,
  'segment': 'churn_risk'},
 {'id': 3,
  'name': 'User 3',
  'country': 'IT',
  'purchases': 27,
  'segment': 'new'}]

## 4. JSON pretty, JSON minified, and TOON-like representation

In [4]:
def to_json_pretty(data):
    """Return JSON with indentation (human-readable)."""
    return json.dumps(data, indent=2, ensure_ascii=False)

def to_json_minified(data):
    """Return minified JSON (no unnecessary spaces or newlines)."""
    return json.dumps(data, separators=(",", ":"), ensure_ascii=False)

def to_toon_like(data, root_name="users"):
    """
    Create a simple TOON-like representation:

    users[N]{col1,col2,...}:
      v1,v2,...
      v1,v2,...

    Assumes a list of dicts with the same keys.
    This is not a full official TOON implementation, but it is
    close enough to illustrate the idea: a compact, tabular format
    for LLM prompts.
    """
    if not data:
        return ""

    keys = list(data[0].keys())
    header = f"{root_name}[{len(data)}]{{{','.join(keys)}}}:"

    lines = [header]

    for row in data:
        values = []
        for k in keys:
            v = row[k]
            if isinstance(v, str):
                # Very basic escaping for commas and quotes
                s = v.replace('"', '""')
                if "," in s:
                    s = f'"{s}"'
                values.append(s)
            else:
                values.append(str(v))
        lines.append(",".join(values))

    return "\n".join(lines)

# Quick example
toon_example = to_toon_like(sample_data)
print(toon_example)


users[3]{id,name,country,purchases,segment}:
1,User 1,DE,8,standard
2,User 2,FR,38,churn_risk
3,User 3,IT,27,new


## 5.Visual comparison: JSON vs TOON-like

In [5]:
json_pretty_example = to_json_pretty(sample_data)
json_min_example = to_json_minified(sample_data)

print("=== JSON pretty ===")
print(json_pretty_example[:400], "...\n")

print("=== JSON minified ===")
print(json_min_example[:400], "...\n")

print("=== TOON-like ===")
print(toon_example[:400], "...")


=== JSON pretty ===
[
  {
    "id": 1,
    "name": "User 1",
    "country": "DE",
    "purchases": 8,
    "segment": "standard"
  },
  {
    "id": 2,
    "name": "User 2",
    "country": "FR",
    "purchases": 38,
    "segment": "churn_risk"
  },
  {
    "id": 3,
    "name": "User 3",
    "country": "IT",
    "purchases": 27,
    "segment": "new"
  }
] ...

=== JSON minified ===
[{"id":1,"name":"User 1","country":"DE","purchases":8,"segment":"standard"},{"id":2,"name":"User 2","country":"FR","purchases":38,"segment":"churn_risk"},{"id":3,"name":"User 3","country":"IT","purchases":27,"segment":"new"}] ...

=== TOON-like ===
users[3]{id,name,country,purchases,segment}:
1,User 1,DE,8,standard
2,User 2,FR,38,churn_risk
3,User 3,IT,27,new ...


## 6. Run the experiment: tokens and characters for different sizes

In [6]:
def run_experiment(row_counts=(10, 100, 1000)):
    """
    For each number of rows, generate fake users and compare:
      - JSON pretty
      - JSON minified
      - TOON-like

    We measure:
      - token counts
      - character counts
      - relative token savings of TOON vs JSON pretty and minified
    """
    results = []

    for n in row_counts:
        data = generate_users(n)

        json_pretty = to_json_pretty(data)
        json_minified = to_json_minified(data)
        toon_text = to_toon_like(data)

        row_result = {
            "rows": n,
            "json_pretty_tokens": count_tokens(json_pretty),
            "json_minified_tokens": count_tokens(json_minified),
            "toon_tokens": count_tokens(toon_text),
            "json_pretty_chars": len(json_pretty),
            "json_minified_chars": len(json_minified),
            "toon_chars": len(toon_text),
        }
        results.append(row_result)

    df = pd.DataFrame(results)

    # Relative savings (percentage)
    df["toon_vs_pretty_%"] = (1 - df["toon_tokens"] / df["json_pretty_tokens"]) * 100
    df["toon_vs_minified_%"] = (1 - df["toon_tokens"] / df["json_minified_tokens"]) * 100

    return df

df_results = run_experiment(row_counts=(10, 100, 1000, 5000))
df_results


Unnamed: 0,rows,json_pretty_tokens,json_minified_tokens,toon_tokens,json_pretty_chars,json_minified_chars,toon_chars,toon_vs_pretty_%,toon_vs_minified_%
0,10,434,245,140,1115,754,278,67.741935,42.857143
1,100,4277,2378,1236,11165,7564,2409,71.101239,48.023549
2,1000,42808,23809,12282,113699,77698,25744,71.309101,48.414465
3,5000,221700,126701,69285,576868,396867,136913,68.748309,45.316138


In [7]:
def print_summary(df):
    """
    Print a human-readable summary of the experiment
    that you can reuse in your article.
    """
    for _, row in df.iterrows():
        print(f"Rows: {int(row['rows'])}")
        print(f"  JSON pretty   : {row['json_pretty_tokens']} tokens")
        print(f"  JSON minified : {row['json_minified_tokens']} tokens")
        print(f"  TOON-like     : {row['toon_tokens']} tokens")
        print(f"  TOON vs pretty   : {row['toon_vs_pretty_%']:.1f}% fewer tokens")
        print(f"  TOON vs minified : {row['toon_vs_minified_%']:.1f}% fewer tokens")
        print()

print_summary(df_results)


Rows: 10
  JSON pretty   : 434.0 tokens
  JSON minified : 245.0 tokens
  TOON-like     : 140.0 tokens
  TOON vs pretty   : 67.7% fewer tokens
  TOON vs minified : 42.9% fewer tokens

Rows: 100
  JSON pretty   : 4277.0 tokens
  JSON minified : 2378.0 tokens
  TOON-like     : 1236.0 tokens
  TOON vs pretty   : 71.1% fewer tokens
  TOON vs minified : 48.0% fewer tokens

Rows: 1000
  JSON pretty   : 42808.0 tokens
  JSON minified : 23809.0 tokens
  TOON-like     : 12282.0 tokens
  TOON vs pretty   : 71.3% fewer tokens
  TOON vs minified : 48.4% fewer tokens

Rows: 5000
  JSON pretty   : 221700.0 tokens
  JSON minified : 126701.0 tokens
  TOON-like     : 69285.0 tokens
  TOON vs pretty   : 68.7% fewer tokens
  TOON vs minified : 45.3% fewer tokens

