# Assignment 2: Counting Tokens and Estimating Cost

Example:
https://tiktokenizer.vercel.app/?model=gpt-4-1106-preview

## The Completed Tasks are done towards the end of the notebook

## Objective:
Write a Python program to count the number of tokens in a given text source (paragraph, text file, or PDF file) and calculate the cost of processing these tokens using OpenAI's GPT-4o pricing model.

## Requirements:
1. **Input**:
   - A paragraph as a string.
   - A text file containing the content.
   - A PDF file with the content.
2. **Output**:
   - The total number of tokens in the input.
   - The estimated cost of processing the input tokens using GPT-4o pricing ($2.50 per 1M tokens).
3. **Constraints**:
   - Use OpenAI's `tiktoken` library to tokenize the input.
   - Ensure compatibility with different file formats (text and PDF).
   - Handle invalid or empty inputs gracefully.

## Example:
### Input:
```plaintext
Paragraph: "Tiktoken is a tokenizer by OpenAI. It splits text into tokens."
```
### Output:
```plaintext
Total Tokens: 12
Estimated Cost: $0.00003
```

### Input:
```plaintext
Text File: "example.txt" (contains 500 tokens)
```
### Output:
```plaintext
Total Tokens: 500
Estimated Cost: $0.00125
```

## Extra Credit:
Whoever calculates the total number of tokens and the cost for all 6 **Harry Potter** books combined will receive an **assignment pass**, which can be used to skip any future assignment.

### Additional Notes:
- Use `tiktoken`'s `encode` method to calculate token counts.
- For PDF files, extract the text content first (e.g., using a library like `PyPDF2`).
- Costs should be calculated as `total_tokens / 1,000,000 * 2.50`.


# How to count tokens with tiktoken

[`tiktoken`](https://github.com/openai/tiktoken/blob/main/README.md) is a fast open-source tokenizer by OpenAI.

Given a text string (e.g., `"tiktoken is great!"`) and an encoding (e.g., `"cl100k_base"`), a tokenizer can split the text string into a list of tokens (e.g., `["t", "ik", "token", " is", " great", "!"]`).

Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token).


## Encodings

Encodings specify how text is converted into tokens. Different models use different encodings.

`tiktoken` supports three encodings used by OpenAI models:

| Encoding name           | OpenAI models                                       |
|-------------------------|-----------------------------------------------------|
| `o200k_base`            | `gpt-4o`, `gpt-4o-mini`                             |
| `cl100k_base`           | `gpt-4-turbo`, `gpt-4`, `gpt-3.5-turbo`, `text-embedding-ada-002`, `text-embedding-3-small`, `text-embedding-3-large`  |
| `p50k_base`             | Codex models, `text-davinci-002`, `text-davinci-003`|
| `r50k_base` (or `gpt2`) | GPT-3 models like `davinci`                         |

You can retrieve the encoding for a model using `tiktoken.encoding_for_model()` as follows:
```python
encoding = tiktoken.encoding_for_model('gpt-4o-mini')
```

Note that `p50k_base` overlaps substantially with `r50k_base`, and for non-code applications, they will usually give the same tokens.

## Tokenizer libraries by language

For `o200k_base`, `cl100k_base` and `p50k_base` encodings:
- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md)
- .NET / C#: [SharpToken](https://github.com/dmitry-brazhenko/SharpToken), [TiktokenSharp](https://github.com/aiqinxuancai/TiktokenSharp)
- Java: [jtokkit](https://github.com/knuddelsgmbh/jtokkit)
- Golang: [tiktoken-go](https://github.com/pkoukk/tiktoken-go)
- Rust: [tiktoken-rs](https://github.com/zurawiki/tiktoken-rs)

For `r50k_base` (`gpt2`) encodings, tokenizers are available in many languages.
- Python: [tiktoken](https://github.com/openai/tiktoken/blob/main/README.md) (or alternatively [GPT2TokenizerFast](https://huggingface.co/docs/transformers/model_doc/gpt2#transformers.GPT2TokenizerFast))
- JavaScript: [gpt-3-encoder](https://www.npmjs.com/package/gpt-3-encoder)
- .NET / C#: [GPT Tokenizer](https://github.com/dluc/openai-tools)
- Java: [gpt2-tokenizer-java](https://github.com/hyunwoongko/gpt2-tokenizer-java)
- PHP: [GPT-3-Encoder-PHP](https://github.com/CodeRevolutionPlugins/GPT-3-Encoder-PHP)
- Golang: [tiktoken-go](https://github.com/pkoukk/tiktoken-go)
- Rust: [tiktoken-rs](https://github.com/zurawiki/tiktoken-rs)

(OpenAI makes no endorsements or guarantees of third-party libraries.)


## How strings are typically tokenized

In English, tokens commonly range in length from one character to one word (e.g., `"t"` or `" great"`), though in some languages tokens can be shorter than one character or longer than one word. Spaces are usually grouped with the starts of words (e.g., `" is"` instead of `"is "` or `" "`+`"is"`). You can quickly check how a string is tokenized at the [OpenAI Tokenizer](https://beta.openai.com/tokenizer), or the third-party [Tiktokenizer](https://tiktokenizer.vercel.app/) webapp.

## 0. Install `tiktoken`

If needed, install `tiktoken` with `pip`:

In [1]:
# %pip install --upgrade tiktoken -q
# %pip install --upgrade openai -q

## 1. Import `tiktoken`

In [2]:
import tiktoken

## 2. Load an encoding

Use `tiktoken.get_encoding()` to load an encoding by name.

The first time this runs, it will require an internet connection to download. Later runs won't need an internet connection.

In [3]:
# encoding = tiktoken.get_encoding("cl100k_base")
encoding = tiktoken.get_encoding("o200k_base")

Use `tiktoken.encoding_for_model()` to automatically load the correct encoding for a given model name.

In [4]:
# encoding = tiktoken.encoding_for_model("gpt-4o-mini")
encoding = tiktoken.encoding_for_model("gpt-4o")

## 3. Turn text into tokens with `encoding.encode()`



The `.encode()` method converts a text string into a list of token integers.

In [5]:
encoding.encode("tiktoken is great!")

[83, 8251, 2488, 382, 2212, 0]

Count tokens by counting the length of the list returned by `.encode()`.

In [6]:
def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [7]:
num_tokens_from_string("tiktoken is great!", "o200k_base")

6

## 4. Turn tokens into text with `encoding.decode()`

`.decode()` converts a list of token integers to a string.

In [8]:
encoding.decode([83, 8251, 2488, 382, 2212, 0])

'tiktoken is great!'

Warning: although `.decode()` can be applied to single tokens, beware that it can be lossy for tokens that aren't on utf-8 boundaries.

For single tokens, `.decode_single_token_bytes()` safely converts a single integer token to the bytes it represents.

In [9]:
[encoding.decode_single_token_bytes(token) for token in [83, 8251, 2488, 382, 2212, 0]]

[b't', b'ikt', b'oken', b' is', b' great', b'!']

(The `b` in front of the strings indicates that the strings are byte strings.)

## 5. Comparing encodings

Different encodings vary in how they split words, group spaces, and handle non-English characters. Using the methods above, we can compare different encodings on a few example strings.

In [10]:
def compare_encodings(example_string: str) -> None:
    """Prints a comparison of three string encodings."""
    # print the example string
    print(f'\nExample string: "{example_string}"')
    # for each encoding, print the # of tokens, the token integers, and the token bytes
    for encoding_name in ["r50k_base", "p50k_base", "cl100k_base", "o200k_base"]:
        encoding = tiktoken.get_encoding(encoding_name)
        token_integers = encoding.encode(example_string)
        num_tokens = len(token_integers)
        token_bytes = [encoding.decode_single_token_bytes(token) for token in token_integers]
        print()
        print(f"{encoding_name}: {num_tokens} tokens")
        print(f"token integers: {token_integers}")
        print(f"token bytes: {token_bytes}")

In [11]:
compare_encodings("antidisestablishmentarianism")


Example string: "antidisestablishmentarianism"

r50k_base: 5 tokens
token integers: [415, 29207, 44390, 3699, 1042]
token bytes: [b'ant', b'idis', b'establishment', b'arian', b'ism']

p50k_base: 5 tokens
token integers: [415, 29207, 44390, 3699, 1042]
token bytes: [b'ant', b'idis', b'establishment', b'arian', b'ism']

cl100k_base: 6 tokens
token integers: [519, 85342, 34500, 479, 8997, 2191]
token bytes: [b'ant', b'idis', b'establish', b'ment', b'arian', b'ism']

o200k_base: 6 tokens
token integers: [493, 129901, 376, 160388, 21203, 2367]
token bytes: [b'ant', b'idis', b'est', b'ablishment', b'arian', b'ism']


In [12]:
compare_encodings("2 + 2 = 4")


Example string: "2 + 2 = 4"

r50k_base: 5 tokens
token integers: [17, 1343, 362, 796, 604]
token bytes: [b'2', b' +', b' 2', b' =', b' 4']

p50k_base: 5 tokens
token integers: [17, 1343, 362, 796, 604]
token bytes: [b'2', b' +', b' 2', b' =', b' 4']

cl100k_base: 7 tokens
token integers: [17, 489, 220, 17, 284, 220, 19]
token bytes: [b'2', b' +', b' ', b'2', b' =', b' ', b'4']

o200k_base: 7 tokens
token integers: [17, 659, 220, 17, 314, 220, 19]
token bytes: [b'2', b' +', b' ', b'2', b' =', b' ', b'4']


In [13]:
compare_encodings("السلام علیکم ، کیسے ہیں آپ؟")


Example string: "السلام علیکم ، کیسے ہیں آپ؟"

r50k_base: 36 tokens
token integers: [23525, 45692, 13862, 12919, 25405, 17550, 117, 13862, 151, 234, 150, 102, 25405, 17550, 234, 220, 150, 102, 151, 234, 45692, 151, 240, 220, 151, 223, 151, 234, 150, 118, 17550, 95, 149, 122, 148, 253]
token bytes: [b'\xd8\xa7\xd9\x84', b'\xd8\xb3', b'\xd9\x84', b'\xd8\xa7', b'\xd9\x85', b' \xd8', b'\xb9', b'\xd9\x84', b'\xdb', b'\x8c', b'\xda', b'\xa9', b'\xd9\x85', b' \xd8', b'\x8c', b' ', b'\xda', b'\xa9', b'\xdb', b'\x8c', b'\xd8\xb3', b'\xdb', b'\x92', b' ', b'\xdb', b'\x81', b'\xdb', b'\x8c', b'\xda', b'\xba', b' \xd8', b'\xa2', b'\xd9', b'\xbe', b'\xd8', b'\x9f']

p50k_base: 36 tokens
token integers: [23525, 45692, 13862, 12919, 25405, 17550, 117, 13862, 151, 234, 150, 102, 25405, 17550, 234, 220, 150, 102, 151, 234, 45692, 151, 240, 220, 151, 223, 151, 234, 150, 118, 17550, 95, 149, 122, 148, 253]
token bytes: [b'\xd8\xa7\xd9\x84', b'\xd8\xb3', b'\xd9\x84', b'\xd8\xa7', b'\xd9\x85', b' \xd8', b

## 6. Counting tokens for chat completions API calls

ChatGPT models like `gpt-4o-mini` and `gpt-4` use tokens in the same way as older completions models, but because of their message-based formatting, it's more difficult to count how many tokens will be used by a conversation.

Below is an example function for counting tokens for messages passed to `gpt-3.5-turbo`, `gpt-4`, `gpt-4o` and `gpt-4o-mini`.

Note that the exact way that tokens are counted from messages may change from model to model. Consider the counts from the function below an estimate, not a timeless guarantee.

In particular, requests that use the optional functions input will consume extra tokens on top of the estimates calculated below.

In [14]:
def num_tokens_from_messages(messages, model="gpt-4o-mini-2024-07-18"):
    """Return the number of tokens used by a list of messages."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        print("Warning: model not found. Using o200k_base encoding.")
        encoding = tiktoken.get_encoding("o200k_base")
    if model in {
        "gpt-3.5-turbo-0125",
        "gpt-4-0314",
        "gpt-4-32k-0314",
        "gpt-4-0613",
        "gpt-4-32k-0613",
        "gpt-4o-mini-2024-07-18",
        "gpt-4o-2024-08-06"
        }:
        tokens_per_message = 3
        tokens_per_name = 1
    elif "gpt-3.5-turbo" in model:
        print("Warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0125.")
        return num_tokens_from_messages(messages, model="gpt-3.5-turbo-0125")
    elif "gpt-4o-mini" in model:
        print("Warning: gpt-4o-mini may update over time. Returning num tokens assuming gpt-4o-mini-2024-07-18.")
        return num_tokens_from_messages(messages, model="gpt-4o-mini-2024-07-18")
    elif "gpt-4o" in model:
        print("Warning: gpt-4o and gpt-4o-mini may update over time. Returning num tokens assuming gpt-4o-2024-08-06.")
        return num_tokens_from_messages(messages, model="gpt-4o-2024-08-06")
    elif "gpt-4" in model:
        print("Warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.")
        return num_tokens_from_messages(messages, model="gpt-4-0613")
    else:
        raise NotImplementedError(
            f"""num_tokens_from_messages() is not implemented for model {model}."""
        )
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            num_tokens += len(encoding.encode(value))
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3  # every reply is primed with <|start|>assistant<|message|>
    return num_tokens


In [15]:
def count_tokens_and_cost_with_messages(messages, model="gpt-4o-mini-2024-07-18"):
    """
    Counts the total tokens and calculates the processing cost for a list of messages.

    Args:
    - messages (list): A list of message dictionaries, where each dictionary represents a message.
    - model (str): The model name used for token calculation (default: "gpt-4o-mini-2024-07-18").

    Returns:
    - dict: A dictionary containing the total tokens and estimated cost.
    """
    # Calculate the total tokens using the existing method
    total_tokens = num_tokens_from_messages(messages, model)

    # Calculate the cost based on GPT-4o pricing
    cost_per_million_tokens = 2.50  # $2.50 per 1M tokens
    cost = (total_tokens / 1_000_000) * cost_per_million_tokens

    # Return the result as a dictionary
    return {"Total Tokens": total_tokens, "Cost (USD)": f"{cost:.6f}"}

# Example usage
messages = [
    {"role": "user", "content": "What is the weather today in Karachi?"}
]

result = count_tokens_and_cost_with_messages(messages)
print(f"Total Tokens: {result['Total Tokens']}")
print(f"Estimated Cost: ${result['Cost (USD)']}")


Total Tokens: 15
Estimated Cost: $0.000038


### Assignment Begin

In [16]:
import os
from PyPDF2 import PdfReader

In [17]:
COST_PER_MILLION_TOKENS = 2.50  # $2.50 per 1M tokens

In [30]:
def parse_paragraphs(para: str, encoding_name: str = "cl100k_base"):
    """Returns the number of tokens and cost of reading the tokens in the given text."""
    encoding = tiktoken.get_encoding(encoding_name)
    tokens = encoding.encode(para)
    cost = len(tokens) / 1_000_000 * COST_PER_MILLION_TOKENS
    # return f"Number of tokens: {len(tokens)}, Cost: ${cost:.8f}"
    return (len(tokens), cost)

In [31]:
paragraph = """
V For Vendetta
Remember, remember The fifth of November The gunpowder treason and plot. I know of no reason Why the gunpowder treason Should ever be forgot." But what of the man? I know his name was Guy Fawkes, and I know that, in 1605, he attempted to blow up the houses of Parliament. But who was he really? What was he like? We are told to remember the idea, not the man, because a man can fail. He can be caught. He can be killed and forgotten. But four hundred years later an idea can still change the world. I've witnessed firsthand the power of ideas. I've seen people kill in the name of them; and die defending them. But you cannot kiss an idea, cannot touch it or hold it. Ideas do not bleed. They do not feel pain. They do not love. And it is not an idea that I miss, it is a man. A man that made me remember the fifth of November. A man that I will never forget.


Lord of the rings, Sam <3
FRODO: I can’t do this, Sam.

SAM: I know. It’s all wrong. By rights we shouldn’t even be here. But we are. It’s like in the great stories, Mr. Frodo. The ones that really mattered. Full of darkness and danger they were. And sometimes you didn’t want to know the end. Because how could the end be happy. How could the world go back to the way it was when so much bad had happened. But in the end, it’s only a passing thing, this shadow. Even darkness must pass. A new day will come. And when the sun shines it will shine out the clearer. Those were the stories that stayed with you. That meant something. Even if you were too small to understand why. But I think, Mr. Frodo, I do understand. I know now. Folk in those stories had lots of chances of turning back only they didn’t. Because they were holding on to something.

FRODO: What are we holding on to, Sam?

SAM: That there’s some good in this world, Mr. Frodo. And it’s worth fighting for.


Lord of the rings, Gandalf <3
Deserves it! I daresay he does. Many that live deserve death. And some that die deserve life. Can you give it to them? Then do not be too eager to deal out death in judgement. For even the very wise cannot see all ends.
"""

In [32]:
parse_paragraphs(paragraph)

(526, 0.001315)

In [49]:
def parse_text_file(filepath: str):
    """Reads and returns the tokens and cost for reading a text file"""
    if not os.path.isfile(filepath):
        raise FileNotFoundError("File not found: {filepath}")
    with open(filepath, 'r', encoding='utf-8') as f:
        text = f.read()
    
    lines = [lines.strip().replace('\n', ' ') for lines in text.split('\n') if lines.strip()]

    messages = [{"role": "user", "content": line} for line in lines]
    res = count_tokens_and_cost_with_messages(messages)
    return res["Total Tokens"], float(res["Cost (USD)"])
    
    # return parse_paragraphs(text)

tokens, cost = parse_text_file("sample.txt")
print(f"Total Tokens: {tokens}")
print(f"Estimated Cost: ${cost:.6f}")

Total Tokens: 229
Estimated Cost: $0.000572


In [51]:
def parse_pdf_file(filepath: str):
    """Reads and returns the tokens and cost for reading a pdf file"""
    if not os.path.isfile(filepath):
        raise FileNotFoundError("File not found: {filepath}")
    # reader = PdfFileReader(filepath)
    reader = PdfReader(filepath)
    text = ""
    for page in reader.pages:
        text += page.extract_text() or ""
    
    lines = [lines.strip().replace('\n', ' ') for lines in text.split('\n') if lines.strip()]

    messages = [{"role": "user", "content": line} for line in lines]
    res = count_tokens_and_cost_with_messages(messages)
    return res["Total Tokens"], float(res["Cost (USD)"])

tokens, cost = parse_pdf_file("sample.pdf")
print(f"Total Tokens: {tokens}")
print(f"Estimated Cost: ${cost:.6f}")

Total Tokens: 229
Estimated Cost: $0.000572


### Extra Credit - All 6 Harry Potter Books (7 - including deathly hallows)

In [52]:
# hp_books = [
#     "HP1-SorcerersStone.pdf",
#     "HP2-ChamberOfSecrets.pdf",
#     "HP3-PrisonerOfAzkaban.pdf",
#     "HP4-GobletOfFire.pdf",
#     "HP5-OrderOfThePhoenix.pdf",
#     "HP6-HalfBloodPrince.pdf",
#     "HP7-DeathlyHallows.pdf"
# ]
hp = "harrypotter.pdf"

In [53]:
total_tokens = 0
total_cost = 0

# for book in hp_books:
#     count_tokens_and_cost_with_pdf_files(book)
#     tokens, cost = count_tokens_and_cost_with_pdf_files(book)
#     total_tokens += tokens
#     total_cost += cost
total_tokens, total_cost = parse_pdf_file(hp)

print(f"Total Tokens: {total_tokens}, Total Cost: ${total_cost:.8f}")

Total Tokens: 2682675, Total Cost: $6.70668800
