# How to count tokens with PSOpenAI

PSOpenAI provides `ConvertTo-Token` and `ConvertFrom-Token` commands for tokenize.

Given a text string (e.g., "PowerShell for every system!") and an encoding (e.g., "cl100k_base"), a tokenizer can split the text string into a list of tokens (e.g., ("Power", "Shell", " for", " every", " system", "!")).

Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token).

## Encodings

Encodings specify how text is converted into tokens. Different models use different encodings.

PSOpenAI supports various encodings used by OpenAI models.

|Encoding name|OpenAI models|
|:----|:----|
|`cl100k_base`|`gpt-4`, `gpt-3.5-turbo`, `text-embedding-ada-002`|
|`p50k_base`|Codex models, `text-davinci-002`, `text-davinci-003`|
|`p50k_edit`|`text-davinci-edit-001`|
|`r50k_base` (or `gpt2`)|GPT-3 models like `davinci`|

You can specify encoding by an encoding name or model name:

```PowerShell
ConvertTo-Token -Encoding cl100k_base
ConvertTo-Token -Model gpt-4
```

> Note: If you don't specify any encoding or model, it will use `cl100k_base` encoding.


## 1. Import

Tokenizer is run on a local machine. No Internet connection or API key is required.

In [1]:
# imports
Import-Module ..\PSOpenAI.psd1 -Force

## 2. Turn text into tokens with `ConvertTo-Token`

The `ConvertTo-Token` converts a text string into a list of token integers.

In [2]:
ConvertTo-Token -Text "PowerShell for every system!" -Encoding "cl100k_base"

15335
26354
369
1475
1887
0


`ConvertTo-Token` also accepts input from pipeline,

In [3]:
"PowerShell for every system!" | ConvertTo-Token -Encoding "cl100k_base"

15335
26354
369
1475
1887
0


Count tokens by counting the length of the list returned by `ConvertTo-Token`

In [4]:
$tokens = "PowerShell for every system!" | ConvertTo-Token -Encoding "cl100k_base"
$tokens.Length

6


## 3. Turn tokens into text with `ConvertFrom-Token`

`ConvertFrom-Token` converts a list of token integers to a string.

In [5]:
ConvertFrom-Token -Token (15335, 26354, 369, 1475, 1887, 0) -Encoding "cl100k_base"

# Also, you can input from pipeline as well
# (15335, 26354, 369, 1475, 1887, 0) | ConvertFrom-Token -Encoding "cl100k_base"

PowerShell for every system!


`-AsArray` switch can also be used to convert each token into a string array. This gives you the ability to see how the text is splitted by Tokenizer.

In [6]:
ConvertFrom-Token -Token (15335, 26354, 369, 1475, 1887, 0) -Encoding "cl100k_base" -AsArray

Power
Shell
 for
 every
 system
!


## 4. Comparing encodings

Different encodings vary in how they split words, group spaces, and handle non-English characters. you can compare different encodings on a few example strings.

In [7]:
function Compare-Encodings {
  $text = $args[0]
  echo ('Example string: "{0}"' -f $text)
  $encs = 'gpt2', 'p50k_base', 'cl100k_base'

  $encs | % {
    # Encoding
    $tokens = ($text | ConvertTo-Token -Encoding $_)
    # Decoding
    $words = ($tokens | ConvertFrom-Token -Encoding $_ -AsArray)
    # Display
    [pscustomobject]@{
        "Encoding"      = $_
        "Count"         = $tokens.Length
        "Token(int)"    = $tokens -join ', '
        "Token(string)" = $words -join ', '
    }
  }
}

In [8]:
Compare-Encodings "antidisestablishmentarianism"

Example string: "antidisestablishmentarianism"

[32;1mEncoding    Count Token(int)                         Token(string)[0m
[32;1m--------    ----- ----------                         -------------[0m
gpt2            5 415, 29207, 44390, 3699, 1042      ant, idis, establishment, arian, ism
p50k_base       5 415, 29207, 44390, 3699, 1042      ant, idis, establishment, arian, ism
cl100k_base     6 519, 85342, 34500, 479, 8997, 2191 ant, idis, establish, ment, arian, ism



In [9]:
Compare-Encodings "2 + 2 = 4"

Example string: "2 + 2 = 4"

[32;1mEncoding    Count Token(int)                     Token(string)[0m
[32;1m--------    ----- ----------                     -------------[0m
gpt2            5 17, 1343, 362, 796, 604        2,  +,  2,  =,  4
p50k_base       5 17, 1343, 362, 796, 604        2,  +,  2,  =,  4
cl100k_base     7 17, 489, 220, 17, 284, 220, 19 2,  +,  , 2,  =,  , 4



In [10]:
Compare-Encodings "お誕生日おめでとう🎉"

Example string: "お誕生日おめでとう🎉"

[32;1mEncoding    Count Token(int)[0m
[32;1m--------    ----- ----------                                                                       [0m
gpt2           17 2515, 232, 45739, 243, 37955, 33768, 98, 2515, 232, 1792, 223, 30640, 30201, 295…
p50k_base      17 2515, 232, 45739, 243, 37955, 33768, 98, 2515, 232, 1792, 223, 30640, 30201, 295…
cl100k_base    12 33334, 45918, 243, 21990, 9080, 33334, 62004, 16556, 78699, 9468, 236, 231       



## 5. Counting tokens for chat API calls

ChatGPT models like `gpt-3.5-turbo` and `gpt-4` use tokens in the same way as older completions models, but because of their message-based formatting, it's more difficult to count how many tokens will be used by a conversation.

Below is an example function for counting tokens for messages passed to `gpt-3.5-turbo-0301` or `gpt-4-0314`.

Note that the exact way that tokens are counted from messages may change from model to model. Consider the counts from the function below an estimate, not a timeless guarantee.

In [11]:
function Measure-TokensFromMessages ($Messages, $Model) {
  # Returns the number of tokens used by a list of messages.
  # Note: this function is ported from openai-cookbook.
  switch -wildcard  ($Model) {
    'gpt-3.5-turbo*' { 
      $tokens_per_message = 4  # every message follows <|start|>{role/name}\n{content}<|end|>\n
      $tokens_per_name = -1  # if there's a name, the role is omitted
    }
    'gpt-4*' { 
      $tokens_per_message = 3
      $tokens_per_name = 1
    }
    Default {
      Write-Error "Not implemented for model $Model."
      return
    }
  }
  $num_tokens = 0
  foreach ($message in $Messages) {
    $num_tokens += $tokens_per_message
    foreach ($item in $message.GetEnumerator()) {
      $num_tokens += @(ConvertTo-Token -Text $item.Value -Model $Model).Count
      if ($item.Key -eq 'name') {
        $num_tokens += $tokens_per_name
      }
    }
  }
  $num_tokens += 3 # every reply is primed with <|start|>assistant<|message|>
  Write-Output $num_tokens
}

In [12]:
# let's verify the function above matches the OpenAI API response
$ExampleMessages = [pscustomobject]@{
  History = @(
    @{
      'role'    = 'system'
      'content' = 'You are a helpful, pattern-following assistant that translates corporate jargon into plain English.'
    },
    @{
      'role'    = 'system'
      'name'    = 'example_user'
      'content' = 'New synergies will help drive top-line growth.'
    },
    @{
      'role'    = 'system'
      'name'    = 'example_assistant'
      'content' = 'Things working well together will increase revenue.'
    },
    @{
      'role'    = 'system'
      'name'    = 'example_user'
      'content' = "Let's circle back when we have more bandwidth to touch base on opportunities for increased leverage."
    },
    @{
      'role'    = 'system'
      'name'    = 'example_assistant'
      'content' = "Let's talk later when we're less busy about how to do better."
    }
  )
  Message = @{
    'role'    = 'user'
    'content' = "This late pivot means we don't have time to boil the ocean for the client deliverable."
  }
}

'gpt-3.5-turbo-0301', 'gpt-4-0314' | ForEach-Object {
  $model = $_
  echo $model
  # example token count from the function defined above
  "{0} prompt tokens counted by Measure-TokensFromMessages function." -f (Measure-TokensFromMessages -Messages ($ExampleMessages.History + $ExampleMessages.Message) -Model $model)
  # example token count from the OpenAI API
  $response = $ExampleMessages | Request-ChatGPT `
    -Model $model `
    -Message $ExampleMessages.Message.content `
    -Temperature 0 `
    -MaxTokens 1  # we're only counting input tokens here, so let's not waste tokens on the output
  "{0} prompt tokens counted by the OpenAI API.`r`n" -f $response.usage.prompt_tokens
} | Out-String



gpt-3.5-turbo-0301
127 prompt tokens counted by Measure-TokensFromMessages function.
127 prompt tokens counted by the OpenAI API.

gpt-4-0314
129 prompt tokens counted by Measure-TokensFromMessages function.
129 prompt tokens counted by the OpenAI API.


