<div style="width: 100%; overflow: hidden;">
    <div style="width: 150px; float: left;"> <img src="https://raw.githubusercontent.com/DataForScience/Networks/master/data/D4Sci_logo_ball.png" alt="Data For Science, Inc" align="left" border="0" width=150px> </div>
    <div style="float: left; margin-left: 10px;"> <h1>Generative AI with OpenAI API</h1>
<h1>Basic Concepts</h1>
        <p>Bruno Gonçalves<br/>
        <a href="http://www.data4sci.com/">www.data4sci.com</a><br/>
            @bgoncalves, @data4sci</p></div>
</div>

In [1]:
from collections import Counter, defaultdict
from pprint import pprint
import random

import pandas as pd
import numpy as np

import matplotlib
import matplotlib.pyplot as plt 

import openai
from openai import OpenAI

import tiktoken

import nltk
from nltk.corpus import reuters
from nltk import bigrams, trigrams

from ipywidgets import interact

import os
import gzip

import tqdm as tq
from tqdm.notebook import tqdm

import watermark

%load_ext watermark
%matplotlib inline

We start by printing out the versions of the libraries we're using for future reference

In [2]:
%watermark -n -v -m -g -iv

Python implementation: CPython
Python version       : 3.11.7
IPython version      : 8.12.3

Compiler    : Clang 14.0.6 
OS          : Darwin
Release     : 24.3.0
Machine     : arm64
Processor   : arm
CPU cores   : 16
Architecture: 64bit

Git hash: 3a7a9a8b6856eb5855cd2ac76a384e203382ab54

json      : 2.0.9
tiktoken  : 0.7.0
matplotlib: 3.8.0
nltk      : 3.8.1
watermark : 2.4.3
openai    : 1.30.5
numpy     : 1.26.4
tqdm      : 4.66.4
pandas    : 2.2.3



Load default figure style

In [3]:
plt.style.use('d4sci.mplstyle')
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']

# Encodings

tiktoken supports several types of encodings

In [4]:
tiktoken.list_encoding_names()

['gpt2', 'r50k_base', 'p50k_base', 'p50k_edit', 'cl100k_base', 'o200k_base']

Encodings can by loaded by name

In [5]:
encoding = tiktoken.get_encoding("cl100k_base")
encoding

<Encoding 'cl100k_base'>

or by specifying the name of the model we are using

In [6]:
encoding = tiktoken.encoding_for_model("gpt-4o")
encoding

<Encoding 'o200k_base'>

After loading, we can use the encoding object to tokenize text by using the __encode()__ method

In [8]:
encoded_text = encoding.encode("tiktoken is great!")
print(encoded_text)

[83, 8251, 2488, 382, 2212, 0]


which returns numerical IDs for each of the tokens. Numerical IDs can be converted back to the original text using __decode()__

In [9]:
encoding.decode(encoded_text)

'tiktoken is great!'

Tokens can be individual letters, characters, or even full words. To convert indivudal numerical IDs to tokens, we should use __decode_single_token_bytes()__

In [10]:
for token in encoded_text:
    print('%s\t->\t%s' % (token, encoding.decode_single_token_bytes(token)))

83	->	b't'
8251	->	b'ikt'
2488	->	b'oken'
382	->	b' is'
2212	->	b' great'
0	->	b'!'


The number of tokens generated depend on the specific encoding used. This is particularly noticeable in long words, so let's take the longest english word as an example

In [14]:
example_string = "pneumonoultramicroscopicsilicovolcanoconiosis"

for encoding_name in tiktoken.list_encoding_names():
    encoding = tiktoken.get_encoding(encoding_name)
    encoded_text = encoding.encode(example_string)
    num_tokens = len(encoded_text)
    token_text = [encoding.decode_single_token_bytes(token) for token in encoded_text]
    print()
    print(f"{encoding_name}: {num_tokens} tokens")
    print(f"Token IDs: {encoded_text}")
    print(f"Token Text: {token_text}")


gpt2: 15 tokens
Token IDs: [79, 25668, 261, 25955, 859, 2500, 1416, 404, 873, 41896, 709, 349, 5171, 36221, 42960]
Token Text: [b'p', b'neum', b'on', b'oult', b'ram', b'icro', b'sc', b'op', b'ics', b'ilic', b'ov', b'ol', b'can', b'ocon', b'iosis']

r50k_base: 15 tokens
Token IDs: [79, 25668, 261, 25955, 859, 2500, 1416, 404, 873, 41896, 709, 349, 5171, 36221, 42960]
Token Text: [b'p', b'neum', b'on', b'oult', b'ram', b'icro', b'sc', b'op', b'ics', b'ilic', b'ov', b'ol', b'can', b'ocon', b'iosis']

p50k_base: 15 tokens
Token IDs: [79, 25668, 261, 25955, 859, 2500, 1416, 404, 873, 41896, 709, 349, 5171, 36221, 42960]
Token Text: [b'p', b'neum', b'on', b'oult', b'ram', b'icro', b'sc', b'op', b'ics', b'ilic', b'ov', b'ol', b'can', b'ocon', b'iosis']

p50k_edit: 15 tokens
Token IDs: [79, 25668, 261, 25955, 859, 2500, 1416, 404, 873, 41896, 709, 349, 5171, 36221, 42960]
Token Text: [b'p', b'neum', b'on', b'oult', b'ram', b'icro', b'sc', b'op', b'ics', b'ilic', b'ov', b'ol', b'can', b'ocon',

Encodings are capable of handling a large number of languages and character sets. Let's take Japanese for example:

In [15]:
example_string = "お誕生日おめでとう"

for encoding_name in tiktoken.list_encoding_names():
    encoding = tiktoken.get_encoding(encoding_name)
    encoded_text = encoding.encode(example_string)
    num_tokens = len(encoded_text)
    token_text = [encoding.decode_single_token_bytes(token) for token in encoded_text]
    print()
    print(f"{encoding_name}: {num_tokens} tokens")
    print(f"Token IDs: {encoded_text}")
    print(f"Token Text: {token_text}")


gpt2: 14 tokens
Token IDs: [2515, 232, 45739, 243, 37955, 33768, 98, 2515, 232, 1792, 223, 30640, 30201, 29557]
Token Text: [b'\xe3\x81', b'\x8a', b'\xe8\xaa', b'\x95', b'\xe7\x94\x9f', b'\xe6\x97', b'\xa5', b'\xe3\x81', b'\x8a', b'\xe3\x82', b'\x81', b'\xe3\x81\xa7', b'\xe3\x81\xa8', b'\xe3\x81\x86']

r50k_base: 14 tokens
Token IDs: [2515, 232, 45739, 243, 37955, 33768, 98, 2515, 232, 1792, 223, 30640, 30201, 29557]
Token Text: [b'\xe3\x81', b'\x8a', b'\xe8\xaa', b'\x95', b'\xe7\x94\x9f', b'\xe6\x97', b'\xa5', b'\xe3\x81', b'\x8a', b'\xe3\x82', b'\x81', b'\xe3\x81\xa7', b'\xe3\x81\xa8', b'\xe3\x81\x86']

p50k_base: 14 tokens
Token IDs: [2515, 232, 45739, 243, 37955, 33768, 98, 2515, 232, 1792, 223, 30640, 30201, 29557]
Token Text: [b'\xe3\x81', b'\x8a', b'\xe8\xaa', b'\x95', b'\xe7\x94\x9f', b'\xe6\x97', b'\xa5', b'\xe3\x81', b'\x8a', b'\xe3\x82', b'\x81', b'\xe3\x81\xa7', b'\xe3\x81\xa8', b'\xe3\x81\x86']

p50k_edit: 14 tokens
Token IDs: [2515, 232, 45739, 243, 37955, 33768, 98, 251

Here we are seeing the unicode representation of the text, but we can easily recover the original Kanji and Hiragana text

In [16]:
print(b"".join(token_text).decode())

お誕生日おめでとう


# Counting tokens for API calls

We can use tiktoken to count how many tokens our API calls are going to consume. Naturally, this depends on the language model used. Based on https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

In [17]:
def num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613"):
    """Return the number of tokens used by a list of messages."""
    try:
        encoding = tiktoken.encoding_for_model(model)
    except KeyError:
        print("Warning: model not found. Using cl100k_base encoding.")
        encoding = tiktoken.get_encoding("cl100k_base")
    if model in {
        "gpt-3.5-turbo-0613",
        "gpt-3.5-turbo-16k-0613",
        "gpt-4-0314",
        "gpt-4-32k-0314",
        "gpt-4-0613",
        "gpt-4-32k-0613",
        }:
        tokens_per_message = 3
        tokens_per_name = 1
    elif model == "gpt-3.5-turbo-0301":
        tokens_per_message = 4  # every message follows <|start|>{role/name}\n{content}<|end|>\n
        tokens_per_name = -1  # if there's a name, the role is omitted
    elif "gpt-3.5-turbo" in model:
        print("Warning: gpt-3.5-turbo may update over time. Returning num tokens assuming gpt-3.5-turbo-0613.")
        return num_tokens_from_messages(messages, model="gpt-3.5-turbo-0613")
    elif "gpt-4" in model:
        print("Warning: gpt-4 may update over time. Returning num tokens assuming gpt-4-0613.")
        return num_tokens_from_messages(messages, model="gpt-4-0613")
    else:
        raise NotImplementedError(
            f"""num_tokens_from_messages() is not implemented for model {model}. See https://github.com/openai/openai-python/blob/main/chatml.md for information on how messages are converted to tokens."""
        )
    num_tokens = 0
    for message in messages:
        num_tokens += tokens_per_message
        for key, value in message.items():
            n_tokens = len(encoding.encode(value))
            num_tokens += n_tokens
            if key == "name":
                num_tokens += tokens_per_name
    num_tokens += 3  # every reply is primed with <|start|>assistant<|message|>
    return num_tokens


Instatiate the client

In [18]:
client = OpenAI()

In [19]:
example_messages = [
    {
        "role": "system",
        "content": "You are a helpful, pattern-following assistant that translates corporate jargon into plain English.",
    },
    {
        "role": "system",
        "name": "example_user",
        "content": "New synergies will help drive top-line growth.",
    },
    {
        "role": "system",
        "name": "example_assistant",
        "content": "Things working well together will increase revenue.",
    },
    {
        "role": "system",
        "name": "example_user",
        "content": "Let's circle back when we have more bandwidth to touch base on opportunities for increased leverage.",
    },
    {
        "role": "system",
        "name": "example_assistant",
        "content": "Let's talk later when we're less busy about how to do better.",
    },
    {
        "role": "user",
        "content": "This late pivot means we don't have time to boil the ocean for the client deliverable.",
    },
]

for model in [
    "gpt-3.5-turbo",
#     "gpt-3.5-turbo-0613",
    "gpt-4",
    "gpt-4o",
    ]:
    print(model)
    # example token count from the function defined above
    print(f"{num_tokens_from_messages(example_messages, model)} prompt tokens counted by num_tokens_from_messages().")
    # example token count from the OpenAI API
    response = client.chat.completions.create(
        model=model,
        messages=example_messages,
        temperature=0,
        max_tokens=1,  # we're only counting input tokens here, so let's not waste tokens on the output
    )
    print(f'{response.usage.prompt_tokens} prompt tokens counted by the OpenAI API.')
    print()


gpt-3.5-turbo
129 prompt tokens counted by num_tokens_from_messages().
129 prompt tokens counted by the OpenAI API.

gpt-4
129 prompt tokens counted by num_tokens_from_messages().
129 prompt tokens counted by the OpenAI API.

gpt-4o
129 prompt tokens counted by num_tokens_from_messages().
124 prompt tokens counted by the OpenAI API.



# "Small" Language Model

In [20]:
model = defaultdict(lambda: defaultdict(lambda: 0))

We start by counting number of trigram co-occurrences

In [21]:
for sentence in tqdm(reuters.sents(), total=54_716):
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
        bigram = (w1, w2)
        model[bigram][w3] += 1

  0%|          | 0/54716 [00:00<?, ?it/s]

And normalizing the probabilities for each bigram. 

In [22]:
for bigram in model:
    total_count = float(sum(model[bigram].values()))

    for w3 in model[bigram]:
        model[bigram][w3] /= total_count

Our language model is just a weighted mapping between each bigram and the possible next words.

In [23]:
model[("the", "United")]

defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
            {'States': 0.880672268907563,
             'Kingdom': 0.011764705882352941,
             'Arab': 0.052100840336134456,
             'Permanent': 0.0016806722689075631,
             'Steelworkers': 0.0033613445378151263,
             'Nations': 0.025210084033613446,
             'Coconut': 0.0067226890756302525,
             'State': 0.0033613445378151263,
             'Democratic': 0.0016806722689075631,
             'Food': 0.008403361344537815,
             'Automobile': 0.0016806722689075631,
             'acquisition': 0.0016806722689075631,
             'Rubber': 0.0016806722689075631})

In [24]:
model[("United", "Kingdom")]

defaultdict(<function __main__.<lambda>.<locals>.<lambda>()>,
            {',': 0.21428571428571427,
             'and': 0.21428571428571427,
             'blender': 0.07142857142857142,
             ')': 0.14285714285714285,
             'company': 0.07142857142857142,
             'operations': 0.07142857142857142,
             'assets': 0.07142857142857142,
             'Ltd': 0.07142857142857142,
             '.': 0.07142857142857142})

This is all we need to generate new text staring from a bigram prompt. We must simply perform a random walk on this weighted graph starting from an initial prompt:

In [25]:
def generate_sentence_from_prompt(prompt, zero_temperature=False):
    text = [*prompt]

    # Dont impose any fixed sentence length
    while True:
        # the current not we're in is just the one that accounts
        # for the last two words in the text
        bigram = tuple(text[-2:])

        # We extract the list of possible next words and their probabilities
        words = []
        probs = []

        for word, prob in model[bigram].items():
            words.append(word)
            probs.append(prob)

        # Choose one word proportionally to each probability        
        if zero_temperature:
            pos = np.argmax(probs) # Temperature = 0
        else:
            selection = np.random.multinomial(1, probs)
            pos = np.argmax(selection) # Temperature = 1

        # Check which one was chosen
        word = words[pos]

        # Append the new word to our runnning text
        text.append(word)

        # Stop when we hit two None tokens in a row, that represnet the end of a sentence
        if text[-2:] == [None, None]:
            break
        
        # Make sure we don't run forever
        if zero_temperature and len(text) > 100:
            break
                
    return " ".join([t for t in text if t])

In [30]:
generate_sentence_from_prompt(('United', 'States'))

'United States does not intend to back up our lost position will be allowed after such a bill tomorrow that would reduce tax preferences , advised a month earlier and 15 , 1987 , but Trump said the bank said .'

In [34]:
generate_sentence_from_prompt(('today', 'the'))

'today the pound .'

In [38]:
generate_sentence_from_prompt(('financial', 'markets'))

"financial markets ' political perceptions , which has had to be done to improve prices , with profits of 1 . 60 dlrs vs 1 . 50 dlrs / bbl , according to a 6 . 70 dlrs vs 1 , 547 , 000 Revs 401 . 8 mln Revs 63 . 9 mln vs 451 , 000 NOTE : Net excludes extraordinary gain of 150 dlrs a share vs a gain of 3 . 85 and 5 . 3 PCT IN 4TH QTR LOSS Shr loss nine cts vs 75 , 000 vs loss 14 cts Net 129 . 3 mln vs 12 . 96 last month ."

<center>
     <img src="https://raw.githubusercontent.com/DataForScience/Networks/master/data/D4Sci_logo_full.png" alt="Data For Science, Inc" align="center" border="0" width=300px> 
</center>