The code below demonstrates how to use the OpenAI API to create text embeddings that can be used to classify text without training data, also known as one-shot classification.

In [None]:
import os
import json
from openai import OpenAI
import tiktoken
from numpy import dot
from numpy.linalg import norm

The code below assumes you have an environment variable named OPENAI_SECRET that is the openAI secret characters of your account. 

Using the terminal on macos the command to create an environment variable is: export OPENAI_SECRET='enter your secret key between the quotes'

On Windows, search for environment variables and create an environment variable named OPENAI_SECRET with your OpenAI secret phrase.

In [None]:
client = OpenAI(api_key=os.getenv("OPENAI_SECRET"))

The code below creates an openAI client object that can be used to access the openAI API endpoints.

This example used the text-embedding-3-small model, which costs around 10 cents per million tokens.

The codebase tiktoken tokenizes text into numbers that correspond to groups of letters in the text-embedding-3-small model input layer.

In [None]:
client = OpenAI(api_key=os.getenv("OPENAI_SECRET"))
model = 'text-embedding-3-small'
encoding = tiktoken.get_encoding("cl100k_base")


The code below loads a file of newline delimited JSON records. The list variable out will hold records for the classification.

In [None]:
j = [json.loads(z) for z in open('in/twtr2015orcl.json')]
out = []

The function below calculates cosine similarity. Cosine similarity is defined by the dot product scaled by the cross product of two vectors.

In [None]:
def cos_sim(a,b):
    return dot(a, b)/(norm(a)*norm(b))

The code below tokenizes classification labels and retireves embeddings for classification cetegories.

In [None]:
labels = [encoding.encode(z) for z in ['technical analysis','no financial information','accounting information']]
labels_embeddings = [z.embedding for z in client.embeddings.create(input= labels, model = model).data]

The code below iterates through the dataset 10 records at a time and retrieves embeddings that represent the text of each tweet. It then calculates the cosine distance between classification labels and the text of each tweet and saves the output to the out list variable.

In [None]:

for x in range(0,len(j)-10,10):
    ids = [j[z]['id']['$numberLong'] for z in range(x,x+10)]
    texts = [encoding.encode(j[z]['text']) for z in range(x,x+10)]
    texts_embeddings = [z.embedding for z in client.embeddings.create(input = texts, model=model).data]
    for k1,t in enumerate(texts_embeddings):
        o = {ids[k1] : []}
        for l in labels_embeddings:
            o[ids[k1]].append(cos_sim(l,t))
        out.append(o)

The code below saves the results to a file.

In [None]:
with open('out/twtr2015orcl_oneshot.txt','w') as f:
    f.write('id\ttechnical analysis\tno financial information\taccounting information\n')
    for i in out:
        for j in i:
            f.write(f'{j}\t{i[j][0]}\t{i[j][1]}\t{i[j][2]}\n')