# Technical guide #1: GLLM working example

This technical guide shows how to implement a textual analysis task using a GLLM model. 

The guide is a companion to the paper *"Generative LLMs and Textual Analysis in Accounting:(Chat)GPT as Research Assistant?"* ([SSRN](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4429658))

**Author:** Ties de Kok    
**Last updated:** April 2023    
**Status:** *Early - work-in-progress*    

----
# Imports
----


All the dependencies required for this notebook are provided in the `environment.yml` file.

To install: `conda env create -f environment.yml` --> this creates the `gllm` environment.

I recommend using Python 3.9 or higher to avoid dependency conflicts.

**Python built-in libraries**

In [43]:
import os, sys, re, copy, random, json, time, datetime
from pathlib import Path
import getpass

**Libraries for interacting with the OpenAI API**

In [2]:
import requests
import openai
import tiktoken

**General helper libraries**

In [3]:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

In [4]:
from sklearn.metrics import classification_report

### Settings

In [5]:
pd.options.mode.chained_assignment = None  # default='warn'
pd.set_option('display.max_columns', 150)
pd.set_option('display.max_rows', 150)

### Utility functions

In [6]:
## This function makes it easier to print rendered markdown through a code cell.

from IPython.display import Markdown

def mprint(text, *args, **kwargs):
    if 'end' in kwargs.keys():
        text += kwargs['end']
        
    display(Markdown(text))

-----
# Task description
----

Let's say the objective of our toy problem is as follows: 

> Identify earnings call sentences that contain a forward-looking statement.

In this notebook we will create a prediction pipeline to accomplish this using the OpenAI ChatGPT (*gpt-3.5-turbo*) API. 

An example of a FLS sentence would be:

> We anticipate that sales in our Europe segment will go up. 

An example of a non-FLS sentence would be:

> Our Europe segment is performing well and the sales numbers are up relative to last quarter. 

For example, a question that tries to elicit a FLS would be: 

**Disclaimer:**

As discussed in the paper, GLLMs are generally best suited to solve problems that would otherwise require a machine learning approach or manual coding. The objective of this toy example is to illustrate the basic coding steps to implement a GLLM pipeline, not to illustrate the best type of problem to solve using a GLLM model. This specific task can likely also be solved using traditional methods such as FLS keyword lists, but it makes for an example that is easier to follow and adapt. 

## Load data

In [7]:
with open(Path.cwd() / "data" / "statements.json", "r", encoding = "utf-8") as f:
    statement_list = json.load(f)

statement_df = pd.DataFrame(statement_list)

In [8]:
mprint(f"We have **{len(statement_list)}** statements.")

We have **60** statements.

In [9]:
statement_df.sample(5)

Unnamed: 0,i,statement,contains_fls
56,57,We are excited about the possibilities that li...,1
4,5,"In the past year, we have successfully reduced...",0
17,18,Our ongoing digital transformation efforts are...,1
3,4,We expect to see continued growth in the Asian...,1
32,33,The successful rollout of our new customer loy...,0


### Select a demo example

#### An example with an FLS

In [10]:
fls_row = statement_df[statement_df.contains_fls == 1].iloc[0].to_dict()
fls_row

{'i': 2,
 'statement': 'We anticipate that our investments in R&D will lead to a 20% improvement in efficiency in the next two years.',
 'contains_fls': 1}

#### An example without an FLS

In [11]:
non_fls_row = statement_df[statement_df.contains_fls == 0].iloc[0].to_dict()
non_fls_row

{'i': 1,
 'statement': 'In the last quarter, we managed to increase our revenue by 15% due to the successful launch of our new product line.',
 'contains_fls': 0}

-------
# #1 - Understanding the requirements
-----

**What information is nescessary to know to complete the task?**

On order to evaluate whether a statement is forward-looking or not one likely needs to know the following things:

1. The statement (obviously)  
2. The definition of what would consistute a forward-looking statement

Things yields a few questions / discussion points:

- It likely won't matter if we include additional details, such as the company name, the managers name(s), or the analysts' name. We could add these as additional context, however, that will eat into our token-budget and it might distract the model from our primary task. 
- It is not immediately clear whether it is nescessary to provide the model with an explicit definition, or examples, of what constitutes a forward-looking statement. Doing so would require more tokens, so we can try it without an explicit definition first and adjust things from there. 


----
# #2 - Decide on approach and model
----

**What approach to try first?**

Based on our evaluation above, it is quite plausible that a larger GLLM model, such as ChatGPT (gpt-3.5-turbo) or GPT-4, could perform this task in a *zero-shot* fashion. I consider this plausible because (1) the amount of context required is small, (2) the task is reasonably easy to do for a human, and (3) ad-hoc experimentation with a few random examples in ChatGPT shows promising results. 

--> So let's try the *zero-shot* approach first. We can always switch to the *few-shot* or *fine-tuning* approaches later.

**What model?**

*Zero-shot* approaches work best with larger instruct-tuned models, such as ChatGPT/GPT-4. At the time of writing, the ChatGPT (`gpt-3.5-turbo`) API endpoints provided by OpenAI are both powerful and cost-efficient. The GPT-4 API would likely work better, however, it is significantly slower, more expensive, and by invite-only. 

The ChatGPT API also does not use the prompts & completions for training purposes.

----
# #3 - Prompt engineering
----

There are a few things we need to achieve with out prompt template:

1. It needs to properly instruct the model to correctly identify FLS.  
2. It needs to yield a result that is consistent and easily parseable for us.  
3. It needs to use up as few tokens as possible. 

We can design our prompt in many different ways, for example we could write it like we would to a human:

In [12]:
prompt_template_v1 = """
You are a research assistant and your task is to classify whether the following statement from an earnings conference call possibly making a forward-looking statement (FLS). Return the results in valid JSON format using the format {{"fls" : 0 or 1}}. The statement is:
{statement}
""".strip()

## Note, we need to use double curly braces for the output format so that our "format" operation below does expect us to fill it. 

This prompt might work, however, we can improve it in several ways:

1. We can reduce the number of tokens.    
2. We can remove ambigious language such as "possibly".    
3. We can make the model more likely to yield the desirable output.   
4. We can make it easier to add additional instructions by using a list.    

For example by creating a prompt like this:

In [13]:
prompt_template_v2 = """
Task: classify whether the statement below contains a forward looking statements (fls).
Rules:
- Answer using JSON in the following format: {{"contains_fls" : 0 or 1}}
Statement:
> {statement}
JSON =
""".strip()

### Let's fill these templates to see what they would look like:

In [14]:
input_dict = {
    "statement" : fls_row["statement"]
}

In [15]:
prompt_1 = prompt_template_v1.format(**input_dict)
print(prompt_1)

You are a research assistant and your task is to classify whether the following statement from an earnings conference call possibly making a forward-looking statement (FLS). Return the results in valid JSON format using the format {"fls" : 0 or 1}. The statement is:
We anticipate that our investments in R&D will lead to a 20% improvement in efficiency in the next two years.


In [16]:
prompt_2 = prompt_template_v2.format(**input_dict)
print(prompt_2)

Task: classify whether the statement below contains a forward looking statements (fls).
Rules:
- Answer using JSON in the following format: {"contains_fls" : 0 or 1}
Statement:
> We anticipate that our investments in R&D will lead to a 20% improvement in efficiency in the next two years.
JSON =


### How much tokens would each of these use?

#### Set up `tiktoken`

In [17]:
encoder = tiktoken.encoding_for_model("gpt-3.5-turbo")

#### Compare

In [18]:
f"The first prompt template requires {len(encoder.encode(prompt_1)):,} tokens"

'The first prompt template requires 79 tokens'

In [19]:
f"The first prompt template requires {len(encoder.encode(prompt_2)):,} tokens"

'The first prompt template requires 68 tokens'

In [20]:
## Remember, 20% less tokens == 20% lower costs + 20% faster! 

---
## Test the prompt

### Set up OpenAI

There are roughly three ways to interact with the OpenAI API through Python:

- Directly using `requests`   
- Using the official `openai` Python library   
- Through a higher level library such as `langchain`

To use the OpenAI API you will need an API key. If you don't have one, follow these steps:   

1. Create an OpenAI account --> https://platform.openai.com   
2. Create an OpenAI API key --> https://platform.openai.com/account/api-keys   
3. You will get \\$5 in free credits if you create a new account. If you've used that up, you will need to add a payment method. The code in this notebook will cost less than a dollar to run.

Once you have your OpenAI Key, you can set it as the `OPENAI_API_KEY` environment variable (recommended) or enter it directly below.

In [21]:
if 'OPENAI_API_KEY' not in os.environ:
    os.environ['OPENAI_API_KEY'] = getpass.getpass(prompt='Enter your API key: ')
    
openai.api_key = os.environ['OPENAI_API_KEY']
    
## KEEP YOUR KEY SECURE, ANYONE WITH ACCESS TO IT CAN GENERATE COSTS ON YOUR ACCOUNT!

### Define API parameters

##### Basic parameters

In [22]:
model = "gpt-3.5-turbo"
temperature = 0 ## Setting this to 0 makes the generations, mostly, deterministic

##### Set the system message

The chat models, `gpt-3.5-turbo` and `gpt-4` also accept a so-called system message. These models are specifically trained to follow the role that is explained to them in this system message. The `gpt-3.5-turbo` message does not always pay strong attention to this system message, however, GPT-4 does. 

The default system message is "You are a helpful assistant."

For more details, see: https://platform.openai.com/docs/guides/chat/introduction

In [23]:
system_message = "You are a serious research assistant who follows exact instructions and returns only valid JSON."
## Note, the system message also counts towards our token usage.

### Make a generation

#### The input

Let's first generate a prediction for the FLS example.

In [24]:
prompt = prompt_template_v2.format(**{
    "statement" : fls_row["statement"].strip()
})

print(prompt)

Task: classify whether the statement below contains a forward looking statements (fls).
Rules:
- Answer using JSON in the following format: {"contains_fls" : 0 or 1}
Statement:
> We anticipate that our investments in R&D will lead to a 20% improvement in efficiency in the next two years.
JSON =


#### The request

In [25]:
completion_fls_1 = openai.ChatCompletion.create(
    model = model,
    temperature = temperature,
    messages=[
        {"role": "system", "content" : system_message},
        {"role": "user", "content": prompt}
    ]
)

### The result

In [26]:
completion_fls_1.keys()

dict_keys(['id', 'object', 'created', 'model', 'usage', 'choices'])

**How many tokens did we just use?**

In [27]:
dict(completion_fls_1.usage)

{'prompt_tokens': 97, 'completion_tokens': 9, 'total_tokens': 106}

Our request used up 106 tokens in total, 97 in the prompt and 9 in the completion.

This is higher than our estimate through `tiktoken` because the chat models add special tokens to the prompt. 

We can use the April 2023 pricing to estimate how much our 60 observations will cost to predict:
- **gpt-3.5-turbo:** \\$0.002 / 1K tokens

In [28]:
num_tokens = 108
num_obs = 60
price_per_token = 0.002

cost = (num_tokens/1000) * price_per_token * num_obs
mprint(f"""Running **{num_obs}** predictions will cost around **~${cost}**.""")

Running **60** predictions will cost around **~$0.01296**.

**Did it work?**

In [29]:
result = dict(completion_fls_1["choices"][0]["message"])
result

{'role': 'assistant', 'content': '{"contains_fls" : 1}'}

In [30]:
prediction = json.loads(result["content"])
prediction

{'contains_fls': 1}

## Let's wrap that up into some functions

We can wrap the above logic up into a function so that we can scale thing more easily.

In [31]:
def generate_prompt(data, prompt_template):
    prompt = prompt_template.format(**{
        "statement" : data["statement"].strip()
    })
    return prompt

In [32]:
def make_prediction(
    i, ## Adding an identifier will make things easier to track and match up later.
    prompt, 
    model = model,
    temperature = temperature,
    system_message = system_message
    ):
    
    completion = openai.ChatCompletion.create(
        model = model,
        temperature = temperature,
        stop = ["}"], ## This forces the model to stop when it hits a closing brace
        messages = [
                {"role": "system", "content" : system_message},
                {"role": "user", "content": prompt}
            ]
    )

    result = dict(completion["choices"][0]["message"])
    prediction = json.loads(result["content"] + "}") 
    ## The end is not included, so we add it back, hence the + "}"
    prediction["i"] = i
        
    return prediction

## Test

In [33]:
prompt = generate_prompt(fls_row, prompt_template_v2)
prediction = make_prediction(fls_row["i"], prompt)
print(prompt, end = "\n\n")
print(prediction)

Task: classify whether the statement below contains a forward looking statements (fls).
Rules:
- Answer using JSON in the following format: {"contains_fls" : 0 or 1}
Statement:
> We anticipate that our investments in R&D will lead to a 20% improvement in efficiency in the next two years.
JSON =

{'contains_fls': 1, 'i': 2}


In [34]:
prompt = generate_prompt(non_fls_row, prompt_template_v2)
prediction = make_prediction(non_fls_row["i"], prompt)
print(prompt, end = "\n\n")
print(prediction)

Task: classify whether the statement below contains a forward looking statements (fls).
Rules:
- Answer using JSON in the following format: {"contains_fls" : 0 or 1}
Statement:
> In the last quarter, we managed to increase our revenue by 15% due to the successful launch of our new product line.
JSON =

{'contains_fls': 0, 'i': 1}


### Let's run it for the full sample of 60 so that we can evaluate

Using GLLMs in a zero-shot can create some variation in the prediction behavior, the try/except statement helps to deal with that. 

In [35]:
pred_list = []
num_tries = 3
for item in tqdm(statement_list):
    current_try = 0 
    while True:
        try:
            prompt = generate_prompt(item, prompt_template_v2)
            prediction = make_prediction(item["i"], prompt)
            pred_list.append(prediction)
            break
        except Exception as e:
            current_try += 1
            if current_try > num_tries:
                print(f"There is an issue with {item['i']}:\n{e}")
                break
            else:
                time.sleep(1)

  0%|          | 0/60 [00:00<?, ?it/s]

This took about ~90 seconds to complete for 60 observations. 

In [36]:
pred_df = pd.DataFrame(pred_list)
pred_df.head()

Unnamed: 0,contains_fls,i
0,0,1
1,1,2
2,0,3
3,1,4
4,0,5


------
# #4 - Evaluate performance
----

## Evaluation dataset

Our toy example dataset already has truth labels, so we will use that.

However, you will generally need to construct a ground-truth evaluation dataset yourself.

You can do so by drawing a random sample, storing it without any predictions, and adding the labels yourself. There are tools for this (e.g., `prodigy`), but for most cases Excel will be the easiest tool to use. 

### Add the predictions to our dataset

In [37]:
pred_df = pred_df.rename(columns = {"contains_fls" : "prediction"})

In [38]:
eval_df = pd.merge(statement_df, pred_df, on = "i", how = "left")
eval_df.sample(5)

Unnamed: 0,i,statement,contains_fls,prediction
20,21,The successful integration of the acquired com...,0,0
34,35,Our focus on continuous improvement has enable...,0,0
51,52,The resilience of our business model has been ...,0,0
38,39,Our focus on cost optimization during the last...,0,0
49,50,"Over the next year, we plan to increase our in...",1,1


### Evaluate

Because we designed our prompt to result a simple 0 and 1, we can treat this like a supervised machine learning approach. For example, we can calculate a classification report using the `scikit-learn` library.

In [39]:
print(classification_report(
    eval_df["contains_fls"], 
    eval_df["prediction"]
))

              precision    recall  f1-score   support

           0       0.97      1.00      0.98        30
           1       1.00      0.97      0.98        30

    accuracy                           0.98        60
   macro avg       0.98      0.98      0.98        60
weighted avg       0.98      0.98      0.98        60



Our prediction pipeline works! This is a simple task and ChatGPT is a powerful model, so this is not surprising. 

----
# #5 - Execute
---

Once we are happy with our approach we can implement the remaining code to run it for all the obervations. 

A few practical recommendations:

1. Always make sure to store all your raw predictions. 
2. Optimize your prompt to reduce the tokens and costs.    
3. Optimize your prediction pipeline to make things faster.  

I elaborate on these points below:

---
### Storing predictions

Below is an example of how one can store the raw predictions right after they are made.

#### First we create a place to store the data:
*Note:* I have a specific folder structure that I like to use to keep things organized. In that stucture I would store these to the "store" pipeline folder. You can read more about that here:

https://medium.com/towards-data-science/how-to-keep-your-research-projects-organized-part-1-folder-structure-10bd56034d3a

In [40]:
store_loc = Path.cwd() / "storage"
if not store_loc.exists():
    os.makedirs(store_loc)

#### Next we adapt the function to store the data:

In [41]:
def make_prediction(
    i, 
    prompt, 
    model = model,
    temperature = temperature,
    system_message = system_message,
    store_loc = store_loc ## We add the storage location to our function
    ):
    
    completion = openai.ChatCompletion.create(
        model = model,
        temperature = temperature,
        stop = ["}"], 
        messages = [
                {"role": "system", "content" : system_message},
                {"role": "user", "content": prompt}
            ]
    )

    result = dict(completion["choices"][0]["message"])
    prediction = json.loads(result["content"] + "}") 
    prediction["i"] = i
    
    ## Before returning it, we store the data:
    
    ### We never want to overwrite our raw predictions, so we add the timestamp
    timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S_%f")
    with open(store_loc / f"{i}-=-{timestamp}.json", "w", encoding = "utf-8") as f:
        json.dump(prediction, f)
        
    return prediction

#### Now we always store our data!

In [44]:
prompt = generate_prompt(fls_row, prompt_template_v2)
prediction = make_prediction(fls_row["i"], prompt)

In [45]:
os.listdir(store_loc)

['2-=-20230424_184659_929113.json', '2-=-20230425_150752_851815.json']

----
### How much would it cost to scale it?

How much would it cost to classify, let's say, 100,000 sentences?

In [46]:
num_tokens = 120 ## Rough estimate, slightly higher than what we've seen.
num_obs = 100_000
price_per_token = 0.002

cost = (num_tokens/1000) * price_per_token * num_obs
mprint(f"""Running **{num_obs:,}** predictions will cost around **~${cost}**.""")

Running **100,000** predictions will cost around **~$24.0**.

-----
## How much time would it take to scale it?

#### How much time would it take to run 100,000?

It took about 90 seconds to run 60 observations.

I.e., about 2 seconds per observation. That is slow!

In [47]:
num_obs = 100_000
sec_per_obs = 2

total_seconds = sec_per_obs * num_obs
mprint(f"""Running **{num_obs:,}** predictions will take around **~{total_seconds/60/60:.2f}** hours.""")

Running **100,000** predictions will take around **~55.56** hours.

#### How do we make it faster?

Generating predictions for 100,000 observations would take about 2.5 days, that is really slow...

Here are a few tips and tricks to speed things up:

**1. Don't run predictions on data you don't need predictions for.**

Is there any filtering or pre-work that we can do to reduce the number of predictions that require a GLLM generation? In this case we could, for example, create a rough word list. If we make the word list over-inclusive and biased toward false positives we can then feed all the potential matches through the GLLM approach to clean it up.

**2. We can combine multipe predictions into one.**

The speed of the generation scales linearly with the number of tokens, which includes the instructions. Asking the model to make multiple predictions using a single set of instructions will thus be cheaper and faster (although it might not always work).

To illustrate:

In [48]:
prompt_template_multiple = """
Task: classify whether the statements below contain a forward looking statements (fls).
Rules:
- Answer using JSON in the following format: [{{"contains_fls" : 0 or 1, "i" : i}}, ..]
Statements:
{statements}
JSON =
""".strip()

prompt = prompt_template_multiple.format(**{"statements" : str(statement_list[:3])})
print(prompt, end = "\n\n")

completion = openai.ChatCompletion.create(
    model = "gpt-3.5-turbo",
    temperature = 0,
    stop = ["]"],
    messages = [
            {"role": "user", "content": prompt}
        ]
)

result = dict(completion["choices"][0]["message"])
prediction = json.loads(result["content"] + "]") 

prediction

Task: classify whether the statements below contain a forward looking statements (fls).
Rules:
- Answer using JSON in the following format: [{"contains_fls" : 0 or 1, "i" : i}, ..]
Statements:
[{'i': 1, 'statement': 'In the last quarter, we managed to increase our revenue by 15% due to the successful launch of our new product line.', 'contains_fls': 0}, {'i': 2, 'statement': 'We anticipate that our investments in R&D will lead to a 20% improvement in efficiency in the next two years.', 'contains_fls': 1}, {'i': 3, 'statement': 'Our recent acquisition of XYZ Company has already started to show positive results in terms of cost savings and market reach.', 'contains_fls': 0}]
JSON =



[{'contains_fls': 0, 'i': 1},
 {'contains_fls': 1, 'i': 2},
 {'contains_fls': 0, 'i': 3}]

**3. We can saturate the rate limits by making async requests**

The rate limits for the ChatGPT API are (as of April 2023):

- 3,500 RPM (Requests-per-minute)
- 90,000 TPM (Tokens-per-minute)

**Theoretical throughput:** 90,000 / 120 tokens = 750 predictions per minute    
Versus   
**Sequential throughput:** 30 predictions per minute

So we can potentially speed things up 12x with an async approach!

For an implementation example, see the OpenAI Cookbook:

- https://github.com/openai/openai-cookbook/blob/main/examples/How_to_handle_rate_limits.ipynb
- https://github.com/openai/openai-cookbook/blob/297c53430cad2d05ba763ab9dca64309cb5091e9/examples/api_request_parallel_processor.py