# OpenAI - zero shot example

The guide is a companion to the paper *"Generative LLMs and Textual Analysis in Accounting:(Chat)GPT as Research Assistant?"* ([SSRN](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4429658))

**Author:** [Ties de Kok](https://www.tiesdekok.com)    

----
# Imports
----


All the dependencies required for this notebook are provided in the `environment.yml` file.

To install: `conda env create -f environment.yml` --> this creates the `gllm` environment.

I recommend using Python 3.9 or higher to avoid dependency conflicts.

**Python built-in libraries**

In [1]:
import os, sys, re, copy, random, json, time, datetime
from pathlib import Path
import getpass

**Libraries for interacting with the OpenAI API**

In [2]:
import requests
import openai
import tiktoken

**General helper libraries**

In [3]:
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

### Settings

In [5]:
pd.options.mode.chained_assignment = None  # default='warn'
pd.set_option('display.max_columns', 150)
pd.set_option('display.max_rows', 150)

### Utility functions

In [6]:
## This function makes it easier to print rendered markdown through a code cell.

from IPython.display import Markdown

def mprint(text, *args, **kwargs):
    if 'end' in kwargs.keys():
        text += kwargs['end']
        
    display(Markdown(text))

-----
# Toy example
----

I will use a hypothetical dataset of earnings call sentences and try to identify sentences with a forward-looking statement.

## Load data

In [7]:
with open(Path.cwd() / "data" / "statements.json", "r", encoding = "utf-8") as f:
    statement_list = json.load(f)

statement_df = pd.DataFrame(statement_list)

In [8]:
sentence_1 = statement_df.iloc[0].statement
print(sentence_1)

In the last quarter, we managed to increase our revenue by 15% due to the successful launch of our new product line.


----
## Prompt engineering
----

### Define prompt template

In [10]:
prompt_template = """
Task: classify whether the statement below contains a forward looking statements (fls).
Rules:
- Answer using JSON in the following format: {{"contains_fls" : 0 or 1}}
Statement:
> {statement}
JSON =
""".strip()

## Note, the curly braces are what we will fill in for each observation

### Create prompt

In [11]:
prompt = prompt_template.format(**{
    "statement" : sentence_1
})

In [12]:
print(prompt)

Task: classify whether the statement below contains a forward looking statements (fls).
Rules:
- Answer using JSON in the following format: {"contains_fls" : 0 or 1}
Statement:
> In the last quarter, we managed to increase our revenue by 15% due to the successful launch of our new product line.
JSON =


### How many tokens will this use?

In [13]:
encoder = tiktoken.encoding_for_model("gpt-3.5-turbo")

In [14]:
mprint(f"""
**Prompt:**
```
{prompt}
```
<br>

**Number of tokens:**
`{len(encoder.encode(prompt)):,}`
""".strip())

**Prompt:**
```
Task: classify whether the statement below contains a forward looking statements (fls).
Rules:
- Answer using JSON in the following format: {"contains_fls" : 0 or 1}
Statement:
> In the last quarter, we managed to increase our revenue by 15% due to the successful launch of our new product line.
JSON =
```
<br>

**Number of tokens:**
`70`

---
## Set up OpenAI
---

There are roughly three ways to interact with the OpenAI API through Python:

- Directly using `requests`   
- Using the official `openai` Python library   
- Through a higher level library such as `langchain`

To use the OpenAI API you will need an API key. If you don't have one, follow these steps:   

1. Create an OpenAI account --> https://platform.openai.com   
2. Create an OpenAI API key --> https://platform.openai.com/account/api-keys   
3. You will get \\$5 in free credits if you create a new account. If you've used that up, you will need to add a payment method. The code in this notebook will cost less than a dollar to run.

Once you have your OpenAI Key, you can set it as the `OPENAI_API_KEY` environment variable (recommended) or enter it directly below.

In [15]:
if 'OPENAI_API_KEY' not in os.environ:
    os.environ['OPENAI_API_KEY'] = getpass.getpass(prompt='Enter your API key: ')
    
openai.api_key = os.environ['OPENAI_API_KEY']
    
## KEEP YOUR KEY SECURE, ANYONE WITH ACCESS TO IT CAN GENERATE COSTS ON YOUR ACCOUNT!

---
## Generate prediction
---

##### Basic parameters

In [16]:
model = "gpt-3.5-turbo"
temperature = 0 ## Setting this to 0 makes the generations, mostly, deterministic

##### Set the system message

The chat models, `gpt-3.5-turbo` and `gpt-4` also accept a so-called system message. These models are specifically trained to follow the role that is explained to them in this system message. The `gpt-3.5-turbo` message does not always pay strong attention to this system message, however, GPT-4 does. 

The default system message is "You are a helpful assistant."

For more details, see: https://platform.openai.com/docs/guides/chat/introduction

In [17]:
system_message = "You are a serious research assistant who follows exact instructions and returns only valid JSON."
## Note, the system message also counts towards our token usage.

### Make generation

In [18]:
completion = openai.ChatCompletion.create(
    model = model,
    temperature = temperature,
    messages=[
        {"role": "system", "content" : system_message},
        {"role": "user", "content": prompt}
    ]
)

### The result

In [19]:
completion.keys()

dict_keys(['id', 'object', 'created', 'model', 'choices', 'usage'])

**How many tokens did we just use?**

In [21]:
dict(completion.usage) ## Higher than our estimate because of chat token overhead

{'prompt_tokens': 97, 'completion_tokens': 9, 'total_tokens': 106}

**Did it work?**

In [22]:
result = dict(completion["choices"][0]["message"])
result

{'role': 'assistant', 'content': '{"contains_fls" : 0}'}

---
# Function demonstration
---

## Let's wrap that up into some functions

We can wrap the above logic up into a function so that we can scale thing more easily.

In [23]:
def generate_prompt(data, prompt_template):
    prompt = prompt_template.format(**{
        "statement" : data["statement"].strip()
    })
    return prompt

In [24]:
def make_prediction(
    i, ## Adding an identifier will make things easier to track and match up later.
    prompt, 
    model = model,
    temperature = temperature,
    system_message = system_message
    ):
    
    completion = openai.ChatCompletion.create(
        model = model,
        temperature = temperature,
        stop = ["}"], ## This forces the model to stop when it hits a closing brace
        messages = [
                {"role": "system", "content" : system_message},
                {"role": "user", "content": prompt}
            ]
    )

    result = dict(completion["choices"][0]["message"])
    prediction = json.loads(result["content"] + "}") 
    ## The end is not included, so we add it back, hence the + "}"
    prediction["i"] = i
        
    return prediction

### Loop through first five

In [None]:
for i, row in statement_df.head().iterrows():
    prompt = generate_prompt(row, prompt_template)
    prediction = make_prediction(row["i"], prompt) 
    mprint(f"""
**Item:** `{row["i"]}`
> {row["statement"]}    

*Prediction - contains FLS:* `{prediction["contains_fls"]}`
<br><br>
""".strip())

**Item:** `1`
> In the last quarter, we managed to increase our revenue by 15% due to the successful launch of our new product line.    

*Prediction - contains FLS:* `0`
<br><br>

**Item:** `2`
> We anticipate that our investments in R&D will lead to a 20% improvement in efficiency in the next two years.    

*Prediction - contains FLS:* `1`
<br><br>

**Item:** `3`
> Our recent acquisition of XYZ Company has already started to show positive results in terms of cost savings and market reach.    

*Prediction - contains FLS:* `0`
<br><br>

**Item:** `4`
> We expect to see continued growth in the Asian market, with a potential increase in revenue of 25% over the next three years.    

*Prediction - contains FLS:* `1`
<br><br>

**Item:** `5`
> In the past year, we have successfully reduced our operational costs by 10% through process improvements and better supply chain management.    

*Prediction - contains FLS:* `0`
<br><br>