# Demo script for doing Named Entity Recognition (NER) using ChatGPT

To run this script, you need to first create an account on the OpenAI developer platform by going to <https://platform.openai.com/> and clicking on the `Sign up` button. As of 2024-02-07, the pricing is as follows: Charges per token (1000 tokens is roughly 750 words); `$0.0005` per 1000 tokens for gpt3.5-turbo, new accounts start with a $5 credit (check your usage under profile page). This credit expires after a certain number of months, so don't sign up until you are actually ready to experiment. Once the credit runs out, you will need to add to your credit balance using a credit card. 

After creating your account, go to the [API keys](https://platform.openai.com/api-keys) option and click `Create new secret key`. If you are working on your own computer, you can allow "All" permissions. You will have one opportunity to copy the secret key once it is generated, so be prepared to either save it in a text file or add it to the code below. Note the warning about hard-coding the key within this script!


In [None]:
# The following line only needs to be run one time on your system to install the OpenAI Python module.
! pip install openai

This section of the notebook reads your API key and creates a client object that will be used in the rest of the script. You only need to run this cell once each time you load the notebook unless you change the base prompt.

Resources:
- <https://kili-technology.com/data-labeling/machine-learning/using-chatgpt-to-pre-annotate-named-entities-recognition-labeling-tasks>
- <https://github.com/openai/openai-python>
- <https://platform.openai.com/docs/guides/prompt-engineering/six-strategies-for-getting-better-results>


In [None]:
# openai_ner.ipynb, a Python script for using the ChatGPT API to do entity recognition.

# (c) 2024 Vanderbilt University. This program is released under a GNU General Public License v3.0 http://www.gnu.org/licenses/gpl-3.0
# Authors: Emily Yan and Steve Baskauf

script_version = '0.0.1'
version_modified = '2024-02-07'

# -------------------
# Imports
# -------------------

from openai import OpenAI
from pathlib import Path
import json

# -------------------
# Global variables
# -------------------

# create a base prompt that will be used for all questions
BASE_PROMPT = """From the text below, give me the list of:
- object named entity
- location named entity
- person named entity
- miscellaneous named entity.
Format the output in json with the following keys:
- OBJECT for organization named entity
- LOCATION for location named entity
- PERSON for person named entity
- MISCELLANEOUS for miscellaneous named entity.
Text below:
"""

# set model parameters
OPENAI_QUERY_PARAMS = {
    "model": "gpt-3.5-turbo", # gpt-3.5-turbo best suited for understanding + generating natural language
    "temperature": 0, # temperature affects the amount of "creativity" that the model will use to generate the response; 0 means no randomness
    "max_tokens": 1024 # Limit the number of tokens that can be used. A token is roughly equivalent to a word.
}

# If you are ONLY going to use this notebook locally, you can hard code your API key here. However, that is really a bad practice if
# there is any chance that you will share this notebook or push it to a public repository. It is better to keep the key in a file that
# is in a separate location that is unrelated to where you are keeping the code. The code that follows the next line will read the key
# from a plain text file in your home directory.
#client = OpenAI(api_key='')

# If saving the API key in a file in your home directory, change the following string value to the filename that you used.
openai_key_filename = 'open_ai_api_key_text_analysis.txt'

# Read the API key from a file in the home directory. The file should contain only the key and no other text.
home = str(Path.home()) # gets path to home directory; supposed to work for both Win and Mac
with open(home + '/' + openai_key_filename, 'r') as file:
    api_key_string = file.read().strip() # remove any leading or trailing white space or newlines

CLIENT = OpenAI(api_key=api_key_string)



Notes about the request function:

- Has parameters prompt (the specific text you want to perform NER on), default base_prompt (can be overridden with a different base prompt during function call), and openai_query_params
- model: gpt-3.5-turbo best suited for understanding + generating natural language
- temperature: sampling temperature between 0 and 2; higher temperature = more random/abstracted output, lower temperature = more focused output
- max_tokens: cap on number of tokens that can be generated in the chat completion
- messages: establish preliminary dialogue and help prevent some user from significantly misguiding the model with a malicious input prompt; “system” message helps prime model for specific task (NER), “user” message is from user to model, “assistant” message would be from model to user



In [None]:
# -------------------
# Function definitions
# -------------------

def ask_openai(prompt: str, base_prompt=BASE_PROMPT, openai_query_params=OPENAI_QUERY_PARAMS) -> str:
    """Send a request to OpenAI's ChatGPT API to do entity recognition. The prompt should be a sentence or paragraph of text
    on which you want to perform NER.
    
    The function returns a JSON-formatted string with the named entities extracted from the input text.
    """
    response = CLIENT.chat.completions.create(
        messages=[
        {
            "role": "system", 
            "content": "You are a smart and intelligent Named Entity Recognition (NER) system. I will provide you the definition of the entities you need to extract, the sentence from where you extract the entities and the output format."
        },
        {
            "role": "user", 
            "content": base_prompt + prompt
        }        
    ],
        **openai_query_params
    )
    
    return(response.choices[0].message.content)


The following example requests an analysis and directly prints the response.

In [None]:
# example 
example_text = 'Vanderbilt University is a private research university in Nashville, Tennessee. It was founded in 1873.'
print(ask_openai(example_text))

To experiment, replace the input text with your own text and run this cell. The response will be stored in the `data` variable as a dictionary of lists, which can be used for further analysis in subsequent cells without re-running the query.

In [None]:
# custom input
input_text = '''The House committee investigating the Jan. 6 attack on the U.S. Capitol on Tuesday evening unanimously approved a criminal contempt report against Steve Bannon, an ally of former President Donald Trump's, for defying a subpoena from the panel.'''
#input_text = '''Pax, depicting the Crucifixion with the Virgin raising her hands, Saint John the Evangelist, and two angels'''
data_text = ask_openai(input_text)

# Interpret the response data as a JSON string and convert it to a Python data structure.
data = json.loads(data_text)

data

In [None]:
print('People mentioned in the text:')
for person in data['PERSON']:
    print(person)
