## Named Entity Recognition (NER)

Named Entity Recognition (NER) is a Natural Language Processing (NLP) task that identifies and classifies Named Entities (NE) into predefined semantic categories such as persons, organizations, locations, events, time expressions and quantities. It serves as the foundation for many NLP such as semantic annotation, question answering, text summarization, and machine translation.

This notebook performs NER with [OpenAI ChatCompletion](https://platform.openai.com/docs/api-reference/chat).

### 1. Setup

#### 1.1 Upgrade to the latest OpenAI python package 

In [1]:
%pip install --upgrade openai --quiet

Note: you may need to restart the kernel to use updated packages.


#### 1.2 Load packages and OPENAI_API_KEY

You can generate an API key in the OpenAI web interface. See https://platform.openai.com/account/api-keys for details.

In [2]:
import os
import json
import openai

OPENAI_MODEL = 'gpt-3.5-turbo'

openai.api_key = os.getenv("OPENAI_API_KEY")

### 2. Define the NER Categories to be Recognized

In [3]:
ner_labels = [
    "PERSON",      # People, including fictional characters
    "NORP",        # Nationalities or religious or political groups
    "FAC",         # Buildings, airports, highways, bridges
    "ORG",         # Organizations, including companies, agencies, institutions
    "GPE",         # Geopolitical entities like countries, cities, states
    "LOC",         # Non-GPE locations, mountain ranges, bodies of water
    "PRODUCT",     # Vehicles, foods, appareal, appliances, software, toys (Not services.)
    "EVENT",       # Named sports, scientific milestones, historical events, etc
    "WORK_OF_ART", # Titles of books, songs, movies
    "LAW",         # Named laws, acts, or legislations
    "LANGUAGE",    # Any named language
    "DATE",        # Absolute or relative dates or periods
    "TIME",        # Times smaller than a day
    "PERCENT",     # Percentage (e.g., "twenty percent", "18%")
    "MONEY",       # Monetary values, including unit
    "QUANTITY",    # Measurements, e.g., weight or distance
]

### 3. Prepare Completion Messages

The OpenAI Chat completion models take a list of messages as input and return a model-generated message as output. Although the chat format is designed to make multi-turn conversations easy, it’s just as useful for single-turn tasks without any conversation. In our case, we will define three messages for the system, assistant and user roles.

The system message sets the behavior of the assistant by defining its persona and task:

In [4]:
def system_message(ner_labels):
    return f"""
            You are an expert in Natural Language Processing.
            You will be provided with a block of text. Your task is to identify and list common Named Entities (NER) by their type in the given text.
            The common Named Entities (NER) types are: ({", ".join(ner_labels)}).
            Provide your output as a dictionary with entity types as keys, and lists of corresponding entities as values.
            """

Assistant messages usually store previous assistant responses, but can also be written - as in our case - to give examples of desired behavior:

In [5]:
def assisstant_message():
    return f"""
            EXAMPLE:
                Text: 'In Germany, in 1440, goldsmith Johannes Gutenberg invented the movable-type printing press, which started the Printing Revolution. 
                    Modelled on the design of existing screw presses, a single Renaissance movable-type printing press could produce up to 3,600 pages per workday'
                {{
                    "GPE": ["germany"],
                    "DATE": ["1440"],
                    "PERSON": ["johannes gutenberg"],
                    "PRODUCT": ["movable-type printing press"],
                    "EVENT": ["printing revolution","renaissance"],
                    "QUANTITY": ["3,600 pages"],
                    "TIME": ["workday"]
                }}
            """

The user message provides the request for the assistant to respond to:

In [6]:
def user_message(text):
    return f"""
            TASK:
                Text: {text}
            """

### 4. ChatCompletion for Named Entity Recognition

In [7]:
def named_entity_recognition(ner_labels, text):
    response = openai.ChatCompletion.create(
      model=OPENAI_MODEL, 
      messages=[
          {"role": "system", "content": system_message(ner_labels=ner_labels)},
          {"role": "assistant", "content": assisstant_message()},
          {"role": "user", "content": user_message(text=text)}
      ],
      n=1,
      max_tokens=256,
      temperature=0,
      frequency_penalty=0,
      presence_penalty=0
    )
    
    return response

#### 4.1 Format Response 

In [8]:
def convert_to_text(response):
    return response.choices[0].message.content

In [9]:
def convert_to_dict(response):
    text = response.choices[0].message.content
    return json.loads(text.strip())

### 5. Sample Texts

#### 5.1. OpenAI Text

In [10]:
sample_text = """OpenAI is an AI research and deployment company. Its mission is to ensure that artificial general intelligence benefits all of humanity.
                 OpenAI was founded in 2015 by Ilya Sutskever, Greg Brockman, Trevor Blackwell, Vicki Cheung, Andrej Karpathy, Durk Kingma, Jessica Livingston, 
                 John Schulman, Pamela Vagata, and Wojciech Zaremba, with Sam Altman and Elon Musk serving as the initial board members. 
                 Microsoft provided OpenAI with a $1 billion investment in 2019 and a $10 billion investment in 2023."""

response = named_entity_recognition(ner_labels=ner_labels, text=sample_text)
response.choices[0].finish_reason

'stop'

In [11]:
print(convert_to_text(response))

{
    "ORG": ["OpenAI", "Microsoft"],
    "DATE": ["2015", "2019", "2023"],
    "PERSON": ["Ilya Sutskever", "Greg Brockman", "Trevor Blackwell", "Vicki Cheung", "Andrej Karpathy", "Durk Kingma", "Jessica Livingston", "John Schulman", "Pamela Vagata", "Wojciech Zaremba", "Sam Altman", "Elon Musk"],
    "MONEY": ["$1 billion", "$10 billion"]
}


#### 5.2 Olympics Text

In [12]:
sample_text = """The 1896 Summer Olympics, officially known as the Games of the Olympiad, was an international multi-sport event.
                 It was celebrated in Athens, Greece, from 6 to 15 April 1896.
                 About 100,000 people attended for the opening of the games. The athletes came from 14 nations, with most coming from Greece."""

response = named_entity_recognition(ner_labels=ner_labels, text=sample_text)
response.choices[0].finish_reason

'stop'

In [13]:
print(convert_to_text(response))

{
    "DATE": ["1896", "6 to 15 April 1896"],
    "EVENT": ["summer olympics", "games of the olympiad"],
    "GPE": ["athens", "greece", "14 nations", "greece"],
    "QUANTITY": ["100,000 people"]
}


#### 5.3 Format Response as Dictionary

In [14]:
ner_dict = convert_to_dict(response)
print(json.dumps(ner_dict, indent=4))

{
    "DATE": [
        "1896",
        "6 to 15 April 1896"
    ],
    "EVENT": [
        "summer olympics",
        "games of the olympiad"
    ],
    "GPE": [
        "athens",
        "greece",
        "14 nations",
        "greece"
    ],
    "QUANTITY": [
        "100,000 people"
    ]
}
