<a href="https://colab.research.google.com/github/kili-technology/kili-python-sdk/blob/main/recipes/ner_pre_annotations_openai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to import OpenAI NER pre-annotations

## Setup

Let's start this tutorial by installing the packages we will need later on.

## Data preparation

In this tutorial, we will use the CoNLL2003 dataset from the Hugging Face repository. This dataset contains more than 10,000 sentences annotated with named entities.

To speed up the process, we will use a limited number of samples. We will also remove sentences that do not contain enough words.

In [3]:
NER_TAGS_ONTOLOGY = {
    "O": 0,
    "B-PERSON": 1,
    "I-PERSON": 2,
    "B-ORGANIZATION": 3,
    "I-ORGANIZATION": 4,
    "B-LOCATION": 5,
    "I-LOCATION": 6,
    "B-MISCELLANEOUS": 7,
    "I-MISCELLANEOUS": 8,
}

`NER_TAGS_ONTOLOGY` is a dictionary that maps the named entity tags in the CoNLL2003 dataset to integer labels. Here is the meaning of each key-value pair in the dictionary:

- **O**: Represents the tag "O" which means that the token is not part of a named entity.
- **B-PERSON**: Represents the beginning of a person.
- **I-PERSON**: Represents a token inside a person.
- **B-ORGANIZATION**: Represents the beginning of an organization.
- **I-ORGANIZATION**: Represents a token inside an organization.
- **B-LOCATION**: Represents the beginning of a location.
- **I-LOCATION**: Represents a token inside a location.
- **B-MISCELLANEOUS**: Represents the beginning of a miscellaneous.
- **I-MISCELLANEOUS**: Represents a token inside a miscellaneous.

During the training of a NER model, the entity names will be converted to integer labels using such a dictionary.

## Connect with ChatGPT API

Let's use the OpenAI API to get the pre-annotations for our dataset.

In [3]:
import os
os.environ["OPENAI_API_BASE"] = "https://testavinx.openai.azure.com/"
os.environ["OPENAI_API_KEY"] = "cd826423871544a486d616f14805725a"



In [4]:
import os
import openai
openai.api_type = "azure"
openai.api_version = "2023-05-15"
openai.api_base = os.getenv("OPENAI_API_BASE")  # Your Azure OpenAI resource's endpoint value.
openai.api_key = os.getenv("OPENAI_API_KEY")

def get_response(prompt,input):

  response = openai.ChatCompletion.create(

                                          engine = "gpt-35-turbo",
                                          messages = [
                                              {"role":"system", "content": prompt},
                                              {"role": "user", "content": input }
                                          ]
  )

  return response['choices'][0]['message']['content']


In [5]:
#Sample
prompt = "Assistant is a large language model trained by OpenAI."
input = "tell me a joke?"

print(get_response(prompt, input))

Why don't scientists trust atoms? 
Because they make up everything.


## Prompt design

To get pre-annotations for our dataset, we need to create a prompt that tells the model what to do:

In [None]:
B-Age
B-History
B-Sex
B-Clinical_event
B-Sign_symptom
I-Sign_symptom
B-Duration
B-Sign_symptom
I-Sign_symptom
I-Clinical_event
B-Frequency
B-Sign_symptom
B-Biological_structure
I-Biological_structure
B-Detailed_description
I-Detailed_description
B-Lab_value
B-Biological_structure
I-Biological_structure
B-Detailed_description
I-Detailed_description
B-Diagnostic_procedure
B-Sign_symptom
B-Disease_disorder
B-CAUSE Arg1


In [6]:
base_prompt = """In the sentence below, extract the entities for:
- age named entity
- history named entity
- sex named entity
- clinical event named entity
- sign symptom named entity
- biological structure named entity
Format the output in json with the following keys:
- AGE for age named entity
- HISTORY for history named entity
- SEX for sex named entity
- CLINICAL EVENT for clinical event named entity
- SIGN SYMPTOM for sign named entity
- BIOLOGICAL STRUCTURE for biological structure named entity

- MISCELLANEOUS for miscellaneous named entity.
in the BIO format
Sentence below:
"""

In [22]:
prompt1 = """"Extract the entities in IOB format for Name Entity Recognition:
              B-AGE for age entities,
              B-HISTORY for history entities,
              B-SEX for sex entities,
              B-CLINICAL EVENT for clinical event entities,
              B-SIGN SYMPTOM for sign symptom entities,
              B-BIOLOGICAL STURCTURE for biological structure entities:
              """

Let's see if the model understands the prompt well on a simple example:

In [23]:
test_sentence = """CASE: A 28-year-old previously healthy man presented with a 6-week history of palpitations.
The symptoms occurred during rest, 2–3 times per week, lasted up to 30 minutes at a time and were associated with dyspnea.
Except for a grade 2/6 holosystolic tricuspid regurgitation murmur (best heard at the left sternal border with inspiratory accentuation), physical examination yielded unremarkable findings.
An electrocardiogram (ECG) revealed normal sinus rhythm and a Wolff– Parkinson– White pre-excitation pattern (Fig.1: Top), produced by a right-sided accessory pathway.
Transthoracic echocardiography demonstrated the presence of Ebstein's anomaly of the tricuspid valve, with apical displacement of the valve and formation of an “atrialized” right ventricle (a functional unit between the right atrium and the inlet [inflow] portion of the right ventricle) (Fig.2).
The anterior tricuspid valve leaflet was elongated (Fig.2C, arrow), whereas the septal leaflet was rudimentary (Fig.2C, arrowhead).
Contrast echocardiography using saline revealed a patent foramen ovale with right-to-left shunting and bubbles in the left atrium (Fig.2D).
The patient underwent an electrophysiologic study with mapping of the accessory pathway, followed by radiofrequency ablation (interruption of the pathway using the heat generated by electromagnetic waves at the tip of an ablation catheter).
His post-ablation ECG showed a prolonged PR interval and an odd “second” QRS complex in leads III, aVF and V2–V4 (Fig.1Bottom), a consequence of abnormal impulse conduction in the “atrialized” right ventricle.
The patient reported no recurrence of palpitations at follow-up 6 months after the ablation."""

In [24]:
print(get_response(prompt1, test_sentence))

B-AGE: 28-year-old
B-HISTORY: 6-week history of palpitations, occurred during rest, lasted up to 30 minutes, 2-3 times per week, associated with dyspnea.
B-SIGN SYMPTOM: palpitations, dyspnea, grade 2/6 holosystolic tricuspid regurgitation murmur, normal sinus rhythm, Wolff-Parkinson-White pre-excitation pattern
B-BIOLOGICAL STRUCTURE: tricuspid valve, right ventricle, anterior tricuspid valve leaflet, septal leaflet, foramen ovale
B-CLINICAL EVENT: electrophysiologic study, mapping of the accessory pathway, radiofrequency ablation, prolonged PR interval, abnormal impulse conduction
