<a href="https://colab.research.google.com/github/kili-technology/kili-python-sdk/blob/main/recipes/ner_pre_annotations_openai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to import OpenAI NER pre-annotations

## Setup

Let's start this tutorial by installing the packages we will need later on.

## Data preparation

In this tutorial, we will use the CoNLL2003 dataset from the Hugging Face repository. This dataset contains more than 10,000 sentences annotated with named entities.

In [1]:
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm


To speed up the process, we will use a limited number of samples. We will also remove sentences that do not contain enough words.

In [9]:
NER_TAGS_ONTOLOGY = {
    "O": 0,
    "B-PERSON": 1,
    "I-PERSON": 2,
    "B-ORGANIZATION": 3,
    "I-ORGANIZATION": 4,
    "B-LOCATION": 5,
    "I-LOCATION": 6,
    "B-MISCELLANEOUS": 7,
    "I-MISCELLANEOUS": 8,
}

`NER_TAGS_ONTOLOGY` is a dictionary that maps the named entity tags in the CoNLL2003 dataset to integer labels. Here is the meaning of each key-value pair in the dictionary:

- **O**: Represents the tag "O" which means that the token is not part of a named entity.
- **B-PERSON**: Represents the beginning of a person.
- **I-PERSON**: Represents a token inside a person.
- **B-ORGANIZATION**: Represents the beginning of an organization.
- **I-ORGANIZATION**: Represents a token inside an organization.
- **B-LOCATION**: Represents the beginning of a location.
- **I-LOCATION**: Represents a token inside a location.
- **B-MISCELLANEOUS**: Represents the beginning of a miscellaneous.
- **I-MISCELLANEOUS**: Represents a token inside a miscellaneous.

During the training of a NER model, the entity names will be converted to integer labels using such a dictionary.

## Connect with ChatGPT API

Let's use the OpenAI API to get the pre-annotations for our dataset.

In [3]:
import os
os.environ["OPENAI_API_BASE"] = "https://testavinx.openai.azure.com/"
os.environ["OPENAI_API_KEY"] = "cd826423871544a486d616f14805725a"



In [4]:
import os
import openai
openai.api_type = "azure"
openai.api_version = "2023-05-15"
openai.api_base = os.getenv("OPENAI_API_BASE")  # Your Azure OpenAI resource's endpoint value.
openai.api_key = os.getenv("OPENAI_API_KEY")

def get_response(prompt,input):

  response = openai.ChatCompletion.create(

                                          engine = "gpt-35-turbo",
                                          messages = [
                                              {"role":"system", "content": prompt},
                                              {"role": "user", "content": input }
                                          ]
  )

  return response['choices'][0]['message']['content']


In [5]:
#Sample
prompt = "Assistant is a large language model trained by OpenAI."
input = "tell me a joke?"

print(get_response(prompt, input))

APIConnectionError: Error communicating with OpenAI: HTTPSConnectionPool(host='testavinx.openai.azure.com', port=443): Max retries exceeded with url: //openai/deployments/gpt-35-turbo/chat/completions?api-version=2023-05-15 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)')))

We can now define the parameters that will be used during the query to OpenAI model:

- **model**: the model that will be used to generate the pre-annotations. The full list is available under this [link](https://platform.openai.com/docs/models/overview).
- **temperature**: the temperature of the model. The higher the temperature, the more random the text. The lower the temperature, the more likely it is to predict the next word. The default value is 0.7. It should be between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.
- **max_tokens**: the maximum number of tokens to generate. The default value is 64. It should be between 1 and 4096, depending on the model.

## Prompt design

To get pre-annotations for our dataset, we need to create a prompt that tells the model what to do:

In [52]:
base_prompt = """In the sentence below, give me the list of:
- organization named entity
- location named entity
- person named entity
- miscellaneous named entity.
Format the output in json with the following keys:
- ORGANIZATION for organization named entity
- LOCATION for location named entity
- PERSON for person named entity
- MISCELLANEOUS for miscellaneous named entity.
Sentence below:
"""

Let's see if the model understands the prompt well on a simple example:

In [54]:
test_sentence = (
    "Elon Musk is the CEO of Tesla and SpaceX. He was born in South Africa and now lives in the"
    " USA. He is one of the founders of OpenAI."
)

In [55]:
test_sentence

'Elon Musk is the CEO of Tesla and SpaceX. He was born in South Africa and now lives in the USA. He is one of the founders of OpenAI.'

In [56]:
print(get_response(base_prompt, test_sentence))

{
    "PERSON": [
        "Elon Musk"
    ],
    "ORGANIZATION": [
        "Tesla",
        "SpaceX",
        "OpenAI"
    ],
    "LOCATION": [
        "South Africa",
        "USA"
    ],
    "MISCELLANEOUS": []
}


Looks really good! Let's now process all sentences in our dataset with the previous prompt.

## Create the pre-annotations

In the code below, we will use the OpenAI API to get the pre-annotations for each sentence in our dataset.

In [60]:
openai_answers = []
for datapoint in dataset:
    sentence = datapoint["sentence"]
    answer = get_response(base_prompt, sentence)
    try:
        answer_json = json.loads(answer)
    except json.JSONDecodeError:
        print(f"Wrong json formatting:\n{answer}")
        answer_json = {"ORGANIZATION": [], "LOCATION": [], "PERSON": [], "MISCELLANEOUS": []}
    openai_answers.append(answer_json)

Wrong json formatting:
Output: 
```
{
  "ORGANIZATION": [
    "Welsh National Farmers' Union (NFU)",
    "BBC"
  ],
  "LOCATION": [
    "Germany"
  ],
  "PERSON": [
    "John Lloyd Jones"
  ],
  "MISCELLANEOUS": []
}
```


In [61]:
print(openai_answers[:3])

[{'ORGANIZATION': ['EU', 'British'], 'LOCATION': ['German'], 'PERSON': [], 'MISCELLANEOUS': ['lamb']}, {'ORGANIZATION': ['The European Commission', 'British'], 'LOCATION': ['German'], 'PERSON': [], 'MISCELLANEOUS': ['Thursday', 'mad cow disease', 'sheep']}, {'ORGANIZATION': ['European Union'], 'LOCATION': ['Germany', 'Britain'], 'PERSON': ['Werner Zwingmann'], 'MISCELLANEOUS': ['Wednesday']}]


We need to sanitize the json to make sure that the values are of type list:

In [62]:
openai_answers

[{'ORGANIZATION': ['EU', 'British'],
  'LOCATION': ['German'],
  'PERSON': [],
  'MISCELLANEOUS': ['lamb']},
 {'ORGANIZATION': ['The European Commission', 'British'],
  'LOCATION': ['German'],
  'PERSON': [],
  'MISCELLANEOUS': ['Thursday', 'mad cow disease', 'sheep']},
 {'ORGANIZATION': ['European Union'],
  'LOCATION': ['Germany', 'Britain'],
  'PERSON': ['Werner Zwingmann'],
  'MISCELLANEOUS': ['Wednesday']},
 {'ORGANIZATION': ['Commission'],
  'LOCATION': [],
  'PERSON': ['Nikolaus van der Pas'],
  'MISCELLANEOUS': ['news briefing']},
 {'ORGANIZATION': ['European Union'],
  'LOCATION': [],
  'PERSON': [],
  'MISCELLANEOUS': ['scientific']},
 {'ORGANIZATION': ['EU Farm', 'Commissioner Franz Fischler'],
  'LOCATION': [],
  'PERSON': [],
  'MISCELLANEOUS': ['sheep brains',
   'spleens',
   'spinal cords',
   'human',
   'animal food chains']},
 {'ORGANIZATION': ['EU'],
  'LOCATION': ['Britain', 'France'],
  'PERSON': ['Fischler'],
  'MISCELLANEOUS': ['Bovine Spongiform Encephalopathy'

In [63]:
for i, _ in enumerate(openai_answers):
    json_dict = openai_answers[i]
    for category in json_dict:
        if isinstance(json_dict[category], str):
            json_dict[category] = [json_dict[category]]
        elif isinstance(json_dict[category], list):
            continue
        else:
            print(f"Unknown value type '{json_dict[category]}' for value '{json_dict[category]}'")
            json_dict[category] = []

{'ORGANIZATION': ['Florida restaurant'],
 'LOCATION': ['London'],
 'PERSON': ['Hendrix'],
 'MISCELLANEOUS': ['$', '10,925 pounds', '16,935']}


## Import dataset and pre-annotations to Kili

Now that we have both the data and the pre-annotations, we can import them to a Kili project.

In [None]:
from kili.client import Kili

In [None]:
kili = Kili(
    # api_endpoint="https://cloud.kili-technology.com/api/label/v2/graphql",
    # the line above can be uncommented and changed if you are working with an on-premise version of Kili
)

Below, we define the ontology (json interface) of the project. We define the 4 classes as well as their corresponding colors:

In [None]:
COLORS = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728"]

ENTITY_TYPES = [
    ("PERSON", "Person"),
    ("ORGANIZATION", "Organization"),
    ("LOCATION", "Location"),
    ("MISCELLANEOUS", "Miscellaneous"),
]

ENTITY_TYPES_WITH_COLORS = [
    (entity_type[0], entity_type[1], color) for entity_type, color in zip(ENTITY_TYPES, COLORS)
]
print(ENTITY_TYPES_WITH_COLORS)

[('PERSON', 'Person', '#1f77b4'), ('ORGANIZATION', 'Organization', '#ff7f0e'), ('LOCATION', 'Location', '#2ca02c'), ('MISCELLANEOUS', 'Miscellaneous', '#d62728')]


In [None]:
json_interface = {
    "jobs": {
        "NAMED_ENTITIES_RECOGNITION_JOB": {
            "mlTask": "NAMED_ENTITIES_RECOGNITION",
            "content": {
                "categories": {
                    name: {"name": name_pretty, "children": [], "color": color}
                    for name, name_pretty, color in ENTITY_TYPES_WITH_COLORS
                },
                "input": "radio",
            },
            "instruction": "",
            "required": 1,
            "isChild": False,
        }
    },
}

Let's now create the project with its ontology:

In [None]:
project = kili.create_project(
    title="[Kili SDK Notebook]: CoNLL Named Entity Recognition with OpenAI pre-annotations",
    input_type="TEXT",
    json_interface=json_interface,
)
project_id = project["id"]

We now import the sentences to the project:

In [None]:
external_id_array = []
content_array = []
for datapoint in dataset:
    sentence = datapoint["sentence"]
    content_array.append(sentence)
    external_id_array.append(datapoint["id"])

print(content_array[:3])
print(external_id_array[:3])

['EU rejects German call to boycott British lamb.', 'The European Commission said on Thursday it disagreed with German advice to consumers to shun British lamb until scientists determine whether mad cow disease can be transmitted to sheep.', "Germany's representative to the European Union's veterinary committee Werner Zwingmann said on Wednesday consumers should buy sheepmeat from countries other than Britain until the scientific advice was clearer."]
['0', '3', '4']


In [None]:
kili.append_many_to_dataset(
    project_id=project_id, content_array=content_array, external_id_array=external_id_array
)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:03<00:00,  5.73it/s]


{'id': 'clf14l26401or0jv4e0d7d9ge'}

If you go to the project page, you should be able to see your assets:

![image.png](attachment:60725f12-45bb-4ab5-a9c0-0cec01b7c84e.png)

And on the labeling interface, you will see the sentence and the ontology:

![image.png](attachment:65cfc764-d1ea-4483-ae2c-83e5107eacb8.png)

We can finally import our OpenAI-generated pre-annotations!

In [None]:
json_response_array = []

for datapoint, sentence_annotations in zip(dataset, openai_answers):
    full_sentence = datapoint["sentence"]
    annotations = []  # list of annotations for the sentence
    for category, _ in ENTITY_TYPES:
        sentence_annotations_cat = sentence_annotations[category]
        for content in sentence_annotations_cat:
            begin_offset = full_sentence.find(content)
            assert (
                begin_offset != -1
            ), f"Cannot find offset of '{content}' in sentence '{full_sentence}'"
            annotation = {
                "categories": [{"name": category}],
                "beginOffset": begin_offset,
                "content": content,
            }
            annotations.append(annotation)

    json_resp = {"NAMED_ENTITIES_RECOGNITION_JOB": {"annotations": annotations}}
    json_response_array.append(json_resp)

In [None]:
print(json_response_array[0])

{'NAMED_ENTITIES_RECOGNITION_JOB': {'annotations': [{'categories': [{'name': 'ORGANIZATION'}], 'beginOffset': 0, 'content': 'EU'}, {'categories': [{'name': 'ORGANIZATION'}], 'beginOffset': 11, 'content': 'German'}, {'categories': [{'name': 'LOCATION'}], 'beginOffset': 34, 'content': 'British'}, {'categories': [{'name': 'MISCELLANEOUS'}], 'beginOffset': 42, 'content': 'lamb'}]}}


We then import the annotations using the `kili.create_predictions()` method:

In [None]:
kili.create_predictions(
    project_id,
    external_id_array=external_id_array,
    json_response_array=json_response_array,
    model_name=openai_query_params["model"],
)

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 66.63it/s]


{'id': 'clf14l26401or0jv4e0d7d9ge'}

In the main project page, you should now be able to see that the assets have been pre-annotated with the model you chose before:

![image.png](attachment:782658a8-f8c4-4a4d-b986-1dcabea34a94.png)

On the labeling interface for a specific asset, you can see the pre-annotations:

![image.png](attachment:8ac0f050-872f-4b28-8a08-6958210393fe.png)

Great! We have successfully pre-annotated our dataset. Looks like this solution has the potential to save us a lot of time in future projects.

## Pre-annotations quality evaluation

Because OpenAI-generated pre-annotations are not perfect, it would be great to have a way to measure the model's accuracy.

Since our dataset CoNLL2003 has been annotated, we can easily evaluate the quality of the pre-annotations generated by OpenAI.

In [None]:
def format_sentence_annotations(sentence_annotations):
    """
    Maps a token to its NER tag (B-ORGANIZATION, I-ORGANIZATION, etc.) class value.
    """
    ret = defaultdict(list)
    for category, _ in ENTITY_TYPES:
        sentence_annotations_cat = sentence_annotations[category]
        for content in sentence_annotations_cat:
            content_split = content.split(" ")
            for i, token in enumerate(content_split):
                if i == 0:
                    ret[token].append(NER_TAGS_ONTOLOGY[f"B-{category}"])
                else:
                    ret[token].append(NER_TAGS_ONTOLOGY[f"I-{category}"])
    return ret


references = []
predictions = []
for datapoint, sentence_annotations in zip(dataset, openai_answers):
    references.append(datapoint["ner_tags"])

    sentence_annotations = format_sentence_annotations(sentence_annotations)
    ner_tags_predicted = []
    for token in datapoint["tokens"]:
        if token in sentence_annotations and len(sentence_annotations[token]) > 0:
            ner_tags_predicted.append(sentence_annotations[token][0])
            del sentence_annotations[token][0]
        else:
            ner_tags_predicted.append(NER_TAGS_ONTOLOGY["O"])
    predictions.append(ner_tags_predicted)

In [None]:
print(dataset[0]["tokens"])
print(references[0])
print(predictions[0])
print(NER_TAGS_ONTOLOGY)

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
[3, 0, 7, 0, 0, 0, 7, 0, 0]
[3, 0, 3, 0, 0, 0, 5, 7, 0]
{'O': 0, 'B-PERSON': 1, 'I-PERSON': 2, 'B-ORGANIZATION': 3, 'I-ORGANIZATION': 4, 'B-LOCATION': 5, 'I-LOCATION': 6, 'B-MISCELLANEOUS': 7, 'I-MISCELLANEOUS': 8}


In [None]:
def flatten_list(list_):
    ret = []
    for sublist in list_:
        ret.extend(sublist)
    return ret


references = flatten_list(references)
predictions = flatten_list(predictions)
references = np.array(references)
predictions = np.array(predictions)

In [None]:
from sklearn.metrics import f1_score

We will use the F1 score to report the results.

In [None]:
table = Table(title=f"Results")

table.add_column("Class")
table.add_column("F1")
table.add_column("Nb samples", justify="center")

for class_name, class_value in NER_TAGS_ONTOLOGY.items():
    y_true = np.where(references == class_value, 1, 0)
    y_pred = np.where(predictions == class_value, 1, 0)
    table.add_row(
        class_name,
        f"{f1_score(y_true, y_pred) * 100:6.1f}%",
        f"{y_true.sum():3d}",
        end_section=True,
    )

# Group tokens regardless of their positions in the entities
NER_TAGS_ONTOLOGY_GROUPED = {
    "PERSON": (1, 2),
    "ORGANIZATION": (3, 4),
    "LOCATION": (5, 6),
    "MISCELLANEOUS": (7, 8),
}
for class_name, class_values in NER_TAGS_ONTOLOGY_GROUPED.items():
    y_true = np.where((references == class_values[0]) | (references == class_values[1]), 1, 0)
    y_pred = np.where((predictions == class_values[0]) | (predictions == class_values[1]), 1, 0)
    table.add_row(
        class_name,
        f"{f1_score(y_true, y_pred) * 100:6.1f}%",
        f"{y_true.sum():3d}",
        style="bold green",
        end_section=True,
    )


table.add_row(
    "All",
    f"{f1_score(references, predictions, average='weighted') * 100:6.1f}%",
    f"{len(references):3d}",
    style=f"bold bright_red",
)

In [None]:
console = Console()
console.print(table)

Quite good!

As we can see, the pre-annotations are not perfect, but the LLM seems to be able to generate pre-annotations that are good enough to help us speed up the labelling process in future projects.

## Conclusion

In this tutorial, we have seen how to use the OpenAI API to generate pre-annotations for a dataset. We have also seen how to import the data and the pre-annotations to a Kili project, and how to evaluate the quality of these pre-annotations.