# Text Annotation Example

In [3]:
%pip install transformers==4.17.0 -qq
!git clone https://github.com/AMontgomerie/bulgarian-nlp
%cd bulgarian-nlp

Cloning into 'bulgarian-nlp'...
remote: Enumerating objects: 141, done.[K
remote: Counting objects: 100% (141/141), done.[K
remote: Compressing objects: 100% (126/126), done.[K
remote: Total 141 (delta 62), reused 9 (delta 2), pack-reused 0[K
Receiving objects: 100% (141/141), 70.39 KiB | 1.68 MiB/s, done.
Resolving deltas: 100% (62/62), done.
/content/bulgarian-nlp


First we create an instance of the annotator.

In [4]:
from annotation.annotators import TextAnnotator

annotator = TextAnnotator()

Downloading:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/318M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/78.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.71M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.37M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/988 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/318M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/78.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.71M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.37M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Next we create an example input and pass it as an argument to the annotator.

In [20]:
example_input = 'България е член на ЕС.'
annotations = annotator(example_input)
annotations

{'entities': [{'text': 'България', 'type': 'LOCATION'},
  {'text': 'ЕС', 'type': 'ORGANISATION'}],
 'tokens': [{'entity': 'B-LOC', 'pos': 'PROPN', 'text': 'България'},
  {'entity': 'O', 'pos': 'AUX', 'text': 'е'},
  {'entity': 'O', 'pos': 'NOUN', 'text': 'член'},
  {'entity': 'O', 'pos': 'ADP', 'text': 'на'},
  {'entity': 'B-ORG', 'pos': 'PROPN', 'text': 'ЕС'},
  {'entity': 'O', 'pos': 'PUNCT', 'text': '.'}]}

As you can see, the raw output is a dictionary of tokens and corresponding tags. To make it more readable, let's display the tag level output as a dataframe.

In [14]:
import pandas as pd

tokens = [t["text"] for t in annotations["tokens"]]
pos_tags = [t["pos"] for t in annotations["tokens"]]
entity_tags = [t["entity"] for t in annotations["tokens"]]
df = pd.DataFrame({"token": tokens, "pos": pos_tags, "entity": entity_tags})
df

Unnamed: 0,token,pos,entity
0,България,PROPN,B-LOC
1,е,AUX,O
2,член,NOUN,O
3,на,ADP,O
4,ЕС,PROPN,B-ORG
5,.,PUNCT,O


For more information about the meanings of the POS tags, see https://universaldependencies.org/u/pos/


The sentence level entities are also available in `annotations["entities"]`:

In [18]:
for entity in annotations["entities"]:
    print(f"{entity['text']}: {entity['type']}")

България: LOCATION
ЕС: ORGANISATION
