In [6]:
import re

# Entity marks ##

For flat entities:

I - Word is inside a phrase of type TYPE
B - If two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE 
O - Word is not part of a phrase
E - End ( E will not appear in a prefix-only partial match )
S - Single

For nested entities:

B - Begin: The first word of a multi-word entity.
I - Inside: Any non-initial word of a multi-word entity. (M - middle)
L - Last: The last word of a multi-word entity. (E - end)
U - Unit: A single-word entity.
O - Outside: A word that is not part of any entity. (S / Single)

BILOU or IOBES

In [ ]:
# Define BILOU tags
bilou_tags = ["B", "I", "L", "U", "O"]

## Dependencies ##

https://wiki.gucorpling.org/gum/dependencies

List of dependency function labels used in GUM

    acl
    acl:relcl
    advcl
    advcl:relcl
    advmod
    amod
    appos
    aux
    aux:pass
    case
    cc
    cc:preconj
    ccomp
    compound
    compound:prt
    conj
    cop
    csubj
    csubj:pass
    dep
    det
    det:predet
    discourse
    dislocated
    expl
    fixed
    flat
    goeswith
    iobj
    list
    mark
    nmod
    nmod:npmod
    nmod:tmod
    nmod:poss
    nsubj
    nsubj:pass
    nummod
    obj
    obl
    obl:agent
    obl:npmod
    obl:tmod
    orphan
    parataxis
    punct
    reparandum
    root
    vocative
    xcomp

## Entities types
 
https://wiki.gucorpling.org/gum/entities

There are 10 entity type:

    person - any person, including fictitious figures, groups of people, and semi-human entities (Pinocchio)
    place - a country (Iceland), region (Sahara)), or other place being referred to as a location (the factory - when used as a place, not to refer to the physical building)
    organization - a company, government, sports team and others
    object - a concrete tangible object
    event - includes reference to nouns ('War', 'the performance') and clauses that are referred back to ('that John came')
    time - dates, times of day, days, years...
    substance - water, mercury, gas, poison ... includes context-dependent substances, such as Skittles or baking chocolate
    animal - any animal, potentially including bacteria, aliens and others construed as animals
    plant - interpreted broadly to include fruits, seeds and other living plant parts, but not substances (e.g. 'wood' is not classified as a plant)
    abstract - abstract notions (luck), emotions (excitement) or intangible properties (predisposition)

## Entity salience

https://wiki.gucorpling.org/gum/entities

An entity is considered salient if and only if it appears in the summary of a document
Annotate the first mention of a salient entity as salient, there is no need to annotate subsequent mentions as salient

In the header: 

meta::salientEntities = 1, 2, 36, 41, 42, 46, 76, 99

## Coreference

https://wiki.gucorpling.org/gum/entities

In [115]:
import scene_desc as sd

In [116]:
TEXT_MARKER = "# text = "
TEXT_MARKER_LEN = len(TEXT_MARKER)
ENTITY_START_MARKER = "Entity="
ENTITY_STOP_MARKER = "|"

entity_marker_len = len(ENTITY_START_MARKER)
        
def extract_entity_info(text: str) -> list[(str,str)]:
    entity_matches = []
    entity_start_mark_index = text.find(ENTITY_START_MARKER)
    if entity_start_mark_index != -1:
        entity_stop_mark_index = text.find(ENTITY_STOP_MARKER)                
        entity_tag = text[entity_start_mark_index + entity_marker_len:entity_stop_mark_index] if entity_start_mark_index != -1 else '_'
        entity_matches = re.findall(r'(\(\d+-\w+-\w+)(?=-|\:)|(\d*?\))', entity_tag) #parts[9])
    return entity_matches
    
def convert_conllu(file_path) -> (str, sd.SceneDesc):    
    scene_desc = sd.SceneDesc()
    
    with open(file_path, 'r', encoding='utf-8') as file:
        lines = file.readlines()      
        
        text = ""
        word_info_list = []
        
        # Iterate through each line in the file
        for line in lines:
            # collect text
            if line.startswith(TEXT_MARKER):
                text += line[TEXT_MARKER_LEN:] + " "
                
            # collect line info
            if line[0].isdigit():
                parts = line.split("\t")
                lemma = sd.Lemma(sentence_id=0, lemma_id=parts[0], lemma=parts[1],lemma_init=parts[2],
                                 pos_tag=parts[3], dep_type=parts[7], dep_parent_id=parts[6])                
                entity_matches = extract_entity_info(text=parts[9])     
                
                #word_info_list.append((word_number, word, entity_match))
        return text.strip(), word_info_list

In [108]:
entity_tag = 'Entity=(4-place-giv:act-cf3-1-coref-Russia)14)'
re.findall(r'(\(\d+-\w+-\w+)(?=-|\:)|(\d*?\))', entity_tag)

[('(4-place-giv', ''), ('', ')'), ('', '14)')]

In [113]:
text, word_info_list = convert_conllu('../datasets/gum/dep/GUM_fiction_error.conllu')
for word_number, word, entity_tag in word_info_list:
    print(f"Word Number: {word_number}, Word: {word}, Entity Tag: {entity_tag}")

Word Number: 10, Word: When, Entity Tag: []
Word Number: 10, Word: Tyler, Entity Tag: [('(1-person-new', ''), ('', ')')]
Word Number: 10, Word: was, Entity Tag: []
Word Number: 10, Word: very, Entity Tag: []
Word Number: 10, Word: young, Entity Tag: []
Word Number: 10, Word: ,, Entity Tag: []
Word Number: 10, Word: his, Entity Tag: []
Word Number: 10, Word: grandmother, Entity Tag: [('', '2)')]
Word Number: 10, Word: was, Entity Tag: []
Word Number: 10, Word: his, Entity Tag: [('(2-person-giv', ''), ('(1-person-giv', ''), ('', ')')]
Word Number: 10, Word: favorite, Entity Tag: []
Word Number: 10, Word: person, Entity Tag: [('', '2)')]
Word Number: 10, Word: in, Entity Tag: []
Word Number: 10, Word: the, Entity Tag: [('(3-place-acc', '')]
Word Number: 10, Word: world, Entity Tag: [('', '3)')]
Word Number: 10, Word: because, Entity Tag: []
Word Number: 10, Word: ,, Entity Tag: []
Word Number: 10, Word: unlike, Entity Tag: []
Word Number: 10, Word: his, Entity Tag: [('(4-person-new', ''),