# Entity marks ##

For flat entities:

I - Word is inside a phrase of type TYPE
B - If two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE 
O - Word is not part of a phrase
E - End ( E will not appear in a prefix-only partial match )
S - Single

For nested entities:

B - Begin: The first word of a multi-word entity.
I - Inside: Any non-initial word of a multi-word entity. (M - middle)
L - Last: The last word of a multi-word entity. (E - end)
U - Unit: A single-word entity.
O - Outside: A word that is not part of any entity. (S / Single)

BILOU or IOBES

In [1]:
# Define BILOU tags
bilou_tags = ["B", "I", "L", "U", "O"]
START_SENTENCE_TOKEN = '[CLS]'
SEP_SENTENCE_TOKEN = '[SEP]'

## Dependencies ##

https://wiki.gucorpling.org/gum/dependencies

List of dependency function labels used in GUM

    acl
    acl:relcl
    advcl
    advcl:relcl
    advmod
    amod
    appos
    aux
    aux:pass
    case
    cc
    cc:preconj
    ccomp
    compound
    compound:prt
    conj
    cop
    csubj
    csubj:pass
    dep
    det
    det:predet
    discourse
    dislocated
    expl
    fixed
    flat
    goeswith
    iobj
    list
    mark
    nmod
    nmod:npmod
    nmod:tmod
    nmod:poss
    nsubj
    nsubj:pass
    nummod
    obj
    obl
    obl:agent
    obl:npmod
    obl:tmod
    orphan
    parataxis
    punct
    reparandum
    root
    vocative
    xcomp

## Entities types
 
https://wiki.gucorpling.org/gum/entities

There are 10 entity type:

    person - any person, including fictitious figures, groups of people, and semi-human entities (Pinocchio)
    place - a country (Iceland), region (Sahara)), or other place being referred to as a location (the factory - when used as a place, not to refer to the physical building)
    organization - a company, government, sports team and others
    object - a concrete tangible object
    event - includes reference to nouns ('War', 'the performance') and clauses that are referred back to ('that John came')
    time - dates, times of day, days, years...
    substance - water, mercury, gas, poison ... includes context-dependent substances, such as Skittles or baking chocolate
    animal - any animal, potentially including bacteria, aliens and others construed as animals
    plant - interpreted broadly to include fruits, seeds and other living plant parts, but not substances (e.g. 'wood' is not classified as a plant)
    abstract - abstract notions (luck), emotions (excitement) or intangible properties (predisposition)

## Entity salience

https://wiki.gucorpling.org/gum/entities

An entity is considered salient if and only if it appears in the summary of a document
Annotate the first mention of a salient entity as salient, there is no need to annotate subsequent mentions as salient

In the header: 

meta::salientEntities = 1, 2, 36, 41, 42, 46, 76, 99

## Coreference

https://wiki.gucorpling.org/gum/entities

In [2]:
import re
entity_tag = 'Entity=(6-abstract-giv:act-cf2-1-ana)4)'
re.findall(r'(\(\d+-\w+)(?=-|\:)|(\d*?\))', entity_tag)

[('(6-abstract', ''), ('', ')'), ('', '4)')]

In [1]:
import conllu_converter

converter = conllu_converter.ConlluFilesConverter(max_level=3) 
converter.convert_and_save_folder(source_folder='../datasets/gum/dep', 
                                  target_text_folder='../datasets/gum_parsed/texts',
                                  coref_dict_folder='../datasets/gum_parsed/coref_dict',
                                  target_labels_folder='../datasets/gum_parsed/labels',
                                  files_mapping_file_path='../datasets/gum_parsed/files_mapping.txt'
                                  )

In [2]:
converter.convert_and_save_file(source_file_path='../datasets/gum/dep/GUM_fiction_error.conllu',
                                target_text_file_path='../datasets/gum_parsed/texts/1.txt',
                                target_labels_file_path='../datasets/gum_parsed/labels/1.txt',
                                coref_dict_file_path='../datasets/gum_parsed/coref_dict/1.txt')
converter.convert_and_save_file(source_file_path='../datasets/gum/dep/GUM_fiction_falling.conllu',
                                target_text_file_path='../datasets/gum_parsed/texts/2.txt',
                                target_labels_file_path='../datasets/gum_parsed/labels/2.txt',
                                coref_dict_file_path='../datasets/gum_parsed/coref_dict/2.txt')