module__TabularProjector

#org.bibliome.alvisnlp.modules.trie.TabularProjector

Synopsis

Projects a simple dictionary on sections.

Description

org.bibliome.alvisnlp.modules.trie.TabularProjector reads a list of entries from dictFile and searches for these entries in sections. The format of the dictionary is one entry per line. Each line is split into columns separated by tab characters, or whichever character defined by separator. The column specified by keyIndex will be the entry to be searched and the other columns are data associated to the entry.

The parameters skipBlank, skipEmpty, strictColumnNumber, trimColumns, separator, multipleEntryBehaviour control the loading of the dictionary file.

The parameters allowJoined, allUpperCaseInsensitive, caseInsensitive, ignoreDiacritics, joinDash, matchStartCaseInsensitive, skipConsecutiveWhitespaces, skipWhitespace and wordStartCaseInsensitive control the matching between the section and the entry keys.

The subject parameter specifies which text of the section should be matched. There are two options:

the entries are matched on the contents of the section, subject can also control if matches boundaries coincide with word delimiters;
the entries are matched on the feature value of annotations of a given layer separated by a whitespace, in this way entries can be searched against word lemmas for instance.

org.bibliome.alvisnlp.modules.trie.TabularProjector creates an annotation for each matched entry and adds these annotations to the layer named targetLayerName. The created annotations will have features whose keys correspond to entryFeatureNames and values to the data associated to the matched entry (columns in the dictionary file). For instance if entryFeatureNames is [a,b,c], then each annotation will have three features named a, b and c with the respective values of the entry's second, third and fourth columns. A feature name left blank in entryFeatureNames will not create a feature. Thus, in order not to keep the entry in the a feature, entryFeatureNames should be [,b,c]. In addition, the created annotations will have the feature keys and values defined in constantAnnotationFeatures.

If specified, then org.bibliome.alvisnlp.modules.trie.TabularProjector assumes that trieSource contains a compiled version of the dictionary. dictFile is not read. If specified, org.bibliome.alvisnlp.modules.trie.TabularProjector writes a compiled version of the dictionary in trieSink. The use of compiled dictionaries may accelerate the processing for large dictionaries.

Parameters

dictFile

Optional

Type: SourceStream

Source of the dictionary.

targetLayerName

Optional

Type: String

Name of the layer that contains the match annotations.

valueFeatures

Optional

Type: String[]]

Target features in match annotations. The values are the columns in the matched entry line.

constantAnnotationFeatures

Optional

Type: Mapping

Constant features to add to each annotation created by this module

trieSink

Optional

Type: OutputFile

If set, org.bibliome.alvisnlp.modules.trie.TabularProjector writes the compiled dictionary to the specified file.

trieSource

Optional

Type: InputFile

If set, read the compiled dictionary from the specified files. Compiled dictionaries are generally faster for large dictionaries.

allUpperCaseInsensitive

Default value: false

Type: Boolean

Either the match allows case substitution on all characters in words that are all upper case.

allowJoined

Default value: false

Type: Boolean

Either the match allows arbitrary suppression of whitespace characters in the subject. For instance, the contents aminoacid matches the entry amino acid.

caseInsensitive

Default value: false

Type: Boolean

Either the match allows case substitutions on all characters.

documentFilter

Default value: true

Type: Expression

Only process document that satisfy this filter.

ignoreDiacritics

Default value: false

Type: Boolean

Either the match allows dicacritics substitutions on all characters. For instance the contents acide amine matches the entry acide aminé.

joinDash

Default value: false

Type: Boolean

Either to treat dash characters (-) as whitespace characters if allowJoined is true. For instance, the contents aminoacid matches the entry amino-acid.

keyIndex

Default value: 0

Type: Integer[]]

Specifies the key column index (starting at 0).

matchStartCaseInsensitive

Default value: false

Type: Boolean

Either the match allows case substitution on the first character of the entry key.

multipleEntryBehaviour

Default value: all

Type: MultipleEntryBehaviour

Specifies the behavious of org.bibliome.alvisnlp.modules.trie.TabularProjector if dictFile contains several entries with the same key.

sectionFilter

Default value: true

Type: Expression

Process only sections that satisfy this filter.

separator

Default value:

Type: Character

Specifies the character that separates columns in dictFile.

skipBlank

Default value: false

Type: Boolean

In dictFile, skip lines that contain only whitespace characters.

skipConsecutiveWhitespaces

Default value: false

Type: Boolean

Either the match allows insertion of consecutive whitespace characters in the subject. For instance, the contents amino acid matches the entry amino acid.

skipEmpty

Default value: false

Type: Boolean

In dictFile, skip empty lines.

skipWhitespace

Default value: false

Type: Boolean

Either the match allows arbitrary insertion of whitespace characters in the subject. For instance, the contents amino acid matches the entry aminoacid.

strictColumnNumber

Default value: true

Type: Boolean

Either to check that every line in dictFile has the same number of columns as the number of features specified in entryFeatureNames.

subject

Default value: WORD

Type: Subject

Specifies the contents to match.

trimColumns

Default value: false

Type: Boolean

Either to trim column values in dictFile from leading and trailing whitespace characters.

wordStartCaseInsensitive

Default value: false

Type: Boolean

Either the match allows case substitution on the first character of words.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

module__TabularProjector

Synopsis

Description

Parameters

dictFile

targetLayerName

valueFeatures

constantAnnotationFeatures

trieSink

trieSource

allUpperCaseInsensitive

allowJoined

caseInsensitive

documentFilter

ignoreDiacritics

joinDash

keyIndex

matchStartCaseInsensitive

multipleEntryBehaviour

sectionFilter

separator

skipBlank

skipConsecutiveWhitespaces

skipEmpty

skipWhitespace

strictColumnNumber

subject

trimColumns

wordStartCaseInsensitive

AlvisNLP/ML Wiki

User guides

Developer guides

Clone this wiki locally