Element classifier recipes

Introduction

This page provides pragmatic insights about the generic Weka wrapper modules, for complete parameter description refer to the reference documentation of TrainingElementClassifier, TaggingElementClassifier and SelectingElementClassifier.

Overview

decide the target elements
writing a relation definition
select attributes with SelectingElementClassifier
training a classifier on training elements with TrainingElementClassifier
using the classifier to tag elements with TaggingElementClassifier

Target elements

The target elements are the elements you want to classify, these are specified by the example parameter in each of the three modules. It is an Element Expression evaluated as a list of elements with the corpus as the context element. The resulting collection of elements will be the training set in TrainingElementClassifier and SelectingElementClassifier, or the elements to predict the class in TaggingElementClassifier.

Target examples

Documents

documents

To restrict the target to only some documents, for instance the training set:

documents(set == "train")

This assumes that documents have a feature with key set and an appropriate value. For instance, this feature could have been added by the reader module that loaded some files into the corpus.

Annotations

documents.sections.layer:sentences

This assumes a layer named sentences that contains annotations representing sentences. For instance this layer could have been filled with SeSMig.

To restrict the target to only some sentences, for instance those that contain at least two gene names:

documents.sections.layer:sentences(inside:genes >= 2)

This assumes a layer named genes containing all gene names acquired from previous modules.

Now for NER tasks, you may want to classify annotation n-grams, then you'd use the NGrams module:

<tokenize class="OgmiosTokenizer">
<tokenTypeFeature>type</tokenTypeFeature>
<separatorTokens>false</separatorTokens>
<targetLayerName>tokens</targetLayerName>
</tokenize>

<ngrams class="NGrams">
<targetLayerName>ngrams</targetLayerName>
<tokenLayerName>tokens</targetLayerName>
<maxNGramSize>3</maxNGramSize>
</ngrams>

Tuples

Why not?

documents.sections.relations:genePairs.tuples

This assumes a relation named genePairs. Note that all gene name pairs in a sentence can be generated with the module CartesianProductTuples like this:

<genePairs class="CartesianProductTuples">
<anchor>documents.sections.layer:sentences</anchor>
<relationName>genePairs</relationName>
<arguments>
<first>inside:genes</first>
<second>inside:genes</second>
</arguments>
</genePairs>

Of course, you need to adjust the target so that your classifier does not attempt to classify pairs of the same gene:

documents.sections.relations:genePairs.tuples(args:first != args:second)

Relation definition

Here, relation is used in the meaning of Weka, it does not mean AlvisNLP/ML's relations.

The relation definition is specified by the relationDefinition parameter in the three modules:

<relationDefinition>
<relation name="myrelation">
attribute and bag definitions
</relation>
</relationDefinition>

However we recommend to place the relation subtree in a separate file and invoke it like this:

<relationDefinition load="myfile.xml"/>

Indeed it is important you use the same relation definition in the three modules.

The relation name is optional and doesn't actually make a difference at all.

Attributes

Each attribute is specified with an attribute tag:

<attribute
name="NAME"
type="TYPE"
class="CLASS">
EXPR
</attribute>

NAME is the name of the attribute, it is mandatory and must be unique in the relation.
TYPE is the type of the attribute and can take either one of three values: bool, int or nominal. If the type is omited, then it is bool by default. If the type is nominal, then the attribute definition must also specify all possible values:

<attribute
name="NAME"
type="nominal"
class="CLASS"
value="EXPR">
<value>value1</value>
<value>value2</value>
...
</attribute>

Note the alternative way to specify EXPR.

CLASS is a boolean (values allowed: true, false, yes and no); it indicates either the attribute is the class attribute, that is to say either if the attribute is the one predicted by the classifier. If omitted then the attribute is not the class attribute by default. There must be one and only one class attribute in the relation definition.
EXPR is an expression that specifies the value of the attribute for a given example element. To compute the value of the attribute for a given element, AlvisNLP/ML evaluates EXPR with the element as the context element. The type of the evaluation depends on the type of the attribute:

Attribute type	Evaluation type
`bool`	boolean
`int`	number
`nominal`	string

If a nominal value evaluates to a string different from all declared possible values then AlvisNLP/ML will issue an error.

Attribute Examples

All-uppercase word

<attribute name="allcaps" type="bool">@form =~ "^[A-Z]$"</attribute>

Number of words in sentence

<attribute name="wordcount" type="int">inside:words</attribute>

Do not count punctuations:

<attribute name="wordcount" type="int">inside:words[@type != "punctuation"]</attribute>

This assumes that words have a feature type indicating the word type (see WoSMig annotationTypeFeature parameter).

POS category of word

<attribute name="wordcount" type="nominal" value='@pos =~ "^."'>
<value>N</value>
<value>V</value>
<value>J</value>
<value>R</value>
<value>D</value>
</attribute>

Bags

Bags are attribute generators mainly used to emulate bag-of-word representations.

<bag
prefix="PREFIX"
key="KEY"
count="COUNT"
loadValues="FILE">
EXPR
</bag>

PREFIX is the prefix of all generated attribute names, it is mandatory and xhoose it wisely so it does not create a name clash with other attributes.
KEY is a feature name
COUNT is a boolean value that specifies the type of the generated attributes:

Value	Atribute type	Test
`false`	boolean	presence
`true`	number	count

FILE is the path to a file containing all forms of the bags, it is an UTF-8 encoded file with one value per line. AlvisNLP/ML generates one attribute for each entry.
EXPR is an expression evaluated as a list of elements with the example as the context element. For each element in the result, the value of feature KEY sets or increments the corresponding attribute (depending on COUNT).

Bag examples

Document word vector

<bag prefix="w__" key="lemma" count="yes" loadValues="words.txt">sections.layer:words</bar>

You may generate words.txt with AggregateValues:

<vocabulary class="AggregateValues">
<entries>documents.sections.layer:words</entries>
<key>@lemma</key>
<outFile>words.txt</outFile>
</vocabulary>

Syntactic dependency argument

<bag prefix="syn__" key="lemma" loadValues="words.txt">tuple:dependencies:head.args:dependent</bar>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly