Element classifier recipes
This page provides pragmatic insights about the generic Weka wrapper modules, for complete parameter description refer to the reference documentation of TrainingElementClassifier, TaggingElementClassifier and SelectingElementClassifier.
- decide the target elements
- writing a relation definition
- select attributes with
SelectingElementClassifier
- training a classifier on training elements with
TrainingElementClassifier
- using the classifier to tag elements with
TaggingElementClassifier
The target elements are the elements you want to classify, these are
specified by the example
parameter in each of the three modules. It is
an Element Expression evaluated as a list of elements with the
corpus as the context element. The resulting collection of elements will
be the training set in TrainingElementClassifier
and
SelectingElementClassifier
, or the elements to predict the class in
TaggingElementClassifier
.
documents
To restrict the target to only some documents, for instance the training set:
documents(set == "train")
This assumes that documents have a feature with key set
and an
appropriate value. For instance, this feature could have been added by
the reader module that loaded some files into the corpus.
documents.sections.layer:sentences
This assumes a layer named sentences
that contains annotations
representing sentences. For instance this layer could have been filled
with
SeSMig.
To restrict the target to only some sentences, for instance those that contain at least two gene names:
documents.sections.layer:sentences(inside:genes >= 2)
This assumes a layer named genes
containing all gene names acquired
from previous modules.
Now for NER tasks, you may want to classify annotation n-grams, then you'd use the NGrams module:
<tokenize class="OgmiosTokenizer">
<tokenTypeFeature>type</tokenTypeFeature>
<separatorTokens>false</separatorTokens>
<targetLayerName>tokens</targetLayerName>
</tokenize>
<ngrams class="NGrams">
<targetLayerName>ngrams</targetLayerName>
<tokenLayerName>tokens</targetLayerName>
<maxNGramSize>3</maxNGramSize>
</ngrams>
Why not?
documents.sections.relations:genePairs.tuples
This assumes a relation named genePairs
. Note that all gene name pairs
in a sentence can be generated with the module
CartesianProductTuples
like this:
<genePairs class="CartesianProductTuples">
<anchor>documents.sections.layer:sentences</anchor>
<relationName>genePairs</relationName>
<arguments>
<first>inside:genes</first>
<second>inside:genes</second>
</arguments>
</genePairs>
Of course, you need to adjust the target so that your classifier does not attempt to classify pairs of the same gene:
documents.sections.relations:genePairs.tuples(args:first != args:second)
Here, relation is used in the meaning of Weka, it does not mean AlvisNLP/ML's relations.
The relation definition is specified by the relationDefinition
parameter in the three modules:
<relationDefinition>
<relation name="myrelation">
attribute and bag definitions
</relation>
</relationDefinition>
However we recommend to place the relation
subtree in a separate file
and invoke it like this:
<relationDefinition load="myfile.xml"/>
Indeed it is important you use the same relation definition in the three modules.
The relation name is optional and doesn't actually make a difference at all.
Each attribute is specified with an attribute
tag:
<attribute
name="NAME"
type="TYPE"
class="CLASS">
EXPR
</attribute>
-
NAME
is the name of the attribute, it is mandatory and must be unique in the relation. -
TYPE
is the type of the attribute and can take either one of three values:bool
,int
ornominal
. If the type is omited, then it isbool
by default. If the type isnominal
, then the attribute definition must also specify all possible values:
<attribute
name="NAME"
type="nominal"
class="CLASS"
value="EXPR">
<value>value1</value>
<value>value2</value>
...
</attribute>
Note the alternative way to specify EXPR
.
-
CLASS
is a boolean (values allowed:true
,false
,yes
andno
); it indicates either the attribute is the class attribute, that is to say either if the attribute is the one predicted by the classifier. If omitted then the attribute is not the class attribute by default. There must be one and only one class attribute in the relation definition. -
EXPR
is an expression that specifies the value of the attribute for a given example element. To compute the value of the attribute for a given element, AlvisNLP/ML evaluatesEXPR
with the element as the context element. The type of the evaluation depends on the type of the attribute:
Attribute type | Evaluation type |
---|---|
bool |
boolean |
int |
number |
nominal |
string |
If a nominal value evaluates to a string different from all declared possible values then AlvisNLP/ML will issue an error.
<attribute name="allcaps" type="bool">@form =~ "^[A-Z]$"</attribute>
<attribute name="wordcount" type="int">inside:words</attribute>
Do not count punctuations:
<attribute name="wordcount" type="int">inside:words[@type != "punctuation"]</attribute>
This assumes that words have a feature type
indicating the word type
(see WoSMig annotationTypeFeature
parameter).
<attribute name="wordcount" type="nominal" value='@pos =~ "^."'>
<value>N</value>
<value>V</value>
<value>J</value>
<value>R</value>
<value>D</value>
</attribute>
Bags are attribute generators mainly used to emulate bag-of-word representations.
<bag
prefix="PREFIX"
key="KEY"
count="COUNT"
loadValues="FILE">
EXPR
</bag>
-
PREFIX
is the prefix of all generated attribute names, it is mandatory and xhoose it wisely so it does not create a name clash with other attributes. -
KEY
is a feature name -
COUNT
is a boolean value that specifies the type of the generated attributes:
Value | Atribute type | Test |
---|---|---|
false |
boolean | presence |
true |
number | count |
-
FILE
is the path to a file containing all forms of the bags, it is an UTF-8 encoded file with one value per line. AlvisNLP/ML generates one attribute for each entry. -
EXPR
is an expression evaluated as a list of elements with the example as the context element. For each element in the result, the value of featureKEY
sets or increments the corresponding attribute (depending onCOUNT
).
<bag prefix="w__" key="lemma" count="yes" loadValues="words.txt">sections.layer:words</bar>
You may generate words.txt
with
AggregateValues:
<vocabulary class="AggregateValues">
<entries>documents.sections.layer:words</entries>
<key>@lemma</key>
<outFile>words.txt</outFile>
</vocabulary>
<bag prefix="syn__" key="lemma" loadValues="words.txt">tuple:dependencies:head.args:dependent</bar>