Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unit classification mode: Integrate TextClassificationUnit.setSuffix() with externalD #251

Open
daxenberger opened this issue Jul 10, 2015 · 26 comments
Assignees
Milestone

Comments

@daxenberger
Copy link
Member

In unit mode, the id2outcome report contains random IDs for units in documents with more than one unit. It would be much easier to interpret and analyze classification errors, if the IDs were interpretable.
To make this happen, we should introduce a new feature to the TextClassificationUnit type which holds (optionally) stores a user-specified string. This field can be set together with the label for this unit during reading. In ClassificationUnitCasMultiplier, where the units are "converted" into instances/CASes, this fields should be used to set the document ID of the CAS (which is later used by the id2outcome report). If no user-specified IDs are set, the current approach (increasing index for multiple units in a document) can be applied instead.

@zesch
Copy link
Member

zesch commented Jul 10, 2015

We already have something similar in the form of the "suffix" field in the TextClassificationUnit type.
This does not replace numerical ids, but as the name says is added as a suffix.

@daxenberger
Copy link
Member Author

True. However, AFAIK this suffix is only explicitely used in a feature (ID_FEATURE_NAME) and is not applied within id2outcome report. I guess, the best option is to use this (string) value to append it to the document ID in ClassificationUnitCasMultiplier, rather than adding a new one. The IDs do not need to be numeric.

@zesch
Copy link
Member

zesch commented Jul 10, 2015

They need to be unique. When only using a user-defined name, we cannot rely on that.

@daxenberger
Copy link
Member Author

We need to make sure that document IDs are unique as well. I think, there, we also rely on the user to set unique values, don't we? I guess uima(FIT) will complain if the IDs are not unique.

@zesch
Copy link
Member

zesch commented Jul 10, 2015

Yes, that is right. As document IDs are usually file names, the problem is less critical there as they need to be kind of unique already.
As we have multiple units per document, there is some danger that we will get CASes with the same name here.

However, just pointing out the problem. Whether we want to take action or leave it to the user is another question.

@reckart
Copy link
Member

reckart commented Jul 10, 2015

We need to make sure that document IDs are unique as well. I think, there, we also rely on the user to set unique values, don't we? I guess uima(FIT) will complain if the IDs are not unique.

I don't see why uima(FIT) should complain about any IDs not being unique. From the perspective of these frameworks, I'd say the IDs you are speaking about are "user-defined" and oblique to the framework.

@daxenberger
Copy link
Member Author

Within a single document, it is easy to ensure that user-specified IDs are unique by verifying in the ClassificationUnitCasMultiplier. If user-specified IDs are not unique within the same document, we can append numeric indices and output a warning.

@reckart
Copy link
Member

reckart commented Jul 10, 2015

Btw. external IDs are a very common request... a related issue is e.g. dkpro/dkpro-core#609

@daxenberger
Copy link
Member Author

The feature requested in dkpro/dkpro-core#609 will be bound to sentences or tokens, right? In DKPro TC we need to bind it to units (which can be sentences, but also many other things). If there is a way to combine these requests and to make it a common feature, that would be good; but at the moment I have no idea how this could be done.

@reckart
Copy link
Member

reckart commented Jul 10, 2015

Well... we don't have a common base-class for all DKPro annotations and I personally would prefer not to introduce one (gut feeling). If we had such a base-class, we could introduce all kinds of common features there and they would be automatically inherited. But we'd risk inheriting such common features where they are not actually needed and there is no way of "removing" features in subclasses.

Unfortunately, UIMA currently doesn't allow adding "interfaces" to JCas classes. One could imagine to add an interface "HasExternalId" to certain types which would then have to define a "externalId" field along with getters and setters. Cf. https://issues.apache.org/jira/browse/UIMA-3354

So currently, I tend towards introducing a set of "helper" classes, e.g.

ExternalIdHelper.getExternalId(annotation); 
ExternalIdHelper.setExternalId(annotation, id);

If the annotation type defines a field "externalId", the methods would work, otherwise they would throw an exception.

Btw. the same "helper" class solution would allow removing customized code from the generated JCas classes, e.g. from DocumentMetaData. That in turn would allow us to remove the JCas classes from the repository and automatically generate all of them during build. Currently, most JCas classes in DKPro Core are generated during build, but some are not due to customizations.

@reckart
Copy link
Member

reckart commented Jul 10, 2015

Btw. there are some considerations towards making JCas more dynamic in future versions of UIMA:

https://cwiki.apache.org/confluence/display/UIMA/Ideas+for+UIMAJ+v3#IdeasforUIMAJv3-AutomaticJCasGenofmergedtypesystem

@logological
Copy link
Member

Not only the instance IDs but the label IDs should be consistent from report to report. Is there already an issue for that?

@daxenberger
Copy link
Member Author

So currently, I tend towards introducing a set of "helper" classes, e.g.
ExternalIdHelper.getExternalId(annotation);
ExternalIdHelper.setExternalId(annotation, id);
If the annotation type defines a field "externalId", the methods would work, otherwise
they would throw an exception.

Sounds good to me. That would require us to agree upon a common name for that field.

@reckart
Copy link
Member

reckart commented Jul 10, 2015

It would merely require that you don't object to using "externalId" ;)

@reckart
Copy link
Member

reckart commented Jul 10, 2015

Mind that "externalId" is the name I have in mind for DKPro Core. It implies that a single annotation type cannot have more than one external id. In case you store labelId and instanceId in a single type, a different DKPro TC solution would be required.

... or we could maintain multiple external IDs, e.g.

ExternalIdHelper.getExternalId(annotation); 
ExternalIdHelper.setExternalId(annotation, idNamespace, id);

But I'm not sure... it seems a bit of an overkill. I believe in the majority of cases, only one external ID would be required.

@daxenberger
Copy link
Member Author

Not only the instance IDs but the label IDs should be consistent from report to report.
Is there already an issue for that?

I'm not exactly sure whether I understand that request. units (TextClassificationUnit) and label (TextClassificationOutcome) annotations need to span across the same text. So it should be enough to have unique IDs with the units.
Or do you have another application in mind?

@logological
Copy link
Member

Currently id2outcome.txt uses numeric IDs for the classification outcomes, but (at least for cross-validation experiments) these IDs are not consistent from file to file. For example, BrownPosDemo does a two-fold cross-validation. One of the id2outcome.txt file uses the following mapping of numeric IDs to the original labels:

0=NPg 2=JJ 1=(null) 3=RB 5=TO 4=PPS 6=RP 7=NP 8=NN 10=VBN 9=VB 11=pct 12=PPO 13=BE 14=MD 15=DTS 16=VBZ 17=AT 18=IN 19=CS 20=VBG 21=VBD 22=BEDZ 23=NNS 24=CC 25=CD 26=AP 27=PPg

The other id2outcome.txt file uses a slightly different mapping:

0=NPg 2=(null) 1=JJ 3=RB 5=PPS 4=TO 6=RP 7=NP 8=NN 10=VB 9=VBN 11=pct 12=PPO 13=BE 14=MD 15=DTS 16=VBZ 17=AT 18=IN 19=CS 20=VBG 21=VBD 22=BEDZ 23=NNS 24=CC 25=CD 26=AP 27=PPg

In order to get the raw classifications, not only do I need to combine the two files, but I first have to manually un-map all the numeric IDs to their original labels.

It would be better if id2outcome.txt didn't use numeric IDs at all, but rather used the original label IDs. If for some reason the mapping to numeric IDs is necessary, it would be helpful if the mapping were consistent across files.

@EmilyKJamison
Copy link
Member

@tristan: Whoops, we have two different discussions here. User-set unit
ids is a separate issue from maintaining continuity in numeric ids of
classifications in a CV task.

@johannes, CC:Tristan: Yes, TextClassificationUnit.setSuffix(String v) does
show up in id2outcome, along with the auto-generated numeric unit id.
AFAIK, the proposed issue is that it would still be desirable to remove the
auto-generated unit id.

On Fri, Jul 10, 2015 at 12:21 PM, Tristan Miller notifications@github.com
wrote:

Currently id2outcome.txt uses numeric IDs for the classification
outcomes, but (at least for cross-validation experiments) these IDs are not
consistent from file to file. For example, BrownPosDemo does a two-fold
cross-validation. One of the id2outcome.txt file uses the following
mapping of numeric IDs to the original labels:

0=NPg 2=JJ 1=(null) 3=RB 5=TO 4=PPS 6=RP 7=NP 8=NN 10=VBN 9=VB 11=pct 12=PPO 13=BE 14=MD 15=DTS 16=VBZ 17=AT 18=IN 19=CS 20=VBG 21=VBD 22=BEDZ 23=NNS 24=CC 25=CD 26=AP 27=PPg

The other id2outcome.txt file uses a slightly different mapping:

0=NPg 2=(null) 1=JJ 3=RB 5=PPS 4=TO 6=RP 7=NP 8=NN 10=VB 9=VBN 11=pct 12=PPO 13=BE 14=MD 15=DTS 16=VBZ 17=AT 18=IN 19=CS 20=VBG 21=VBD 22=BEDZ 23=NNS 24=CC 25=CD 26=AP 27=PPg

In order to get the raw classifications, not only do I need to combine the
two files, but I first have to manually un-map all the numeric IDs to their
original labels.

It would be better if id2outcome.txt didn't use numeric IDs at all, but
rather used the original label IDs. If for some reason the mapping to
numeric IDs is necessary, it would be helpful if the mapping were
consistent across files.


Reply to this email directly or view it on GitHub
#251 (comment).

@logological
Copy link
Member

OK, I've opened a separate issue for the outcome classification labels (Issue 252).

@daxenberger
Copy link
Member Author

Emily is right. These are two issues (the second to be tracked in Issue 252).
TextClassificationUnit.setSuffix(String v) does indeed show up in the id2outcome report. I'll add a line to the BrownCorpusReader to show its usage and to make it effective in the BrownUnitPOSDemo. This should solve the original request.

I'm leaving this issue open to integrate it with DKPro's externalId.

@daxenberger daxenberger changed the title Unit classification mode: Augment id2outcome report to contain customized ids Unit classification mode: Integrate TextClassificationUnit.setSuffix() with externalD Jul 13, 2015
@reckart reckart modified the milestone: 0.8.0 Aug 8, 2015
@Horsmann
Copy link
Member

integrate it with DKPro's externalId.
What is left to do for that?

@reckart
Copy link
Member

reckart commented Mar 27, 2016

Core didn't implement an "externalId" mechanism yet.

@Horsmann Horsmann modified the milestones: 0.9.0, 0.8.0 Mar 27, 2016
@Horsmann Horsmann modified the milestones: 1.0.0, 0.9.0 Oct 19, 2016
@Horsmann
Copy link
Member

Horsmann commented Feb 9, 2018

Any updates on this one?

@reckart
Copy link
Member

reckart commented Feb 9, 2018

Nope.

@Horsmann
Copy link
Member

Horsmann commented Feb 9, 2018

Ok, then I close this for now. Once the demand for this kind of change becomes bigger a new issue can be opened.

@Horsmann Horsmann closed this as completed Feb 9, 2018
@reckart
Copy link
Member

reckart commented Feb 9, 2018

Why not leave it open and push it to the next release? If it get's closed, its easier to miss past discussions on the topic.

@Horsmann Horsmann reopened this Feb 9, 2018
@Horsmann Horsmann modified the milestones: 1.0.0, 1.1.0 Apr 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants