Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MASCReader returns empty CASes #64

Open
logological opened this issue Jun 24, 2015 · 2 comments
Open

MASCReader returns empty CASes #64

logological opened this issue Jun 24, 2015 · 2 comments

Comments

@logological
Copy link
Member

Originally reported on Google Code with ID 65

MASCReader generates an empty CAS for some files in the corpus

Reported by MedKhemakhemFSEGS on 2014-12-03 18:54:40


- _Attachment: [mascModified.groovy](https://storage.googleapis.com/google-code-attachments/dkpro-wsd/issue-65/comment-0/mascModified.groovy)_
@logological
Copy link
Member Author

I reproduced this problem.

For the MASC sentence corpus, the MASCReader returns 1865 empty Cas's and 13754 normal
Cas's

I copied a version of the MASCReader into a local project and run the following pipeline:

String patterns = "round*/*-v/*-wn.xml";
        SimplePipeline.runPipeline(
                createReaderDescription(
                        MascReader.class,
                        MascReader.PARAM_IGNORE_TIES, true,
                        MascReader.PARAM_SOURCE_LOCATION, MASCDirectory,
                        MascReader.PARAM_PATTERNS,  new String[] {
                                ResourceCollectionReaderBase.INCLUDE_PREFIX + patterns
}),
                createEngineDescription(LanguageToolSegmenter.class),
                createEngineDescription(MascProblemFinder.class)
                //createEngineDescription(CasDumpWriter.class)
                );

I modified the MASCReader to return a sentence instead of an empty Cas: this is where
the problem is introduced:

        // if no tie between annotators is discovered
        if (documentText != null) {
            setDocumentMetadata(jCas, node);
            jCas.setDocumentText(documentText);
        }
        else {
            setDocumentMetadata(jCas, node);
            jCas.setDocumentText("This is an empty Cas.");

            //jCas.reset(); // TODO here the CAS is emptied
        }


Reported by eckle.kohler on 2014-12-08 20:45:13

@logological
Copy link
Member Author

I don't recall much about the MASC corpus format, so I don't have much context to help
me interpret this problem report.  I take it from reading the code that the empty CAS
was returned only in those cases where there was a tie between the annotators.  Is
this perhaps the intended behaviour?  If not, is your modified code above intended
to fix the problem?

Reported by tristan.miller@nothingisreal.com on 2014-12-11 14:48:43

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant