Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LuceneNGramMetaCollector throws exception when used with MultiThreadBatchTaskEngine #86

Open
mwunderlich opened this issue Sep 5, 2015 · 2 comments

Comments

@mwunderlich
Copy link
Contributor

Hi,

I have just come across the following: I am using the MultiThreadBatchTaskEngine in a feature ablation experiment in DKPro TC. The batch task fails with the following runtime exception:

Details: de.tudarmstadt.ukp.dkpro.lab.engine.ExecutionException: java.lang.RuntimeException: 

-de.tudarmstadt.ukp.dkpro.lab.storage.UnresolvedImportException: Unable to resolve import of task 
[de.tudarmstadt.ukp.dkpro.tc.core.task.ExtractFeaturesTask-Test-AIFdbClassificationFE_PRvCvNP_2015-09-04_18-05-38] 
pointing to [task-latest://de.tudarmstadt.ukp.dkpro.tc.core.task.ExtractFeaturesTask-Train-AIFdbClassificationFE_PRvCvNP_2015-09-04_18-05-38/output]; 
nested exception is de.tudarmstadt.ukp.dkpro.lab.storage.TaskContextNotFoundException: Task 
[de.tudarmstadt.ukp.dkpro.tc.core.task.ExtractFeaturesTask-Train-AIFdbClassificationFE_PRvCvNP_2015-09-04_18-05-38] has never been executed.
...

This in turn leads to more runtime exceptions about other tasks not having been executed.
When I look into the details, I find the following exception at an earlier stage:

2015-09-05 09:32:49 DEBUG PrimitiveAnalysisEngine_impl:347 - AnalysisEngine de.tudarmstadt.ukp.dkpro.tc.core.feature.UnitContextMetaCollector process begin
2015-09-05 09:32:49 DEBUG PrimitiveAnalysisEngine_impl:413 - AnalysisEngine de.tudarmstadt.ukp.dkpro.tc.core.feature.UnitContextMetaCollector process end
2015-09-05 09:32:49 DEBUG PrimitiveAnalysisEngine_impl:347 - AnalysisEngine de.tudarmstadt.ukp.dkpro.tc.features.ngram.meta.LuceneNGramMetaCollector process begin
2015-09-05 09:32:49 ERROR PrimitiveAnalysisEngine_impl:417 - Exception occurred
org.apache.uima.analysis_engine.AnalysisEngineProcessException: Annotator processing failed.    
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:401)
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:308)
        at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:570)
        at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.<init>(ASB_impl.java:412)
        at org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:344)
        at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:265)
        at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:269)
        at de.tudarmstadt.ukp.dkpro.lab.uima.engine.simple.SimpleExecutionEngine.run(SimpleExecutionEngine.java:139)
        at de.tudarmstadt.ukp.dkpro.lab.engine.impl.MultiThreadBatchTaskEngine$ExecutionThread.run(MultiThreadBatchTaskEngine.java:274)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
        at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:614)
        at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:628)
        at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1508)
        at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1188)
        at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1169)
        at de.tudarmstadt.ukp.dkpro.tc.features.ngram.meta.LuceneBasedMetaCollector.writeToIndex(LuceneBasedMetaCollector.java:165)
        at de.tudarmstadt.ukp.dkpro.tc.features.ngram.meta.LuceneBasedMetaCollector.process(LuceneBasedMetaCollector.java:136)
        at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:48)
        at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:385)
        ... 13 more

The experiments runs successfully when I use the single-threaded DefaulfBatchTaskEngine, so I presume the error about the closed index writer is caused as a side-effect of the multi-threading.

I don't have time right now to investigate this in detail or produce a minimum configuration where this error occurs, but I thought I'd report it anyways, in case others come across the same issue.

@zesch
Copy link
Member

zesch commented Sep 5, 2015

I guess how we currently close the index writer is not really thread safe ...
It worked single threaded, as when collectionProcessComplete() is called, all lucene-based meta collectors are already finished, not so in a multi-threaded scenario?

However, Lucene says that IndexWriter is completely thread-safe ...

    @Override
    public void collectionProcessComplete()
        throws AnalysisEngineProcessException
    {
        super.collectionProcessComplete();

        if (indexWriter != null) {
            try {
                indexWriter.commit();
                indexWriter.close();
                indexWriter = null;
            } catch (AlreadyClosedException e) {
                // ignore, as multiple meta collectors write in the same index 
                // and will all try to close the index
            } catch (CorruptIndexException e) {
                throw new AnalysisEngineProcessException(e);
            } catch (IOException e) {
                throw new AnalysisEngineProcessException(e);
            }
        }
    }

@reckart
Copy link
Member

reckart commented Sep 5, 2015

Multi-threading on the batch-task level should only parallelize complete task executions. Since each task execution (that includes in particular the UIMA pipeline) is then single-threaded and working on its own context (folder) there should be no problem.

Of course, if you use the CPE UIMA engine, that's another thing - then the writers that are not thread-safe/that must see all data must be derived from (J)CasConsumer_ImplBase or explicitly declare @OperationalProperties(multipleDeploymentAllowed = false).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants