Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

could not find XmlIoBasicImgLoader implementation for format bdv.n5 #2

Closed
boazmohar opened this issue Dec 25, 2021 · 29 comments
Closed

Comments

@boazmohar
Copy link

Hi @StephanPreibisch,

Tried running this on the Janelia cluster and got this error.

mpicbg.spim.data.SpimDataInstantiationException: could not find XmlIoBasicImgLoader implementation for format bdv.n5

I am using the latest spark-janelia from here

This is how I build the repo and ran it
  [login1 - moharb@e05u15]~>git clone https://github.com/PreibischLab/BigStitcher-Spark.git
Cloning into 'BigStitcher-Spark'...
remote: Enumerating objects: 181, done.
remote: Counting objects: 100% (181/181), done.
remote: Compressing objects: 100% (104/104), done.
remote: Total 181 (delta 69), reused 108 (delta 22), pack-reused 0
Receiving objects: 100% (181/181), 35.31 KiB | 2.35 MiB/s, done.
Resolving deltas: 100% (69/69), done.
[login1 - moharb@e05u15]~>cd BigStitcher-Spark/
[login1 - moharb@e05u15]~/BigStitcher-Spark>~/apache-maven-3.8.4/bin/mvn clean package
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------< net.preibisch:BigStitcher-Spark >-------------------
[INFO] Building BigStitcher Spark 0.0.1-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
[INFO]
[INFO] --- maven-clean-plugin:3.1.0:clean (default-clean) @ BigStitcher-Spark ---
[INFO]
[INFO] --- maven-enforcer-plugin:3.0.0-M3:enforce (enforce-rules) @ BigStitcher-Spark ---
[INFO] Adding ignore: module-info
[INFO] Adding ignore: META-INF/versions/*/module-info
[INFO] Adding ignore: com.esotericsoftware.kryo.*
[INFO] Adding ignore: com.esotericsoftware.minlog.*
[INFO] Adding ignore: com.esotericsoftware.reflectasm.*
[INFO] Adding ignore: com.google.inject.*
[INFO] Adding ignore: jnr.ffi.*
[INFO] Adding ignore: org.apache.hadoop.yarn.*.package-info
[INFO] Adding ignore: org.apache.spark.unused.UnusedStubClass
[INFO] Adding ignore: org.hibernate.stat.ConcurrentStatisticsImpl
[INFO] Adding ignore: org.jetbrains.kotlin.daemon.common.*
[INFO] Adding ignore: org.junit.runner.Runner
[INFO] Adding ignore: module-info
[INFO] Adding ignore: module-info
[INFO]
[INFO] --- build-helper-maven-plugin:3.0.0:regex-property (sanitize-version) @ BigStitcher-Spark ---
[INFO]
[INFO] --- build-helper-maven-plugin:3.0.0:regex-property (guess-package) @ BigStitcher-Spark ---
[INFO]
[INFO] --- buildnumber-maven-plugin:1.4:create (default) @ BigStitcher-Spark ---
[INFO] Executing: /bin/sh -c cd '/groups/spruston/home/moharb/BigStitcher-Spark' && 'git' 'rev-parse' '--verify' 'HEAD'
[INFO] Working directory: /groups/spruston/home/moharb/BigStitcher-Spark
[INFO] Storing buildNumber: e2b676364526588195f16931e998a7a756ca778b at timestamp: 1640470297294
[INFO] Storing buildScmBranch: main
[INFO]
[INFO] --- scijava-maven-plugin:2.0.0:set-rootdir (set-rootdir) @ BigStitcher-Spark ---
[INFO] Setting rootdir: /groups/spruston/home/moharb/BigStitcher-Spark
[INFO]
[INFO] --- jacoco-maven-plugin:0.8.6:prepare-agent (jacoco-initialize) @ BigStitcher-Spark ---
[WARNING] The artifact xml-apis:xml-apis:jar:2.0.2 has been relocated to xml-apis:xml-apis:jar:1.0.b2
[INFO] argLine set to -javaagent:/groups/spruston/home/moharb/.m2/repository/org/jacoco/org.jacoco.agent/0.8.6/org.jacoco.agent-0.8.6-runtime.jar=destfile=/groups/spruston/home/moharb/BigStitcher-Spark/target/jacoco.exec
[INFO]
[INFO] --- maven-resources-plugin:3.1.0:resources (default-resources) @ BigStitcher-Spark ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /groups/spruston/home/moharb/BigStitcher-Spark/src/main/resources
[INFO]
[INFO] --- maven-compiler-plugin:3.8.1:compile (default-compile) @ BigStitcher-Spark ---
[INFO] Compiling 6 source files to /groups/spruston/home/moharb/BigStitcher-Spark/target/classes
[INFO]
[INFO] --- maven-resources-plugin:3.1.0:testResources (default-testResources) @ BigStitcher-Spark ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /groups/spruston/home/moharb/BigStitcher-Spark/src/test/resources
[INFO]
[INFO] --- maven-compiler-plugin:3.8.1:testCompile (default-testCompile) @ BigStitcher-Spark ---
[INFO] No sources to compile
[INFO]
[INFO] --- maven-surefire-plugin:2.22.2:test (default-test) @ BigStitcher-Spark ---
[INFO] No tests to run.
[INFO]
[INFO] --- maven-jar-plugin:3.2.0:jar (default-jar) @ BigStitcher-Spark ---
[INFO] Building jar: /groups/spruston/home/moharb/BigStitcher-Spark/target/BigStitcher-Spark.jar
[INFO]
[INFO] >>> maven-source-plugin:3.2.1:jar (attach-sources-jar) > generate-sources @ BigStitcher-Spark >>>
[INFO]
[INFO] --- maven-enforcer-plugin:3.0.0-M3:enforce (enforce-rules) @ BigStitcher-Spark ---
[INFO] Adding ignore: module-info
[INFO] Adding ignore: META-INF/versions/*/module-info
[INFO] Adding ignore: com.esotericsoftware.kryo.*
[INFO] Adding ignore: com.esotericsoftware.minlog.*
[INFO] Adding ignore: com.esotericsoftware.reflectasm.*
[INFO] Adding ignore: com.google.inject.*
[INFO] Adding ignore: jnr.ffi.*
[INFO] Adding ignore: org.apache.hadoop.yarn.*.package-info
[INFO] Adding ignore: org.apache.spark.unused.UnusedStubClass
[INFO] Adding ignore: org.hibernate.stat.ConcurrentStatisticsImpl
[INFO] Adding ignore: org.jetbrains.kotlin.daemon.common.*
[INFO] Adding ignore: org.junit.runner.Runner
[INFO] Adding ignore: module-info
[INFO] Adding ignore: module-info
[INFO]
[INFO] --- build-helper-maven-plugin:3.0.0:regex-property (sanitize-version) @ BigStitcher-Spark ---
[INFO]
[INFO] --- build-helper-maven-plugin:3.0.0:regex-property (guess-package) @ BigStitcher-Spark ---
[INFO]
[INFO] --- buildnumber-maven-plugin:1.4:create (default) @ BigStitcher-Spark ---
[INFO]
[INFO] --- scijava-maven-plugin:2.0.0:set-rootdir (set-rootdir) @ BigStitcher-Spark ---
[INFO]
[INFO] --- jacoco-maven-plugin:0.8.6:prepare-agent (jacoco-initialize) @ BigStitcher-Spark ---
[INFO] argLine set to -javaagent:/groups/spruston/home/moharb/.m2/repository/org/jacoco/org.jacoco.agent/0.8.6/org.jacoco.agent-0.8.6-runtime.jar=destfile=/groups/spruston/home/moharb/BigStitcher-Spark/target/jacoco.exec
[INFO]
[INFO] <<< maven-source-plugin:3.2.1:jar (attach-sources-jar) < generate-sources @ BigStitcher-Spark <<<
[INFO]
[INFO]
[INFO] --- maven-source-plugin:3.2.1:jar (attach-sources-jar) @ BigStitcher-Spark ---
[INFO] Building jar: /groups/spruston/home/moharb/BigStitcher-Spark/target/BigStitcher-Spark-sources.jar
[INFO]
[INFO] --- jacoco-maven-plugin:0.8.6:report (jacoco-site) @ BigStitcher-Spark ---
[INFO] Skipping JaCoCo execution due to missing execution data file.
[INFO]
[INFO] --- maven-assembly-plugin:3.1.1:single (make-assembly) @ BigStitcher-Spark ---
[INFO] Building jar: /groups/spruston/home/moharb/BigStitcher-Spark/target/BigStitcher-Spark-jar-with-dependencies.jar
[INFO]
[INFO] --- maven-jar-plugin:3.2.0:test-jar (default) @ BigStitcher-Spark ---
[INFO] Skipping packaging of the test-jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  32.978 s
[INFO] Finished at: 2021-12-25T17:12:02-05:00
[INFO] ------------------------------------------------------------------------
[login1 - moharb@e05u15]~/BigStitcher-Spark>TERMINATE=1 RUNTIME=8:00 TMPDIR=~/tmp ~/spark-janelia/flintstone.sh 3 \
> ~/BigStitcher-Spark/target/BigStitcher-Spark-jar-with-dependencies.jar \
> net.preibisch.bigstitcher.spark.AffineFusion \
> -x '/nrs/svoboda/moharb/dataset.xml' \
> -o  '/nrs/svoboda/moharb/output3.n5' \
> -d '/GFP/s0' \
> --channelId 0 \
> --UINT8 \
> --minIntensity 1 \
> --maxIntensity 254

On e05u15.int.janelia.org with Python 3.6.10 :: Anaconda, Inc., running:

  /groups/spruston/home/moharb/spark-janelia/spark-janelia  --consolidate_logs --nnodes=3 --gb_per_slot=15 --driverslots=32 --worker_slots=32 --minworkers=1 --hard_runtime=8:00 --submitargs=" --verbose --conf spark.default.parallelism=270 --conf spark.executor.instances=6 --conf spark.executor.cores=5 --conf spark.executor.memory=75g --class net.preibisch.bigstitcher.spark.AffineFusion /groups/spruston/home/moharb/BigStitcher-Spark/target/BigStitcher-Spark-jar-with-dependencies.jar -x /nrs/svoboda/moharb/dataset.xml -o /nrs/svoboda/moharb/output3.n5 -d /GFP/s0 --channelId 0 --UINT8 --minIntensity 1 --maxIntensity 254" generate-and-launch-run


Created:
  /groups/spruston/home/moharb/.spark/20211225_171249/conf
  /groups/spruston/home/moharb/.spark/20211225_171249/logs
  /groups/spruston/home/moharb/.spark/20211225_171249/scripts

Running:
  /groups/spruston/home/moharb/.spark/20211225_171249/scripts/00-queue-lsf-jobs.sh

Sat Dec 25 17:12:49 EST 2021 [e05u15.int.janelia.org] submitting jobs to scheduler
This job will be billed to svoboda
Job <114095184> is submitted to default queue <local>.
This job will be billed to svoboda
Job <114095185> is submitted to default queue <short>.
This job will be billed to svoboda
Job <114095186> is submitted to default queue <local>.
This job will be billed to svoboda
Job <114095189> is submitted to default queue <local>.
This job will be billed to svoboda
Job <114095190> is submitted to default queue <short>.

Queued jobs are:
JOBID      USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
114095184  moharb  PEND  local      e05u15                  *171249_ma Dec 25 17:12
114095185  moharb  PEND  short      e05u15                  *171249_ur Dec 25 17:12
114095186  moharb  PEND  local      e05u15                  *249_wo[1] Dec 25 17:12
114095186  moharb  PEND  local      e05u15                  *249_wo[2] Dec 25 17:12
114095186  moharb  PEND  local      e05u15                  *249_wo[3] Dec 25 17:12
114095189  moharb  PEND  local      e05u15                  *171249_dr Dec 25 17:12
114095190  moharb  PEND  short      e05u15                  *171249_sd Dec 25 17:12


To get web user interface URL after master has started, run:
  grep "Bound MasterWebUI to" /groups/spruston/home/moharb/.spark/20211225_171249/logs/01-master.log
These are the driver logs with the error
Sat Dec 25 17:13:34 EST 2021 [h07u13] running /misc/local/spark-3.0.1/bin/spark-submit --deploy-mode client --master spark://10.36.107.39:7077  --verbose --conf spark.default.parallelism=270 --conf spark.executor.instances=6 --conf spark.executor.cores=5 --conf spark.executor.memory=75g --class net.preibisch.bigstitcher.spark.AffineFusion /groups/spruston/home/moharb/BigStitcher-Spark/target/BigStitcher-Spark-jar-with-dependencies.jar -x /nrs/svoboda/moharb/dataset.xml -o /nrs/svoboda/moharb/output3.n5 -d /GFP/s0 --channelId 0 --UINT8 --minIntensity 1 --maxIntensity 254
Using properties file: /groups/spruston/home/moharb/.spark/20211225_171249/conf/spark-defaults.conf
Adding default property: spark.storage.blockManagerHeartBeatMs=30000
Adding default property: spark.driver.maxResultSize=0
Adding default property: spark.kryoserializer.buffer.max=1024m
Adding default property: spark.rpc.askTimeout=300s
Adding default property: spark.driver.memory=479g
Adding default property: spark.submit.deployMode=cluster
Adding default property: spark.rpc.retry.wait=30s
Adding default property: spark.core.connection.ack.wait.timeout=600s
Parsed arguments:
  master                  spark://10.36.107.39:7077
  deployMode              client
  executorMemory          75g
  executorCores           5
  totalExecutorCores      null
  propertiesFile          /groups/spruston/home/moharb/.spark/20211225_171249/conf/spark-defaults.conf
  driverMemory            479g
  driverCores             null
  driverExtraClassPath    null
  driverExtraLibraryPath  null
  driverExtraJavaOptions  null
  supervise               false
  queue                   null
  numExecutors            6
  files                   null
  pyFiles                 null
  archives                null
  mainClass               net.preibisch.bigstitcher.spark.AffineFusion
  primaryResource         file:/groups/spruston/home/moharb/BigStitcher-Spark/target/BigStitcher-Spark-jar-with-dependencies.jar
  name                    net.preibisch.bigstitcher.spark.AffineFusion
  childArgs               [-x /nrs/svoboda/moharb/dataset.xml -o /nrs/svoboda/moharb/output3.n5 -d /GFP/s0 --channelId 0 --UINT8 --minIntensity 1 --maxIntensity 254]
  jars                    null
  packages                null
  packagesExclusions      null
  repositories            null
  verbose                 true

Spark properties used, including those specified through
 --conf and those from the properties file /groups/spruston/home/moharb/.spark/20211225_171249/conf/spark-defaults.conf:
  (spark.default.parallelism,270)
  (spark.driver.memory,479g)
  (spark.executor.instances,6)
  (spark.executor.memory,75g)
  (spark.rpc.askTimeout,300s)
  (spark.storage.blockManagerHeartBeatMs,30000)
  (spark.kryoserializer.buffer.max,1024m)
  (spark.submit.deployMode,cluster)
  (spark.core.connection.ack.wait.timeout,600s)
  (spark.driver.maxResultSize,0)
  (spark.rpc.retry.wait,30s)
  (spark.executor.cores,5)

    
2021-12-25 17:13:36,473 [main] WARN [NativeCodeLoader]: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Main class:
net.preibisch.bigstitcher.spark.AffineFusion
Arguments:
-x
/nrs/svoboda/moharb/dataset.xml
-o
/nrs/svoboda/moharb/output3.n5
-d
/GFP/s0
--channelId
0
--UINT8
--minIntensity
1
--maxIntensity
254
Spark config:
(spark.storage.blockManagerHeartBeatMs,30000)
(spark.driver.maxResultSize,0)
(spark.jars,file:/groups/spruston/home/moharb/BigStitcher-Spark/target/BigStitcher-Spark-jar-with-dependencies.jar)
(spark.kryoserializer.buffer.max,1024m)
(spark.app.name,net.preibisch.bigstitcher.spark.AffineFusion)
(spark.rpc.askTimeout,300s)
(spark.driver.memory,479g)
(spark.executor.instances,6)
(spark.submit.pyFiles,)
(spark.default.parallelism,270)
(spark.submit.deployMode,client)
(spark.master,spark://10.36.107.39:7077)
(spark.rpc.retry.wait,30s)
(spark.executor.memory,75g)
(spark.executor.cores,5)
(spark.core.connection.ack.wait.timeout,600s)
Classpath elements:
file:/groups/spruston/home/moharb/BigStitcher-Spark/target/BigStitcher-Spark-jar-with-dependencies.jar


[-x, /nrs/svoboda/moharb/dataset.xml, -o, /nrs/svoboda/moharb/output3.n5, -d, /GFP/s0, --channelId, 0, --UINT8, --minIntensity, 1, --maxIntensity, 254]
mpicbg.spim.data.SpimDataInstantiationException: could not find XmlIoBasicImgLoader implementation for format bdv.n5
	at mpicbg.spim.data.generic.sequence.ImgLoaders.createXmlIoForFormat(ImgLoaders.java:72)
	at mpicbg.spim.data.generic.sequence.XmlIoAbstractSequenceDescription.fromXml(XmlIoAbstractSequenceDescription.java:110)
	at mpicbg.spim.data.generic.XmlIoAbstractSpimData.fromXml(XmlIoAbstractSpimData.java:153)
	at net.preibisch.mvrecon.fiji.spimdata.XmlIoSpimData2.fromXml(XmlIoSpimData2.java:164)
	at net.preibisch.mvrecon.fiji.spimdata.XmlIoSpimData2.fromXml(XmlIoSpimData2.java:52)
	at mpicbg.spim.data.generic.XmlIoAbstractSpimData.load(XmlIoAbstractSpimData.java:95)
	at net.preibisch.bigstitcher.spark.AffineFusion.call(AffineFusion.java:94)
	at net.preibisch.bigstitcher.spark.AffineFusion.call(AffineFusion.java:39)
	at picocli.CommandLine.executeUserObject(CommandLine.java:1853)
	at picocli.CommandLine.access$1100(CommandLine.java:145)
	at picocli.CommandLine$RunLast.executeUserObjectOfLastSubcommandWithSameParent(CommandLine.java:2255)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2249)
	at picocli.CommandLine$RunLast.handle(CommandLine.java:2213)
	at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:2080)
	at picocli.CommandLine.execute(CommandLine.java:1978)
	at net.preibisch.bigstitcher.spark.AffineFusion.main(AffineFusion.java:283)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
2021-12-25 17:13:37,007 [Thread-1] INFO [ShutdownHookManager]: Shutdown hook called
2021-12-25 17:13:37,007 [Thread-1] INFO [ShutdownHookManager]: Deleting directory /tmp/spark-fcf876a8-e1e9-4de3-bffe-d05ee3fb0a4a

------------------------------------------------------------
Sender: LSF System <lsfadmin@h07u13>
Subject: Job 114095189: <spark_moharb_20211225_171249_dr> in cluster <Janelia> Exited

Job <spark_moharb_20211225_171249_dr> was submitted from host <e05u15> by user <moharb> in cluster <Janelia> at Sat Dec 25 17:12:49 2021
Job was executed on host(s) <32*h07u13>, in queue <local>, as user <moharb> in cluster <Janelia> at Sat Dec 25 17:13:31 2021
</groups/spruston/home/moharb> was used as the home directory.
</groups/spruston/home/moharb/BigStitcher-Spark> was used as the working directory.
Started at Sat Dec 25 17:13:31 2021
Terminated at Sat Dec 25 17:13:37 2021
Results reported at Sat Dec 25 17:13:37 2021

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
/groups/spruston/home/moharb/.spark/20211225_171249/scripts/04-launch-driver.sh
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time :                                   3.96 sec.
    Max Memory :                                 11 MB
    Average Memory :                             11.00 MB
    Total Requested Memory :                     491520.00 MB
    Delta Memory :                               491509.00 MB
    Max Swap :                                   -
    Max Processes :                              4
    Max Threads :                                5
    Run time :                                   6 sec.
    Turnaround time :                            48 sec.

The output (if any) is above this job summary.

Same happens for nonRigid, am I doing something wrong?

Thanks!
Boaz

@StephanPreibisch
Copy link
Contributor

Thanks @boazmohar ... @tpietzsch we have seen this behavior before, do you remember what the reason was? Thanks :)

@StephanPreibisch
Copy link
Contributor

not your fault @boazmohar ... this is about the annotations that are used to find Imgloader implementations at runtime ...

@StephanPreibisch
Copy link
Contributor

Hi @boazmohar, I found how we solved a the same problem before ... I updated the pom.xml

Can you please build with

mvn clean package -P fatjar

and use the BigStitcher-Spark-0.0.1-SNAPSHOT.jar that is created? It should work ...

@boazmohar
Copy link
Author

@StephanPreibisch it changed location of the error, it is now on the workers running the task:

2021-12-26 10:06:29,223 [task-result-getter-2] ERROR [TaskSetManager]: Task 35 in stage 0.0 failed 4 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 35 in stage 0.0 failed 4 times, most recent failure: Lost task 35.3 in stage 0.0 (TID 129, 10.36.107.23, executor 6): mpicbg.spim.data.SpimDataInstantiationException: could not find XmlIoBasicImgLoader implementation for format bdv.n5
	at mpicbg.spim.data.generic.sequence.ImgLoaders.createXmlIoForFormat(ImgLoaders.java:72)
	at mpicbg.spim.data.generic.sequence.XmlIoAbstractSequenceDescription.fromXml(XmlIoAbstractSequenceDescription.java:110)

@boazmohar
Copy link
Author

My bad, there was an issue with access to my home folder, putting the .jar on nrs solved it!

@StephanPreibisch
Copy link
Contributor

Cool, did you already try it on a bigger dataset?

@boazmohar
Copy link
Author

It is failing again, same error.
I submitted a help desk ticket and asked Tim Wang to try running it.

@StephanPreibisch
Copy link
Contributor

Which same error? I thought it worked? If the same error persists we need to solve it …

@boazmohar
Copy link
Author

Well, it worked on the example dataset ONCE:

Saved, e.g. view with './n5-view -i /nrs/svoboda/moharb/output3.n5 -d /GFP/s0
done, took: 8220 ms.

Full log:
04-driver.log

But the worker stdout still has this error:
stdout.txt

Here is one where the dataset is bigger and it failed. maybe it is running out of memory and the error is unrelated??

stdout_failed.txt

@StephanPreibisch
Copy link
Contributor

Hi, there is still a bug when creating the FATJAR ... I will need to examine what is going on

@StephanPreibisch
Copy link
Contributor

StephanPreibisch commented Dec 27, 2021

Hi @boazmohar, can you please pull the newest version, call mvn clean package -P fatjar and try:

java -cp target/BigStitcher-Spark-0.0.1-SNAPSHOT.jar net.preibisch.bigstitcher.spark.TestDependencies -x '/Users/spreibi/Documents/Microscopy/Stitching/Truman/standard/dataset.xml'

It should load and display one of the stacks ... works for me, meaning it can find the N5 imgloader

@StephanPreibisch
Copy link
Contributor

of course adjust the XML path and maybe try it locally on your machine ... and then on an interactive session on the cluster. If that all works, so should the Spark resaving (I think).

@boazmohar
Copy link
Author

boazmohar commented Dec 27, 2021

It is working locally and from nomechine on the cluster.
It is still not working on spark.
One thing I noticed is that it is launching the tasks before loading the jar (at least in the log file):

021-12-27 11:34:41,447 [dispatcher-Executor] INFO [ResourceUtils]: ==============================================================
2021-12-27 11:34:41,467 [dispatcher-Executor] INFO [CoarseGrainedExecutorBackend]: Successfully registered with driver
2021-12-27 11:34:41,469 [dispatcher-Executor] INFO [Executor]: Starting executor ID 0 on host 10.36.107.13
2021-12-27 11:34:41,557 [dispatcher-Executor] INFO [Utils]: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 17913.
2021-12-27 11:34:41,558 [dispatcher-Executor] INFO [NettyBlockTransferService]: Server created on 10.36.107.13:17913
2021-12-27 11:34:41,559 [dispatcher-Executor] INFO [BlockManager]: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2021-12-27 11:34:41,565 [dispatcher-Executor] INFO [BlockManagerMaster]: Registering BlockManager BlockManagerId(0, 10.36.107.13, 17913, None)
2021-12-27 11:34:41,572 [dispatcher-Executor] INFO [BlockManagerMaster]: Registered BlockManager BlockManagerId(0, 10.36.107.13, 17913, None)
2021-12-27 11:34:41,573 [dispatcher-Executor] INFO [BlockManager]: Initialized BlockManager: BlockManagerId(0, 10.36.107.13, 17913, None)
2021-12-27 11:34:41,648 [dispatcher-Executor] INFO [CoarseGrainedExecutorBackend]: Got assigned task 20
2021-12-27 11:34:41,651 [dispatcher-Executor] INFO [CoarseGrainedExecutorBackend]: Got assigned task 21
2021-12-27 11:34:41,652 [dispatcher-Executor] INFO [CoarseGrainedExecutorBackend]: Got assigned task 22
2021-12-27 11:34:41,652 [dispatcher-Executor] INFO [CoarseGrainedExecutorBackend]: Got assigned task 23
2021-12-27 11:34:41,653 [dispatcher-Executor] INFO [CoarseGrainedExecutorBackend]: Got assigned task 24
2021-12-27 11:34:41,655 [Executor task launch worker for task 21] INFO [Executor]: Running task 21.0 in stage 0.0 (TID 21)
2021-12-27 11:34:41,655 [Executor task launch worker for task 22] INFO [Executor]: Running task 22.0 in stage 0.0 (TID 22)
2021-12-27 11:34:41,655 [Executor task launch worker for task 24] INFO [Executor]: Running task 24.0 in stage 0.0 (TID 24)
2021-12-27 11:34:41,655 [Executor task launch worker for task 23] INFO [Executor]: Running task 23.0 in stage 0.0 (TID 23)
2021-12-27 11:34:41,655 [Executor task launch worker for task 20] INFO [Executor]: Running task 20.0 in stage 0.0 (TID 20)
2021-12-27 11:34:41,675 [Executor task launch worker for task 23] INFO [Executor]: Fetching spark://h07u21:35021/jars/BigStitcher-Spark-0.0.1-SNAPSHOT.jar with timestamp 1640622879317
2021-12-27 11:34:41,698 [Executor task launch worker for task 23] INFO [TransportClientFactory]: Successfully created connection to h07u21/10.36.107.31:35021 after 1 ms (0 ms spent in bootstraps)
2021-12-27 11:34:41,699 [Executor task launch worker for task 23] INFO [Utils]: Fetching spark://h07u21:35021/jars/BigStitcher-Spark-0.0.1-SNAPSHOT.jar to /scratch/moharb/spark-114599123-1/spark-bb5a2dd5-fed1-4caa-83df-fff14cc90256/executor-160b31f0-6981-45c2-b399-7b8db7e3d19a/spark-8e944ff4-3dcf-44f5-84cf-c9773c878a38/fetchFileTemp2825303330217301914.tmp
2021-12-27 11:34:42,182 [Executor task launch worker for task 23] INFO [Utils]: Copying /scratch/moharb/spark-114599123-1/spark-bb5a2dd5-fed1-4caa-83df-fff14cc90256/executor-160b31f0-6981-45c2-b399-7b8db7e3d19a/spark-8e944ff4-3dcf-44f5-84cf-c9773c878a38/-3846531471640622879317_cache to /groups/spruston/home/moharb/.spark/20211227_113327/logs/worker-1-dir/app-20211227113439-0000/0/./BigStitcher-Spark-0.0.1-SNAPSHOT.jar
2021-12-27 11:34:42,597 [Executor task launch worker for task 23] INFO [Executor]: Adding file:/groups/spruston/home/moharb/.spark/20211227_113327/logs/worker-1-dir/app-20211227113439-0000/0/./BigStitcher-Spark-0.0.1-SNAPSHOT.jar to class loader
2021-12-27 11:34:42,773 [Executor task launch worker for task 23] INFO [TorrentBroadcast]: Started reading broadcast variable 0 with 1 pieces (estimated total size 4.0 MiB)
2021-12-27 11:34:42,806 [Executor task launch worker for task 23] INFO [TransportClientFactory]: Successfully created connection to h07u21/10.36.107.31:36127 after 1 ms (0 ms spent in bootstraps)
2021-12-27 11:34:42,825 [Executor task launch worker for task 23] INFO [MemoryStore]: Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.3 KiB, free 39.8 GiB)
2021-12-27 11:34:42,831 [Executor task launch worker for task 23] INFO [TorrentBroadcast]: Reading broadcast variable 0 took 57 ms
2021-12-27 11:34:42,904 [Executor task launch worker for task 23] INFO [MemoryStore]: Block broadcast_0 stored as values in memory (estimated size 4.3 KiB, free 39.8 GiB)
2021-12-27 11:34:43,551 [Executor task launch worker for task 21] ERROR [Executor]: Exception in task 21.0 in stage 0.0 (TID 21)
mpicbg.spim.data.SpimDataInstantiationException: could not find XmlIoBasicImgLoader implementation for format bdv.n5
	at mpicbg.spim.data.generic.sequence.ImgLoaders.createXmlIoForFormat(ImgLoaders.java:72)
	at mpicbg.spim.data.generic.sequence.XmlIoAbstractSequenceDescription.fromXml(XmlIoAbstractSequenceDescription.java:110)
	at mpicbg.spim.data.generic.XmlIoAbstractSpimData.fromXml(XmlIoAbstractSpimData.java:153)
	at net.preibisch.mvrecon.fiji.spimdata.XmlIoSpimData2.fromXml(XmlIoSpimData2.java:164)
	at net.preibisch.mvrecon.fiji.spimdata.XmlIoSpimData2.fromXml(XmlIoSpimData2.java:52)

Maybe it is related to some race condition?
Maybe some of this discussion is relevant.

@StephanPreibisch
Copy link
Contributor

@trautmane do you have any ideas? Thanks 😊

@mzouink
Copy link
Member

mzouink commented Dec 27, 2021

I think it can be solved by adding this line :
ImgLoaders.registerManually(XmlIoN5ImageLoader.class); but it is a hack.
The XmlIoN5ImageLoader should be added automatically to the BDV loaders register.

@StephanPreibisch
Copy link
Contributor

Could you give this a try @mzouink? That would be awesome ...

@mzouink
Copy link
Member

mzouink commented Dec 27, 2021

done, @boazmohar, I added the code to a new branch fix_bdv_n5
can you plz try again with this branch code: https://github.com/PreibischLab/BigStitcher-Spark/tree/fix_bdv_n5
If it works then we can know the reason for the bug.

@boazmohar
Copy link
Author

Sorry still happaning but not in all workers, some of them are working and I see the right output, some have an issue with loading the right class as before.
image
image
I think that for the smaller dataset it is enough retries to get thru.

@mzouink
Copy link
Member

mzouink commented Dec 27, 2021

we are making progress. now the problem is inside the jobs.
can you plz pull and try again. I added a line. https://github.com/PreibischLab/BigStitcher-Spark/tree/fix_bdv_n5

@boazmohar
Copy link
Author

boazmohar commented Dec 27, 2021

Still failing in some
image
Also found a new error
image
Not sure if it a separate issue or not. This is a big dataset.

@trautmane
Copy link
Member

Hi @StephanPreibisch and @boazmohar - I'm looking at all of this for the first time now under the assumption that the problem still exists. Please let me know if that assumption is incorrect.

@boazmohar
Copy link
Author

boazmohar commented Dec 28, 2021

Yes, still an issue, thanks for looking into it!

@boazmohar - I just left a question for you on slack ...

@boazmohar
Copy link
Author

boazmohar commented Dec 29, 2021

Can confirm this solved this issue!
THANKS @trautmane
Here is the result of the non-rigid on a large dataset
image
image

Still had some memory issues but i used 5 cores per task and N_TASKS_PER_EXECUTOR_CORE=1000
BTW it took 3.5h with 2011 cores but allocating only 202 tasks at a time.
Would be happy to hear if there is a more principled way to do this (tried 10, 30 100, 300, 1000 until it stopped failing more then 4 times per task).

@StephanPreibisch
Copy link
Contributor

StephanPreibisch commented Dec 29, 2021

Awesome, thanks @trautmane! Could you please merge this into master?

The fix is a change to the spimdata repo and I don't have access to that, so I submitted this pull request. Someone with write access to spimdata ( @tpietzsch , @StephanPreibisch , or ??? ) needs to review and approve my changes. Once a new version of spimdata is deployed, we can update this repo to use it.

I still do not understand why it worked locally with Spark though, do you know?

I did not play with a local Spark version, so I can't definitively answer this - but my guess is that with the local Spark you are just more likely to get lucky and avoid concurrent access to ImgLoaders.build(). When I was testing/debugging, I needed to use multiple worker nodes to regularly encounter the issue. Some initial tests I ran with just one worker node on a small data set failed to reproduce the problem.

thanks again!!

@StephanPreibisch
Copy link
Contributor

You should post the image and statistics on Twitter @boazmohar 😊 Thanks so much for b pushing this!!

Now only the cloud version is missing @mzouink 😊

@StephanPreibisch
Copy link
Contributor

StephanPreibisch commented Dec 30, 2021

Thanks @trautmane, makes sense ... I didn't realize on my phone that it was in another repo, @tpietzsch has to approve it and cut a new release so we can distribute, I don't want to do it without him. It always creates this ripple-effect through many Fiji plugins ...

Yeah - that was a concern of mine too. @tpietzsch left a TODO comment in January 2021 about thread safety but didn't fix it then. Hopefully, he just didn't have time to fix it but it is also possible he ran into some other ripple-effect problem.

@tpietzsch
Copy link
Collaborator

Yeah - that was a concern of mine too. @tpietzsch left a TODO comment in January 2021 about thread safety but didn't fix it then. Hopefully, he just didn't have time to fix it but it is also possible he ran into some other ripple-effect problem.

No, no ripple-effect expected... I just didn't have time to think it through and fix it...

@tpietzsch
Copy link
Collaborator

This is now merged and released in spim_data-2.2.7

@trautmane
Copy link
Member

This is fixed by 0110317 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants