Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No check #62

Closed
wants to merge 12 commits into from
Closed

No check #62

wants to merge 12 commits into from

Conversation

will-moore
Copy link
Member

Just for testing, use --no-check to avoid connecting to IDR, so we can just focus on perf of getPlanes() on current server.

@will-moore
Copy link
Member Author

will-moore commented Jan 31, 2024

On idr-next: lots of parallel jobs causes problems - #55 (comment)
but running on a single thread doesn't - #55 (comment)

Let's use a small number parallel threads on just omeroreadwrite-1...

[wmoore@prod120-proxy ~]$ 
[wmoore@prod120-proxy ~]$ cat nodes
omeroreadonly-1

$ screen -dmS cache parallel --eta --sshloginfile nodes -a ids_idr0016.txt -j10 '/opt/omero/server/OMERO.server/bin/omero login -s localhost -u public -w public && /opt/omero/server/venv3/bin/python /uod/idr/metadata/idr-utils/scripts/check_pixels.py --render >> /tmp/render_20240131.log'
screen -r

Computers / CPU cores / Max jobs to run
1:omeroreadonly-1 / 8 / 9

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
ETA: 0s Left: 413 AVG: 0.00s  omeroreadonly-1:9/0/100%/0.0s 

EDIT: after 14 mins, Blitz log shows gaps of no activity for several mins, e.g. 13:38:03 -> 13:40:01 when we'd expect rendering to be happening constantly...

[wmoore@prod120-omeroreadonly-1 ~]$ tail -f /opt/omero/server/OMERO.server/var/log/Blitz-0.log
...
2024-01-31 13:36:46,008 INFO  [        ome.services.util.ServiceHandler] (Server-155)  Rslt:	null
2024-01-31 13:36:46,008 INFO  [        ome.services.util.ServiceHandler] (Server-155)  Meth:	interface omeis.providers.re.RenderingEngine.renderCompressed
2024-01-31 13:36:46,008 INFO  [        ome.services.util.ServiceHandler] (Server-155)  Args:	[Type: XY, z=0, t=0, renderShapes=false, shapeIds=[]]
2024-01-31 13:36:46,008 INFO  [             omeis.providers.re.Renderer] (Server-155) Using: 'omeis.providers.re.HSBStrategy' rendering strategy.
2024-01-31 13:37:05,095 DEBUG [                   loci.formats.Memoizer] (Server-158) start[1706708176678] time[48417] tag[loci.formats.Memoizer.setId]
2024-01-31 13:37:05,096 INFO  [                ome.io.nio.PixelsService] (Server-158) Creating BfPixelBuffer: /data/OMERO/ManagedRepository/demo_2/2016-06/16/04-33-36.550_mkngff/45e4cbb8-7ac6-4060-aa32-0b8f975a2894.zarr/.zattrs Series: 1920
2024-01-31 13:38:03,947 INFO  [                 org.perf4j.TimingLogger] (Server-158) start[1706708166663] time[117284] tag[omero.call.success.ome.services.RenderingBean$12.doWork]
2024-01-31 13:40:01,104 INFO  [ome.services.sessions.state.SessionCache] (2-thread-1) Synchronizing session cache. Count = 4
2024-01-31 13:40:30,257 INFO  [ ome.services.blitz.fire.SessionManagerI] (2-thread-5) Performing requestHeartbeats
2024-01-31 13:40:10,886 INFO  [        ome.services.util.ServiceHandler] (Server-158)  Rslt:	ome.io.bioformats.BfPixelBuffer@199c2b33

Just seeing first errors:

[wmoore@prod120-omeroreadonly-1 ~]$ tail -f /tmp/render_20240131.log
160/2304 Render Image:2052592 24307 [Well B14, Field 4]
161/2304 Render Image:2052593 24307 [Well B14, Field 5]
162/2304 Render Image:2052594 24307 [Well B14, Field 6]
163/2304 Render Image:2052596 24307 [Well N17, Field 1]
164/2304 Render Image:2052597 24307 [Well N17, Field 2]
165/2304 Render Image:2052598 24307 [Well N17, Field 3]
166/2304 Render Image:2052599 24307 [Well N17, Field 4]
167/2304 Render Image:2052600 24307 [Well N17, Field 5]
168/2304 Render Image:2052601 24307 [Well N17, Field 6]
Error: RenderJpeg Image:2052601 24307 [Well N17, Field 6] catching classes that do not inherit from BaseException is not allowed
[wmoore@prod120-omeroreadonly-1 ~]$ grep Error /tmp/render_20240131.log
Error: RenderJpeg Image:2043403 24278 [Well I20, Field 5] exception ::Ice::UnknownLocalException
Error: RenderJpeg Image:2051658 24319 [Well O23, Field 3] exception ::Ice::UnknownLocalException
Error: RenderJpeg Image:2056263 24352 [Well B3, Field 6] exception ::Ice::UnknownLocalException
Error: RenderJpeg Image:2047360 24304 [Well A1, Field 1] catching classes that do not inherit from BaseException is not allowed
Error: RenderJpeg Image:2042440 24279 [Well I8, Field 1] exception ::Ice::UnknownLocalException
Error: RenderJpeg Image:2057293 24507 [Well N9, Field 1] catching classes that do not inherit from BaseException is not allowed
Error: RenderJpeg Image:2060877 24512 [Well M5, Field 6] catching classes that do not inherit from BaseException is not allowed
Error: RenderJpeg Image:2047046 24297 [Well H1, Field 5] exception ::Ice::UnknownLocalException
Error: RenderJpeg Image:2052601 24307 [Well N17, Field 6] catching classes that do not inherit from BaseException is not allowed

Those UnknownLocalExceptions have more info when ex is raised

Error: RenderJpeg Image:2047046 24297 [Well H1, Field 5] exception ::Ice::UnknownLocalException
{
    unknown = ConnectionI.cpp:2052: Ice::ConnectTimeoutException:
timeout while establishing a connection
}

But catching classes that do not inherit from BaseException is not allowed don't have any other info in that log.
No errors from today in Blitz log:

[wmoore@prod120-omeroreadonly-1 ~]$ grep Error /opt/omero/server/OMERO.server/var/log/Blitz-0.log
2024-01-30 12:41:46,012 WARN  [            ome.services.blitz.fire.Ring] (      main) Error getting uuid from node ClusterNode/5a3099a1-3ea7-4cb8-a343-104dab25066b -t -e 1.1:tcp -h 10.35.199.43 -p 37012 -t 60000:tcp -h 192.168.120.132 -p 37012 -t 60000 -- removing.
2024-01-30 12:47:12,465 WARN  [            ome.services.blitz.fire.Ring] (      main) Error getting uuid from node ClusterNode/7ace86bd-3fe0-4aeb-8af2-e224eeefd894 -t -e 1.1:tcp -h 10.35.199.43 -p 37017 -t 60000:tcp -h 192.168.120.132 -p 37017 -t 60000 -- removing.
2024-01-30 14:00:02,974 ERROR [ ome.services.blitz.fire.SessionManagerI] (.Server-17) Error reaping session 7b8e7456-b98e-430d-a0b1-df092b0fdb6c from client b3bd52d6-df48-4c60-bdf0-5e665effbe78
2024-01-30 14:09:59,588 ERROR [ ome.services.blitz.fire.SessionManagerI] (.Server-32) Error while creating ServiceFactoryI
2024-01-30 14:13:32,712 ERROR [ ome.services.blitz.fire.SessionManagerI] (l.Server-5) Error while creating ServiceFactoryI
2024-01-30 14:13:32,718 ERROR [ ome.services.blitz.fire.SessionManagerI] (.Server-34) Error while creating ServiceFactoryI
2024-01-30 14:16:34,808 ERROR [ ome.services.blitz.fire.SessionManagerI] (.Server-19) Error while creating ServiceFactoryI
2024-01-30 14:17:14,805 ERROR [ ome.services.blitz.fire.SessionManagerI] (.Server-21) Error while creating ServiceFactoryI
2024-01-30 14:31:58,445 WARN  [            ome.services.blitz.fire.Ring] (2-thread-3) Error getting uuid from node ClusterNode/0d867a63-7138-4d8c-9741-5e9f87822010 -t -e 1.1:tcp -h 10.35.199.43 -p 42195 -t 60000:tcp -h 192.168.120.132 -p 42195 -t 60000 -- removing.

@will-moore
Copy link
Member Author

It seems that the --render script stopped at the last Error above (9 Errors in total).

[wmoore@prod120-omeroreadonly-1 ~]$ tail /tmp/render_20240131.log
...
168/2304 Render Image:2052601 24307 [Well N17, Field 6]
Error: RenderJpeg Image:2052601 24307 [Well N17, Field 6] catching classes that do not inherit from BaseException is not allowed

[wmoore@prod120-omeroreadonly-1 ~]$ grep Error /tmp/render_20240131.log | wc
      9     126     995

That would seem to correspond with the number of jobs running:

Computers / CPU cores / Max jobs to run
1:omeroreadonly-1 / 8 / 9

and is likely due to the fact that raising the Exception causes each job to stop after the first Error.

Total number of Images rendered before all jobs failed is

[wmoore@prod120-omeroreadonly-1 ~]$ grep "Render Image" /tmp/render_20240131.log | wc
   1332   10670   73367

Divide 1332 images between 9 jobs gives an average of 148 images per job before Error.
We can see that's pretty good estimate - actually around 166 for 8 jobs and 0 for one job:

[wmoore@prod120-omeroreadonly-1 ~]$ grep -B 1 Error /tmp/render_20240131.log
167/2304 Render Image:2043403 24278 [Well I20, Field 5]
Error: RenderJpeg Image:2043403 24278 [Well I20, Field 5] exception ::Ice::UnknownLocalException
--
165/2304 Render Image:2051658 24319 [Well O23, Field 3]
Error: RenderJpeg Image:2051658 24319 [Well O23, Field 3] exception ::Ice::UnknownLocalException
--
162/2304 Render Image:2056263 24352 [Well B3, Field 6]
Error: RenderJpeg Image:2056263 24352 [Well B3, Field 6] exception ::Ice::UnknownLocalException
--
0/2304 Render Image:2047360 24304 [Well A1, Field 1]
Error: RenderJpeg Image:2047360 24304 [Well A1, Field 1] catching classes that do not inherit from BaseException is not allowed
--
163/2304 Render Image:2042440 24279 [Well I8, Field 1]
Error: RenderJpeg Image:2042440 24279 [Well I8, Field 1] exception ::Ice::UnknownLocalException
--
169/2304 Render Image:2057293 24507 [Well N9, Field 1]
Error: RenderJpeg Image:2057293 24507 [Well N9, Field 1] catching classes that do not inherit from BaseException is not allowed
--
168/2304 Render Image:2060877 24512 [Well M5, Field 6]
Error: RenderJpeg Image:2060877 24512 [Well M5, Field 6] catching classes that do not inherit from BaseException is not allowed
--
161/2304 Render Image:2047046 24297 [Well H1, Field 5]
Error: RenderJpeg Image:2047046 24297 [Well H1, Field 5] exception ::Ice::UnknownLocalException
--
168/2304 Render Image:2052601 24307 [Well N17, Field 6]
Error: RenderJpeg Image:2052601 24307 [Well N17, Field 6] catching classes that do not inherit from BaseException is not allowed

Is this now at a useful state for adding to tests etc?

@will-moore will-moore closed this Feb 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant