Performance analysis#54
Conversation
…eline
Adds optional performance instrumentation that emits [PERF] lines on
stderr, guarded behind a new --perf CLI flag (off by default, zero
overhead when off). Useful for diagnosing where slider-tick time is
going on a given install — reader I/O on network filesystems, pipeline
filters, rendering, and Trame round-trips each get their own labels.
New module:
- utils/perf.py: enable/is_enabled/timed/log. The timed(label) context
manager checks the module-level bool once and does nothing further
when disabled. Safe to leave timers in production code paths.
CLI:
- utils/cli.py: --perf boolean flag.
- app.py: perf.enable(args.perf) at startup.
Instrumented paths:
- app._on_slicing_change: total tick + pipeline/color_range/render/
extract_avgs sub-steps (label includes dim=N so runs are paired up).
- app._on_downstream_change: same 4-stage breakdown under
downstream_change.* for crop/projection changes.
- app._data_load_variables: initial load — LoadVariables, both
network_completion awaits, build_auto_layout, reset_camera.
- plugins/eam_reader: RequestData total + per-variable netcdf_read
and fill_value_scan. These are the I/O-hypothesis targets.
- plugins/eam_projection:
- EAMCenterMeridian.RequestData: cache_hit vs rebuild, plus
clip_left/clip_right/transform/append within rebuild.
- EAMExtract.RequestData: cache_hit vs rebuild_trim, plus
cell_centers, deep_copy_cells, ghost_mask, remove_ghost_cells.
- EAMProject.RequestData: cache_hit vs cache_miss, plus
deep_copy_points and pyproj_transform within the miss path.
- add_cell_arrays: slice_plan build + per-array pedigree_copy.
- view_manager.VariableView: per-view render, update_color_range,
nested get_range.
- utils/compute.extract_avgs: per-variable weighted-average cost.
Plugins use a try/except fallback to a no-op shim for the import so
they still work if loaded by ParaView GUI with quickview not on
sys.path.
Output format: `[PERF] <label>: NN.NN ms`. Hierarchical labels
(e.g. "tick.pipeline", "reader.netcdf_read.U") are easy to grep,
sort, and aggregate.
The existing view.<var>.render timer wraps the ctx[name].update() call,
which is non-blocking — it only schedules the trame/RCA push. The real
VTK render happens later on the event loop, so the timer reported
~0.01 ms and left a gap in the per-tick picture.
Add Start/EndEvent observers on the render window in both view
managers. Each render (RCA-triggered, camera change, whatever) emits
[PERF] view.<var>.render_window: <ms> ms (default view manager)
[PERF] view.shared.render_window: <ms> ms (single-window --fast)
The observers only compute/log when perf.is_enabled() is true, so the
non-perf path pays only one bool check per render event.
JPEG encode + websocket push are still not captured; those happen in
the RCA/trame layer after VTK returns. If that turns out to dominate
we'll need to hook further into the RCA view handler.
Monkey-patch RcaEncoder.encode and RcaViewAdapter.push at import so the
two phases show up under --perf:
[PERF] rca.encode.<encoder>: <ms> ms (turbo-jpeg / jpeg / avif / etc.)
[PERF] rca.push.<area>: <ms> ms (one per view's websocket push)
With this plus the existing view.<var>.render_window timer, the full
server-side tick path is now accounted for: VTK render → JPEG encode
→ websocket push. On a NERSC-style trace the difference between wall-
clock lag and the sum of these three labels is the network-to-client
bandwidth.
Wrappers check _ENABLED per-call so the non-perf path costs only a
bool test per encode/push.
|
@jourdain Can you deploy that on NERSC so that we can get more details on the 2 sec / frame? It needs to be run with --perf and outputs to stderr. |
|
The new version (v2.4.0) has been release. Now someone needs to update it on NERSC and run it with the @huiwanpnnl I will try to do that later today, but if you manage to try it sooner, please go ahead. Thx |
|
I updated the installation at NERSC, and updated the script |
|
I did a quick test on an exclusive GPU node and got the following in the terminal. Not sure if it's informative for @berkgeveci . I'm attending a NERSC training course today and hence am slow in responding. Sorry! |
|
@huiwanpnnl for doing the test with only 1 render window, you need to keep the
|
|
I will make it the default at some point but since we were running into graphic issue, I kept the old mode too. I may switch the default though. |
|
I see. I've updated |
|
Here is the AI analysis of perf results. Rendering is a concern. @jourdain : There is a way of printing out what rendering backend ParaView is using, I think. Same thing that shows in the about panel. Can you print it out to make sure that we are using the GPU? An NVIDIA GPU should be much faster than my Mac. @huiwanpnnl : Which file is this? Where the time is going
Each reader.netcdf_read. is ~2,000 ms on NERSC vs ~45 ms locally. That's 50× slower,
rca.push.* is 0.01–0.06 ms on NERSC. The websocket kernel write is not back-pressuring. If
Labels are per-view (view.LW.render_window + view.precip.render_window) rather than Likely cause: NERSC compute instances typically run software GL (Mesa llvmpipe) — no GPU. 25M
Line 11: project.pyproj_transform: 614 ms (cold) vs ~2,500 ms locally. Our threaded pyproj is
|
|
Hi @berkgeveci Simulation file: Connectivity: |
|
When I did the test of the GPU used at NERSC, I was getting the Nvidia EGL of the A100. |
|
@huiwanpnnl Does that machine have access to $SCRATCH? If it does, would you mind copying the file to scratch, running again and sending the performance log? @jourdain Can you add print(w.ReportCapabilities()) where w is the render window somewhere near the start? |
|
I've copied the connectivity file and data files to scratch under |
|
|
Loading from the scratch path Staring up + loading (LW + precip) Then moving time forward again and again |
|
And when cropping so the rendering endup "smaller" but with 4 variables. |
|
Significant improvement in IO. From ~2s / var to ~85ms / var. Once the VTK
changes I made land, the render should be > 2x faster. So we should be able
to get to < 1s / frame with 2 variables. When all is said and done, the
biggest part of each timestep will be the transformation and uploading of
the scalars to the GPU. There is a way to optimize that further in the
future. It requires moving some of the processing to the GPU. We do two
things on the CPU: apply scalar -> RGBA mapping and create a texture for
rendering cell scalars.
It would be great if you could test 3D variables too. Instead of directly
copying to scratch, please run the script I sent to create chunked HDF5
backed NetCDF on the scratch. With that, I expect similar IO performance to
2D variables when changing time or lev.
I am happy :)
…On Thu, Apr 23, 2026 at 5:48 PM Sebastien Jourdain (Kitware) < ***@***.***> wrote:
*jourdain* left a comment (Kitware/QuickView#54)
<#54?email_source=notifications&email_token=AAATRY6WT3O57CNMIZA4W234XKFSRA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTIMZQHA2DGNZVHAY2M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJNLQOJPWG33NNVSW45C7N5YGK3S7MNWGSY3L#issuecomment-4308437581>
And when cropping so the rendering endup "smaller" but with 4 variables.
Longitude = [ -50, 96 ]
Latitude = [ -66, 73 ]
[PERF] reader.netcdf_read.LW_flux_up_at_model_top: 78.85 ms
[PERF] reader.fill_value_scan.LW_flux_up_at_model_top: 10.57 ms
[PERF] reader.netcdf_read.precip_total_surf_mass_flux: 78.87 ms
[PERF] reader.fill_value_scan.precip_total_surf_mass_flux: 17.77 ms
[PERF] reader.RequestData: 186.86 ms
[PERF] add_cell_arrays.slice_plan: 0.01 ms
[PERF] add_cell_arrays.pedigree_copy.area: 23.09 ms
[PERF] add_cell_arrays.pedigree_copy.LW_flux_up_at_model_top: 22.80 ms
[PERF] add_cell_arrays.pedigree_copy.precip_total_surf_mass_flux: 22.44 ms
[PERF] add_cell_arrays.pedigree_copy.lat: 22.66 ms
[PERF] center_meridian.cache_hit: 92.18 ms
[PERF] center_meridian.RequestData: 92.28 ms
[PERF] add_cell_arrays.slice_plan: 0.00 ms
[PERF] add_cell_arrays.pedigree_copy.area: 8.01 ms
[PERF] add_cell_arrays.pedigree_copy.LW_flux_up_at_model_top: 9.85 ms
[PERF] add_cell_arrays.pedigree_copy.precip_total_surf_mass_flux: 9.88 ms
[PERF] add_cell_arrays.pedigree_copy.lat: 8.90 ms
[PERF] extract.cache_hit: 36.99 ms
[PERF] extract.RequestData: 37.09 ms
[PERF] project.cache_hit: 0.01 ms
[PERF] project.RequestData: 0.08 ms
[PERF] tick.pipeline: 352.27 ms
[PERF] tick.color_range: 50.81 ms
[PERF] tick.render: 0.01 ms
[PERF] extract_avgs.LW_flux_up_at_model_top: 23.69 ms
[PERF] extract_avgs.precip_total_surf_mass_flux: 23.73 ms
[PERF] extract_avgs.area: 25.26 ms
[PERF] extract_avgs.lat: 24.19 ms
[PERF] tick.extract_avgs: 97.27 ms
[PERF] tick.time=30.total: 500.44 ms
[PERF] view.shared.render_window: 1664.24 ms
[PERF] rca.encode.turbo-jpeg: 15.36 ms
[PERF] view.shared.render_window: 116.38 ms
[PERF] rca.push.rca_image_stream_1: 0.05 ms
[PERF] rca.encode.turbo-jpeg: 15.93 ms
[PERF] rca.push.rca_image_stream_1: 0.03 ms
[PERF] view.shared.render_window: 91.17 ms
[PERF] rca.encode.turbo-jpeg: 18.20 ms
[PERF] rca.push.rca_image_stream_1: 0.03 ms
—
Reply to this email directly, view it on GitHub
<#54?email_source=notifications&email_token=AAATRY6WT3O57CNMIZA4W234XKFSRA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTIMZQHA2DGNZVHAY2M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJNLQOJPWG33NNVSW45C7N5YGK3S7MNWGSY3L#issuecomment-4308437581>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAATRY7NPZ4XJHGK2NGQNGL4XKFSRAVCNFSM6AAAAACYD2LBTGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DGMBYGQZTONJYGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
Who ever get to there first @huiwanpnnl |
|
On login node (shared gpu/resources) |
|
On exclusive GPU node, using 2.5.0. Start app, load file: |
|
Load one surface variable: Manually stepping through some time steps: |
|
I also tried the "play" button. Given that we are looking at a ultra-high-resolution (3.5 km) global field, I'd say this is pretty good. Thanks! |
|
I just realized that if I crop a smaller (but still fairly large) lat-lon box, then the stepping through slices appears impressively fast. QuickView2.5.0_SCREAM_TOMLWup.mov |
|
You guys chunked the surface variables file. I want to test with a 3D file like output.scream.Cess.monthly_ne1024.AVERAGE.small.nc and vary the lev instead of the time. Chunking the surface fields is not useful and probably counter-productive. |
|
Yes. I verified that we don't want to chunk the surface variable. Just the 3D vars. Also, head node is about 4x slower than the compute nodes in processing. |
|
Oops. Thanks, Berk! I chunked some 3D variables related to cloud fraction. Below is a real-time screen recording showing version 2.5.1's performance in stepping through the module load conda
conda activate /global/common/software/m4359/conda-envs/quickview-family
quickview -p 0 --fast --perf -df output.scream.Cess.monthly_ne1024.AVERAGE.nmonths_x1.2019-09-01-00000.cldfrac.chunked.nc -cf connectivity_ne1024pg2_TEMPEST.scrip.ncQuickView2.5.0_SCREAM_cldfrac_tot.mov |
|
Nice!!! |
Adds an opt-in
--perfCLI flag that turns on[PERF] <label>: <ms> mstiming on stderr across the full server-side slider-tick path, plus a no-op shim so instrumented code paths are near-free (one bool check per call site) when the flag is off. Motivated by a user report of ~2 s/tick on a NERSC cloud instance — we want a way to confirm (or rule out) I/O, rendering, and the web-streaming layer without running a profiler.Three commits:
2cc8f34— newutils/perf.pymodule,--perfflag wired through CLI, timers around every logical tick stage:tick.pipeline/tick.color_range/tick.render/tick.extract_avgs, per-variablereader.netcdf_read.<var>+reader.fill_value_scan.<var>, all three Python filters (center_meridian/extract/project) with cache-hit vs. rebuild breakdowns,add_cell_arraysslice-plan + per-variable pedigree copy, per-view render + color-range, and per-variableextract_avgs.cfee690— the existingview.<var>.rendertimer only wrappedctx[name].update()(non-blocking trigger), so it reported ~0 ms. Added Start/EndEvent observers on the render window in both view managers that emitview.<var>.render_window(default) /view.shared.render_window(--fast) with the actual VTK render time.16dd8c2— monkey-patchestrame_rca.RcaEncoder.encodeandRcaViewAdapter.pushto emitrca.encode.<encoder>(turbo-jpeg/jpeg/…) andrca.push.<area>per call. Together with the render-window timer, the server-side path is now end-to-end instrumented: pipeline → VTK render → JPEG encode → websocket write.Local steady-state trace (25M-cell grid, 2 variables,
--fast):Known blind spot:
rca.pushmeasures the server-side handoff to the kernel; actual websocket transit to the browser isn't captured. For the NERSC case,rca.push.*in the tens/hundreds of ms would indicate a blocked websocket write (back-pressure from a slow client link);rca.push.*staying ~0 ms while user-perceived lag is high would point at network round-trip or client-side paint.