Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long load times in directories with many files #1

Closed
low-sky opened this issue Jan 7, 2020 · 22 comments · Fixed by CARTAvis/carta-frontend#1699 or CARTAvis/carta-backend#962
Closed
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@low-sky
Copy link

low-sky commented Jan 7, 2020

CARTA has a long load time when I call CARTA (v1.2) in a directory with a large number of files (>100), but I'm calling it with a specific file from the command line, e.g.,

carta file0001.fits

The long load time appears to come from running the scan across all the files in directory before loading. Is it possible to skip the scan when called from the command line in this fashion?

@veggiesaurus
Copy link

veggiesaurus commented Jan 8, 2020

@low-sky does this directory have lots of image files? @pford is working on changes that should speed this scenario up dramatically, because we won't open each FITS file and detect the HDU list and HDU types, but will instead just read the magic number (SIMPLE = T) to determine if they are FITS.

You can track progress of this feature here (backend) and here (frontend)

@veggiesaurus veggiesaurus added the enhancement New feature or request label Jan 8, 2020
@low-sky
Copy link
Author

low-sky commented Jan 8, 2020

Sorry to be unclear: yes, this happens for directories with lots of image FITS files.

@veggiesaurus
Copy link

Expected to be included in 1.4 release

@keflavich
Copy link

This remains my main pain point; I've frequently run CARTA sessions into the ground by accidentally (or necessarily) navigating to a folder that contains too many files.

I see that there's a lot of progress on the linked PRs, but I don't understand the details. Could a dev perhaps provide a user-facing update on the status of the file listing improvements?

Note also that CARTAvis/carta-backend#431 is closely related.

@kswang1029
Copy link

The improvements include:

  • check magic number for file type parsing
  • parse FITS HDUs at file info request level not file list request level
  • show progress report when encountering a folder with too many files
  • have the ability to cancel the file list request

Would these be sensible?

@kswang1029
Copy link

kswang1029 commented Feb 3, 2021

@keflavich Just curious, would you mind if we just show all files without file type parsing at the file list request level? The actual type parsing happens at the file info request level. This has implications to UX. 🤔

@keflavich
Copy link

I would greatly prefer if we just saw the filenames and sizes (simple ls -lh output) rather than the metadata; I almost never use the metadata, and I certainly never look at the metadata for all files in a folder.

Is the UI you're proposing, say, list all files, then once a user clicks on the file, determine whether or not it can be opened? I'd be happy with that for sure.

I also liked the suggestion I saw in one thread of being able to select files by type. I would love to filter by suffix, e.g., show only image.tt0 or only .residual or only .psf files, say, especially if that meant the folder would load faster!

@veggiesaurus
Copy link

I would greatly prefer if we just saw the filenames and sizes (simple ls -lh output) rather than the metadata; I almost never use the metadata, and I certainly never look at the metadata for all files in a folder.

I think we should add a preference "show all files in file browser". This would make the file list very quick (although we'd still need to do some checks for folder based image formats) but would list all files, rather than all supported files

@keflavich
Copy link

Right, yes, the difference between "folders that are files" and "folders that should be browsed to" makes the problem trickier than I implied! Still, just showing them all, then deciding later if they're images or not, would be nice.

@veggiesaurus
Copy link

fixed in upcoming 3.0-beta.2 release

@keflavich
Copy link

I'm using 3.0-beta.2, and with file list set either to "All Files" or "Filter by extension", it is still very slow to show all files. I see that, independent of filtering technique, it still shows the size of all the files in GB - that means it must be doing some sort of file size inspection. Is there any way to turn that off? This is my main bottleneck in using CARTA right now; I have to wait ~30s-few minutes every time I want to load a new file.

@veggiesaurus
Copy link

I'm using 3.0-beta.2, and with file list set either to "All Files" or "Filter by extension", it is still very slow to show all files. I see that, independent of filtering technique, it still shows the size of all the files in GB - that means it must be doing some sort of file size inspection. Is there any way to turn that off? This is my main bottleneck in using CARTA right now; I have to wait ~30s-few minutes every time I want to load a new file.

This is curious. We're basically just using stat to get file info (file size, last modified etc) when filtering by extension or showing all files. I'm not really sure how to speed that process up. I'm not sure how this could be significantly slower than ls -lh. In most of our tests, 25K files on an old hard drive was no problem whatsoever.

Can you remind me again what sort of filesystem you're using?

@keflavich
Copy link

OK, that's interesting - stat * is nearly instantaneous in a ~500 image directory. But it took >30s on a 40,000-file directory.

That same ~500-image directory takes nearly a minute to load in CARTA.

The filesystem is lustre-based. It's not very high-performance, and the support team specifically encouraged me to avoid / limit the use of ls -lh (as opposed to ls) when possible.

@veggiesaurus
Copy link

OK, that's interesting - stat * is nearly instantaneous in a ~500 image directory. But it took >30s on a 40,000-file directory.

That same ~500-image directory takes nearly a minute to load in CARTA.

The filesystem is lustre-based. It's not very high-performance, and the support team specifically encouraged me to avoid / limit the use of ls -lh (as opposed to ls) when possible.

Ok. I think we'll have to do some additional benchmarking with Lustre (@ajm-asiaa perhaps you could do so?). I've just tested on our CephFS remote filesystem, and 25000 files showed up in CARTA within 1 second 🤔 A minute to a file list of 500 files seems way out of the ordinary.

@Jordatious
Copy link

I also experience this problem on ilifu. I just tested again on carta-testing.

@veggiesaurus
Copy link

I also experience this problem on ilifu. I just tested again on carta-testing.

can you compare this to the time it takes to run ls -lh? Note that things might be cached after you ls them once

@kswang1029
Copy link

kswang1029 commented May 10, 2022

could it be because the disk arrays were hibernated so it took time to wake up? I did a quick test with ASIAA's lustre and with 25000 mixed files and folders, within a second the list showed up.

@ajm-ska
Copy link

ajm-ska commented May 10, 2022

We have a convenient test folder on our Lustre system containing 25012 CASA images of 106kB each. The CARTA Filebrowser took 4 minutes 35 seconds to process them. At least it was showing "Loading file list" progress, otherwise I would have thought it was frozen. The second time I opened the folder, it did indeed process faster, only taking about 30 seconds.

If I just use the terminal, ls -lh only takes ~5 seconds before all the files are listed. The stat * command is instantaneous, but it takes about 15 seconds for all the information to be printed on the screen.

@ajm-ska
Copy link

ajm-ska commented May 10, 2022

The Lustre lfs getstripe command shows the images in our test folder have a "stripe_count" of 1 as expected as they are tiny files in this case. Each image has a different "stripe_offset" so I imagine they are being read in from different OSTs. Although I wonder if larger files that are composed of multiple stripes could be processed faster.

Our Lustre system seems to be functioning fine as the lfs check servers command shows all OSTs are active.

@veggiesaurus
Copy link

We have a convenient test folder on our Lustre system containing 25012 CASA images of 106kB each. The CARTA Filebrowser took 4 minutes 35 seconds to process them. At least it was showing "Loading file list" progress, otherwise I would have thought it was frozen. The second time I opened the folder, it did indeed process faster, only taking about 30 seconds.

If I just use the terminal, ls -lh only takes ~5 seconds before all the files are listed. The stat * command is instantaneous, but it takes about 15 seconds for all the information to be printed on the screen.

Did you have your Carta front-end set to filter by file extension rather than content type?

@ajm-ska
Copy link

ajm-ska commented May 10, 2022

Did you have your Carta front-end set to filter by file extension rather than content type?

No. It was initially set to Filter by file content. After changing it to Filter by extension, it now takes 20 seconds (from the cached state). So it is about 10 seconds faster.

@ajm-ska
Copy link

ajm-ska commented May 10, 2022

Our ASIAA CARTA public demo server originally mounted all its test images from our Lustre system. That has since changed and now only uses a small internal HDD array in ext4 format. We don't have space to put all the original test images there. But just as an experiment, I just copied over the folder with the 25012 test images (set_lotsFiles2). It takes about 5 seconds to process from the HDD array! So there definitely seems to be poor performance when using Lustre with CARTA. The strange thing is, I don't remember it being so slow with Lustre before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
10 participants