-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory issues in DataAverager and DirtyImager #224
Comments
Relevant also for #136 |
The memory crash is also an issue for |
I think the solutions for
|
Cool, the streaming algorithm looks good. For the NuFFT, I think we really should implement some solution internally. Otherwise the workaround for a user is potentially:
That's too much to leave to them and is the kind of thing that would push me away from a code. Something like |
Ok, something like If the user is doing any kind of optimization loop, then I think they will want to use something like If the user is trying to get all model visibilities at once, then they can use
Since we can't use Am I missing anything? |
Yes I agree with all that. I assumed something like SGD would batch 'internally' as implemented in MPoL (i.e., it's done for the user when they use SGD). For use case 2, I'd just add "visualizing model (for comparison to loose true) and residual visibilities directly, like with 1D plots". No other use cases come to mind.
|
Can you clarify what you mean by
|
I just meant the use case for getting the loose model vis, in addition to the loose residual vis. (unless I'm missing another way to get them) |
Ah, right. For the NuFFT, I think this would just return the model vis, and the user would calculate residuals as data - model (the NuFFT never sees data itself). But residuals could be calculated by whatever might be using NuFFT as an intermediate layer. |
Yes that's what |
@jeffjennings I'm curious about your initial bug report. How are you running either |
I ran through the This is all on the CPU. I get a peak memory usage of about 12Gb. The raw .asdf file itself is 1.1Gb. I think the main thing at play with the memory issues is keeping reference to various components and copies of the dataset. As you can see in the script, we load The 12Gb is a moderate-to-large ask for memory, and this is a fairly large ALMA dataset in terms of number of (non-channel averaged) visibilities. I would say most "data science" users probably have this much available on their machines. But it probably would exclude some people (e.g., current MacBook Air base mem is 8Gb), so it's worth thinking about what could be done here. CASA cites a minimum memory requirement of 8Gb per core, with 16Gb or 32Gb preferable. Note: getting rid of the references from loading the data puts peak memory usage at 10.41 Gb. updated script. But simply storing the data within We could also experiment with #104, storing uu and vv in meters, and calculating the channelized lambda values on the fly via |
Good point, I had the |
Can you provide more info about the error messages and crashes, possibly with an MRE script? |
update: Before when the code crashed on the GPU, it was in the call to |
Ok, I'm still not sure how The |
Yes I don't mean to say that
|
OK, and this is entirely on the CPU? Are you on a local branch that's already implemented the lambda vs. klambda, too? |
Yes entirely on CPU (all inputs to |
I can also run |
Ok I will give it a quick try on memray now. But it's worth checking out yourself, too, it's fun. The reason I asked about the lambda refactor is that I don't have 1e3 in my code. So the only reason that should appear is if you've modified the broadcast routines. |
Ah, sorry, yes in my pipeline script I have everything in lambda (I convert u and v to lambda within |
For reference,
|
Thanks. I ran the script successfully and also ran memray on my copy of the dataset (same length of uu, so I'm pretty sure it's the same). I got a peak memory usage of 8Gb. But nothing's really happened inside either of the routines. For
What is the workflow in which you need |
Thank you for checking. I get 8.7 GB; I'm now digging into Exiting scope is a good idea for a workaround, but I'd prefer to keep |
Fair enough, but I'm not sure if I'd class this as a workaround. When the dataset itself is very large (1.1 Gb) and you have a cap of 8Gb RAM, there isn't a lot of room for keeping references to this data and derivative products from it. One could quickly run into the same sort of problem with pure numpy arrays if one took a large array and broadcasted it to the wrong dimension, or kept references to copies of the data hanging around. I don't think that means that numpy needs to anticipate memory management needs beyond their best-effort, the Python runtime will catch the errors otherwise. I tend to think of the With respect to DirtyImager, I think the easiest thing we could do now is to implement the As currently written, the memory footprint of This raises the question, what should |
That's a good point. Separately, I think being able to call
Yes I guess that's what we see differently. And it's a preference I know. My perspective is that having a core library with very good tutorials (as MPoL does) is extremely useful. It's also a lot to approach for a user who just wants to get an image out of the code to compare to clean. Expecting every user to fully learn the model and code is ideal in principle, but I think kind of a lot to ask in practice, especially for a userbase that isn't necessarily strong in programming. And it can be a bit inconsiderate of a user's time, because most users would probably end up writing very similar scripts to do effectively the same thing with the code. With Of course In that sense I'd disagree that the training and cross-validation routines are too specific - their purpose is simply to give the user an image! From the imaging software! I'm joking, but really that's all they're meant to do, and I think they're pretty general in doing it (but maybe I'm not considering something). If you want them to go in a separate package, I guess that's fine, although it would inevitably reduce their visibility, and I don't see the harm in having a pipeline internal to MPoL. The pipeline routines are all isolated from the 'core' code (they're in their own scripts), and installing a second package to run the first package in an automated way seems unnecessary. But it would be good to discuss more on Monday.
Sounds good!
Sure yeah, as long as internally it can build up the image over chunks for a single channel. |
Def agree with you about the basic use case of MPoL. But I think I expect that there is enough variation for what one might want to do in each type of analysis problem that, for many cases, it will be more direct for a new user to copy + adapt a script to do what they want than to work within the constraints of a general pipeline. @kadri-nizam had brought up PyTorch Lightning before as an interesting framework that might solve many of our boilerplate issues. It might be worth us taking a closer look.
I whole-heartedly agree about making the software as accessible as possible. But I would say that for any real-world analysis problem (i.e., one that is working with data to produce an image in support of peer-reviewed research), the user is looking for more than a black box. They should be approaching the problem with some training in interferometry and scientific computing. And I think it is fair to expect that the basics of PyTorch (i.e., as covered in our tutorial, its additional resources, and the Intro to RML optimization tutorial) can be expected as a pre-requisite for using the package. It should only take a few hours to read through the tutorials, but I would argue that the knowledge gained by this pays massive dividends over the course of a research project, when one might be using the code again, and again, and again. I.e., "teach a person to fish." I think the same is true of exoplanet, as a pre-requisite one needs to understand the basics of Bayesian analysis and MCMC, otherwise the software is pointless. That said, if we can get an MPoL application to be easily and reliably contained from the command line in addition to a rich API + tutorials, we should do it.
I agree there are some downsides to a separate package, too. Re: specificity, we haven't yet arrived at a satisfactory answer to the IM Lup dataset :-p . If and when we arrive at one and codify it in a pipeline, how well will this approach scale to other datasets? Perhaps it will be fine for the DSHARP disks. But what about the AK Sco dataset, with per-EB astrometric offsets and amplitude that need to be modeled? Or a multi-channel CO dataset, where the regularizer strengths might need to be different per-channel? Or a prior with the scattering transform (might want to visualize the coefficients). Or considering different ways the data might be split for validation (considering per-EB boundaries?). I think effectively treating these issues is key to producing images that are better than clean, which is the main reason users are interested in this package in the first place. I think there's enough variation in which diagnostics the user might want to see that each new analysis problem will start to demand its own custom pipeline, in which case, there will be friction for the user actually using the standard pipeline. But obviously, this is just my conjecture now because we haven't finished analyzing the IM Lup dataset yet. But yea, let's run through these points in on Monday. As regards this issue, I'll try to implement the |
Ok great, thanks for your thoughts, let's talk more Monday. |
Describe the bug
Within
DataAverager
andDirtyImager
, whencheck_visibility_scatter
is True inDataAverager.to_pytorch_dataset
orDirtyImager.get_dirty_image
,GridderBase._check_scatter_error
is called, thusGridderBase.estimate_cell_standard_deviation
. With large datasets, specifically the test dataset for IM Lup in .asdf,estimate_cell_standard_deviation
causes memory crash on a system with 8 GB of VRAM (I'm running things on the GPU). If running on a CPU (with 16 GB free RAM), the code doesn't crash (but is much slower).Suggested fix
A quick look at this function suggests its operations could be batched, working on smaller chunks of the loose visibilities in sequence.
Desktop (please complete the following information):
The text was updated successfully, but these errors were encountered: