Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory issues in DataAverager and DirtyImager #224

Closed
jeffjennings opened this issue Dec 5, 2023 · 29 comments · Fixed by #230
Closed

Memory issues in DataAverager and DirtyImager #224

jeffjennings opened this issue Dec 5, 2023 · 29 comments · Fixed by #230

Comments

@jeffjennings
Copy link
Contributor

jeffjennings commented Dec 5, 2023

Describe the bug
Within DataAverager and DirtyImager, when check_visibility_scatter is True in DataAverager.to_pytorch_dataset or DirtyImager.get_dirty_image, GridderBase._check_scatter_error is called, thus GridderBase.estimate_cell_standard_deviation. With large datasets, specifically the test dataset for IM Lup in .asdf, estimate_cell_standard_deviation causes memory crash on a system with 8 GB of VRAM (I'm running things on the GPU). If running on a CPU (with 16 GB free RAM), the code doesn't crash (but is much slower).

Suggested fix
A quick look at this function suggests its operations could be batched, working on smaller chunks of the loose visibilities in sequence.

Desktop (please complete the following information):

  • OS: WSL - Ubuntu
  • Python Version: 3.9.16
  • MPoL Version: 0.2.0 (+ current commits to main)
@jeffjennings
Copy link
Contributor Author

Relevant also for #136

@jeffjennings
Copy link
Contributor Author

The memory crash is also an issue for fourier.NuFFT when the dataset is sufficiently large.

@iancze
Copy link
Collaborator

iancze commented Dec 6, 2023

I think the solutions for GridderBase.estimate_cell_standard_deviation and fourier.NuFFT will be similar in principle (iterate over loose visibilities) but different in implementation.

GridderBase.estimate_cell_standard_deviation definitely warrants a fix, the routine is simply unusable for larger datasets. I think the solution is to find and implement a streaming std deviation estimate (some links to algorithms here, haven't found an implementation in a standard package yet, but I feel like it must exist in a dependency we already require).

NuFFT I think requires a bit more thought about makes sense for all use cases. Since it's a matter of calling a routine to predict visibilities at various (u,v) points, the user could just iterate through their list of (u,v) and accumulate their model visibilities (batching exterior to the routine). If we decide to put logic for memory-efficient batching inside the NuFFT.forward routine, then I worry this could make the NuFFT even more complicated than it already is and potentially cause problems for efficient autodiff. Rather than modifying NuFFT.forward, perhaps a solution here is to add another routine NuFFT.forward_mem or something like that, which calls NuFFT.forward in batches purely for the purpose of getting all model visibilities, without the ability to run autodiff. But I think it's also fine to leave this logic up to the user script.

@iancze iancze self-assigned this Dec 6, 2023
@jeffjennings
Copy link
Contributor Author

jeffjennings commented Dec 6, 2023

Cool, the streaming algorithm looks good.

For the NuFFT, I think we really should implement some solution internally. Otherwise the workaround for a user is potentially:

  • run code, process killed
  • realize why process was killed
  • find where it happened in their code (nufft may have been implicitly called by something like fourier.get_vis_residuals)
  • think of batching as a solution
  • implement batching, interfacing with the NuFFT which as you say isn't super straightforward

That's too much to leave to them and is the kind of thing that would push me away from a code. Something like NuFFT.forward_mem sounds good; the nufft could have an arg for retrieve_vis that if true, calls forward_mem. If it's instead false, then when the nufft is created it checks len(uu) and if it's big enough, raises a warning that suggests using retrive_vis.

@iancze
Copy link
Collaborator

iancze commented Dec 6, 2023

Ok, something like NuFFT.forward_mem sounds reasonable. I think we should sketch out use cases a bit more, though.

If the user is doing any kind of optimization loop, then I think they will want to use something like NuFFT.forward in its current form so that autodiff can propagate derivatives efficiently. If their dataset is too large to fit in one call, then they will need to batch externally. SGD takes care of this, taking mini-batches of the dataset for several iterations within one training epoch.

If the user is trying to get all model visibilities at once, then they can use NuFFFT.forward_mem. But they won't be able to do any kind of optimization or autodiff with these, since I think that would bring the memory issue right back into play. So the only use cases I can think of for these model visibilities are

  1. DirtyImager model and residuals
  2. visualizing residual visibilities directly, like with 1D plots

Since we can't use forward_mem for forward (in the sense of a back-propagating neural net layer), maybe it's better to call this method NuFFT.predict or something like that, and be clear in the docs that autodiff is disabled on this quantity.

Am I missing anything?

@iancze iancze changed the title Memory issues in DataAverager and DirtyImager Memory issues in DataAverager and DirtyImager (and NuFFT) Dec 6, 2023
@jeffjennings
Copy link
Contributor Author

Yes I agree with all that. I assumed something like SGD would batch 'internally' as implemented in MPoL (i.e., it's done for the user when they use SGD).

For use case 2, I'd just add "visualizing model (for comparison to loose true) and residual visibilities directly, like with 1D plots". No other use cases come to mind.

NuFFT.predict is nice.

@iancze
Copy link
Collaborator

iancze commented Dec 6, 2023

Can you clarify what you mean by

For use case 2, I'd just add "visualizing model (for comparison to loose true) and
?

@jeffjennings
Copy link
Contributor Author

I just meant the use case for getting the loose model vis, in addition to the loose residual vis. (unless I'm missing another way to get them)

@iancze
Copy link
Collaborator

iancze commented Dec 6, 2023

Ah, right. For the NuFFT, I think this would just return the model vis, and the user would calculate residuals as data - model (the NuFFT never sees data itself). But residuals could be calculated by whatever might be using NuFFT as an intermediate layer.

@jeffjennings
Copy link
Contributor Author

Yes that's what fourier.get_vis_residuals does, an intermediate layer to retrieve loose model visibilities using the nufft and difference them with observed vis to get residuals. Using that routine was causing memory overload for me. Anyway, yes I just need NuFFFT.forward_mem to get the loose model visibilities.

@iancze iancze changed the title Memory issues in DataAverager and DirtyImager (and NuFFT) Memory issues in DataAverager and DirtyImager Dec 6, 2023
@iancze
Copy link
Collaborator

iancze commented Dec 8, 2023

@jeffjennings I'm curious about your initial bug report. How are you running either DataAverager or DirtyImager on the GPU? Both of these routines take in numpy arrays, so I would think they would need to be doing all of their operations on the CPU.

@iancze
Copy link
Collaborator

iancze commented Dec 8, 2023

I ran through the IMLup.asdf file with this dirty imaging routine, basically initializing DirtyImager and then get_dirty_image using the defaults, which triggers the estimate of cell standard deviation as noted in this issue. I used the memray profiler to produce a flamegraph (html file sent via slack).

This is all on the CPU. I get a peak memory usage of about 12Gb. The raw .asdf file itself is 1.1Gb.

I think the main thing at play with the memory issues is keeping reference to various components and copies of the dataset. As you can see in the script, we load uu, vv, data, etc...., and then pass these to the DirtyImager object, which creates copies of them using self.uu etc.... and their Hermitian pairs, and initializes the cell_indices relative to the grid defined by GridCoords. So, already we have 6Gb of memory allocated just by "loading" the data into the DirtyImager, before any gridding or averaging takes place.

The _estimate_cell_standard_deviation routine isn't what I'd call lightweight, but I think it's also doing the best it can given the task at hand. It takes 5.1Gb to process this dataset. Most of that is allocated into temporary products like the real and imaginary residuals. Though I mentioned streaming (online) methods of calculating standard deviation, I'm not actually sure if that type of solution is applicable here. One possible solution is to go cell-by-cell, calculating these quantities. But I think that's a large enough re-design that we're talking about major changes to nearly all GridderBase routines relying upon np.histogram such that it's probably cleaner to create an entirely different GridderMemBase class entirely. This is a big task (fraught with indexing pitfalls), so I'd like to get a better understanding of memory issues across the entire codebase, including the GPU, before prioritising this work (#136).

12Gb is a moderate-to-large ask for memory, and this is a fairly large ALMA dataset in terms of number of (non-channel averaged) visibilities. I would say most "data science" users probably have this much available on their machines. But it probably would exclude some people (e.g., current MacBook Air base mem is 8Gb), so it's worth thinking about what could be done here. CASA cites a minimum memory requirement of 8Gb per core, with 16Gb or 32Gb preferable.

Note: getting rid of the references from loading the data puts peak memory usage at 10.41 Gb. updated script. But simply storing the data within DirtyImager still takes 4.4 Gb because of the aforementioned Hermitian conjugates and cell indices.

We could also experiment with #104, storing uu and vv in meters, and calculating the channelized lambda values on the fly via @property. We could also store the non-conjugated values and calculate Hermitian values on the fly. This could be significant for datasets with large number of channels.

@iancze iancze removed their assignment Dec 8, 2023
@jeffjennings
Copy link
Contributor Author

@jeffjennings I'm curious about your initial bug report. How are you running either DataAverager or DirtyImager on the GPU? Both of these routines take in numpy arrays, so I would think they would need to be doing all of their operations on the CPU.

Good point, I had the DirtyImager on the CPU; now with that dataset I get a crash when first calling DirtyImager, so it's hard to trace. I have 16 GB of free memory and the call to DirtyImager still kills the process; I don't know exactly the criterion for killing it. In any case 12 GB is a lot of free memory to ask of most users I'd say. Even if you have 16 GB RAM, probably background processes and a web browser are using ~half.

@iancze
Copy link
Collaborator

iancze commented Dec 8, 2023

Can you provide more info about the error messages and crashes, possibly with an MRE script?

@jeffjennings
Copy link
Contributor Author

jeffjennings commented Dec 8, 2023

update: Before when the code crashed on the GPU, it was in the call to training.train_to_dirty_image, where img, beam = imager.get_dirty_image sent the memory over the top on the GPU; I was passing in the already large imager = DirtyImager (but on CPU) to this routine, and in the GPU case I then did dirty_image = torch.tensor(img.copy()).to('cuda:0') because I was using the dirty_image in a loss function calculation - see here

@iancze
Copy link
Collaborator

iancze commented Dec 8, 2023

Ok, I'm still not sure how DirtyImager is being run on a GPU, though. Internally, it's all numpy arrays.

The torch.tensor(img.copy()).to('cuda:0') will take the img output from DirtyImager, create a PyTorch tensor, and then put that to the GPU. But that should just be the size of a normal image.

@jeffjennings
Copy link
Contributor Author

jeffjennings commented Dec 8, 2023

Yes I don't mean to say that DirtyImager is being run on a GPU. I'm not sure how memory was split between CPU and GPU in my run since now I have a crash right away on the instantiation of DirtyImager. To reproduce this current crash, it's just

dfile = ... # path to .asdf
uu, vv, vis, weight = get_basic_data(dfile)
data_re, data_im = vis.real, vis.imag

cell_size, npix = 0.005103605411684434, 940
coords = GridCoords(cell_size=cell_size, npix=npix)

averager = DataAverager(
      coords=coords,
      uu=uu / 1e3,
      vv=vv / 1e3,
      weight=weight,
      data_re=data_re,
      data_im=data_im
  )

imager = DirtyImager(
      coords=coords,
      uu=uu / 1e3,
      vv=vv / 1e3,
      weight=weight,
      data_re=data_re,
      data_im=data_im
  )

@iancze
Copy link
Collaborator

iancze commented Dec 8, 2023

OK, and this is entirely on the CPU? Are you on a local branch that's already implemented the lambda vs. klambda, too?

@jeffjennings
Copy link
Contributor Author

Yes entirely on CPU (all inputs to DirtyImager are on the CPU), for reference uu.shape is 42421131, and I'm on main. (I haven't come close to approaching the lambda refactor.)

@jeffjennings
Copy link
Contributor Author

I can also run memray if useful, though I don't expect it would be different from your result.

@iancze
Copy link
Collaborator

iancze commented Dec 8, 2023

Ok I will give it a quick try on memray now. But it's worth checking out yourself, too, it's fun. The reason I asked about the lambda refactor is that I don't have 1e3 in my code. So the only reason that should appear is if you've modified the broadcast routines.

@jeffjennings
Copy link
Contributor Author

Ah, sorry, yes in my pipeline script I have everything in lambda (I convert u and v to lambda within get_basic_data), and only convert to klambda in the calls to MPoL routines.

@jeffjennings
Copy link
Contributor Author

For reference,

def get_basic_data(dfile):
    """Load all of the data contained in the .asdf file, and convert it into 1D. 
    Destroys channel information.
    
    Returns:
        uu, vv, data, weight as 1D arrays.
    """
    
    data_list = []
    weight_list = []
    uu_list = []
    vv_list = []

    with asdf.open(dfile) as af:
        # print(af.info())

        # obsids contained in measurement set
        obsids = af["obsids"].keys()
        for obsid in obsids:
            # ddids contained in each obsid
            ddids = af["obsids"][obsid]["ddids"].keys()
            for ddid in ddids:
                d = af["obsids"][obsid]["ddids"][ddid]
                # each ddid should have the following keys
                # "frequencies", "uu", "vv", "antenna1", "antenna2", 
                # "time", "data, "flag", "weight"

                # but we'll load only what we need for now
                # frequencies should be (nchan,)
                # uu and vv should each be (nvis,)
                # flag should be (nchan, nvis)
                # data should be (nchan, nvis)
                # weight should be (nvis,) assuming the same for all channels
                
                # convert uu, vv from meters to klambda
                # uu, vv = visread.process.broadcast_and_convert_baselines(d["uu"], d["vv"], d["frequencies"])
                uu, vv = mpol.utils.broadcast_and_convert_baselines(d["uu"], d["vv"], d["frequencies"])
                
                # broadcast weights
                # weight = visread.process.broadcast_weights(d["weight"], d["data"].shape)
                nchan = d['data'].shape[0]
                broadcast = np.ones((nchan, 1))
                weight = d['weight'][np.newaxis, :] * broadcast

                # flag relevant baselines, data, and weight
                flag = d["flag"]
                
                uu = uu[~flag] # keep the good ones
                vv = vv[~flag]

                data = d["data"][~flag]
                weight = weight[~flag]

                # destroy channel axis and concatenate
                uu_list.append(uu.flatten())
                vv_list.append(vv.flatten())
                data_list.append(data.flatten())
                weight_list.append(weight.flatten())

    # concatenate all files at the end
    uu = np.concatenate(uu_list)
    vv = np.concatenate(vv_list)
    data = np.concatenate(data_list)
    weight = np.concatenate(weight_list)

    # return in [lambda]
    uu *= 1e3
    vv *= 1e3
    
    return uu, vv, data, weight

@iancze
Copy link
Collaborator

iancze commented Dec 8, 2023

Thanks. I ran the script successfully and also ran memray on my copy of the dataset (same length of uu, so I'm pretty sure it's the same). I got a peak memory usage of 8Gb. But nothing's really happened inside either of the routines. For DirtyImager, the arrays were concatenated and added to self, so again it has that 4-5 Gb just for existing.

DataAverager is a bit more efficient since it just stores the reference to the arrays (no concatenate needed).

What is the workflow in which you need DirtyImager and DataAverager simultaneously? Perhaps you could write an isolated routine that calls DirtyImager with what it needs and then exits scope, that way the garbage collector can free it up, and you don't have that large block of memory continually allocated?

@jeffjennings
Copy link
Contributor Author

jeffjennings commented Dec 8, 2023

Thank you for checking. I get 8.7 GB; I'm now digging into crossval.py and training.py to see where memory issues are arising there and will update here.

Exiting scope is a good idea for a workaround, but I'd prefer to keep DirtyImager and DataAverager in memory and see if there's a more user-friendly solution that comes along with implementing memory management in those routines. I don't want to expect users to manage memory carefully, and I think my use case would be common -- I'm holding DirtyImager and DataAverager in memory simultaneously because DirtyImager is used to create a crossval.CrossValidate class (it's passed internally to training.train_to_dirty_image), and DataAverager is then used for crossval.CrossValidate.run_crossval.

@iancze
Copy link
Collaborator

iancze commented Dec 8, 2023

Exiting scope is a good idea for a workaround, but I'd prefer to keep DirtyImager and DataAverager in memory and see if there's a more user-friendly solution that comes along with implementing memory management in those routines. I don't want to expect users to manage memory carefully, and I think my use case would be common

Fair enough, but I'm not sure if I'd class this as a workaround. When the dataset itself is very large (1.1 Gb) and you have a cap of 8Gb RAM, there isn't a lot of room for keeping references to this data and derivative products from it. One could quickly run into the same sort of problem with pure numpy arrays if one took a large array and broadcasted it to the wrong dimension, or kept references to copies of the data hanging around. I don't think that means that numpy needs to anticipate memory management needs beyond their best-effort, the Python runtime will catch the errors otherwise.

I tend to think of the MPoL-dev/mpol routines primarily as a core library (e.g., kind of like exoplanet) to enable RML inference, where a user will come along and use these as building blocks for their specific problem. I think it's smart design that exoplanet doesn't have a separate class for an astrometric fitter or an RV fitter or a RV + transit fitter, etc... but rather shows the user how they might construct these things from core primatives through well-documented tutorials. I think it's very hard to have a "RV + transit + ... fitter" that is flexible enough to work in new problems without requiring so many optional parameters for each case that just calling the routine becomes onerous (see CASA tclean calls). It's worth considering whether some routines in crossval and training are heading in this direction by trying to solve too general of a problem. Frequently, it becomes more direct to just write a new script for a new problem (as in the exoplanet tutorials). To that end (and in the spirit of reorganization spurred by visread), maybe it makes sense to move some of the routines in crossval and training into a value-added MPoL-dev/mpol_pipeline (or better named) package? I guess it's too early to know what the best solution may be, since we still haven't fully solved the cross-validation problem in a satisfying manner. But I think there's significant developer and user efficiency to be gained by separating core library routines from broader analysis pipeline routines.


With respect to DirtyImager, I think the easiest thing we could do now is to implement the @property solution for conjugate visibilities discussed above. I think this would nearly halve the memory footprint of an instantiated DirtyImager, making a significant dent. I can attempt this sometime soon.

As currently written, the memory footprint of DirtyImager and it's .get_dirty_image are about 5 times the size of the input dataset. With the @property revision hopefully this would be reduced to < 3x. I think that's pretty reasonable for a standard routine.

This raises the question, what should DirtyImager do when the raw data is itself larger than the RAM available? This will be an issue for channel maps with many channels (e.g., raw dataset > 10 Gb). In that case I think we need to consider for what applications we really need on-demand access to DirtyImages for all channels simultaneously, because there's already the larger problem that doing an RML optimization loop on all of those channels simultaneously will immediately run into memory issues (as @briannazawadzki discovered). I think it's fine to have the user write a for loop over the batch of channels if they really need all of the dirty images simultaneously.

@jeffjennings
Copy link
Contributor Author

Exiting scope is a good idea for a workaround, but I'd prefer to keep DirtyImager and DataAverager in memory and see if there's a more user-friendly solution that comes along with implementing memory management in those routines. I don't want to expect users to manage memory carefully, and I think my use case would be common

Fair enough, but I'm not sure if I'd class this as a workaround. When the dataset itself is very large (1.3 Gb) and you have a cap of 8Gb RAM, there isn't a lot of room for keeping references to this data and derivative products from it. One could quickly run into the same sort of problem with pure numpy arrays if one took a large array and broadcasted it to the wrong dimension, or kept references to copies of the data hanging around. I don't think that means that numpy needs to anticipate memory management needs beyond their best-effort, the Python runtime will catch the errors otherwise.

That's a good point. Separately, I think being able to call DirtyImager and DataAverager for a big dataset is still useful to handle internally through blocking / batching.

I tend to think of the MPoL-dev/mpol routines primarily as a core library (e.g., kind of like exoplanet) to enable RML inference, where a user will come along and use these as building blocks for their specific problem. I think it's smart design that exoplanet doesn't have a separate class for an astrometric fitter or an RV fitter or a RV + transit fitter, etc... but rather shows the user how they might construct these things from core primatives through well-documented tutorials. I think it's very hard to have a "RV + transit + ... fitter" that is flexible enough to work in new problems without requiring so many optional parameters for each case that just calling the routine becomes onerous (see CASA tclean calls). It's worth considering whether some routines in crossval and training are heading in this direction trying to solve too general of a problem. Frequently, it becomes more direct to just write a new script for a new problem (as in the exoplanet tutorials). To that end (and in the spirit of reorganization spurred by visread), maybe it makes sense to move some of the routines in crossval and training into a value-added MPoL-dev/mpol_pipeline (or better named) package? I guess it's too early to know what the best solution may be, since we still haven't fully solved the cross-validation problem in a satisfying manner. But I think there's significant developer and user efficiency to be gained by separating core library routines from broader analysis pipeline routines.

Yes I guess that's what we see differently. And it's a preference I know. My perspective is that having a core library with very good tutorials (as MPoL does) is extremely useful. It's also a lot to approach for a user who just wants to get an image out of the code to compare to clean. Expecting every user to fully learn the model and code is ideal in principle, but I think kind of a lot to ask in practice, especially for a userbase that isn't necessarily strong in programming. And it can be a bit inconsiderate of a user's time, because most users would probably end up writing very similar scripts to do effectively the same thing with the code. With frank I pushed for the ability to run the code from the terminal, in addition to tutorials that cover interfacing with it as a library, so that it would be more accessible (plus a lot easier to debug user issues). It's not that popular, but among those who use it, almost everyone I know of runs it through the terminal with parameter files. Those who want to modify it still can, but the most basic and common use case is covered by its fitting pipeline.

Of course frank is just doing one thing - fitting visibilities to produce I(r). MPoL is more general. But I think the most basic use case is still one thing - modeling the visibilities to get an image. That's what I've been trying to implement in a pipeline: train a model, get some metric of the model performance to compare to other priors/strengths, and plot some diagnostics/results, all to get an image out in the end. My view is that this is what the tutorials for producing a model image cover. So I think it's worth wrapping that up for users who aren't as advanced or don't have as much time to learn about MPoL in-depth. Experts can always use it as a library while less experienced users or coders like students can still learn a reasonable amount of the modeling approach from the tutorials, without having to fully understand all the theory and the Python implementation to generalize it themselves.

In that sense I'd disagree that the training and cross-validation routines are too specific - their purpose is simply to give the user an image! From the imaging software! I'm joking, but really that's all they're meant to do, and I think they're pretty general in doing it (but maybe I'm not considering something). If you want them to go in a separate package, I guess that's fine, although it would inevitably reduce their visibility, and I don't see the harm in having a pipeline internal to MPoL. The pipeline routines are all isolated from the 'core' code (they're in their own scripts), and installing a second package to run the first package in an automated way seems unnecessary. But it would be good to discuss more on Monday.

With respect to DirtyImager, I think the easiest thing we could do now is to implement the @property solution for conjugate visibilities discussed above. I think this would nearly halve the memory footprint of an instantiated DirtyImager, making a significant dent. I can attempt this sometime soon.

Sounds good!

As currently written, the memory footprint of DirtyImager and it's .get_dirty_image are about 5 times the size of the input dataset. With the @property revision hopefully this would be reduced to < 3x. I think that's pretty reasonable for a standard routine.

This raises the question, what should DirtyImager do when the raw data is itself larger than the RAM available? This will be an issue for channel maps with many channels (e.g., raw dataset > 10 Gb). In that case I think we need to consider for what applications we really need on-demand access to DirtyImages for all channels simultaneously, because there's already the larger problem that doing an RML optimization loop on all of those channels simultaneously will immediately run into memory issues (as @briannazawadzki discovered). I think it's fine to have the user write a for loop over the batch of channels if they really need all of the dirty images simultaneously.

Sure yeah, as long as internally it can build up the image over chunks for a single channel.

@iancze
Copy link
Collaborator

iancze commented Dec 8, 2023

Of course frank is just doing one thing - fitting visibilities to produce I(r). MPoL is more general. But I think the most basic use case is still one thing - modeling the visibilities to get an image. That's what I've been trying to implement in a pipeline: train a model, get some metric of the model performance to compare to other priors/strengths, and plot some diagnostics/results, all to get an image out in the end. My view is that this is what the tutorials for producing a model image cover.

Def agree with you about the basic use case of MPoL. But I think I expect that there is enough variation for what one might want to do in each type of analysis problem that, for many cases, it will be more direct for a new user to copy + adapt a script to do what they want than to work within the constraints of a general pipeline. @kadri-nizam had brought up PyTorch Lightning before as an interesting framework that might solve many of our boilerplate issues. It might be worth us taking a closer look.

So I think it's worth wrapping that up for users who aren't as advanced or don't have as much time to learn about MPoL in-depth. Experts can always use it as a library while less experienced users or coders like students can still learn a reasonable amount of the modeling approach from the tutorials, without having to fully understand all the theory and the Python implementation to generalize it themselves.

I whole-heartedly agree about making the software as accessible as possible. But I would say that for any real-world analysis problem (i.e., one that is working with data to produce an image in support of peer-reviewed research), the user is looking for more than a black box. They should be approaching the problem with some training in interferometry and scientific computing. And I think it is fair to expect that the basics of PyTorch (i.e., as covered in our tutorial, its additional resources, and the Intro to RML optimization tutorial) can be expected as a pre-requisite for using the package. It should only take a few hours to read through the tutorials, but I would argue that the knowledge gained by this pays massive dividends over the course of a research project, when one might be using the code again, and again, and again. I.e., "teach a person to fish." I think the same is true of exoplanet, as a pre-requisite one needs to understand the basics of Bayesian analysis and MCMC, otherwise the software is pointless.

That said, if we can get an MPoL application to be easily and reliably contained from the command line in addition to a rich API + tutorials, we should do it. frank clearly hits a sweet spot here with offering the best of both worlds. My sense so far is that the MPoL RML loop (with it many priors and output products) is a squishier problem and will defy a concise treatment.

In that sense I'd disagree that the training and cross-validation routines are too specific - their purpose is simply to give the user an image! From the imaging software! I'm joking, but really that's all they're meant to do, and I think they're pretty general in doing it (but maybe I'm not considering something). If you want them to go in a separate package, I guess that's fine, although it would inevitably reduce their visibility, and I don't see the harm in having a pipeline internal to MPoL. The pipeline routines are all isolated from the 'core' code (they're in their own scripts), and installing a second package to run the first package in an automated way seems unnecessary. But it would be good to discuss more on Monday.

I agree there are some downsides to a separate package, too.

Re: specificity, we haven't yet arrived at a satisfactory answer to the IM Lup dataset :-p . If and when we arrive at one and codify it in a pipeline, how well will this approach scale to other datasets? Perhaps it will be fine for the DSHARP disks. But what about the AK Sco dataset, with per-EB astrometric offsets and amplitude that need to be modeled? Or a multi-channel CO dataset, where the regularizer strengths might need to be different per-channel? Or a prior with the scattering transform (might want to visualize the coefficients). Or considering different ways the data might be split for validation (considering per-EB boundaries?). I think effectively treating these issues is key to producing images that are better than clean, which is the main reason users are interested in this package in the first place. I think there's enough variation in which diagnostics the user might want to see that each new analysis problem will start to demand its own custom pipeline, in which case, there will be friction for the user actually using the standard pipeline. But obviously, this is just my conjecture now because we haven't finished analyzing the IM Lup dataset yet. But yea, let's run through these points in on Monday.

As regards this issue, I'll try to implement the @property into DirtyImager and then perhaps we can consider the narrowest scope of this issue closed? We can re-raise the other points in other, more focused issues tailored to multi-channel datasets or the train loop, e.g.

@jeffjennings
Copy link
Contributor Author

Ok great, thanks for your thoughts, let's talk more Monday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants