Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a parallel `%` (foreach-object) #3008

Closed
be5invis opened this issue Jan 16, 2017 · 17 comments

Comments

@be5invis
Copy link

commented Jan 16, 2017

No description provided.

@RamblingCookieMonster

This comment has been minimized.

Copy link

commented Jan 17, 2017

Thoughts from someone who finds parallelism handy:

I'd agree this is valuable given that...

  • Tools like PoshRSJob are quite popular (based on download count, article hits, meetup popularity, and the wealth of variations out there)
  • Sysadmins traditionally like re-inventing wheels rather than using existing libraries (not written by Microsoft) - i.e. if there were an official Microsoft implementation, it would likely get more use

On the other hand:

  • Given that tools like PoshRSJob exist, and are published to the gallery, personally I don't see this as a big priority, considering the various other issues out there.
  • If you add a new ParameterSet to Foreach-Object, rather than adding a new Cmdlet, you may run into confusion depending on how you implement this (i.e. which variables / modules / etc. are available).

Cheers!

@Jaykul

This comment has been minimized.

Copy link

commented Jan 19, 2017

Does PoshRSJob work on linux?
Can we pick a winner and ship it "in the box" now that "the box" isn't Windows™️?

@dragonwolf83

This comment has been minimized.

Copy link

commented Jan 19, 2017

We really need a port of Windows Workflow Foundation to .NET Core since that is how PowerShell Workflows work or a rewritten version of those concepts if the engine is too complex. Adding Parallel is not enough, you need the other activities like Sequence too. I haven't heard of any plans around that though so I hope someone from PowerShell can find out since they added it as a core feature in Desktop Edition.

@RamblingCookieMonster

This comment has been minimized.

Copy link

commented Jan 26, 2017

@Jaykul - Yes, PoshRsJob works across platforms. I don't think there are really any competitors, unless you're talking one-off commands for ad hoc parallelization (invoke-parallel, foreach-parallel, foreach -parallel, etc.) - Personally I think it would be a great fit : )

@dragonwolf83 - That's a bit of scope creep : ) Guessing there's an issue covering Workflows (which personally, are one of the few areas in PowerShell I actively try to avoid, given various oh, you're doing X? you can't do that in a workflow pain points, that don't always give you that hint...)

Cheers!

@powercode

This comment has been minimized.

Copy link
Collaborator

commented Jan 26, 2017

This may actually require some more thinking. Since it is involving running in other runspaces, we may want to use "using:$var" or using:function in some way to indicate what to import into the other runspaces.
There are also issues to solve regarding how to handle the output streams, and it may involve either choosing good progress reporting or fast initial processing.

My vote is clearly to get this in the box. I don't think we will need any workflow features for it, but I thinks the semantics should be the same as for the Invoke-Command scriptblocks.

We should also consider debugging of the parallell scriptblocks.

@kittholland

This comment has been minimized.

Copy link
Contributor

commented Jan 26, 2017

+1 to powercode, I would like to echo that the ordering issues with returning different output streams (along with a few other issues it returns each channel serially, rather than interleaved in chronological order), as well as the lack of debugging makes it difficult to troubleshoot.

@dragonwolf83

This comment has been minimized.

Copy link

commented Jan 26, 2017

@RamblingCookieMonster I think the issue with Workflows is that it is not really PowerShell so it made it harder to use than it needed to be. The theory was good though to be able to code a complex workflow and still see it in a designer for those less code inclined.

I think best course of action is to get this natively like @powercode said. A native non-Workflow implementation of ForEach-Object -parallel, forach -parallel, and Parallel scriptblocks would be great to have inbox to support so many scenerios. Then, we can have a separate ticket for a native Sequence scriptblock which I think is very important to get parallel right.

@iSazonov

This comment has been minimized.

Copy link
Collaborator

commented Jan 27, 2017

+1 to powercode
Perhaps we should consider this issue more widely because any cmdlet with (for example) -ComputerName option is potentially a candidate for -parallel.

@alx9r

This comment has been minimized.

Copy link

commented Aug 7, 2018

I've been experimenting with parallelizing powershell unit tests with some success. That use case is mostly CPU-bound and involves large numbers of invocations and importing large script modules. From what I can tell there are a number of non-trivial challenges to overcome to make Invoke-Parallel or parallel ForEach-Object intuitive, robust, and performant for general use.

Below is a summary of the main challenges I have noted during my experimentation.

(BTW, thanks @RamblingCookieMonster and @proxb for your blogs and repos on parallelization. Your work saved me a lot of time.)

Module Importing and Contention

Each concurrent powershell needs its own runspace. Each runspace must import its own modules. This means all the modules used by each runspace must be imported for each runspace. For script modules, importing modules for each scriptblock invocation can easily result in worse performance than a corresponding single-runspace implementation. This is simply because importing script modules takes a significant amount of processor time.

There seems to be contention that prevents runspaces from being opened using multiple cores in parallel (see also #7153 and #7035). I have not yet found a way to, in a single process, open multiple runspaces that import the same script module in a manner that performs better than single-threaded. I suspect this is possible with some changes to PowerShell, though, because importing the same script module in parallel in multiple instances of pwsh.exe seems to parallelize nicely.

Currently there are a couple of open issues related to reliably opening runspaces with imported modules (see #7377 and #7034). I have found some tentative workarounds, but they aren't exactly supported techniques and it's hard to predict how robust such a workaround is in the first place.

Note also that ResetRunspaceState() is limited to variables, so runspaces currently can't simply be re-used in a manner that is guaranteed to be side-effect free.

This all seems to lead to the need for some sort of runspace provider that is more sophisticated than the current RunspacePool. It seems to me like such a runspace provider should share at least compiled script modules between runspaces -- the current RunspacePool doesn't seem to perform in a manner consistent with such sharing. Parallelizing the CPU-bound portions of module importing where possible would also be an improvement.

Output Behavior

Invoking runspaces in parallel means that you can have any combination of progressing, succeeding, and faulting powershells as a result of a single invocation of Invoke-Parallel or ForEach-Object. What, exactly, Invoke-Parallel and ForEach-Object should do with those results is a messy business. Should it throw AggregateException? If so, immediately? Or after all the runspaces are complete? Should the PSDataStreams of all the runspaces be output by Invoke-Parallel? If so, should the outputs be interleaved? Should outputs from the PSDataStreams of one runspace be kept together? From my experiments with some of these behaviors there are pros and cons to the different possible answers to these questions. For each of these questions there doesn't seem to be one good answer that works best for all uses.

Functions Mentioned in Scriptblock

As @powercode pointed out, the functions mentioned in the scriptblock of parallel ForEach-Object might not be available because the scriptblock is being invoked in a difference runspace. This is further complicated in the case where a large script module is auto-loaded when one of its exported functions is mentioned. This will usually cause all the runspaces to try to load that same module at the same time, and because of the contention mentioned above, takes approximately the time to import the module once times the number of runspaces. So for doing a large amount of things in parallel you can easily be waiting minutes just to load modules before any real work begins to occur.

@iSazonov

This comment has been minimized.

Copy link
Collaborator

commented Aug 8, 2018

@alx9r Many thanks for the excellent work!

Currently we have experimental feature support implemented by @daxian-dbw. You could implement Invoke-Parallel as an experimental feature. This will simplify for all PowerShell fans the research on the problems that you listed.

@powercode

This comment has been minimized.

Copy link
Collaborator

commented Aug 8, 2018

I think we should start with looking at what features Runspaces would need to make this possible to implement.
Maybe a Snapshot and ResetToSnapshot and CloneSnapshot.

An implementation would then start by making a snapshot of the current Runspace, CloneSnapshot for each processor, and reset for each new input.

Still many hard questions left...

@alx9r

This comment has been minimized.

Copy link

commented Aug 9, 2018

@iSazonov

Currently we have experimental feature support implemented by @daxian-dbw.

Do you mean that @daxian-dbw has already implemented something like this? If so, could you point me to that work?

You could implement Invoke-Parallel as an experimental feature. This will simplify for all PowerShell fans the research on the problems that you listed.

I'm not sure how applicable the code I have for parallelizing unit testing is to the implementation of something general-purpose like Invoke-Parallel or parallel ForEach-Object. I'll take another look with that in mind though.

@alx9r

This comment has been minimized.

Copy link

commented Aug 9, 2018

I think we should start with looking at what features Runspaces would need to make this possible to implement.
Maybe a Snapshot and ResetToSnapshot and CloneSnapshot.

@powercode I think you're right. Let me think about the specifics of this for a bit -- I think I have enough notes from my experiments to come up with a minimal set of improvements that would work well.

@alx9r

This comment has been minimized.

Copy link

commented Aug 14, 2018

@powercode I have created #7524 to discuss the runspace features to support performant concurrency.

@BrucePay BrucePay self-assigned this Sep 6, 2018

@alx9r

This comment has been minimized.

Copy link

commented Sep 21, 2018

#7626 would also need to be solved to achieve a robust implementation of Invoke-Parallel or parallel ForEach-Object.

@PaulHigin

This comment has been minimized.

Copy link
Contributor

commented Jun 18, 2019

@PaulHigin PaulHigin closed this Jun 18, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.