Skip to content

OutVariable Transparency #120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

rjmholt
Copy link
Contributor

@rjmholt rjmholt commented Mar 16, 2018

After some feedback on my pull request, I thought I should open an RFC on breaking PowerShell's OutVariable implementation.

Also see the original issue where this behaviour was discussed and reviewed by the PowerShell Committee.

## Specification

This table summarises the changes proposed in this RFC:
| Input | Non-OutVar Result Type | Old Result Type | New Result Type |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Table formatting needs redone to make it visible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, VS Code's preview seems to like it. I'll see what I can do


## Motivation

As a PowerShell user, I can use OutVariable just like pipeline output, so that my language experience is simpler and more consistent.
Copy link

@markekraus markekraus Mar 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with this motivation.
I specifically use this feature when I do NOT want the default pipeline behavior.

A multitude of cmdlets and functions return objects in an inconsistent way. Because -OutVariable always returns an ArrayList, I can ensure the result of the cmdlet/function has a consistent format

Take the following example function as a stand in for Get-ADUser and quite a few of the Exchange and Exchange Online cmdlets:

function Get-Result {
    [CmdletBinding()]
    param ()
    end {
        :outer
        switch( 1..9 | Get-Random) {
            1 { 'String'; break }
            2 { 1; break }
            3 { $null; break }
            4 { 1,2,3,4; break }
            5 { ,@(1,2,3,4) }
            6 { throw; break }
            7 { Write-Error 'SomeError'; break }
            8 { break :outer }
            9 { break }
        }
    }
}

Now, run the following and see all the various issues you encounter:

1..20 | %{
    $_
    $Result = 'Stale'
    try {
        $Result = Get-Result
    }
    Catch {
        'Caught'
    }
    finally {
        'Stale? {0}' -f ('Stale' -eq $Result)
        $Result.gettype().name
        $Result.Count
        $Result[0]
    }
}

A string will result in the array index being S, throw results in the $result being stale.

Compare that with

1..20 | %{
    $_
    $Result = 'Stale'
    try {
        $null = Get-Result -OutVariable Result
    }
    Catch {
        'Caught'
    }
    finally {
        'Stale? {0}' -f ('Stale' -eq $Result)
        $Result.gettype().name
        $Result.Count
        $Result[0]
    }
}

The ArrayList is always created fresh, so it is never stale even when the cmdlet errors or throws. it is always an ArrayList so I don't get weird indexing issues. I can always rely on the methods and properties of an ArrayList to be there.

If this is taken away and changed, this will break what I feel is the most stable and reliable method for retrieving output from commands. I will be forced to write ridiculous logic around Microsoft's cmdlets in many cases to try and deal with the insanity of the results they provide.

@markekraus
Copy link

Another thing this RFC needs to consider is how the change will reconcile with the usage of -OutVariable '+Result'. this is not an unheard of pattern:

Get-Widget -Option1 -OutVariable '+Results' 
Get-Widget -Option2 -OutVariable '+Results' 
Get-Widget -Option3 -OutVariable '+Results' 
foreach ($Result in $Result) {
    <# do stuff #>
}

It also supports "bring your own collections":

$Results = [System.Collections.ArrayList]::new()
$Results.Add('before')
Write-Output 'a' -OutVariable '+Results'
$Results.Add('After')

How would this change handle a situation where the it is now returning string instead of an ArrayList?

$null = Write-Output 'a' -OutVariable 'Results'
$null = Write-Output 'b' -OutVariable '+Results'

@bergmeister
Copy link

Although it seems to be better to make the behaviour more consistent at first sight, I think we might be steering into the wrong direction. One of the design mistakes in PowerShell (to me) was to use array types of fixed size like System.Array by default, which leads to bad performance. To me, using ArrayList would be much better and if you also fixed the += operator to call the add method on it then the average user would not run into issues that easily.

@mklement0
Copy link
Contributor

mklement0 commented Mar 17, 2018

@markekraus: The +Results concern is a non-issue, because this feature already converts a preexisting scalar value to a collection on demand:

> $Results = 'hi'  <# scalar #>; $null = Write-Output 'there' -ov +Results; $Results
hi
there

It also supports "bring your own collections":

It generally supports "bring any preexisting value, irrespective of how it was obtained".
Whatever type of input collection you provide as the preexisting value, however, the output is always a [System.Collections.ArrayList] instance; it is true, however, that even a self-created [System.Collections.ArrayList] instance is efficiently extended in place.

@mklement0
Copy link
Contributor

@bergmeister:

Generally, I don't think that efficiency concerns regarding the +Results behavior should be the deciding factor here.
The more fundamental issue of making PowerShell's default array-like data structure and the += operator more efficient, see PowerShell/PowerShell#5643 (comment) and PowerShell/PowerShell#5805 (I know you know about these, but I'm linking them for the benefit of others.)

Specifically, the reallocation of the array to store in the output variable would only need to happen once per pipeline, so it's less problematic than building up a large collection item by item.

@markekraus
Copy link

The +Results concern is a non-issue, because this feature already converts a preexisting scalar value to a collection on demand:

Except it is not. In the PR that was submitted, that functionality was broken when a string was returned.

That's why I want it firmly spelled out in the RFC how that will be handled in all the same situations where a change is being made.

he output is always a [System.Collections.ArrayList]

Yes, As I have repeatedly said in this thread, the output is always an ArrayList. It is kind enough to take the existing elements of your existing collection and add them to an ArrayList. I'm not asking to change that, just pointing out that this would be broken with what this RFC suggestes:

$Results = 1,2
Write-Output 'a' -OutVariable '+Results'

Because instead of now being a collection it would be a single string per spec.

It's also confusing... Does +Results always return an array list? Does it overwrite an existing collection with a single string? does it return a single string if the variable doesn't exist but an ArrayList if it does? how do you distinguish the intent of the user in this situation:

# Results does not yet exist in the session
Write-Output 'a' -OutVariable 'Results'
Write-Output 'b' -OutVariable '+Results'

vs:

$Results = 'a'
Write-Output 'b' -OutVariable '+Results'

Does it take the existing value and add it to an ArrayList? What about in the case of mismatching types:

$Results = 1
Write-Output 'b' -OutVariable '+Results'

I still say this doesn't need any changing and that there is a significant enough documentation and usage of the existing behavior in a way that is very reliant on it to justify breaking it just to make it more like the pipeline... Especially when it is more commonly used when the default Pipeline behavior is not desirable. But if this is going to be broken, than I want all the pieces spelled out.

@mklement0
Copy link
Contributor

mklement0 commented Mar 17, 2018

@markekraus:

Except it is not. In the PR that was submitted, that functionality was broken when a string was returned.
That's why I want it firmly spelled out in the RFC how that will be handled in all the same situations where a change is being made.

The answer is: In the case of a preexisting scalar value, it should become the first element of the output [object[]] array; with a preexisting collection, its N elements should become the first N elements of the [object[]] output array (in input order).
To put it in existing PowerShell terms: it should be $Result = @($Result) + ...

just pointing out that this would be broken with what this RFC suggestes:

Yes, in-place extending would be broken, which is unavoidable, if the output type is changed to [object[]] / a scalar.

Does +Results always return an array list?

No, +Results should always return [object[]], just like Results will (as already implied by the RFC).

Does it overwrite an existing collection with a single string?

It would "append" to the collection - by (in effect) copying the existing collection elements followed by the single string to the [object[]] output array, as described above.

how do you distinguish the intent of the user in this situation:

There is no need to distinguish these two scenarios; how the preexisting value was obtained should be irrelevant.

What about in the case of mismatching types

[object[]] arrays accommodate disparate types.

Thanks for helping to clarify these issues.

@rjmholt: Do you agree with the +Results behavior proposed here?

| Input | Non-OutVar Result Type | Old Result Type | New Result Type |
| :----------------------------------------------: | :--------------------------: | :--------------------------: | :--------------------------: |
| `'Hello'` | System.String | System.Collections.ArrayList | System.String |
| `@(1, 2)` | object[] | System.Collections.ArrayList | object[] |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we use Object[]? we should be moving away from Object[] in favor of List<Object>. I don't see why taking a step back from ArrayList to Object[] is beneficial to anyone.

@mklement0
Copy link
Contributor

On a meta note:

  • If the consensus turns out to be that this is too much of a breaking change, then that's that - we'll document and move on.

  • Either way, however, it's worth getting clarity on the issue and if/how the current behavior is misaligned with the fundamentals of PowerShell - even if only to prevent similar problems in the future.

  • @markekraus, please stop using incendiary language such as "ridiculous", "insanity" and (in comments on other issues) "silly"; such language adds nothing useful to the discussion and serves only to antagonize.


So, unencumbered by the history of the implementation, let's take a look:

The sparse existing documentation tell us merely "Stores output objects from the command in the specified variable [...]" and
To add the output to the variable, instead of replacing any output that might already be stored there, type a plus sign (+) before the variable name".

Thus, in the absence of information that would suggest deviating from PowerShell's normal behavior, it's reasonable and consistent to infer the following behavior:

<cmdlet> ... -OutVariable Result

is short for (leaving the ability to use -OutVariable while still sending output through the pipeline to later segments aside):

$Result = <cmdlet> ...  # capture
$Result                 # ... and output

Similarly,

<cmdlet> ... -OutVariable +Result

is short for:

$tmpResult = <cmdlet> ...  # capture
if ($null -ne $Result) {
  # ... append to existing $Result value with array concatenation
  $Result = @($Result) + $tmpResult   
} else {
  $Result = $tmpResult
}
$Result                    # ... and output

In a nutshell: -OutVariable is convenient shortcut for capturing output in a variable while still sending that output through the pipeline, and there is no good reason for the capturing-in-a-variable part to work any differently than a direct assignment ($Result = <cmdlet> ...); the +Result syntax is an added convenience for appending output objects to existing results (to be treated as a collection on demand).

Assuming that @rjmholt agrees with my description of the +Result behavior, this is what this RFC proposes.


Now let's look at the current implementation:

Returning (a) an instance of [System.Collections.ArrayList] - not used by another native PowerShell features - and (b) not unwrapping single-item collections suggests a "leaky abstraction" (or "leaky implementation", if you prefer) - an implementation detail accidentally peeking from behind the curtain, similar (in the abstract) to PowerShell/PowerShell#4343

(Even if the implementation was by design, which seems unlikely, it is an awkward deviation from regular PowerShell behavior that would have deserved conspicuous documentation.)


Now let's look at the rationale for keeping the current behavior given @markekraus:

Again, if the consensus turns out to be that too many people have relied on the implementation leak, the best we can is to document the existing behavior and move on - but it's important that moving on means to avoid creating such "opportunities" again.

I disagree with this motivation.
I specifically use this feature when I do NOT want the default pipeline behavior.

It's important to be clear that having made productive use of an implementation leak for a different purpose than originally intended does not make that purpose a meaningful feature retroactively and it therefore shouldn't be framed as such.

A multitude of cmdlets and functions return objects in an inconsistent way. A multitude of cmdlets and functions return objects in an inconsistent way. Because -OutVariable always returns an ArrayList, I can ensure the result of the cmdlet/function has a consistent format

This suggests that you have an issue with a behavior at the very heart of PowerShell's collection handling: the default pipeline behavior of unwrapping a single-item collection to a scalar, given
that (, 1 | Write-Output) -is [int] is $True for instance.
In the vast majority of cases this - mostly beneficial - behavior is no longer a problem in PSv3+, due to the unified handling of scalars and collections (.Count property on scalars, ability to index).
For edge cases, there's still @(...) (e.g., $s = 'hi'; @($s)[0] to get hi rather than just h with $s[0]).

I feel is the most stable and reliable method for retrieving output from commands.

If it takes reliance on an accidental implementation leak for the most stable and reliable method for retrieving output from commands, using a construct that suggests nothing of the sort,
we clearly have a bigger problem to solve.

@markekraus
Copy link

markekraus commented Mar 17, 2018

@mklement0

The answer is: In the case of a preexisting scalar value, it should become the first element of the output [object[]] array; with a preexisting collection, its N elements should become the first N elements of the [object[]] output array (in input order).

I don't see how this makes any sense to use Object[] array. for Pipeline parity? Not a compelling reason in my book. We should not be using Object[] there either, IMO. Changing from ArrayList to Object[] seems like a step backwards. Then when you chain -OutVariable '+Result' you are having to create new Object[]. seems like a huge waste of memory and possible a performance hit on large operations.

@markekraus
Copy link

markekraus commented Mar 17, 2018

It's important to be clear that having made productive use of an implementation leak for a different purpose than originally intended does not make that purpose a meaningful feature retroactively and it therefore shouldn't be framed as such.

Is this an implementation leak though? Or is it just that documentation doesn't explicitly call out that it is always an ArrayList? The official documentation doesn't suggest one way or the other that this was or was not intended design. as far as I can tell.

I'm not real interested in the history of its original intent either. Use of and reliance the current behavior is prolific. It is a recommended workaround for command like Get-ADUser which can produce real hardship when trying to determine if the command was successful or not and how many results (if any) iit returned. Whether or not the current behavior was or was not intended doesn't matter at this point, because the behavior is relied upon. We need significant justification to break that behavior.

What I am interested in is why this needs to change at all. What problem is this change addressing? As far as I know, there are no issues with the current behavior other than a lack of official documentation on the always-ArraList behavior. If it's just to gain parity with the pipeline, I want to know why we should break the existing behavior to gain that parity. What do we get from it? Is what we gain from it enough to justify breaking a behavior that has been consistent since its addition to the language? What are those points exactly?

Those need to be present in this RFC. The change needs to be justified because the existing behavior is relied upon, regardless of the original intent. I believe this to be a Bucket-1 breaking change and as such, the level of detail in the RFC should be very high and justifications well documented.

It also seems to me that some of the design considerations for this are a step in the wrong direction. Object[] arrays are not friendly to work with. Yes, that is what you get form a pipeline, but, should pipelines have been an Object[] to begin with? Don't we have other RFCs, Issues, and PRs debating exactly that and the possibility of moving to List<Object> or increasing the usage thereof? Let me be clear, I'm not saying "a step in the wrong direction" to be antagonistic. I mean it in the context of the larger discussion about default collections in PowerShell and easing/increasing the use thereof, and to align with .NET recommended practices of favoring collections over arrays.

This suggests that you have an issue with a behavior at the very heart of PowerShell's collection handling:

The issue is with cmdlets and functions written by Microsoft, vendors, and the community often return data in an inconsistent manner. A single cmdlet that returns all of these: $null, a collection unrolled. a collection not unrolled, a single item, nothing (automation null) and may do so with any combination of throw, terminating errors, and non-terminating errors (this is what I was referring to as insanity). In a perfect world, a function/cmdlet would do-one-thing and return-one-type-of-object. The current way of reliably managing code that doesn't follow that paradigm is to use the reliable ArrayList capabilities of -OutVariable.

If it takes reliance on an accidental implementation leak for the most stable and reliable method for retrieving output from commands, using a construct that suggests nothing of the sort,
we clearly have a bigger problem to solve.

The real problem is with how developers/authors/scripters (not sure of a good catchall term here) sometimes choose to design their functions and cmdlets. I don't think PowerShell is at fault for being an expressive shell and scripting language. Getting people to make best practice choices in their code that does not abuse (either intentionally or not) the expressive nature of PowerShell is not something I think we can or should fix in the language itself.

This is the crux of my opposition to this change. The current behavior, intended or not, provides a best practice solution for dealing with non-best practice code. I, myself, learned about it in the context of dealing with such problems in the AD cmdlets. The "ridiculous" logic I was referring to before was in reference to my solutions before I learned about -OutVariable and its always-ArrayList nature. I had to do things I consider "ridiculous", like running the commands in multiple try/catch blocks and alternating between -EA Stop and -EA SilentlyContinue. All things very easily solved with a single try/catch and -OutVariable which always creates the ArrayList even in the event of a terminating error, or throw, or non-terminating errors half way through.

@markekraus
Copy link

Another thing to consider in this RFC from @PetSerAl's comment here.

PowerShell currently supports this usage:

1..10 | echo -OutVariable a | % { "$a" }

The PR that was previously submitted broke that usage. I would like for this RFC to spell out how it will either maintain, change, or break this usage.

@markekraus
Copy link

I know it's a bit pedantic, but I'd also like to see it explicitly spelled out as to what happens when there is no output. (It's conceptually the same as $null but not really.. a cmdlet could decide to do return $null which is different than return without any thing or simply a "void" type cmdlet).

@Jaykul
Copy link
Contributor

Jaykul commented Mar 17, 2018

I think it would be an extremely confusing mistake to change OutVariable unless we are on a campaign to change all uses of Arraylist to a generic List or something.

All of the common parameter capture variables except the implicitly singular PipelineVariable are and always have been implemented the same way: as ArrayLists. We have more than ten years worth of code, articles, and blog posts documenting the fact that these variables are ArrayLists.

  • ErrorVariable
  • WarningVariable
  • OutVariable
  • InformationVariable

And it's not just those variables, it's also $Error and $input and $foreach ...

In addition, it is completely misleading to suggest that the fact that OutputVariable is an Arraylist is just "an implementation leak" -- it has always been a deliberate choice for every variable like this, since PowerShell was called Monad -- and I believe it should continue to be until such time as PowerShell is ready to embrace modern generic collections.

Do a quick search on Google and you will find tens of thousands of articles documenting these variables and the fact that they are ArrayLists.

Finally, it is completely incorrect to suggest that the perf hit of using arrays would only happen once.

All of the common parameter capture variables are available within the pipeline for use as they are being filled. They are updated after each process step and available for use and modification by every other command in the pipeline. As a demonstration:

Get-ChildItem -File -PipelineVariable file -OutVariable out | ForEach-Object -WarningVariable warn { 
    Write-Warning $file.Fullname
   $PSItem
} | ForEach-Object { 
    Write-Host $Warn.Count $Out[-1].FullName 
}

That is, one can do something like this, to get a list of directories in the current folder and files within them

gci -ov items -Directory | gci -ov +items -File | out-null
$items

But one can also modify the list by hand while collecting it:

gci -ov items -Directory | % -Process { $items.AddRange( @(dir $_ -File) ) } -End { $items }

@markekraus
Copy link

I'm digging through my usages of OutVariable and I discovered another instance where this would break:

function Get-Widget {
    [cmdletbinding()]
    param()
    end{
        $random = 1..3 | Get-Random
        Write-Verbose ('Returning {0} Objects'  -f $random)
        1..$random | %{
            [PSCustomObject]@{
                Count = 4
            }
        }
    }
}

1..15 | %{
    '----------------'
    'Run {0}' -f $_
    $null = Get-Widget -OutVariable Results -Verbose
    $Results.Count
}

In instances where an object (which may or may not be a collection) has a Count property and I need to determine the number of results, I'm using the current behavior of -OutVariable since it always provides the results as items in an ArrayList. With this RFC, when a single object is returned that has the Count property, I have to know what the type is first..a dn this may not always work. For example, in the case where a function outputs a collection that is not unrolled:

function Get-Widget {
    [cmdletbinding()]
    param()
    end{
        $random = 1..3 | Get-Random
        Write-Verbose ('Returning {0} Objects'  -f $random)
        1..$random | %{
            ,@(1,2,3,4)
        }
    }
}

It's also further complicated if it happens to return one or more ArrayLists not-unrolledt. or an Object[]. the implementation proposed in this RFC is a bit confusing. I get back an Object[], or and Object[] of Object[] or an ArrayList or an Object[] or ArrayList. I'm not sure that is approachable for someone trying to figure out OutVariable for the first time.

@ChrisLGardner
Copy link

I agree with @markekraus that this needs more details on what benefits come out of this change and how those offset the breaking of so many existing scripts.

It seems the root issue is really that it's not thoroughly documented that the *Variable parameters produce an ArrayList rather than whatever the cmdlet would normally produce, so wouldn't a simpler fix be to just update the documentation with more details and helpful information about how they work? That lowers the barrier to entry and helps remove the confusion around what they do.

We also need to consider how many newer scripters are using -*Variable compared to the more experienced scripters who know what it does and how it works.

I'd also support moving to List<Object>rather than Object[], the performance hit alone should be seen as a major downside to going to Object[].

@mklement0
Copy link
Contributor

On a meta note:

There are two, not necessarily mutual considerations:

(a) maintaining backward-compatibility
(b) thinking about problems with the existing behavior in order to:

  • properly document how existing behavior may be unexpected from a perspective of how it should / is intuitively expected to work.
  • inform future - potentially breaking - changes.

If (a) is all you care about, you have your answer, and there's no need to read on:
Make your voice heard to vote for the status quo - divorced from whether it makes sense or not - demonstrating how your legacy code will be affected.
( I won't personally weigh in on that debate, because the existing code I am responsible for is under my personal control, and I am not a veteran user.)

If you (also) want to participate in (b), be sure to avoid two pitfalls:

  • (1) Don't champion existing de-facto behavior unquestioningly, just because "it's always been this way".

  • (2) Don't be betriebsblind - realize that you may no longer perceive "quirks" as such, because you've dealt with them so long, or that you're perhaps even actively taking advantage of them; instead, realize that they may be a barrier to newcomers.

  • (3) Don't use (1) and (2) to justify (a).


With that out of the way, let's look at @Jaykul 's arguments:

are and always have been implemented the same way: as ArrayLists
We have more than ten years worth of code, articles, and blog posts documenting the fact that these variables are ArrayLists.

See (a) and (3).

And it's not just those variables, it's also $Error and $input and $foreach ...

Among these, $Error is indeed a [System.Collections.ArrayList] instance, the others are not ($Input - at least in its primary use - and $ForEach are enumerators and documented as such, and, re ..., I'm not aware of other automatic variables that are implemented as [System.Collections.ArrayList]).

$Error is mentioned in the docs as an "array of objects", but in practice that makes little difference - treat it as if it were an array (knowing that the errors are recorded in reverse chronological order, as documented), and you'll be fine (although if you want to clear the collection, you do need to know about its .Clear() method).

it has always been a deliberate choice for every variable like this,

There is no evidence in the documentation or the v3 language spec (the latest version available) that the use of [System.Collections.ArrayList] was a deliberate choice for -*Variable common parameters; Get-Help ArrayList, in fact, returns nothing.

The Windows PowerShell Language Specification Version 3.0 - the most recent version available - even goes out of its way to indicate that the specific collection type chosen for $Error is implementation-defined: "Although the type of this collection is unspecified, it does support subscripting to get access to individual error records."; ditto for $Input and $ForEach: "The type of an enumerator is implementation defined".

In the absence of specific behavior prescribed for -OutVariable, the most sensible behavior is to collect output in the same manner as when assigning to a variable, as described in a previous comment - but see the first aside below.

The spec just says re -OutVariable: "Stores output objects from the command in the variable specified by VariableName. The leading + causes the errors to be appended to the variable content, instead of replacing it.", followed by a copy-paste error from the -ErrorVariable parameter: "Specific errors stored in the variable can be accessed by subscripting that variable."

Finally, it is completely incorrect to suggest that the perf hit of using arrays would only happen once.

That assumes that the proposal is to (re)create the [object[]] array for every object in the pipeline.
Obviously, that wouldn't be a good idea.

Just like captured pipeline output is converted to [object[]] / a scalar after termination of the pipeline for assignment to a variable, so should the value stored in the target variable of -OutVariable (extra logic is required to deal with the case where multiple pipeline segments target the same variable (one as-is, others with +); accessing the [System.Collections.ArrayList] instances intra-pipeline should be avoided, because it relies on an implementation detail.)

You do pay a memory and performance penalty for that, just as when using assignment to capture a pipeline's output in a variable - but see the first aside below.

Two asides:

@Jaykul
Copy link
Contributor

Jaykul commented Mar 19, 2018

The primary value in these common parameters is that they are populated as the pipeline runs, not at the end. If all we wanted was to collect the output at the end, there would be no OutVariable -- that's what Tee-Object -Variable is for.

In fact, Tee-Object already has the behavior you're looking for: Unlike -OutVariable, Tee-Object doesn't populate it's variable until the end of the pipeline. Because of that, it's able to not output a collection when there's only one item, and it outputs an array when there is (not an ArrayList).

Out-Variable is not Tee-Object -Variable

The output collection common parameters are collection variables that are available to be read and modified by every command in the pipeline (that comes after the first creation of the variable). That's not merely an implementation detail -- it's the critical feature.

What you are (now) proposing is that in order to get a tiny feature (the ability to assume the output variable is a single item when we know there's only one item) we must give up all of the functionality of Out-Variable which sets it apart from Tee-Object -Variable.

No thank you.

@Jaykul
Copy link
Contributor

Jaykul commented Mar 19, 2018

As a separate point -- I don't appreciate the attempt to control and redefine the terms of the discussion.

First: it's never an either/or discussion when considering fixing problems by breaking compatibility with a previous release.

Second: it's rude to come to a product with 12 years of history and say things like:

Make your voice heard to vote for the status quo - divorced from whether it makes sense or not - demonstrating how your legacy code will be affected.

You shouldn't need to coerce people into your point of view by trying to make it sound like a vote for the status quo doesn't make sense.

Third: don't make the mistake of insulting people who don't want you to break their tools by suggesting that they are championing "existing de-facto behavior unquestioningly, just because 'it's always been this way'" -- some of us have been using this tool every day for a long time, and although we always want to fix things that are broken, we're sometimes going to look at suggested changes as fixing things that aren't broken.

You won't convince veterans that your way is better by suggesting they're stuck in a rut and trying to put the onus on them of justifying their experience -- you need to show them how your suggestion would improve things. I haven't actually seen anything in the RFC or in this conversation that convinced me it would be worth changing, even if it didn't break anything.

Honestly, I feel like you jumped in here in an effort to make something easier to use, but did so without fully understanding the feature and it's many benefits (i.e. the fact that these are for use within the pipeline, not just at the end).

I don't think that any change is called for, but an improvement to the common parameter documentation would be great, specifically documenting the fact that these are IList collections.

@markekraus
Copy link

@mklement0

maintaining backward-compatibility

That's not the only reason. It's the desire to maintain backwards compatibility because the current behavior provides a series of demonstrably valuable features that would be completely lost if changed, some without a reasonably simple way to re-implement in another way.

Yes, the fact that the a breaking change would result in work to update and change scripts is a part of it. As myself and others have pointed out, we use this feature in our code in ways that would be broken by this change. That's not just bellyaching at the fact that we would need to do work. Just do a search of GitHub for PowerShell where OutVariable is in use and then every instance of where .Add() and [] (index accessor) is used. Those are a potential broken code points. The number is quite large.

But, I'm all for breaking changes that advance the language and provide value. I made quite a few breaking changes in the web cmdlets and I have been championing breaking changes in other areas of PowerShell. So personally, this is not a fight to hold on to the way it's always been for the sake of holding on to the way it's always been. As such, personally I find your tone insulting and does not further this discussion.

The point is, this feature has been around, has prolific use, and has demonstrably valuable features that would be lost with the changes proposed in this RFC. At this point, the onus is on those who wish to make this breaking change to provide a significant body of evidence and justification as to how this breaking change benefits the language, divorced of any (perceived or real) initial intent of the feature.

Even taking all of that out of consideration, this RFC and the resulting changes are rife with issues. The behavior is confusing and unexpected, even with knowledge of the pipeline., There are several performance concerns which have been called into question, and the use of Object[] is not a direction we should go.

In the absence of specific behavior prescribed for -OutVariable, the most sensible behavior is to collect output in the same manner as when assigning to a variable

Previously you called me out for my use of language which seems to only seek to antagonize. This is exactly what you are doing here. Please heed some of your own advice.

The current behavior if OutVariable can make perfect sense with the addition of "always returns an ArrayList." to the documentation. However, having to provide a chart as to what output to expect an which behavior set could be confusing for novice users.

accessing the [System.Collections.ArrayList] instances intra-pipeline should be avoided,

Why? This is one of the benefits of the current behavior (again, divorced from intent). Why should it be avoided? The ability to retrieve, insert, sort, remove items in these variables within the pipeline provides flexibility and value. What value do we get out of this change? Why should it be avoided? is it that it should be avoided due to implementation concerns? If so, we should not make this change as existing value will be lost if only due to complexity concerns.

Just like captured pipeline output is converted to [object[]] / a scalar after termination of the pipeline

Yes, but what about the next pipeline and the one after that where the + feature is used? What about when that behavior occurs in a large iteration loop as it so often does in it's current usage? Are you suggesting that there will not be a performance hit there compared to current behavior? The way this RFC is proposed seems to suggest there will definitely be a performance hit.

The crux of the issue here is not the use of [System.Collections.ArrayList] instead of [object[]], but the deviation from the regular pipeline behavior of unwrapping single-item output.

I would like for this RFC to list a compelling body of reason as to why that behavior is even desirable. One of the most common complaints I get from novice users is how hard it is to comprehend just how the pipeline produces values and why you sometimes get back an Object[] vs a String, vs others. As advanced users of the language, we take that behavior for granted and to some extent we may even like it and think it is desirable, But the pipeline is already a very sour experience and a high hurdle to jump to gain basic competency of the language.

I have seen (and personally used) the current behavior OutVariable used as a teaching aid to help novice users gain understanding of the pipeline. So this change would take away one tool that is helpful for understanding the thing that the change seeks to gain parity with.

What benefit do we gain by changing the behavior of OutVariable in a way which the many users already struggles with, complain about, and generally dislike? IMO, "for the sake of parity with the pipeline" is not a convincing argument for this change from the perspective of someone new to the language.

@mklement0
Copy link
Contributor

mklement0 commented Mar 19, 2018

More meta notes :)

As such, personally I find your tone insulting and does not further this discussion.
Previously you called me out for my use of language which seems to only seek to antagonize. This is exactly what you are doing here. Please heed some of your own advice.
You shouldn't need to coerce people into your point of view by trying to make it sound like a vote for the status quo doesn't make sense.
Second: it's rude to come to a product with 12 years of history and say things like:

(Below I'm referring to my previous post's meta note at the beginning; @markekraus, you may have misconstrued my note about "sensible" behavior: all I wanted to say there was that, divorced from its history, the name and description of -OutputVariable to me suggest that it should be on par with regular variable assignment - simply restating the intent of this RFC. No insult intended there at all.)

It's a valid point, and I apologize - I reacted to what I perceived as haughty dismissals. I will say that I am no stranger to the pitfalls I've described myself, and as such they are reminder to me as much as anyone else. In the particular context of PowerShell, it just so happens that I may need a reminder of the opposite - that, given my relative inexperience, I should be more mindful of history and real-world impact. I will do that in the future and defer to others for assessing backward-compatibility risks.

I absolutely do believe that a vote for the status makes sense, but it's important to have clarity why.
I think it is important to strive for a shared understanding of problematic existing behavior even if it is retained, so as to properly frame it in the documentation, potentially fix it down the line, and inform the design of related, future features.

In short: my (regrettable) tone aside, I wanted to have a separate discussion about the merits of the proposed change in the abstract, and, this particular discussion aside, I think the conversation actually has moved in that direction, thankfully, and let me try to continue in that spirit below.


Honestly, I feel like you jumped in here in an effort to make something easier to use, but did so without fully understanding the feature and its many benefits

Yes, I was coming from my personal use of the feature, which was like Tee-Object -Variable, as a more concise and convenient per-cmdlet alternative that also supports appending, and was then surprised to find that their behavior differed, motivating me to try to spare others the same surprise.

(i.e. the fact that these are for use within the pipeline, not just at the end).

They currently can be use that way, but it isn't obvious at all (except with the non-collecting, always-single-item -PipelineVariable).
This is evidenced by the fact that even the PowerShell committee did not consider it when they approved PowerShell/PowerShell#3154 and that expecting $var = ... and ... -OutVariable var to act the same is reasonable, based on naming and existing documentation.
Note how even the language spec. doesn't address intra-pipeline use at all.

If the consensus is that the existing behavior should be retained as-is (but see my suggestion below), the solution is therefore to improve the documentation (as @ChrisLGardner suggested):

  • Describe how collecting values in an *Variable target variable differs from regular variable assignment (and Tee-Object -Variable).

  • Motivate the difference by explaining the intra-pipeline use of -OutVariable and its benefits (and of the other *Variable parameters, potentially, though less likely).

  • While it is definitely worth mentioning that the collection type used is not a regular PowerShell array ([object[]]), that aspect of the behavior is the least problematic one in practice, given that you may not even notice that you're talking to a [System.Collections.ArrayList] instead of [object[]].

That said, even a documented surprise will remain a surprise - both to newcomers making intuitive assumptions and to seasoned users who forget about the difference.

One of the most common complaints I get from novice users is how hard it is to comprehend just how the pipeline produces values and why you sometimes get back an Object[] vs a String, vs others. As advanced users of the language, we take that behavior for granted and to some extent we may even like it and think it is desirable, But the pipeline is already a very sour experience and a high hurdle to jump to gain basic competency of the language.

I think what you describe was true until v2, but went away in v3 with the ability to index scalars and their having a .Count property with value 1 (or 0, in the case of $null).
There is one notably wrinkle, though: PowerShell/PowerShell#2798

Understanding pipeline output may be nontrivial, but it is crucial, given PowerShell's pervasive use of pipelines. Once you have that understanding, it applies wherever pipelines are involved.
Selective deviation in places where users are likely to expect pipeline-output behavior (I certainly did) is problematic. (And there are certainly other teaching tools for explaining the pipeline).

IMO, "for the sake of parity with the pipeline" is not a convincing argument for this change from the perspective of someone new to the language.

I think the symmetry of $var = ... and ... -OutVariable var is important - you are collecting pipeline output after all (what the cmdlet at hand writes to the pipeline).

The simplest use case is to output to the screen while also capturing the output in a variable, for later use. That use case is currently broken.

(Similarly, the simplest use case for -ErrorVariable is to collect errors for later inspection; whether single-item unwrapping should occur there (and with -WarningVariable and Information-Variable) is less clear-cut, but my vote is for consistency.)

That you can currently also modify this collection as it is being built is to me a secondary consideration (and quite possibly an accidental feature) that shouldn't violate the fundamental assumption of the capture behavior.

So it sounds like the primary philosophical difference is over how to fundamentally frame the primary purpose of the -*Variable parameters and what their perceived primary use is.

Note that even advanced uses such as gci -ov items -Directory | gci -ov +items -File do not require intra-pipeline access to the collection; it is only explicit manipulations in user code - based on knowledge of the collection type - that do, such as gci -ov items -Directory | % -Process { $items.AddRange( @(dir $_ -File) ) }.

And I do wonder if the latter is worth supporting, given that it gets in the way of the simpler capture-and-inspect-later use case.

If the consensus is to grandfather in this usage, perhaps the following compromise is an option - though it ain't too pretty:

  • With multi-item output, keep the existing behavior; as stated, in terms of element access [System.Collections.ArrayList] behaves like [object[]], and users may not even notice.

  • With single-item output, unwrap it - yes, that means changing the type after the fact, but it solves the problem of the $var = ... / ... -OutVariable var symmetry.

    • New users and basic-use-case users get the behavior they expect.
    • Existing code that uses .Count and indexing to access the variable should continue to work.

@markekraus
Copy link

Existing code that uses .Count and indexing to access the variable should continue to work.

Except in cases where a single item has a count property. I should have included this in my list of things to look for on GitHub: any instance where the variable defined in OutVariable uses the Count property is a potential breaking change as demonstrated by this example.

Without inspecting the type of that value returned from OutVariable, the Count property could be:

  1. the count on the ArrayList containing multiple output results from the cmdlet/function
  2. the Count Property on a single object that happens to contain a Count property (could be a collection or not)
  3. The single ArrayList is returned, as a single unenumerated ArrayList output by the function/cmdlet

In the case of the last one, how can the user know whether the ArrayList is the unenumerated output or the collection of of results collected by OutVariable? That's a confusing implementation to work through just to determine the number of objects returned by the cmdlet/function. This is a feature that is heavily relied on now to determine the output counts at various stages of the pipeline.

Also, indexing on OutVariable would be broken when a single String is returned. It will also be broken for other types which support indexers when a function/cmdlet may return one or more of that type.

@mklement0
Copy link
Contributor

mklement0 commented Mar 19, 2018

Except in cases where a single item has a Count property.
indexing on OutVariable would be broken when a single String is returned.

Yes, these are known (and unfortunate) edge cases that affect pipeline behavior fundamentally, but I don't think their avoidance in a particular scenario should be the driving factor.

@(...) is the way to handle these scenarios (which reverts to [object[]] in case of multi-item output, but that's what it arguably should have been all along).

The single ArrayList is returned, as a single unenumerated ArrayList output by the function/cmdlet
In the case of the last one, how can the user know whether the ArrayList is the unenumerated output or the collection of of results collected by OutVariable

Yes, you wouldn't be able to distinguish the following two cases:

# output an explicitly constructed array list as a single item
% { , ([System.Collections.ArrayList] (1,2,3)) } -ov var  
# let -ov collect the pipeline-enumerated elements in an array list
% { 1,2,3 } -ov var 

However, again I think that handling such an obscure edge case should not be the driving factor and, again, the behavior is fundamental to the pipeline:

$var = % { , (1, 2, 3) } # output explicitly constructed [object[]] array as single item
$var = % { 1, 2, 3 }     # let the assignment collect the enumerated elements in an [object[]] array

$var receives the same value in both cases: a 3-element [object[]] array.

@markekraus
Copy link

edge cases

A very high percentage of occurrences of -OutVariable on GitHub are using the .Count property and are potentially broken by the proposed change. This is not an edge case.

I think that handling such an obscure edge case should not be the driving factor

You are proposing a breaking change. You need to provide justification for breaking these. Making edge cases more confusing is not a compelling reason to change. The edge cases now are clear and work the same as non-edge cases. That is a reason to maintain the current behavior: this proposed change creates edge cases where none existed before.

@Jaykul
Copy link
Contributor

Jaykul commented Mar 20, 2018

I think the symmetry of $var = ... and ... -OutVariable var is important - you are collecting pipeline output after all (what the cmdlet at hand writes to the pipeline).

I understand your desire to make this true, but I believe the symmetry to the other common parameters is more important:

People don't learn of just one common parameter: they're all documented together, and basically all behave the same way. These common parameters (including -PipelineVariable) all stream values while the pipeline is running. -PipelineVariable is not a collection, but like the others, it's still updated with every output (not merely with each call to ProcessObject, but with each output from it), and available to all the other commands in the pipeline -- not just after the pipeline is finished.

I agree that the documentation is lacking.

Note that even advanced uses such as gci -ov items -Directory | gci -ov +items -File do not require intra-pipeline access to the collection

I don't consider this an "advanced" use. It's pretty basic usage -- if you're doing anything simpler than that, you don't need this feature at all 😉 And of course, you are accessing the collection within the pipeline: you're modifying it in both commands.

If you examine $Items afterward, you'll see that the output from the first and the second pipeline commands are interleaved. In this basic example, the second command is not accessing the collection except through the -OutVariable common parameter, but it's doing the exact same thing as the other example -- and the performance of this example requires that $items be an IList with the ability to add items at constant O(1) speed, not an array (nor an "optimized" single item that has to be converted back into a list).

I think it would be a more confusing experience if you were writing code where the -OutVariable was sometimes an IList and sometimes a single item, than the current experience where it's always an IList,. It would also be more confusing than your desired experience where it's the same as pipeline assignments: either an array or a single item...

@mklement0
Copy link
Contributor

mklement0 commented Mar 25, 2018

@markekraus:

To me, having a consistent experience whenever PowerShell collects output for you - i.e., a pipeline-like experience - outweighs the risk of breaking code, given that that risk is low in my estimation:

Re the potential .Count problem:

A scalar type having a .Count property is rare - for good reasons.
Cmdlets outputting collections as single pipeline items is rare - for good reasons. Whenever they do, confusion ensues (e.g., PowerShell/PowerShell#3424)

Re the potential [0] (indexing) problem:

A cmdlet outputting directly indexable scalars - notably [string] instances - is rare - again, for good reasons.
Get-Content does output collections of strings, but, again, that's a fundamental pitfall you should expect whenever you use it ('hi'>t.txt; (Get-Content t.txt)[0] # -> 'h' - subscript [0] was applied to the string scalar and returned its 1st character)

Arguably, even [string] shouldn't be directly indexable - a string is implemented as an array of characters, but it isn't one; PowerShell treating a System.Data.DataTable as an enumerable sows the same confusion - see PowerShell/PowerShell#6466; but that's a separate debate.

In short: the edge cases are rare and arise from scalars posing as collections.


Let's look at how -OutVariable is used in practice, based on GitHub repos - I can't speak to how representative they are of overall usage (including closed-source use):

I sampled about 240 uses, both for -OutVariable and -ov, including those in repositories owned by @Jaykul and in https://github.com/PoshCode/PoshCode (I couldn't find any uses in yours) and found the following:

The vast majority of cases:

  • Many instances of outright bugs / no-ops: -OutVariable $results instead of the intended -OutVariable results; in one creative case, -OutVariable $null;
    one instance of assuming that the result is an array and accessing the .Length property, which System.Collections.ArrayList instances don't have (something that switching to [object[]] would actually fix).

  • Many instances of collect-first-use-later use, with a single -OutVariable per pipeline (-OutVariable results) and later uses that would not break with the proposed changes:

    • Later collection use: $results.Count, $results[0], $results | .... - I have yet to see a single case of applying [0] to a cmdlet returning strings, where the indexing problem would surface.
    • Later scalar use; e.g., $results.FullName ... i.e., users were relying on member enumeration or just assumed that the result would be a scalar anyway

The exotic cases:

  • 1 use of multiple -OutVariable parameters, but with distinct variables

  • 1 use of adding to an output variable (+stdout), but in successive calls. (There were effective no-op variants: initializing $results to @() or neglecting to initialize it and then using -OutVariable +results once.)

In short:

  • I didn't find any uses that would break.

  • I didn't find any cases where a single variable is appended to by multiple -OutVariable parameters (along the lines of gci -ov items -Directory | gci -ov +items -File).

  • I didn't find any cases where a target variable is directly accessed or modified as part of the containing pipeline (along the lines of gci -ov items -Directory | % -Process { $items.AddRange( @(dir $_ -File) ) })

Given the above, I'm inclined to say that even the proposed on-pipeline-exit conversion to [object[]] for multiple outputs - also preferable for consistency - won't be a problem in practice.
On the plus side, it paves the way for later replacing [object[]] with an extensible list type as the default collection data structure.

@mklement0
Copy link
Contributor

@Jaykul:

I understand your desire to make this true, but I believe the symmetry to the other common parameters is more important:

The -PipelineVariable is a different beast from the other ones: its purpose is indeed to be
(a) streaming and (b) scalar (1 output object at a time), and it is discussed as such in the documentation.

By contrast, -OutVariable, -ErrorVariable, -WarningVariable, and -InformationVariable should be considered as a group and act consistently; their purpose is to collect values:

  • Nothing suggests that they are streaming values in the documentation.

  • They aren't being used that way in practice, judging by the PowerShell source code available on GitHub (see my previous comment).

  • Instead, the primary use in practice is: collect first, use later.

You've discovered that they're implemented as streaming variables, and perhaps you've made use of that in practice.

This RFC wouldn't take that ability away from you, except that a single-item result would be unwrapped to a scalar and - possibly - the multi-item results converted to [object[]].

The unifying vision behind this RFC, at least in my mind, is to provide a consistent experience whenever PowerShell collects output for you.

Here's another example of how pervasive the pipeline logic is in PowerShell: (, @{ a = , 1 }).a.GetType().Name is Int32; that is, member enumeration unwrapped the single-element array to a scalar.

Being able to rely on this behavior - without having to worry about exceptions here and there - is important.

@markekraus
Copy link

having a consistent experience whenever PowerShell collects output for you - i.e., a pipeline-like experience - outweighs the risk of breaking code,

The pipeline experience is painful, not ideal for all situations, and the current implementation provided a way to work around issue I have already outlined here.

The proposed changes introduces breaking changes for the sake of parity for something it currently serves as an alternative to. That in itself is a strong reason this change is not a direction we should take.

The proposed code changes produces edge cases where currently none currently exist. This too is a very strong reason to not make the change. I have code that will definitely break with this change. I did some analysis of this as well and I was able to find examples on GitHub where it was for sure broken buy the change and areas where they may be subject to the edge cases. In any case, we should not introduce changes that also introduce instability, lack of predictability, nor where edge cases are created.

Examples where code would be broken due to indexing into a string because they can return a single sting:

@mklement0
Copy link
Contributor

mklement0 commented Mar 25, 2018

@markekraus:

The pipeline experience is painful, not ideal for all situations

Please focus your efforts on fixing the pipeline, if you believe it to be painful.

for something it currently serves as an alternative to

Relying on an implementation detail that happens to bypass what you dislike about the pipeline is not a fix.
Conversely, the lack of consistency is a long-term headache and requires documentation, which is always a poor substitute for removing inconsistencies.

I was able to find examples on GitHub

  • Yes, the - one - repo you link to outputs strings in ForEach-Object script blocks on 3 occasions.
  • Yes, if that command happens to output just one string, the indexing will break - though the expected use case is clearly to have multiple strings, given the unfiltered Get-Mailbox, Get-DistributionGroup, and Get-MsolUser calls .

Again, to me the benefit of resolving the inconsistency outweighs such edge cases.

@markekraus
Copy link

Please focus your efforts on fixing the pipeline, if you believe it to be painful.

There is nothing wrong with it. It's just painful in some situations.

Relying on an implementation detail that happens to bypass what you dislike about the pipeline is not a fix.

I still believe this is not an accident. Everything about the way -OutVariable works in conjunction with -ErrorVariable, -WarningVariable, and -InformationVariable seems pretty intentional. Nothing in the documentation seems to contradict that assumption, IMO. It was maybe a less than ideal design choice to use an ArrayList instead of List<T>, but that these are a collection type does not seem like an accident or an "implementation detail".

The fact is, the current implementation works, is convenient, and offers an alternative to the pipeline. It is often a go-to for when a user cannot use the pipeline or needs to standardize output. All of that is inconsequential of the initial intent. That is how it is used and the changes proposed here will break the current behavior.

While my individual preference to keep this behavior unchanged may seem insignificant, I am also not alone in that sentiment. If users like a current feature, that is a good reason to keep it unchanged.

Also, please re-read those examples I provided. They are working on string transformation emitted by foreach-object, not the objects emitted by the commands at the front of the pipeline. So yes, they are broken in instances where there is one option.

@mklement0
Copy link
Contributor

@markekraus: Thanks for the correction re the samples you linked to - please see my updated comment above.

@mklement0
Copy link
Contributor

@markekraus:

It was maybe a less than ideal design choice to use an ArrayList instead of List

No, the issue is not what specific flavor of extensible collection is returned, but (a) that it's not an instance of PowerShell's default collection data type, [object[]] and (b) - more importantly - that a single-item collection is not unwrapped.
And that observation is independent of the speculation over what may or may not have been the original design intent (that an extensible collection type is used internally is perfectly sensible and also happens when assigning pipeline output to a variable, but in the end a conversion to PowerShell's native collection type, [object[]], is performed - again, this behavior is worth revisiting fundamentally, not as a fortuitous exception).

If users like a current feature, that is a good reason to keep it unchanged.

That makes sense to established users, familiar with the warts and all, but won't somebody, please, think of the children?

@mklement0
Copy link
Contributor

I want to share helpful perspectives from @BrucePay from PowerShell/PowerShell#6512 with respect to when pipeline-like unwrapping is not appropriate:

  • Method output:

Methods don't stream so there is no ambiguity as to whether the result should be a collection or not.

  • Operator output:

Operators aren't commands. Operators have operator semantics, not command semantics. Operators don't stream.

That cleared things up for me in my quest for unified behavior: these are simple rules to remember, assuming they're implemented consistently.

In the case at hand, because -OutVariable and its ilk are command (cmdlet) features, I still think pipeline behavior is the way to go (except for -PipelineVariable, whose purpose is different, as discussed).

@Jaykul
Copy link
Contributor

Jaykul commented Sep 12, 2018

I'm confused about where the RFC process for this is. It's been six months....

Typically, one or two months is allowed for comments, though this may be extended if a submission is particularly contentious or hasn't received enough feedback for the Committee to feel comfortable making a decision.

The reality is that we debated this for about a week and there basically haven't been any other comments since then.

It seems like the committee should make a call to either reject or progress to draft and experimental implementation, right?

@rjmholt
Copy link
Contributor Author

rjmholt commented Dec 19, 2018

In the interests of clearing out my history, I'm going to close this. It looks to controversial to consider properly and based on some of the discussion would needed to be worded better to convey original intent. Thank you for the discussion everyone.

@rjmholt rjmholt closed this Dec 19, 2018
@rjmholt rjmholt deleted the out-variable-transparency branch December 19, 2018 00:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants