Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get-Content is slow on large text files. Could it have a parameter to speed it up by not adding NoteProperties? #7537

Closed
HumanEquivalentUnit opened this issue Aug 16, 2018 · 26 comments
Labels
Issue-Enhancement the issue is more of a feature request than a bug Resolution-No Activity Issue has had no activity for 6 months or more Up-for-Grabs Up-for-grabs issues are not high priorities, and may be opportunities for external contributors WG-Cmdlets-Core cmdlets in the Microsoft.PowerShell.Core module

Comments

@HumanEquivalentUnit
Copy link
Contributor

HumanEquivalentUnit commented Aug 16, 2018

Using Get-Content to read an example 170,000 line wordlist text file.

# Default use, slow.  Roughly 6 seconds. Over 100x longer than the alternatives.

$lines = Get-Content -Path '/path/to/bigfile.txt'



# Fast. Roughly 40ms - 90ms.

$lines = [system.io.file]::ReadAllLines('/path/to/bigfile.txt')



# Fast. Roughly 50-100ms. NB. the ReadCount has to be larger than the file line count,
# otherwise $lines is not a 1-dimensional array. i.e. you need to know the file line 
# count to be able to do this in one move.

$lines = Get-Content -Path '/path/to/bigfile.txt' -ReadCount 200kb



# Fastest. Roughly 30ms - 50ms.

$lines = Get-Content -Path '/path/to/bigfile.txt' -ReadCount 100 | foreach { $_ }

The reason for the slow version is explained here, apparently by Bruce Payette in 2006:

This is a known issue with the way Get-Content works. For each object
returned from the pipe, it adds a bunch of extra information to that object
in the form of NoteProperties.

These properties are being added for every object processed in the
pipeline. We do this to allow cmdlets to work more effectively together.
It's important because things like the Path property may vary across
different object types. In effect, we're doing "property name
normalization". Unfortunately, while this technique provides significant
benefits by making the system more consistent, it isn't free. It adds
significant overhead both in terms of processing time and memory space.
We're investigating ways to reduce these costs without losing the benefits
but in the end, we may need to add a way to suppress adding this extra
information.

I think it's a shame that the default usage of Get-Content is the slow version, but that's likely not going to change. But, 12 years on from this posting, is it time to add a way to suppress adding this extra information?

e.g. a parameter to Get-Content which switches off the NoteProperties. I have no good parameter name suggestion - ideally I would want it to communicate "this is faster" to people who see it in written code, or who read the documentation wondering how they can speed up Get-Content on large files.

@jcotton42
Copy link
Contributor

jcotton42 commented Aug 16, 2018

I'm willing to take this if the PS team OKs it, although I'm unsure what to name the parameter. Someone suggested RawLines to me.

@jcotton42
Copy link
Contributor

I have decided to go ahead and work on this issue. @powershell/powershell can I get an assignment?

@BrucePay BrucePay added Issue-Enhancement the issue is more of a feature request than a bug Up-for-Grabs Up-for-grabs issues are not high priorities, and may be opportunities for external contributors WG-Cmdlets-Core cmdlets in the Microsoft.PowerShell.Core module labels Aug 16, 2018
@BrucePay
Copy link
Collaborator

I've marked it as an enhancement and up-for-grabs. You should just be able to assign it to yourself. We added -Raw a long while back to address the perf issue but it doesn't really do the right thing. Naming the new parameter -RawLines sounds ok but maybe a -ReadMode parameter that took lines, text, rawlines, rawtext etc. might be more flexible.

@powercode
Copy link
Collaborator

#7481 Will make this a bit better - haven't tried it yet to see how much, and it will always be slower that just creating the strings.

@jcotton42
Copy link
Contributor

@BrucePay maybe I'm just being daft but I don't see a way to assign this to myself

@Jaykul
Copy link
Contributor

Jaykul commented Aug 16, 2018

See #7501

@jcotton42
Copy link
Contributor

jcotton42 commented Aug 16, 2018

Someone just brought PR #7502 to my attention, would that make solving this issue unnecessary? It seems to solve the same issue of Get-Content being slow, but in a more elegant manner.

@SteveL-MSFT SteveL-MSFT self-assigned this Aug 16, 2018
@SteveL-MSFT
Copy link
Member

Only individuals marked as Collaborators show up in the Assignees list. @jcotton42 I'll assign this to myself to avoid someone else duplicating the work. You can assume this is assigned to you.

The WIP PR is still under review as it is a breaking change. However, although that change will improve things if accepted, it may still make sense to add a parameter to Get-Content

@HumanEquivalentUnit
Copy link
Contributor Author

We added -Raw a long while back to address the perf issue but it doesn't really do the right thing

Ohhh I didn't imagine it had already been acted on. That's partly what I meant about "ideally I would want a parameter name to communicate "this is faster" to people who see it". I've used -Raw in other circumstances, not noticed it was faster, and not twigged it was related to this.

@jcotton42
Copy link
Contributor

jcotton42 commented Aug 17, 2018

@SteveL-MSFT ok sounds good. Given that it looks like that PR will affect mine I will wait until it's merged.

@mklement0
Copy link
Contributor

mklement0 commented Sep 5, 2018

As for what to name the parameter:

I suggest -Bare, which avoids the -Raw confusion (raw also has inapplicable connotations of reading raw bytes).

(Conversely, a more sensibly named parameter alias for -Raw should be introduced - see #7715)

Using -Bare - without including the term lines - also opens the door for implementing similar logic for other cmdlets (opting out of output-object decoration) that may be emitting different types of output objects - see #7713 (though there the "bare" objects happen to be lines too, except if combined with the proposed option to return only matching portions of a line (#7712)).

@ZackInMA
Copy link

ZackInMA commented Jan 8, 2023

using -raw and it flies for me.

@mklement0
Copy link
Contributor

mklement0 commented Jan 8, 2023

@ZackInMA, yes, -Raw is fast because it reads the file as a whole, into a single, multiline string, so there is only one object that needs decorating.

However, this won't help you if you want line-by-line streaming, which is the typical use case, and that is the one that's painfully slow.

If the individual lines are needed, there are two ways of speeding up the operation - both of which make the line output non-streaming, however:

# Read all lines into an array that is then output *as a whole* 
# To use this in a pipeline, enclose in (...) to force enumeration
Get-Content -ReadCount 0 file.txt

# Slower alternative, but still much faster than Get-Content with neither -ReadCount nor -Raw:
# Read into a single string, then split by newlines.
# Note: If the last line has a trailing newline, as is typical, 
#       the resulting array will have an empty last element.
(Get-Content -Raw file.txt) -split '\r?\n

Bypassing the line-by-lne streaming in itself speeds up these commands, but in both cases only one object is decorated with the NoteProperties: the array object as a whole with -ReadCount 0, and the single string with -Raw.

@ZackInMA
Copy link

ZackInMA commented Jan 8, 2023

@ZackInMA, yes, -Raw is fast because it reads the file as a whole, into a single, multiline string, so there is only one object that needs decorating.

However, this won't help you if you want line-by-line streaming, which is the typical use case, and that is the one that's painfully slow.

If the individual lines are needed, there are two ways of speeding up the operation - both of which make the line output non-streaming, however:

# Read all lines into an array that is then output *as a whole* 
# To use this in a pipeline, enclose in (...) to force enumeration
Get-Content -ReadCount 0 file.txt

# Slower alternative, but still much faster than Get-Content with neither -ReadCount nor -Raw:
# Read into a single string, then split by newlines.
(Get-Content -Raw file.txt) -split '\r?\n

Bypassing the line-by-lne streaming in itself speeds up these commands, but in both cases only one object is decorated with the NoteProperties: the array object as a whole with -ReadCount 0, and the single string with -Raw.

Wow, thanks for taking the time man.

@SteveL-MSFT SteveL-MSFT removed the Up-for-Grabs Up-for-grabs issues are not high priorities, and may be opportunities for external contributors label Jan 9, 2023
@SteveL-MSFT
Copy link
Member

Bringing this up to Cmdlets WG to discuss -Bare

@JamesWTruher
Copy link
Member

The WG has reviewed this and believe that an appropriate approach may be to change the default value of -ReadCount to 0 which will essentially improve the performance for all users while possibly causing a small number of users to use $PSDefaultParameterValue['get-content:readcount'] = 1. We also believe this should be provided as an experimental feature.

@JamesWTruher JamesWTruher added the Up-for-Grabs Up-for-grabs issues are not high priorities, and may be opportunities for external contributors label Mar 1, 2023
@mklement0
Copy link
Contributor

mklement0 commented Mar 1, 2023

The proposed change would be massively breaking:

  • Code that processes Get-Content output directly in the pipeline (which is typical) would break, because with -ReadCount 0 $_ then refers to the entire array of lines in ForEach-Object and Where-Object script blocks.

  • Code that happens to bypass this problem with an intermediate variable but relies on the presence of the NoteProperties on the individual lines would break.

@iSazonov
Copy link
Collaborator

iSazonov commented Mar 2, 2023

At first look, we can improve the cmdlet using a trick we use in FileSystemProvider - use cached NoteProperty object for all current outputs.

@jhoneill
Copy link

@JamesWTruher when we discussed this in the WG I don't think anyone picked up that -readcount 0 outputs a single object .

@mklement0's "massively breaking" sounds like hyperbole, but it may not be in this case

PS > get-content .\profile.ps1 -ReadCount 0 | measure

Count             : 1
Average           :
Sum               : 
Maximum           :
Minimum           :
StandardDeviation :
Property          :


PS > get-content .\profile.ps1  | measure            

Count             : 563

@SteveL-MSFT SteveL-MSFT removed the Up-for-Grabs Up-for-grabs issues are not high priorities, and may be opportunities for external contributors label Mar 20, 2023
@SteveL-MSFT
Copy link
Member

@jhoneill We should bring this back up to WG discussion. I believe the ask is for line-by-line reading, but no extra decoration. I'm thinking maybe just -NoNoteProperty which is more self-describing than trying to explain the difference of -Bare and -Raw

@SteveL-MSFT
Copy link
Member

SteveL-MSFT commented Apr 5, 2023

@PowerShell/wg-powershell-cmdlets reviewed this and agree that the use case to have the string objects not have additional decoration makes sense. Considering that this parameter may be used by other cmdlets, we suggest a switch called -NoExtendedMember which may help discoverability and lead the user to learn about the PowerShell extended type system.

@SteveL-MSFT SteveL-MSFT added the Up-for-Grabs Up-for-grabs issues are not high priorities, and may be opportunities for external contributors label Apr 5, 2023
Copy link
Contributor

This issue has not had any activity in 6 months, if this is a bug please try to reproduce on the latest version of PowerShell and reopen a new issue and reference this issue if this is still a blocker for you.

2 similar comments
Copy link
Contributor

This issue has not had any activity in 6 months, if this is a bug please try to reproduce on the latest version of PowerShell and reopen a new issue and reference this issue if this is still a blocker for you.

Copy link
Contributor

This issue has not had any activity in 6 months, if this is a bug please try to reproduce on the latest version of PowerShell and reopen a new issue and reference this issue if this is still a blocker for you.

@microsoft-github-policy-service microsoft-github-policy-service bot added Resolution-No Activity Issue has had no activity for 6 months or more labels Nov 16, 2023
Copy link
Contributor

This issue has been marked as "No Activity" as there has been no activity for 6 months. It has been closed for housekeeping purposes.

@powercode
Copy link
Collaborator

Ping to keep alive

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue-Enhancement the issue is more of a feature request than a bug Resolution-No Activity Issue has had no activity for 6 months or more Up-for-Grabs Up-for-grabs issues are not high priorities, and may be opportunities for external contributors WG-Cmdlets-Core cmdlets in the Microsoft.PowerShell.Core module
Projects
None yet
Development

No branches or pull requests

12 participants