-
Notifications
You must be signed in to change notification settings - Fork 7.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving performance of Import-CSV #7413
Conversation
@powercode Thanks for great PR! I see that most of the changes (over 95%!) are style. It would be nice if you left only meaningful changes. (I hope you understand the difference between reviewing 1300 and 50 lines of code.) |
@iSazonov Well, they are separated in different commits. But I can scrap the cleanup and just submit the core changes. The first two are purely style. |
@powercode Thanks for this contribution! I've fixed your PR description so that the checkboxes show up correctly. In the future, you can just leave the checkboxes empty then after you submit your PR, you can use your mouse to click them and it'll update the markdown correctly. |
Since the cleanup is in separate commits, it shouldn't be too difficult to review. In the future, it'll be better to have the cleanup as a separate PR, although I can certainly understand making the cleanup while making code changes in the same source files :) |
} | ||
} | ||
|
||
#endregion Member collection classes and its auxiliary classes | ||
} | ||
|
||
#pragma warning restore 56503 | ||
|
||
#pragma warning restore 56503 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing trailing newline
@@ -4894,6 +4902,7 @@ public void Dispose() | |||
} | |||
} | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra whitespace
_instanceMembers = new PSMemberInfoInternalCollection<PSMemberInfo>(instanceMemberCapacity); | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra whitespace
@SteveL-MSFT I did some additional perf work (profiled allocations). Is it ok to submit a few more commits in this PR? It is mainly about reducing allocations when creating parsing the CSV and creating PSObjects/PSNoteProperties. |
@powercode yes, feel free to submit more commits, will re-review |
Our code guidance demands to separate style and significant changes. Please remove the style commits from the PR - they very large and contain controversial changes. We can discuss them in another PR. I added WIP while we wait new commits. |
In the context the line
looks as unneeded although this may be necessary in other scenarios. |
@PowerShellTeam I'm really getting sick on this endless focus on formatting. This is a 10x perf improvement on Import-CSV, and I have spend about as much time on this that I'm willing to spend. Feel free to adopt the PR, or ditch it. |
@powercode I am very surprised by "I'm really getting sick ...". I absolutely can not accept your "I'm really getting sick". You pushed over 1300 line changes and only 30-50 of them are significant. This absolutely contradicts our Coding Guidelines. This forces reviewers to spend a lot of time looking at insignificant changes. It would be nice if you took care not only of your time. As I wrote earlier if you are a very busy person, then you can always ask a help from maintainers and other contributors and say "I will be busy the whole next month, feel free grab my commits or wait for me to return". This is more creative than writing "I'm really getting sick". In recent months, we have done a great job of correcting style issues in the repo in order to facilitate the work of the contributors. And we continue. We are configuring CodeFactor/StyleCop. We have started to use CodeFormatter (a tooling you ask). We have PR for adding I hope for your understanding, patience and help. |
Well, I beg to differ. I'm happy to run tools to keep a consistent formatting on files that already confirms to the standard. But it is a huge waste of peoples time to nitpick on things that could easily be fixed by tools. Especially where the choice is between following the coding guidelines and maintaining a consistent style with the rest of the file. |
@powercode The problem is that there is no tool that would fix everything we need. |
The best practice is to have separate PRs for style changes than functional changes. We should try to do that as it makes reviews easier. I also appreciate @iSazonov's efforts to help clean up the code base. @powercode I appreciate your contributions and hope you'll continue to contribute, but do try to have style changes as separate PRs next time. For this specific PR, I don't think it's worth the effort to cherry-pick the style changes to a new PR. CodeFactor shows 313 issues fixed and when I looked at those specific commits, the style changes all looked good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 4 files at r2, 1 of 2 files at r3.
Reviewable status: 0 of 5 files reviewed
@@ -3392,6 +3394,7 @@ internal TypeTable(bool isShared) | |||
{ | |||
this.isShared = isShared; | |||
_typeFileList = new List<string>(); | |||
_memberFactoryFunc = MemberFactory; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the filed _memberFactoryFunc
really needed? Why not directly use _consolidatedMembers.GetOrAdd(types.Key, MemberFactory, types)
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. You are trying to avoid constructing a Func<...>
object every time running this method. Nice thought!
There are 4 constructors for TypeTable
, and adding _memberFactoryFunc = MemberFactory
here only covers 2 of them, can you add this initialization to the other constructors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if would be helpful to add a comment to the _memberFactoryFunc
field to explain its purpose. If you cannot get to this PR in the short term, please let me know, I'm happy to make these minor changes for you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daxian-dbw Done. Yes, I profiled allocations and found that this was called so frequently, and generated lots of short lived delegates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch with the ctors - I misread the code and thought they all delegated to a single one. They do now.
Quoted from
It's OK to have a small set of style changes along with functional changes in a PR, as long as they are in separate commits. But for a large set of style changes, the best practice is to separate them into another PR. We have this guideline in place actually trying to make the life easier for both the author and reviewer.
That is the context why we have this guideline in place. @powercode Hope you can understand, and thanks for your understanding! :) However, on the other hand, I totally understand that when making changes in unfamiliar code, style changes may have to be made to understand the code before making changes. This often makes the functional changes depend on the style changes that have already been made, and thus makes it hard to separate the style changes to another PR, especially if the author didn't have the It's a blurry line there, and I think in such a case, as long as style changes are in separate commits, we can be more accepting of the PR while reminding the author to have the @iSazonov has spent tons of effort cleaning up the code base and unifying the style/formatting to make the code base more readable and maintainable (I really appreciate your work @iSazonov!). But we are not quite there yet, the tools, rules, and guidance are still work-in-progress and we are looking for input. It's an on-going learning experience for all of us maintainers. |
@daxian-dbw The first two style commits contain controversial changes and gaps. I vote to remove them from the PR. I can grab the commits and open new PR. |
@iSazonov Can you point out the changes that you are concerned with? You don't need to point out every one of them. One example per category would be good enough. I should mention that I do notice some minor issues in the style changes, for example, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daxian-dbw I can not comment first commit - GitHub returns an error. I see there some style issues too.
return this.name; | ||
} | ||
} | ||
public string Name => this.name; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't use the pattern in our code. Seems it was rejected previously.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a new language feature introduced in C# 7.0 - expression-bodied members. Why can't we use this new language feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main problem we face is that historically many files have different formatting and style.
This complicates both reading the code and changing it. We are working on fixing this.
If we continue to create different styles, we will never bring our code into good condition.
So if we want to use the new feature, we need to first add it to our codinig guidelines, and then fix the entire code base (in a few commits).
bool hasCycle; | ||
PSMemberInfo returnValue; | ||
LookupMember(name, new HashSet<string>(StringComparer.OrdinalIgnoreCase), out returnValue, out hasCycle); | ||
LookupMember(name, new HashSet<string>(StringComparer.OrdinalIgnoreCase), out var returnValue, out var hasCycle); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var in out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I noticed those instances, and was planning on fixing them in a separate PR.
@@ -536,18 +546,19 @@ private void LookupMember(string name, HashSet<string> visitedAliases, out PSMem | |||
name); | |||
} | |||
|
|||
PSAliasProperty aliasMember = member as PSAliasProperty; | |||
if (aliasMember == null) | |||
if (!(member is PSAliasProperty aliasMember)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Readability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I also noticed this, and plan to fix it. (as we discussed in another PR)
try | ||
{ | ||
SetterCodeReference.Invoke(null, new object[2] { this.instance, value }); | ||
SetterCodeReference.Invoke(null, new object[] {this.instance, value}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug or no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a bug. The length of the array can be inferred from the initializer following new object[]
. You can even do new [] { ... }
, though I think this style is less preferred (BTW, I also notice one instance of this style in the changes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daxian-dbw Why do you think that style is less preferred?
To me, this is the same discussion we had in the C++ community, where people thought auto
was bad. Herb Sutter wrote this:
https://herbsutter.com/2013/08/12/gotw-94-solution-aaa-style-almost-always-auto/
Try to come up with a situation where
SetterCodeReference.Invoke(null, new object[] {this.instance, value});
is easier to understand than
SetterCodeReference.Invoke(null, new [] {this.instance, value});
Types are for the compiler to verify that we are not messing up our logic.
The essense of both lines are that we are invoking a Setter for a code reference, passing in the parameters. We know that the compiler is happy with the call. We guess that the array is the instance to set the value on, and the actual value.
This comes from good variable naming.
I would argue that explicit types has fairly limited value for readablility, and gives a lot of churn when doing refactorings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are so many places is code where you don't see the type. You just don't notice it.
Take this example:
void Foo(Bar bar)
{
_member.GetInstance(out var instance);
DoWork(GetData(), _context.Position, bar.Size, instance);
}
Of the arguments to DoWork, why do you consider not seeing the type of instance
as a problem, and not the types of GetData()
, _context.Position
, and bar.Size
?
It makes no sence - In that case, the guideline would have to be to write this as
void Foo(Bar bar)
{
_member.GetInstance(out MyInstance instance);
SomeData someData = GetData();
XPosition position = _context.Position;
StrangeSize strangeSize = bar.Size;
DoWork( someData, position, strangeSize, instance);
}
Most of us would reject that as crazy.
Why do you make a different call for out var
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daxian-dbw I could not able comment in first commit so I added comments in full list. (Not all issues was marked as you ask)
@@ -4735,17 +4621,19 @@ private void EnsureReservedMemberIsLoaded(string name) | |||
foreach (CollectionEntry<T> collection in Collections) | |||
{ | |||
Diagnostics.Assert(delegateOwner != null, "all integrating collections with non empty collections have an associated PSObject"); | |||
T memberAsT = collection.GetMember((PSObject)delegateOwner, name); | |||
T memberAsT = collection.GetMember((PSObject) delegateOwner, name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daxian-dbw Extra space.
if (memberAsT != null) | ||
{ | ||
if (collection.ShouldCloneWhenReturning) | ||
{ | ||
memberAsT = (T)memberAsT.Copy(); | ||
memberAsT = (T) memberAsT.Copy(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daxian-dbw Extra space.
@@ -416,22 +433,15 @@ internal PSMemberInfo ReferencedMember | |||
/// <returns>a new PSMemberInfo that is a copy of this PSMemberInfo</returns> | |||
public override PSMemberInfo Copy() | |||
{ | |||
PSAliasProperty alias = new PSAliasProperty(name, ReferencedMemberName); | |||
alias.ConversionType = ConversionType; | |||
PSAliasProperty alias = new PSAliasProperty(name, ReferencedMemberName) {ConversionType = ConversionType}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daxian-dbw Skipped spaces in initialization {}
@@ -1412,16 +1390,13 @@ public override string ToString() | |||
/// <param name="variable">The variable to wrap</param> | |||
/// <exception cref="ArgumentException">for an empty or null name</exception> | |||
public PSVariableProperty(PSVariable variable) | |||
: base(variable != null ? variable.Name : null, null) | |||
: base(variable?.Name, null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daxian-dbw Readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would argue that
base(variable?.Name, null)
is more readable than
base(variable != null ? variable.Name : null, null)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first line is read the Name of variable or null.
The second line has to be parsed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And from our coding guidelines:
Use of new C# language syntax is encouraged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The syntax was introduced to address just this scenario. Once you have become used to it, it reads a lot better than the old, verbose, syntax.
returnValue.Append("}"); | ||
return returnValue.ToString(); | ||
} | ||
|
||
private Nullable<PSLanguageMode> _languageMode; | ||
private string _getterScriptText; | ||
private readonly PSLanguageMode? _languageMode; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daxian-dbw Should we fix all Nullable in our code base?
if (typesXmlMembers[member.Name] != null) | ||
var typesXmlMembers = typeTable.GetMembers(_mshOwner.InternalTypeNames); | ||
var typesXmlMember = typesXmlMembers[member.Name]; | ||
if (typesXmlMember is T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daxian-dbw var
is bad readable here. And we should exclude typesXmlMember
(used once).
@powercode Thanks for excluding transformation whole collection. Good catch!
PSMemberSet retVal = null; | ||
|
||
retVal = TypeTableGetMemberDelegate<PSMemberSet>(this, TypeTable.PSStandardMembers); | ||
var retVal = TypeTableGetMemberDelegate<PSMemberSet>(this, TypeTable.PSStandardMembers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daxian-dbw Bad readable var
.
{ | ||
return null; | ||
} | ||
return targetType?.Value as Type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@daxian-dbw Readability?
if (currentMemberAsMemberSet == null || typeMemberAsMemberSet == null || | ||
!typeMemberAsMemberSet.InheritMembers) | ||
retValue.Remove(typeMember.Name); | ||
retValue.Add(typeMember.Copy()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to exclude this copying if we could.
@@ -1278,9 +1279,12 @@ internal void ReadHeader() | |||
{ | |||
_alreadyWarnedUnspecifiedName = alreadyWriteOutWarning; | |||
ReadHeader(); | |||
var prevalidated = false; | |||
var values = new List<string>(16); | |||
var builder = new StringBuilder(256); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@powercode Please explain why we use this init constants? (Why not 32 and 512?) I believe we need to add a good comment.
Also above we use the same values. I believe we should use readonly const
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are just reasonable guestimates of number of columns and line length - just so that we don't have to resize and reallocate so much.
Agree that we can make constants of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that the same constants is used in ReadHeader() and Import().
After we read header we already know needed size of the values
(header column count) and could use it in Import().
ReadHeader() is called once - seems make sense use init value only if we get a header from Header parameter - in the case we know size of the header.
Given that the discussion has been more focused on coding styles instead of the functional improvement, I will reorganize this PR to separate the first two commits to another PR. |
@daxian-dbw :) I have no objections what-so-ever to people changing code I submit. Less work for me! I was under the naive impression that the changes was non-controversial, since they were only formatting changes, and simplifications that took advantage of more consise ways of expressing the same meaning introduced by newer language versions. In the world I work, a formatting only change is mostly ignored, and the focus is on the bits that affect the customers, so I didn't perceive it as something that would generate much more work for you guys. I get that the team is more worried about how code style affect readability than I am - so far I have always been able to read the code, even if the style vary widely, and I frankly don't consider it on the top 10 list of things to prioritize on the project. What I do care about is finding the real issues in the code - like the constructor bug I had introduced. I do believe it is good, from a workflow perspecive, to start with a refactoring pass when you are about to do a change to some code. By that I mean a pass where no functionality is changed, only formatting and the way things are expressed. It may be introduction of new methods, etc, but functionally the same. Having the requirement that that pass should be in it's own PR seems a bit process-heavy to me, but if that is how you want to roll, I'll adapt. Just my €.02. |
By avoiding an expensive copying of members just to check for the existance of one of them, perf is significantly increased, and allocations are reduced.
Reduce allocations and GC pressure by preallocating and reusing StringBuilders and List<string> for line parsing. Use List<string> instead of Collection<string> to get fewer virtual calls and better inlining.
…le.GetMembers GetMembers(ConsolidatedString types)
OK, I have separated the original first 2 commits to #7446 @powercode Please take a look just to make sure I didn't mess up anything :) |
@@ -1235,9 +1241,11 @@ internal void ReadHeader() | |||
TypeName = ReadTypeInformation(); | |||
} | |||
|
|||
var values = new List<string>(valueCountGuestimate); | |||
var builder = new StringBuilder(lineLengthGuestimate); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should exclude the constants here because we read a header once per file - make no sense do the optimization.
After we read the header or get it from Header parameter we know the size of the header and values
- so we could use the size in Import()
method below.
As for 256 const in StringBuilder I agree for the PR. Later I want add optimizations in the parser (if you already haven't such plans).
And sorry... lineLengthGuestimate
should start with upper case LineLengthGuestimate
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it make more sense to do extra work when it can be avoided?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one minor comment.
@powercode Thanks again for the great improvement! |
@powercode Thanks for the great and very useful improvement! I often process logs and I are happy to get a quick Import-Csv cmdlet. Sorry again for noisy style comments. @SteveL-MSFT This can be an interesting blog post. |
PR Summary
Fixes #7112
Speed up the creation of PSObjects in CSV cmdlets by
NoteProperty
members._instanceMembers
collection with an initial capacity.StringBuilders
andList<string>
for line parsing in CSV cmdlets.List<string>
instead ofCollection<string>
to get fewer virtual calls and better inlining.TypeTable.GetMembers(ConsolidatedString types)
.Linq.Any()
with aList.Count > 0
in binder code.The main gain is from taking advantage of the fact that all objects created by
Import-Csv
have the same shape (the same properties).PR Checklist
.h
,.cpp
,.cs
,.ps1
and.psm1
files have the correct copyright headerWIP:
to the beginning of the title and remove the prefix when the PR is ready.[feature]
if the change is significant or affects feature testsThis change is