Skip to content

csv parser used in import alias refactor #9091

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 95 commits into from

Conversation

SytzeAndr
Copy link
Contributor

@SytzeAndr SytzeAndr commented Mar 8, 2019

PR Summary

The old implementation used a for loop with lots of if statements to parse a csv string, which is used in the import-alias cmdlet. I think the old implementation is bug sensitive and hard to read. A new implementation can have a reduced complexity and uses the StringReader and StringBuilder to convert a string in csv format to a collection of elements, while keeping functionality intact. Experimental analysis shows this new implementation performs better, especially for large Strings.

PR Checklist

@msftclas
Copy link

msftclas commented Mar 8, 2019

CLA assistant check
All CLA requirements met.

@vexx32
Copy link
Collaborator

vexx32 commented Mar 8, 2019

@SytzeAndr the # case is for when the CSV contains a type hint like it used to in Windows PowerShell -- CSVs generated without the -NoTypeInformation would have a one-line header with something like this:

PS Z:\> [pscustomobject]@{a = 1} | ConvertTo-Csv
#TYPE System.Management.Automation.PSCustomObject
"a"
"1"

@SytzeAndr
Copy link
Contributor Author

SytzeAndr commented Mar 8, 2019

@SytzeAndr the # case is for when the CSV contains a type hint like it used to in Windows PowerShell -- CSVs generated without the -NoTypeInformation would have a one-line header with something like this:

PS Z:\> [pscustomobject]@{a = 1} | ConvertTo-Csv
#TYPE System.Management.Automation.PSCustomObject
"a"
"1"

Thanks for the comment, I'll then reuse the part from the old implementation for when to return an empty collection.

@iSazonov
Copy link
Collaborator

iSazonov commented Mar 9, 2019

@SytzeAndr Thanks for your contribution!

We are working on enhancing the cmdlet and have plans to improve performance.
I am concerned that your change affects performance and adds extra allocations. Did you measure?

@SytzeAndr
Copy link
Contributor Author

SytzeAndr commented Mar 9, 2019

@iSazonov I haven't tested it yet, but performance is one of the reasons I think refactoring this code would make sense. The old implementation is synchronous, but it should be possible to make this an asynchronous process instead of iterating over all the characters which should be faster. (edit: I realized this is only possible if the order in which the elements are in result is not important)

I haven't set up the testing environment locally (so therefore WIP), but when I do and the tests are passing I'll do a measure in terms of performance against the old implementation aswell.

@powercode
Copy link
Collaborator

Are you aware that this method is only called from the Import-Alias cmdlet?

The implementation of Import-Csv is in the file CSVCommands.cs, and the bulk of the work is done in the ImportCsvHelper class. That said, almost all of the cost of this operation is in creating the resulting objects. IIRC, the stream parsing was below 5%. And while the first rule of performance is to measure, in this case I would venture to speak from experience and say that using Regex is not the way to improve the performance.

@powercode
Copy link
Collaborator

@SytzeAndr If you want to improve performance, I would suggest a different approach. Start with a scenario you know is slow. Then run it under a profiler, and try to understand if there are any bottlenecks.

Then try to understand the code around the bottleneck - think about it, why is it written the way it is, how could it be done instead etc. Keep recordings of the original scenario. Try out changes, and measure them, while at the same time making sure that the tests that covers your changes still passes.

The area you have selected here is very hard to improve significantly upon, especially since it is not in the hot path.

@SytzeAndr
Copy link
Contributor Author

SytzeAndr commented Mar 12, 2019

some updates on this PR:

I've implemented the method by using the StringReader and StringBuilder. It currently passes all CI tests, so I assume my implementation is fine.

I assume use of the Stringbuilder has a positive impact on performance since the string class builds a new string at each operation while the stringbuilder maintains it.

I've reduced the complexity of this method according to CodeFactor (and is regarded as being a fixed issue to the repo). Although I still think it can be improved by splitting some things and might do so.

@powercode I used this PR aswell to get a bit used to contributing to powershell, and I chose to tackle an issue which could be fixed with some refactoring. Therefore, I didn't start with analyzing the bottlenecks since I had to set up things yet and wanted to start small by just refactoring some code. I agree with you in the sense that it would make more sense to analyze from a bit higher perspective to find the true bottlenecks.

So the next step for me is to test it for performance against the old implementation, after that I'll remove the WIP tag (assuming performance is not worse).
Is there some simple way of checking the performance/runtime by checking the CI runtime results for different branches, or is it better to do some local performance-based-tests myself of the different implementations and report them here?

@SytzeAndr
Copy link
Contributor Author

@Geweldig makes sense, thanks for the suggestions!

@SytzeAndr
Copy link
Contributor Author

Sorry for the delay. I managed to get rid of the try/catch and KeyValuePair in CreateItemOptions. Tests are passing locally. Can I get a new review?

@TravisEz13 TravisEz13 added the CL-General Indicates that a PR should be marked as a general cmdlet change in the Change Log label Jun 25, 2019
@anmenaga
Copy link

anmenaga commented Jul 8, 2019

@iSazonov @daxian-dbw do you see any pending issues with this PR?

Copy link

@anmenaga anmenaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SytzeAndr Please fix Codacy issues in the new code: specifically about converting 'CreateItemOptions','ConstructAlias','IsValidParsedLine' to be static methods. Thank you.

@SytzeAndr
Copy link
Contributor Author

SytzeAndr commented Jul 13, 2019

Hi, sorry for the radio silence. I fixed the ReadOnlySpan trimming

@SytzeAndr Please fix Codacy issues in the new code: specifically about converting 'CreateItemOptions','ConstructAlias','IsValidParsedLine' to be static methods. Thank you.

Regarding static methods, I didn't made them static due to the following reasons. I'm not sure why codefactor doesn't pick these up. Let me know if you agree, if so I suspect it is fine for merging.

  • I couldn't change CreateItemOptions to static due to a call to the public method WriteError.
  • ConstructAlias uses Context at line 434, which is an internal of the abstract class InternalCommand.Context and thus can be derived by calling this.Context. I changed Contect to this.Context and now Codacy doesn't complain anymore.
  • IsValidParsedLine can't be made static due to a call to the public void ThrowTerminatingError.

while (i < csvTrimmed.Length && inQuotes)
{
i++;
nextChar = csvTrimmed[i];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boundary after i++ can be = csvTrimmed.Length:

 while (++i < csvTrimmed.Length && inQuotes)

Copy link
Contributor Author

@SytzeAndr SytzeAndr Jul 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't that introduce a bug? Replacing i with ++i would increment i when the conditional statement is checked, regardless whether inQuotes is true or false. In other words, when i is at the index of the second " (the escape quote), it will increment i one additional time, skipping the first character after we escape the quote sequence.

For example, a"b"cd would then be parsed to abd, which is incorrect as it should be parsed to abcd.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iSazonov , can you please looks at this again? thank you.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comment is only about that there can be boundary overflow because of i++ without follow check.

Copy link
Contributor Author

@SytzeAndr SytzeAndr Jul 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true but would require some different solution, which is replacing line 147 with i + 1 < csvTrimmed.Length && inQuotes.

I just commited and pushed this change to this branch

@anmenaga
Copy link

@daxian-dbw Can you please review this again? thank you.


private static Collection<string> ParseCsvLine(string csv)
{
ReadOnlySpan<char> csvTrimmed = csv.Trim();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

csv should be ReadOnlySpan to avoid allocation here.

@ghost ghost added the Review - Needed The PR is being reviewed label May 27, 2020
@ghost
Copy link

ghost commented May 27, 2020

This pull request has been automatically marked as Review Needed because it has been there has not been any activity for 7 days.
Mainainer, Please provide feedback and/or mark it as Waiting on Author

@iSazonov
Copy link
Collaborator

Since there is no activity for a long time and this is too edge case, we can close the PR.

@iSazonov iSazonov closed this May 27, 2020
@iSazonov iSazonov removed the Review - Needed The PR is being reviewed label May 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CL-General Indicates that a PR should be marked as a general cmdlet change in the Change Log
Projects
None yet
Development

Successfully merging this pull request may close these issues.