Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The last cell of an empty column read by ConvertFrom-Csv is inconsistently $Null #17702

Closed
iRon7 opened this issue Jul 18, 2022 · 21 comments
Closed
Labels
Needs-Triage The issue is new and needs to be triaged by a work group. Resolution-No Activity Issue has had no activity for 6 months or more WG-Cmdlets-Utility cmdlets in the Microsoft.PowerShell.Utility module

Comments

@iRon7
Copy link

iRon7 commented Jul 18, 2022

If the last cell of an unquoted Csv file is empty, the cmdlets ConvertFrom-Csv and Import-Csv inconsistently returns a $Null rather than am empty string.

Steps to reproduce

$Data = ConvertFrom-Csv @'
Id,Name,Note
01,John,
02,Jack,
03,Ryan,Just a note
04,Luke,
05,Noah,
'@

$Data.Note.ForEach{ if ($Null -eq $_) { 'Null' } else { 'String' } }

Expected behavior

String
String
String
String
String

Actual behavior

String
String
String
String
Null

As shown above all other cells in the above Note column are interpreted as string, except for the last cell with appears to be $Null. This would have been correct if the last comma was omitted but that is not the case.

Error details

No response

Environment data

Name                           Value
----                           -----
PSVersion                      7.2.5
PSEdition                      Core
GitCommitId                    7.2.5
OS                             Microsoft Windows 10.0.22000
Platform                       Win32NT
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0

See also: how to filter empty values in a column in powershell

@iRon7 iRon7 added the Needs-Triage The issue is new and needs to be triaged by a work group. label Jul 18, 2022
@iRon7 iRon7 changed the title The last cell of an empty column read by ConvertFrom-Csv in inconsistently $Null The last cell of an empty column read by ConvertFrom-Csv is inconsistently $Null Jul 18, 2022
@237dmitry
Copy link

Not valid CSV?

$Data = ConvertFrom-Csv @'
Id,Name,Note
01,John
02,Jack
03,Ryan,Just a note
04,Luke
05,Noah
'@

$Data.Note.ForEach{ $_ ? 'String' : 'Null' }

<#
Null
Null
String
Null
Null
#>

@iRon7
Copy link
Author

iRon7 commented Jul 18, 2022

@237dmitry

Not valid CSV?

The example you show is in fact less "valid" as the one issued because it has less delimiters in the data area than the header.
Besides, if you round trip your suggestion, it returns the same csv file as the issued one:

ConvertFrom-Csv @'
Id,Name,Note
01,John
02,Jack
03,Ryan,Just a note
04,Luke
05,Noah
'@ |ConvertTo-Csv -UseQuotes AsNeeded
Id,Name,Note
01,John,
02,Jack,
03,Ryan,Just a note
04,Luke,
05,Noah,

@237dmitry
Copy link

rather than am empty string.

Maybe in an empty cell, on the contrary, there should be $null, and not a string?

ArcoLinux_2022-07-18_18-10-53

@iRon7
Copy link
Author

iRon7 commented Jul 18, 2022

@237dmitry

Maybe in an empty cell, on the contrary, there should be $null, and not a string?

$_.Note ? is not a synonym (or antonym) for $Null -eq $_.Note ? (in the original issue).
$_.Note ? falsifies if either $_ is null ($Null) or an empty string ('')!

@brianstringfellow
Copy link

The problem also shows up with Import-Csv, which uses the same internal class ImportCsvHelper. The case encountered was when a CSV was modified and removed the last newline from the file. Another case was when an editor (VS code extension) was used to edit the CSV and it also removed the last newline from the file. In the image below, the left side returned an unexpected $null value for the last row Note property while the right side returned an empty string '' as expected.

Example
image

@iSazonov
Copy link
Collaborator

A comment in the code explicitly says it is "by-design" (for all rows where no data is for tail columns):

// If no value was present in CSV file, we write null.
if (i < values.Count)
{
value = values[i];
}

Changing this would be a breaking change. So, it is question for WG should be change the design and accept the breaking change.

@iRon7
Copy link
Author

iRon7 commented Sep 22, 2022

@iSazonov,
That would be the issue if the last row would have less delimiters than the header (and the other rows) which is not the case. In other words, I don't think that the snippet you point out is the issue also knowing that all other lines (except for the last line) do correctly return an empty string (rather than $Null). As far as I can determine, the issue is that ReadChar throws an InvalidOperationException at the EOF which causes the very last value to be omitted rather than to be an empty string.

@dkaszews
Copy link
Contributor

@iSazonov Agreed with @iRon7 that this comment is not applicable here, as we see only an issue on the very last row, so it is likely caused by EOF

@iSazonov
Copy link
Collaborator

@dkaszews You say about edge case. I ask about common case - should we write null or empty string for absent data in any cell.
Demo (see line 2):

$Data = ConvertFrom-Csv @'
Id,Name,Note
01,John,
02,Jack
03,Ryan,Just a note
04,Luke,
'@

$Data.Note.ForEach{ if ($Null -eq $_) { 'Null' } else { 'String' } }

@dkaszews
Copy link
Contributor

@iSazonov From original example, it should be String, Null, String, String - default value if there is a comma but value is empty, null if there are not enough commas so some columns are completely missing. We are focusing on the last column, but this should apply to every one:

$Data = ConvertFrom-Csv @'
Id,Note,Name
01,,John
02,
03,Just a note,Ryan
04,,Luke
'@

The way I see it, first you split on the delimiter, which gives you array of potentially empty values, then pad with nulls to correct length. With the examples present, it somehow also depends on the very presence of another row after, so reordering rows or truncating the table can change types.

@iSazonov
Copy link
Collaborator

I tend to think that nulls make no sense at all for CSV.

$a = @{q = $null}
$a | convertTo-Csv
"q"

ConvertTo-Csv: Object reference not set to an instance of an object.

It is a bug we must fix. But no null literals are in CVS format. So we have to output empty string. Then we could conclude it makes no sense to designate nulls and empties on read too. This looks more consistent.

@dkaszews
Copy link
Contributor

no null literals are in CVS format.

Fair, but I'm afraid that changing those nulls may be a breaking change 😕 . That's why I would rather limit the fix to just consistency, so that shuffling, truncating or separating rows does not change their output.

@iSazonov
Copy link
Collaborator

Fair, but I'm afraid that changing those nulls may be a breaking change

That's what I originally thought too. But I can't think of a scenario where that would be appropriate. _The main thing here is that CSV has no explicit constant for null._If someone wants to reliably handle nulls, he'll obviously have to come up with such a constant for himself. Only for a specific application or script - it won't work anywhere else.

(So technically it's a breaking change bracket 3 - it is unbelievable that something will be destroyed.)

@dkaszews
Copy link
Contributor

I guess, it only comes up in malformed CSVs with inconsistent number of columns. So that's what we in C++ world called Undefined Behavior - program can do whatever it wants to. And if somebody has to deal with malformed CSVs, they should program it defensively. Source: used to do processing of malformed CSVs in my previous job.

@iRon7
Copy link
Author

iRon7 commented Sep 29, 2022

Fair, but I'm afraid that changing those nulls may be a breaking change

Then we could conclude it makes no sense to designate nulls and empties on read too

For reading, it might break a simple condition as below
(even for this initial issue, but I still think it should be consistent with the rest of the rows):

if ($Null -eq $Data[4].Note) { ...

or if $Data would be a single object (from a single csv row)

if ($Null -eq $Data.Note) { ...

Or (incorrectly) having $Null add the right hand side (member access enumeration):

if ($Data.Note -eq $Null) { ...

@iSazonov
Copy link
Collaborator

@iRon7 These are obvious code snippets. The right question is whether there are popular applications or services that create such corrupted files that could lead to code like yours. I don't know of any. And I can't think of a reason why they would be created. From this I conclude that we should get rid of nulls.

@iSazonov
Copy link
Collaborator

Source: used to do processing of malformed CSVs in my previous job.

@dkaszews Is the source of these files in common use? And what did you do with the nulls?

@dkaszews
Copy link
Contributor

dkaszews commented Sep 29, 2022

@iSazonov It was random data downloaded all over the internet, mostly how words are related to each other in different languages. Much of it was poorly exported spreadsheets. I had to safeguard against stuff like random mismatched quotes or u escaped commas and discard rows just to not corrupt everything else, otherwise we had a stuff like 10k rows parsed as a single value. But I was in a different situation because I could discard those rows I did not like, as the entire project was about statistical analysis.

@iSazonov
Copy link
Collaborator

@dkaszews Thanks! It is not a scenario we should care.

@microsoft-github-policy-service microsoft-github-policy-service bot added the Resolution-No Activity Issue has had no activity for 6 months or more label Nov 15, 2023
@brianstringfellow
Copy link

Still reproduces in:
PS>$PSVersionTable

Name Value


PSVersion 7.3.9
PSEdition Core
GitCommitId 7.3.9
OS Microsoft Windows 10.0.22621
Platform Win32NT
PSCompatibleVersions {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion 2.3
SerializationVersion 1.1.0.1
WSManStackVersion 3.0

Copy link
Contributor

This issue has been marked as "No Activity" as there has been no activity for 6 months. It has been closed for housekeeping purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs-Triage The issue is new and needs to be triaged by a work group. Resolution-No Activity Issue has had no activity for 6 months or more WG-Cmdlets-Utility cmdlets in the Microsoft.PowerShell.Utility module
Projects
None yet
Development

No branches or pull requests

5 participants