Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Add OutputType parameter to Import-Csv #8862

Closed
powercode opened this issue Feb 10, 2019 · 12 comments
Closed

Feature Request: Add OutputType parameter to Import-Csv #8862

powercode opened this issue Feb 10, 2019 · 12 comments
Labels
Issue-Enhancement the issue is more of a feature request than a bug Resolution-No Activity Issue has had no activity for 6 months or more

Comments

@powercode
Copy link
Collaborator

powercode commented Feb 10, 2019

Summary of the new feature/enhancement

By using a concrete type, instead of PSObject, both import speed and memory usage can be improved.

As it is today, Import-Csv is almost useless for larger datasets, since the overhead of our NoteProperties is 48 bytes, not counting the name and the value. When the imported values are integers, that is a blowup-factor of ~20.

These numbers are from my prototype:

Destination Time Memory used
PSObject 01:34 19Gb
class with int props 01:02 490Mb
class with string props 00:10 4.2Gb

By keeping it as strings, the import speed is vastly improved. By converting to integers, the speed is still improved, and the memory requirements are vastly improved.

Proposed technical implementation details (optional)

See #8860.

The gist is to generate expression trees, that sets the properties or call the constructor, on an instance of the provided type.

The use of the constructor allows for custom type conversion, where there are no language conversions from string to the property type.

The type needs to have members that match the names of the columns in the CSV.
Maybe we should provide a way of providing alternate headers to map to existing objects?

Data.csv

Text, Integer,Date
Hi,42,2016-12-24
Bye,4711,2016-12-25
class MyCsv {
   [string] $Text
   [int] $Integer
   [DateTime] $Date
}
$d = Import-Csv -OutputType ([MyCsv]) -Path .\data.csv

I also implemented ctor calls, that takes precedence.

class MyCsv2 {
   MyCsv2([string] $text, [int] $integer, $date){
       $this.Name = $this.text
       $this.Number = $integer * 100
       $this.When = $dateTime
   }
   [string] $Name
   [long] $Number
   [DateTime] $When
}
$e = Import-Csv -OutputType ([MyCsv2]) -Path .\data.csv

I would like to see a discussion about the feature set, error handling, names for parameters etc.

@powercode powercode added the Issue-Enhancement the issue is more of a feature request than a bug label Feb 10, 2019
@powercode
Copy link
Collaborator Author

This really got a lot of traction :)
Seems like an area that comes up as problematic in the wild from time to time.

@mklement0
Copy link
Contributor

mklement0 commented Nov 9, 2019

Only just saw this now - seems well worth doing.

@mklement0
Copy link
Contributor

mklement0 commented Nov 9, 2019

problematic in the wild from time to time.

Just came across https://stackoverflow.com/q/58660818/45375, where an out-of-memory exception occurred even in a streaming scenario; that is, the objects weren't even collected in full in memory and instead just piped back to Export-Csv.

Is the problem in this case one of mounting memory pressure due to lack of garbage collections? Would it make sense to build periodic garbage collection into the command?

Stack Overflow
getting memory exception while running this code. Is there a way to filter one file at a time and write output and append after processing each file. Seems the below code loads everything to memory...

@mklement0
Copy link
Contributor

mklement0 commented Feb 22, 2020

https://stackoverflow.com/a/60356120/45375 may give this issue a bit more exposure.

Stack Overflow
I'm trying to understand why PowerShell's memory balloons so much when I import a file that's ~16MB's as a variable. I can understand there's additional memory structure around that variable but I'...

@iRon7
Copy link

iRon7 commented Feb 23, 2020

Shouldn't this be done (or also possible) via a calculated property, like:

$e = Import-Csv -Path .\data.csv -Property
    @{Name = 'Name'; Type = [string]},
    @{Name = 'Number'; Type = [long]},
    @{Name = 'When'; Type = [DateTime]}
}

Where the default type is a PSNoteProperty.

I guess that a calculated Type property attribute also makes sense for (some of) the existing cmdlets (e.g. Sort-Object) that support calculated properties via the -Property parameter.

At second thought, I think this isn't possible as the property types of a PSCustomObject can't be changed itself only the type of the object contained by the PSNoteProperty but that wouldn't safe any memory... 😒

@iRon7
Copy link

iRon7 commented Feb 23, 2020

Yet another thought to consider:
A [DataTable] (with [string] type columns) appears to consume a little more memory then a class with string props (~4.5Gb) but might just require a simple -AsDataTable switch. Besides a [DataTable] also easily converts into a [PSCustomObject[]].

@mklement0
Copy link
Contributor

mklement0 commented Feb 23, 2020

Interesting ideas, @iRon7, but they are complementary to what is being proposed here, so I encourage you to create new issues:

  • Even with the current, [pscustomobject]-only output, being able to specify column (property) types could be helpful - possibly combined with the next proposal.

    • However, my sense is that concisely specifying types is more important than also being able to rename columns, so something like -ColumnType @{ Id = [long]; Date = [datetime] }, which would allow you to specify output types for a given subset of columns identified by name would make more sense to me; those columns not mentioned would remain [string]-typed.
  • Producing optimized standard-type output such as [DataTable] would also be handy, as a simpler alternative to creating a custom output type up front (as proposed here).

Also, something that probably fits better into the context of this issue and the associated PR (#8860) in terms of implementation, is what @bergmeister has suggested before (emphasis added):

Do you think it would make sense to let the cmdlet create a default type of the ResultType based on the CSV header (i.e. create a default class with string properties for-each column)? This way the average consumer would still benefit from it without having to specify complex parameters and defining the custom class would then be an additional, optional optimisation on top of it?

@iRon7
Copy link

iRon7 commented Jan 9, 2023

Despite my own -AsDataTable (#11941) propose, I guess that one of the lightest tables possible (that can contain a CSV table) is a table with contains just columns names and rows with (string) data...
🤔, wait, that definition is in fact very close to: a CSV table...
In other words, rather than trying to store the table in memory other than PowerShell Objects in the form of a lighter class or - DataTable to conserve memory, you should simply consider to store the CSV table...

Instead of:

$e = Import-Csv -Path .\data.csv
$e | Foreach-Object { <process your item> } | <output your results>

Keep your CSV data as it is:

$e = Get-Content -Path .\data.csv
$e | ConvertFrom-Csv | Foreach-Object { <process your item> } | <output your results>

Copy link
Contributor

This issue has not had any activity in 6 months, if this is a bug please try to reproduce on the latest version of PowerShell and reopen a new issue and reference this issue if this is still a blocker for you.

2 similar comments
Copy link
Contributor

This issue has not had any activity in 6 months, if this is a bug please try to reproduce on the latest version of PowerShell and reopen a new issue and reference this issue if this is still a blocker for you.

Copy link
Contributor

This issue has not had any activity in 6 months, if this is a bug please try to reproduce on the latest version of PowerShell and reopen a new issue and reference this issue if this is still a blocker for you.

@microsoft-github-policy-service microsoft-github-policy-service bot added Resolution-No Activity Issue has had no activity for 6 months or more labels Nov 16, 2023
Copy link
Contributor

This issue has been marked as "No Activity" as there has been no activity for 6 months. It has been closed for housekeeping purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue-Enhancement the issue is more of a feature request than a bug Resolution-No Activity Issue has had no activity for 6 months or more
Projects
None yet
Development

No branches or pull requests

3 participants