Specify the columns to read from file #154

daschw · 2018-01-18T14:58:56Z

I often have to deal with CSV files that where edited by someone in Excel resulting in something like
test.csv:

datacol1, datacol2, ,
0, 1, , [some annoying comment]

When I try to read such a file (CSV.read("test.csv")) I get

MethodError: Cannot `convert` an object of type WeakRefString{UInt8} to an object of type Missings.Missing
This may have arisen from a call to the constructor Missings.Missing(...),
since type constructors fall back to convert methods.
setindex!(::Array{Missings.Missing,1}, ::WeakRefString{UInt8}, ::Int64) at array.jl:583
streamto! at io.jl:303 [inlined]
macro expansion at DataStreams.jl:547 [inlined]
stream!(::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}},Missings.Missing}, ::Type{DataStreams.Data.Field}, ::DataFrames.DataFrameStream{Tuple{Array{Int64,1},Array{Int64,1},Array{Missings.Missing,1},CategoricalArrays.CategoricalArray{String,1,UInt32,String,CategoricalArrays.CategoricalString{UInt32},Union{}}}}, ::DataStreams.Data.Schema{true,Tuple{Int64,Int64,Missings.Missing,CategoricalArrays.CategoricalValue{String,UInt32}}}, ::Int64, ::NTuple{4,Base.#identity}, ::DataStreams.Data.##15#16, ::Array{Any,1}, ::Type{Ref{(:datacol1, :datacol2, Symbol(""), Symbol(""))}}) at DataStreams.jl:614
#stream!#17(::Bool, ::Dict{Int64,Function}, ::Function, ::Array{Any,1}, ::Array{Any,1}, ::Function, ::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}},Missings.Missing}, ::Type{DataFrames.DataFrame}) at DataStreams.jl:490
(::DataStreams.Data.#kw##stream!)(::Array{Any,1}, ::DataStreams.Data.#stream!, ::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}},Missings.Missing}, ::Type{DataFrames.DataFrame}) at <missing>:0
#read#43(::Bool, ::Dict{Int64,Function}, ::Bool, ::Array{Any,1}, ::Function, ::String, ::Type{T} where T) at Source.jl:312
read(::String) at Source.jl:311
include_string(::String, ::String) at loading.jl:522
eval(::Module, ::Any) at boot.jl:235
(::Atom.##63#66)() at eval.jl:104
withpath(::Atom.##63#66, ::Void) at utils.jl:30
withpath(::Function, ::Void) at eval.jl:38
macro expansion at eval.jl:103 [inlined]
(::Atom.##62#65{Dict{String,Any}})() at task.jl:80

It would be nice to be able to specify the columns to read with a keyword like Pandas' usecols to be able to easily avoid such issues.

The text was updated successfully, but these errors were encountered:

quinnj · 2018-01-18T15:09:22Z

Coming soon; stay tuned.

Nosferican · 2018-03-25T22:11:13Z

R's readr uses col_types to specify which columns to read and how.
CSV.read(; types::Vector{DataType}) could use Nothing for skipping that column...

quinnj · 2018-09-11T04:00:42Z

On master, using CSV.File, you can now get this by iterating a File object, like:

f = CSV.File(file)
for row in f
    println("a=$(row.a), b=$(row.b)")
end

This will only parse the a and b columns in the file, skipping the others. Users should note that it's most performant to access columns in sequential order (i.e. it's optimized for the normal case where people just read the whole file), but it's completely valid to access, say, in reverse order if so desired.

Nosferican · 2018-09-11T05:32:24Z

Would it be able to use that through a keyword argument in CSV.read (e.g., col_types = "ccd_i")?

nalimilan · 2018-09-11T12:33:37Z

I agree, it sounds like a legitimate request to have a convenient way of getting a DataFrame (or another structure) with only the specified columns.

quinnj · 2018-09-13T03:59:42Z

Nah, I don't think putting this in the function is the right abstraction. For example, I have a lazy Select transformation object that operates on any Tables.jl table and implements itself the interface. So, a user would be able to do:

df = CSV.File("cool_file.csv") |> select(:a, :b) |> DataFrame

to select just the a and b columns from a csv file and materialize it in a DataFrame and the csv parsing will only read the a and b columns. I wouldn't want users to think this was functionality only for CSV.jl :)

I'll leave this issue open until we decide on what to do w/ things like Select; @piever, @joshday, @andyferris, and myself have been discussing it a bit.

nalimilan · 2018-09-13T09:32:09Z

Makes sense. Though if we keep the CSV.read convenience function, supporting this via an argument wouldn't be absurd.

Nosferican · 2018-09-13T09:36:22Z

I like the lazy pipeline, but agree on the convenience if read is still available. The other aspect is that the coltypes interface does not only subset columns, but also gives a simple way to specify the type it should be parsed as (e.g., zipcodes being parsed as Integer instead of String).

piever · 2018-09-13T10:07:58Z

Definitely the pipeline is appealing (using Tables and the related querying framework) but I agree that it's good to have a keyword argument as a shorthand. I wonder if it can be somehow combined with coltypes. For example CSV.read(fn, select = (:height => Float64, :name => String)) would mean "read only columns height and name, parsing the first as Float64 and the second as String" (just throwing out ideas here - haven't thought deeply about the implications of this design).

quinnj · 2018-09-18T12:27:07Z

The problem here is that trying to support select as a keyword argument is actually much harder than having it be an external construct. To try to support this internally, we'd have to essentially re-implement the external construct internally, by keeping track of the columns the user selected, while also keeping track of the full file's columns (in case they select out-of-order columns). So while I agree it could be nice to have a keyword argument, it just doesn't seem very practical; that's a lot of extra work for functionality that's already available.

nalimilan · 2018-09-18T12:32:28Z

When you say "internally", do you mean inside CSV.File or inside CSV.read? I would think it's easy to do for the latter since it's already "external" to the CSV.File main logic.

quinnj · 2018-09-18T12:34:41Z

Yeah, but then we get this weird inversion of dependencies: we'd have to have CSV.jl depend on some "TableOperations.jl" package that provided select functionality, just so users could call CSV.read(file; select=...) w/ a keyword argument instead of keeping everything composable and separate. That feels pretty icky to me.

andyferris · 2018-09-18T12:34:56Z

I tend to agree with Jacob here. My philosophy is that composable tools are beneficial for both library writers and users, and this is a great example of that where .csv files are loaded lazily and the columns can be picked later.

nalimilan · 2018-09-18T12:39:49Z

Yes, having CSV.jl depend on another data package isn't ideal. Let's see how the ecosystem evolves first then.

daschw · 2018-09-18T14:39:42Z

I just realized that the example I posted in the original issue description still is not working. If I try to do something like

f = CSV.File(file)
for row in f
    println("a=$(row.a), b=$(row.b)")
end

with a file test1.csv that looks like this:

a, b, ,
0, 1, , comment
12, 5, ,

I get

f = CSV.File("test1.csv")
ERROR: BoundsError: attempt to access ""
  at index [1]
Stacktrace:
 [1] checkbounds at .\strings\basic.jl:193 [inlined]
 [2] codeunit at .\strings\string.jl:87 [inlined]
 [3] getindex at .\strings\string.jl:206 [inlined]
 [4] normalizename(::String) at C:\Users\Daniel\.julia\packages\CSV\dpCnm\src\filedetection.jl:11
 [5] (::getfield(CSV, Symbol("##21#23")){Bool})(::Tuple{Int64,String}) at .\none:0
 [6] iterate at .\generator.jl:47 [inlined]
 [7] collect_to!(::Array{Symbol,1}, ::Base.Generator{Base.Iterators.Enumerate{Array{Union{Missing, String},1}},getfield(CSV, Symbol("##21#23")){Bool}}, ::Int64, ::Tuple{Int64,Int64}) at .\array.jl:656
 [8] collect_to_with_first! at .\array.jl:643 [inlined]
 [9] collect(::Base.Generator{Base.Iterators.Enumerate{Array{Union{Missing, String},1}},getfield(CSV, Symbol("##21#23")){Bool}}) at .\array.jl:624
 [10] _totuple at .\tuple.jl:261 [inlined]
 [11] Type at .\tuple.jl:243 [inlined]
 [12] datalayout(::Int64, ::Parsers.Delimited{false,Parsers.Quoted{Parsers.Strip{Parsers.Sentinel{typeof(Parsers.defaultparser),Parsers.Trie{0x00,false,missing,2,Tuple{}}}}},Parsers.Trie{0x00,false,missing,8,Tuple{Parsers.Trie{0x2c,true,missing,8,Tuple{}},Parsers.Trie{0x0a,true,missing,8,Tuple{}},Parsers.Trie{0x0d,true,missing,8,Tuple{Parsers.Trie{0x0a,true,missing,8,Tuple{}}}}}}}, ::Base.GenericIOBuffer{Array{UInt8,1}}, ::Int64, ::Bool) at C:\Users\Daniel\.julia\packages\CSV\dpCnm\src\filedetection.jl:123
 [13] #File#1(::Int64, ::Bool, ::Int64, ::Nothing, ::Int64, ::Nothing, ::Bool, ::Nothing, ::Bool, ::Array{String,1}, ::String, ::String, ::Bool, ::Char, ::Nothing, ::Nothing, ::Char, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Dict{Type,Type}, ::Symbol, ::Bool, ::Bool, ::Bool, ::Type, ::String) at C:\Users\Daniel\.julia\packages\CSV\dpCnm\src\CSV.jl:288
 [14] CSV.File(::String) at C:\Users\Daniel\.julia\packages\CSV\dpCnm\src\CSV.jl:263
 [15] top-level scope at none:0

So, selecting valid columns is only possible if all columns are valid?

…a manually iterating CSV.File or using Tables.select, but there was another issue where this file had an invalid column name that dies while trying to normalize the name.

quinnj · 2019-02-05T15:33:50Z

With CSV.jl current release, I can do the following:

julia> csv = """a,b,,
       0, 1, , comment
       12, 5, ,
       """
"a,b,,\n0, 1, , comment\n12, 5, ,\n"

julia> df = CSV.File(IOBuffer(csv)) |> Tables.select(:a, :b) |> DataFrame
2×2 DataFrame
│ Row │ a      │ b      │
│     │ Int64⍰ │ Int64⍰ │
├─────┼────────┼────────┤
│ 1   │ 0      │ 1      │
│ 2   │ 12     │ 5      │

Note that in my example, the header row looks like a,b,,, which produces column names a and b, while in the original example, the header row is a, b, , which produces column names like a, Symbol(" b"). This also caused the BoundsError @daschw saw in his most recent post. I've put up a PR (#382) to fix the BoundsError case, so in @daschw 's example, he would just need to do CSV.File(file; normalizenames=true). Then manually iterating the CSV.File and select the a and b columns or doing Tables.select should work for this.

daschw · 2019-02-06T11:05:29Z

Great, thanks a lot @quinnj !

skanskan · 2020-01-28T00:09:07Z

Then, can we use a "select=" keyword?
Reading the whole file and then selecting the columns it's not as efficient and can even be impossible if the file is larger than the available memory.

quinnj closed this as completed Sep 11, 2018

nalimilan reopened this Sep 11, 2018

quinnj mentioned this issue Oct 3, 2018

Handling of an explicit header #316

Closed

quinnj closed this as completed in da5bcbb Feb 5, 2019

anandijain mentioned this issue Oct 2, 2019

CSV-row filtering when reading #503

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify the columns to read from file #154

Specify the columns to read from file #154

daschw commented Jan 18, 2018

quinnj commented Jan 18, 2018

Nosferican commented Mar 25, 2018

quinnj commented Sep 11, 2018

Nosferican commented Sep 11, 2018 •

edited

nalimilan commented Sep 11, 2018

quinnj commented Sep 13, 2018

nalimilan commented Sep 13, 2018

Nosferican commented Sep 13, 2018

piever commented Sep 13, 2018

quinnj commented Sep 18, 2018

nalimilan commented Sep 18, 2018

quinnj commented Sep 18, 2018

andyferris commented Sep 18, 2018

nalimilan commented Sep 18, 2018

daschw commented Sep 18, 2018

quinnj commented Feb 5, 2019

daschw commented Feb 6, 2019

skanskan commented Jan 28, 2020

Specify the columns to read from file #154

Specify the columns to read from file #154

Comments

daschw commented Jan 18, 2018

quinnj commented Jan 18, 2018

Nosferican commented Mar 25, 2018

quinnj commented Sep 11, 2018

Nosferican commented Sep 11, 2018 • edited

nalimilan commented Sep 11, 2018

quinnj commented Sep 13, 2018

nalimilan commented Sep 13, 2018

Nosferican commented Sep 13, 2018

piever commented Sep 13, 2018

quinnj commented Sep 18, 2018

nalimilan commented Sep 18, 2018

quinnj commented Sep 18, 2018

andyferris commented Sep 18, 2018

nalimilan commented Sep 18, 2018

daschw commented Sep 18, 2018

quinnj commented Feb 5, 2019

daschw commented Feb 6, 2019

skanskan commented Jan 28, 2020

Nosferican commented Sep 11, 2018 •

edited