Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify the columns to read from file #154

Closed
daschw opened this issue Jan 18, 2018 · 18 comments
Closed

Specify the columns to read from file #154

daschw opened this issue Jan 18, 2018 · 18 comments

Comments

@daschw
Copy link

daschw commented Jan 18, 2018

I often have to deal with CSV files that where edited by someone in Excel resulting in something like
test.csv:

datacol1, datacol2, ,
0, 1, , [some annoying comment]

When I try to read such a file (CSV.read("test.csv")) I get

MethodError: Cannot `convert` an object of type WeakRefString{UInt8} to an object of type Missings.Missing
This may have arisen from a call to the constructor Missings.Missing(...),
since type constructors fall back to convert methods.
setindex!(::Array{Missings.Missing,1}, ::WeakRefString{UInt8}, ::Int64) at array.jl:583
streamto! at io.jl:303 [inlined]
macro expansion at DataStreams.jl:547 [inlined]
stream!(::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}},Missings.Missing}, ::Type{DataStreams.Data.Field}, ::DataFrames.DataFrameStream{Tuple{Array{Int64,1},Array{Int64,1},Array{Missings.Missing,1},CategoricalArrays.CategoricalArray{String,1,UInt32,String,CategoricalArrays.CategoricalString{UInt32},Union{}}}}, ::DataStreams.Data.Schema{true,Tuple{Int64,Int64,Missings.Missing,CategoricalArrays.CategoricalValue{String,UInt32}}}, ::Int64, ::NTuple{4,Base.#identity}, ::DataStreams.Data.##15#16, ::Array{Any,1}, ::Type{Ref{(:datacol1, :datacol2, Symbol(""), Symbol(""))}}) at DataStreams.jl:614
#stream!#17(::Bool, ::Dict{Int64,Function}, ::Function, ::Array{Any,1}, ::Array{Any,1}, ::Function, ::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}},Missings.Missing}, ::Type{DataFrames.DataFrame}) at DataStreams.jl:490
(::DataStreams.Data.#kw##stream!)(::Array{Any,1}, ::DataStreams.Data.#stream!, ::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}},Missings.Missing}, ::Type{DataFrames.DataFrame}) at <missing>:0
#read#43(::Bool, ::Dict{Int64,Function}, ::Bool, ::Array{Any,1}, ::Function, ::String, ::Type{T} where T) at Source.jl:312
read(::String) at Source.jl:311
include_string(::String, ::String) at loading.jl:522
eval(::Module, ::Any) at boot.jl:235
(::Atom.##63#66)() at eval.jl:104
withpath(::Atom.##63#66, ::Void) at utils.jl:30
withpath(::Function, ::Void) at eval.jl:38
macro expansion at eval.jl:103 [inlined]
(::Atom.##62#65{Dict{String,Any}})() at task.jl:80

It would be nice to be able to specify the columns to read with a keyword like Pandas' usecols to be able to easily avoid such issues.

@quinnj
Copy link
Member

quinnj commented Jan 18, 2018

Coming soon; stay tuned.

@Nosferican
Copy link
Contributor

R's readr uses col_types to specify which columns to read and how.
CSV.read(; types::Vector{DataType}) could use Nothing for skipping that column...

@quinnj
Copy link
Member

quinnj commented Sep 11, 2018

On master, using CSV.File, you can now get this by iterating a File object, like:

f = CSV.File(file)
for row in f
    println("a=$(row.a), b=$(row.b)")
end

This will only parse the a and b columns in the file, skipping the others. Users should note that it's most performant to access columns in sequential order (i.e. it's optimized for the normal case where people just read the whole file), but it's completely valid to access, say, in reverse order if so desired.

@quinnj quinnj closed this as completed Sep 11, 2018
@Nosferican
Copy link
Contributor

Nosferican commented Sep 11, 2018

Would it be able to use that through a keyword argument in CSV.read (e.g., col_types = "ccd_i")?

@nalimilan
Copy link
Member

I agree, it sounds like a legitimate request to have a convenient way of getting a DataFrame (or another structure) with only the specified columns.

@nalimilan nalimilan reopened this Sep 11, 2018
@quinnj
Copy link
Member

quinnj commented Sep 13, 2018

Nah, I don't think putting this in the function is the right abstraction. For example, I have a lazy Select transformation object that operates on any Tables.jl table and implements itself the interface. So, a user would be able to do:

df = CSV.File("cool_file.csv") |> select(:a, :b) |> DataFrame

to select just the a and b columns from a csv file and materialize it in a DataFrame and the csv parsing will only read the a and b columns. I wouldn't want users to think this was functionality only for CSV.jl :)

I'll leave this issue open until we decide on what to do w/ things like Select; @piever, @joshday, @andyferris, and myself have been discussing it a bit.

@nalimilan
Copy link
Member

Makes sense. Though if we keep the CSV.read convenience function, supporting this via an argument wouldn't be absurd.

@Nosferican
Copy link
Contributor

I like the lazy pipeline, but agree on the convenience if read is still available. The other aspect is that the coltypes interface does not only subset columns, but also gives a simple way to specify the type it should be parsed as (e.g., zipcodes being parsed as Integer instead of String).

@piever
Copy link

piever commented Sep 13, 2018

Definitely the pipeline is appealing (using Tables and the related querying framework) but I agree that it's good to have a keyword argument as a shorthand. I wonder if it can be somehow combined with coltypes. For example CSV.read(fn, select = (:height => Float64, :name => String)) would mean "read only columns height and name, parsing the first as Float64 and the second as String" (just throwing out ideas here - haven't thought deeply about the implications of this design).

@quinnj
Copy link
Member

quinnj commented Sep 18, 2018

The problem here is that trying to support select as a keyword argument is actually much harder than having it be an external construct. To try to support this internally, we'd have to essentially re-implement the external construct internally, by keeping track of the columns the user selected, while also keeping track of the full file's columns (in case they select out-of-order columns). So while I agree it could be nice to have a keyword argument, it just doesn't seem very practical; that's a lot of extra work for functionality that's already available.

@nalimilan
Copy link
Member

When you say "internally", do you mean inside CSV.File or inside CSV.read? I would think it's easy to do for the latter since it's already "external" to the CSV.File main logic.

@quinnj
Copy link
Member

quinnj commented Sep 18, 2018

Yeah, but then we get this weird inversion of dependencies: we'd have to have CSV.jl depend on some "TableOperations.jl" package that provided select functionality, just so users could call CSV.read(file; select=...) w/ a keyword argument instead of keeping everything composable and separate. That feels pretty icky to me.

@andyferris
Copy link
Member

I tend to agree with Jacob here. My philosophy is that composable tools are beneficial for both library writers and users, and this is a great example of that where .csv files are loaded lazily and the columns can be picked later.

@nalimilan
Copy link
Member

Yes, having CSV.jl depend on another data package isn't ideal. Let's see how the ecosystem evolves first then.

@daschw
Copy link
Author

daschw commented Sep 18, 2018

I just realized that the example I posted in the original issue description still is not working. If I try to do something like

f = CSV.File(file)
for row in f
    println("a=$(row.a), b=$(row.b)")
end

with a file test1.csv that looks like this:

a, b, ,
0, 1, , comment
12, 5, ,

I get

f = CSV.File("test1.csv")
ERROR: BoundsError: attempt to access ""
  at index [1]
Stacktrace:
 [1] checkbounds at .\strings\basic.jl:193 [inlined]
 [2] codeunit at .\strings\string.jl:87 [inlined]
 [3] getindex at .\strings\string.jl:206 [inlined]
 [4] normalizename(::String) at C:\Users\Daniel\.julia\packages\CSV\dpCnm\src\filedetection.jl:11
 [5] (::getfield(CSV, Symbol("##21#23")){Bool})(::Tuple{Int64,String}) at .\none:0
 [6] iterate at .\generator.jl:47 [inlined]
 [7] collect_to!(::Array{Symbol,1}, ::Base.Generator{Base.Iterators.Enumerate{Array{Union{Missing, String},1}},getfield(CSV, Symbol("##21#23")){Bool}}, ::Int64, ::Tuple{Int64,Int64}) at .\array.jl:656
 [8] collect_to_with_first! at .\array.jl:643 [inlined]
 [9] collect(::Base.Generator{Base.Iterators.Enumerate{Array{Union{Missing, String},1}},getfield(CSV, Symbol("##21#23")){Bool}}) at .\array.jl:624
 [10] _totuple at .\tuple.jl:261 [inlined]
 [11] Type at .\tuple.jl:243 [inlined]
 [12] datalayout(::Int64, ::Parsers.Delimited{false,Parsers.Quoted{Parsers.Strip{Parsers.Sentinel{typeof(Parsers.defaultparser),Parsers.Trie{0x00,false,missing,2,Tuple{}}}}},Parsers.Trie{0x00,false,missing,8,Tuple{Parsers.Trie{0x2c,true,missing,8,Tuple{}},Parsers.Trie{0x0a,true,missing,8,Tuple{}},Parsers.Trie{0x0d,true,missing,8,Tuple{Parsers.Trie{0x0a,true,missing,8,Tuple{}}}}}}}, ::Base.GenericIOBuffer{Array{UInt8,1}}, ::Int64, ::Bool) at C:\Users\Daniel\.julia\packages\CSV\dpCnm\src\filedetection.jl:123
 [13] #File#1(::Int64, ::Bool, ::Int64, ::Nothing, ::Int64, ::Nothing, ::Bool, ::Nothing, ::Bool, ::Array{String,1}, ::String, ::String, ::Bool, ::Char, ::Nothing, ::Nothing, ::Char, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Nothing, ::Dict{Type,Type}, ::Symbol, ::Bool, ::Bool, ::Bool, ::Type, ::String) at C:\Users\Daniel\.julia\packages\CSV\dpCnm\src\CSV.jl:288
 [14] CSV.File(::String) at C:\Users\Daniel\.julia\packages\CSV\dpCnm\src\CSV.jl:263
 [15] top-level scope at none:0

So, selecting valid columns is only possible if all columns are valid?

quinnj added a commit that referenced this issue Feb 5, 2019
…a manually iterating CSV.File or using Tables.select, but there was another issue where this file had an invalid column name that dies while trying to normalize the name.
@quinnj
Copy link
Member

quinnj commented Feb 5, 2019

With CSV.jl current release, I can do the following:

julia> csv = """a,b,,
       0, 1, , comment
       12, 5, ,
       """
"a,b,,\n0, 1, , comment\n12, 5, ,\n"

julia> df = CSV.File(IOBuffer(csv)) |> Tables.select(:a, :b) |> DataFrame
2×2 DataFrame
│ Row │ a      │ b      │
│     │ Int64⍰ │ Int64⍰ │
├─────┼────────┼────────┤
│ 101      │
│ 2125

Note that in my example, the header row looks like a,b,,, which produces column names a and b, while in the original example, the header row is a, b, , which produces column names like a, Symbol(" b"). This also caused the BoundsError @daschw saw in his most recent post. I've put up a PR (#382) to fix the BoundsError case, so in @daschw 's example, he would just need to do CSV.File(file; normalizenames=true). Then manually iterating the CSV.File and select the a and b columns or doing Tables.select should work for this.

@quinnj quinnj closed this as completed in da5bcbb Feb 5, 2019
@daschw
Copy link
Author

daschw commented Feb 6, 2019

Great, thanks a lot @quinnj !

@skanskan
Copy link

Then, can we use a "select=" keyword?
Reading the whole file and then selecting the columns it's not as efficient and can even be impossible if the file is larger than the available memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants