Skip to content

Commit

Permalink
typemap: switch to IdDict (#1069)
Browse files Browse the repository at this point in the history
* `typemap`: switch to IdDict

When using types as keys, IdDict is essentially always preferred:
- it uses `@nospecialize` on all operations, reducing latency
- it uses pointer comparisons to test for matchs, which is
  much faster than subtyping (which is what `isequal` dispatches to).

* Lots of trailing whitespace fixes

Courtesy of my settings for vscode

* One more spot
  • Loading branch information
timholy committed Jan 19, 2023
1 parent 94deaf4 commit cfb4ffb
Show file tree
Hide file tree
Showing 9 changed files with 55 additions and 51 deletions.
70 changes: 35 additions & 35 deletions docs/src/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -443,19 +443,19 @@ file = CSV.File(IOBuffer(data); delim="::")
```julia
using CSV

# This is an example of "fixed width" data, where each
# column is the same number of characters away from each
# other on each row. Fields are "padded" with extra
# delimiters (in this case `' '`) so that each column is
# This is an example of "fixed width" data, where each
# column is the same number of characters away from each
# other on each row. Fields are "padded" with extra
# delimiters (in this case `' '`) so that each column is
# the same number of characters each time
data = """
col1 col2 col3
123431 2 3421
2355 346 7543
"""
# In addition to our `delim`, we can pass
# `ignorerepeated=true`, which tells parsing that
#consecutive delimiters should be treated as a single
# In addition to our `delim`, we can pass
# `ignorerepeated=true`, which tells parsing that
#consecutive delimiters should be treated as a single
# delimiter.
file = CSV.File(IOBuffer(data); delim=' ', ignorerepeated=true)
```
Expand Down Expand Up @@ -488,12 +488,12 @@ file = CSV.File(IOBuffer(data); quoted=false)
```julia
using CSV

# In this data, we have a few "quoted" fields, which means the field's value starts and ends with `quotechar` (or
# `openquotechar` and `closequotechar`, respectively). Quoted fields allow the field to contain characters that would otherwise
# be significant to parsing, such as delimiters or newline characters. When quoted, parsing will ignore these otherwise
# signficant characters until the closing quote character is found. For quoted fields that need to also include the quote
# character itself, an escape character is provided to tell parsing to ignore the next character when looking for a close quote
# character. In the syntax examples, the keyword arguments are passed explicitly, but these also happen to be the default
# In this data, we have a few "quoted" fields, which means the field's value starts and ends with `quotechar` (or
# `openquotechar` and `closequotechar`, respectively). Quoted fields allow the field to contain characters that would otherwise
# be significant to parsing, such as delimiters or newline characters. When quoted, parsing will ignore these otherwise
# signficant characters until the closing quote character is found. For quoted fields that need to also include the quote
# character itself, an escape character is provided to tell parsing to ignore the next character when looking for a close quote
# character. In the syntax examples, the keyword arguments are passed explicitly, but these also happen to be the default
# values, so just doing `CSV.File(IOBuffer(data))` would result in successful parsing.
data = """
col1,col2
Expand All @@ -512,9 +512,9 @@ file = CSV.File(IOBuffer(data); openquotechar='"' closequotechar='"', escapechar
```julia
using CSV

# In this file, our `date` column has dates that are formatted like `yyyy/mm/dd`. We can pass just such a string to the
# `dateformat` keyword argument to tell parsing to use it when looking for `Date` or `DateTime` columns. Note that currently,
# only a single `dateformat` string can be passed to parsing, meaning multiple columns with different date formats cannot all
# In this file, our `date` column has dates that are formatted like `yyyy/mm/dd`. We can pass just such a string to the
# `dateformat` keyword argument to tell parsing to use it when looking for `Date` or `DateTime` columns. Note that currently,
# only a single `dateformat` string can be passed to parsing, meaning multiple columns with different date formats cannot all
# be parsed as `Date`/`DateTime`.
data = """
code,date
Expand All @@ -531,7 +531,7 @@ file = CSV.File(IOBuffer(data); dateformat="yyyy/mm/dd")
using CSV

# In many places in the world, floating point number decimals are separated with a comma instead of a period (`3,14` vs. `3.14`)
# . We can correctly parse these numbers by passing in the `decimal=','` keyword argument. Note that we probably need to
# . We can correctly parse these numbers by passing in the `decimal=','` keyword argument. Note that we probably need to
# explicitly pass `delim=';'` in this case, since the parser will probably think that it detected `','` as the delimiter.
data = """
col1;col2;col3
Expand All @@ -547,7 +547,7 @@ file = CSV.File(IOBuffer(data); delim=';', decimal=',')
```julia
using CSV

# By default, parsing only considers the string values `true` and `false` as valid `Bool` values. To consider alternative
# By default, parsing only considers the string values `true` and `false` as valid `Bool` values. To consider alternative
# values, we can pass a `Vector{String}` to the `truestrings` and `falsestrings` keyword arguments.
data = """
id,paid,attended
Expand All @@ -565,8 +565,8 @@ file = CSV.File(IOBuffer(data); truestrings=["T", "TRUE"], falsestrings=["F", "F
```julia
using CSV

# This file contains a 3x3 identity matrix of `Float64`. By default, parsing will detect the delimiter and type, but we can
# also explicitly pass `delim= ' '` and `types=Float64`, which tells parsing to explicitly treat each column as `Float64`,
# This file contains a 3x3 identity matrix of `Float64`. By default, parsing will detect the delimiter and type, but we can
# also explicitly pass `delim= ' '` and `types=Float64`, which tells parsing to explicitly treat each column as `Float64`,
# without having to guess the type on its own.
data = """
1.0 0.0 0.0
Expand All @@ -583,12 +583,12 @@ file = CSV.File(IOBuffer(data); header=false, delim=' ', types=Float64)
```julia
using CSV

# In this file, our 3rd column has an invalid value on the 2nd row `invalid`. Let's imagine we'd still like to treat it as an
# `Int` column, and ignore the `invalid` value. The syntax examples provide several ways we can tell parsing to treat the 3rd
# column as `Int`, by referring to column index `3`, or column name with `Symbol` or `String`. We can also provide an entire
# `Vector` of types for each column (and which needs to match the length of columns in the file). There are two additional
# keyword arguments that control parsing behavior; in the first 4 syntax examples, we would see a warning printed like
# `"warning: invalid Int64 value on row 2, column 3"`. In the fifth example, passing `silencewarnings=true` will suppress this
# In this file, our 3rd column has an invalid value on the 2nd row `invalid`. Let's imagine we'd still like to treat it as an
# `Int` column, and ignore the `invalid` value. The syntax examples provide several ways we can tell parsing to treat the 3rd
# column as `Int`, by referring to column index `3`, or column name with `Symbol` or `String`. We can also provide an entire
# `Vector` of types for each column (and which needs to match the length of columns in the file). There are two additional
# keyword arguments that control parsing behavior; in the first 4 syntax examples, we would see a warning printed like
# `"warning: invalid Int64 value on row 2, column 3"`. In the fifth example, passing `silencewarnings=true` will suppress this
# warning printing. In the last syntax example, passing `strict=true` will result in an error being thrown during parsing.
data = """
col1,col2,col3
Expand Down Expand Up @@ -626,8 +626,8 @@ file = CSV.File(IOBuffer(data); types=Dict(:col1 => Bool, r"^col\d" => Int8))
```julia
using CSV

# In this file, we have U.S. zipcodes in the first column that we'd rather not treat as `Int`, but parsing will detect it as
# such. In the first syntax example, we pass `typemap=Dict(Int => String)`, which tells parsing to treat any detected `Int`
# In this file, we have U.S. zipcodes in the first column that we'd rather not treat as `Int`, but parsing will detect it as
# such. In the first syntax example, we pass `typemap=IdDict(Int => String)`, which tells parsing to treat any detected `Int`
# columns as `String` instead. In the second syntax example, we alternatively set the `zipcode` column type manually.
data = """
zipcode,score
Expand All @@ -636,7 +636,7 @@ zipcode,score
84044,3.4
"""

file = CSV.File(IOBuffer(data); typemap=Dict(Int => String))
file = CSV.File(IOBuffer(data); typemap=IdDict(Int => String))
file = CSV.File(IOBuffer(data); types=Dict(:zipcode => String))
```

Expand All @@ -645,9 +645,9 @@ file = CSV.File(IOBuffer(data); types=Dict(:zipcode => String))
```julia
using CSV

# In this file, we have an `id` column and a `code` column. There can be advantages with various DataFrame/table operations
# like joining and grouping when `String` values are "pooled", meaning each unique value is mapped to a `UInt32`. By default,
# `pool=(0.2, 500)`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide
# In this file, we have an `id` column and a `code` column. There can be advantages with various DataFrame/table operations
# like joining and grouping when `String` values are "pooled", meaning each unique value is mapped to a `UInt32`. By default,
# `pool=(0.2, 500)`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide
# greater control: `pool=0.4` means that if 40% or less of a column's values are unique, then it will be pooled.
data = """
id,code
Expand Down Expand Up @@ -689,9 +689,9 @@ file = CSV.File(IOBuffer(data); pool=[true, false])
```julia
using CSV

# In this file, we have an `id` column and a `code` column. There can be advantages with various DataFrame/table operations
# like joining and grouping when `String` values are "pooled", meaning each unique value is mapped to a `UInt32`. By default,
# `pool=(0.2, 500)`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide
# In this file, we have an `id` column and a `code` column. There can be advantages with various DataFrame/table operations
# like joining and grouping when `String` values are "pooled", meaning each unique value is mapped to a `UInt32`. By default,
# `pool=(0.2, 500)`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide
# greater control: `pool=(0.5, 2)` means that if a column has 2 or fewer unique values _and_ the total number of unique values is less than 50% of all values, then it will be pooled.
data = """
id,code
Expand Down
2 changes: 1 addition & 1 deletion docs/src/reading.md
Original file line number Diff line number Diff line change
Expand Up @@ -189,7 +189,7 @@ Note that the default [stringtype](@ref stringtype) can be overridden by providi

## [`typemap`](@id typemap)

A `Dict{Type, Type}` argument that allows replacing a non-`String` standard type with another type when a column's type is auto-detected. Most commonly, this would be used to force all numeric columns to be `Float64`, like `typemap=Dict(Int64 => Float64)`, which would cause any columns detected as `Int64` to be parsed as `Float64` instead. Another common case would be wanting all columns of a specific type to be parsed as strings instead, like `typemap=Dict(Date => String)`, which will cause any columns detected as `Date` to be parsed as `String` instead.
An `AbstractDict{Type, Type}` argument that allows replacing a non-`String` standard type with another type when a column's type is auto-detected. Most commonly, this would be used to force all numeric columns to be `Float64`, like `typemap=IdDict(Int64 => Float64)`, which would cause any columns detected as `Int64` to be parsed as `Float64` instead. Another common case would be wanting all columns of a specific type to be parsed as strings instead, like `typemap=IdDict(Date => String)`, which will cause any columns detected as `Date` to be parsed as `String` instead.

### Examples
* [Typemap](@ref typemap_example)
Expand Down
4 changes: 2 additions & 2 deletions src/chunks.jl
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ end
Returns a file "chunk" iterator. Accepts all the same inputs and keyword arguments as [`CSV.File`](@ref),
see those docs for explanations of each keyword argument.
The `ntasks` keyword argument specifies how many chunks a file should be split up into, defaulting to
The `ntasks` keyword argument specifies how many chunks a file should be split up into, defaulting to
the # of threads available to Julia (i.e. `JULIA_NUM_THREADS` environment variable) or 8 if Julia is
run single-threaded.
Expand Down Expand Up @@ -64,7 +64,7 @@ function Chunks(source::ValidSources;
# type options
type=nothing,
types=nothing,
typemap::Dict=Dict{Type, Type}(),
typemap::AbstractDict=IdDict{Type, Type}(),
pool::Union{Bool, Real, AbstractVector, AbstractDict, Base.Callable, Tuple}=DEFAULT_POOL,
downcast::Bool=false,
lazystrings::Bool=false,
Expand Down
7 changes: 4 additions & 3 deletions src/context.jl
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ mutable struct Context
pool::Union{Float64, Tuple{Float64, Int}}
downcast::Bool
customtypes::Type
typemap::Dict{Type, Type}
typemap::IdDict{Type, Type}
stringtype::StringTypes
limit::Int
threaded::Bool
Expand Down Expand Up @@ -233,7 +233,7 @@ function Context(source::ValidSources;
# type options
type=nothing,
types=nothing,
typemap::Dict=Dict{Type, Type}(),
typemap::AbstractDict=IdDict{Type, Type}(),
pool=DEFAULT_POOL,
downcast::Bool=false,
lazystrings::Bool=false,
Expand Down Expand Up @@ -288,7 +288,7 @@ end
# type options
type::Union{Nothing, Type},
types::Union{Nothing, Type, AbstractVector, AbstractDict, Function},
typemap::Dict,
typemap::AbstractDict,
pool::Union{Bool, Real, AbstractVector, AbstractDict, Base.Callable, Tuple},
downcast::Bool,
lazystrings::Bool,
Expand Down Expand Up @@ -481,6 +481,7 @@ end
end
end
# check for nonstandard types in typemap
typemap = convert(IdDict{Type, Type}, typemap)::IdDict{Type, Type}
for T in values(typemap)
if nonstandardtype(T) !== Union{}
customtypes = tupcat(customtypes, nonstandardtype(T))
Expand Down
6 changes: 3 additions & 3 deletions src/file.jl
Original file line number Diff line number Diff line change
Expand Up @@ -200,7 +200,7 @@ function File(source::ValidSources;
# type options
type=nothing,
types=nothing,
typemap::Dict=Dict{Type, Type}(),
typemap::AbstractDict=IdDict{Type, Type}(),
pool=DEFAULT_POOL,
downcast::Bool=false,
lazystrings::Bool=false,
Expand All @@ -215,7 +215,7 @@ function File(source::ValidSources;
# header=1;normalizenames=false;datarow=-1;skipto=-1;footerskip=0;transpose=false;comment=nothing;ignoreemptyrows=true;ignoreemptylines=nothing;
# select=nothing;drop=nothing;limit=nothing;threaded=nothing;ntasks=Threads.nthreads();tasks=nothing;rows_to_check=30;lines_to_check=nothing;missingstrings=String[];missingstring="";
# delim=nothing;ignorerepeated=false;quoted=true;quotechar='"';openquotechar=nothing;closequotechar=nothing;escapechar='"';dateformat=nothing;
# dateformats=nothing;decimal=UInt8('.');truestrings=nothing;falsestrings=nothing;type=nothing;types=nothing;typemap=Dict{Type,Type}();
# dateformats=nothing;decimal=UInt8('.');truestrings=nothing;falsestrings=nothing;type=nothing;types=nothing;typemap=IdDict{Type,Type}();
# pool=CSV.DEFAULT_POOL;downcast=false;lazystrings=false;stringtype=String;strict=false;silencewarnings=false;maxwarnings=100;debug=false;parsingdebug=false;buffer_in_memory=false
# @descend CSV.Context(CSV.Arg(source), CSV.Arg(header), CSV.Arg(normalizenames), CSV.Arg(datarow), CSV.Arg(skipto), CSV.Arg(footerskip), CSV.Arg(transpose), CSV.Arg(comment), CSV.Arg(ignoreemptyrows), CSV.Arg(ignoreemptylines), CSV.Arg(select), CSV.Arg(drop), CSV.Arg(limit), CSV.Arg(buffer_in_memory), CSV.Arg(threaded), CSV.Arg(ntasks), CSV.Arg(tasks), CSV.Arg(rows_to_check), CSV.Arg(lines_to_check), CSV.Arg(missingstrings), CSV.Arg(missingstring), CSV.Arg(delim), CSV.Arg(ignorerepeated), CSV.Arg(quoted), CSV.Arg(quotechar), CSV.Arg(openquotechar), CSV.Arg(closequotechar), CSV.Arg(escapechar), CSV.Arg(dateformat), CSV.Arg(dateformats), CSV.Arg(decimal), CSV.Arg(truestrings), CSV.Arg(falsestrings), CSV.Arg(type), CSV.Arg(types), CSV.Arg(typemap), CSV.Arg(pool), CSV.Arg(downcast), CSV.Arg(lazystrings), CSV.Arg(stringtype), CSV.Arg(strict), CSV.Arg(silencewarnings), CSV.Arg(maxwarnings), CSV.Arg(debug), CSV.Arg(parsingdebug), CSV.Arg(false))
ctx = @refargs Context(source, header, normalizenames, datarow, skipto, footerskip, transpose, comment, ignoreemptyrows, ignoreemptylines, select, drop, limit, buffer_in_memory, threaded, ntasks, tasks, rows_to_check, lines_to_check, missingstrings, missingstring, delim, ignorerepeated, quoted, quotechar, openquotechar, closequotechar, escapechar, dateformat, dateformats, decimal, truestrings, falsestrings, stripwhitespace, type, types, typemap, pool, downcast, lazystrings, stringtype, strict, silencewarnings, maxwarnings, debug, parsingdebug, validate, false)
Expand Down Expand Up @@ -435,7 +435,7 @@ function multithreadpostparse(ctx, ntasks, pertaskcolumns, rows, finalrows, j, c
# col.column is a PooledArray
elseif col.type === Int64
# we need to special-case Int here because while parsing, a default Int64 sentinel value is chosen to
# represent missing; if any chunk bumped into that sentinel value while parsing, then it cycled to a
# represent missing; if any chunk bumped into that sentinel value while parsing, then it cycled to a
# new sentinel value; this step ensures that each chunk has the same encoded sentinel value
# passing force=false means it will first check if all chunks already have the same sentinel and return
# immediately if so, which will be the case most often
Expand Down

0 comments on commit cfb4ffb

Please sign in to comment.