typemap: switch to IdDict (#1069)

* `typemap`: switch to IdDict When using types as keys, IdDict is essentially always preferred: - it uses `@nospecialize` on all operations, reducing latency - it uses pointer comparisons to test for matchs, which is much faster than subtyping (which is what `isequal` dispatches to). * Lots of trailing whitespace fixes Courtesy of my settings for vscode * One more spot
JuliaData · Jan 19, 2023 · cfb4ffb · cfb4ffb
1 parent 94deaf4
commit cfb4ffb
Show file tree

Hide file tree

Showing 9 changed files with 55 additions and 51 deletions.
diff --git a/docs/src/examples.md b/docs/src/examples.md
@@ -443,19 +443,19 @@ file = CSV.File(IOBuffer(data); delim="::")
 ```julia
 using CSV
 
-# This is an example of "fixed width" data, where each 
-# column is the same number of characters away from each 
-# other on each row. Fields are "padded" with extra 
-# delimiters (in this case `' '`) so that each column is 
+# This is an example of "fixed width" data, where each
+# column is the same number of characters away from each
+# other on each row. Fields are "padded" with extra
+# delimiters (in this case `' '`) so that each column is
 # the same number of characters each time
 data = """
 col1    col2 col3
 123431  2    3421
 2355    346  7543
 """
-# In addition to our `delim`, we can pass 
-# `ignorerepeated=true`, which tells parsing that 
-#consecutive delimiters should be treated as a single 
+# In addition to our `delim`, we can pass
+# `ignorerepeated=true`, which tells parsing that
+#consecutive delimiters should be treated as a single
 # delimiter.
 file = CSV.File(IOBuffer(data); delim=' ', ignorerepeated=true)
 ```
@@ -488,12 +488,12 @@ file = CSV.File(IOBuffer(data); quoted=false)
 ```julia
 using CSV
 
-# In this data, we have a few "quoted" fields, which means the field's value starts and ends with `quotechar` (or 
-# `openquotechar` and `closequotechar`, respectively). Quoted fields allow the field to contain characters that would otherwise 
-# be significant to parsing, such as delimiters or newline characters. When quoted, parsing will ignore these otherwise 
-# signficant characters until the closing quote character is found. For quoted fields that need to also include the quote 
-# character itself, an escape character is provided to tell parsing to ignore the next character when looking for a close quote 
-# character. In the syntax examples, the keyword arguments are passed explicitly, but these also happen to be the default 
+# In this data, we have a few "quoted" fields, which means the field's value starts and ends with `quotechar` (or
+# `openquotechar` and `closequotechar`, respectively). Quoted fields allow the field to contain characters that would otherwise
+# be significant to parsing, such as delimiters or newline characters. When quoted, parsing will ignore these otherwise
+# signficant characters until the closing quote character is found. For quoted fields that need to also include the quote
+# character itself, an escape character is provided to tell parsing to ignore the next character when looking for a close quote
+# character. In the syntax examples, the keyword arguments are passed explicitly, but these also happen to be the default
 # values, so just doing `CSV.File(IOBuffer(data))` would result in successful parsing.
 data = """
 col1,col2
@@ -512,9 +512,9 @@ file = CSV.File(IOBuffer(data); openquotechar='"' closequotechar='"', escapechar
 ```julia
 using CSV
 
-# In this file, our `date` column has dates that are formatted like `yyyy/mm/dd`. We can pass just such a string to the 
-# `dateformat` keyword argument to tell parsing to use it when looking for `Date` or `DateTime` columns. Note that currently, 
-# only a single `dateformat` string can be passed to parsing, meaning multiple columns with different date formats cannot all 
+# In this file, our `date` column has dates that are formatted like `yyyy/mm/dd`. We can pass just such a string to the
+# `dateformat` keyword argument to tell parsing to use it when looking for `Date` or `DateTime` columns. Note that currently,
+# only a single `dateformat` string can be passed to parsing, meaning multiple columns with different date formats cannot all
 # be parsed as `Date`/`DateTime`.
 data = """
 code,date
@@ -531,7 +531,7 @@ file = CSV.File(IOBuffer(data); dateformat="yyyy/mm/dd")
 using CSV
 
 # In many places in the world, floating point number decimals are separated with a comma instead of a period (`3,14` vs. `3.14`)
-# . We can correctly parse these numbers by passing in the `decimal=','` keyword argument. Note that we probably need to 
+# . We can correctly parse these numbers by passing in the `decimal=','` keyword argument. Note that we probably need to
 # explicitly pass `delim=';'` in this case, since the parser will probably think that it detected `','` as the delimiter.
 data = """
 col1;col2;col3
@@ -547,7 +547,7 @@ file = CSV.File(IOBuffer(data); delim=';', decimal=',')
 ```julia
 using CSV
 
-# By default, parsing only considers the string values `true` and `false` as valid `Bool` values. To consider alternative 
+# By default, parsing only considers the string values `true` and `false` as valid `Bool` values. To consider alternative
 # values, we can pass a `Vector{String}` to the `truestrings` and `falsestrings` keyword arguments.
 data = """
 id,paid,attended
@@ -565,8 +565,8 @@ file = CSV.File(IOBuffer(data); truestrings=["T", "TRUE"], falsestrings=["F", "F
 ```julia
 using CSV
 
-# This file contains a 3x3 identity matrix of `Float64`. By default, parsing will detect the delimiter and type, but we can 
-# also explicitly pass `delim= ' '` and `types=Float64`, which tells parsing to explicitly treat each column as `Float64`, 
+# This file contains a 3x3 identity matrix of `Float64`. By default, parsing will detect the delimiter and type, but we can
+# also explicitly pass `delim= ' '` and `types=Float64`, which tells parsing to explicitly treat each column as `Float64`,
 # without having to guess the type on its own.
 data = """
 1.0 0.0 0.0
@@ -583,12 +583,12 @@ file = CSV.File(IOBuffer(data); header=false, delim=' ', types=Float64)
 ```julia
 using CSV
 
-# In this file, our 3rd column has an invalid value on the 2nd row `invalid`. Let's imagine we'd still like to treat it as an 
-# `Int` column, and ignore the `invalid` value. The syntax examples provide several ways we can tell parsing to treat the 3rd 
-# column as `Int`, by referring to column index `3`, or column name with `Symbol` or `String`. We can also provide an entire 
-# `Vector` of types for each column (and which needs to match the length of columns in the file). There are two additional 
-# keyword arguments that control parsing behavior; in the first 4 syntax examples, we would see a warning printed like 
-# `"warning: invalid Int64 value on row 2, column 3"`. In the fifth example, passing `silencewarnings=true` will suppress this 
+# In this file, our 3rd column has an invalid value on the 2nd row `invalid`. Let's imagine we'd still like to treat it as an
+# `Int` column, and ignore the `invalid` value. The syntax examples provide several ways we can tell parsing to treat the 3rd
+# column as `Int`, by referring to column index `3`, or column name with `Symbol` or `String`. We can also provide an entire
+# `Vector` of types for each column (and which needs to match the length of columns in the file). There are two additional
+# keyword arguments that control parsing behavior; in the first 4 syntax examples, we would see a warning printed like
+# `"warning: invalid Int64 value on row 2, column 3"`. In the fifth example, passing `silencewarnings=true` will suppress this
 # warning printing. In the last syntax example, passing `strict=true` will result in an error being thrown during parsing.
 data = """
 col1,col2,col3
@@ -626,8 +626,8 @@ file = CSV.File(IOBuffer(data); types=Dict(:col1 => Bool, r"^col\d" => Int8))
 ```julia
 using CSV
 
-# In this file, we have U.S. zipcodes in the first column that we'd rather not treat as `Int`, but parsing will detect it as 
-# such. In the first syntax example, we pass `typemap=Dict(Int => String)`, which tells parsing to treat any detected `Int` 
+# In this file, we have U.S. zipcodes in the first column that we'd rather not treat as `Int`, but parsing will detect it as
+# such. In the first syntax example, we pass `typemap=IdDict(Int => String)`, which tells parsing to treat any detected `Int`
 # columns as `String` instead. In the second syntax example, we alternatively set the `zipcode` column type manually.
 data = """
 zipcode,score
@@ -636,7 +636,7 @@ zipcode,score
 84044,3.4
 """
 
-file = CSV.File(IOBuffer(data); typemap=Dict(Int => String))
+file = CSV.File(IOBuffer(data); typemap=IdDict(Int => String))
 file = CSV.File(IOBuffer(data); types=Dict(:zipcode => String))
 ```
 
@@ -645,9 +645,9 @@ file = CSV.File(IOBuffer(data); types=Dict(:zipcode => String))
 ```julia
 using CSV
 
-# In this file, we have an `id` column and a `code` column. There can be advantages with various DataFrame/table operations 
-# like joining and grouping when `String` values are "pooled", meaning each unique value is mapped to a `UInt32`. By default, 
-# `pool=(0.2, 500)`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide 
+# In this file, we have an `id` column and a `code` column. There can be advantages with various DataFrame/table operations
+# like joining and grouping when `String` values are "pooled", meaning each unique value is mapped to a `UInt32`. By default,
+# `pool=(0.2, 500)`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide
 # greater control: `pool=0.4` means that if 40% or less of a column's values are unique, then it will be pooled.
 data = """
 id,code
@@ -689,9 +689,9 @@ file = CSV.File(IOBuffer(data); pool=[true, false])
 ```julia
 using CSV
 
-# In this file, we have an `id` column and a `code` column. There can be advantages with various DataFrame/table operations 
-# like joining and grouping when `String` values are "pooled", meaning each unique value is mapped to a `UInt32`. By default, 
-# `pool=(0.2, 500)`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide 
+# In this file, we have an `id` column and a `code` column. There can be advantages with various DataFrame/table operations
+# like joining and grouping when `String` values are "pooled", meaning each unique value is mapped to a `UInt32`. By default,
+# `pool=(0.2, 500)`, so string columns with low cardinality are pooled by default. Via the `pool` keyword argument, we can provide
 # greater control: `pool=(0.5, 2)` means that if a column has 2 or fewer unique values _and_ the total number of unique values is less than 50% of all values, then it will be pooled.
 data = """
 id,code

diff --git a/docs/src/reading.md b/docs/src/reading.md
@@ -189,7 +189,7 @@ Note that the default [stringtype](@ref stringtype) can be overridden by providi
 
 ## [`typemap`](@id typemap)
 
-A `Dict{Type, Type}` argument that allows replacing a non-`String` standard type with another type when a column's type is auto-detected. Most commonly, this would be used to force all numeric columns to be `Float64`, like `typemap=Dict(Int64 => Float64)`, which would cause any columns detected as `Int64` to be parsed as `Float64` instead. Another common case would be wanting all columns of a specific type to be parsed as strings instead, like `typemap=Dict(Date => String)`, which will cause any columns detected as `Date` to be parsed as `String` instead.
+An `AbstractDict{Type, Type}` argument that allows replacing a non-`String` standard type with another type when a column's type is auto-detected. Most commonly, this would be used to force all numeric columns to be `Float64`, like `typemap=IdDict(Int64 => Float64)`, which would cause any columns detected as `Int64` to be parsed as `Float64` instead. Another common case would be wanting all columns of a specific type to be parsed as strings instead, like `typemap=IdDict(Date => String)`, which will cause any columns detected as `Date` to be parsed as `String` instead.
 
 ### Examples
   * [Typemap](@ref typemap_example)

diff --git a/src/chunks.jl b/src/chunks.jl
@@ -9,7 +9,7 @@ end
 Returns a file "chunk" iterator. Accepts all the same inputs and keyword arguments as [`CSV.File`](@ref),
 see those docs for explanations of each keyword argument.
 
-The `ntasks` keyword argument specifies how many chunks a file should be split up into, defaulting to 
+The `ntasks` keyword argument specifies how many chunks a file should be split up into, defaulting to
 the # of threads available to Julia (i.e. `JULIA_NUM_THREADS` environment variable) or 8 if Julia is
 run single-threaded.
 
@@ -64,7 +64,7 @@ function Chunks(source::ValidSources;
     # type options
     type=nothing,
     types=nothing,
-    typemap::Dict=Dict{Type, Type}(),
+    typemap::AbstractDict=IdDict{Type, Type}(),
     pool::Union{Bool, Real, AbstractVector, AbstractDict, Base.Callable, Tuple}=DEFAULT_POOL,
     downcast::Bool=false,
     lazystrings::Bool=false,

diff --git a/src/context.jl b/src/context.jl
@@ -172,7 +172,7 @@ mutable struct Context
     pool::Union{Float64, Tuple{Float64, Int}}
     downcast::Bool
     customtypes::Type
-    typemap::Dict{Type, Type}
+    typemap::IdDict{Type, Type}
     stringtype::StringTypes
     limit::Int
     threaded::Bool
@@ -233,7 +233,7 @@ function Context(source::ValidSources;
     # type options
     type=nothing,
     types=nothing,
-    typemap::Dict=Dict{Type, Type}(),
+    typemap::AbstractDict=IdDict{Type, Type}(),
     pool=DEFAULT_POOL,
     downcast::Bool=false,
     lazystrings::Bool=false,
@@ -288,7 +288,7 @@ end
     # type options
     type::Union{Nothing, Type},
     types::Union{Nothing, Type, AbstractVector, AbstractDict, Function},
-    typemap::Dict,
+    typemap::AbstractDict,
     pool::Union{Bool, Real, AbstractVector, AbstractDict, Base.Callable, Tuple},
     downcast::Bool,
     lazystrings::Bool,
@@ -481,6 +481,7 @@ end
         end
     end
     # check for nonstandard types in typemap
+    typemap = convert(IdDict{Type, Type}, typemap)::IdDict{Type, Type}
     for T in values(typemap)
         if nonstandardtype(T) !== Union{}
             customtypes = tupcat(customtypes, nonstandardtype(T))

diff --git a/src/file.jl b/src/file.jl
@@ -200,7 +200,7 @@ function File(source::ValidSources;
     # type options
     type=nothing,
     types=nothing,
-    typemap::Dict=Dict{Type, Type}(),
+    typemap::AbstractDict=IdDict{Type, Type}(),
     pool=DEFAULT_POOL,
     downcast::Bool=false,
     lazystrings::Bool=false,
@@ -215,7 +215,7 @@ function File(source::ValidSources;
     # header=1;normalizenames=false;datarow=-1;skipto=-1;footerskip=0;transpose=false;comment=nothing;ignoreemptyrows=true;ignoreemptylines=nothing;
     # select=nothing;drop=nothing;limit=nothing;threaded=nothing;ntasks=Threads.nthreads();tasks=nothing;rows_to_check=30;lines_to_check=nothing;missingstrings=String[];missingstring="";
     # delim=nothing;ignorerepeated=false;quoted=true;quotechar='"';openquotechar=nothing;closequotechar=nothing;escapechar='"';dateformat=nothing;
-    # dateformats=nothing;decimal=UInt8('.');truestrings=nothing;falsestrings=nothing;type=nothing;types=nothing;typemap=Dict{Type,Type}();
+    # dateformats=nothing;decimal=UInt8('.');truestrings=nothing;falsestrings=nothing;type=nothing;types=nothing;typemap=IdDict{Type,Type}();
     # pool=CSV.DEFAULT_POOL;downcast=false;lazystrings=false;stringtype=String;strict=false;silencewarnings=false;maxwarnings=100;debug=false;parsingdebug=false;buffer_in_memory=false
     # @descend CSV.Context(CSV.Arg(source), CSV.Arg(header), CSV.Arg(normalizenames), CSV.Arg(datarow), CSV.Arg(skipto), CSV.Arg(footerskip), CSV.Arg(transpose), CSV.Arg(comment), CSV.Arg(ignoreemptyrows), CSV.Arg(ignoreemptylines), CSV.Arg(select), CSV.Arg(drop), CSV.Arg(limit), CSV.Arg(buffer_in_memory), CSV.Arg(threaded), CSV.Arg(ntasks), CSV.Arg(tasks), CSV.Arg(rows_to_check), CSV.Arg(lines_to_check), CSV.Arg(missingstrings), CSV.Arg(missingstring), CSV.Arg(delim), CSV.Arg(ignorerepeated), CSV.Arg(quoted), CSV.Arg(quotechar), CSV.Arg(openquotechar), CSV.Arg(closequotechar), CSV.Arg(escapechar), CSV.Arg(dateformat), CSV.Arg(dateformats), CSV.Arg(decimal), CSV.Arg(truestrings), CSV.Arg(falsestrings), CSV.Arg(type), CSV.Arg(types), CSV.Arg(typemap), CSV.Arg(pool), CSV.Arg(downcast), CSV.Arg(lazystrings), CSV.Arg(stringtype), CSV.Arg(strict), CSV.Arg(silencewarnings), CSV.Arg(maxwarnings), CSV.Arg(debug), CSV.Arg(parsingdebug), CSV.Arg(false))
     ctx = @refargs Context(source, header, normalizenames, datarow, skipto, footerskip, transpose, comment, ignoreemptyrows, ignoreemptylines, select, drop, limit, buffer_in_memory, threaded, ntasks, tasks, rows_to_check, lines_to_check, missingstrings, missingstring, delim, ignorerepeated, quoted, quotechar, openquotechar, closequotechar, escapechar, dateformat, dateformats, decimal, truestrings, falsestrings, stripwhitespace, type, types, typemap, pool, downcast, lazystrings, stringtype, strict, silencewarnings, maxwarnings, debug, parsingdebug, validate, false)
@@ -435,7 +435,7 @@ function multithreadpostparse(ctx, ntasks, pertaskcolumns, rows, finalrows, j, c
         # col.column is a PooledArray
     elseif col.type === Int64
         # we need to special-case Int here because while parsing, a default Int64 sentinel value is chosen to
-        # represent missing; if any chunk bumped into that sentinel value while parsing, then it cycled to a 
+        # represent missing; if any chunk bumped into that sentinel value while parsing, then it cycled to a
         # new sentinel value; this step ensures that each chunk has the same encoded sentinel value
         # passing force=false means it will first check if all chunks already have the same sentinel and return
         # immediately if so, which will be the case most often