NA values are destroyed in sparse character columns in fread #737

adamkennedy · 2014-07-19T00:10:35Z

When doing inflation of sparse columns that are entirely NA in the initial sampling range, type inflation appears to destroy NA values, resulting in incorrect null strings.

input <- '"Integer","Numeric","Logical","Character"
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
1,1.1,FALSE,"a"
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA'

table <- fread(input, verbose = TRUE)
# Input contains a \n (or is ""). Taking this to be text input (not a filename)
# Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
# Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep=','
# Found 4 columns
# First row with 4 fields occurs on line 1 (either column names or first row of data)
# All the fields on line 1 are character fields. Treating as the column names.
# Count of eol after first data row: 34
# Subtracted 0 for last eol and any trailing empty lines, leaving 34 data rows
# Type codes: 1111 (first 5 rows)
# Type codes: 1111 (+middle 5 rows)
# Type codes: 1111 (+last 5 rows)
# Type codes: 1111 (after applying colClasses and integer64)
# Type codes: 1111 (after applying drop or select (if supplied)
# Allocating 4 column slots (4 - 0 NULL)
# Bumping column 2 from INT to INT64 on data row 26, field contains '1.1'
# Bumping column 2 from INT64 to REAL on data row 26, field contains '1.1'
# Bumping column 3 from INT to INT64 on data row 26, field contains 'FALSE'
# Bumping column 3 from INT64 to REAL on data row 26, field contains 'FALSE'
# Bumping column 3 from REAL to STR on data row 26, field contains 'FALSE'
# Bumping column 4 from INT to INT64 on data row 26, field contains '"a"'
# Bumping column 4 from INT64 to REAL on data row 26, field contains '"a"'
# Bumping column 4 from REAL to STR on data row 26, field contains '"a"'
#    0.000s (  3%) Memory map (rerun may be quicker)
#    0.000s (  5%) sep and header detection
#    0.000s (  1%) Count rows (wc -l)
#    0.001s ( 46%) Column type detection (first, middle and last 5 rows)
#    0.000s (  1%) Allocation of 34x4 result (xMB) in RAM
#    0.000s (  1%) Reading data
#    0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
#    0.001s ( 41%) Coercing data already read in type bumps (if any)
#    0.000s (  0%) Changing na.strings to NA
#    0.002s        Total

...
Warning messages:
1: In fread(input, verbose = TRUE) :
Bumped column 3 to type character on data row 26, field contains 'FALSE'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.
2: In fread(input, verbose = TRUE) :
Bumped column 4 to type character on data row 26, field contains '"a"'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.

table$Character
 [1] ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  "a" NA  
     NA  NA  NA  NA  NA  NA  NA

Confirm this is broken in 1.9.3.

Please note that the report you see above was done on 1.9.2, which is why it is using default types of 1111. But a colleague has run the same thing on 1.9.3 and confirms that while the logical-detection portion of the above is fixed, the character column behaves similarly.

adamkennedy · 2014-07-19T00:57:56Z

If there's anything we can do to assist in getting this fixed and the new release with this and the other logical type fixes out, feel free to contact me at adam.kennedy@kaggle.com and let me know.

peterbecich · 2014-12-07T09:29:08Z

Could the problem be that the na.strings argument does not work properly in fread?

It seems to work partially. In my dataset, all columns but one are floats. Within these float columns, character ? represents NA.

With this,

fread("p53_new_2012/K9.data",header=FALSE, na.strings="?", verbose = TRUE)

, elements in the CSV that are the character ? are correctly coerced into an NA, in the output data.table. But any column containing this ? element becomes a character type.

After removing all NA values from my data with sed -i '/\?/d' K9.edited.data, the problem is solved. All float columns in the CSV become numeric columns in the data.table.

arunsrinivasan · 2015-03-16T03:04:41Z

Has been fixed by Matt long back, in commit e15facd.

…lumns (colClasses=...), in order to remove a warning encountered when the '?' column value is read. The warning seems to be a bug in fread, documented here: Rdatatable/data.table#737

arunsrinivasan added the bug label Jul 19, 2014

arunsrinivasan changed the title ~~NA values are destroyed in sparse character columns~~ NA values are destroyed in sparse character columns in fread Aug 1, 2014

arunsrinivasan closed this as completed Mar 16, 2015

arunsrinivasan assigned arunsrinivasan and mattdowle and unassigned arunsrinivasan Mar 16, 2015

arunsrinivasan added this to the v1.9.6 milestone Mar 16, 2015

MichaelChirico added the fread label May 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NA values are destroyed in sparse character columns in fread #737

NA values are destroyed in sparse character columns in fread #737

adamkennedy commented Jul 19, 2014

adamkennedy commented Jul 19, 2014

peterbecich commented Dec 7, 2014

arunsrinivasan commented Mar 16, 2015

NA values are destroyed in sparse character columns in fread #737

NA values are destroyed in sparse character columns in fread #737

Comments

adamkennedy commented Jul 19, 2014

adamkennedy commented Jul 19, 2014

peterbecich commented Dec 7, 2014

arunsrinivasan commented Mar 16, 2015