Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NA values are destroyed in sparse character columns in fread #737

Closed
adamkennedy opened this issue Jul 19, 2014 · 3 comments
Closed

NA values are destroyed in sparse character columns in fread #737

adamkennedy opened this issue Jul 19, 2014 · 3 comments
Assignees
Milestone

Comments

@adamkennedy
Copy link

When doing inflation of sparse columns that are entirely NA in the initial sampling range, type inflation appears to destroy NA values, resulting in incorrect null strings.

input <- '"Integer","Numeric","Logical","Character"
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
1,1.1,FALSE,"a"
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA
NA,NA,NA,NA'

table <- fread(input, verbose = TRUE)
# Input contains a \n (or is ""). Taking this to be text input (not a filename)
# Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
# Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep=','
# Found 4 columns
# First row with 4 fields occurs on line 1 (either column names or first row of data)
# All the fields on line 1 are character fields. Treating as the column names.
# Count of eol after first data row: 34
# Subtracted 0 for last eol and any trailing empty lines, leaving 34 data rows
# Type codes: 1111 (first 5 rows)
# Type codes: 1111 (+middle 5 rows)
# Type codes: 1111 (+last 5 rows)
# Type codes: 1111 (after applying colClasses and integer64)
# Type codes: 1111 (after applying drop or select (if supplied)
# Allocating 4 column slots (4 - 0 NULL)
# Bumping column 2 from INT to INT64 on data row 26, field contains '1.1'
# Bumping column 2 from INT64 to REAL on data row 26, field contains '1.1'
# Bumping column 3 from INT to INT64 on data row 26, field contains 'FALSE'
# Bumping column 3 from INT64 to REAL on data row 26, field contains 'FALSE'
# Bumping column 3 from REAL to STR on data row 26, field contains 'FALSE'
# Bumping column 4 from INT to INT64 on data row 26, field contains '"a"'
# Bumping column 4 from INT64 to REAL on data row 26, field contains '"a"'
# Bumping column 4 from REAL to STR on data row 26, field contains '"a"'
#    0.000s (  3%) Memory map (rerun may be quicker)
#    0.000s (  5%) sep and header detection
#    0.000s (  1%) Count rows (wc -l)
#    0.001s ( 46%) Column type detection (first, middle and last 5 rows)
#    0.000s (  1%) Allocation of 34x4 result (xMB) in RAM
#    0.000s (  1%) Reading data
#    0.000s (  0%) Allocation for type bumps (if any), including gc time if triggered
#    0.001s ( 41%) Coercing data already read in type bumps (if any)
#    0.000s (  0%) Changing na.strings to NA
#    0.002s        Total

...
Warning messages:
1: In fread(input, verbose = TRUE) :
Bumped column 3 to type character on data row 26, field contains 'FALSE'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.
2: In fread(input, verbose = TRUE) :
Bumped column 4 to type character on data row 26, field contains '"a"'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.

table$Character
 [1] ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  ""  "a" NA  
     NA  NA  NA  NA  NA  NA  NA 

Confirm this is broken in 1.9.3.

Please note that the report you see above was done on 1.9.2, which is why it is using default types of 1111. But a colleague has run the same thing on 1.9.3 and confirms that while the logical-detection portion of the above is fixed, the character column behaves similarly.

@adamkennedy
Copy link
Author

If there's anything we can do to assist in getting this fixed and the new release with this and the other logical type fixes out, feel free to contact me at adam.kennedy@kaggle.com and let me know.

@arunsrinivasan arunsrinivasan changed the title NA values are destroyed in sparse character columns NA values are destroyed in sparse character columns in fread Aug 1, 2014
@peterbecich
Copy link

Could the problem be that the na.strings argument does not work properly in fread?

It seems to work partially. In my dataset, all columns but one are floats. Within these float columns, character ? represents NA.

With this,

fread("p53_new_2012/K9.data",header=FALSE, na.strings="?", verbose = TRUE)

, elements in the CSV that are the character ? are correctly coerced into an NA, in the output data.table. But any column containing this ? element becomes a character type.

After removing all NA values from my data with sed -i '/\?/d' K9.edited.data, the problem is solved. All float columns in the CSV become numeric columns in the data.table.

@arunsrinivasan
Copy link
Member

Has been fixed by Matt long back, in commit e15facd.

@arunsrinivasan arunsrinivasan added this to the v1.9.6 milestone Mar 16, 2015
brigaldies pushed a commit to brigaldies/ExData_Plotting1 that referenced this issue Apr 11, 2015
…lumns (colClasses=...), in order to remove a warning encountered when the '?' column value is read. The warning seems to be a bug in fread, documented here: Rdatatable/data.table#737
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants