Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread fails on file with inconsistent # columns #2267

Closed
MichaelChirico opened this issue Jul 12, 2017 · 0 comments
Closed

fread fails on file with inconsistent # columns #2267

MichaelChirico opened this issue Jul 12, 2017 · 0 comments
Milestone

Comments

@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Jul 12, 2017

I've got a file I made and continued to add to. Unfortunately at one point I switched from writing 15 to 14 columns and kept using the same file.

I expect two things from this file: 1) when I use fread on it, it should fail, but with an error that informs about the inconsistent # of columns 2) when I use fill = TRUE, the read is successful.

Unfortunately neither are true:

library(data.table)
# data.table 1.10.5 IN DEVELOPMENT built 2017-07-11 18:43:20 UTC; travis

URL = paste0('https://gist.githubusercontent.com/MichaelChirico/',
             '0f1a9ae0d419160ad8ef5b7ac5469336/raw/',
             'db7936fafaf2602e03e657bbfc9e49dd526260af/bad_fill.csv')
x = fread(URL, verbose = TRUE)

# Input contains no \n. Taking this to be a filename to open
# [1] Check arguments
# Using 2 threads (omp_get_max_threads()=2, nth=2)
# NAstrings = [<<NA>>]
# None of the NAstrings look like numbers.
# [2] Opening the file
# Opening file /tmp/RtmpMqWBHa/filee366e213235
# File opened, size = 34.88MB (36578984 bytes).
# Memory mapping ... ok
# [3] Detect and skip BOM
# [4] Detect end-of-line character(s)
# Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
# [6] Skipping initial rows if needed
# Positioned on line 1 starting: <<train_set,delx,dely,alpha,eta,>>
#   [7] Detect separator, quoting rule, and ncolumns
# Detecting sep ...
# sep=','  with 100 lines of 15 fields using quote rule 0
# Detected 15 columns on line 1. This line is either column names or first data row. Line starts as: <<train_set,delx,dely,alpha,eta,>>
#   Quote rule picked = 0
# [8] Determine column names
# All the fields on line 1 are character fields. Treating as the column names.
# [9] Detect column types
# Number of sampling jump points = 101 because (36578905 bytes from row 1 to eof) / (2 * 15974 jump0size) == 1144
# Type codes (jump 000)    : 655552525552255  Quote rule 0

Error in fread(URL, verbose = TRUE) : Could not find first good line start after jump point 73 when sampling.

x = fread(URL, fill = TRUE, verbose = TRUE)

# Input contains no \n. Taking this to be a filename to open
# [1] Check arguments
# Using 2 threads (omp_get_max_threads()=2, nth=2)
# NAstrings = [<<NA>>]
# None of the NAstrings look like numbers.
# [2] Opening the file
# Opening file /tmp/RtmpMqWBHa/filee361f04c03
# File opened, size = 34.88MB (36578984 bytes).
# Memory mapping ... ok
# [3] Detect and skip BOM
# [4] Detect end-of-line character(s)
# Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.
# [6] Skipping initial rows if needed
# Positioned on line 1 starting: <<train_set,delx,dely,alpha,eta,>>
#   [7] Detect separator, quoting rule, and ncolumns
# Detecting sep ...
# sep=','  with 100 lines of 15 fields using quote rule 0
# Detected 15 columns on line 1. This line is either column names or first data row. Line starts as: <<train_set,delx,dely,alpha,eta,>>
#   Quote rule picked = 0
# fill=true and the most number of columns found is 15
# [8] Determine column names
# All the fields on line 1 are character fields. Treating as the column names.
# [9] Detect column types
# Number of sampling jump points = 101 because (36578905 bytes from row 1 to eof) / (2 * 15974 jump0size) == 1144
# Type codes (jump 000)    : 655552525552255  Quote rule 0

Error in fread(URL, fill = TRUE, verbose = TRUE) : Could not find first good line start after jump point 73 when sampling.

I was able to overcome the problem and fix my file by identifying the exact row where the switch occurred and doing:

x = fread('head -n 164161 ~/Desktop/fire_random_search.csv')
y = fread('tail -n +164162 ~/Desktop/fire_random_search.csv',
          col.names = names(x)[-ncol(x)])
z = rbind(x, y, fill = TRUE)

fwrite(z, '~/Desktop/fire_random_search.csv')

Also, this worked as expected in 1.10.4:

fread(URL, fill = TRUE)
#         train_set     delx     dely alpha      eta       lt     theta   k
#      1:  train_13 181.5100 620.9804     0 0.956508  7.00000 2.0617496 100
#      2:  train_13 181.5100 620.9804     0 0.956508  7.00000 2.0617496 100
#      3:  train_13 181.5100 620.9804     0 0.956508  7.00000 2.0617496 100
#      4:  train_13 181.5100 620.9804     0 0.956508  7.00000 2.0617496 100
#      5:  train_13 181.5100 620.9804     0 0.956508  7.00000 2.0617496 100
#     ---                                                                  
# 227516:  train_16 734.8028 406.3451     1 1.035084 23.08614 0.8077483  27
# 227517:  train_16 734.8028 406.3451     1 1.035084 23.08614 0.8077483  27
# 227518:  train_16 734.8028 406.3451     1 1.035084 23.08614 0.8077483  27
# 227519:  train_16 734.8028 406.3451     1 1.035084 23.08614 0.8077483  27
# 227520:  train_16 734.8028 406.3451     1 1.035084 23.08614 0.8077483  27
#            l1    l2   kde.bw kde.lags kde.win        pei      pai
#      1: 0.000 0e+00 615.3886        1      11 0.09523810 24.42448
#      2: 0.000 1e-05 615.3886        1      11 0.02380952  6.10612
#      3: 0.000 5e-05 615.3886        1      11 0.00000000  0.00000
#      4: 0.000 1e-04 615.3886        1      11 0.02380952  6.10612
#      5: 0.000 5e-04 615.3886        1      11 0.02380952  6.10612
#     ---                                                          
# 227516: 0.001 1e-05 501.8807        1      15 0.02857143       NA
# 227517: 0.001 5e-05 501.8807        1      15 0.02857143       NA
# 227518: 0.001 1e-04 501.8807        1      15 0.02857143       NA
# 227519: 0.001 5e-04 501.8807        1      15 0.02857143       NA
# 227520: 0.001 1e-03 501.8807        1      15 0.01904762       NA
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants