Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ignoreemptylines should default to true #638

Closed
yakir12 opened this issue Jun 9, 2020 · 7 comments
Closed

ignoreemptylines should default to true #638

yakir12 opened this issue Jun 9, 2020 · 7 comments

Comments

@yakir12
Copy link

yakir12 commented Jun 9, 2020

This is related to #621 , right now ignoreemptylines's default is false. I think it should default to true.

I cannot imagine that the majority of people that want to parse a csv file intend to have missing in all the columns by leaving a line empty. Like, how would that even come to pass (as in, what program encodes missing in all columns by leaving an empty line)?
Empty lines should be flagged as they are now (there is a warning about it), but they should also be ignored since the majority of users will want to filter those away since they stem from some error (such as shown in #621 or some other source) and not an intent.

@yakir12
Copy link
Author

yakir12 commented Jun 9, 2020

Wow, I'm trying to fix this and after running the tests I realized that CSV assumes that ignoreemptylines = false a lot... I might therefore be wrong about my assumptions, or maybe this was just for convenience (you knew that it would ignore empty lines and so did the tests). I'll wait for your input before working on the tests and pushing my PR.

@yakir12
Copy link
Author

yakir12 commented Jun 9, 2020

(btw, got an error when testing on master:)

(@v1.5) pkg> test CSV
    Testing CSV
Status `/tmp/jl_gtSfjc/Project.toml`
  [336ed68f] CSV v0.6.2 `~/.julia/dev/CSV`
  [324d7699] CategoricalArrays v0.8.1
  [944b1d66] CodecZlib v0.7.0
  [a93c6f00] DataFrames v0.21.2
  [48062228] FilePathsBase v0.8.0
  [69de0a69] Parsers v1.0.5
  [2dfb63ee] PooledArrays v0.5.3
  [bd369af6] Tables v1.0.4
  [ea10d353] WeakRefStrings v0.6.2
  [ade2ca70] Dates
  [a63ad114] Mmap
  [9a3f8284] Random
  [8dfed614] Test
  [4ec0a83e] Unicode
Status `/tmp/jl_gtSfjc/Manifest.toml`
  [336ed68f] CSV v0.6.2 `~/.julia/dev/CSV`
  [324d7699] CategoricalArrays v0.8.1
  [944b1d66] CodecZlib v0.7.0
  [34da2185] Compat v3.11.0
  [9a962f9c] DataAPI v1.3.0
  [a93c6f00] DataFrames v0.21.2
  [864edb3b] DataStructures v0.17.17
  [e2d170a0] DataValueInterfaces v1.0.0
  [48062228] FilePathsBase v0.8.0
  [41ab1584] InvertedIndices v1.0.0
  [82899510] IteratorInterfaceExtensions v1.0.0
  [682c06a0] JSON v0.21.0
  [e1d29d7a] Missings v0.4.3
  [bac558e1] OrderedCollections v1.2.0
  [69de0a69] Parsers v1.0.5
  [2dfb63ee] PooledArrays v0.5.3
  [189a3867] Reexport v0.2.0
  [a2af1166] SortingAlgorithms v0.3.1
  [3783bdb8] TableTraits v1.0.0
  [bd369af6] Tables v1.0.4
  [3bb67fe8] TranscodingStreams v0.9.5
  [ea10d353] WeakRefStrings v0.6.2
  [83775a58] Zlib_jll v1.2.11+10
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8bb1440f] DelimitedFiles
  [8ba89e20] Distributed
  [9fa8497b] Future
  [b77e0a4c] InteractiveUtils
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [a63ad114] Mmap
  [44cfe95a] Pkg
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA
  [9e88b42a] Serialization
  [1a1011a3] SharedArrays
  [6462fe0b] Sockets
  [2f01184e] SparseArrays
  [10745b16] Statistics
  [8dfed614] Test
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
thread = 1 warning: only found 7 / 8 columns on data row: 1. Filling remaining columns with `missing`
thread = 1 warning: error parsing Int64 on row = 1, col = 1: "abc
", error=INVALID: SENTINEL | NEWLINE | EOF | INVALID_DELIMITER 
CSV.File basics: Test Failed at /home/yakir/.julia/dev/CSV/test/basics.jl:367
  Expression: df.id.sentinel == 0x6690d6bed2a7da46
   Evaluated: 0x734baccd23d8b4d6 == 0x6690d6bed2a7da46
Stacktrace:
 [1] top-level scope at /home/yakir/.julia/dev/CSV/test/basics.jl:367
 [2] top-level scope at /home/yakir/julia/usr/share/julia/stdlib/v1.5/Test/src/Test.jl:1114
 [3] top-level scope at /home/yakir/.julia/dev/CSV/test/basics.jl:7
.
.
.
Test Summary:                                  | Pass  Fail  Total
CSV                                            | 1748     1   1749
  CSV.File                                     | 1555     1   1556
    CSV.File basics                            |  182     1    183
    CSV.File iteration                         |  366          366
  CSV.write                                    |   83           83
  CategoricalArray levels (including ordering) |    7            7
  PooledArrays                                 |   14           14
  CSV.Rows                                     |   37           37
  CSV.detect                                   |    8            8
  CSV.findrowstarts!                           |    4            4
  CSV.promote_typecode                         |    2            2
  CSV.File with select/drop                    |   38           38
ERROR: LoadError: Some tests did not pass: 1748 passed, 1 failed, 0 errored, 0 broken.
in expression starting at /home/yakir/.julia/dev/CSV/test/runtests.jl:13
ERROR: Package CSV errored during testing

@quinnj
Copy link
Member

quinnj commented Jun 19, 2020

Sorry for the slow response here; I changed the default and ran the tests to see what you saw. I think I'm ok w/ changing the default, but as you noticed there's a bunch of tests that rely on the current default, so it'd just be a matter of going through and making sure ignoreemptylines=false is passed in those cases.

This would be a breaking change, but fortunately, we're getting ready for a 1.0 release that could include a breaking change like this.

@nalimilan
Copy link
Member

FWIW I got a CSV file produced by LimeSurvey which included a final line break, and it's annoying to get a row full of missings because of that. So I support this change (or special-casing the last row, which is a bit weird).

@Moelf
Copy link
Contributor

Moelf commented Aug 9, 2020

@quinnj I've made a PR with one test failing no matter ignoreemptylines set to true or false, the test is this one

@test_throws ArgumentError CSV.File(IOBuffer("x\n1\n2\n3\n#4"), ignorerepeated=true)

everything else passes, since I didn't find a NEWS.md I only did the necessary changes in docstring.

I personally think this is a better default so I took the liberty to make a PR hope you won't mind @yakir12

@yakir12
Copy link
Author

yakir12 commented Aug 21, 2020

God no, I just got back fro a lengthy vacation and welcome your initiative, thanks!

@quinnj
Copy link
Member

quinnj commented Sep 3, 2020

Implemented; thanks @Moelf !

@quinnj quinnj closed this as completed Sep 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants