CSV processor #49509

probakowski · 2019-11-22T21:46:20Z

This change adds new ingest processor that breaks line from CSV file into separate fields.
By default it conforms to RFC 4180 but can be tweaked.

Closes #49113

This change adds new ingest processor that breaks line from CSV file into separate fields. By default it conforms to RFC 4180 but can be tweaked. Closes elastic#49113

elasticmachine · 2019-11-22T21:46:22Z

Pinging @elastic/es-core-features (:Core/Features/Ingest)

jasontedor · 2019-11-22T22:51:40Z

@probakowski Can you add some docs with this pull request? Some REST tests would be nice too.

probakowski · 2019-11-23T00:09:58Z

@jasontedor sure, I'll add both on Monday

I've also run JMH benchmark to compare version above with https://github.com/johtani/elasticsearch-ingest-csv mentioned in #49113 (the less the better)

Benchmark                 Mode  Score    Error  Units
1Thread johtani           avgt  2,502  ± 0,166  us/op
1Thread probakowski       avgt  1,730  ± 0,123  us/op
8Threads johtani          avgt  24,180 ± 0,563  us/op
8Threads probakowski      avgt  3,104  ± 0,294  us/op

(I've used version 7.4.2 of @johtani library, which is thread safe but suffers from synchronization)

modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/CsvProcessor.java

probakowski · 2019-11-25T23:47:20Z

@elasticmachine update branch

martijnvg

I left two more comments, otherwise LGTM.

modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/CsvProcessor.java

docs/reference/ingest/processors/csv.asciidoc

probakowski · 2019-11-29T23:22:37Z

@elasticmachine update branch

jbaiera

Some small things and a question regarding the trim setting when parsing quoted values.

modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/CsvParser.java

jbaiera · 2019-12-02T22:03:58Z

modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/CsvParser.java

+        boolean shouldSetField = true;
+        for (; currentIndex < length; currentIndex++) {
+            c = currentChar();
+            if (isWhitespace(c)) {


If the value we are parsing is a quoted string, are the spaces around it always trimmed?

I've taken a look at the RFC but I'm still not sure what the right path here is. It states "Spaces are considered part of a field and should not be ignored." In the implementation we have a trim option which is a fine extension, but if reading a quoted string, the spaces are trimmed regardless of if the trim option is set to true.

Spaces around quotes are always trimmed since it's well defined where start and end of data is. It's deviation from RFC, where no spaces are allowed both before and after quotes at all. But it's common for CSV files in the wild to have them, so this change make it easier for a user with no ambiguity introduced.
For unquoted fields situation is different, there's no way to tell automatically where the data starts so a user must decide here if he wants to trim whitespaces or not. That's why it's parametrized.
Now when I think of it, my gut feeling is most use cases would trim leading/trailing whitespaces so I would even consider default trim to true. What do you think? @martijnvg you opinion would be appreciated as well.

I tend to agree with trimming leading/trailing whitespaces by default, because it seems more practical to me.

@droberts195 convinced me otherwise: #49509 (comment)

yep, me too, reverted back to false

…ommon/CsvParser.java Co-Authored-By: James Baiera <james.baiera@gmail.com>

droberts195 · 2019-12-09T09:03:57Z

Thanks for changing the trim functionality 👍

probakowski · 2019-12-09T19:07:44Z

@elasticmachine update branch

droberts195 · 2019-12-10T10:10:20Z

Some reasons not to trim CSV by default are:

It's not in the standard
Excel doesn't quote values that start and end in spaces when saving CSV
Google Sheets doesn't quote values that start and end in spaces when saving CSV
macOS Numbers doesn't quote values that start and end in spaces when saving CSV

I agree it's quite common to find non-standard CSV that needs spaces trimming, so it's good to have the option, but it seems more defensible to me to default to the standard and make people with non-standard data customise an option.

probakowski · 2019-12-10T21:05:05Z

@droberts195 these are valid points, I'll revert default value for trim back to false

jbaiera

Good points about the clear start and end for quoted content. LGTM!

probakowski · 2019-12-11T11:46:40Z

@elasticmachine update branch

probakowski · 2019-12-11T13:13:47Z

@elasticmachine run elasticsearch-ci/2

* CSV Processor for Ingest This change adds new ingest processor that breaks line from CSV file into separate fields. By default it conforms to RFC 4180 but can be tweaked. Closes elastic#49113

* CSV ingest processor (#49509) This change adds new ingest processor that breaks line from CSV file into separate fields. By default it conforms to RFC 4180 but can be tweaked. Closes #49113

* CSV Processor for Ingest This change adds new ingest processor that breaks line from CSV file into separate fields. By default it conforms to RFC 4180 but can be tweaked. Closes elastic#49113

dschneiter · 2020-09-11T13:31:41Z

@probakowski What's the intended behavior of this processor for quoted fields containing line-breaks? It seems that the processor is interpreting this as a new message, which is in contrast to how Excel and Numbers are dealing with line breaks in quoted strings.

probakowski · 2020-09-11T13:52:51Z

Hi @dschneiter, this processor handles only single line from CSV so it doesn't care for line breaks at all (unless you set it as separator) and it doesn't have any notion of "new message"

probakowski added 2 commits November 22, 2019 22:30

CSV Processor for Ingest

67fde34

This change adds new ingest processor that breaks line from CSV file into separate fields. By default it conforms to RFC 4180 but can be tweaked. Closes elastic#49113

Javadoc fix

101cb4e

probakowski added >feature :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP v8.0.0 v7.6.0 labels Nov 22, 2019

probakowski requested review from jbaiera and jakelandis November 22, 2019 21:46

probakowski self-assigned this Nov 22, 2019

martijnvg reviewed Nov 25, 2019

View reviewed changes

modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/CsvProcessor.java Outdated Show resolved Hide resolved

probakowski added 3 commits November 25, 2019 11:30

Add REST tests, make class final

16c0146

Documentation + required target_fields parameter

0614315

Fix javadoc

a060368

Merge branch 'master' into csv-processor

ea25312

martijnvg approved these changes Nov 26, 2019

View reviewed changes

modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/CsvProcessor.java Outdated Show resolved Hide resolved

modules/ingest-common/src/main/java/org/elasticsearch/ingest/common/CsvProcessor.java Outdated Show resolved Hide resolved

Review comments applied

2902941

droberts195 reviewed Nov 28, 2019

View reviewed changes

docs/reference/ingest/processors/csv.asciidoc Outdated Show resolved Hide resolved

Trim both leading and trailing spaces

508b307

Merge branch 'master' into csv-processor

02314b4

jbaiera requested changes Dec 2, 2019

View reviewed changes

probakowski and others added 5 commits December 6, 2019 21:07

Update modules/ingest-common/src/main/java/org/elasticsearch/ingest/c…

7c95d45

…ommon/CsvParser.java Co-Authored-By: James Baiera <james.baiera@gmail.com>

Review comments

cdbefc2

Merge branch 'master' into csv-processor

29e4b45

Javadoc update

e30e463

Javadoc update

ee2b00c

default trim to true

9adde62

Merge branch 'master' into csv-processor

069e4a1

Default trim to false

55335b5

probakowski requested a review from jbaiera December 10, 2019 21:08

jbaiera approved these changes Dec 10, 2019

View reviewed changes

Merge branch 'master' into csv-processor

5261182

probakowski merged commit 64e1a77 into elastic:master Dec 11, 2019

probakowski deleted the csv-processor branch December 11, 2019 13:52

probakowski added the backport pending label Dec 11, 2019

probakowski mentioned this pull request Dec 11, 2019

[7.x] CSV ingest processor (#49509) #50083

Merged

probakowski removed the backport pending label Dec 11, 2019

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV processor #49509

CSV processor #49509

probakowski commented Nov 22, 2019

elasticmachine commented Nov 22, 2019

jasontedor commented Nov 22, 2019

probakowski commented Nov 23, 2019 •

edited

probakowski commented Nov 25, 2019

martijnvg left a comment

probakowski commented Nov 29, 2019

jbaiera left a comment

jbaiera Dec 2, 2019

probakowski Dec 9, 2019

martijnvg Dec 10, 2019

martijnvg Dec 10, 2019

probakowski Dec 10, 2019

droberts195 commented Dec 9, 2019

probakowski commented Dec 9, 2019

droberts195 commented Dec 10, 2019

probakowski commented Dec 10, 2019

jbaiera left a comment

probakowski commented Dec 11, 2019

probakowski commented Dec 11, 2019

dschneiter commented Sep 11, 2020

probakowski commented Sep 11, 2020

CSV processor #49509

CSV processor #49509

Conversation

probakowski commented Nov 22, 2019

elasticmachine commented Nov 22, 2019

jasontedor commented Nov 22, 2019

probakowski commented Nov 23, 2019 • edited

probakowski commented Nov 25, 2019

martijnvg left a comment

Choose a reason for hiding this comment

probakowski commented Nov 29, 2019

jbaiera left a comment

Choose a reason for hiding this comment

jbaiera Dec 2, 2019

Choose a reason for hiding this comment

probakowski Dec 9, 2019

Choose a reason for hiding this comment

martijnvg Dec 10, 2019

Choose a reason for hiding this comment

martijnvg Dec 10, 2019

Choose a reason for hiding this comment

probakowski Dec 10, 2019

Choose a reason for hiding this comment

droberts195 commented Dec 9, 2019

probakowski commented Dec 9, 2019

droberts195 commented Dec 10, 2019

probakowski commented Dec 10, 2019

jbaiera left a comment

Choose a reason for hiding this comment

probakowski commented Dec 11, 2019

probakowski commented Dec 11, 2019

dschneiter commented Sep 11, 2020

probakowski commented Sep 11, 2020

probakowski commented Nov 23, 2019 •

edited