Add basic JSON support #3

michaelmior · 2020-11-14T15:27:00Z

I'd like to be able to use Cleora for JSON data, so this is an attempt to implement support.
I'm not an experienced Rustacean, so there's definitely some cruft here to be removed, but it's basically working.
There is now a file type parameter that can be passed in to parse input data as JSON.
Note that I've removed any mention of CSV since it doesn't actually seem to be supported, but correct if I'm wrong.
If this seems like something you may eventually want to merge, happy to take suggestions on improvements

src/entity.rs

src/pipeline.rs

src/main.rs

src/pipeline.rs

src/entity.rs

piobab · 2020-11-15T19:59:16Z

@michaelmior thanks for the implementation - the JSON idea is very nice :) I have made some comments. Let me know if you need any help. If you wouldn't like to deal with it, I can take it over from you.

src/pipeline.rs

kodieg · 2020-11-17T14:07:43Z

@michaelmior I allowed myself to add my two cents here. Hope you don't mind :)

Pushed commit that makes process_row allow to accept both &str and String. This allows to keep previous performance of CSV parsing.

Also moved file_type test out of the loop just in case that could cause some regressions. It adds some code duplication, so possibly in the future we should refactor this portion. Especially, if we will want to add another file type.

@piobab I would propose merging this as this PR after your final review.

michaelmior · 2020-11-17T15:26:32Z

@kodieg Thanks! I just rebased and made one minor change.

kodieg · 2020-11-17T19:48:31Z

Removed two new clippy warnings regarding &Vec<...> refs in function arguments. Replaced with slices.

michaelmior · 2020-11-18T12:39:07Z

If all else looks good, let me know if you want to rebase to clean up the messy history.

piobab · 2020-11-18T15:09:26Z

src/pipeline.rs

+        .map({
+            |c| {
+                if !c.complex {
+                    smallvec![parsed


We can use other types for non-complex column as well (something similar as it is for arrays):

let elem = parsed.at_key(&c.name).unwrap(); let value = match elem.get_type() { dom::element::ElementType::String => elem.get_string().unwrap(), _ => elem.minify(), }; smallvec![value]

After modification we can read such lines:

{"users": 1, "products": ["p1", "p2"], "brands": ["b1", "b2"]} {"users": 2, "products": ["p2", "p3", "p4"], "brands": ["b1"]}

Okay. Although personally I would lean towards disallowing array values here since they should probably be treated as complex. This is why I only allowed strings initially. Although certainly other atomic values would make sense here.

You are right but column config is here for "support". If there is some array it is treated as single entity and a. user see it in the output file. Otherwise the user should prepare json file with just strings.

src/pipeline.rs

Cargo.toml

src/main.rs

piobab · 2020-11-18T15:23:26Z

@michaelmior I made a bench for TSV based on very big file - no regression. We will add benches to CI pipeline in other PR.

I've made a review. Please take a look. After your changes I think we can merge.

michaelmior · 2020-11-18T19:21:33Z

I've addressed a couple final issues and rebased into a single commit so I think this is ready to be merged.

michaelmior force-pushed the json branch from 7aab534 to 2ecbb11 Compare November 14, 2020 15:28

michaelmior commented Nov 14, 2020

View reviewed changes

src/entity.rs Outdated Show resolved Hide resolved

kosciej reviewed Nov 14, 2020

View reviewed changes

src/pipeline.rs Outdated Show resolved Hide resolved

piobab reviewed Nov 15, 2020

View reviewed changes

michaelmior commented Nov 17, 2020

View reviewed changes

src/pipeline.rs Outdated Show resolved Hide resolved

michaelmior force-pushed the json branch from 8ff5ca0 to 4b2f9f1 Compare November 17, 2020 15:24

michaelmior force-pushed the json branch from c48ebc2 to 4d8f000 Compare November 17, 2020 15:36

piobab reviewed Nov 18, 2020

View reviewed changes

michaelmior force-pushed the json branch from bd302ba to 03240e0 Compare November 18, 2020 19:18

michaelmior changed the title ~~Add basic JSON support (WIP)~~ Add basic JSON support Nov 18, 2020

Add support for JSON as an input format

b8584b1

michaelmior force-pushed the json branch from 03240e0 to b8584b1 Compare November 18, 2020 19:21

piobab approved these changes Nov 18, 2020

View reviewed changes

piobab merged commit 0ac407d into BaseModelAI:master Nov 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add basic JSON support #3

Add basic JSON support #3

michaelmior commented Nov 14, 2020

piobab commented Nov 15, 2020

kodieg commented Nov 17, 2020

michaelmior commented Nov 17, 2020

kodieg commented Nov 17, 2020

michaelmior commented Nov 18, 2020

piobab Nov 18, 2020

michaelmior Nov 18, 2020

piobab Nov 18, 2020

piobab commented Nov 18, 2020

michaelmior commented Nov 18, 2020

Add basic JSON support #3

Add basic JSON support #3

Conversation

michaelmior commented Nov 14, 2020

piobab commented Nov 15, 2020

kodieg commented Nov 17, 2020

michaelmior commented Nov 17, 2020

kodieg commented Nov 17, 2020

michaelmior commented Nov 18, 2020

piobab Nov 18, 2020

Choose a reason for hiding this comment

michaelmior Nov 18, 2020

Choose a reason for hiding this comment

piobab Nov 18, 2020

Choose a reason for hiding this comment

piobab commented Nov 18, 2020

michaelmior commented Nov 18, 2020