Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic JSON support #3

Merged
merged 1 commit into from
Nov 18, 2020
Merged

Add basic JSON support #3

merged 1 commit into from
Nov 18, 2020

Conversation

michaelmior
Copy link
Contributor

I'd like to be able to use Cleora for JSON data, so this is an attempt to implement support.
I'm not an experienced Rustacean, so there's definitely some cruft here to be removed, but it's basically working.
There is now a file type parameter that can be passed in to parse input data as JSON.
Note that I've removed any mention of CSV since it doesn't actually seem to be supported, but correct if I'm wrong.
If this seems like something you may eventually want to merge, happy to take suggestions on improvements

src/entity.rs Outdated Show resolved Hide resolved
src/pipeline.rs Outdated Show resolved Hide resolved
src/main.rs Show resolved Hide resolved
src/pipeline.rs Outdated Show resolved Hide resolved
src/pipeline.rs Outdated Show resolved Hide resolved
src/pipeline.rs Outdated Show resolved Hide resolved
src/pipeline.rs Outdated Show resolved Hide resolved
src/entity.rs Outdated Show resolved Hide resolved
src/entity.rs Outdated Show resolved Hide resolved
@piobab
Copy link
Contributor

piobab commented Nov 15, 2020

@michaelmior thanks for the implementation - the JSON idea is very nice :) I have made some comments. Let me know if you need any help. If you wouldn't like to deal with it, I can take it over from you.

src/pipeline.rs Outdated Show resolved Hide resolved
@kodieg
Copy link
Collaborator

kodieg commented Nov 17, 2020

@michaelmior I allowed myself to add my two cents here. Hope you don't mind :)

Pushed commit that makes process_row allow to accept both &str and String. This allows to keep previous performance of CSV parsing.

Also moved file_type test out of the loop just in case that could cause some regressions. It adds some code duplication, so possibly in the future we should refactor this portion. Especially, if we will want to add another file type.

@piobab I would propose merging this as this PR after your final review.

@michaelmior
Copy link
Contributor Author

@kodieg Thanks! I just rebased and made one minor change.

@kodieg
Copy link
Collaborator

kodieg commented Nov 17, 2020

Removed two new clippy warnings regarding &Vec<...> refs in function arguments. Replaced with slices.

@michaelmior
Copy link
Contributor Author

If all else looks good, let me know if you want to rebase to clean up the messy history.

src/pipeline.rs Outdated
.map({
|c| {
if !c.complex {
smallvec![parsed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use other types for non-complex column as well (something similar as it is for arrays):

let elem = parsed.at_key(&c.name).unwrap();
let value = match elem.get_type() {
  dom::element::ElementType::String => elem.get_string().unwrap(),
  _ => elem.minify(),
};
smallvec![value]

After modification we can read such lines:

{"users": 1, "products": ["p1", "p2"], "brands": ["b1", "b2"]}
{"users": 2, "products": ["p2", "p3", "p4"], "brands": ["b1"]}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Although personally I would lean towards disallowing array values here since they should probably be treated as complex. This is why I only allowed strings initially. Although certainly other atomic values would make sense here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right but column config is here for "support". If there is some array it is treated as single entity and a. user see it in the output file. Otherwise the user should prepare json file with just strings.

src/pipeline.rs Outdated Show resolved Hide resolved
Cargo.toml Show resolved Hide resolved
src/main.rs Show resolved Hide resolved
@piobab
Copy link
Contributor

piobab commented Nov 18, 2020

@michaelmior I made a bench for TSV based on very big file - no regression. We will add benches to CI pipeline in other PR.

I've made a review. Please take a look. After your changes I think we can merge.

@michaelmior michaelmior changed the title Add basic JSON support (WIP) Add basic JSON support Nov 18, 2020
@michaelmior
Copy link
Contributor Author

I've addressed a couple final issues and rebased into a single commit so I think this is ready to be merged.

@piobab piobab merged commit 0ac407d into BaseModelAI:master Nov 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants