Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for NULL values #38

Open
ostrokach opened this issue Jun 15, 2016 · 10 comments
Open

Support for NULL values #38

ostrokach opened this issue Jun 15, 2016 · 10 comments

Comments

@ostrokach
Copy link

ostrokach commented Jun 15, 2016

First, thanks for the great tool!

I was wondering if it would be possible to add some rudimentary support for NULL values?

I want to use xsv to get the column types for a CSV file before loading it into a database. This way I would know the correct column types when creating the database table. While xsv is very fast at calculating file column statistics, my files often have int / float columns with NULL values (e.g. "-", ".'", "\N", "NA"), and xsv reports those as Unicode. It would be nice to have an option like --na-values, so that xsv could skip NULL values when calculating statistics, and could report whether a column contains at least one NULL value.

@ostrokach ostrokach changed the title Null values Support for NULL values Jun 16, 2016
@BurntSushi
Copy link
Owner

xsv does actually have support for null values. (Look at xsv stats --help.) The actual problem here is that there are uncountably many ways to represent null. xsv only interprets empty fields as NULL right now. You've listed a few already: -, ., \N, NA. I can imagine several other reasonable values there as well. How should xsv handle such a thing? Notably, values like . or - could otherwise be legitimate outside of any particular context, but probably could be safely assumed to represent NULL if every other type of value in the column were numeric.

A brutish way to solve this is to permit the user to specify a regex that determines what values constitute NULL, but I really dislike that for a number reasons (it'd be slow and I don't consider it user friendly at all).

A different way of thinking about this problem is: instead of teaching different parts of xsv about the different flavors of NULL, we could create a new sub-command that permits one to normalize NULL values in a data set to a canonical form. This indeed might permit specifying a regex or choosing from some predefined list of values. This might even fit into a broader "search and replace" style sub-command.

@ostrokach
Copy link
Author

I really like the idea of a new sub-command that would normalise NULL values!

One problem I often face when importing CSV files into MySQL is that the CSV files have different NULL values, which I have to convert to \N before importing, as \N is the only NULL character that MySQL can recognise. I wrote a Python script to convert CSVs to the appropriate format (using PyPy it's not too slow). But I'm sure something in Rust would run much faster. And this would solve my datatype inference problem as well.

@ostrokach
Copy link
Author

But then it becomes a question of what should be the canonical NULL character. \N would be ideal for loading data into MySQL, . would be best for VCF files, some programs like NA, NULL, etc... I agree that an empty field is the most intuitive representation for NULL, but it would be nice to have the option of using something else...

@BurntSushi
Copy link
Owner

I'm open to use friendly heuristics.

@danfowler
Copy link

Perhaps one option is to define the array of missing values for your CSV field in a schema format we're developing that xsv can then read: http://specs.frictionlessdata.io/json-table-schema/#missing-values

Support for reading a CSV schema might also have a carry-over effects for the other inference-related issues #22 and #28

@BurntSushi
Copy link
Owner

I'm not too keen on going down the schema route. That seems like a lot of added complexity. The point is that there's a certain point at which xsv becomes insufficient for your data analysis, and that's OK.

@icefoxen
Copy link

Merging in a request that's related, which is the ability to replace null values with something else.

Basically, I want to to be able to run xsv impute and have it replace blank values (or values with an out-of-bounds marker you can specify on the command line, such as "NA"). Obvious choices to fill in with would be a fixed value specified on the command line, or the same value as the previous cell in a row or column.

Value: It's hard to do this using sed or such, when a "blank value" could be the strings ,$ ,""$ ,, ,"", and so on.

Not sure how far one wants to go with replacement, since it'd be easy to feature creep "well why don't we add the option to fill in blanks with an average" "well why don't we add interpolation" "well why don't we let you put in an arbitrary equation" and so on. Which is obviously not something xsv should care about.

Notably though, for this use case just being able to specify a "null" to be a fixed string should be sufficient. If you have multiple different null values you can just have a pipeline of xsv calls that replaces each one with a fixed values. So if the "null" in your dataset could be an empty string, NA or na and you want to convert all nulls to \N, you could do xsv impute --null-value "na" --with "\N" | xsv impute --null-value "NA" --with "\N" | xsv impute --null-value "" --with "\N".

That way you don't need a "canonical" null character, you can translate any dataset to use whatever you consider canonical.

@BurntSushi
Copy link
Owner

@icefoxen Does this capture what you're after?

A different way of thinking about this problem is: instead of teaching different parts of xsv about the different flavors of NULL, we could create a new sub-command that permits one to normalize NULL values in a data set to a canonical form. This indeed might permit specifying a regex or choosing from some predefined list of values. This might even fit into a broader "search and replace" style sub-command.

(What I'm trying to do is understand whether the idea I had previously is enough to satisfy your use case. If it is, great. If not, then that means I need to think a bit more about this!)

@icefoxen
Copy link

@BurntSushi Yeah, that sounds about right. The only thing missing from that would be the ability to say "replace value X with the cell above/to the left of it", which might be a more specific task than you want xsv to handle.

Sorry for the repetition, obviously time for more coffee. :-)

@BurntSushi
Copy link
Owner

@icefoxen Haha no worries, just wanted to be super clear. The idea of replacing a value is interesting. There is something close to that in xsv today with the xsv select command, but it's not conditional on the value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants