Support for NULL values #38

ostrokach · 2016-06-15T01:49:44Z

First, thanks for the great tool!

I was wondering if it would be possible to add some rudimentary support for NULL values?

I want to use xsv to get the column types for a CSV file before loading it into a database. This way I would know the correct column types when creating the database table. While xsv is very fast at calculating file column statistics, my files often have int / float columns with NULL values (e.g. "-", ".'", "\N", "NA"), and xsv reports those as Unicode. It would be nice to have an option like --na-values, so that xsv could skip NULL values when calculating statistics, and could report whether a column contains at least one NULL value.

The text was updated successfully, but these errors were encountered:

BurntSushi · 2016-07-31T16:40:16Z

xsv does actually have support for null values. (Look at xsv stats --help.) The actual problem here is that there are uncountably many ways to represent null. xsv only interprets empty fields as NULL right now. You've listed a few already: -, ., \N, NA. I can imagine several other reasonable values there as well. How should xsv handle such a thing? Notably, values like . or - could otherwise be legitimate outside of any particular context, but probably could be safely assumed to represent NULL if every other type of value in the column were numeric.

A brutish way to solve this is to permit the user to specify a regex that determines what values constitute NULL, but I really dislike that for a number reasons (it'd be slow and I don't consider it user friendly at all).

A different way of thinking about this problem is: instead of teaching different parts of xsv about the different flavors of NULL, we could create a new sub-command that permits one to normalize NULL values in a data set to a canonical form. This indeed might permit specifying a regex or choosing from some predefined list of values. This might even fit into a broader "search and replace" style sub-command.

ostrokach · 2016-07-31T22:40:10Z

I really like the idea of a new sub-command that would normalise NULL values!

One problem I often face when importing CSV files into MySQL is that the CSV files have different NULL values, which I have to convert to \N before importing, as \N is the only NULL character that MySQL can recognise. I wrote a Python script to convert CSVs to the appropriate format (using PyPy it's not too slow). But I'm sure something in Rust would run much faster. And this would solve my datatype inference problem as well.

ostrokach · 2016-07-31T22:49:09Z

But then it becomes a question of what should be the canonical NULL character. \N would be ideal for loading data into MySQL, . would be best for VCF files, some programs like NA, NULL, etc... I agree that an empty field is the most intuitive representation for NULL, but it would be nice to have the option of using something else...

BurntSushi · 2016-07-31T23:37:14Z

I'm open to use friendly heuristics.

danfowler · 2016-09-22T18:29:58Z

Perhaps one option is to define the array of missing values for your CSV field in a schema format we're developing that xsv can then read: http://specs.frictionlessdata.io/json-table-schema/#missing-values

Support for reading a CSV schema might also have a carry-over effects for the other inference-related issues #22 and #28

BurntSushi · 2016-09-23T10:42:53Z

I'm not too keen on going down the schema route. That seems like a lot of added complexity. The point is that there's a certain point at which xsv becomes insufficient for your data analysis, and that's OK.

icefoxen · 2017-03-30T15:01:37Z

Merging in a request that's related, which is the ability to replace null values with something else.

Basically, I want to to be able to run xsv impute and have it replace blank values (or values with an out-of-bounds marker you can specify on the command line, such as "NA"). Obvious choices to fill in with would be a fixed value specified on the command line, or the same value as the previous cell in a row or column.

Value: It's hard to do this using sed or such, when a "blank value" could be the strings ,$ ,""$ ,, ,"", and so on.

Not sure how far one wants to go with replacement, since it'd be easy to feature creep "well why don't we add the option to fill in blanks with an average" "well why don't we add interpolation" "well why don't we let you put in an arbitrary equation" and so on. Which is obviously not something xsv should care about.

Notably though, for this use case just being able to specify a "null" to be a fixed string should be sufficient. If you have multiple different null values you can just have a pipeline of xsv calls that replaces each one with a fixed values. So if the "null" in your dataset could be an empty string, NA or na and you want to convert all nulls to \N, you could do xsv impute --null-value "na" --with "\N" | xsv impute --null-value "NA" --with "\N" | xsv impute --null-value "" --with "\N".

That way you don't need a "canonical" null character, you can translate any dataset to use whatever you consider canonical.

BurntSushi · 2017-03-30T15:03:18Z

@icefoxen Does this capture what you're after?

A different way of thinking about this problem is: instead of teaching different parts of xsv about the different flavors of NULL, we could create a new sub-command that permits one to normalize NULL values in a data set to a canonical form. This indeed might permit specifying a regex or choosing from some predefined list of values. This might even fit into a broader "search and replace" style sub-command.

(What I'm trying to do is understand whether the idea I had previously is enough to satisfy your use case. If it is, great. If not, then that means I need to think a bit more about this!)

icefoxen · 2017-03-30T15:07:11Z

@BurntSushi Yeah, that sounds about right. The only thing missing from that would be the ability to say "replace value X with the cell above/to the left of it", which might be a more specific task than you want xsv to handle.

Sorry for the repetition, obviously time for more coffee. :-)

BurntSushi · 2017-03-30T15:12:17Z

@icefoxen Haha no worries, just wanted to be super clear. The idea of replacing a value is interesting. There is something close to that in xsv today with the xsv select command, but it's not conditional on the value.

ostrokach changed the title ~~Null values~~ Support for NULL values Jun 16, 2016

BurntSushi added the enhancement label Jul 31, 2016

ghuls mentioned this issue Jan 5, 2017

Allow setting "missing value" string when using join. #54

Open

icefoxen mentioned this issue Mar 30, 2017

Feature request: Filling in blanks #68

Closed

BurntSushi mentioned this issue Jun 26, 2017

Add count of empty/non empty values in stats #48

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for NULL values #38

Support for NULL values #38

ostrokach commented Jun 15, 2016 •

edited

Loading

BurntSushi commented Jul 31, 2016

ostrokach commented Jul 31, 2016

ostrokach commented Jul 31, 2016

BurntSushi commented Jul 31, 2016

danfowler commented Sep 22, 2016

BurntSushi commented Sep 23, 2016

icefoxen commented Mar 30, 2017

BurntSushi commented Mar 30, 2017

icefoxen commented Mar 30, 2017

BurntSushi commented Mar 30, 2017

Support for NULL values #38

Support for NULL values #38

Comments

ostrokach commented Jun 15, 2016 • edited Loading

BurntSushi commented Jul 31, 2016

ostrokach commented Jul 31, 2016

ostrokach commented Jul 31, 2016

BurntSushi commented Jul 31, 2016

danfowler commented Sep 22, 2016

BurntSushi commented Sep 23, 2016

icefoxen commented Mar 30, 2017

BurntSushi commented Mar 30, 2017

icefoxen commented Mar 30, 2017

BurntSushi commented Mar 30, 2017

ostrokach commented Jun 15, 2016 •

edited

Loading