Add count of empty/non empty values in stats #48

patricebellan · 2016-11-26T17:02:32Z

xsv stats gives a lot of useful information, and I use it a lot.

I'm currently missing one thing that may not be too much of a hassle to add.
Would you consider adding the count of empty/non empty values for each field?

BurntSushi · 2016-11-26T18:20:52Z

Isn't there already a count of null values? I know at least the frequency command should help here.

patricebellan · 2016-11-26T19:02:59Z

Unless I missed something, stats just tells whether there are nulls, but doesn't give the count.

xsv frequency -a -l1 does the trick though, since nulls come first

mratsim · 2017-02-27T13:10:11Z

Unless I missed something xsv frequency doesn't do "groupby" operations like mean, std and count.

As an input to a data science workflow, I'm working with EXIF metadata from 70k+ photos. The csv source file is about 1.1GB and very sparse. I would like to select only columns that are mostly used.

This is the output of xsv frequency

$  head -5 metadata_frequency.csv
field,value,count
SourceFile,../data/images/6928025/6928025_3fff3925431f80e0dfc83d48e7229a10.jpg,1
SourceFile,../data/images/7031649/7031649_dce502452d13649aa6682060126e2bfb.jpg,1
SourceFile,../data/images/7198981/7198981_6b7e5ef7587dab908ff8864b7c2c4b55.jpg,1
SourceFile,../data/images/7224251/7224251_b68dec5933ec49d3be9213513c7a8e43.jpg,1

This is the output of xsv frequency -a -l1

$  head -5 metadata_count.csv
field,value,count
SourceFile,../data/images/6969761/6969761_9b6c7e93ee82a10a406a34c44d900c60.jpg,1
About,uuid:73abd4a8-2f89-11df-9d3c-a53e4fead10b,1
AccelerationVector,-0.9773887673 -0.006948695972 -0.2358681423,1
Accentuation,0.0,73

It just took the first occurence of "SourceFile" but did not "groupby sum" on it.

BurntSushi · 2017-02-27T13:22:44Z

@mratsim Could you please create a new issue and describe the problem you're trying to solve? Your problem seems completely distinct from this issue. In particular, I don't understand what you expect to see and why you think "groupby sum" isn't happening.

mratsim · 2017-02-27T14:20:04Z

Sorry, I wasn't clear enough.

What I'm looking for is counting the number of empty, or non-empty values for each columns like OP.

Basically xsv stats is already doing min, max, mean, median to summarize on all the values appearing in a column, I would like to add the count of non-empty values in xsv stats.

Next paragraph is an explanation on groupby sum and may be skipped

Using the database query terms, we would do a count to get the number of non-empty items in a csv column.
However, if we queried the output xsv frequency we would have to do a groupby the "field" column and sum on the matching "count" column to get the count of non-empty items.

Those are equivalent: count if original csv, groupby sum if in the xsv frequency format

From your example in the README, it's turning:

$ xsv frequency worldcitiespop.csv --limit 5
field,value,count
Country,cn,238985
Country,ru,215938
Country,id,176546
Country,us,141989
Country,ir,123872
City,san jose,328
City,san antonio,320
City,santa rosa,296
City,santa cruz,282
City,san juan,255
AccentCity,San Antonio,317
AccentCity,Santa Rosa,296
AccentCity,Santa Cruz,281
AccentCity,San Juan,254
AccentCity,San Miguel,254

into

field,groupby(field).sum(count)
Country,897327
City,1481
AccentCity,1402

897327 being 238985+215938+176546+141986+123872
So we have the total count for each "field" column values.

Another way to look at it in term of functional programming/Rust is that we are folding (reducing) by summing the values associated to "Country, City, AccentCity" respectively

Given the format this should probably appear in xsv stats

BurntSushi · 2017-02-27T14:31:26Z

Ah, gotya, thanks, that's clearer. I can think of a couple of work-arounds, although I suppose it would be nice to provide it by default. For example, you could do xsv search -s Column '.' foo.csv | xsv count to get a count of non-NULL columns (and xsv count foo.csvto give you a count of all rows).

seamusabshere · 2017-06-26T15:01:23Z

I suppose it would be nice to provide it by default

👍

it would be very natural for xsv stats to return null counts for every column

emk · 2017-06-26T15:05:34Z

@BurntSushi @seamusabshere I would be happy to implement this and submit a PR that's at least as tidy as what I submitted for xsv partition. :-) Are there any thoughts you have on how this should work?

What would a good output column name be?
Are the any backwards compatibility concerns about adding a new output column?

BurntSushi · 2017-06-26T15:16:21Z

I would be happy to implement this and submit a PR that's at least as tidy as what I submitted for xsv partition. :-)

Awesome, thank you!

Note that there is a related but orthogonal issue: #38. I defer to you on whether you want to also try and solve that problem, but I just wanted to make sure you were aware of it.

What would a good output column name be?

I think my only real opinion here is that it shouldn't contain spaces. nullcount or something?

Are the any backwards compatibility concerns about adding a new output column?

If there are, we can just release a new minor version, but I don't think so. In particular, I think CSV data with headers can be expanded with additional columns (and probably even re-orderings of those columns too).

devinrsmith · 2018-09-26T13:51:01Z

A nullcount for the xsv stats command would be great!

emk · 2018-10-03T12:45:19Z

Unfortunately, this is currently about 647 items down my TODO list, at least until the next time I need it for something at work. :-( If somebody else wants to tackle this, please don't wait for me!

scpike mentioned this issue Dec 13, 2019

Add nullcount to stats command #198

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add count of empty/non empty values in stats #48

Add count of empty/non empty values in stats #48

patricebellan commented Nov 26, 2016

BurntSushi commented Nov 26, 2016

patricebellan commented Nov 26, 2016

mratsim commented Feb 27, 2017 •

edited

Loading

BurntSushi commented Feb 27, 2017

mratsim commented Feb 27, 2017 •

edited

Loading

BurntSushi commented Feb 27, 2017

seamusabshere commented Jun 26, 2017

emk commented Jun 26, 2017

BurntSushi commented Jun 26, 2017 •

edited

Loading

devinrsmith commented Sep 26, 2018

emk commented Oct 3, 2018

Add count of empty/non empty values in stats #48

Add count of empty/non empty values in stats #48

Comments

patricebellan commented Nov 26, 2016

BurntSushi commented Nov 26, 2016

patricebellan commented Nov 26, 2016

mratsim commented Feb 27, 2017 • edited Loading

BurntSushi commented Feb 27, 2017

mratsim commented Feb 27, 2017 • edited Loading

BurntSushi commented Feb 27, 2017

seamusabshere commented Jun 26, 2017

emk commented Jun 26, 2017

BurntSushi commented Jun 26, 2017 • edited Loading

devinrsmith commented Sep 26, 2018

emk commented Oct 3, 2018

mratsim commented Feb 27, 2017 •

edited

Loading

mratsim commented Feb 27, 2017 •

edited

Loading

BurntSushi commented Jun 26, 2017 •

edited

Loading