Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add count of empty/non empty values in stats #48

Open
patricebellan opened this issue Nov 26, 2016 · 11 comments
Open

Add count of empty/non empty values in stats #48

patricebellan opened this issue Nov 26, 2016 · 11 comments

Comments

@patricebellan
Copy link

xsv stats gives a lot of useful information, and I use it a lot.

I'm currently missing one thing that may not be too much of a hassle to add.
Would you consider adding the count of empty/non empty values for each field?

@BurntSushi
Copy link
Owner

Isn't there already a count of null values? I know at least the frequency command should help here.

@patricebellan
Copy link
Author

Unless I missed something, stats just tells whether there are nulls, but doesn't give the count.

xsv frequency -a -l1 does the trick though, since nulls come first

@mratsim
Copy link

mratsim commented Feb 27, 2017

Unless I missed something xsv frequency doesn't do "groupby" operations like mean, std and count.

As an input to a data science workflow, I'm working with EXIF metadata from 70k+ photos. The csv source file is about 1.1GB and very sparse. I would like to select only columns that are mostly used.

This is the output of xsv frequency

$  head -5 metadata_frequency.csv
field,value,count
SourceFile,../data/images/6928025/6928025_3fff3925431f80e0dfc83d48e7229a10.jpg,1
SourceFile,../data/images/7031649/7031649_dce502452d13649aa6682060126e2bfb.jpg,1
SourceFile,../data/images/7198981/7198981_6b7e5ef7587dab908ff8864b7c2c4b55.jpg,1
SourceFile,../data/images/7224251/7224251_b68dec5933ec49d3be9213513c7a8e43.jpg,1

This is the output of xsv frequency -a -l1

$  head -5 metadata_count.csv
field,value,count
SourceFile,../data/images/6969761/6969761_9b6c7e93ee82a10a406a34c44d900c60.jpg,1
About,uuid:73abd4a8-2f89-11df-9d3c-a53e4fead10b,1
AccelerationVector,-0.9773887673 -0.006948695972 -0.2358681423,1
Accentuation,0.0,73

It just took the first occurence of "SourceFile" but did not "groupby sum" on it.

@BurntSushi
Copy link
Owner

@mratsim Could you please create a new issue and describe the problem you're trying to solve? Your problem seems completely distinct from this issue. In particular, I don't understand what you expect to see and why you think "groupby sum" isn't happening.

@mratsim
Copy link

mratsim commented Feb 27, 2017

Sorry, I wasn't clear enough.

What I'm looking for is counting the number of empty, or non-empty values for each columns like OP.

Basically xsv stats is already doing min, max, mean, median to summarize on all the values appearing in a column, I would like to add the count of non-empty values in xsv stats.


Next paragraph is an explanation on groupby sum and may be skipped

Using the database query terms, we would do a count to get the number of non-empty items in a csv column.
However, if we queried the output xsv frequency we would have to do a groupby the "field" column and sum on the matching "count" column to get the count of non-empty items.

Those are equivalent: count if original csv, groupby sum if in the xsv frequency format

From your example in the README, it's turning:

$ xsv frequency worldcitiespop.csv --limit 5
field,value,count
Country,cn,238985
Country,ru,215938
Country,id,176546
Country,us,141989
Country,ir,123872
City,san jose,328
City,san antonio,320
City,santa rosa,296
City,santa cruz,282
City,san juan,255
AccentCity,San Antonio,317
AccentCity,Santa Rosa,296
AccentCity,Santa Cruz,281
AccentCity,San Juan,254
AccentCity,San Miguel,254

into

field,groupby(field).sum(count)
Country,897327
City,1481
AccentCity,1402

897327 being 238985+215938+176546+141986+123872
So we have the total count for each "field" column values.

Another way to look at it in term of functional programming/Rust is that we are folding (reducing) by summing the values associated to "Country, City, AccentCity" respectively

Given the format this should probably appear in xsv stats

@BurntSushi
Copy link
Owner

Ah, gotya, thanks, that's clearer. I can think of a couple of work-arounds, although I suppose it would be nice to provide it by default. For example, you could do xsv search -s Column '.' foo.csv | xsv count to get a count of non-NULL columns (and xsv count foo.csvto give you a count of all rows).

@seamusabshere
Copy link

I suppose it would be nice to provide it by default

👍

it would be very natural for xsv stats to return null counts for every column

@emk
Copy link
Contributor

emk commented Jun 26, 2017

@BurntSushi @seamusabshere I would be happy to implement this and submit a PR that's at least as tidy as what I submitted for xsv partition. :-) Are there any thoughts you have on how this should work?

  • What would a good output column name be?
  • Are the any backwards compatibility concerns about adding a new output column?

@BurntSushi
Copy link
Owner

BurntSushi commented Jun 26, 2017

I would be happy to implement this and submit a PR that's at least as tidy as what I submitted for xsv partition. :-)

Awesome, thank you!

Note that there is a related but orthogonal issue: #38. I defer to you on whether you want to also try and solve that problem, but I just wanted to make sure you were aware of it.

What would a good output column name be?

I think my only real opinion here is that it shouldn't contain spaces. nullcount or something?

Are the any backwards compatibility concerns about adding a new output column?

If there are, we can just release a new minor version, but I don't think so. In particular, I think CSV data with headers can be expanded with additional columns (and probably even re-orderings of those columns too).

@devinrsmith
Copy link

A nullcount for the xsv stats command would be great!

@emk
Copy link
Contributor

emk commented Oct 3, 2018

Unfortunately, this is currently about 647 items down my TODO list, at least until the next time I need it for something at work. :-( If somebody else wants to tackle this, please don't wait for me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants