Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include summary statistics for numeric columns in Facets #2001

Open
ostephens opened this issue Mar 29, 2019 · 7 comments
Open

Include summary statistics for numeric columns in Facets #2001

ostephens opened this issue Mar 29, 2019 · 7 comments
Labels
facets Behaviour or rendering of facets in a project Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.

Comments

@ostephens
Copy link
Member

ostephens commented Mar 29, 2019

Is your feature request related to a problem or area of OpenRefine? Please describe.
For numeric cells in a column I would like to be easily find basic statistical information such as:

  • max
  • min
  • mean
  • median
  • sum

Describe the solution you'd like
Make it possible to see this information in a facet

Additional context
See discussion in #1340

@ettorerizza
Copy link
Member

ettorerizza commented Mar 29, 2019

For the record, the statistics provided by the currently deprecated refine-stats extension are:

Count
Sum
Min
Max
Mean
Median
Mode (not very useful IMHO)
Standard deviation
Variance

The number of blank or null cells (NA) would also be very useful.

@ostephens
Copy link
Member Author

Is "Count" the number of cells with a numeric value?

@ettorerizza
Copy link
Member

Yep, just the number of cells in the (filtered) column.

@ostephens
Copy link
Member Author

OK - so I think we already get:

  • Count of Numeric
  • Count of blank (null or empty string)
  • Count of errors
  • Count of non-numeric (not including blanks or errors)

A range is shown, but this is usually slightly outside the Max/Min (i.e. the max shown can be slightly higher than the max value in the column and the min shown can be slightly lower than the min value in the column)

@ettorerizza
Copy link
Member

ettorerizza commented Mar 29, 2019

I see that a tool like Talend Data Preparation displays this information as a Boxplot.

screenshot-127 0 0 1-9090-2019 03 29-13-23-18

In the logic of OpenRefine, the equivalent could be an enriched histogram (maybe with the median and the mean as vertical lines)

screenshot-localhost-3333-2019 03 29-13-24-33

@wetneb wetneb added Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements. facets Behaviour or rendering of facets in a project labels Apr 1, 2019
@nanobrad
Copy link

nanobrad commented Apr 8, 2019

I would suggest a separate "Statistical Facet" that allows the user to explore a column's quartiles, outliers, and sigmas in a more precise fashion than the current histogram. I would do a Gaussian (or Normal) mode and a quartile mode.

It would have the side effect of providing all the statistical measures as mentioned above.

Quartile mode would use the rules of the ggplot box plot to determine outliers (1.5*IQR).

Gaussian mode would use sigmas:
image

Give a "n choices"-style link that allows pops up a text box with all the statistics for easy copying. The blue "Stat:" link in the above mock-up.

Make it invertible as well.

To cleanse:

  1. Move sliders to the +/- 3 sigma marks
  2. Invert the selection
  3. Review the validity of the data
  4. Mass delete

Like a text facet, this facet would be responsive to the current facets and filters. The numeric facet, for example, only indicates the filtering through dimming of the bars. This facet would "zoom in" on the filtered data and the statistics would update.

@thadguidry
Copy link
Member

@nanobrad Bradley THE ARTIST ! Nicely done mockup ! Thanks for this !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
facets Behaviour or rendering of facets in a project Type: Feature Request Identifies requests for new features or enhancements. These involve proposing new improvements.
Projects
None yet
Development

No branches or pull requests

5 participants