Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create kwarg to normalize histogram such that sum of bin values equals 1 #1567

Closed
louist87 opened this issue Dec 6, 2012 · 22 comments
Closed

Comments

@louist87
Copy link

louist87 commented Dec 6, 2012

This StackOverflow question sums up the problem pretty well. The answer is straightforward but convoluted. This kind of functionality is widely used and requested, so it would be of great benefit for the hist function to handle it automatically.

Basically, setting the normed kwarg to a logically true value normalizes the histogram such that the integral is equal to 1. I suggest creating a new kwarg, named something like normedheight (or perhaps normedh), which implements the solution proposed in the StackOverflow post.

Here is the code, for convenience:

inp = normal(size=1000)
h = hist(inp)
bar(h[1][:-1], h[0]/float(len(inp)), diff(h[1]))  # normalized histogram
@piti118
Copy link
Contributor

piti118 commented Dec 13, 2012

how about having normed accept multiple types
-bool bahave like it used to
-callable with two argument accept h and bins as argument and compute the normalization
-number normalize area to that amount
-string reserved for special case like what's described

@louist87
Copy link
Author

@piti118 That's a good call. Might I suggest normed=height?

@WeatherGod
Copy link
Member

Usually, I am against such ideas where a parameter can accept multiple types, but this might be a case where the use-case is compelling enough to over-ride such concerns. In this situation, the requested feature is not orthogonal to the existing features, and so adding a new kwarg would only complicate things.

I would suggest keeping it as simple as possible, though, and accept only booleans and (positive) numbers, to start. This approach makes sense because True == 1. Strings and callables I would avoid for now until a really compelling use-case is put forth.

@louist87
Copy link
Author

@WeatherGod I humbly disagree with your suggestion to limit the kwarg values to booleans and positive integers. My concern is that it makes for very cryptic function calls because it's not immediately clear what the integers represent.

The original issue is that matplotlib normalizes the area under the curve to 1 instead of the height of the bars. In this case, it's not the numerical value that's of primary importance, but what that value represents. If we were to make it so that 2 represents the kind of height normalization I originally requested, the function call `hist(data, bins=20, normed=2) makes it seem as though we're normalizing to 2. Even if normalizing to 2 seems unlikely, the casual reader has to refer to the documentation to understand what's going on. This is not very pythonic.

Basically, I want to avoid the situation where I constantly have to refer to the documentation to remember which value to pass. A call such as hist(data, bins=20, normed='height') is much more explicit. In fact, if we decide to overload the normed kwarg, then I would suggest aliasing True with a string such as auc (area under the curve) or deriv (derivative). That way, passing True would still work as expected, but there's also a new way of writing more explicit code.

@piti118
Copy link
Contributor

piti118 commented Dec 14, 2012

@WeatherGod I'm just toying the idea and alternative around. Your expert comment is more than welcome.

But in this case there are two types of normalization we need to distinguish.

  1. Normalize by area like it is now
  2. Normalize by sum like what's being suggested.

If we only allow normed=1., I'm not sure which one it will be using.

I hate adding another kwarg too but we might need one in this case. Lets call it norm_mode which takes string either area or sum and do the appropriate normalization.

normed keyword in numpy histogram is being deprecated anyway.
http://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html

@WeatherGod
Copy link
Member

On Thu, Dec 13, 2012 at 8:22 PM, Piti Ongmongkolkul <
notifications@github.com> wrote:

@WeatherGod https://github.com/WeatherGod I'm just toying the idea and
alternative around. Your expert comment is more than welcome.

But in this case there are two types of normalization we need to
distinguish.

  1. Normalize by area like it is now
  2. Normalize by sum like what's being suggested.

If we only allow normed=1., I'm not sure which one it will be using.

I hate adding another kwarg too but we might need one in this case. Lets
call it norm_mode which takes string either area or sum and do the
appropriate normalization.

Actually, I like that sort of distinction, and it does make the issues that
I was confusing in my head nicely orthogonal.

normed keyword in numpy histogram is being deprecated anyway.
http://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html

Oy! Not again...

@louist87
Copy link
Author

I'd be willing to take a stab at fixing this, but I've never contributed anything significant before. If someone feels like doing it instead, the community might be better off!

In the meantime, is this the place to ask for clarification regarding the code in axes.hist? If not, is there a more appropriate place? A developer's IRC, perhaps?

@piti118
Copy link
Contributor

piti118 commented Dec 14, 2012

Contributing code is easy and fun. I have contributed to matplotlib couple times. I found the community to be super friendly.

Just make a pull request/issue and one of repo collab will help you.

@louist87
Copy link
Author

@piti118 Alright, then! I'll take a crack at it sometime this week. If I get stuck, I'll be sure to let someone know.

@ivanov
Copy link
Member

ivanov commented Dec 14, 2012

In the meantime, is this the place to ask for clarification regarding the code in axes.hist?

matplotlib-users list for help with using it, matplotlib-devel list for proposing changes. sign up here (note, the archives are broken on there, you can find the archives here)

@dmcdougall
Copy link
Member

Contributing code is easy and fun. I have contributed to matplotlib couple times. I found the community to be super friendly.

Just make a pull request/issue and one of repo collab will help you.

It is feedback like this that makes me happy to contribute to this open source project. Thank you.

@neggert
Copy link
Contributor

neggert commented Jan 26, 2013

How's progress on this coming? I'm doing some work on the hist function anyway, so I could probably implement this if no one has started working on it yet.

What would make sense to me would be to let normed take either a string or a bool. So normed='area' gives the current behavior and normed='sum' gives the new behavior. We make normed=True equivalent to normed='area' both to preserve backwards compatibility and because IMO that's what most users want when asking for a normalized histogram.

@louist87
Copy link
Author

@nic I haven't had a chance to get started -- have at it!

On Sat, Jan 26, 2013 at 4:27 PM, Nic Eggert notifications@github.comwrote:

How's progress on this coming? I'm doing some work on the hist function
anyway, so I could probably implement this if no one has started working on
it yet.

What would make sense to me would be to let normed take either a string
or a bool. So normed='area' gives the current behavior and normed='sum'gives the new behavior. We make
normed=True equivalent to normed='area' both to preserve backwards
compatibility and because IMO that's what most users want when asking for a
normalized histogram.


Reply to this email directly or view it on GitHubhttps://github.com//issues/1567#issuecomment-12736801.

Louis Thibault
+33 6 40 24 06 05

@phobson
Copy link
Member

phobson commented Mar 8, 2013

@neggert et al, think there's any value in a normed='max' option so that the tallest bar equals one? Just thinking out loud here

@tacaswell
Copy link
Member

At some point does it make sense to be able to pass hist an arbitrary normalization function instead of trying to guess every possible normalization?

The function would take a single argument of a numpy array (the raw bin counts), and returns an array of the exact same shape scaled how ever the user wants.

Does it make sense to try and leverage the Normalize classes that already exist in color?

@neggert
Copy link
Contributor

neggert commented Mar 16, 2013

Hmm, it does look like what Normalize does is conceptually similar, but it doesn't look like we could apply it as is. If we were going to go that route, I think the cleanest thing to do would be to pull Normalize out of colors.py and expand the class to handle arbitrary normalizations. Then color stuff and hist could both call it. That seems like a fairly major change for something this trivial.

@tacaswell
Copy link
Member

I never claimed it would be simple ;) , but there was recently discussion of developing a way to make Normalize easier to extend to arbitrary functions to avoid an explosion of built in functions, and that is where it looked like this was going. If that is going to get written, it should only be done once.

@neggert
Copy link
Contributor

neggert commented Mar 16, 2013

Ah, fair enough.

On Sat, Mar 16, 2013 at 3:43 PM, Thomas A Caswell
notifications@github.comwrote:

I never claimed it would be simple ;) , but there was recently discussion
of developing a way to make Normalize easier to extend to arbitrary
functions to avoid an explosion of built in functions, and that is where it
looked like this was going. If that is going to get written, it should only
be done once.


Reply to this email directly or view it on GitHubhttps://github.com//issues/1567#issuecomment-15010953
.

@tacaswell
Copy link
Member

@neggert @phobson Either of you want to take responsibility for getting this done? If not, I am going to close it.

@phobson
Copy link
Member

phobson commented Jan 17, 2014

I just can't commit to this before next month. Stretched pretty thin right now :/--
Paul Hobson
Sorry if this is unintelligible. I'm on my phone.

On Thu, Jan 16, 2014 at 2:57 PM, Thomas A Caswell
notifications@github.com wrote:

@neggert @phobson Either of you want to take responsibility for getting this done? If not, I am going to close it.

Reply to this email directly or view it on GitHub:
#1567 (comment)

@neggert
Copy link
Contributor

neggert commented Jan 17, 2014

To be honest, I still don't really see the appeal of this feature, and the hist signature is complicated enough as it is. I won't object if someone else wants to implement this, but I don't think this is a good use of my limited time to work on mpl.

Plus, I'm already behind on the PRs I have open.

@tacaswell
Copy link
Member

Fair enough. I am going to close this. If anyone is interested in implementing this feel free to re-open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants