Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-parametric distributions #29

Merged
merged 1 commit into from
Nov 28, 2014
Merged

Conversation

cmaumet
Copy link
Contributor

@cmaumet cmaumet commented Nov 21, 2014

This is a proposal to add two terms to describe non-parametric distributions:

  • non-parametric distribution: "Probability distribution estimated empirically on the data without assumptions on the shape of the probability distribution."
  • non-parametric symmetric distribution: "Probability distribution estimated empirically on the data assuming only symmetry of the probability distribution."

This proposal was made up with @nicholst. Definitions were also discussed with @khelm at incf-nidash/nidm-specs#191.

A few comments:

  • I was unsure whether those terms should go under discrete probability distribution or continuous probability distribution. The estimated non-parametric distribution is discrete but there is no assumption on the discreetness of the underlying true distribution. @nicholst: would you like to comment on this?
  • I did not know how to set a new numerical identifier, so I used STAT_XXXX and STATO_YYYY. I am happy to update these if you can let me know how to proceed (@agbeltran, @proccaserra)?

Any comment is very welcome! Thank you.

@nicholst
Copy link

Thanks @cmaumet. Quick comment on discrete/continuous:

A nonparametric distribution can be discrete or continuous. Can there be 3 terms? "Discrete nonparametric" and "continuous nonparametric", and then "nonparametric" the parent of these two?

In my experience many nonparametric test procedures are agnostic to whether data are continuous or discrete (though there are a class of methods that work exclusively with discrete, i.e. categorical data).

@proccaserra
Copy link
Member

@cmaumet @nicholst : these terms could be added but we need to watch out for one thing when developing STATO, namely, avoiding asserted multiple parent hierarchy.
Then 2 things:
-distribution symmetry: we could used the notion of 'skewness' to formally define the class and set the axiom.
-non-parametric distribution: I was consulting the following link http://reference.wolfram.com/language/guide/NonparametricStatisticalDistributions.html
for clarifying how non-parametric distributions were generated and how to define them. Is this representative of what you need?

Finally, @cmaumet , we need to get back to you regarding the issue of identifiers and possibly agree on a process as diff on owl files can be painful.

Many thanks for the input!

@nicholst
Copy link

About
http://reference.wolfram.com/language/guide/NonparametricStatisticalDistributions.html,
yes, that's a good description of what we're talking about.

On "multiple parent hierarchy" and "axioms", this is where I'm beyond my
comfort zone :)

On Sun, Nov 23, 2014 at 9:35 PM, Philippe Rocca-Serra <
notifications@github.com> wrote:

@cmaumet https://github.com/cmaumet @nicholst
https://github.com/nicholst : these terms could be added but we need to
watch out for one thing when developing STATO, namely, avoiding asserted
multiple parent hierarchy.
Then 2 things:
-distribution symmetry: we could used the notion of 'skewness' to formally
define the class and set the axiom.
-non-parametric distribution: I was consulting the following link
http://reference.wolfram.com/language/guide/NonparametricStatisticalDistributions.html
for clarifying how non-parametric distributions were generated and how to
define them. Is this representative of what you need?

Finally, @cmaumet https://github.com/cmaumet , we need to get back to
you regarding the issue of identifiers and possibly agree on a process as
diff on owl files can be painful.

Many thanks for the input!


Reply to this email directly or view it on GitHub
#29 (comment).


Thomas Nichols, PhD
Professor, Head of Neuroimaging Statistics
Department of Statistics & Warwick Manufacturing Group
University of Warwick, Coventry CV4 7AL, United Kingdom

Web: http://warwick.ac.uk/tenichols
Email: t.e.nichols@warwick.ac.uk
Phone, Stats: +44 24761 51086, WMG: +44 24761 50752
Fax: +44 24 7652 4532

@nicholst
Copy link

I realise I didn't respond to @proccaserra's comment on symmetry. Indeed, a distribution with zero skew is symmetric.

About asserting multiple parent hierarchies... does this mean that from a bucket of concepts

  • discrete nonparametric
  • continuous nonparametric
  • symmetric nonparametric
    you would model this one concept "nonparametric distribution" with different attributes?

@proccaserra
Copy link
Member

@nicholst, this is indeed what we are looking at. Also, we were looking at the way these 'non parametric distribution' differ from the 'parametric ones' and it seems that we could model this by these distributions are computed/estimated from the data (all data, binned data,censored data, kernel) whereas the parametric ones are not.

@nicholst
Copy link

@proccaserra... hmmm, well, in practice, parametric distributions are also estimated from the data, it's just there are parameters that define the distribution. That is, there is no one "Gaussian" distribution, there are an infinite number conveniently indexed by just two values, mean and variance.

But I don't want to turn this into a theoretical counting-the-number-of-angels-on-the-head-of-a-pin exercise. I see two reasons to represent distributions in STATO: Model assumptions and test statistic sampling distributions.

Every statistical procedure makes some some sort of assumptions on the data. These assumptions take the form of an assertion that the data follow a given distribution (and also about the dependency--or lack there of--of multiple observations; but that's a different issue). If that distribution can be described by a finite number of values, we call it "parametric", if an uncountable or infinite number of values are needed to describe the distribution we call it "nonparametric". Most models assume Gaussianity; models for count data typically assume Poisson, Binomial or Negative Binomial distributions.

A "Hypothesis Test" procedure produces test statistic. That test statistic typically follows one of a small number of named distributions, like standard Normal (aka Gaussian) (which, for once, is just one, single, entity, no parameters... mean 0, variance 1), or a t distribution (which, does have a special type of parameter, the "degrees of freedom").

A model makes assumptions on the data; when the data assumptions are satisfied, then we can trust that the test statistic produced will follow the usual test statistic distribution. But the data and test statistic's distributions are different. E.g. a two-sample t-test assumes the data are Normally distributed, independent and have a common variance; given those assumptions, the test statistic it produces will follow a Student's T distribution with n1+n2-2 degrees-of-freedom.

A nonparametric two-sample permutation test assumes the data are independent and identically distributed from some (i.e. "nonparametric" ) distribution. The test statistic created has no particular sampling distribution, and thus is also nonparametric.

Does this clarify or only muddle?

@agbeltran
Copy link
Contributor

@nicholst thanks for all the explanations! It does clarify, thanks.
@cmaumet about the identifiers, we are planning to set up some automatic way to assign them (e.g. URIgen service), but until then, I hope it is OK if we assign the STATO ids.

So, I will merge this PR now, but I'll change the two the term non-parametric distribution from being a child of discrete probability distribution (http://purl.obolibrary.org/obo/STATO_0000117) (as in the commit/PR) to a child of probability distribution (http://purl.obolibrary.org/obo/STATO_0000225).

I will also assign the STATO identifiers next.

agbeltran added a commit that referenced this pull request Nov 28, 2014
@agbeltran agbeltran merged commit 84411b1 into ISA-tools:dev Nov 28, 2014
@agbeltran
Copy link
Contributor

One more point!

@proccaserra suggests to keep non-parametric distribution and add also symmetric distribution, but not the pre-coordinated term non-parametric symmetric distribution (as that term can be combined through the other two)

@cmaumet @nicholst it would be great to discuss further about your use cases for these terms

So, I will assign a STATO ID for non-parametric distribution only. I will keep non-parametric symmetric distribution for now (without ID) until we discuss further. I will open another item for that discussion. Thanks!

@nicholst
Copy link

... to keep non-parametric distribution and add also symmetric distribution, but not the pre-coordinated term non-parametric symmetric distribution.

Yup, this makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants