Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculating a p-value for a Pearson's correlation #40

Open
kovasap opened this issue Jan 29, 2022 · 6 comments
Open

Calculating a p-value for a Pearson's correlation #40

kovasap opened this issue Jan 29, 2022 · 6 comments

Comments

@kovasap
Copy link

kovasap commented Jan 29, 2022

I'm interested in calculating a p-value for a correlation that I'm calculating using kixi.stats.core/correlation. I'm curious if a function to do this already exists in this library, or if functions that would help do this exist. For example, based on the description at https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/how-to/correlation/methods-and-formulas/, I need to be able to generate a t distribution (I think?) in order to calculate this p-value.

Or maybe it's possible to calculate this p-value from the error functions that already exist in kixi.stats?

If this is not possible and/or if there is a better way to do this, I'm happy to hear about it!

@henrygarner
Copy link
Collaborator

@kovasap this is something which would be an excellent addition to the library. There are a handful of significance tests already in the kixi.stats.test namespace, but not correlation yet.

I've sketched a solution (using kixi.stats.digest/sum-squares rather than kixi.stats.core/correlation because the latter doesn't return the count of rows we need for significance testing). Hopefully it illustrates some of the support you get from the kixi.stats.distribution and kixi.stats.test namespaces for this sort of use case.

(def data [{:x 1 :y 2} {:x 2 :y 2} {:x 3 :y 4} {:x 4 :y 4}])

(require '[kixi.stats.core :as kixi]
         '[kixi.stats.math :refer [sq sqrt]]
         '[kixi.stats.digest :as digest]
         '[kixi.stats.distribution :as dist]
         '[kixi.stats.test :as t])

(defn correlation-test
  [fx fy]
  (kixi/post-complete
   (digest/sum-squares fx fy)
   (fn [{:keys [n ss-x ss-y ss-xy]}]
     (let [denominator (sqrt (* ss-x ss-y))]
       (when-not (zero? denominator)
         (let [r (/ ss-xy denominator)
               degrees-of-freedom (- n 2)
               t-stat (/ (* r (sqrt degrees-of-freedom))
                         (sqrt (- 1 (sq r))))]
           {:correlation r
            :test (t/test-result t-stat (dist/t {:v degrees-of-freedom}))}))))))

(def result
  (transduce identity (correlation-test :x :y) data))

;; Return the calculated correlation
(-> result :correlation)

;; Return the p-value from the test result
(-> result :test t/p-value)

;; Perform a two-tailed test of significance given 5% alpha
(-> result :test (t/significant? 0.05 :<>))

@kovasap
Copy link
Author

kovasap commented Jan 30, 2022

Thanks for the detailed response! By count of rows do you mean the number of data points used to compute the correlation ((count data) in your example above), which we need to feed into dist/t via the degrees-of-freedom var? Or is there additional logic required to do this kind of test for a kixi.stats.core/correlation?

If/when I end up writing a test for kixi.stats.core/correlation, would you be interested in a pull request?

@henrygarner
Copy link
Collaborator

henrygarner commented Jan 30, 2022

do you mean the number of data points used to compute the correlation ((count data) in your example above

That's right. The number of observations is required for calculating significance and this manifests as the t distribution degrees of freedom. For this reason I haven't used kixi.stats.core/correlation above.

It would be good to consider a consistent strategy for optionally returning significance tests alongside statistics.

PRs welcome!

@kovasap
Copy link
Author

kovasap commented Feb 1, 2022

Ok got it. I'm in the process of getting this working in my hands for kixi.stats.core/correlation and am having some trouble understanding the code you are calling at

(defrecord TestResult [statistic distribution h1]
p/PTestResult
(p-value [this]
(p-value this h1))
(p-value [this alternate]
(when (and statistic distribution alternate)
(case alternate
:<> (clamp (* 2 (d/cdf distribution (- (abs statistic)))) 0.0 1.0)
:< (d/cdf distribution statistic)
:> (- 1 (d/cdf distribution statistic)))))
(significant? [this alpha]
(significant? [this alpha h1]))
(significant? [this alpha alternate]
(when (and statistic distribution alpha alternate)
(let [critical (d/critical-value distribution alpha alternate)]
(case alternate
:<> (> (abs statistic) critical)
:< (< statistic critical)
:> (> statistic critical))))))

Specifically, I'm wondering what the difference is between the p-value calculation and the significant? calculation. Based on my (mediocre) understanding of statistics, I would expect the significant? calculation to just check if the p-value is below the given value (0.05 in your case). Instead it seems to be using a different distribution to calculate it than the one used for p-value?

@henrygarner
Copy link
Collaborator

That's a fair question: it would be possible to leverage the p-value calculation in order to determine significance at a given alpha. Given a test statistic, a test distribution, and a desired alpha, we can determine significance in either of at least two ways:

  • Calculate the p-value of the test statistic, given the test distribution, and compare to the given alpha
  • Calculate the critical value of the test distribution at the given alpha and compare to the test statistic

It happens that the latter approach seems a little more straightforward to me, and I think this is borne out by the complexity of the implementations of p-value and significant? respectively.

@kovasap
Copy link
Author

kovasap commented Feb 2, 2022

Got it that makes sense. I'm asking because I want to display p-values in a table I'm making and I want to make sure that it's accurate to say that p-values less than my desired alpha are significant (so I can color/sort them appropriately). I wanted to make sure that would give the same results as using significant? so I wouldn't have strange dependencies like a p-value greater than my alpha being classed as significant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants