Calculating a p-value for a Pearson's correlation #40

kovasap · 2022-01-29T22:07:28Z

I'm interested in calculating a p-value for a correlation that I'm calculating using kixi.stats.core/correlation. I'm curious if a function to do this already exists in this library, or if functions that would help do this exist. For example, based on the description at https://support.minitab.com/en-us/minitab-express/1/help-and-how-to/modeling-statistics/regression/how-to/correlation/methods-and-formulas/, I need to be able to generate a t distribution (I think?) in order to calculate this p-value.

Or maybe it's possible to calculate this p-value from the error functions that already exist in kixi.stats?

If this is not possible and/or if there is a better way to do this, I'm happy to hear about it!

The text was updated successfully, but these errors were encountered:

henrygarner · 2022-01-30T19:46:47Z

@kovasap this is something which would be an excellent addition to the library. There are a handful of significance tests already in the kixi.stats.test namespace, but not correlation yet.

I've sketched a solution (using kixi.stats.digest/sum-squares rather than kixi.stats.core/correlation because the latter doesn't return the count of rows we need for significance testing). Hopefully it illustrates some of the support you get from the kixi.stats.distribution and kixi.stats.test namespaces for this sort of use case.

(def data [{:x 1 :y 2} {:x 2 :y 2} {:x 3 :y 4} {:x 4 :y 4}])

(require '[kixi.stats.core :as kixi]
         '[kixi.stats.math :refer [sq sqrt]]
         '[kixi.stats.digest :as digest]
         '[kixi.stats.distribution :as dist]
         '[kixi.stats.test :as t])

(defn correlation-test
  [fx fy]
  (kixi/post-complete
   (digest/sum-squares fx fy)
   (fn [{:keys [n ss-x ss-y ss-xy]}]
     (let [denominator (sqrt (* ss-x ss-y))]
       (when-not (zero? denominator)
         (let [r (/ ss-xy denominator)
               degrees-of-freedom (- n 2)
               t-stat (/ (* r (sqrt degrees-of-freedom))
                         (sqrt (- 1 (sq r))))]
           {:correlation r
            :test (t/test-result t-stat (dist/t {:v degrees-of-freedom}))}))))))

(def result
  (transduce identity (correlation-test :x :y) data))

;; Return the calculated correlation
(-> result :correlation)

;; Return the p-value from the test result
(-> result :test t/p-value)

;; Perform a two-tailed test of significance given 5% alpha
(-> result :test (t/significant? 0.05 :<>))

kovasap · 2022-01-30T20:17:23Z

Thanks for the detailed response! By count of rows do you mean the number of data points used to compute the correlation ((count data) in your example above), which we need to feed into dist/t via the degrees-of-freedom var? Or is there additional logic required to do this kind of test for a kixi.stats.core/correlation?

If/when I end up writing a test for kixi.stats.core/correlation, would you be interested in a pull request?

henrygarner · 2022-01-30T20:46:35Z

do you mean the number of data points used to compute the correlation ((count data) in your example above

That's right. The number of observations is required for calculating significance and this manifests as the t distribution degrees of freedom. For this reason I haven't used kixi.stats.core/correlation above.

It would be good to consider a consistent strategy for optionally returning significance tests alongside statistics.

PRs welcome!

kovasap · 2022-02-01T03:26:49Z

Ok got it. I'm in the process of getting this working in my hands for kixi.stats.core/correlation and am having some trouble understanding the code you are calling at

kixi.stats/src/kixi/stats/test.cljc

Lines 10 to 28 in 7dbc5fd

    
           (defrecord TestResult [statistic distribution h1] 
        
             p/PTestResult 
        
             (p-value [this] 
        
               (p-value this h1)) 
        
             (p-value [this alternate] 
        
               (when (and statistic distribution alternate) 
        
                 (case alternate 
        
                   :<> (clamp (* 2 (d/cdf distribution (- (abs statistic)))) 0.0 1.0) 
        
                   :<  (d/cdf distribution statistic) 
        
                   :>  (- 1 (d/cdf distribution statistic))))) 
        
             (significant? [this alpha] 
        
               (significant? [this alpha h1])) 
        
             (significant? [this alpha alternate] 
        
               (when (and statistic distribution alpha alternate) 
        
                 (let [critical (d/critical-value distribution alpha alternate)] 
        
                   (case alternate 
        
                     :<> (> (abs statistic) critical) 
        
                     :<  (< statistic critical) 
        
                     :>  (> statistic critical))))))

Specifically, I'm wondering what the difference is between the p-value calculation and the significant? calculation. Based on my (mediocre) understanding of statistics, I would expect the significant? calculation to just check if the p-value is below the given value (0.05 in your case). Instead it seems to be using a different distribution to calculate it than the one used for p-value?

henrygarner · 2022-02-01T19:47:10Z

That's a fair question: it would be possible to leverage the p-value calculation in order to determine significance at a given alpha. Given a test statistic, a test distribution, and a desired alpha, we can determine significance in either of at least two ways:

Calculate the p-value of the test statistic, given the test distribution, and compare to the given alpha
Calculate the critical value of the test distribution at the given alpha and compare to the test statistic

It happens that the latter approach seems a little more straightforward to me, and I think this is borne out by the complexity of the implementations of p-value and significant? respectively.

kovasap · 2022-02-02T01:37:55Z

Got it that makes sense. I'm asking because I want to display p-values in a table I'm making and I want to make sure that it's accurate to say that p-values less than my desired alpha are significant (so I can color/sort them appropriately). I wanted to make sure that would give the same results as using significant? so I wouldn't have strange dependencies like a p-value greater than my alpha being classed as significant.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculating a p-value for a Pearson's correlation #40

Calculating a p-value for a Pearson's correlation #40

kovasap commented Jan 29, 2022

henrygarner commented Jan 30, 2022

kovasap commented Jan 30, 2022

henrygarner commented Jan 30, 2022 •

edited

Loading

kovasap commented Feb 1, 2022

henrygarner commented Feb 1, 2022

kovasap commented Feb 2, 2022

Calculating a p-value for a Pearson's correlation #40

Calculating a p-value for a Pearson's correlation #40

Comments

kovasap commented Jan 29, 2022

henrygarner commented Jan 30, 2022

kovasap commented Jan 30, 2022

henrygarner commented Jan 30, 2022 • edited Loading

kovasap commented Feb 1, 2022

henrygarner commented Feb 1, 2022

kovasap commented Feb 2, 2022

henrygarner commented Jan 30, 2022 •

edited

Loading