This is a tool for calculating the binary Pearson's correlation coefficient between a set of labels for events.
It accepts input from STDIN, where each line of input is of the form
identifier1 identifer2 ... identifiern
These are required to be separated with spaces, not tabs.
Currently it is an error to repeat an identifier more than once on a line, but this is not validated. Either of validation or a sensible interpretation may be provided later.
We then view each identifier as a 0,1 valued random variable, where each line is taken to be a sample of these random variables (1 if present, 0 otherwise). This allows us to calculate the correlation between these random variables.
For example, given lines:
foo foo bar baz bar baz bif foo bar
We get the following vectors of samples:
foo: 1 1 0 1 bar: 0 1 1 1 baz: 0 1 1 0 bif: 0 0 1 0
The pearsons correlation between e.g. foo and baz is now calculated as follows:
E(foo * baz) = (1 * 0 + 1 * 1 + 0 * 1 + 1 * 0) / 4 = 0.25 E(foo) = (1 + 1 + 0 + 1) / 4 = 0.75 E(baz) = (0 + 1 + 1 + 0) / 4 = 0.5 Sigma(foo) = Sqrt(0.75 - 0.75^2) = 0.433... Sigma(baz) = Sqrt(0.5 - 0.5^2) = 0.5 So Pearsons(foo, baz) = (0.25 - 0.75 * 0.25) / (0.433... * 0.5) = -0.577...
i.e. foo and baz are fairly anticorrelated. Which appears plausible from the data.
If we were to output all pearsons results, the amount of output would always be quadratic in the number of identifiers. It is desirable to avoid this. Consequently a number of reductions are performed:
- the correlation is only output for labels which cooccur at least once. Consequently in the above example no correlation is output between foo and bif.
- because pearsons is symmetric, and the pearsons of an identifier with itself is 1, we only output the pearsons between identifiers x and y if x < y.
- there is a cutoff value. The pearsons is only output for things >= this value. This may be specified via the -c flag. The default value is 0.