This is a tool for calculating the binary Pearson's correlation coefficient between a set of labels for events.
It accepts input from STDIN, where each line of input is of the form
identifier1 identifer2 ... identifiern
Valid identifiers are of the form [A-Za-z0-9_]+. i.e. contain only alphanumeric characters and underscores. Identifiers are required to be separated with spaces, not tabs.
Currently it is an error to repeat an identifier more than once on a line, but this is not validated. Either of validation or a sensible interpretation may be provided later.
We then view each identifier as a 0,1 valued random variable, where each line is taken to be a sample of these random variables (1 if present, 0 otherwise). This allows us to calculate the correlation between these random variables.
For example, given lines:
foo foo bar baz bar baz bif foo bar
We get the following vectors of samples:
foo: 1 1 0 1 bar: 0 1 1 1 baz: 0 1 1 0 bif: 0 0 1 0
The pearsons correlation between e.g. foo and baz is now calculated as follows:
E(foo * baz) = (1 * 0 + 1 * 1 + 0 * 1 + 1 * 0) / 4 = 0.25 E(foo) = (1 + 1 + 0 + 1) / 4 = 0.75 E(baz) = (0 + 1 + 1 + 0) / 4 = 0.5 Sigma(foo) = Sqrt(0.75 - 0.75^2) = 0.433... Sigma(baz) = Sqrt(0.5 - 0.5^2) = 0.5 Pearsons(foo, baz) = (0.25 - 0.75 * 0.25) / (0.433... * 0.5) = -0.577...
i.e. foo and baz are fairly anticorrelated. Which appears plausible from the data.
If we were to output all pearsons results, the amount of output would always be quadratic in the number of identifiers. It is desirable to avoid this. Consequently a number of reductions are performed:
- the correlation is only output for labels which cooccur at least once. Consequently in the above example no correlation is output between foo and bif.
- because pearsons is symmetric, and the pearsons of an identifier with itself is 1, we only output the pearsons between identifiers x and y if x < y.
- there is a cutoff value. The pearsons is only output for things >= this value. This may be specified via the -c flag. The default value is 0.
- The install procedure for this program is, shall we say, somewhat non-existent. You can install the ruby-gem and you'll get the command along with it. I'll add a better one at some later date.
- Parts of this program are currently written in Java. Further, they are written in very bad Java. I had to do some really idiotic things in order to get this adequately fast (almost everything you'll see in this code that makes you go "What is this idiot doing?" shaved at least 5 seconds off the runtime on my sample dataset. These added up very quickly), thus reinforcing my exceedingly negative opinion of java.io. I will probably rewrite the Java parts in C at some point.