Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjusted Rand index inconsistency for large n #225

Closed
ynschy opened this issue Dec 21, 2021 · 0 comments
Closed

Adjusted Rand index inconsistency for large n #225

ynschy opened this issue Dec 21, 2021 · 0 comments
Labels

Comments

@ynschy
Copy link

ynschy commented Dec 21, 2021

The adjusted Rand index fails unexpectedly when n is large (n > 100,000). Here is an example with a comparison to an R implementation.

using Random
using Clustering
using RCall
Random.seed!(123);

n = 100_000;
a = rand(1:3,n);
b = rand(1:3,n);

randindex(a,b)[1]

only(R"library(mclust); adjustedRandIndex($a,$b)")

which gives

0.2933142400616828

-1.5731751561282826e-6

In theory the true adjusted Rand index should be close to 0. This starts to happen around n=83,000 for me.

As a Julia comparison, my own implementation of the adjusted Rand index gives the same result as in R:

function ari(a,b)
    table = counts(a,b)
    acounts = sum(table,dims=1)
    bcounts = sum(table,dims=2)
    
    score = sum([x*(x-1)/2 for x in table])
    asum = sum([x*(x-1)/2 for x in acounts])
    bsum = sum([x*(x-1)/2 for x in bcounts])
    expected = asum*bsum/binomial(sum(table),2)
    total = (asum + bsum)/2
    
    if total == expected
        return 0
    else
        return (score-expected)/(total-expected)
    end
end;
ari(a,b)
-1.5731751561282826e-6

I use Clustering.jl 0.14.2, Julia 1.6.2.

wildart added a commit to wildart/Clustering.jl that referenced this issue Dec 25, 2021
- added pair confusion matrix (CM) calculations
- fixed ARI calculation using CM
wildart added a commit to wildart/Clustering.jl that referenced this issue Dec 25, 2021
- added pair confusion matrix (CM) calculations
- fixed ARI calculation using CM
@alyst alyst closed this as completed in a661450 Mar 19, 2023
alyst added a commit that referenced this issue Mar 19, 2023
alyst added a commit that referenced this issue Mar 20, 2023
for very large clusterings the agreement/disagreement counts are
very large, so we have to switch to float when multiplying them

fixes #225
enhances #227
@alyst alyst added the bug label Mar 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants