Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inclusion of Hclust into Clustering.jl #32

Merged
merged 1 commit into from Jun 29, 2015

Conversation

davidavdav
Copy link
Contributor

Hi,

Following the suggestions from the METATDATA maintainers, I made a patch for Clustering to absorb the package Hclust (Hierarchical clustering along the lines of R's hclust() and cutree()), sice this is also does a form of clustering. This keeps the number of packages down.

I did not further integrate the code, it seems to be reasonably independent of the existing code in Clustering.

---david


## functions to comput maximum, minimum, mean for just a slice of an array
## Only consider the upper tiangle, i.e., i<j
function Base.maximum{T<:Real}(d::Matrix{T}, cl1::Vector{Int}, cl2::Vector{Int})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should use different names for such functions, instead of extending the base functions here. They have different interfaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, isn't that the beauty of the dispatch mechanism, that we can define functions with new type of arguments that still do a similar thing to existing ones? The output interface still is the same: a single (maximum) value. The input interface obviously is different.

If maximum(::Array, ::Vector, ::Vector) would have a meaning in Base, I wouldn't want to overload this function.

To be honest, I had assumed that this overloaded function would not be available outside the module, since maximum is not exported. But I see that it is, anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multi-dispatch mechanism was mainly to support different behaviours on different input types. However, we encourage that the API be consistent for each function name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, and I can easily change these names, I'll add another commit shortly.

But I think there are many functions that have quite a different meaning depending on their arguments (zip(), Array(), sum(), similar(), etc)---and that is not just the first argument as in traditional object oriented languages.

I've always seen it as one of the cool things that I don't have to think of new function names all the time when I develop code in steps to operate on increasingly more complex objects, as in
mycoolfunc(a::Array) = map(mycoolfunc, a)

And there are already a lot of functions defined in Base (max and maximum are already taken, so where do I go? I find "mymaximum" lacking any fantasy.) I'll probably go for utmaximum().

@davidavdav
Copy link
Contributor Author

renamed maximum, minimum and mean to utmaximum, utminimum and utmean for upper triangle (and sliced rows/columns)

@lindahua lindahua mentioned this pull request Nov 6, 2014
@davidavdav
Copy link
Contributor Author

Any ideas on this PR?

@lindahua
Copy link
Contributor

Please add test/hclust.jl to test/runtests.jl and see what the travis says.

@davidavdav
Copy link
Contributor Author

OK, done.

@davidavdav
Copy link
Contributor Author

Anything else needed for considering this PR?

## merge clusters i and j
## update D(i,j) and N(i) accordingly
function hclust_minimum{T<:Real}(d::Matrix{T})
@assert size(d,1) == size(d,2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be the more idiomatic:
size(d,1) == size(d,2) || error("Distance matrix d is not square.")?
That error might be more descriptive than the output of the failed assert.

@jpfairbanks
Copy link
Contributor

I would like to see this merged, one thing that I can see is that you use built in function names as variable names. specifically you assign to min and new. It leads to a possible confusion.

The functions are also quite large. I think there are some pieces of functionality that would be better made into functions that modify their arguments that get called more than once. Refactoring it so that these blocks become reusable pieces of code will help review the PR.

For instance:

## For each 0 < i <= n compute Nearest Neighbor N[i]
    N = zeros(Int, nc)
    for k = 1:nc
        min = Inf
        mk = 0
        for i = 1:(k-1)
            if d[i,k] < min
                min = d[i,k]
                mk = i
            end
        end
        for j = (k+1):nc
            if d[k,j] < min
                min = d[k,j]
                mk = j
            end
        end
        N[k] = mk
    end

could be extracted into a function such as nearestneighbors!(N, d, nc).

## For each 0 < i <= n compute Nearest Neighbor N[i]
function nearestneighbors!(N,d,nc)
    for k = 1:nc
        min = Inf
        mk = 0
        for i = 1:(k-1)
            if d[i,k] < min
                min = d[i,k]
                mk = i
            end
        end
        for j = (k+1):nc
            if d[k,j] < min
                min = d[k,j]
                mk = j
            end
        end
        N[k] = mk
    end
end
#usage
N = zeros(Int, nc)
nearestneighbors!(N,d,nc)

@davidavdav
Copy link
Contributor Author

I renamed min and max, and got some common code out of the functions. I agree that the functions are quite long, especially hclust_minimum() (single linkage clustering). This is a bit frustrating, as the pseudocode is only 6 lines... I don't see a way to make re-usable code out of it.

@davidavdav
Copy link
Contributor Author

The interface now is

hclust(d::Symmetric, method::Symbol) ## method :single, :complete, or :average

with convenience functions

hclust(d::AbstractMatrix, method::Symbol) ## checks symmetry
hclust(d::AbstractMatrix, method::Symbol, uplo) ## use only upper/lower part of d

In the last function, uplo is a Char (for Julia v0.3) or a Symbol (for Julia v0.4).

I checked the speed of access to elements in a Symmetric, it appears to be as fast as a normal matrix, perhaps even faster because more elements fit in the cache.

@jpfairbanks
Copy link
Contributor

@davidavdav Why do you want to include hclust1 if it doesn't work? Is the intended algorithm supposed to be faster or have better statistical properties when coded correctly?

@davidavdav
Copy link
Contributor Author

Maybe someone else can pick it up. It may be a more efficient algorithm than hclust2()---that still is 4.5x as slow as R's implentation (where hclust_minimum() is faster than R).

But there is no harm in removing it.

@davidavdav
Copy link
Contributor Author

Does anything still need to be done for this PR?

@slundberg
Copy link
Contributor

Just checking on the status on this pull request. I want to release code that depends on HClust so this would effect how I incorporate the dependency.

@johnmyleswhite
Copy link
Member

Just to confirm: you've never read the code for R's implementation, right? Assuming that's true, I'd be ready to squash this and merge it.

@davidavdav
Copy link
Contributor Author

I did not look in R's code for hclust---I wouldn't even know where to find it. I used the paper quoted in the source (Olson). I couldn't get the Wikipedia algorithm description working.

@johnmyleswhite
Copy link
Member

Works for me. Thanks for writing this!

@johnmyleswhite
Copy link
Member

Can you do a quick squash? Then we'll merge this.

Made test/hclust.R consistent with test/hclust.jl
rename local definitions of maximum, minimum and mean to utmaximum etc.
Added test/hclust.jl in runtests.jl some refactoring (min, max), some extra functions (assertdistancematrix, rorder!)
Go for Symmetric arg type to hclust(), with AbstractMatrix fallback
Remove hclust1() implementation
@davidavdav
Copy link
Contributor Author

I don't think the travis 0.4 failure is related to this squashed commit.

johnmyleswhite added a commit that referenced this pull request Jun 29, 2015
Inclusion of Hclust into Clustering.jl
@johnmyleswhite johnmyleswhite merged commit 7e6ae49 into JuliaStats:master Jun 29, 2015
@johnmyleswhite
Copy link
Member

Sounds like the code is broken on 0.4. We can fix that.

@tlnagy
Copy link

tlnagy commented Feb 2, 2016

Any progress on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants