Inclusion of Hclust into Clustering.jl #32

davidavdav · 2014-10-11T15:17:41Z

Hi,

Following the suggestions from the METATDATA maintainers, I made a patch for Clustering to absorb the package Hclust (Hierarchical clustering along the lines of R's hclust() and cutree()), sice this is also does a form of clustering. This keeps the number of packages down.

I did not further integrate the code, it seems to be reasonably independent of the existing code in Clustering.

---david

lindahua · 2014-10-28T11:04:05Z

src/hclust.jl

+
+## functions to comput maximum, minimum, mean for just a slice of an array
+## Only consider the upper tiangle, i.e., i<j
+function Base.maximum{T<:Real}(d::Matrix{T}, cl1::Vector{Int}, cl2::Vector{Int})


I think you should use different names for such functions, instead of extending the base functions here. They have different interfaces.

Well, isn't that the beauty of the dispatch mechanism, that we can define functions with new type of arguments that still do a similar thing to existing ones? The output interface still is the same: a single (maximum) value. The input interface obviously is different.

If maximum(::Array, ::Vector, ::Vector) would have a meaning in Base, I wouldn't want to overload this function.

To be honest, I had assumed that this overloaded function would not be available outside the module, since maximum is not exported. But I see that it is, anyway.

The multi-dispatch mechanism was mainly to support different behaviours on different input types. However, we encourage that the API be consistent for each function name.

I see your point, and I can easily change these names, I'll add another commit shortly.

But I think there are many functions that have quite a different meaning depending on their arguments (zip(), Array(), sum(), similar(), etc)---and that is not just the first argument as in traditional object oriented languages.

I've always seen it as one of the cool things that I don't have to think of new function names all the time when I develop code in steps to operate on increasingly more complex objects, as in
mycoolfunc(a::Array) = map(mycoolfunc, a)

And there are already a lot of functions defined in Base (max and maximum are already taken, so where do I go? I find "mymaximum" lacking any fantasy.) I'll probably go for utmaximum().

davidavdav · 2014-10-28T13:02:29Z

renamed maximum, minimum and mean to utmaximum, utminimum and utmean for upper triangle (and sliced rows/columns)

davidavdav · 2014-11-19T15:17:31Z

Any ideas on this PR?

lindahua · 2014-11-20T00:52:56Z

Please add test/hclust.jl to test/runtests.jl and see what the travis says.

davidavdav · 2014-11-20T16:57:04Z

OK, done.

davidavdav · 2015-01-22T12:09:13Z

Anything else needed for considering this PR?

jpfairbanks · 2015-03-18T21:52:08Z

src/hclust.jl

+##   merge clusters i and j
+##   update D(i,j) and N(i) accordingly
+function hclust_minimum{T<:Real}(d::Matrix{T})
+    @assert size(d,1) == size(d,2)


Should this be the more idiomatic:
size(d,1) == size(d,2) || error("Distance matrix d is not square.")?
That error might be more descriptive than the output of the failed assert.

jpfairbanks · 2015-03-19T03:51:31Z

I would like to see this merged, one thing that I can see is that you use built in function names as variable names. specifically you assign to min and new. It leads to a possible confusion.

The functions are also quite large. I think there are some pieces of functionality that would be better made into functions that modify their arguments that get called more than once. Refactoring it so that these blocks become reusable pieces of code will help review the PR.

For instance:

## For each 0 < i <= n compute Nearest Neighbor N[i]
    N = zeros(Int, nc)
    for k = 1:nc
        min = Inf
        mk = 0
        for i = 1:(k-1)
            if d[i,k] < min
                min = d[i,k]
                mk = i
            end
        end
        for j = (k+1):nc
            if d[k,j] < min
                min = d[k,j]
                mk = j
            end
        end
        N[k] = mk
    end

could be extracted into a function such as nearestneighbors!(N, d, nc).

## For each 0 < i <= n compute Nearest Neighbor N[i]
function nearestneighbors!(N,d,nc)
    for k = 1:nc
        min = Inf
        mk = 0
        for i = 1:(k-1)
            if d[i,k] < min
                min = d[i,k]
                mk = i
            end
        end
        for j = (k+1):nc
            if d[k,j] < min
                min = d[k,j]
                mk = j
            end
        end
        N[k] = mk
    end
end
#usage
N = zeros(Int, nc)
nearestneighbors!(N,d,nc)

davidavdav · 2015-03-19T22:33:44Z

I renamed min and max, and got some common code out of the functions. I agree that the functions are quite long, especially hclust_minimum() (single linkage clustering). This is a bit frustrating, as the pseudocode is only 6 lines... I don't see a way to make re-usable code out of it.

davidavdav · 2015-03-20T09:18:17Z

The interface now is

hclust(d::Symmetric, method::Symbol) ## method :single, :complete, or :average

with convenience functions

hclust(d::AbstractMatrix, method::Symbol) ## checks symmetry
hclust(d::AbstractMatrix, method::Symbol, uplo) ## use only upper/lower part of d

In the last function, uplo is a Char (for Julia v0.3) or a Symbol (for Julia v0.4).

I checked the speed of access to elements in a Symmetric, it appears to be as fast as a normal matrix, perhaps even faster because more elements fit in the cache.

jpfairbanks · 2015-03-20T20:32:48Z

@davidavdav Why do you want to include hclust1 if it doesn't work? Is the intended algorithm supposed to be faster or have better statistical properties when coded correctly?

davidavdav · 2015-03-20T20:55:35Z

Maybe someone else can pick it up. It may be a more efficient algorithm than hclust2()---that still is 4.5x as slow as R's implentation (where hclust_minimum() is faster than R).

But there is no harm in removing it.

davidavdav · 2015-06-23T07:26:22Z

Does anything still need to be done for this PR?

slundberg · 2015-06-27T17:29:16Z

Just checking on the status on this pull request. I want to release code that depends on HClust so this would effect how I incorporate the dependency.

johnmyleswhite · 2015-06-27T21:45:28Z

Just to confirm: you've never read the code for R's implementation, right? Assuming that's true, I'd be ready to squash this and merge it.

davidavdav · 2015-06-28T15:50:41Z

I did not look in R's code for hclust---I wouldn't even know where to find it. I used the paper quoted in the source (Olson). I couldn't get the Wikipedia algorithm description working.

johnmyleswhite · 2015-06-28T15:51:21Z

Works for me. Thanks for writing this!

johnmyleswhite · 2015-06-28T15:51:42Z

Can you do a quick squash? Then we'll merge this.

Made test/hclust.R consistent with test/hclust.jl rename local definitions of maximum, minimum and mean to utmaximum etc. Added test/hclust.jl in runtests.jl some refactoring (min, max), some extra functions (assertdistancematrix, rorder!) Go for Symmetric arg type to hclust(), with AbstractMatrix fallback Remove hclust1() implementation

davidavdav · 2015-06-28T17:03:09Z

I don't think the travis 0.4 failure is related to this squashed commit.

Inclusion of Hclust into Clustering.jl

johnmyleswhite · 2015-06-29T20:44:15Z

Sounds like the code is broken on 0.4. We can fix that.

tlnagy · 2016-02-02T06:25:27Z

Any progress on this?

lindahua reviewed Oct 28, 2014
View reviewed changes

lindahua mentioned this pull request Nov 6, 2014

Hierarchical clustering? #4

Closed

jpfairbanks reviewed Mar 18, 2015
View reviewed changes

davidavdav mentioned this pull request Jun 23, 2015

Fixes name collision between hclust.jl and HClust.jl in OSX checkout process davidavdav/HClust.jl#1

Merged

davidavdav force-pushed the master branch from 0beaeab to 1595732 Compare June 28, 2015 16:10

johnmyleswhite added a commit that referenced this pull request Jun 29, 2015

Merge pull request #32 from davidavdav/master

7e6ae49

Inclusion of Hclust into Clustering.jl

johnmyleswhite merged commit 7e6ae49 into JuliaStats:master Jun 29, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inclusion of Hclust into Clustering.jl #32

Inclusion of Hclust into Clustering.jl #32

davidavdav commented Oct 11, 2014

lindahua Oct 28, 2014

davidavdav Oct 28, 2014

lindahua Oct 28, 2014

davidavdav Oct 28, 2014

davidavdav commented Oct 28, 2014

davidavdav commented Nov 19, 2014

lindahua commented Nov 20, 2014

davidavdav commented Nov 20, 2014

davidavdav commented Jan 22, 2015

jpfairbanks Mar 18, 2015

jpfairbanks commented Mar 19, 2015

davidavdav commented Mar 19, 2015

davidavdav commented Mar 20, 2015

jpfairbanks commented Mar 20, 2015

davidavdav commented Mar 20, 2015

davidavdav commented Jun 23, 2015

slundberg commented Jun 27, 2015

johnmyleswhite commented Jun 27, 2015

davidavdav commented Jun 28, 2015

johnmyleswhite commented Jun 28, 2015

johnmyleswhite commented Jun 28, 2015

davidavdav commented Jun 28, 2015

johnmyleswhite commented Jun 29, 2015

tlnagy commented Feb 2, 2016

Inclusion of Hclust into Clustering.jl #32

Inclusion of Hclust into Clustering.jl #32

Conversation

davidavdav commented Oct 11, 2014

lindahua Oct 28, 2014

Choose a reason for hiding this comment

davidavdav Oct 28, 2014

Choose a reason for hiding this comment

lindahua Oct 28, 2014

Choose a reason for hiding this comment

davidavdav Oct 28, 2014

Choose a reason for hiding this comment

davidavdav commented Oct 28, 2014

davidavdav commented Nov 19, 2014

lindahua commented Nov 20, 2014

davidavdav commented Nov 20, 2014

davidavdav commented Jan 22, 2015

jpfairbanks Mar 18, 2015

Choose a reason for hiding this comment

jpfairbanks commented Mar 19, 2015

davidavdav commented Mar 19, 2015

davidavdav commented Mar 20, 2015

jpfairbanks commented Mar 20, 2015

davidavdav commented Mar 20, 2015

davidavdav commented Jun 23, 2015

slundberg commented Jun 27, 2015

johnmyleswhite commented Jun 27, 2015

davidavdav commented Jun 28, 2015

johnmyleswhite commented Jun 28, 2015

johnmyleswhite commented Jun 28, 2015

davidavdav commented Jun 28, 2015

johnmyleswhite commented Jun 29, 2015

tlnagy commented Feb 2, 2016