New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inclusion of Hclust into Clustering.jl #32
Conversation
|
||
## functions to comput maximum, minimum, mean for just a slice of an array | ||
## Only consider the upper tiangle, i.e., i<j | ||
function Base.maximum{T<:Real}(d::Matrix{T}, cl1::Vector{Int}, cl2::Vector{Int}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should use different names for such functions, instead of extending the base functions here. They have different interfaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, isn't that the beauty of the dispatch mechanism, that we can define functions with new type of arguments that still do a similar thing to existing ones? The output interface still is the same: a single (maximum) value. The input interface obviously is different.
If maximum(::Array, ::Vector, ::Vector) would have a meaning in Base, I wouldn't want to overload this function.
To be honest, I had assumed that this overloaded function would not be available outside the module, since maximum is not exported. But I see that it is, anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The multi-dispatch mechanism was mainly to support different behaviours on different input types. However, we encourage that the API be consistent for each function name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point, and I can easily change these names, I'll add another commit shortly.
But I think there are many functions that have quite a different meaning depending on their arguments (zip(), Array(), sum(), similar(), etc)---and that is not just the first argument as in traditional object oriented languages.
I've always seen it as one of the cool things that I don't have to think of new function names all the time when I develop code in steps to operate on increasingly more complex objects, as in
mycoolfunc(a::Array) = map(mycoolfunc, a)
And there are already a lot of functions defined in Base (max and maximum are already taken, so where do I go? I find "mymaximum" lacking any fantasy.) I'll probably go for utmaximum().
renamed maximum, minimum and mean to utmaximum, utminimum and utmean for upper triangle (and sliced rows/columns) |
Any ideas on this PR? |
Please add |
OK, done. |
Anything else needed for considering this PR? |
## merge clusters i and j | ||
## update D(i,j) and N(i) accordingly | ||
function hclust_minimum{T<:Real}(d::Matrix{T}) | ||
@assert size(d,1) == size(d,2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be the more idiomatic:
size(d,1) == size(d,2) || error("Distance matrix d is not square.")
?
That error might be more descriptive than the output of the failed assert.
I would like to see this merged, one thing that I can see is that you use built in function names as variable names. specifically you assign to The functions are also quite large. I think there are some pieces of functionality that would be better made into functions that modify their arguments that get called more than once. Refactoring it so that these blocks become reusable pieces of code will help review the PR. For instance: ## For each 0 < i <= n compute Nearest Neighbor N[i]
N = zeros(Int, nc)
for k = 1:nc
min = Inf
mk = 0
for i = 1:(k-1)
if d[i,k] < min
min = d[i,k]
mk = i
end
end
for j = (k+1):nc
if d[k,j] < min
min = d[k,j]
mk = j
end
end
N[k] = mk
end could be extracted into a function such as ## For each 0 < i <= n compute Nearest Neighbor N[i]
function nearestneighbors!(N,d,nc)
for k = 1:nc
min = Inf
mk = 0
for i = 1:(k-1)
if d[i,k] < min
min = d[i,k]
mk = i
end
end
for j = (k+1):nc
if d[k,j] < min
min = d[k,j]
mk = j
end
end
N[k] = mk
end
end
#usage
N = zeros(Int, nc)
nearestneighbors!(N,d,nc) |
I renamed min and max, and got some common code out of the functions. I agree that the functions are quite long, especially hclust_minimum() (single linkage clustering). This is a bit frustrating, as the pseudocode is only 6 lines... I don't see a way to make re-usable code out of it. |
The interface now is hclust(d::Symmetric, method::Symbol) ## method :single, :complete, or :average with convenience functions hclust(d::AbstractMatrix, method::Symbol) ## checks symmetry
hclust(d::AbstractMatrix, method::Symbol, uplo) ## use only upper/lower part of d In the last function, I checked the speed of access to elements in a |
@davidavdav Why do you want to include hclust1 if it doesn't work? Is the intended algorithm supposed to be faster or have better statistical properties when coded correctly? |
Maybe someone else can pick it up. It may be a more efficient algorithm than hclust2()---that still is 4.5x as slow as R's implentation (where hclust_minimum() is faster than R). But there is no harm in removing it. |
Does anything still need to be done for this PR? |
Just checking on the status on this pull request. I want to release code that depends on HClust so this would effect how I incorporate the dependency. |
Just to confirm: you've never read the code for R's implementation, right? Assuming that's true, I'd be ready to squash this and merge it. |
I did not look in R's code for hclust---I wouldn't even know where to find it. I used the paper quoted in the source (Olson). I couldn't get the Wikipedia algorithm description working. |
Works for me. Thanks for writing this! |
Can you do a quick squash? Then we'll merge this. |
Made test/hclust.R consistent with test/hclust.jl rename local definitions of maximum, minimum and mean to utmaximum etc. Added test/hclust.jl in runtests.jl some refactoring (min, max), some extra functions (assertdistancematrix, rorder!) Go for Symmetric arg type to hclust(), with AbstractMatrix fallback Remove hclust1() implementation
I don't think the travis 0.4 failure is related to this squashed commit. |
Inclusion of Hclust into Clustering.jl
Sounds like the code is broken on 0.4. We can fix that. |
Any progress on this? |
Hi,
Following the suggestions from the METATDATA maintainers, I made a patch for Clustering to absorb the package Hclust (Hierarchical clustering along the lines of R's hclust() and cutree()), sice this is also does a form of clustering. This keeps the number of packages down.
I did not further integrate the code, it seems to be reasonably independent of the existing code in Clustering.
---david