Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential type issues with fuzzy_cmeans #140

Open
tlienart opened this issue Jan 28, 2019 · 13 comments
Open

Potential type issues with fuzzy_cmeans #140

tlienart opened this issue Jan 28, 2019 · 13 comments

Comments

@tlienart
Copy link
Contributor

tlienart commented Jan 28, 2019

Following explorations in #138 I had a look at the code for fuzzy_cmeans and tried:

julia> Random.seed!(51)
julia> X = rand(Int16, 100, 5);
julia> fuzzy_cmeans(X, 3, 1.2)
ERROR: InexactError: Int16(-9526.902803360517)

I believe this may stem from the fact that the algorithm accepts any subtype of Real and therefore <:Integer but with integers there may be type issues with how centers are assigned and updated.

Two suggestions

  1. restrict the input type to AbstractFloat everywhere in clustering, I believe that solves the problem but maybe that we would like clustering algorithms to work on ints...
  2. cast X to float if it isn't which, I believe, would be the expected behaviour from a maths perspective
@nalimilan
Copy link
Member

Solution 1. isn't very interesting since it fixes the error by throwing another error. Better choose the appropriate element type using float(eltype(X)) or something similar.

@tlienart
Copy link
Contributor Author

I agree. The disadvantage being that it requires converting the matrix to a floating point one, so effectively copying the data.

With respect to float(eltype(X)), one thing I noted is that float(Int) for Int16, Int32, ... always gives Float64. That's not super intuitive to me. Wouldn't it be preferrable to use the a Float32 if we can to speed up computations?

@alyst
Copy link
Member

alyst commented Jan 29, 2019

The disadvantage being that it requires converting the matrix to a floating point one, so effectively copying the data.

Why so? The centers are calculated in update_centers!(), and from that code it doesn't look like you need to do any explicit conversion of data matrix or its elements.

With respect to float(eltype(X)), one thing I noted is that float(Int) for Int16, Int32, ... always gives Float64. That's not super intuitive to me. Wouldn't it be preferrable to use the a Float32 if we can to speed up computations?

float(<:Integer) returns the most reasonable type, which is the one that is most precise.
Note that for Int16 the maximal number is 32767, but maxintfloat(Float16) is 2048.0; not every larger integer could be precisely represented as Float16.
But maybe it would make sense -- similar to kmeans!() -- to have fuzzy_cmeans!(data, centers) method, which would allow specifying the centers eltype.
And speaking of kmeans!(): actually after #138 it's still not possible that centers have eltype different from X.

@tlienart
Copy link
Contributor Author

tlienart commented Jan 30, 2019

In kmeans, these lines:

centers[i, cj] += X[i, j]

and

centers[i ,cj] += X[i, j] * wj

are the ones causing issues if the type of centers, X and w are not set carefully. At the moment (previous behaviour prior to #138 as well), if X is T then necessarily centers is T too but w can be anything which can cause problems. And I agree that centers should be allowed to be provided with its own type separate from X but then more type casting error possibilities arise.

Allegedly you don't necessarily need to fully construct Xf = convert(Matrix{FloatXX}, X) but then you need to make sure these computations are done correctly and without loss. Casting individual computations without doing the full thing is IMO a path to problems.

What I would suggest:

  1. write specialised functions that act on X::AbstractMatrix{<:Union{Integer, Rational, AbstractIrrational}}, compute a conversion to the nearest floating type that allows least precision loss (Float64 for Int64, Rational, Irrational, possibly Float32 for Int16 and Int32,, possibly Float16 for Bool)
  2. write specialised functions that act on X::AbstractFloat (=the current one but reverting to AbstractFloat instead of Real) and use that after the conversion done in (1).

PS:

float(Int16) # Float64
float(Bool) # Float64

@nalimilan
Copy link
Member

But maybe it would make sense -- similar to kmeans!() -- to have fuzzy_cmeans!(data, centers) method, which would allow specifying the centers eltype.

I agree with this. That sounds simpler and more reliable than trying to choose the floating-point type based on the input type. Indeed, depending on the situation, you may need more or less precision, independent from whether the data can be converted without loss. For example, Float16 is very imprecise, so even if your data is boolean or in a small range you probably don't want to use it. Float64 is the default everywhere in Julia because of this.

@tlienart
Copy link
Contributor Author

tlienart commented Jan 30, 2019

See #143 where basically what I implemented:

  • allows to have any type for center
  • if X is a non-floating point it gets converted to Float64
  • centers get converted to eltype(X) after this first conversion so effectively Float64 unless X is specified as lower-precision float like Float32 in which case I don't think it makes a lot of sense to keep the centers in -say- Float64 while updating them with Float32 operations.
  • (Probably Float16 should be treated as Int etc, and just be converted to Float64, who uses Float16 in data anlytics anyway apart from in Neural Nets maybe? I didn't do this and just kept the behaviour pre making kmeans take AbstractMatrix instead of Matrix #138 in place.)

@wildart
Copy link
Contributor

wildart commented Jan 30, 2019

It does not make sense to use Integer data input for k-means without initial conversion to floating point format. The mean calculation results into a floating point value, I would not expect it to be integer.

I think that better to limit input type to be derived from AbstractFloat rather then Real, instead of converting data to Float64.

Moreover, make sure that the update_centers! is agnostic to input data types. I can spot a problem when the center weights array is initialized with Float64 zero value instead of appropriate data type:

cweights[to_update] .= 0.0

IMHO, there should be no silent data conversion: whatever the input data type is provided the same data type should be expected in the result. It's a user responsibility to decide what kind of data to feed into the algorithm. We should just provide a clear description of parameter limitations.

@nalimilan
Copy link
Member

Do we really need to convert the input, though? Can't we just allocate the resulting vector in the appropriate type without converting the input? If so, it sounds wasteful to require the user to convert beforehand (especially given that it requires making a copy of the data).

Moreover, make sure that the update_centers! is agnostic to input data types. I can spot a problem when the center weights array is initialized with Float64 zero value instead of appropriate data type:

Assignment into an array will never change its type, so what happen here is that 0.0 is converted to eltype(cweights). Using .= 0 would have the same effect.

@tlienart
Copy link
Contributor Author

tlienart commented Jan 30, 2019

The proposal specifically uses .=zero(T) see

cweights[to_update] .= zero(T)
etc

Re "silent data conversion" I'm in favour of throwing a warning message and do the conversion rather than just throw an InexactError.

If someone enters data that looks like:

1 1 1 2 1 2 2 2 1 1 50 55 60 50 65 50 65

you'd still want your kmeans to pick the obvious two clusters.

Do we really need to convert the input, though? Can't we just allocate the resulting vector in the appropriate type without converting the input? If so, it sounds wasteful to require the user to convert beforehand (especially given that it requires making a copy of the data).

Maybe not, but then copyseeds needs to be modified so that the type of centers is not tied to that of X and, rather, is tied to that of what X would be after conversion. This may work. I can give it a shot.

@wildart
Copy link
Contributor

wildart commented Jan 30, 2019

Assignment into an array will never change its type, so what happen here is that 0.0 is converted to eltype(cweights). Using .= 0 would have the same effect.

Sorry, my bad. I spotted assignment and acted on it.

Re "silent data conversion" I'm in favour of throwing a warning message and do the conversion rather than just throw an InexactError.

Restricting input to AbstractFloat will not allow run the function with incompatible input data, leaving a user a choice to convert data to an acceptable type.

Do we really need to convert the input, though? Can't we just allocate the resulting vector in the appropriate type without converting the input?

I believe its how the function works now - all internal structures and output have the same time as input, and that is how it always should be unless a particular algorithm specifies differently.

@tlienart
Copy link
Contributor Author

tlienart commented Jan 30, 2019

Restricting input to AbstractFloat will not allow run the function with incompatible input data, leaving a user a choice to convert data to an acceptable type.

I think this is an odd restriction. In fact the example given on ScikitLearn's page itself is with integer data... https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

Also the previous behaviour with which you may be familiar had a few issues:

  • it required the type of X and centers to be the same in kmeans! which is not necessary though you may say that it would happen most of the time
  • you could still enter data as Float16 and potentially get shafted by InexactErrors
  • it oddly forced some of the things to be exactly Float64 when not necessarily required.

The attempt at #143 is to change this while offering better performances at the same time. I'm working on @nalimilan's suggestion which should allow to not use explicit conversion of the input data.

Also I'm doing all this with a view on future use of kmeans where the user has potentially no idea of the internals and may in fact try to feed a DataFrame or even a Table to kmeans, in that case a novice user may have no idea why it's failing with integer input and not know (or want) to convert it to float...

I think all these tools from super basic ML / Data analytics should be as robust to user input as possible... (especially when it's not very complicated to make them so)

@wildart
Copy link
Contributor

wildart commented Jan 30, 2019

I think this is an odd restriction. In fact the example given on ScikitLearn's page itself is with integer data...

That results in floating point array of centers.

the user has potentially no idea of the internals and may in fact try to feed a DataFrame or even a Table to kmeans,

I think users should be aware of type of their data and limitations of the algorithms that they use.

But, I see a reason for providing a different type of a centers array, only when it's done explicitly.

@tlienart
Copy link
Contributor Author

That results in floating point array of centers.

The code as it stands currently in #143 does this as well fwiw.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants