Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide standardize API for 1D array #490

Merged
merged 20 commits into from Sep 19, 2019
Merged

Conversation

yuehhua
Copy link
Contributor

@yuehhua yuehhua commented May 2, 2019

I found that standardize API is just available for 2D array, and not for 1D array.

I just add supporting for 1D array.

@codecov
Copy link

codecov bot commented May 2, 2019

Codecov Report

Merging #490 into master will decrease coverage by 0.03%.
The diff coverage is 94.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #490      +/-   ##
==========================================
- Coverage   83.99%   83.95%   -0.04%     
==========================================
  Files          21       21              
  Lines        2162     2163       +1     
==========================================
  Hits         1816     1816              
- Misses        346      347       +1
Impacted Files Coverage Δ
src/transformations.jl 91.39% <94.11%> (-1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aad1fe0...8beb43f. Read the comment docs.

@codecov
Copy link

codecov bot commented May 2, 2019

Codecov Report

Merging #490 into master will increase coverage by 0.32%.
The diff coverage is 94.91%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #490      +/-   ##
==========================================
+ Coverage   90.15%   90.47%   +0.32%     
==========================================
  Files          21       21              
  Lines        2031     2100      +69     
==========================================
+ Hits         1831     1900      +69     
  Misses        200      200
Impacted Files Coverage Δ
src/transformations.jl 95.45% <94.91%> (+2.04%) ⬆️
src/weights.jl 86.53% <0%> (+2.09%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b3f9d19...e9860b1. Read the comment docs.

Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. @wildart, is this OK?

src/transformations.jl Outdated Show resolved Hide resolved
@yuehhua
Copy link
Contributor Author

yuehhua commented May 4, 2019

Correct the implementation, add tests and update docs.

@nalimilan
Copy link
Member

Can you also adjust fit, and revert all unrelated whitespace changes?

src/transformations.jl Outdated Show resolved Hide resolved
@yuehhua
Copy link
Contributor Author

yuehhua commented May 4, 2019

Can you also adjust fit, and revert all unrelated whitespace changes?

Do you mean to implement a fit which accept AbstractArray{<:Real,1}?

Sorry for adding unrelated whitespaces. My editor do so while I save.

And could I merge

transform!(t::AbstractDataTransform, x::AbstractArray{<:Real,1}) = transform!(x, t, x)
transform!(t::AbstractDataTransform, x::AbstractArray{<:Real,2}) = transform!(x, t, x)

into

transform!(t::AbstractDataTransform, x::AbstractVecOrMat{<:Real}) = transform!(x, t, x)

also?

Co-Authored-By: yuehhua <a504082002@gmail.com>
@wildart
Copy link
Contributor

wildart commented May 13, 2019

Yes, you should do that.

@nalimilan
Copy link
Member

Sorry, I hadn't realized that this is inconsistent with the fact that vectors are treated as column vectors in Julia. Can you change this?

I think this also illustrates a problem that I hadn't spotted when reviewing the original PR which added transformations: we shouldn't apply standardization to rows by default, as it's inconsistent with what e.g. cov does in Statistics. I think we should add a dims keyword argument and deprecate the current behavior, as I did in JuliaStats/Distances.jl#121.

@yuehhua
Copy link
Contributor Author

yuehhua commented May 27, 2019

Great! OK, Let's check our plan.

  1. Add dims keyword argument as cov does in Statistics.

The dims behavior of cov is different from you did in JuliaStats/Distances.jl#121.

Suppose we have a m-by-n matrix X.

cov(X, dims=1) calculates along the first dimension, or in column-wise fashion, then gives n-by-n covariance matrix.

cov(X, dims=2) calculates along the second dimension, or in row-wise fashion, then gives m-by-m covariance matrix.

Similar to cov, we may have standardization API as follow:

standardize(::Type{DT}, X::AbstractMatrix{<:Real}; kwargs...) which has dims=1 keyword argument to standardize X along the first dimension, then gives column-standardized matrix.

standardize(::Type{DT}, X::AbstractMatrix{<:Real}; kwargs...) with dims=2 keyword argument standardizes X along the second dimension, then gives row-standardized matrix.

  1. Change the calculation behavior to column-wise fashion for calculation efficiency.

We could separate the API and implementation. API design would follow the design above. The implementation goes in column-wise fashion for calculation efficiency.

Thus, we provide the efficient implementation with user-friendly API.

The details of implementation:

For dims=1, column-standardized matrix can be calculated straight forward in column-wise fashion.

For dims=2, row-standardized matrix may be transposed before going to calculation in column-wise fashion.

The addition of dims keyword argument gives the minimal modification for user while deprecating the current behavior.

  1. Provide the API and implementation for Vector fulfilling the purpose of this PR.

Could I change/redesign the behavior of whole transformations.jl to achieve the description above?

@nalimilan
Copy link
Member

The dims behavior of cov is different from you did in JuliaStats/Distances.jl#121.

Suppose we have a m-by-n matrix X.

cov(X, dims=1) calculates along the first dimension, or in column-wise fashion, then gives n-by-n covariance matrix.

cov(X, dims=2) calculates along the second dimension, or in row-wise fashion, then gives m-by-m covariance matrix.

Yes, cov and pairwise are different since the former computes pairwise correlations between variables, but the latter computes pairwise distances between observations. So when you have an observations×variables matrix, you use cov(X, dims=1) and pairwise(X, dims=1). (I guess a counter-argument to that decision is that in cov and sum, the dims argument indicates the dimension which will be collapsed and will therefore not appear in the resulting matrix, which isn't the case for pairwise. Ah, well... not sure which behavior is the most intuitive.)

Similar to cov, we may have standardization API as follow:

standardize(::Type{DT}, X::AbstractMatrix{<:Real}; kwargs...) which has dims=1 keyword argument to standardize X along the first dimension, then gives column-standardized matrix.

standardize(::Type{DT}, X::AbstractMatrix{<:Real}; kwargs...) with dims=2 keyword argument standardizes X along the second dimension, then gives row-standardized matrix.

Yes, your proposal sounds consistent with how cov and pairwise work: use dims=1 when observations are stored as rows, and dims=2 when they are stored as columns. Standardizing with dims=i will then be equivalent to (x .- mean(x, dims=i)) ./ std(x, dims=i).

1. Change the calculation behavior to column-wise fashion for calculation efficiency.

We could separate the API and implementation. API design would follow the design above. The implementation goes in column-wise fashion for calculation efficiency.

Thus, we provide the efficient implementation with user-friendly API.

The details of implementation:

For dims=1, column-standardized matrix can be calculated straight forward in column-wise fashion.

For dims=2, row-standardized matrix may be transposed before going to calculation in column-wise fashion.

Yes, you can use transpose as I did at JuliaStats/Distances.jl#121. But note that's a lazy operation: no transposed copy is made, which means the memory is still not accessed in the most efficient pattern. But making a copy would probably not be much faster since that would add some overhead (and use memory, which the user may not want).

Otherwise, the plan sounds good.

@yuehhua
Copy link
Contributor Author

yuehhua commented May 27, 2019

Yes, cov and pairwise are different since the former computes pairwise correlations between variables, but the latter computes pairwise distances between observations. So when you have an observations×variables matrix, you use cov(X, dims=1) and pairwise(X, dims=1). (I guess a counter-argument to that decision is that in cov and sum, the dims argument indicates the dimension which will be collapsed and will therefore not appear in the resulting matrix, which isn't the case for pairwise. Ah, well... not sure which behavior is the most intuitive.)

A intuitive and user-friendly design is necessary, however, I also considering the consistency between APIs. So far, I saw the counter-arguments are also used in Flux.jl. I have no further guesses about this. Maybe some suggestions?

Yes, you can use transpose as I did at JuliaStats/Distances.jl#121. But note that's a lazy operation: no transposed copy is made, which means the memory is still not accessed in the most efficient pattern. But making a copy would probably not be much faster since that would add some overhead (and use memory, which the user may not want).

Otherwise, the plan sounds good.

I also considered the scenarios you mentioned above. Transpose doesn't copy the matrix, while the deep copy may not get efficiency. So, I decide to implement it straight forward.

@nalimilan
Copy link
Member

A intuitive and user-friendly design is necessary, however, I also considering the consistency between APIs. So far, I saw the counter-arguments are also used in Flux.jl. I have no further guesses about this. Maybe some suggestions?

What do you mean about Flux.jl? AFAIK the difference in conventions is only about defaults, which can be tackled separately from the meaning we assign to dims=1 and dims=2, right? It's very hard to decide what's the best default since that's mostly an opposition between stats (observations×variables) and machine learning (variables×observations) people. A possible rule is to be consistent with Statistics since that's in the stdlib, but that's just one possible choice.

@yuehhua
Copy link
Contributor Author

yuehhua commented May 28, 2019

For your reference, FluxML/Flux.jl#563

@yuehhua
Copy link
Contributor Author

yuehhua commented May 28, 2019

I think that the design of arguments is not related to the domain (statistics/machine learning).
Z-score in MATLAB has the same behavior with Julia.
The reason behind the design might be the memory layout, while Julia and MATLAB are column major languages.

@nalimilan
Copy link
Member

I think that the design of arguments is not related to the domain (statistics/machine learning).
Z-score in MATLAB has the same behavior with Julia.
The reason behind the design might be the memory layout, while Julia and MATLAB are column major languages.

That's fine with me, but the fact that the current method chose a different default indicates that not everybody feels that way. Anyway as a first step we should add a dims argument, then we can decide later whether to change the default. Indeed in column-major languages like Julia it's faster to standardize columns.

src/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
@yuehhua
Copy link
Contributor Author

yuehhua commented Jun 6, 2019

Something like this?

Copy link
Contributor

@wildart wildart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dims argument handling

src/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
src/StatsBase.jl Outdated Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
if dims == 1
return transform!(x, t, x)
elseif dims == 2
return transform!(x', t, x')'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return x instead of calling ', which would return an Adjoint object wrapping it instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still applies:

Suggested change
return transform!(x', t, x')'
transform!(x', t, x')
return x

Same below. Tests should be adapted of course.

src/transformations.jl Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
T = eltype(X)
m, s = mean_and_std(X, 2)

return ZScoreTransform(d, (center ? vec(m) : zeros(T, 0)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dims must be stored in ZScoreTransform as @wildart suggested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still an issue.

src/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
T = eltype(X)
m, s = mean_and_std(X, 2)

return ZScoreTransform(d, (center ? vec(m) : zeros(T, 0)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still an issue.

src/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
new transform API

new reconstruct API

Update src/transformations.jl

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
fix

fix

Update src/transformations.jl

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Update src/transformations.jl

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
fix

Update src/transformations.jl

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Update src/transformations.jl

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Update src/transformations.jl

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Update src/transformations.jl

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Fix

fix
@yuehhua
Copy link
Contributor Author

yuehhua commented Jun 17, 2019

If we keep the dims in the type. That could be confused with original dim.
Is there a good way/naming for that?

@nalimilan
Copy link
Member

I guess dim could be renamed to len.

src/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
test/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Show resolved Hide resolved
src/transformations.jl Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Update src/transformations.jl

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Update src/transformations.jl

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Update src/transformations.jl

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Refactor
@yuehhua
Copy link
Contributor Author

yuehhua commented Jul 12, 2019

@nalimilan Is it good?

Copy link
Contributor

@wildart wildart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 2

src/transformations.jl Show resolved Hide resolved
@wildart
Copy link
Contributor

wildart commented Aug 1, 2019

It is common for in-place update functions to pass as a first parameter a variable where the result of the operation will be stored. So, for transform!(y, t, x), y argument is that variable. I do not see a point of creating some "private" functions when transform! & reconstruct! already exist. You just need to add proper dimension checks related for t.dim parameter.

@yuehhua
Copy link
Contributor Author

yuehhua commented Aug 2, 2019

It is common for in-place update functions to pass as a first parameter a variable where the result of the operation will be stored. So, for transform!(y, t, x), y argument is that variable. I do not see a point of creating some "private" functions when transform! & reconstruct! already exist.

Yeah... I got your point. I know it is common way to go.

You just need to add proper dimension checks related for t.dim parameter.

If I move dimensions checks from transform to transfrom! like this:

function transform!(y, t, x)
    if t.dims == 1
        (original part of code)
    elseif t.dims == 2
        (How can I implement here?)
    end
end

Implementing a similar algorithm in row is not applicable.
If I use recursion call transfrom!(y', t, x')', StackOverFlowError occurs. Since t.dims is immutable, it never ends.
I cannot come out with a feasible solution to achieve this need.

@nalimilan
Copy link
Member

If I use recursion call transfrom!(y', t, x')', StackOverFlowError occurs. Since t.dims is immutable, it never ends.
I cannot come out with a feasible solution to achieve this need.

You should be able to avoid this by creating a new transformation object with the expected dim.

@yuehhua
Copy link
Contributor Author

yuehhua commented Aug 19, 2019

@nalimilan I committed. Would you please review it? Thank you.

src/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
src/transformations.jl Outdated Show resolved Hide resolved
yuehhua and others added 2 commits August 25, 2019 22:00
Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>
Copy link
Contributor

@wildart wildart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In-place transforms

src/transformations.jl Outdated Show resolved Hide resolved
Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants