Provide standardize API for 1D array #490

yuehhua · 2019-05-02T02:41:39Z

I found that standardize API is just available for 2D array, and not for 1D array.

I just add supporting for 1D array.

codecov · 2019-05-02T02:57:45Z

Codecov Report

Merging #490 into master will decrease coverage by 0.03%.
The diff coverage is 94.11%.

@@            Coverage Diff             @@
##           master     #490      +/-   ##
==========================================
- Coverage   83.99%   83.95%   -0.04%     
==========================================
  Files          21       21              
  Lines        2162     2163       +1     
==========================================
  Hits         1816     1816              
- Misses        346      347       +1

Impacted Files	Coverage Δ
src/transformations.jl	`91.39% <94.11%> (-1%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aad1fe0...8beb43f. Read the comment docs.

codecov · 2019-05-02T02:57:45Z

Codecov Report

Merging #490 into master will increase coverage by 0.32%.
The diff coverage is 94.91%.

@@            Coverage Diff             @@
##           master     #490      +/-   ##
==========================================
+ Coverage   90.15%   90.47%   +0.32%     
==========================================
  Files          21       21              
  Lines        2031     2100      +69     
==========================================
+ Hits         1831     1900      +69     
  Misses        200      200

Impacted Files	Coverage Δ
src/transformations.jl	`95.45% <94.91%> (+2.04%)`	⬆️
src/weights.jl	`86.53% <0%> (+2.09%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b3f9d19...e9860b1. Read the comment docs.

nalimilan

Thanks. @wildart, is this OK?

src/transformations.jl

yuehhua · 2019-05-04T13:08:05Z

Correct the implementation, add tests and update docs.

nalimilan · 2019-05-04T14:18:57Z

Can you also adjust fit, and revert all unrelated whitespace changes?

src/transformations.jl

yuehhua · 2019-05-04T14:40:59Z

Can you also adjust fit, and revert all unrelated whitespace changes?

Do you mean to implement a fit which accept AbstractArray{<:Real,1}?

Sorry for adding unrelated whitespaces. My editor do so while I save.

And could I merge

transform!(t::AbstractDataTransform, x::AbstractArray{<:Real,1}) = transform!(x, t, x)
transform!(t::AbstractDataTransform, x::AbstractArray{<:Real,2}) = transform!(x, t, x)

into

transform!(t::AbstractDataTransform, x::AbstractVecOrMat{<:Real}) = transform!(x, t, x)

also?

Co-Authored-By: yuehhua <a504082002@gmail.com>

wildart · 2019-05-13T21:51:09Z

Yes, you should do that.

nalimilan · 2019-05-26T16:30:44Z

Sorry, I hadn't realized that this is inconsistent with the fact that vectors are treated as column vectors in Julia. Can you change this?

I think this also illustrates a problem that I hadn't spotted when reviewing the original PR which added transformations: we shouldn't apply standardization to rows by default, as it's inconsistent with what e.g. cov does in Statistics. I think we should add a dims keyword argument and deprecate the current behavior, as I did in JuliaStats/Distances.jl#121.

yuehhua · 2019-05-27T02:30:03Z

Great! OK, Let's check our plan.

Add dims keyword argument as cov does in Statistics.

The dims behavior of cov is different from you did in JuliaStats/Distances.jl#121.

Suppose we have a m-by-n matrix X.

cov(X, dims=1) calculates along the first dimension, or in column-wise fashion, then gives n-by-n covariance matrix.

cov(X, dims=2) calculates along the second dimension, or in row-wise fashion, then gives m-by-m covariance matrix.

Similar to cov, we may have standardization API as follow:

standardize(::Type{DT}, X::AbstractMatrix{<:Real}; kwargs...) which has dims=1 keyword argument to standardize X along the first dimension, then gives column-standardized matrix.

standardize(::Type{DT}, X::AbstractMatrix{<:Real}; kwargs...) with dims=2 keyword argument standardizes X along the second dimension, then gives row-standardized matrix.

Change the calculation behavior to column-wise fashion for calculation efficiency.

We could separate the API and implementation. API design would follow the design above. The implementation goes in column-wise fashion for calculation efficiency.

Thus, we provide the efficient implementation with user-friendly API.

The details of implementation:

For dims=1, column-standardized matrix can be calculated straight forward in column-wise fashion.

For dims=2, row-standardized matrix may be transposed before going to calculation in column-wise fashion.

The addition of dims keyword argument gives the minimal modification for user while deprecating the current behavior.

Provide the API and implementation for Vector fulfilling the purpose of this PR.

Could I change/redesign the behavior of whole transformations.jl to achieve the description above?

nalimilan · 2019-05-27T16:42:52Z

The dims behavior of cov is different from you did in JuliaStats/Distances.jl#121.

Suppose we have a m-by-n matrix X.

cov(X, dims=1) calculates along the first dimension, or in column-wise fashion, then gives n-by-n covariance matrix.

cov(X, dims=2) calculates along the second dimension, or in row-wise fashion, then gives m-by-m covariance matrix.

Yes, cov and pairwise are different since the former computes pairwise correlations between variables, but the latter computes pairwise distances between observations. So when you have an observations×variables matrix, you use cov(X, dims=1) and pairwise(X, dims=1). (I guess a counter-argument to that decision is that in cov and sum, the dims argument indicates the dimension which will be collapsed and will therefore not appear in the resulting matrix, which isn't the case for pairwise. Ah, well... not sure which behavior is the most intuitive.)

Similar to cov, we may have standardization API as follow:

standardize(::Type{DT}, X::AbstractMatrix{<:Real}; kwargs...) which has dims=1 keyword argument to standardize X along the first dimension, then gives column-standardized matrix.

standardize(::Type{DT}, X::AbstractMatrix{<:Real}; kwargs...) with dims=2 keyword argument standardizes X along the second dimension, then gives row-standardized matrix.

Yes, your proposal sounds consistent with how cov and pairwise work: use dims=1 when observations are stored as rows, and dims=2 when they are stored as columns. Standardizing with dims=i will then be equivalent to (x .- mean(x, dims=i)) ./ std(x, dims=i).

1. Change the calculation behavior to column-wise fashion for calculation efficiency.
We could separate the API and implementation. API design would follow the design above. The implementation goes in column-wise fashion for calculation efficiency.

Thus, we provide the efficient implementation with user-friendly API.

The details of implementation:

For dims=1, column-standardized matrix can be calculated straight forward in column-wise fashion.

For dims=2, row-standardized matrix may be transposed before going to calculation in column-wise fashion.

Yes, you can use transpose as I did at JuliaStats/Distances.jl#121. But note that's a lazy operation: no transposed copy is made, which means the memory is still not accessed in the most efficient pattern. But making a copy would probably not be much faster since that would add some overhead (and use memory, which the user may not want).

Otherwise, the plan sounds good.

yuehhua · 2019-05-27T16:56:30Z

Yes, cov and pairwise are different since the former computes pairwise correlations between variables, but the latter computes pairwise distances between observations. So when you have an observations×variables matrix, you use cov(X, dims=1) and pairwise(X, dims=1). (I guess a counter-argument to that decision is that in cov and sum, the dims argument indicates the dimension which will be collapsed and will therefore not appear in the resulting matrix, which isn't the case for pairwise. Ah, well... not sure which behavior is the most intuitive.)

A intuitive and user-friendly design is necessary, however, I also considering the consistency between APIs. So far, I saw the counter-arguments are also used in Flux.jl. I have no further guesses about this. Maybe some suggestions?

Yes, you can use transpose as I did at JuliaStats/Distances.jl#121. But note that's a lazy operation: no transposed copy is made, which means the memory is still not accessed in the most efficient pattern. But making a copy would probably not be much faster since that would add some overhead (and use memory, which the user may not want).

Otherwise, the plan sounds good.

I also considered the scenarios you mentioned above. Transpose doesn't copy the matrix, while the deep copy may not get efficiency. So, I decide to implement it straight forward.

nalimilan · 2019-05-27T17:01:01Z

A intuitive and user-friendly design is necessary, however, I also considering the consistency between APIs. So far, I saw the counter-arguments are also used in Flux.jl. I have no further guesses about this. Maybe some suggestions?

What do you mean about Flux.jl? AFAIK the difference in conventions is only about defaults, which can be tackled separately from the meaning we assign to dims=1 and dims=2, right? It's very hard to decide what's the best default since that's mostly an opposition between stats (observations×variables) and machine learning (variables×observations) people. A possible rule is to be consistent with Statistics since that's in the stdlib, but that's just one possible choice.

yuehhua · 2019-05-28T00:39:15Z

For your reference, FluxML/Flux.jl#563

yuehhua · 2019-05-28T01:00:54Z

I think that the design of arguments is not related to the domain (statistics/machine learning).
Z-score in MATLAB has the same behavior with Julia.
The reason behind the design might be the memory layout, while Julia and MATLAB are column major languages.

nalimilan · 2019-05-28T08:54:14Z

I think that the design of arguments is not related to the domain (statistics/machine learning).
Z-score in MATLAB has the same behavior with Julia.
The reason behind the design might be the memory layout, while Julia and MATLAB are column major languages.

That's fine with me, but the fact that the current method chose a different default indicates that not everybody feels that way. Anyway as a first step we should add a dims argument, then we can decide later whether to change the default. Indeed in column-major languages like Julia it's faster to standardize columns.

src/transformations.jl

yuehhua · 2019-06-06T14:20:48Z

Something like this?

wildart

dims argument handling

src/transformations.jl

src/StatsBase.jl

src/transformations.jl

nalimilan · 2019-06-12T11:49:07Z

src/transformations.jl

+    if dims == 1
+        return transform!(x, t, x)
+    elseif dims == 2
+        return transform!(x', t, x')'


Return x instead of calling ', which would return an Adjoint object wrapping it instead.

This still applies:

Suggested change

return transform!(x', t, x')'

transform!(x', t, x')

return x

Same below. Tests should be adapted of course.

src/transformations.jl

nalimilan · 2019-06-12T11:52:25Z

src/transformations.jl

    T = eltype(X)
-    m, s = mean_and_std(X, 2)
-
    return ZScoreTransform(d, (center ? vec(m) : zeros(T, 0)),


dims must be stored in ZScoreTransform as @wildart suggested.

Still an issue.

src/transformations.jl

nalimilan · 2019-06-17T13:16:08Z

src/transformations.jl

    T = eltype(X)
-    m, s = mean_and_std(X, 2)
-
    return ZScoreTransform(d, (center ? vec(m) : zeros(T, 0)),


Still an issue.

src/transformations.jl

new transform API new reconstruct API Update src/transformations.jl Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr> fix fix Update src/transformations.jl Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr> Update src/transformations.jl Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr> fix Update src/transformations.jl Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr> Update src/transformations.jl Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr> Update src/transformations.jl Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr> Update src/transformations.jl Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr> Fix fix

yuehhua · 2019-06-17T13:54:45Z

If we keep the dims in the type. That could be confused with original dim.
Is there a good way/naming for that?

nalimilan · 2019-06-17T14:10:17Z

I guess dim could be renamed to len.

src/transformations.jl

test/transformations.jl

src/transformations.jl

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr> Update src/transformations.jl Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr> Update src/transformations.jl Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr> Update src/transformations.jl Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr> Refactor

yuehhua · 2019-07-12T01:03:36Z

@nalimilan Is it good?

wildart

Round 2

src/transformations.jl

wildart · 2019-08-01T18:17:24Z

It is common for in-place update functions to pass as a first parameter a variable where the result of the operation will be stored. So, for transform!(y, t, x), y argument is that variable. I do not see a point of creating some "private" functions when transform! & reconstruct! already exist. You just need to add proper dimension checks related for t.dim parameter.

yuehhua · 2019-08-02T14:46:48Z

It is common for in-place update functions to pass as a first parameter a variable where the result of the operation will be stored. So, for transform!(y, t, x), y argument is that variable. I do not see a point of creating some "private" functions when transform! & reconstruct! already exist.

Yeah... I got your point. I know it is common way to go.

You just need to add proper dimension checks related for t.dim parameter.

If I move dimensions checks from transform to transfrom! like this:

function transform!(y, t, x)
    if t.dims == 1
        (original part of code)
    elseif t.dims == 2
        (How can I implement here?)
    end
end

Implementing a similar algorithm in row is not applicable.
If I use recursion call transfrom!(y', t, x')', StackOverFlowError occurs. Since t.dims is immutable, it never ends.
I cannot come out with a feasible solution to achieve this need.

nalimilan · 2019-08-02T16:22:04Z

If I use recursion call transfrom!(y', t, x')', StackOverFlowError occurs. Since t.dims is immutable, it never ends.
I cannot come out with a feasible solution to achieve this need.

You should be able to avoid this by creating a new transformation object with the expected dim.

yuehhua · 2019-08-19T03:41:38Z

@nalimilan I committed. Would you please review it? Thank you.

src/transformations.jl

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

wildart

In-place transforms

src/transformations.jl

Fix bug

test/transformations.jl

nalimilan

Thanks!

Provide standardize API for 1D array

8beb43f

nalimilan reviewed May 2, 2019

View reviewed changes

src/transformations.jl Outdated Show resolved Hide resolved

Add test and update doc

526900e

nalimilan reviewed May 4, 2019

View reviewed changes

src/transformations.jl Outdated Show resolved Hide resolved

Update src/transformations.jl

c325d68

Co-Authored-By: yuehhua <a504082002@gmail.com>

yuehhua and others added 2 commits May 14, 2019 21:45

Fix

957e366

Revert addition of spaces

97acf9a

nalimilan reviewed Jun 2, 2019

View reviewed changes

wildart reviewed Jun 6, 2019

View reviewed changes

src/transformations.jl Outdated Show resolved Hide resolved

src/transformations.jl Outdated Show resolved Hide resolved

src/transformations.jl Outdated Show resolved Hide resolved

nalimilan reviewed Jun 12, 2019

View reviewed changes

nalimilan reviewed Jun 17, 2019

View reviewed changes

yuehhua force-pushed the master branch from a98099c to de27c44 Compare June 17, 2019 13:49

yuehhua force-pushed the master branch from cd9783c to cf356a0 Compare July 6, 2019 01:29

call reshape on vector for reuse of matrix method

b3cbf8a

nalimilan reviewed Jul 6, 2019

View reviewed changes

src/transformations.jl Outdated Show resolved Hide resolved

src/transformations.jl Outdated Show resolved Hide resolved

test/transformations.jl Outdated Show resolved Hide resolved

src/transformations.jl Show resolved Hide resolved

Remove redundant method and add tests

fa4d81b

yuehhua force-pushed the master branch from 93d362d to fa4d81b Compare July 7, 2019 02:12

Remove redundant transpose

9fe549c

nalimilan reviewed Jul 7, 2019

View reviewed changes

yuehhua force-pushed the master branch from c3e4e3c to 434c556 Compare July 8, 2019 03:21

wildart reviewed Jul 15, 2019

View reviewed changes

src/transformations.jl Show resolved Hide resolved

Dimension check in transform! and reconstruct!

4b4ff92

yuehhua force-pushed the master branch from 5b20b40 to 4b4ff92 Compare August 3, 2019 15:24

nalimilan reviewed Aug 20, 2019

View reviewed changes

yuehhua and others added 2 commits August 25, 2019 22:00

Refactor

48b3448

Update src/transformations.jl

4d298ec

Co-Authored-By: Milan Bouchet-Valat <nalimilan@club.fr>

wildart reviewed Aug 28, 2019

View reviewed changes

src/transformations.jl Outdated Show resolved Hide resolved

Remove redundant warning

7491ca0

Fix bug

yuehhua force-pushed the master branch from ed3b25d to 7491ca0 Compare September 9, 2019 14:46

yuehhua commented Sep 9, 2019

View reviewed changes

test/transformations.jl Show resolved Hide resolved

Break lines at 92 chars

e9860b1

nalimilan approved these changes Sep 19, 2019

View reviewed changes

nalimilan merged commit 2fd192a into JuliaStats:master Sep 19, 2019

This was referenced Mar 1, 2020

0.32.1 introduces a breaking change #556

Closed

fix breaking zscore changes (backport to 0.32 branch) #565

Merged

minor version #566

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide standardize API for 1D array #490

Provide standardize API for 1D array #490

yuehhua commented May 2, 2019

codecov bot commented May 2, 2019

codecov bot commented May 2, 2019 •

edited

nalimilan left a comment

yuehhua commented May 4, 2019

nalimilan commented May 4, 2019

yuehhua commented May 4, 2019 •

edited

wildart commented May 13, 2019

nalimilan commented May 26, 2019

yuehhua commented May 27, 2019

nalimilan commented May 27, 2019

yuehhua commented May 27, 2019

nalimilan commented May 27, 2019

yuehhua commented May 28, 2019

yuehhua commented May 28, 2019

nalimilan commented May 28, 2019

yuehhua commented Jun 6, 2019

wildart left a comment

nalimilan Jun 12, 2019

nalimilan Jul 5, 2019

nalimilan Jun 12, 2019

nalimilan Jun 17, 2019

nalimilan Jun 17, 2019

yuehhua commented Jun 17, 2019

nalimilan commented Jun 17, 2019

yuehhua commented Jul 12, 2019

wildart left a comment

wildart commented Aug 1, 2019

yuehhua commented Aug 2, 2019 •

edited

nalimilan commented Aug 2, 2019

yuehhua commented Aug 19, 2019

wildart left a comment

nalimilan left a comment

Provide standardize API for 1D array #490

Provide standardize API for 1D array #490

Conversation

yuehhua commented May 2, 2019

codecov bot commented May 2, 2019

Codecov Report

codecov bot commented May 2, 2019 • edited

Codecov Report

nalimilan left a comment

Choose a reason for hiding this comment

yuehhua commented May 4, 2019

nalimilan commented May 4, 2019

yuehhua commented May 4, 2019 • edited

wildart commented May 13, 2019

nalimilan commented May 26, 2019

yuehhua commented May 27, 2019

nalimilan commented May 27, 2019

yuehhua commented May 27, 2019

nalimilan commented May 27, 2019

yuehhua commented May 28, 2019

yuehhua commented May 28, 2019

nalimilan commented May 28, 2019

yuehhua commented Jun 6, 2019

wildart left a comment

Choose a reason for hiding this comment

nalimilan Jun 12, 2019

Choose a reason for hiding this comment

nalimilan Jul 5, 2019

Choose a reason for hiding this comment

nalimilan Jun 12, 2019

Choose a reason for hiding this comment

nalimilan Jun 17, 2019

Choose a reason for hiding this comment

nalimilan Jun 17, 2019

Choose a reason for hiding this comment

yuehhua commented Jun 17, 2019

nalimilan commented Jun 17, 2019

yuehhua commented Jul 12, 2019

wildart left a comment

Choose a reason for hiding this comment

wildart commented Aug 1, 2019

yuehhua commented Aug 2, 2019 • edited

nalimilan commented Aug 2, 2019

yuehhua commented Aug 19, 2019

wildart left a comment

Choose a reason for hiding this comment

nalimilan left a comment

Choose a reason for hiding this comment

codecov bot commented May 2, 2019 •

edited

yuehhua commented May 4, 2019 •

edited

yuehhua commented Aug 2, 2019 •

edited