-
Notifications
You must be signed in to change notification settings - Fork 2
WIP feature scalings #1
Conversation
src/fixedrange.jl
Outdated
end | ||
|
||
function StatsBase.fit{T<:Real}(::Type{FixedRangeScaler}, X::AbstractArray{T}; obsdim=LearnBase.default_obsdim(X)) | ||
FixedRangeScaler(X, obsdim=obsdim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use FixedRangeScaler(X, convert(ObsDimension, obsdim))
here the function will be type stable (without any downside I could think of). In general a good rule of thumb is to never call a keyword based method internally if it can be avoided. They are just for convenience for the end user
src/standardize.jl
Outdated
end | ||
|
||
function transform!{T<:Real}(cs::StandardScaler, X::AbstractMatrix{T}) | ||
@assert length(cs.offset) == size(X, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should size(X, 1)
depend on the obsdim
?
Very cool! Thanks for working on this. My main feedback at this point is to try and make sure that the types are concrete and inferrable. I.e. all member variables of an immutable type should be concrete immutable Foo
x::Vector # not concrete
obsdim::ObsDim.Constant{} # not concrete
end immutable Foo2{T,N}
x::Vector{T} # better
obsdim::ObsDim.Constant{N} # better
end |
src/standardize.jl
Outdated
end | ||
|
||
function transform{T<:Real}(cs::StandardScaler, X::AbstractMatrix{T}) | ||
X = convert(AbstractMatrix{AbstractFloat}, X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid using abstract types as storage. Especially when working with arrays.
This resulting array will be painful to work with and really slow
julia> convert(AbstractMatrix{AbstractFloat}, rand(2,2))
2×2 Array{AbstractFloat,2}:
0.468758 0.413961
0.559098 0.174936
This video explains pretty well the difference between abstract and concrete element types (I tagged the beginning of the relevant section). The audio is a bit out of sync with the video though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the input. I ll keep on learning :)
But the line in question above is actually copied from your featurescaling code ;) But now I know why not to do it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good catch. yes this should be changed as well. Looks like it was introduced here JuliaML/MLDataUtils.jl#22 not too long ago
Question on how to handle DataFrames: For the lower level functions such as center! and standardize! the user can specify column names on which the transformation is applied. The question is now about the implementation of StandardScaler and UnitRangeScaler such as:
which for Arrays assume that every column is scaled. |
maybe so immutable StandardScaler{T,U,M,I}
offset::Vector{T}
scale::Vector{U}
obsdim::ObsDim.Constant{M}
opt_in::Vector{I}
end this could allow for selective scaling even for arrays |
Yes I was thinking somewhere along those lines too. but with indices instead of a bool, which would be more convenient for larger data sets with only a few columns to scale. agree? |
Sure, I trust your instincts on that one |
Finally had some time to get back to this. I think functionality wise it is about where I want it to be. What do you think? scaler = fit(StandardScaler, X[, mu, sigma; obsdim, operate_on])
scaler = fit(FixedRangeScaler, X[, lower, upper; obsdim, operate_on])
transform(X, scaler)
transform!(X, scaler) And works for Matrix and DataFrames.
Further they have a The name |
Hi! very cool. I will review this as soon as I find some spare time |
Just a quick comment. There is StatsModels.jl which deals with things like contrast coding (e.g. one-hot for columns). I suspect we might be better off leaving that task to the statisticians, and instead focus on the ML specific parts when it comes to data frames |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautiful work. This will be a great step forward.
Just a few discussion points here and there
src/center.jl
Outdated
|
||
function center!(D::AbstractDataFrame) | ||
μ_vec = Float64[] | ||
function center!{T,M}(x::AbstractVector{T}, ::ObsDim.Constant{M}, operate_on::AbstractVector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to define T
or M
in this case
src/fixedrange.jl
Outdated
FixedRangeScaler(lower, upper, xmin, xmax, ObsDim.Constant{1}(), colnames) | ||
end | ||
|
||
function StatsBase.fit{T<:Real,N}(X::AbstractArray{T,N}, ::Type{FixedRangeScaler}; obsdim=LearnBase.default_obsdim(X), operate_on=default_scaleselection(X, convert(ObsDimension, obsdim))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The convention for StatsBase.fit
is StatsBase.fit(::Type{MyType}, data, ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I changed it because of the use of transform!()
where I think it is a Julia convention to have the variable to be modified in the first place? So transform!(X, scaler)
. In order to have fit
, transform
and fit_transform
consistent I switched the the other two to this argument order. No strong feelings about flipping all three back if that is considered the overall more concise solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave transform
but make fit
consistent with StatsBase
@@ -0,0 +1,69 @@ | |||
function default_scaleselection(X::AbstractMatrix, ::ObsDim.Constant{1}) | |||
collect(1:size(X, 2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the collect
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
operate_on
in FixedRangeScaler
and StandardScaler
is of type Vector
end | ||
|
||
function standardize!(X::AbstractMatrix, μ::AbstractVector, σ::AbstractVector, ::ObsDim.Constant{2}, operate_on) | ||
σ[σ .== 0] = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this line for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is meant to catch the case of constant variables with std(var) = 0 and prevent the division by 0.
Could also do a if sigma != 0 do rescale
tests = [ | ||
"tst_expand.jl" | ||
"tst_center.jl" | ||
"tst_rescale.jl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since this file isn't executed anymore, do you mean to delete it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do
Regarding one-hot encoding. do you suggest leaving it off completely of MLPreprocessing or to implement their methods to be used here? |
I suggest leaving it out completely. In general its fair to assume that users convert their DataFrame to an Array themselves. I do like that we provide feature scaling for DataFrames, but in general we should focus on arrays |
Getting back at the one-hot encoding topic: |
I know, but still. I want to avoid duplicated effort on the table front. Especially because I am not a statistician and I am only somewhat aware of all the nuances that come with that. For example is a one-hot encoding rarely used for categorical variables since it introduces a redundant feature that is better off being omitted since the "bias" is enough to catch that information (so instead the more widely used form of encoding is the "dummy encoding").
I think that is rather untypical. Coming from
|
If you feel strongly about it we can obviously still consider including such functionality. But in that case it would be easier to review/discuss in its own dedicated PR |
Alright. Removed the encoding part and switched back the argument order for |
Looking great! One last thing I noticed. The obsolete file One discussion point we should consider in the future is if we should switch from the corrected |
Is the corrected/uncorrected std() related to 1/N vs. 1/(N-1)? If so, I don't have a strong opinion on it. Can you explain your thoughts on this? |
Removed |
Yes. My thoughts are mostly motivated by consistency with what other frameworks seem to do |
Examples: | ||
|
||
|
||
Xtrain = rand(100, 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use a julia codeblock explicitly so that the readme has syntax highlighting
Alright. This works now in v0.5 and v0.6. Ready to merge from my side. |
Awesome! This is a really big improvement! |
Got started on the feature scaling. A few changes I suggest:
rescale
tostandardize
FeatureNormalizer
toStandardScaler
transform
instead ofpredict
for rescaling datafixedrange
as scaling function which scales the data to a fixed range(lower:upper) which defaults to (0:1).
FixedRangeScaler
This all is still not fully functional and I ll keep working on it in the next few days.