Skip to content
This repository was archived by the owner on May 21, 2022. It is now read-only.

Conversation

abieler
Copy link
Contributor

@abieler abieler commented Apr 30, 2017

Got started on the feature scaling. A few changes I suggest:

  1. rename rescale to standardize
  2. rename FeatureNormalizer to StandardScaler
  3. use transform instead of predict for rescaling data
  4. Add fixedrange as scaling function which scales the data to a fixed range
    (lower:upper) which defaults to (0:1).
  5. Add FixedRangeScaler

This all is still not fully functional and I ll keep working on it in the next few days.

end

function StatsBase.fit{T<:Real}(::Type{FixedRangeScaler}, X::AbstractArray{T}; obsdim=LearnBase.default_obsdim(X))
FixedRangeScaler(X, obsdim=obsdim)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use FixedRangeScaler(X, convert(ObsDimension, obsdim)) here the function will be type stable (without any downside I could think of). In general a good rule of thumb is to never call a keyword based method internally if it can be avoided. They are just for convenience for the end user

end

function transform!{T<:Real}(cs::StandardScaler, X::AbstractMatrix{T})
@assert length(cs.offset) == size(X, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should size(X, 1) depend on the obsdim?

@Evizero
Copy link
Member

Evizero commented Apr 30, 2017

Very cool! Thanks for working on this. My main feedback at this point is to try and make sure that the types are concrete and inferrable. I.e. all member variables of an immutable type should be concrete

immutable Foo
  x::Vector # not concrete
  obsdim::ObsDim.Constant{} # not concrete
end
immutable Foo2{T,N}
  x::Vector{T} # better
  obsdim::ObsDim.Constant{N} # better
end

end

function transform{T<:Real}(cs::StandardScaler, X::AbstractMatrix{T})
X = convert(AbstractMatrix{AbstractFloat}, X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid using abstract types as storage. Especially when working with arrays.

This resulting array will be painful to work with and really slow

julia> convert(AbstractMatrix{AbstractFloat}, rand(2,2))
2×2 Array{AbstractFloat,2}:
 0.468758  0.413961
 0.559098  0.174936

This video explains pretty well the difference between abstract and concrete element types (I tagged the beginning of the relevant section). The audio is a bit out of sync with the video though.

https://youtu.be/ag_NtJRmYg8?t=12m36s

Copy link
Contributor Author

@abieler abieler May 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the input. I ll keep on learning :)
But the line in question above is actually copied from your featurescaling code ;) But now I know why not to do it :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch. yes this should be changed as well. Looks like it was introduced here JuliaML/MLDataUtils.jl#22 not too long ago

@abieler
Copy link
Contributor Author

abieler commented May 8, 2017

Question on how to handle DataFrames: For the lower level functions such as center! and standardize! the user can specify column names on which the transformation is applied. The question is now about the implementation of StandardScaler and UnitRangeScaler such as:

immutable StandardScaler{T,U,M}
    offset::Vector{T}
    scale::Vector{U}
    obsdim::ObsDim.Constant{M}
end

which for Arrays assume that every column is scaled.
Add an additional Vector containing column names and is ignored for Arrays?
Compute parameters such that the transformation has no effect on said columns,
e.g. offset=0 and scale=1 for columns that are not picked. (which leads to unnecessary
computations...)
Other ideas?

@Evizero
Copy link
Member

Evizero commented May 8, 2017

maybe opt_in, which for arrays could be a Vector{Bool} indicating if a feature is to be scaled. for a DataFrame it could be Vector{Symbol}.

so

immutable StandardScaler{T,U,M,I}
    offset::Vector{T}
    scale::Vector{U}
    obsdim::ObsDim.Constant{M}
    opt_in::Vector{I}
end

this could allow for selective scaling even for arrays

@abieler
Copy link
Contributor Author

abieler commented May 9, 2017

Yes I was thinking somewhere along those lines too. but with indices instead of a bool, which would be more convenient for larger data sets with only a few columns to scale. agree?

@Evizero
Copy link
Member

Evizero commented May 9, 2017

Sure, I trust your instincts on that one

@abieler
Copy link
Contributor Author

abieler commented May 31, 2017

Finally had some time to get back to this. I think functionality wise it is about where I want it to be. What do you think?
basic usage would be:

scaler = fit(StandardScaler, X[, mu, sigma; obsdim, operate_on])
scaler = fit(FixedRangeScaler, X[, lower, upper; obsdim, operate_on])

transform(X, scaler)
transform!(X, scaler)

And works for Matrix and DataFrames.
Vector operate_on is to pick (by index or column name) on which "columns" the scaling occurs and defaults to all in case of a Matrix and to all numeric type columns for DataFrames.

StandardScaler fit and transform are in line with scikit-learn namings. FixedRangeScaler is called MinMaxScaler in scikit learn, which I dont really like as a name.

Further they have a fit_transform function which does both steps in one call, which is neat and easy to implement.

The name operate_on is not very satisfying to me, but acceptable for now or until someone comes up with a better name :)

@Evizero
Copy link
Member

Evizero commented May 31, 2017

Hi! very cool. I will review this as soon as I find some spare time

@Evizero Evizero self-assigned this May 31, 2017
@Evizero
Copy link
Member

Evizero commented Jun 1, 2017

Just a quick comment. There is StatsModels.jl which deals with things like contrast coding (e.g. one-hot for columns). I suspect we might be better off leaving that task to the statisticians, and instead focus on the ML specific parts when it comes to data frames

Copy link
Member

@Evizero Evizero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful work. This will be a great step forward.

Just a few discussion points here and there

src/center.jl Outdated

function center!(D::AbstractDataFrame)
μ_vec = Float64[]
function center!{T,M}(x::AbstractVector{T}, ::ObsDim.Constant{M}, operate_on::AbstractVector)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to define T or M in this case

FixedRangeScaler(lower, upper, xmin, xmax, ObsDim.Constant{1}(), colnames)
end

function StatsBase.fit{T<:Real,N}(X::AbstractArray{T,N}, ::Type{FixedRangeScaler}; obsdim=LearnBase.default_obsdim(X), operate_on=default_scaleselection(X, convert(ObsDimension, obsdim)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The convention for StatsBase.fit is StatsBase.fit(::Type{MyType}, data, ...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I changed it because of the use of transform!() where I think it is a Julia convention to have the variable to be modified in the first place? So transform!(X, scaler). In order to have fit, transform and fit_transform consistent I switched the the other two to this argument order. No strong feelings about flipping all three back if that is considered the overall more concise solution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave transform but make fit consistent with StatsBase

@@ -0,0 +1,69 @@
function default_scaleselection(X::AbstractMatrix, ::ObsDim.Constant{1})
collect(1:size(X, 2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the collect ?

Copy link
Contributor Author

@abieler abieler Jun 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

operate_on in FixedRangeScaler and StandardScaler is of type Vector

end

function standardize!(X::AbstractMatrix, μ::AbstractVector, σ::AbstractVector, ::ObsDim.Constant{2}, operate_on)
σ[σ .== 0] = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this line for?

Copy link
Contributor Author

@abieler abieler Jun 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is meant to catch the case of constant variables with std(var) = 0 and prevent the division by 0.
Could also do a if sigma != 0 do rescale

tests = [
"tst_expand.jl"
"tst_center.jl"
"tst_rescale.jl"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this file isn't executed anymore, do you mean to delete it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

@Evizero Evizero removed their assignment Jun 1, 2017
@abieler
Copy link
Contributor Author

abieler commented Jun 2, 2017

Regarding one-hot encoding. do you suggest leaving it off completely of MLPreprocessing or to implement their methods to be used here?

@Evizero
Copy link
Member

Evizero commented Jun 2, 2017

Regarding one-hot encoding. do you suggest leaving it off completely of MLPreprocessing

I suggest leaving it out completely. In general its fair to assume that users convert their DataFrame to an Array themselves. I do like that we provide feature scaling for DataFrames, but in general we should focus on arrays

@abieler
Copy link
Contributor Author

abieler commented Jun 4, 2017

Getting back at the one-hot encoding topic: StatsModels currently only works with DataTables, not DataFrames, so it is not compatible with MLPreprocessing at this point. Further it is a slightly different thing it achieves by going from a DataTable to Matrix. I personally prefer doing the encoding (as much preprocessing as possible really) on the DataFrame level and then do a straightforward convert(Matrix, DataFrame). So the onehot!() implementation here is different than from what is in StatsModels. I feel like this functionality should be present somewhere in the JuliaML universe, or maybe DataFrames or DataTables directly. What do you think?

@Evizero
Copy link
Member

Evizero commented Jun 4, 2017

currently only works with DataTables

I know, but still. I want to avoid duplicated effort on the table front. Especially because I am not a statistician and I am only somewhat aware of all the nuances that come with that. For example is a one-hot encoding rarely used for categorical variables since it introduces a redundant feature that is better off being omitted since the "bias" is enough to catch that information (so instead the more widely used form of encoding is the "dummy encoding").

I personally prefer doing the encoding (as much preprocessing as possible really) on the DataFrame level and then do a straightforward convert(Matrix, DataFrame)

I think that is rather untypical. Coming from R what one usually does there is

  1. somehow end up with a dataframe with the variables one needs
  2. specify a "formula" of how to translate that dataframe into feature matrix and targets
  3. do the ML-specific preprocessing on the feature matrix

@Evizero
Copy link
Member

Evizero commented Jun 4, 2017

If you feel strongly about it we can obviously still consider including such functionality. But in that case it would be easier to review/discuss in its own dedicated PR

@abieler
Copy link
Contributor Author

abieler commented Jun 14, 2017

Alright. Removed the encoding part and switched back the argument order for fit and fit_transform.

@Evizero Evizero self-assigned this Jun 14, 2017
@coveralls
Copy link

Coverage Status

Coverage decreased (-30.5%) to 65.588% when pulling c9a396a on fixedrange into 25f6d60 on master.

@Evizero
Copy link
Member

Evizero commented Jun 15, 2017

Looking great! One last thing I noticed. The obsolete file featurenormalizer.jl still needs to be deleted I think. Other than that this looks awesome and ready to merge.

One discussion point we should consider in the future is if we should switch from the corrected std to the uncorrected (i.e. std(x, corrected=false))

@abieler
Copy link
Contributor Author

abieler commented Jun 16, 2017

Is the corrected/uncorrected std() related to 1/N vs. 1/(N-1)? If so, I don't have a strong opinion on it. Can you explain your thoughts on this?

@abieler
Copy link
Contributor Author

abieler commented Jun 17, 2017

Removed feature_normalizer. I ll start on the README next

@Evizero
Copy link
Member

Evizero commented Jun 17, 2017

Is the corrected/uncorrected std() related to 1/N vs. 1/(N-1)? If so, I don't have a strong opinion on it. Can you explain your thoughts on this?

Yes. My thoughts are mostly motivated by consistency with what other frameworks seem to do

Examples:


Xtrain = rand(100, 4)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use a julia codeblock explicitly so that the readme has syntax highlighting

@abieler
Copy link
Contributor Author

abieler commented Jun 23, 2017

Alright. This works now in v0.5 and v0.6. Ready to merge from my side.

@coveralls
Copy link

coveralls commented Jun 23, 2017

Coverage Status

Coverage decreased (-28.3%) to 67.781% when pulling 4789187 on fixedrange into 25f6d60 on master.

@coveralls
Copy link

coveralls commented Jun 23, 2017

Coverage Status

Coverage decreased (-28.3%) to 67.781% when pulling 686a308 on fixedrange into 25f6d60 on master.

@Evizero Evizero merged commit 487e601 into master Jun 24, 2017
@Evizero
Copy link
Member

Evizero commented Jun 24, 2017

Awesome! This is a really big improvement!

@Evizero Evizero deleted the fixedrange branch June 26, 2017 10:21
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants