WIP feature scalings #1

abieler · 2017-04-30T19:35:44Z

Got started on the feature scaling. A few changes I suggest:

rename rescale to standardize
rename FeatureNormalizer to StandardScaler
use transform instead of predict for rescaling data
Add fixedrange as scaling function which scales the data to a fixed range
(lower:upper) which defaults to (0:1).
Add FixedRangeScaler

This all is still not fully functional and I ll keep working on it in the next few days.

Evizero · 2017-04-30T20:11:26Z

src/fixedrange.jl

+end
+
+function StatsBase.fit{T<:Real}(::Type{FixedRangeScaler}, X::AbstractArray{T}; obsdim=LearnBase.default_obsdim(X))
+    FixedRangeScaler(X, obsdim=obsdim)


If you use FixedRangeScaler(X, convert(ObsDimension, obsdim)) here the function will be type stable (without any downside I could think of). In general a good rule of thumb is to never call a keyword based method internally if it can be avoided. They are just for convenience for the end user

Evizero · 2017-04-30T20:14:33Z

src/standardize.jl

+end
+
+function transform!{T<:Real}(cs::StandardScaler, X::AbstractMatrix{T})
+    @assert length(cs.offset) == size(X, 1)


should size(X, 1) depend on the obsdim?

Evizero · 2017-04-30T20:17:31Z

Very cool! Thanks for working on this. My main feedback at this point is to try and make sure that the types are concrete and inferrable. I.e. all member variables of an immutable type should be concrete

immutable Foo
  x::Vector # not concrete
  obsdim::ObsDim.Constant{} # not concrete
end

immutable Foo2{T,N}
  x::Vector{T} # better
  obsdim::ObsDim.Constant{N} # better
end

Evizero · 2017-04-30T20:22:30Z

src/standardize.jl

+end
+
+function transform{T<:Real}(cs::StandardScaler, X::AbstractMatrix{T})
+    X = convert(AbstractMatrix{AbstractFloat}, X)


Avoid using abstract types as storage. Especially when working with arrays.

This resulting array will be painful to work with and really slow

julia> convert(AbstractMatrix{AbstractFloat}, rand(2,2)) 2×2 Array{AbstractFloat,2}: 0.468758 0.413961 0.559098 0.174936

This video explains pretty well the difference between abstract and concrete element types (I tagged the beginning of the relevant section). The audio is a bit out of sync with the video though.

https://youtu.be/ag_NtJRmYg8?t=12m36s

Thanks for all the input. I ll keep on learning :)
But the line in question above is actually copied from your featurescaling code ;) But now I know why not to do it :)

Ah, good catch. yes this should be changed as well. Looks like it was introduced here JuliaML/MLDataUtils.jl#22 not too long ago

abieler · 2017-05-08T19:39:34Z

Question on how to handle DataFrames: For the lower level functions such as center! and standardize! the user can specify column names on which the transformation is applied. The question is now about the implementation of StandardScaler and UnitRangeScaler such as:

immutable StandardScaler{T,U,M}
    offset::Vector{T}
    scale::Vector{U}
    obsdim::ObsDim.Constant{M}
end

which for Arrays assume that every column is scaled.
Add an additional Vector containing column names and is ignored for Arrays?
Compute parameters such that the transformation has no effect on said columns,
e.g. offset=0 and scale=1 for columns that are not picked. (which leads to unnecessary
computations...)
Other ideas?

Evizero · 2017-05-08T19:47:22Z

maybe opt_in, which for arrays could be a Vector{Bool} indicating if a feature is to be scaled. for a DataFrame it could be Vector{Symbol}.

so

immutable StandardScaler{T,U,M,I}
    offset::Vector{T}
    scale::Vector{U}
    obsdim::ObsDim.Constant{M}
    opt_in::Vector{I}
end

this could allow for selective scaling even for arrays

abieler · 2017-05-09T06:20:57Z

Yes I was thinking somewhere along those lines too. but with indices instead of a bool, which would be more convenient for larger data sets with only a few columns to scale. agree?

Evizero · 2017-05-09T06:46:33Z

Sure, I trust your instincts on that one

abieler · 2017-05-31T12:05:03Z

Finally had some time to get back to this. I think functionality wise it is about where I want it to be. What do you think?
basic usage would be:

scaler = fit(StandardScaler, X[, mu, sigma; obsdim, operate_on])
scaler = fit(FixedRangeScaler, X[, lower, upper; obsdim, operate_on])

transform(X, scaler)
transform!(X, scaler)

And works for Matrix and DataFrames.
Vector operate_on is to pick (by index or column name) on which "columns" the scaling occurs and defaults to all in case of a Matrix and to all numeric type columns for DataFrames.

StandardScaler fit and transform are in line with scikit-learn namings. FixedRangeScaler is called MinMaxScaler in scikit learn, which I dont really like as a name.

Further they have a fit_transform function which does both steps in one call, which is neat and easy to implement.

The name operate_on is not very satisfying to me, but acceptable for now or until someone comes up with a better name :)

Evizero · 2017-05-31T14:30:16Z

Hi! very cool. I will review this as soon as I find some spare time

Evizero · 2017-06-01T17:01:45Z

Just a quick comment. There is StatsModels.jl which deals with things like contrast coding (e.g. one-hot for columns). I suspect we might be better off leaving that task to the statisticians, and instead focus on the ML specific parts when it comes to data frames

Evizero

Beautiful work. This will be a great step forward.

Just a few discussion points here and there

Evizero · 2017-06-01T21:34:01Z

src/center.jl


-function center!(D::AbstractDataFrame)
-    μ_vec = Float64[]
+function center!{T,M}(x::AbstractVector{T}, ::ObsDim.Constant{M}, operate_on::AbstractVector)


no need to define T or M in this case

Evizero · 2017-06-01T21:40:17Z

src/fixedrange.jl

+    FixedRangeScaler(lower, upper, xmin, xmax, ObsDim.Constant{1}(), colnames)
+end
+
+function StatsBase.fit{T<:Real,N}(X::AbstractArray{T,N}, ::Type{FixedRangeScaler}; obsdim=LearnBase.default_obsdim(X), operate_on=default_scaleselection(X, convert(ObsDimension, obsdim)))


The convention for StatsBase.fit is StatsBase.fit(::Type{MyType}, data, ...)

Yes. I changed it because of the use of transform!() where I think it is a Julia convention to have the variable to be modified in the first place? So transform!(X, scaler). In order to have fit, transform and fit_transform consistent I switched the the other two to this argument order. No strong feelings about flipping all three back if that is considered the overall more concise solution.

Let's leave transform but make fit consistent with StatsBase

Evizero · 2017-06-01T21:42:20Z

src/scaleselection.jl

@@ -0,0 +1,69 @@
+function default_scaleselection(X::AbstractMatrix, ::ObsDim.Constant{1})
+    collect(1:size(X, 2))


why the collect ?

operate_on in FixedRangeScaler and StandardScaler is of type Vector

Evizero · 2017-06-01T21:43:30Z

src/standardize.jl

+end
+
+function standardize!(X::AbstractMatrix, μ::AbstractVector, σ::AbstractVector, ::ObsDim.Constant{2}, operate_on)
+    σ[σ .== 0] = 1


what is this line for?

It is meant to catch the case of constant variables with std(var) = 0 and prevent the division by 0.
Could also do a if sigma != 0 do rescale

Evizero · 2017-06-01T21:45:53Z

test/runtests.jl

 tests = [
    "tst_expand.jl"
    "tst_center.jl"
-    "tst_rescale.jl"


since this file isn't executed anymore, do you mean to delete it?

abieler · 2017-06-02T08:55:13Z

Regarding one-hot encoding. do you suggest leaving it off completely of MLPreprocessing or to implement their methods to be used here?

Evizero · 2017-06-02T09:02:09Z

Regarding one-hot encoding. do you suggest leaving it off completely of MLPreprocessing

I suggest leaving it out completely. In general its fair to assume that users convert their DataFrame to an Array themselves. I do like that we provide feature scaling for DataFrames, but in general we should focus on arrays

abieler · 2017-06-04T07:53:15Z

Getting back at the one-hot encoding topic: StatsModels currently only works with DataTables, not DataFrames, so it is not compatible with MLPreprocessing at this point. Further it is a slightly different thing it achieves by going from a DataTable to Matrix. I personally prefer doing the encoding (as much preprocessing as possible really) on the DataFrame level and then do a straightforward convert(Matrix, DataFrame). So the onehot!() implementation here is different than from what is in StatsModels. I feel like this functionality should be present somewhere in the JuliaML universe, or maybe DataFrames or DataTables directly. What do you think?

Evizero · 2017-06-04T08:56:49Z

currently only works with DataTables

I know, but still. I want to avoid duplicated effort on the table front. Especially because I am not a statistician and I am only somewhat aware of all the nuances that come with that. For example is a one-hot encoding rarely used for categorical variables since it introduces a redundant feature that is better off being omitted since the "bias" is enough to catch that information (so instead the more widely used form of encoding is the "dummy encoding").

I personally prefer doing the encoding (as much preprocessing as possible really) on the DataFrame level and then do a straightforward convert(Matrix, DataFrame)

I think that is rather untypical. Coming from R what one usually does there is

somehow end up with a dataframe with the variables one needs
specify a "formula" of how to translate that dataframe into feature matrix and targets
do the ML-specific preprocessing on the feature matrix

Evizero · 2017-06-04T09:03:09Z

If you feel strongly about it we can obviously still consider including such functionality. But in that case it would be easier to review/discuss in its own dedicated PR

abieler · 2017-06-14T18:42:25Z

Alright. Removed the encoding part and switched back the argument order for fit and fit_transform.

coveralls · 2017-06-14T20:51:24Z

Coverage decreased (-30.5%) to 65.588% when pulling c9a396a on fixedrange into 25f6d60 on master.

Evizero · 2017-06-15T15:07:05Z

Looking great! One last thing I noticed. The obsolete file featurenormalizer.jl still needs to be deleted I think. Other than that this looks awesome and ready to merge.

One discussion point we should consider in the future is if we should switch from the corrected std to the uncorrected (i.e. std(x, corrected=false))

abieler · 2017-06-16T12:35:28Z

Is the corrected/uncorrected std() related to 1/N vs. 1/(N-1)? If so, I don't have a strong opinion on it. Can you explain your thoughts on this?

abieler · 2017-06-17T09:09:10Z

Removed feature_normalizer. I ll start on the README next

Evizero · 2017-06-17T09:11:26Z

Is the corrected/uncorrected std() related to 1/N vs. 1/(N-1)? If so, I don't have a strong opinion on it. Can you explain your thoughts on this?

Yes. My thoughts are mostly motivated by consistency with what other frameworks seem to do

Evizero · 2017-06-18T09:40:02Z

README.md

+Examples:
+
+
+    Xtrain = rand(100, 4)


maybe use a julia codeblock explicitly so that the readme has syntax highlighting

abieler · 2017-06-23T19:49:37Z

Alright. This works now in v0.5 and v0.6. Ready to merge from my side.

coveralls · 2017-06-23T20:24:07Z

Coverage decreased (-28.3%) to 67.781% when pulling 4789187 on fixedrange into 25f6d60 on master.

coveralls · 2017-06-23T20:49:49Z

Coverage decreased (-28.3%) to 67.781% when pulling 686a308 on fixedrange into 25f6d60 on master.

Evizero · 2017-06-24T09:46:11Z

Awesome! This is a really big improvement!

Initial commit

21899d9

Evizero reviewed Apr 30, 2017

View reviewed changes

Add some tests

04ea237

abieler added 2 commits May 12, 2017 16:54

WIP: start writing tests

8114ffe

Add tests and docs

838f896

Evizero self-assigned this May 31, 2017

abieler added 3 commits June 1, 2017 07:07

Add fit_transform and fit_transform!

0c21f09

Clean up

a09226f

WIP one hot encoding of DataFrame columns

2c34c73

Evizero reviewed Jun 1, 2017

View reviewed changes

Evizero removed their assignment Jun 1, 2017

abieler added 2 commits June 2, 2017 11:02

Remove obsolete files

d9da3af

Swap arguments for fit and fit_transform

0885739

Fix documentation for swapped arguments

9feac96

Remove encoding.jl

c9a396a

Evizero self-assigned this Jun 14, 2017

Remove obsolete file

afa8324

Start README

9dd7651

Evizero reviewed Jun 18, 2017

View reviewed changes

README.md

Examples:

Xtrain = rand(100, 4)

Copy link

Member

Evizero Jun 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use a julia codeblock explicitly so that the readme has syntax highlighting

abieler added 4 commits June 20, 2017 22:49

Add README

fd39887

remove obsolete test file

9b941f5

Remove depwarns for v0.6

a074b0d

Undo isna related depwarns

4789187

Fix doc typo

686a308

Evizero merged commit 487e601 into master Jun 24, 2017

Evizero deleted the fixedrange branch June 26, 2017 10:21

		@@ -0,0 +1,69 @@
		function default_scaleselection(X::AbstractMatrix, ::ObsDim.Constant{1})
		collect(1:size(X, 2))

WIP feature scalings #1

WIP feature scalings #1

Uh oh!

Conversation

abieler commented Apr 30, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Evizero commented Apr 30, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abieler May 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abieler commented May 8, 2017

Uh oh!

Evizero commented May 8, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abieler commented May 9, 2017

Uh oh!

Evizero commented May 9, 2017

Uh oh!

abieler commented May 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Evizero commented May 31, 2017

Uh oh!

Evizero commented Jun 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Evizero left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abieler Jun 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abieler Jun 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abieler commented Jun 2, 2017

Uh oh!

Evizero commented Jun 2, 2017

Uh oh!

abieler commented Jun 4, 2017

Uh oh!

Evizero commented Jun 4, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Evizero commented Jun 4, 2017

Uh oh!

abieler commented Jun 14, 2017

Uh oh!

coveralls commented Jun 14, 2017

Uh oh!

Evizero commented Jun 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

abieler May 1, 2017 •

edited

Loading

Evizero commented May 8, 2017 •

edited

Loading

abieler commented May 31, 2017 •

edited

Loading

Evizero commented Jun 1, 2017 •

edited

Loading

abieler Jun 2, 2017 •

edited

Loading

abieler Jun 2, 2017 •

edited

Loading

Evizero commented Jun 4, 2017 •

edited

Loading

Evizero commented Jun 15, 2017 •

edited

Loading

coveralls commented Jun 23, 2017 •

edited

Loading

coveralls commented Jun 23, 2017 •

edited

Loading