diff --git a/README.md b/README.md index a6d25ff..f9978eb 100644 --- a/README.md +++ b/README.md @@ -6,4 +6,180 @@ ## Overview -UNDER CONSTRUCTION +Utility package that provides end user friendly methods for feature scalings and polynomial +basis expansion. + +### StandardScaler +Standardization of data sets result in variables with a mean of 0 and variance of 1. +A common use case would be to fit a `StandardScaler` to the training data and later +apply the same transformation to the test data. `StandardScaler` is used with the +functions `fit()`, `transform()` and `fit_transform()` as shown below. + +```julia + + fit(StandardScaler, X[, μ, σ; obsdim, operate_on]) + + fit_transform(StandardScaler, X[, μ, σ; obsdim, operate_on]) +``` + +`X` : Data of type Matrix or `DataFrame`. + +`μ` : Vector or scalar describing the translation. + Defaults to mean(X, obsdim) + +`σ` : Vector or scalar describing the scale. + Defaults to std(X, obsdim) + +`obsdim` : Specify which axis corresponds to observations. + Defaults to obsdim=2 (observations are columns of matrix) + For DataFrames `obsdim` is obsolete and rescaling occurs + column wise. + +`operate_on`: Specify the indices of columns or rows to be centered. + Defaults to all columns/rows. + For DataFrames this must be a vector of symbols, not indices. + E.g. `operate_on`=[1,3] will perform centering on columns + with index 1 and 3 only (if obsdim=1, else rows 1 and 3) + +Note on DataFrames: +Columns containing `NA` values are skipped. +Columns containing non numeric elements are skipped. + +Examples: + +```julia + Xtrain = rand(100, 4) + Xtest = rand(10, 4) + x = rand(4) + Dtrain = DataFrame(A=rand(10), B=collect(1:10), C=[string(x) for x in 1:10]) + Dtest = DataFrame(A=rand(10), B=collect(1:10), C=[string(x) for x in 1:10]) + + scaler = fit(StandardScaler, Xtrain) + scaler = fit(StandardScaler, Xtrain, obsdim=1) + scaler = fit(StandardScaler, Xtrain, obsdim=1, operate_on=[1,3]) + transform(Xtest, scaler) + transform!(Xtest, scaler) + transform(x, scaler) + transform!(x, scaler) + + scaler = fit(StandardScaler, Dtrain) + scaler = fit(StandardScaler, Dtrain, operate_on=[:A,:B]) + transform(Dtest, scaler) + transform!(Dtest, scaler) + + Xscaled, scaler = fit_transform(StandardScaler, X, obsdim=1, operate_on=[1,2,4]) + scaler = fit_transform!(StandardScaler, X, obsdim=1, operate_on=[1,2,4]) +``` + +Note that for `transform!` the data matrix `X` has to be of type <: AbstractFloat +as the scaling occurs inplace. (E.g. cannot be of type Matrix{Int64}). This is not +the case for `transform` however. +For `DataFrames` `transform!` can be used on columns of type <: Integer. + + +### FixedRangeScaler +`FixedRangeScaler` is used with the functions `fit()`, `transform()` and `fit_transform()` +to scale data in a Matrix `X` or DataFrame to a fixed range [lower:upper]. +After fitting a `FixedRangeScaler` to one data set, it can be used to perform the same +transformation to a new set of data. E.g. fit the `FixedRangeScaler` to your training +data and then apply the scaling to the test data at a later stage. (See examples below). + +```julia + fit(FixedRangeScaler, X[, lower, upper; obsdim, operate_on]) + + fit_transform(FixedRangeScaler, X[, lower, upper; obsdim, operate_on]) +``` + +`X` : Data of type Matrix or `DataFrame`. + +`lower` : (Scalar) Lower limit of new range. + Defaults to 0. + +`upper` : (Scalar) Upper limit of new range. + Defaults to 1. + +`obsdim` : Specify which axis corresponds to observations. + Defaults to obsdim=2 (observations are columns of matrix) + For DataFrames `obsdim` is obsolete and rescaling occurs + column wise. + +`operate_on`: Specify the indices of columns or rows to be centered. + Defaults to all columns/rows. + For DataFrames this must be a vector of symbols, not indices. + E.g. `operate_on`=[1,3] will perform centering on columns + with index 1 and 3 only (if obsdim=1, else rows 1 and 3) + +Note on DataFrames: +Columns containing `NA` values are skipped. +Columns containing non numeric elements are skipped. + +Examples: + +```julia + Xtrain = rand(100, 4) + Xtest = rand(10, 4) + x = rand(10) + D = DataFrame(A=rand(10), B=collect(1:10), C=[string(x) for x in 1:10]) + + scaler = fit(FixedRangeScaler, Xtrain) + scaler = fit(FixedRangeScaler, Xtrain, -1, 1) + scaler = fit(FixedRangeScaler, Xtrain, -1, 1, obsdim=1) + scaler = fit(FixedRangeScaler, Xtrain, -1, 1, obsdim=1, operate_on=[1,3]) + scaler = fit(FixedRangeScaler, D, -1, 1, operate_on=[:A,:B]) + + Xscaled = transform(Xtest, scaler) + transform!(Xtest, scaler) + + Xscaled, scaler = fit_transform(FixedRangeScaler, X, -1, 1, obsdim=1, operate_on=[1,2,4]) + scaler = fit_transform!(FixedRangeScaler, X, -1, 1, obsdim=1, operate_on=[1,2,4]) +``` + +### Lower Level Functions +The lower level functions on which `StandardScaler` and `FixedRangeScaler` are built on can also +be used seperately. + +#### center!() +```julia + μ = center!(X[, μ; obsdim, operate_on]) +``` +Shift `X` along `obsdim` by `μ` according to X = X - μ +where `X` is of type Matrix or Vector and `D` of type DataFrame. + +#### fixedrange!() +```julia + lower, upper, xmin, xmax = fixedrange!(X[, lower, upper, xmin, xmax; obsdim, operate_on]) +``` +Normalize `X` or `D` along `obsdim` to the interval [lower:upper] +where `X` is of type Matrix or Vector and `D` of type DataFrame. +If `lower` and `upper` are omitted the default range is [0:1]. + +#### standardize!() +```julia + μ, σ = standardize!(X[, μ, σ; obsdim, operate_on]) +``` +Standardize `X` along `obsdim` according to X = (X - μ) / σ. +If μ and σ are omitted they are computed such that variables have a mean of zero. + +### Polynomial Basis Expansion +```julia + M = expand_poly(x[, degree=5, obsdim]) +``` +Perform a polynomial basis expansion of the given `degree` for the vector `x`. + +```julia +julia> expand_poly(1:5, degree=3) +3×5 Array{Float64,2}: + 1.0 2.0 3.0 4.0 5.0 + 1.0 4.0 9.0 16.0 25.0 + 1.0 8.0 27.0 64.0 125.0 + +julia> expand_poly(1:5, degree=3, obsdim=1) +5×3 Array{Float64,2}: + 1.0 1.0 1.0 + 2.0 4.0 8.0 + 3.0 9.0 27.0 + 4.0 16.0 64.0 + 5.0 25.0 125.0 + +julia> expand_poly(1:5, 3, ObsDim.First()); # same but type-stable +``` diff --git a/src/MLPreprocessing.jl b/src/MLPreprocessing.jl index 5ed2d8a..798e5ee 100644 --- a/src/MLPreprocessing.jl +++ b/src/MLPreprocessing.jl @@ -13,16 +13,22 @@ export expand_poly, center!, - rescale!, + standardize!, + fixedrange!, - FeatureNormalizer, + StandardScaler, + FixedRangeScaler, fit, - predict, - predict! + fit_transform, + fit_transform!, + transform, + transform! + +include("scaleselection.jl") include("basis_expansion.jl") include("center.jl") -include("rescale.jl") -include("featurenormalizer.jl") +include("standardize.jl") +include("fixedrange.jl") end # module diff --git a/src/center.jl b/src/center.jl index 12c7de9..ec9a41b 100644 --- a/src/center.jl +++ b/src/center.jl @@ -1,146 +1,184 @@ """ - μ = center!(X[, μ, obsdim]) + μ = center!(X[, μ; obsdim, operate_on]) or - μ = center!(D[, colnames, μ]) + μ = center!(D[, μ; operate_on]) where `X` is of type Matrix or Vector and `D` of type DataFrame. -Center `X` along `obsdim` around the corresponding entry in the -vector `μ`. If `μ` is not specified then it defaults to the -feature specific means. +Shift `X` along `obsdim` by `μ` according to X = X - μ -For DataFrames, `obsdim` is obsolete and centering is done column wise. -Instead the vector `colnames` allows to specify which columns to center. -If `colnames` is not provided all columns of type T<:Real are centered. -Example: +`μ` : Vector or value describing the translation. + Defaults to mean(X, 2) + +`obsdim` : Specify which axis corresponds to observations. + Defaults to obsdim=2 (observations are columns of matrix) + For DataFrames `obsdim` is obsolete and centering occurs + column wise. + +`operate_on`: Specify the indices of columns or rows to be centered. + Defaults to all columns/rows. + For DataFrames this must be a vector of symbols, not indices + E.g. `operate_on`=[1,3] will perform centering on columns + with index 1 and 3 only (if obsdim=1, else rows 1 and 3) + + +Note on DataFrames: +Columns containing `NA` values are skipped. +Columns containing non numeric elements are skipped. + +Examples: X = rand(4, 100) + x = rand(10) D = DataFrame(A=rand(10), B=collect(1:10), C=[string(x) for x in 1:10]) μ = center!(X, obsdim=2) μ = center!(X, ObsDim.First()) + μ = center!(X, obsdim=1, operate_on=[1,3] + μ = center!(X, [7.0, 8.0], obsdim=1, operate_on=[1,3] μ = center!(D) - μ = center!(D, [:A, :B]) - + μ = center!(D, operate_on=[:A, :B]) + μ = center!(D, [-1,-1], operate_on=[:A, :B]) """ -function center!(X, μ; obsdim=LearnBase.default_obsdim(X)) - center!(X, μ, convert(ObsDimension, obsdim)) + +function center!(X, μ; obsdim=LearnBase.default_obsdim(X), operate_on=default_scaleselection(X, convert(ObsDimension, obsdim))) + center!(X, μ, convert(ObsDimension, obsdim), operate_on) end -function center!(X; obsdim=LearnBase.default_obsdim(X)) - center!(X, convert(ObsDimension, obsdim)) +function center!(X; obsdim=LearnBase.default_obsdim(X), operate_on=default_scaleselection(X, convert(ObsDimension, obsdim))) + center!(X, convert(ObsDimension, obsdim), operate_on) end -function center!{T,N}(X::AbstractArray{T,N}, μ::AbstractVector, ::ObsDim.Last) - center!(X, μ, ObsDim.Constant{N}()) +function center!{T,N,M}(X::AbstractArray{T,N}, obsdim::ObsDim.Constant{M}; operate_on=default_scaleselection(X, convert(ObsDimension, obsdim))) + center!(X, ObsDim.Constant{M}(), operate_on) end -function center!{T,N}(X::AbstractArray{T,N}, ::ObsDim.Last) - center!(X, ObsDim.Constant{N}()) +function center!{T,N}(X::AbstractArray{T,N}, obsdim::ObsDim.Last; operate_on=default_scaleselection(X, convert(ObsDimension, obsdim))) + center!(X, ObsDim.Constant{N}(), operate_on) end -function center!{T,N,M}(X::AbstractArray{T,N}, obsdim::ObsDim.Constant{M}) - μ = vec(mean(X, M)) - center!(X, μ, obsdim) +function center!{T,N}(X::AbstractArray{T,N}, obsdim::ObsDim.Last, operate_on::AbstractVector) + center!(X, ObsDim.Constant{N}(), operate_on) end -function center!{T}(X::AbstractVector{T}, ::ObsDim.Constant{1}) - μ = mean(X) - for i in 1:length(X) - X[i] = X[i] - μ - end - μ +function center!{T,N,M}(X::AbstractArray{T,N}, obsdim::ObsDim.Constant{M}, operate_on::AbstractVector) + μ = vec(mean(X, M))[operate_on] + center!(X, μ, obsdim, operate_on) end -function center!(X::AbstractVector, μ::AbstractVector, ::ObsDim.Constant{1}) - @inbounds for i in 1:length(X) - X[i] = X[i] - μ[i] - end - μ +function center!{T,N,M}(X::AbstractArray{T,N}, μ::AbstractVector, obsdim::ObsDim.Constant{M}; operate_on=default_scaleselection(X, convert(ObsDimension, obsdim))) + center!(X, μ, ObsDim.Constant{M}(), operate_on) end -function center!(X::AbstractVector, μ::AbstractFloat, ::ObsDim.Constant{1}) - @inbounds for i in 1:length(X) - X[i] = X[i] - μ - end - μ +function center!{T,N}(X::AbstractArray{T,N}, μ::AbstractVector, obsdim::ObsDim.Last; operate_on=default_scaleselection(X, convert(ObsDimension, obsdim))) + center!(X, μ, ObsDim.Constant{N}(), operate_on) end -function center!(X::AbstractMatrix, μ::AbstractVector, ::ObsDim.Constant{1}) +function center!(X::AbstractMatrix, μ::AbstractVector, ::ObsDim.Constant{1}, operate_on) + @assert length(μ) == length(operate_on) nObs, nVars = size(X) - for iVar in 1:nVars + for (i, iVar) in enumerate(operate_on) @inbounds for iObs in 1:nObs - X[iObs, iVar] = X[iObs, iVar] - μ[iVar] + X[iObs, iVar] = X[iObs, iVar] - μ[i] end end μ end -function center!(X::AbstractMatrix, μ::AbstractVector, ::ObsDim.Constant{2}) +function center!(X::AbstractMatrix, μ::AbstractVector, ::ObsDim.Constant{2}, operate_on) + @assert length(μ) == length(operate_on) nVars, nObs = size(X) for iObs in 1:nObs - @inbounds for iVar in 1:nVars - X[iVar, iObs] = X[iVar, iObs] - μ[iVar] + @inbounds for (i, iVar) in enumerate(operate_on) + X[iVar, iObs] = X[iVar, iObs] - μ[i] end end μ end -# -------------------------------------------------------------------- +function center!(x::AbstractVector; obsdim=LearnBase.default_obsdim(x), operate_on=default_scaleselection(x)) + center!(x, convert(ObsDimension, obsdim), operate_on) +end -function center!(D::AbstractDataFrame) - μ_vec = Float64[] +function center!(x::AbstractVector, ::ObsDim.Constant, operate_on::AbstractVector) + μ = mean(x) + for iVar in operate_on + x[iVar] = x[iVar] - μ + end + μ +end - flt = Bool[T <: Real for T in eltypes(D)] - for colname in names(D)[flt] - μ = mean(D[colname]) - center!(D, colname, μ) - push!(μ_vec, μ) +function center!(x::AbstractVector, μ::AbstractVector, ::ObsDim.Constant{1}, operate_on::AbstractVector) + @assert length(μ) == length(operate_on) + @inbounds for (i, iVar) in enumerate(operate_on) + x[iVar] = x[iVar] - μ[i] end - μ_vec + μ +end + +function center!(x::AbstractVector, μ::AbstractVector, ::ObsDim.Last, operate_on::AbstractVector) + center!(x, μ, ObsDim.Constant{1}(), operate_on) +end + +function center!(x::AbstractVector, μ::Real, ::ObsDim.Constant{1}, operate_on) + @inbounds for i in operate_on + x[i] = x[i] - μ + end + μ end -function center!(D::AbstractDataFrame, colnames::AbstractVector{Symbol}) +function center!(x::AbstractVector, μ::Real, ::ObsDim.Last, operate_on) + center!(x, μ, ObsDim.Constant{1}(), operate_on) +end + +# -------------------------------------------------------------------- + +function center!(D::AbstractDataFrame; operate_on=default_scaleselection(D)) + center!(D, operate_on) +end + +function center!(D::AbstractDataFrame, operate_on::AbstractVector{Symbol}) μ_vec = Float64[] - for colname in colnames + for colname in operate_on if eltype(D[colname]) <: Real μ = mean(D[colname]) if isna(μ) - warn("Column \"$colname\" contains NA values, skipping rescaling of this column!") + warn("Skipping \"$colname\" because it contains NA values") continue end - center!(D, colname, μ) + center!(D, μ, colname) push!(μ_vec, μ) else - warn("Skipping \"$colname\", centering only valid for columns of type T <: Real.") + warn("Skipping \"$colname\" because data is not of type T <: Real.") end end μ_vec end -function center!(D::AbstractDataFrame, colnames::AbstractVector{Symbol}, μ::AbstractVector) - for (icol, colname) in enumerate(colnames) +function center!(D::AbstractDataFrame, μ::AbstractVector; operate_on=default_scaleselection(D)) + center!(D, μ, operate_on) +end + +function center!(D::AbstractDataFrame, μ::AbstractVector, operate_on::AbstractVector{Symbol}) + for (icol, colname) in enumerate(operate_on) if eltype(D[colname]) <: Real - center!(D, colname, μ[icol]) + center!(D, μ[icol], colname) else - warn("Skipping \"$colname\", centering only valid for columns of type T <: Real.") + warn("Skipping \"$colname\" because data is not of type T <: Real.") end end μ end -function center!(D::AbstractDataFrame, colname::Symbol, μ) - if sum(isna(D[colname])) > 0 - warn("Column \"$colname\" contains NA values, skipping centering on this column!") +function center!(D::AbstractDataFrame, μ::Real, colname::Symbol) + if sum([isna(value) for value in D[colname]]) > 0 + warn("Skipping \"$colname\" because it contains NA values") else newcol::Vector{Float64} = convert(Vector{Float64}, D[colname]) - nobs = length(newcol) - @inbounds for i in eachindex(newcol) - newcol[i] -= μ - end + center!(newcol, μ) D[colname] = newcol end μ diff --git a/src/featurenormalizer.jl b/src/featurenormalizer.jl deleted file mode 100644 index d99f296..0000000 --- a/src/featurenormalizer.jl +++ /dev/null @@ -1,33 +0,0 @@ -immutable FeatureNormalizer - offset::Vector{Float64} - scale::Vector{Float64} - - function FeatureNormalizer(offset::Vector{Float64}, scale::Vector{Float64}) - @assert length(offset) == length(scale) - new(offset, scale) - end -end - -function FeatureNormalizer{T<:Real}(X::AbstractMatrix{T}) - FeatureNormalizer(vec(mean(X, 2)), vec(std(X, 2))) -end - -function StatsBase.fit{T<:Real}(::Type{FeatureNormalizer}, X::AbstractMatrix{T}) - FeatureNormalizer(X) -end - -function StatsBase.predict!{T<:Real}(cs::FeatureNormalizer, X::AbstractMatrix{T}) - @assert length(cs.offset) == size(X, 1) - rescale!(X, cs.offset, cs.scale) - X -end - -function StatsBase.predict{T<:AbstractFloat}(cs::FeatureNormalizer, X::AbstractMatrix{T}) - Xnew = copy(X) - StatsBase.predict!(cs, Xnew) -end - -function StatsBase.predict{T<:Real}(cs::FeatureNormalizer, X::AbstractMatrix{T}) - X = convert(AbstractMatrix{AbstractFloat}, X) - StatsBase.predict!(cs, X) -end diff --git a/src/fixedrange.jl b/src/fixedrange.jl new file mode 100644 index 0000000..9a3f112 --- /dev/null +++ b/src/fixedrange.jl @@ -0,0 +1,416 @@ +""" + lower, upper, xmin, xmax = fixedrange!(X[, lower, upper, xmin, xmax; obsdim, operate_on]) + +or + + lower, upper, xmin, xmax = fixedrange!(D[, lower, upper, xmin, xmax; operate_on]) + + +where `X` is of type Matrix or Vector and `D` of type DataFrame. +Normalize `X` or `D` along `obsdim` to the interval [lower:upper]. +If `lower` and `upper` are omitted the default range is [0:1]. + + +`lower` : (Scalar) Lower limit of new range. + Defaults to 0. + +`upper` : (Scalar) Upper limit of new range. + Defaults to 1. + +`xmin` : (Vector) Minimum values of data before normalization. `xmin` will + correspond to `lower` after transformation. + Defaults to `minimum(X, obsdim)`. + +`xmin` : (Vector) Maximum value of data before normalization. `xmax` will + correspond to `upper` after transformation. + Defaults to `maximum(X, obsdim)`. + +`obsdim` : Specify which axis corresponds to observations. + Defaults to obsdim=2 (observations are columns of matrix) + For DataFrames `obsdim` is obsolete and rescaling occurs + column wise. + +`operate_on`: Specify the indices of columns or rows to be centered. + Defaults to all columns/rows. + For DataFrames this must be a vector of symbols, not indices. + E.g. `operate_on`=[1,3] will perform centering on columns + with index 1 and 3 only (if obsdim=1, else rows 1 and 3) + + +Note on DataFrames: +Columns containing `NA` values are skipped. +Columns containing non numeric elements are skipped. + +Examples: + + X = rand(4, 100) + x = rand(10) + D = DataFrame(A=rand(10), B=collect(1:10), C=[string(x) for x in 1:10]) + + lower, upper, xmin, xmax = fixedrange!(X) + lower, upper, xmin, xmax = fixedrange!(X, -1, 1) + lower, upper, xmin, xmax = fixedrange!(X, -1, 1, obsdim=1) + lower, upper, xmin, xmax = fixedrange!(X, -1, 1, obsdim=1, operate_on=[1,4]) + + + lower, upper, xmin, xmax = fixedrange!(D) + lower, upper, xmin, xmax = fixedrange!(D, -1, 1) + lower, upper, xmin, xmax = fixedrange!(D, -1, 1, operate_on=[:A,:B]) +""" +function fixedrange!(X; obsdim=LearnBase.default_obsdim(X), operate_on=default_scaleselection(X, convert(ObsDimension, obsdim))) + fixedrange!(X, convert(ObsDimension, obsdim), operate_on) +end + +function fixedrange!{T,N}(X::AbstractArray{T,N}, ::ObsDim.Last, operate_on) + fixedrange!(X, ObsDim.Constant{N}(), operate_on) +end + +function fixedrange!{M}(X, obsdim::ObsDim.Constant{M}, operate_on) + lower = 0 + upper = 1 + xmin = minimum(X, M)[operate_on] + xmax = maximum(X, M)[operate_on] + fixedrange!(X, lower, upper, xmin, xmax, obsdim, operate_on) +end + +function fixedrange!(X, lower, upper; obsdim=LearnBase.default_obsdim(X), operate_on=default_scaleselection(X, convert(ObsDimension, obsdim))) + fixedrange!(X, lower, upper, convert(ObsDimension, obsdim), operate_on) +end + +function fixedrange!{M}(X, lower, upper, obsdim::ObsDim.Constant{M}, operate_on) + xmin = minimum(X, M)[operate_on] + xmax = maximum(X, M)[operate_on] + fixedrange!(X, lower, upper, xmin, xmax, obsdim, operate_on) +end + +function fixedrange!{T,M}(X::AbstractArray{T,M}, lower, upper, obsdim::ObsDim.Last, operate_on) + fixedrange!(X, lower, upper, ObsDim.Constant{M}(), operate_on) +end + +function fixedrange!(X, lower, upper, xmin, xmax; obsdim=LearnBase.default_obsdim(X), operate_on=default_scaleselection(X, convert(ObsDimension, obsdim))) + fixedrange!(X, lower, upper, xmin, xmax, convert(ObsDimension, obsdim), operate_on) +end + +function fixedrange!(X::AbstractMatrix, lower::Real, upper::Real, xmin::AbstractVector, xmax::AbstractVector, ::ObsDim.Constant{1}, operate_on::AbstractVector) + @assert length(xmin) == length(xmax) == length(operate_on) + xrange = xmax .- xmin + scale = upper - lower + nObs, nVars = size(X) + + for (i, iVar) in enumerate(operate_on) + @inbounds for iObs in 1:nObs + X[iObs, iVar] = lower + (X[iObs, iVar] - xmin[i]) / xrange[i] * scale + end + end + lower, upper, xmin, xmax +end + +function fixedrange!(X::AbstractMatrix, lower::Real, upper::Real, xmin::AbstractVector, xmax::AbstractVector, ::ObsDim.Constant{2}, operate_on::AbstractVector) + @assert length(xmin) == length(xmax) == length(operate_on) + xrange = xmax .- xmin + scale = upper - lower + nVars, nObs = size(X) + + for iObs in 1:nObs + @inbounds for (i, iVar) in enumerate(operate_on) + X[iVar, iObs] = lower + (X[iVar, iObs] - xmin[i]) / xrange[i] * scale + end + end + lower, upper, xmin, xmax +end + +function fixedrange!{T,M}(X::AbstractArray{T,M}, lower::Real, upper::Real, xmin::Real, xmax::Real, ::ObsDim.Last, operate_on::AbstractVector) + fixedrange!(X, lower, upper, xmin, xmax, ObsDim.Constant{M}(), operate_on) +end + +function fixedrange!{M}(x::AbstractVector, lower::Real, upper::Real, xmin::AbstractVector, xmax::AbstractVector, ::ObsDim.Constant{M}, operate_on::AbstractVector) + @assert length(xmin) == length(xmax) == length(operate_on) + xrange = xmax .- xmin + scale = upper - lower + nVars = length(x) + @inbounds for (i, iVar) in enumerate(operate_on) + x[iVar] = lower + (x[iVar] - xmin[i]) / xrange[i] * scale + end + lower, upper, xmin, xmax +end + +function fixedrange!(x::AbstractVector, lower::Real, upper::Real, xmin::AbstractVector, xmax::AbstractVector, ::ObsDim.Last, operate_on::AbstractVector) + fixedrange!(x, lower, upper, xmin, xmax, ObsDim.Constant{1}(), operate_on) +end + +function fixedrange!(x::AbstractVector, lower::Real, upper::Real, xmin::Real, xmax::Real) + xrange = xmax - xmin + scale = upper - lower + n = length(x) + @inbounds for i in 1:n + x[i] = lower + (x[i] - xmin) / xrange * scale + end + lower, upper, xmin, xmax +end + +# -------------------------------------------------------------------- + +function fixedrange!(D::AbstractDataFrame; operate_on=default_scaleselection(D)) + fixedrange!(D, 0, 1, operate_on) +end + +function fixedrange!(D::AbstractDataFrame, lower, upper; operate_on=default_scaleselection(D)) + fixedrange!(D, lower, upper, operate_on) +end + +function fixedrange!(D::AbstractDataFrame, lower::Real, upper::Real, operate_on::AbstractArray) + xmin = Float64[] + xmax = Float64[] + + for colname in operate_on + if eltype(D[colname]) <: Real + minval = minimum(D[colname]) + maxval = maximum(D[colname]) + if isna(minval) + warn("Skipping \"$colname\" because it contains NA values") + continue + end + fixedrange!(D, lower, upper, minval, maxval, colname) + push!(xmin, minval) + push!(xmax, maxval) + else + warn("Skipping \"$colname\" because data is not of type T <: Real.") + end + end + lower, upper, xmin, xmax +end + +function fixedrange!(D::AbstractDataFrame, lower, upper, xmin, xmax; operate_on=default_scaleselection(D)) + fixedrange!(D, lower, upper, xmin, xmax, operate_on) +end + +function fixedrange!(D::AbstractDataFrame, lower::Real, upper::Real, xmin::AbstractArray, xmax::AbstractArray, operate_on::AbstractVector) + @assert length(xmin) == length(xmax) == length(operate_on) + for (iVar, colname) in enumerate(operate_on) + fixedrange!(D, lower, upper, xmin[iVar], xmax[iVar], colname) + end + lower, upper, xmin, xmax, operate_on +end + +function fixedrange!(D::AbstractDataFrame, lower::Real, upper::Real, xmin::Real, xmax::Real, colname::Symbol) + if any(isna, D[colname]) | !(eltype(D[colname]) <: Real) + warn("Skipping \"$colname\" because it contains NA values or is not of type <: Real") + else + newcol::Vector{Float64} = convert(Vector{Float64}, D[colname]) + fixedrange!(newcol, lower, upper, xmin, xmax) + D[colname] = newcol + end + lower, upper, xmin, xmax, colname +end + + +""" +`FixedRangeScaler` is used with the functions `fit()`, `transform()` and `fit_transform()` +to scale data in a Matrix `X` or DataFrame to a fixed range [lower:upper]. +After fitting a `FixedRangeScaler` to one data set, it can be used to perform the same +transformation to a new set of data. E.g. fit the `FixedRangeScaler` to your training +data and then apply the scaling to the test data at a later stage. (See examples below). + + fit(FixedRangeScaler, X[, lower, upper; obsdim, operate_on]) + + fit_transform(FixedRangeScaler, X[, lower, upper; obsdim, operate_on]) + +`X` : Data of type Matrix or `DataFrame`. + +`lower` : (Scalar) Lower limit of new range. + Defaults to 0. + +`upper` : (Scalar) Upper limit of new range. + Defaults to 1. + +`obsdim` : Specify which axis corresponds to observations. + Defaults to obsdim=2 (observations are columns of matrix) + For DataFrames `obsdim` is obsolete and rescaling occurs + column wise. + +`operate_on`: Specify the indices of columns or rows to be centered. + Defaults to all columns/rows. + For DataFrames this must be a vector of symbols, not indices. + E.g. `operate_on`=[1,3] will perform centering on columns + with index 1 and 3 only (if obsdim=1, else rows 1 and 3) + +Note on DataFrames: +Columns containing `NA` values are skipped. +Columns containing non numeric elements are skipped. + +Examples: + + + Xtrain = rand(100, 4) + Xtest = rand(10, 4) + x = rand(10) + D = DataFrame(A=rand(10), B=collect(1:10), C=[string(x) for x in 1:10]) + + scaler = fit(FixedRangeScaler, Xtrain) + scaler = fit(FixedRangeScaler, Xtrain, -1, 1) + scaler = fit(FixedRangeScaler, Xtrain, -1, 1, obsdim=1) + scaler = fit(FixedRangeScaler, Xtrain, -1, 1, obsdim=1, operate_on=[1,3]) + scaler = fit(FixedRangeScaler, D, -1, 1, operate_on=[:A,:B]) + + Xscaled = transform(Xtest, scaler) + transform!(Xtest, scaler) + + Xscaled, scaler = fit_transform(FixedRangeScaler, X, -1, 1, obsdim=1, operate_on=[1,2,4]) + scaler = fit_transform!(FixedRangeScaler, X, -1, 1, obsdim=1, operate_on=[1,2,4]) + + + +Note that for `transform!` the data matrix `X` has to be of type <: AbstractFloat +as the scaling occurs inplace. (E.g. cannot be of type Matrix{Int64}). This is not +the case for `transform` however. +For `DataFrames` `transform!` can be used on columns of type <: Integer. +""" +immutable FixedRangeScaler{T<:Real,U<:Real,V<:Real,W<:Real,M,I} + lower::T + upper::U + xmin::Vector{V} + xmax::Vector{W} + obsdim::ObsDim.Constant{M} + operate_on::Vector{I} +end + +function FixedRangeScaler{T<:Real,N}(X::AbstractArray{T,N}; obsdim=LearnBase.default_obsdim(X), operate_on=default_scaleselection(X, convert(ObsDimension, obsdim))) + FixedRangeScaler(X, convert(ObsDimension, obsdim), operate_on) +end + +function FixedRangeScaler{T<:Real,N,M}(X::AbstractArray{T,N}, obsdim::ObsDim.Constant{M}, operate_on) + xmin = vec(minimum(X, M))[operate_on] + xmax = vec(maximum(X, M))[operate_on] + FixedRangeScaler(0, 1, xmin, xmax, obsdim, operate_on) +end + +function FixedRangeScaler{T<:Real,N}(X::AbstractArray{T,N}, ::ObsDim.Last, operate_on) + FixedRangeScaler(X, ObsDim.Constant{N}(), operate_on) +end + +function FixedRangeScaler{T<:Real,N}(X::AbstractArray{T,N}, lower, upper; obsdim=LearnBase.default_obsdim(X), operate_on=default_scaleselection(X, convert(ObsDimension, obsdim))) + FixedRangeScaler(X, lower, upper, convert(ObsDimension, obsdim), operate_on) +end + +function FixedRangeScaler{T<:Real,N,M}(X::AbstractArray{T,N}, lower, upper, obsdim::ObsDim.Constant{M}, operate_on) + xmin = vec(minimum(X, M))[operate_on] + xmax = vec(maximum(X, M))[operate_on] + FixedRangeScaler(lower, upper, xmin, xmax, obsdim, operate_on) +end + +function FixedRangeScaler{T<:Real,N}(X::AbstractArray{T,N}, lower, upper, ::ObsDim.Last, operate_on) + FixedRangeScaler(X, lower, upper, ObsDim.Constant{N}(), operate_on) +end + +function FixedRangeScaler(D::AbstractDataFrame; operate_on=default_scaleselection(D)) + FixedRangeScaler(D, 0, 1, operate_on) +end + +function FixedRangeScaler(D::AbstractDataFrame, lower::Real, upper::Real; operate_on=default_scaleselection(D)) + FixedRangeScaler(D, lower, upper, operate_on) +end + +function FixedRangeScaler(D::AbstractDataFrame, lower::Real, upper::Real, operate_on::AbstractVector{Symbol}) + xmin = Float64[] + xmax = Float64[] + colnames = valid_columns(D, operate_on) + for colname in colnames + push!(xmin, minimum(D[colname])) + push!(xmax, maximum(D[colname])) + end + FixedRangeScaler(lower, upper, xmin, xmax, ObsDim.Constant{1}(), colnames) +end + +function StatsBase.fit{T<:Real,N}(::Type{FixedRangeScaler}, X::AbstractArray{T,N}; obsdim=LearnBase.default_obsdim(X), operate_on=default_scaleselection(X, convert(ObsDimension, obsdim))) + FixedRangeScaler(X, convert(ObsDimension, obsdim), operate_on) +end + +function fit_transform{T<:Real,N}(::Type{FixedRangeScaler}, X::AbstractArray{T,N}; obsdim=LearnBase.default_obsdim(X), operate_on=default_scaleselection(X, convert(ObsDimension, obsdim))) + scaler = FixedRangeScaler(X, convert(ObsDimension, obsdim), operate_on) + Xnew = transform(X, scaler) + return Xnew, scaler +end + +function fit_transform!{T<:Real,N}(::Type{FixedRangeScaler}, X::AbstractArray{T,N}; obsdim=LearnBase.default_obsdim(X), operate_on=default_scaleselection(X, convert(ObsDimension, obsdim))) + scaler = FixedRangeScaler(X, convert(ObsDimension, obsdim), operate_on) + transform!(X, scaler) + return scaler +end + +function StatsBase.fit{T<:Real,N}(::Type{FixedRangeScaler}, X::AbstractArray{T,N}, lower, upper; obsdim=LearnBase.default_obsdim(X), operate_on=default_scaleselection(X, convert(ObsDimension, obsdim))) + FixedRangeScaler(X, lower, upper, convert(ObsDimension, obsdim), operate_on) +end + +function fit_transform{T<:Real,N}(::Type{FixedRangeScaler}, X::AbstractArray{T,N}, lower, upper; obsdim=LearnBase.default_obsdim(X), operate_on=default_scaleselection(X, convert(ObsDimension, obsdim))) + scaler = FixedRangeScaler(X, lower, upper, convert(ObsDimension, obsdim), operate_on) + Xnew = transform(X, scaler) + return Xnew, scaler +end + +function fit_transform!{T<:Real,N}(::Type{FixedRangeScaler}, X::AbstractArray{T,N}, lower, upper; obsdim=LearnBase.default_obsdim(X), operate_on=default_scaleselection(X, convert(ObsDimension, obsdim))) + scaler = FixedRangeScaler(X, lower, upper, convert(ObsDimension, obsdim), operate_on) + transform!(X, scaler) + return scaler +end + +function StatsBase.fit(::Type{FixedRangeScaler}, D::AbstractDataFrame; operate_on=default_scaleselection(D)) + FixedRangeScaler(D, 0, 1, operate_on) +end + +function fit_transform(::Type{FixedRangeScaler}, D::AbstractDataFrame; operate_on=default_scaleselection(D)) + scaler = FixedRangeScaler(D, 0, 1, operate_on) + Dnew = transform(D, scaler) + return Dnew, scaler +end + +function fit_transform!(::Type{FixedRangeScaler}, D::AbstractDataFrame; operate_on=default_scaleselection(D)) + scaler = FixedRangeScaler(D, 0, 1, operate_on) + transform!(D, scaler) + return scaler +end + +function StatsBase.fit(::Type{FixedRangeScaler}, D::AbstractDataFrame, lower, upper; operate_on=default_scaleselection(D)) + FixedRangeScaler(D, lower, upper, operate_on) +end + +function fit_transform(::Type{FixedRangeScaler}, D::AbstractDataFrame, lower, upper; operate_on=default_scaleselection(D)) + scaler = FixedRangeScaler(D, lower, upper, operate_on) + Dnew = transform(D, scaler) + return Dnew, scaler +end + +function fit_transform!(::Type{FixedRangeScaler}, D::AbstractDataFrame, lower, upper; operate_on=default_scaleselection(D)) + scaler = FixedRangeScaler(D, lower, upper, operate_on) + transform!(D, scaler) + return scaler +end + +function transform!{T<:AbstractFloat,N}(X::AbstractArray{T,N}, cs::FixedRangeScaler) + fixedrange!(X, cs.lower, cs.upper, cs.xmin, cs.xmax, cs.obsdim, cs.operate_on) +end + +function transform!{T<:AbstractFloat}(x::AbstractVector{T}, cs::FixedRangeScaler) + fixedrange!(x, cs.lower, cs.upper, cs.xmin, cs.xmax, cs.obsdim, cs.operate_on) +end + +function transform!(D::AbstractDataFrame, cs::FixedRangeScaler) + fixedrange!(D, cs.lower, cs.upper, cs.xmin, cs.xmax, cs.operate_on) +end + +function transform{T<:AbstractFloat,N}(X::AbstractArray{T,N}, cs::FixedRangeScaler) + Xnew = copy(X) + fixedrange!(Xnew, cs.lower, cs.upper, cs.xmin, cs.xmax, cs.obsdim, cs.operate_on) + Xnew +end + +function transform{T<:Real,N}(X::AbstractArray{T,N}, cs::FixedRangeScaler) + Xnew = convert(AbstractArray{Float64, N}, X) + fixedrange!(Xnew, cs.lower, cs.upper, cs.xmin, cs.xmax, cs.obsdim, cs.operate_on) + Xnew +end + +function transform(D::AbstractDataFrame, cs::FixedRangeScaler) + Dnew = deepcopy(D) + fixedrange!(Dnew, cs.lower, cs.upper, cs.xmin, cs.xmax, cs.operate_on) + Dnew +end diff --git a/src/rescale.jl b/src/rescale.jl deleted file mode 100644 index 476489a..0000000 --- a/src/rescale.jl +++ /dev/null @@ -1,158 +0,0 @@ -""" - μ, σ = rescale!(X[, μ, σ, obsdim]) - -or - - μ, σ = rescale!(D[, colnames, μ, σ]) - -where `X` is of type Matrix or Vector and `D` of type DataFrame. - -Center `X` along `obsdim` around the corresponding entry in the -vector `μ` and then rescale each feature using the corresponding -entry in the vector `σ`. - -For DataFrames, `obsdim` is obsolete and centering is done column wise. -The vector `colnames` allows to specify which columns to center. -If `colnames` is not provided all columns of type T<:Real are centered. - -Example: - - X = rand(4, 100) - D = DataFrame(A=rand(10), B=collect(1:10), C=[string(x) for x in 1:10]) - - μ, σ = rescale!(X, obsdim=2) - μ, σ = rescale!(X, ObsDim.First()) - μ, σ = rescale!(D) - μ, σ = rescale!(D, [:A, :B]) - -""" -function rescale!(X, μ, σ; obsdim=LearnBase.default_obsdim(X)) - rescale!(X, μ, σ, convert(ObsDimension, obsdim)) -end - -function rescale!{T,N}(X::AbstractArray{T,N}, μ, σ, ::ObsDim.Last) - rescale!(X, μ, σ, ObsDim.Constant{N}()) -end - -function rescale!(X; obsdim=LearnBase.default_obsdim(X)) - rescale!(X, convert(ObsDimension, obsdim)) -end - -function rescale!{T,N}(X::AbstractArray{T,N}, ::ObsDim.Last) - rescale!(X, ObsDim.Constant{N}()) -end - -function rescale!{T,N,M}(X::AbstractArray{T,N}, obsdim::ObsDim.Constant{M}) - μ = vec(mean(X, M)) - σ = vec(std(X, M)) - rescale!(X, μ, σ, obsdim) -end - -function rescale!(X::AbstractVector, ::ObsDim.Constant{1}) - μ = mean(X) - σ = std(X) - @inbounds for i in 1:length(X) - X[i] = (X[i] - μ) / σ - end - μ, σ -end - -function rescale!(X::AbstractMatrix, μ::AbstractVector, σ::AbstractVector, ::ObsDim.Constant{2}) - σ[σ .== 0] = 1 - nVars, nObs = size(X) - for iObs in 1:nObs - @inbounds for iVar in 1:nVars - X[iVar, iObs] = (X[iVar, iObs] - μ[iVar]) / σ[iVar] - end - end - μ, σ -end - -function rescale!(X::AbstractMatrix, μ::AbstractVector, σ::AbstractVector, ::ObsDim.Constant{1}) - σ[σ .== 0] = 1 - nObs, nVars = size(X) - for iVar in 1:nVars - @inbounds for iObs in 1:nObs - X[iObs, iVar] = (X[iObs, iVar] - μ[iVar]) / σ[iVar] - end - end - μ, σ -end - -function rescale!(X::AbstractVector, μ::AbstractVector, σ::AbstractVector, ::ObsDim.Constant{1}) - @inbounds for i in 1:length(X) - X[i] = (X[i] - μ[i]) / σ[i] - end - μ, σ -end - -function rescale!(X::AbstractVector, μ::AbstractFloat, σ::AbstractFloat, ::ObsDim.Constant{1}) - @inbounds for i in 1:length(X) - X[i] = (X[i] - μ) / σ - end - μ, σ -end - -# -------------------------------------------------------------------- - -function rescale!(D::AbstractDataFrame) - μ_vec = Float64[] - σ_vec = Float64[] - - flt = Bool[T <: Real for T in eltypes(D)] - for colname in names(D)[flt] - μ = mean(D[colname]) - σ = std(D[colname]) - rescale!(D, colname, μ, σ) - push!(μ_vec, μ) - push!(σ_vec, σ) - end - μ_vec, σ_vec -end - -function rescale!(D::AbstractDataFrame, colnames::Vector{Symbol}) - μ_vec = Float64[] - σ_vec = Float64[] - for colname in colnames - if eltype(D[colname]) <: Real - μ = mean(D[colname]) - σ = std(D[colname]) - if isna(μ) - warn("Column \"$colname\" contains NA values, skipping rescaling of this column!") - continue - end - rescale!(D, colname, μ, σ) - push!(μ_vec, μ) - push!(σ_vec, σ) - else - warn("Skipping \"$colname\", rescaling only valid for columns of type T <: Real.") - end - end - μ_vec, σ_vec -end - -function rescale!(D::AbstractDataFrame, colnames::Vector{Symbol}, μ::AbstractVector, σ::AbstractVector) - for (icol, colname) in enumerate(colnames) - if eltype(D[colname]) <: Real - rescale!(D, colname, μ[icol], σ[icol]) - else - warn("Skipping \"$colname\", rescaling only valid for columns of type T <: Real.") - end - end - μ, σ -end - -function rescale!(D::AbstractDataFrame, colname::Symbol, μ, σ) - if sum(isna(D[colname])) > 0 - warn("Column \"$colname\" contains NA values, skipping rescaling of this column!") - else - σ_div = σ == 0 ? one(σ) : σ - newcol::Vector{Float64} = convert(Vector{Float64}, D[colname]) - nobs = length(newcol) - @inbounds for i in eachindex(newcol) - newcol[i] = (newcol[i] - μ) / σ_div - end - D[colname] = newcol - end - μ, σ -end diff --git a/src/scaleselection.jl b/src/scaleselection.jl new file mode 100644 index 0000000..8544d7d --- /dev/null +++ b/src/scaleselection.jl @@ -0,0 +1,69 @@ +function default_scaleselection(X::AbstractMatrix, ::ObsDim.Constant{1}) + collect(1:size(X, 2)) +end + +function default_scaleselection(X::AbstractMatrix, ::ObsDim.Constant{2}) + collect(1:size(X, 1)) +end + +function default_scaleselection(X::AbstractMatrix, ::ObsDim.Last) + collect(1:size(X, 1)) +end + +function default_scaleselection(x::AbstractVector) + collect(1:length(x)) +end + +function default_scaleselection(x::AbstractVector, ::ObsDim.Last) + collect(1:length(x)) +end + +function default_scaleselection{M}(x::AbstractVector, ::ObsDim.Constant{M}) + collect(1:length(x)) +end + +function default_categoricalselection(D::AbstractDataFrame) + valid_columns_categorical(D::AbstractDataFrame) +end + +function default_scaleselection(D::AbstractDataFrame) + valid_columns(D) +end + +function valid_columns(D::AbstractDataFrame) + valid_colnames = Symbol[] + for colname in names(D) + if (eltype(D[colname]) <: Real) & !any(isna, D[colname]) + push!(valid_colnames, colname) + else + warn("Skipping \"$colname\" because it either contains NA or is not of type <: Real") + end + end + valid_colnames +end + +function valid_columns(D::AbstractDataFrame, colnames) + valid_colnames = Symbol[] + for colname in colnames + if (eltype(D[colname]) <: Real) & !(any(isna, D[colname])) + push!(valid_colnames, colname) + else + warn("Skipping \"$colname\" because it either contains NA or is not of type <: Real") + end + end + valid_colnames +end + +function valid_columns_categorical(D::AbstractDataFrame) + valid_colnames = Symbol[] + for colname in names(D) + if !(eltype(D[colname]) <: Real) + if !(any(isna, D[colname])) + push!(valid_colnames, colname) + else + warn("Skipping \"$colname\" because it contains NA") + end + end + end + valid_colnames +end diff --git a/src/standardize.jl b/src/standardize.jl new file mode 100644 index 0000000..381df40 --- /dev/null +++ b/src/standardize.jl @@ -0,0 +1,326 @@ +""" + μ, σ = standardize!(X[, μ, σ; obsdim, operate_on]) + +or + + μ, σ = standardize!(D[, μ, σ; operate_on]) + +Standardize `X` along `obsdim` according to X = (X - μ) / σ. +If μ and σ are omitted they are computed such that variables have a mean of zero + + + +`μ` : Vector or value describing the translation. + Defaults to mean(X, 2) + +`σ` : Vector or value describing the scale. + Defaults to std(X, 2) + +`obsdim` : Specify which axis corresponds to observations. + Defaults to obsdim=2 (observations are columns of matrix) + For DataFrames `obsdim` is obsolete and centering occurs + column wise. + +`operate_on`: Specify the indices of columns or rows to be centered. + Defaults to all columns/rows. + For DataFrames this must be a vector of symbols, not indices + E.g. `operate_on`=[1,3] will perform centering on columns + with index 1 and 3 only (if obsdim=1, else rows 1 and 3) + + +Note on DataFrames: +Columns containing `NA` values are skipped. +Columns containing non numeric elements are skipped. + +Examples: + + X = rand(4, 100) + x = rand(10) + D = DataFrame(A=rand(10), B=collect(1:10), C=[string(x) for x in 1:10]) + + μ, σ = standardize!(X, obsdim=2) + μ, σ = standardize!(X, ObsDim.First()) + μ, σ = standardize!(X, obsdim=1, operate_on=[1,3] + μ, σ = standardize!(X, [7.0,8.0], [1,1], obsdim=1, operate_on=[1,3] + μ, σ = standardize!(D) + μ, σ = standardize!(D, operate_on=[:A,:B]) + μ, σ = standardize!(D, [-1,-1], [2,2], operate_on=[:A,:B]) +""" +function standardize!(X, μ, σ; obsdim=LearnBase.default_obsdim(X), operate_on=default_scaleselection(X, convert(ObsDimension, obsdim))) + standardize!(X, μ, σ, convert(ObsDimension, obsdim), operate_on) +end + +function standardize!{T,N}(X::AbstractArray{T,N}, μ, σ, ::ObsDim.Last, operate_on) + standardize!(X, μ, σ, ObsDim.Constant{N}(), operate_on) +end + +function standardize!(X; obsdim=LearnBase.default_obsdim(X), operate_on=default_scaleselection(X, convert(ObsDimension, obsdim))) + standardize!(X, convert(ObsDimension, obsdim), operate_on) +end + +function standardize!{T,N}(X::AbstractArray{T,N}, ::ObsDim.Last, operate_on) + standardize!(X, ObsDim.Constant{N}(), operate_on) +end + +function standardize!{T,N,M}(X::AbstractArray{T,N}, obsdim::ObsDim.Constant{M}, operate_on) + μ = vec(mean(X, M))[operate_on] + σ = vec(std(X, M))[operate_on] + standardize!(X, μ, σ, obsdim, operate_on) +end + +function standardize!{M}(X::AbstractVector, ::ObsDim.Constant{M}, operate_on) + μ = mean(X) + σ = std(X) + for i in operate_on + X[i] = (X[i] - μ) / σ + end + μ, σ +end + +function standardize!(X::AbstractMatrix, μ::AbstractVector, σ::AbstractVector, ::ObsDim.Constant{2}, operate_on) + σ[σ .== 0] = 1 + nVars, nObs = size(X) + for iObs in 1:nObs + @inbounds for (i, iVar) in enumerate(operate_on) + X[iVar, iObs] = (X[iVar, iObs] - μ[i]) / σ[i] + end + end + μ, σ +end + +function standardize!(X::AbstractMatrix, μ::AbstractVector, σ::AbstractVector, ::ObsDim.Constant{1}, operate_on) + σ[σ .== 0] = 1 + nObs, nVars = size(X) + for (i, iVar) in enumerate(operate_on) + @inbounds for iObs in 1:nObs + X[iObs, iVar] = (X[iObs, iVar] - μ[i]) / σ[i] + end + end + μ, σ +end + +function standardize!{M}(X::AbstractVector, μ::AbstractVector, σ::AbstractVector, ::ObsDim.Constant{M}, operate_on) + @inbounds for (i, iVar) in enumerate(operate_on) + X[iVar] = (X[iVar] - μ[i]) / σ[i] + end + μ, σ +end + +function standardize!{M}(X::AbstractVector, μ::AbstractFloat, σ::AbstractFloat, ::ObsDim.Constant{M}, operate_on) + @inbounds for i in 1:length(X) + X[i] = (X[i] - μ) / σ + end + μ, σ +end + +# -------------------------------------------------------------------- +function standardize!(D::AbstractDataFrame; operate_on=default_scaleselection(D)) + standardize!(D, operate_on) +end + +function standardize!(D::AbstractDataFrame, colnames::AbstractVector{Symbol}) + μ_vec = Float64[] + σ_vec = Float64[] + + for colname in colnames + if eltype(D[colname]) <: Real + μ = mean(D[colname]) + σ = std(D[colname]) + if isna(μ) + warn("Skipping \"$colname\" because it contains NA values") + continue + end + standardize!(D, μ, σ, colname) + push!(μ_vec, μ) + push!(σ_vec, σ) + else + warn("Skipping \"$colname\" because data is not of type T <: Real.") + end + end + μ_vec, σ_vec +end + +function standardize!(D::AbstractDataFrame, μ::AbstractVector, σ::AbstractVector; operate_on=default_scaleselection(D)) + standardize!(D, μ, σ, operate_on) +end + +function standardize!(D::AbstractDataFrame, μ::AbstractVector, σ::AbstractVector, colnames::AbstractVector{Symbol}) + for (icol, colname) in enumerate(colnames) + standardize!(D, μ[icol], σ[icol], colname) + end + μ, σ +end + +function standardize!(D::AbstractDataFrame, μ::Real, σ::Real, colname::Symbol) + if any(isna, D[colname]) | !(eltype(D[colname]) <: Real) + warn("Skipping \"$colname\" because it contains NA values or is not of type <: Real") + else + newcol::Vector{Float64} = convert(Vector{Float64}, D[colname]) + nobs = length(newcol) + @inbounds for i in eachindex(newcol) + newcol[i] = (newcol[i] - μ) / σ + end + D[colname] = newcol + end + μ, σ +end + +""" +`StandardScaler` is used with the functions `fit()`, `transform()` and `fit_transform()` +to standardize data in a Matrix `X` or DataFrame according to Xnew = (X - μ) / σ. +After fitting a `StandardScaler` to one data set, it can be used to perform the same +transformation to a new set of data. E.g. fit the `StandardScaler` to your training +data and then apply the scaling to the test data at a later stage. (See examples below). + + fit(StandardScaler, X[, μ, σ; obsdim, operate_on]) + + fit_transform(StandardScaler, X[, μ, σ; obsdim, operate_on]) + +`X` : Data of type Matrix or `DataFrame`. + +`μ` : Vector or scalar describing the translation. + Defaults to mean(X, obsdim) + +`σ` : Vector or scalar describing the scale. + Defaults to std(X, obsdim) + +`obsdim` : Specify which axis corresponds to observations. + Defaults to obsdim=2 (observations are columns of matrix) + For DataFrames `obsdim` is obsolete and rescaling occurs + column wise. + +`operate_on`: Specify the indices of columns or rows to be centered. + Defaults to all columns/rows. + For DataFrames this must be a vector of symbols, not indices. + E.g. `operate_on`=[1,3] will perform centering on columns + with index 1 and 3 only (if obsdim=1, else rows 1 and 3) + +Note on DataFrames: +Columns containing `NA` values are skipped. +Columns containing non numeric elements are skipped. + +Examples: + + + Xtrain = rand(100, 4) + Xtest = rand(10, 4) + x = rand(4) + Dtrain = DataFrame(A=rand(10), B=collect(1:10), C=[string(x) for x in 1:10]) + Dtest = DataFrame(A=rand(10), B=collect(1:10), C=[string(x) for x in 1:10]) + + scaler = fit(StandardScaler, Xtrain) + scaler = fit(StandardScaler, Xtrain, obsdim=1) + scaler = fit(StandardScaler, Xtrain, obsdim=1, operate_on=[1,3]) + transform(Xtest, scaler) + transform!(Xtest, scaler) + transform(x, scaler) + transform!(x, scaler) + + scaler = fit(StandardScaler, Dtrain) + scaler = fit(StandardScaler, Dtrain, operate_on=[:A,:B]) + transform(Dtest, scaler) + transform!(Dtest, scaler) + + Xscaled, scaler = fit_transform(StandardScaler, X, obsdim=1, operate_on=[1,2,4]) + scaler = fit_transform!(StandardScaler, X, obsdim=1, operate_on=[1,2,4]) + +Note that for `transform!` the data matrix `X` has to be of type <: AbstractFloat +as the scaling occurs inplace. (E.g. cannot be of type Matrix{Int64}). This is not +the case for `transform` however. +For `DataFrames` `transform!` can be used on columns of type <: Integer. +""" +immutable StandardScaler{T<:Real,U<:Real,I,M} + offset::Vector{T} + scale::Vector{U} + obsdim::ObsDim.Constant{M} + operate_on::Vector{I} +end + +function StandardScaler{T<:Real,M}(X::AbstractArray{T,M}; obsdim=LearnBase.default_obsdim(X), operate_on=default_scaleselection(X, convert(ObsDimension, obsdim))) + StandardScaler(X, convert(ObsDimension, obsdim), operate_on) +end + +function StandardScaler{T<:Real,M}(X::AbstractArray{T,M}, ::ObsDim.Last, operate_on) + StandardScaler(X, ObsDim.Constant{M}(), operate_on) +end + +function StandardScaler{T<:Real,N,M}(X::AbstractArray{T,N}, obsdim::ObsDim.Constant{M}, operate_on::AbstractVector) + offset = vec(mean(X,M))[operate_on] + scale = vec(std(X, M))[operate_on] + StandardScaler(offset, scale, obsdim, operate_on) +end + +function StandardScaler(D::AbstractDataFrame; operate_on=default_scaleselection(D)) + StandardScaler(D, operate_on) +end + +function StandardScaler(D::AbstractDataFrame, operate_on::Vector{Symbol}) + colnames = valid_columns(D, operate_on) + offset = Float64[mean(D[colname]) for colname in colnames] + scale = Float64[std(D[colname]) for colname in colnames] + StandardScaler(offset, scale, ObsDim.Constant{1}(), colnames) +end + +function StandardScaler(D::AbstractDataFrame, offset, scale; operate_on=default_scaleselection(D)) + colnames = valid_columns(D) + StandardScaler(offset, scale, ObsDim.Constant{1}(), colnames) +end + +function StatsBase.fit{T<:Real}(::Type{StandardScaler}, X::AbstractMatrix{T}; obsdim=LearnBase.default_obsdim(X), operate_on=default_scaleselection(X, convert(ObsDimension, obsdim))) + StandardScaler(X, convert(ObsDimension, obsdim), operate_on) +end + +function fit_transform{T<:Real}(::Type{StandardScaler}, X::AbstractMatrix{T}; obsdim=LearnBase.default_obsdim(X), operate_on=default_scaleselection(X, convert(ObsDimension, obsdim))) + scaler = StandardScaler(X, convert(ObsDimension, obsdim), operate_on) + Xnew = transform(X, scaler) + return Xnew, scaler +end + +function fit_transform!{T<:Real}(::Type{StandardScaler}, X::AbstractMatrix{T}; obsdim=LearnBase.default_obsdim(X), operate_on=default_scaleselection(X, convert(ObsDimension, obsdim))) + scaler = StandardScaler(X, convert(ObsDimension, obsdim), operate_on) + transform!(X, scaler) + return scaler +end + +function StatsBase.fit(::Type{StandardScaler}, D::AbstractDataFrame; operate_on=default_scaleselection(D)) + StandardScaler(D, operate_on) +end + +function fit_transform(::Type{StandardScaler}, D::AbstractDataFrame; operate_on=default_scaleselection(D)) + scaler = StandardScaler(D, operate_on) + Dnew = transform(D, scaler) + return Dnew, scaler +end + +function fit_transform!(::Type{StandardScaler}, D::AbstractDataFrame; operate_on=default_scaleselection(D)) + scaler = StandardScaler(D, operate_on) + transform!(D, scaler) + return scaler +end + +function transform!{T<:AbstractFloat,N}(X::AbstractArray{T,N}, cs::StandardScaler) + standardize!(X, cs.offset, cs.scale, cs.obsdim, cs.operate_on) + X +end + +function transform!(D::AbstractDataFrame, cs::StandardScaler) + standardize!(D, cs.offset, cs.scale, cs.operate_on) + D +end + +function transform{T<:AbstractFloat,N}(X::AbstractArray{T,N}, cs::StandardScaler) + Xnew = deepcopy(X) + transform!(Xnew, cs) +end + +function transform{T<:Real,N}(X::AbstractArray{T,N}, cs::StandardScaler) + Xnew = convert(AbstractArray{Float64, N}, X) + transform!(Xnew, cs) + Xnew +end + +function transform(D::AbstractDataFrame, cs::StandardScaler) + Dnew = deepcopy(D) + transform!(Dnew, cs) + Dnew +end diff --git a/test/runtests.jl b/test/runtests.jl index 00c53a1..3f52c72 100644 --- a/test/runtests.jl +++ b/test/runtests.jl @@ -5,8 +5,8 @@ using Base.Test tests = [ "tst_expand.jl" "tst_center.jl" - "tst_rescale.jl" - "tst_featurenormalizer.jl" + "tst_standardize.jl" + "tst_fixedrangescaler.jl" ] for t in tests diff --git a/test/tst_center.jl b/test/tst_center.jl index 2f513c0..922eb53 100644 --- a/test/tst_center.jl +++ b/test/tst_center.jl @@ -1,93 +1,126 @@ -e_x = collect(-2:0.5:10) -e_X = expand_poly(e_x, 5) -df = DataFrame(A=rand(10), B=collect(1:10), C=[string(x) for x in 1:10]) -df_na = deepcopy(df) -df_na[1, :A] = NA +X = collect(Float64, reshape(1:40, 10, 4)) +x = rand(10) * 10 + +D = DataFrame(A=rand(10), B=collect(1:10), C=[hex(x) for x in 11:20]) +D_NA = deepcopy(D) +D_NA[1, :A] = NA @testset "Array" begin - # Center Vectors - xa = copy(e_x) - @test center!(xa) ≈ mean(e_x) - @test abs(mean(xa)) <= 10e-10 - - xa = copy(e_x) - mu = mean(xa) - center!(xa, mu, obsdim=1) - @test abs(mean(xa)) <= 10e-10 - - xa = copy(e_x) - mu = vec(ones(xa)) - center!(xa, mu, obsdim=1) - @test sum(e_x .- mean(xa)) ≈ length(mu) - - # Center Matrix w/o mu - Xa = copy(e_X) - center!(Xa) - @test abs(sum(mean(Xa, 2))) <= 10e-10 - - Xa = copy(e_X) - center!(Xa, obsdim=1) - @test abs(sum(mean(Xa, 1))) <= 10e-10 - - Xa = copy(e_X) - center!(Xa, ObsDim.First()) - @test abs(sum(mean(Xa, 1))) <= 10e-10 - - Xa = copy(e_X) - center!(Xa, obsdim=2) - @test abs(sum(mean(Xa, 2))) <= 10e-10 - - Xa = copy(e_X) - center!(Xa, ObsDim.Last()) - @test abs(sum(mean(Xa, 2))) <= 10e-10 - - - # Center Matrix with mu as input - Xa = copy(e_X) - mu = vec(mean(Xa, 1)) - center!(Xa, mu, obsdim=1) - @test abs(sum(mean(Xa, 1))) <= 10e-10 - - Xa = copy(e_X) - mu = vec(mean(Xa, 2)) - center!(Xa, mu, obsdim=2) - @test abs(sum(mean(Xa, 2))) <= 10e-10 - - Xa = copy(e_X) - mu = vec(mean(Xa, 2)) - center!(Xa, mu, ObsDim.Last()) - @test abs(sum(mean(Xa, 2))) <= 10e-10 + XX = deepcopy(X) + mu = center!(XX, obsdim=1) + @test sum(abs.(mean(XX, 1))) == 0 + @test all(std(XX, 1) .== std(X, 1)) + @test all(mu .== vec(mean(X, 1))) + + XX = deepcopy(X) + mu = center!(XX, ObsDim.First()) + @test sum(abs.(mean(XX, 1))) == 0 + @test all(std(XX, 1) .== std(X, 1)) + @test all(mu .== vec(mean(X, 1))) + + XX = deepcopy(X) + mu = center!(XX, ObsDim.Last()) + @test sum(abs.(mean(XX, 2))) == 0 + @test all(std(XX, 2) .== std(X, 2)) + @test all(mu .== vec(mean(X, 2))) + + XX = deepcopy(X) + mu = center!(XX) + @test sum(abs.(mean(XX, 2))) == 0 + @test all(std(XX, 2) .== std(X, 2)) + @test all(mu .== vec(mean(X, 2))) + + XX = deepcopy(X) + mu = vec(mean(X, 1)) + center!(XX, mu, obsdim=1) + @test sum(abs.(mean(XX, 1))) == 0 + @test all(std(XX, 1) .== std(X, 1)) + + XX = deepcopy(X) + mu = vec(mean(X, 1)) + center!(XX, mu, ObsDim.First()) + @test sum(abs.(mean(XX, 1))) == 0 + @test all(std(XX, 1) .== std(X, 1)) + + XX = deepcopy(X) + mu = vec(mean(XX, 2)) + center!(XX, mu, obsdim=2) + @test sum(abs.(mean(XX, 2))) == 0 + @test all(std(XX, 2) .== std(X, 2)) + + XX = deepcopy(X) + mu = vec(mean(XX, 2)) + center!(XX, mu, ObsDim.Last()) + @test sum(abs.(mean(XX, 2))) == 0 + @test all(std(XX, 2) .== std(X, 2)) + + XX = deepcopy(X) + mu = vec(mean(X[:,[1,3]], 1)) + center!(XX, mu, obsdim=1, operate_on=[1, 3]) + @test sum(abs.(mean(XX[:,[1,3]], 1))) == 0 + @test all(XX[:,2] .== X[:,2]) + @test all(std(XX, 1) .== std(X, 1)) + + XX = deepcopy(X) + mu = vec(mean(X[[1,3],:], 2)) + center!(XX, mu, obsdim=2, operate_on=[1, 3]) + @test sum(abs.(mean(XX[[1,3],:], 2))) == 0 + @test all(XX[2,:] .== X[2,:]) + @test all(std(XX, 2) .== std(X, 2)) + println() + + xx = deepcopy(x) + center!(xx) + @test mean(xx) <= 10e-10 + + xx = deepcopy(x) + mu = mean(xx) + center!(xx, mu) + @test mean(xx) <= 10e-10 + + xx = deepcopy(x) + mu = ones(xx) + center!(xx, mu) + @test mean(xx) - mean(x) ≈ -1 + + xx = deepcopy(x) + mu = ones(xx) + center!(xx, mu) + @test mean(xx) - mean(x) ≈ -1 end @testset "DataFrame" begin # Center DataFrame - D = copy(df) - mu_check = [mean(D[colname]) for colname in names(D)[1:2]] - mu = center!(D) - @test length(mu) == 2 - @test abs(sum(mu .- mu_check)) <= 10e-10 - - D = copy(df) - mu_check = [mean(D[colname]) for colname in names(D)[1:2]] - mu = center!(D, [:A, :B]) - @test abs(sum(mu .- mu_check)) <= 10e-10 - - D = copy(df) - mu_check = [mean(D[colname]) for colname in names(D)[1:2]] - mu = center!(D, [:A, :B], mu_check) - @test abs(sum([mean(D[colname]) for colname in names(D)[1:2]])) <= 10e-10 - - # skip columns that contain NA values - D = copy(df_na) - mu = center!(D, [:A, :B]) - @test isna(D[1, :A]) - @test all(D[2:end, :A] .== df_na[2:end, :A]) - @test abs(mean(D[:B])) < 10e-10 - - D = copy(df_na) - mu_check = [mean(D[colname]) for colname in names(D)[1:2]] - mu = center!(D, [:A, :B], mu_check) - @test isna(D[1, :A]) - @test all(D[2:end, :A] .== df_na[2:end, :A]) - @test abs(mean(D[:B])) < 10e-10 + DD = deepcopy(D) + center!(DD) + @test abs.(mean(DD[:A])) <= 10e-10 + @test abs.(mean(DD[:B])) <= 10e-10 + @test all(DD[:C] .== D[:C]) + + DD = deepcopy(D) + center!(DD, operate_on=[:B]) + @test all(DD[:A] .== D[:A]) + @test abs.(mean(DD[:B])) <= 10e-10 + @test all(DD[:C] .== D[:C]) + + DD = deepcopy(D) + mu = center!(DD, operate_on=[:A, :B]) + @test abs.(mean(DD[:A])) <= 10e-10 + @test abs.(mean(DD[:B])) <= 10e-10 + @test all(DD[:C] .== D[:C]) + @test all(mu .== [mean(D[:A]), mean(D[:B])]) + + DD = deepcopy(D) + mu = [mean(D[:A]), mean(D[:B])] + @test all(center!(DD, mu, operate_on=[:A, :B]) .== mu) + @test abs.(mean(DD[:A])) <= 10e-10 + @test abs.(mean(DD[:B])) <= 10e-10 + @test all(DD[:C] .== D[:C]) + + DD = deepcopy(D_NA) + center!(DD) + @test all(DD[2:end, :A] .== D[2:end, :A]) + @test abs.(mean(DD[:B])) <= 10e-10 + @test all(DD[:C] .== D[:C]) + @test isna(DD[1, :A]) end diff --git a/test/tst_featurenormalizer.jl b/test/tst_featurenormalizer.jl deleted file mode 100644 index 05906f2..0000000 --- a/test/tst_featurenormalizer.jl +++ /dev/null @@ -1,13 +0,0 @@ -@testset "Test FeatureNormalizer model" begin - e_x = collect(-5:.1:5) - e_X = [e_x e_x.^2 e_x.^3]' - - cs = fit(FeatureNormalizer, e_X) - @test vec(mean(e_X, 2)) ≈ cs.offset - @test vec(std(e_X, 2)) ≈ cs.scale - - Xa = predict(cs, e_X) - @test Xa != e_X - @test abs(sum(mean(Xa, 2))) <= 10e-10 - @test std(Xa, 2) ≈ [1, 1, 1] -end diff --git a/test/tst_fixedrangescaler.jl b/test/tst_fixedrangescaler.jl new file mode 100644 index 0000000..7c3b50a --- /dev/null +++ b/test/tst_fixedrangescaler.jl @@ -0,0 +1,108 @@ +X = collect(Float64, reshape(1:40, 10, 4)) +x = rand(10) * 10 + +D = DataFrame(A=rand(10), B=collect(1:10), C=[hex(x) for x in 11:20]) +D_NA = deepcopy(D) +D_NA[1, :A] = NA +@testset "Array" begin + scaler = fit(FixedRangeScaler, X) + XX = transform(X, scaler) + @test mean(XX[:,end]) ≈ 1 + @test mean(XX[:,1]) ≈ 0 + @test maximum(XX) == 1 + @test minimum(XX) == 0 + + scaler = fit(FixedRangeScaler, X, obsdim=1) + XX = transform(X, scaler) + @test mean(XX[1,:]) ≈ 0 + @test mean(XX[end,:]) ≈ 1 + @test maximum(XX) == 1 + @test minimum(XX) == 0 + + scaler = fit(FixedRangeScaler, X, -2, 2) + XX = transform(X, scaler) + @test mean(XX[:,end]) ≈ 2 + @test mean(XX[:,1]) ≈ -2 + @test maximum(XX) == 2 + @test minimum(XX) == -2 + + scaler = fit(FixedRangeScaler, X, -2, 2, obsdim=1) + XX = transform(X, scaler) + @test mean(XX[1,:]) ≈ -2 + @test mean(XX[end,:]) ≈ 2 + @test maximum(XX) == 2 + @test minimum(XX) == -2 + + scaler = fit(FixedRangeScaler, X, -2, 2, obsdim=2) + XX = transform(X, scaler) + @test mean(minimum(XX, 2)) ≈ -2 + @test mean(maximum(XX, 2)) ≈ 2 + @test maximum(XX) == 2 + @test minimum(XX) == -2 + + scaler = fit(FixedRangeScaler, X, -2, 2, obsdim=1, operate_on=[1,2]) + XX = transform(X, scaler) + @test mean(minimum(XX[:,[1,2]], 1)) ≈ -2 + @test mean(maximum(XX[:,[1,2]], 1)) ≈ 2 + + scaler = fit(FixedRangeScaler, X, -2, 2, obsdim=2, operate_on=[1,2]) + XX = transform(X, scaler) + @test mean(minimum(XX[[1,2],:], 2)) ≈ -2 + @test mean(maximum(XX[[1,2],:], 2)) ≈ 2 + + scaler = fit(FixedRangeScaler, X, -2, 2, obsdim=2, operate_on=[1,2]) + XX = deepcopy(X) + transform!(XX, scaler) + @test mean(minimum(XX[[1,2],:], 2)) ≈ -2 + @test mean(maximum(XX[[1,2],:], 2)) ≈ 2 +end + +@testset "DataFrame" begin + scaler = fit(FixedRangeScaler, D) + DD = transform(D, scaler) + @test minimum(DD[:A]) == 0 + @test maximum(DD[:A]) == 1 + + scaler = fit(FixedRangeScaler, D , -1, 1) + DD = transform(D, scaler) + @test minimum(DD[:A]) == -1 + @test maximum(DD[:A]) == 1 + @test minimum(DD[:B]) == -1 + @test maximum(DD[:B]) == 1 + + scaler = fit(FixedRangeScaler, D, -1, 1, operate_on=[:A]) + DD = transform(D, scaler) + @test minimum(DD[:A]) == -1 + @test maximum(DD[:A]) == 1 + @test minimum(DD[:B]) == minimum(D[:B]) + @test maximum(DD[:B]) == maximum(D[:B]) + + scaler = fit(FixedRangeScaler, D, -1, 1) + DD = transform(D_NA, scaler) + @test isna(DD[1,:A]) + @test DD[end,:A] == D_NA[end,:A] + @test minimum(DD[:B]) == -1 + @test maximum(DD[:B]) == 1 + + scaler = fit(FixedRangeScaler, D, -1, 1, operate_on=[:A, :B]) + DD = transform(D_NA, scaler) + @test isna(DD[1,:A]) + @test DD[end,:A] == D_NA[end,:A] + @test minimum(DD[:B]) == -1 + @test maximum(DD[:B]) == 1 + + scaler = fit(FixedRangeScaler, D, -1, 1, operate_on=[:A, :B, :C]) + DD = transform(D_NA, scaler) + @test isna(DD[1,:A]) + @test DD[end,:A] == D_NA[end,:A] + @test minimum(DD[:B]) == -1 + @test maximum(DD[:B]) == 1 + + DD = deepcopy(D) + scaler = fit(FixedRangeScaler, D, -1, 1, operate_on=[:A, :B]) + transform!(DD, scaler) + @test minimum(DD[:A]) == -1 + @test maximum(DD[:A]) == 1 + @test minimum(DD[:B]) == -1 + @test maximum(DD[:B]) == 1 +end diff --git a/test/tst_rescale.jl b/test/tst_rescale.jl deleted file mode 100644 index ef05bc8..0000000 --- a/test/tst_rescale.jl +++ /dev/null @@ -1,93 +0,0 @@ -e_x = collect(-2:0.5:10) -e_X = expand_poly(e_x, 5) -df = DataFrame(A=rand(10), B=collect(1:10), C=[string(x) for x in 1:10]) -df_na = deepcopy(df) -df_na[1, :A] = NA - -@testset "Array" begin - # Rescale Vector - xa = copy(e_x) - mu, sigma = rescale!(xa) - @test mu ≈ mean(e_x) - @test sigma ≈ std(e_x) - @test abs(mean(xa)) <= 10e-10 - @test std(xa) ≈ 1 - - xa = copy(e_x) - mu, sigma = rescale!(xa, mu, sigma) - @test abs(mean(xa)) <= 10e-10 - @test std(xa) ≈ 1 - - xa = copy(e_x) - mu, sigma = rescale!(xa, mu, sigma, obsdim=1) - @test abs(mean(xa)) <= 10e-10 - @test std(xa) ≈ 1 - - xa = copy(e_x) - mu = copy(e_x) .- 1 - sigma = ones(e_x) - mu, sigma = rescale!(xa, mu, sigma, obsdim=1) - @test mean(xa) ≈ 1 - - Xa = copy(e_X) - rescale!(Xa) - @test abs(sum(mean(Xa, 2))) <= 10e-10 - @test std(Xa, 2) ≈ [1, 1, 1, 1, 1] - - Xa = copy(e_X) - rescale!(Xa, obsdim=2) - @test abs(sum(mean(Xa, 2))) <= 10e-10 - @test std(Xa, 2) ≈ [1, 1, 1, 1, 1] - - Xa = copy(e_X) - rescale!(Xa, obsdim=1) - @test abs(sum(mean(Xa, 1))) <= 10e-10 - - Xa = copy(e_X) - mu = vec(mean(Xa, 1)) - sigma = vec(std(Xa, 1)) - rescale!(Xa, mu, sigma, obsdim=1) - @test abs(sum(mean(Xa, 1))) <= 10e-10 - - Xa = copy(e_X) - mu = vec(mean(Xa, 2)) - sigma = vec(std(Xa, 2)) - rescale!(Xa, mu, sigma, obsdim=2) - @test abs(sum(mean(Xa, 2))) <= 10e-10 -end - -@testset "DataFrame" begin - D = copy(df) - mu, sigma = rescale!(D) - @test abs(sum([mean(D[colname]) for colname in names(D)[1:2]])) <= 10e-10 - @test mean([std(D[colname]) for colname in names(D)[1:2]]) - 1 <= 10e-10 - - D = copy(df) - mu, sigma = rescale!(D, [:A, :B]) - @test abs(sum([mean(D[colname]) for colname in names(D)[1:2]])) <= 10e-10 - @test mean([std(D[colname]) for colname in names(D)[1:2]]) - 1 <= 10e-10 - - D = copy(df) - mu_check = [mean(D[colname]) for colname in names(D)[1:2]] - sigma_check = [std(D[colname]) for colname in names(D)[1:2]] - mu, sigma = rescale!(D, [:A, :B], mu_check, sigma_check) - @test abs(sum([mean(D[colname]) for colname in names(D)[1:2]])) <= 10e-10 - @test mean([std(D[colname]) for colname in names(D)[1:2]]) - 1 <= 10e-10 - - # skip columns that contain NA values - D = copy(df_na) - mu, sigma = rescale!(D, [:A, :B]) - @test isna(D[1, :A]) - @test all(D[2:end, :A] .== df_na[2:end, :A]) - @test abs(mean(D[:B])) < 10e-10 - @test abs(std(D[:B])) - 1 < 10e-10 - - D = copy(df_na) - mu_check = [mean(D[colname]) for colname in names(D)[1:2]] - sigma_check = [std(D[colname]) for colname in names(D)[1:2]] - mu, sigma = rescale!(D, [:A, :B], mu_check, sigma_check) - #= @test isna(D[1, :A]) =# - #= @test all(D[2:end, :A] .== df_na[2:end, :A]) =# - #= @test abs(mean(D[:B])) < 10e-10 =# - #= @test (abs(std(D[:B])) - 1) < 10e-10 =# -end diff --git a/test/tst_standardize.jl b/test/tst_standardize.jl new file mode 100644 index 0000000..44a99a3 --- /dev/null +++ b/test/tst_standardize.jl @@ -0,0 +1,157 @@ +X = collect(Float64, reshape(1:40, 10, 4)) +x = rand(10) * 10 + +D = DataFrame(A=rand(10), B=collect(1:10), C=[hex(x) for x in 11:20]) +D_NA = deepcopy(D) +D_NA[1, :A] = NA + +@testset "Array" begin + # Rescale Vector + xx = deepcopy(x) + mu, sigma = standardize!(xx) + @test mu ≈ mean(x) + @test sigma ≈ std(x) + @test abs(mean(xx)) <= 10e-10 + @test std(xx) ≈ 1 + + xx = deepcopy(x) + mu, sigma = standardize!(xx, mu, sigma) + @test abs(mean(xx)) <= 10e-10 + @test std(xx) ≈ 1 + + xx = deepcopy(x) + mu, sigma = standardize!(xx, mu, sigma, obsdim=1) + @test abs(mean(xx)) <= 10e-10 + @test std(xx) ≈ 1 + + xx = deepcopy(x) + mu = deepcopy(x) .- 1 + sigma = ones(x) + mu, sigma = standardize!(xx, mu, sigma, obsdim=1) + @test mean(xx) ≈ 1 + + # Rescale Matrix + XX = deepcopy(X) + standardize!(XX) + @test abs(sum(mean(XX, 2))) <= 10e-10 + @test std(XX, 2) ≈ ones(size(X, 1)) + + XX = deepcopy(X) + standardize!(XX, obsdim=2) + @test abs(sum(mean(XX, 2))) <= 10e-10 + @test std(XX, 2) ≈ ones(size(X, 1)) + + XX = deepcopy(X) + standardize!(XX, obsdim=1) + @test abs(sum(mean(XX, 1))) <= 10e-10 + @test vec(std(XX, 1)) ≈ ones(size(X, 2)) + + XX = deepcopy(X) + mu = vec(mean(XX, 1)) + sigma = vec(std(XX, 1)) + standardize!(XX, mu, sigma, obsdim=1) + @test abs(sum(mean(XX, 1))) <= 10e-10 + + XX = deepcopy(X) + mu = vec(mean(XX, 2)) + sigma = vec(std(XX, 2)) + standardize!(XX, mu, sigma, obsdim=2) + @test abs(sum(mean(XX, 2))) <= 10e-10 + + XX = deepcopy(X) + flt = [1,2] + standardize!(XX, obsdim=1, operate_on=flt) + @test abs(sum(mean(XX[:,flt], 1))) <= 10e-10 + @test vec(std(XX[:,flt], 1)) ≈ ones(2) + @test all(X[:,[3,4]] .== XX[:,[3,4]]) + + XX = deepcopy(X) + flt = [2,8] + mu = vec(mean(XX, 2)) + sigma = vec(std(XX, 2)) + standardize!(XX, mu[flt], sigma[flt], obsdim=2, operate_on=flt) + @test abs(sum(mean(XX[flt,:], 2))) <= 10e-10 + + scaler = fit(StandardScaler, X) + XX = transform(X, scaler) + @test abs(sum(mean(XX, 2))) <= 10e-10 + @test std(XX, 2) ≈ ones(size(X, 1)) + + scaler = fit(StandardScaler, X, obsdim=2) + XX = transform(X, scaler) + @test abs(sum(mean(XX, 2))) <= 10e-10 + @test std(XX, 2) ≈ ones(size(X, 1)) + + scaler = fit(StandardScaler, X, obsdim=1) + XX = transform(X, scaler) + @test abs(sum(mean(XX, 1))) <= 10e-10 + @test vec(std(XX, 1)) ≈ ones(size(X, 2)) + + flt = [1,4] + scaler = fit(StandardScaler, X, obsdim=1, operate_on=flt) + XX = transform(X, scaler) + xx = transform(vec(X[1,:]), scaler) + @test abs(sum(mean(XX[:,flt], 1))) <= 10e-10 + @test vec(std(XX[:,flt], 1)) ≈ ones(size(X[:,flt], 2)) + @test all(xx .== XX[1,:]) + + XX = deepcopy(X) + xx = vec(X[1,:]) + flt = [1,4] + scaler = fit(StandardScaler, X, obsdim=1, operate_on=flt) + transform!(XX, scaler) + transform!(xx, scaler) + @test abs(sum(mean(XX[:,flt], 1))) <= 10e-10 + @test vec(std(XX[:,flt], 1)) ≈ ones(size(X[:,flt], 2)) + @test all(xx .== XX[1,:]) +end + +@testset "DataFrame" begin + DD = deepcopy(D) + mu, sigma = standardize!(DD) + @test abs(sum([mean(DD[colname]) for colname in names(DD)[1:2]])) <= 10e-10 + @test mean([std(DD[colname]) for colname in names(DD)[1:2]]) - 1 <= 10e-10 + + DD = deepcopy(D) + mu, sigma = standardize!(DD, operate_on=[:A,:B,:C]) + @test abs(sum([mean(DD[colname]) for colname in names(DD)[1:2]])) <= 10e-10 + @test mean([std(DD[colname]) for colname in names(DD)[1:2]]) - 1 <= 10e-10 + + DD = deepcopy(D) + mu, sigma = standardize!(DD, mu, sigma, operate_on=[:A,:B]) + @test abs(sum([mean(DD[colname]) for colname in names(DD)[1:2]])) <= 10e-10 + @test mean([std(DD[colname]) for colname in names(DD)[1:2]]) - 1 <= 10e-10 + + # skip columns that contain NA values + DD = deepcopy(D_NA) + mu, sigma = standardize!(DD) + @test isna(DD[1, :A]) + @test all(DD[2:end, :A] .== D_NA[2:end, :A]) + @test abs(mean(DD[:B])) < 10e-10 + @test abs(std(DD[:B])) - 1 < 10e-10 + + scaler = fit(StandardScaler, D) + DD = transform(D, scaler) + @test mean(DD[:A]) <= 10e-10 + @test std(DD[:A]) - 1 <= 10e-10 + @test mean(DD[:B]) <= 10e-10 + @test std(DD[:B]) - 1 <= 10e-10 + @test all(DD[:C] .== D[:C]) + + scaler = fit(StandardScaler, D, operate_on=[:A, :C]) + DD = transform(D, scaler) + @test mean(DD[:A]) <= 10e-10 + @test std(DD[:A]) - 1 <= 10e-10 + @test all(DD[:B] .== D[:B]) + @test all(DD[:C] .== D[:C]) + @test mean(D[:A]) != mean(DD[:A]) + + DD = deepcopy(D) + scaler = fit(StandardScaler, DD, operate_on=[:A, :C]) + transform!(DD, scaler) + @test mean(DD[:A]) <= 10e-10 + @test std(DD[:A]) - 1 <= 10e-10 + @test all(DD[:B] .== D[:B]) + @test all(DD[:C] .== D[:C]) + @test mean(D[:A]) != mean(DD[:A]) +end