Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fit_mle methods #1670

Open
wants to merge 19 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/Project.toml
@@ -1,4 +1,5 @@
[deps]
Distributions = "31c24e10-a181-5473-b8eb-7969acd0382f"
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
GR = "28b8d3ca-fb5f-59d9-8090-bfdbd6d07a71"

Expand Down
17 changes: 17 additions & 0 deletions docs/src/fit.md
Expand Up @@ -42,6 +42,7 @@ The `fit_mle` method has been implemented for the following distributions:
- [`Binomial`](@ref)
- [`Categorical`](@ref)
- [`DiscreteUniform`](@ref)
- [`Dirac`](@ref)
- [`Exponential`](@ref)
- [`LogNormal`](@ref)
- [`Normal`](@ref)
Expand All @@ -60,6 +61,7 @@ The `fit_mle` method has been implemented for the following distributions:
- [`Multinomial`](@ref)
- [`MvNormal`](@ref)
- [`Dirichlet`](@ref)
- [`Product`](@ref)

For most of these distributions, the usage is as described above. For a few special distributions that require additional information for estimation, we have to use a modified interface:

Expand All @@ -74,6 +76,21 @@ fit_mle(Categorical, x) # equivalent to fit_mle(Categorical, max(x), x)
fit_mle(Categorical, x, w)
```

It is also possible to directly input a distribution `fit_mle(d::Distribution, x[, w])`. This form avoids the extra arguments:
```julia
fit_mle(Binomial(n, 0.1), x)
# equivalent to fit_mle(Binomial, ntrials(Binomial(n, 0.1)), x), here the parameter 0.1 is not used

fit_mle(Categorical(p), x)
# equivalent to fit_mle(Categorical, ncategories(Categorical(p)), x), here the only the length of p is used not its values

d = product_distribution([Exponential(0.5), Normal(11.3, 3.2)])
fit_mle(d, x)
# equivalent to product_distribution([fit_mle(Exponential, x[1,:]), fit_mle(Normal, x[2, :])]). Again parameters of d are not used.
Comment on lines +80 to +88
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this different behaviour of the distribution instances is very confusing and exactly shows the problem I had in mind in https://github.com/JuliaStats/Distributions.jl/pull/1670/files#r1092571161. There's no clear and obvious way to tell which properties of the distribution will be fixed when fitting. And even in the case of the product distribution the resulting code is not (significantly) shorter than the current code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Categorical the current fit_mle(Categorical, x) = fit_mle(Categorical, max(x), x) could also lead to confusion (and error in really unlucky case where there are 0 samples of the last category) and not so different of fit_mle(Categorical, ncategories(Categorical(p)), x).

It is not about code length. See my comment below.

```
Note that for standard distributions, the values of the distribution parameters `d` are not used in `fit_mle` only the “structure” of `d` is passed into `fit_mle`.
However, for complex Maximum Likelihood estimation requiring optimization, e.g., EM algorithm, one could use `D` as an initial guess.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is yet another behaviour and IMO adds to the confusion.


## Sufficient Statistics

For many distributions, the estimation can be based on (sum of) sufficient statistics computed from a dataset. To simplify implementation, for such distributions, we implement `suffstats` method instead of `fit_mle` directly:
Expand Down
2 changes: 1 addition & 1 deletion docs/src/mixture.md
Expand Up @@ -106,4 +106,4 @@ rand!(::AbstractMixtureModel, ::AbstractArray)
## Estimation

There are several methods for the estimation of mixture models from data, and this problem remains an open research topic.
This package does not provide facilities for estimating mixture models. One can resort to other packages, *e.g.* [*GaussianMixtures.jl*](https://github.com/davidavdav/GaussianMixtures.jl), for this purpose.
This package does not provide facilities for estimating mixture models. One can resort to other packages, *e.g.* [*GaussianMixtures.jl*](https://github.com/davidavdav/GaussianMixtures.jl) or [*ExpectationMaximization.jl*](https://github.com/dmetivie/ExpectationMaximization.jl), for this purpose.
4 changes: 3 additions & 1 deletion src/genericfit.jl
@@ -1,6 +1,6 @@
# generic functions for distribution fitting

function suffstats(dt::Type{D}, xs...) where D<:Distribution
function suffstats(dt::Type{D}, xs...) where {D<:Distribution}
argtypes = tuple(D, map(typeof, xs)...)
error("suffstats is not implemented for $argtypes.")
end
Expand Down Expand Up @@ -30,5 +30,7 @@ fit_mle(dt::Type{D}, x::AbstractArray, w::AbstractArray) where {D<:UnivariateDis
fit_mle(dt::Type{D}, x::AbstractMatrix) where {D<:MultivariateDistribution} = fit_mle(D, suffstats(D, x))
fit_mle(dt::Type{D}, x::AbstractMatrix, w::AbstractArray) where {D<:MultivariateDistribution} = fit_mle(D, suffstats(D, x, w))

fit_mle(dt::D, args...) where {D<:Distribution} = fit_mle(typeof(dt), args...)
dmetivie marked this conversation as resolved.
Show resolved Hide resolved

fit(dt::Type{D}, x) where {D<:Distribution} = fit_mle(D, x)
fit(dt::Type{D}, args...) where {D<:Distribution} = fit_mle(D, args...)
20 changes: 18 additions & 2 deletions src/multivariate/product.jl
Expand Up @@ -27,15 +27,17 @@ function Product(v::V) where {S<:ValueSupport,T<:UnivariateDistribution{S},V<:Ab
"`Product(v)` is deprecated, please use `product_distribution(v)`",
:Product,
)
return Product{S, T, V}(v)
return Product{S,T,V}(v)
end

length(d::Product) = length(d.v)
function Base.eltype(::Type{<:Product{S,T}}) where {S<:ValueSupport,
T<:UnivariateDistribution{S}}
T<:UnivariateDistribution{S}}
return eltype(T)
end

params(g::Product) = params.(g.v)

Comment on lines +39 to +40
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unrelated:

Suggested change
params(g::Product) = params.(g.v)

_rand!(rng::AbstractRNG, d::Product, x::AbstractVector{<:Real}) =
map!(Base.Fix1(rand, rng), x, d.v)
function _logpdf(d::Product, x::AbstractVector{<:Real})
Expand All @@ -60,3 +62,17 @@ maximum(d::Product) = map(maximum, d.v)
function product_distribution(dists::V) where {S<:ValueSupport,T<:UnivariateDistribution{S},V<:AbstractVector{T}}
return Product{S,T,V}(dists)
end

#### Fitting

"""
fit_mle(g::Product, x::AbstractMatrix)
fit_mle(g::Product, x::AbstractMatrix, γ::AbstractVector)

The `fit_mle` for a multivariate Product distributions `g` is the `product_distribution` of `fit_mle` of each components of `g`.
"""
function fit_mle(g::Product, x::AbstractMatrix, args...)
d = size(x, 1)
length(g) == d || throw(DimensionMismatch("The dimensions of g and x are inconsistent."))
return product_distribution([fit_mle(g.v[s], y, args...) for (s, y) in enumerate(eachrow(x))])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to what I mentioned above, I think there's another issue here: It is not possible to pass different arguments/options to the fits of the different components, which makes this approach much less flexible than just fitting each component manually.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered about that: is there a way to add kwargs = [kwargs[i] for i in 1:d] and apply them to each distribution?
About the arguments, I believe in most situation you have a sample and fit the whole ProductDistribution at once.
Hence, IMO the code above would correspond to the most typical situation. If you think about a situation where some distributions have weighted samples and some other not, one could either try to define a method for this case or as you said do it manually.

end
dmetivie marked this conversation as resolved.
Show resolved Hide resolved
12 changes: 12 additions & 0 deletions src/univariate/continuous/laplace.jl
Expand Up @@ -136,3 +136,15 @@ function fit_mle(::Type{<:Laplace}, x::AbstractArray{<:Real})
xc .= abs.(x .- m)
return Laplace(m, mean(xc))
end

function fit_mle(::Type{<:Laplace}, x::AbstractArray{<:Real}, w::AbstractArray{<:Real})
dmetivie marked this conversation as resolved.
Show resolved Hide resolved
n = length(x)
if n != length(w)
throw(DimensionMismatch("Inconsistent array lengths."))
end
xc = similar(x)
copyto!(xc, x)
m = median(xc, weights(w)) # https://github.com/JuliaStats/StatsBase.jl/blob/0b64a9c8a16355da16469d0fe5016e0fdf099af5/src/weights.jl#L788
xc .= abs.(x .- m)
return Laplace(m, mean(xc, weights(w)))
end
1 change: 1 addition & 0 deletions src/univariate/discrete/binomial.jl
Expand Up @@ -196,6 +196,7 @@ suffstats(::Type{T}, data::BinomData, w::AbstractArray{<:Real}) where {T<:Binomi

fit_mle(::Type{T}, ss::BinomialStats) where {T<:Binomial} = T(ss.n, ss.ns / (ss.ne * ss.n))

fit_mle(d::T, x::AbstractArray{<:Integer}) where {T<:Binomial} = fit_mle(T, suffstats(T, ntrials(d), x))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally not a big fan of supporting both instances and types if it is not really needed (https://docs.julialang.org/en/v1/manual/style-guide/#Avoid-confusion-about-whether-something-is-an-instance-or-a-type). In this line here, I'm not sure if there is a very strong argument for adding the method based on the instance since the same functionality already exists. More generally, I think it could be confusing to know which parameters of d are fixed and which ones are optimized with MLE.

Copy link
Author

@dmetivie dmetivie Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The arguments are the ones I made before.
Mostly, it allows using fit_mle(dist, x) in a very generic code without having to care about extra arguments for some distributions like ProductDistribution, Binomial and Categorical.

For me, the point is not really adding a functionality to Binomial but all that comes with the form fit_mle(dist, x). The only cost is some possible minor confusion.
Docs should be clear on the difference between the instance and types difference. I tried typing something in that sense.

fit_mle(::Type{T}, n::Integer, x::AbstractArray{<:Integer}) where {T<:Binomial}= fit_mle(T, suffstats(T, n, x))
fit_mle(::Type{T}, n::Integer, x::AbstractArray{<:Integer}, w::AbstractArray{<:Real}) where {T<:Binomial} = fit_mle(T, suffstats(T, n, x, w))
fit_mle(::Type{T}, data::BinomData) where {T<:Binomial} = fit_mle(T, suffstats(T, data))
Expand Down
37 changes: 19 additions & 18 deletions src/univariate/discrete/categorical.jl
Expand Up @@ -26,7 +26,7 @@ External links:
"""
const Categorical{P<:Real,Ps<:AbstractVector{P}} = DiscreteNonParametric{Int,P,Base.OneTo{Int},Ps}

function Categorical{P,Ps}(p::Ps; check_args::Bool=true) where {P<:Real, Ps<:AbstractVector{P}}
function Categorical{P,Ps}(p::Ps; check_args::Bool=true) where {P<:Real,Ps<:AbstractVector{P}}
dmetivie marked this conversation as resolved.
Show resolved Hide resolved
@check_args Categorical (p, isprobvec(p), "vector p is not a probability vector")
return Categorical{P,Ps}(Base.OneTo(length(p)), p; check_args=check_args)
end
Expand All @@ -36,7 +36,7 @@ Categorical(p::AbstractVector{P}; check_args::Bool=true) where {P<:Real} =

function Categorical(k::Integer; check_args::Bool=true)
@check_args Categorical (k, k >= 1, "at least one category is required")
return Categorical{Float64,Vector{Float64}}(Base.OneTo(k), fill(1/k, k); check_args=false)
return Categorical{Float64,Vector{Float64}}(Base.OneTo(k), fill(1 / k, k); check_args=false)
end

Categorical(probabilities::Real...; check_args::Bool=true) = Categorical([probabilities...]; check_args=check_args)
Expand All @@ -49,7 +49,7 @@ convert(::Type{Categorical{P,Ps}}, x::AbstractVector{<:Real}) where {
### Parameters

ncategories(d::Categorical) = support(d).stop
params(d::Categorical{P,Ps}) where {P<:Real, Ps<:AbstractVector{P}} = (probs(d),)
params(d::Categorical{P,Ps}) where {P<:Real,Ps<:AbstractVector{P}} = (probs(d),)
partype(::Categorical{T}) where {T<:Real} = T

### Statistics
Expand All @@ -59,7 +59,7 @@ function median(d::Categorical{T}) where {T<:Real}
p = probs(d)
cp = zero(T)
i = 0
while cp < 1/2 && i <= k
while cp < 1 / 2 && i <= k
i += 1
@inbounds cp += p[i]
end
Expand Down Expand Up @@ -87,16 +87,16 @@ function _pdf!(r::AbstractArray, d::Categorical{T}, rgn::UnitRange) where {T<:Re
vr = min(vlast, ncategories(d))
p = probs(d)
if vl > vfirst
for i = 1:(vl - vfirst)
for i = 1:(vl-vfirst)
r[i] = zero(T)
end
end
fm1 = vfirst - 1
for v = vl:vr
r[v - fm1] = p[v]
r[v-fm1] = p[v]
end
if vr < vlast
for i = (vr - vfirst + 2):length(rgn)
for i = (vr-vfirst+2):length(rgn)
r[i] = zero(T)
end
end
Expand All @@ -107,7 +107,7 @@ end
# sampling

sampler(d::Categorical{P,Ps}) where {P<:Real,Ps<:AbstractVector{P}} =
AliasTable(probs(d))
AliasTable(probs(d))


### sufficient statistics
Expand All @@ -116,36 +116,36 @@ struct CategoricalStats <: SufficientStats
h::Vector{Float64}
end

function add_categorical_counts!(h::Vector{Float64}, x::AbstractArray{T}) where T<:Integer
for i = 1 : length(x)
function add_categorical_counts!(h::Vector{Float64}, x::AbstractArray{T}) where {T<:Integer}
for i = 1:length(x)
@inbounds xi = x[i]
h[xi] += 1. # cannot use @inbounds, as no guarantee that x[i] is in bound
h[xi] += 1.0 # cannot use @inbounds, as no guarantee that x[i] is in bound
end
h
end

function add_categorical_counts!(h::Vector{Float64}, x::AbstractArray{T}, w::AbstractArray{Float64}) where T<:Integer
function add_categorical_counts!(h::Vector{Float64}, x::AbstractArray{T}, w::AbstractArray{Float64}) where {T<:Integer}
n = length(x)
if n != length(w)
throw(DimensionMismatch("Inconsistent array lengths."))
end
for i = 1 : n
for i = 1:n
@inbounds xi = x[i]
@inbounds wi = w[i]
h[xi] += wi # cannot use @inbounds, as no guarantee that x[i] is in bound
end
h
end

function suffstats(::Type{<:Categorical}, k::Int, x::AbstractArray{T}) where T<:Integer
function suffstats(::Type{<:Categorical}, k::Int, x::AbstractArray{T}) where {T<:Integer}
CategoricalStats(add_categorical_counts!(zeros(k), x))
end

function suffstats(::Type{<:Categorical}, k::Int, x::AbstractArray{T}, w::AbstractArray{Float64}) where T<:Integer
function suffstats(::Type{<:Categorical}, k::Int, x::AbstractArray{T}, w::AbstractArray{Float64}) where {T<:Integer}
CategoricalStats(add_categorical_counts!(zeros(k), x, w))
end

const CategoricalData = Tuple{Int, AbstractArray}
const CategoricalData = Tuple{Int,AbstractArray}

suffstats(::Type{<:Categorical}, data::CategoricalData) = suffstats(Categorical, data...)
suffstats(::Type{<:Categorical}, data::CategoricalData, w::AbstractArray{Float64}) = suffstats(Categorical, data..., w)
Expand All @@ -156,11 +156,12 @@ function fit_mle(::Type{<:Categorical}, ss::CategoricalStats)
Categorical(normalize!(ss.h, 1))
end

function fit_mle(::Type{<:Categorical}, k::Integer, x::AbstractArray{T}) where T<:Integer
fit_mle(d::T, x::AbstractArray{<:Integer}) where {T<:Categorical} = fit_mle(T, ncategories(d), x)
function fit_mle(::Type{<:Categorical}, k::Integer, x::AbstractArray{T}) where {T<:Integer}
Categorical(normalize!(add_categorical_counts!(zeros(k), x), 1), check_args=false)
end

function fit_mle(::Type{<:Categorical}, k::Integer, x::AbstractArray{T}, w::AbstractArray{Float64}) where T<:Integer
function fit_mle(::Type{<:Categorical}, k::Integer, x::AbstractArray{T}, w::AbstractArray{Float64}) where {T<:Integer}
Categorical(normalize!(add_categorical_counts!(zeros(k), x, w), 1), check_args=false)
end

Expand Down
14 changes: 13 additions & 1 deletion src/univariate/discrete/dirac.jl
Expand Up @@ -50,9 +50,21 @@ logccdf(d::Dirac, x::Real) = x < d.value ? 0.0 : isnan(x) ? NaN : -Inf
quantile(d::Dirac{T}, p::Real) where {T} = 0 <= p <= 1 ? d.value : T(NaN)

mgf(d::Dirac, t) = exp(t * d.value)
cgf(d::Dirac, t) = t*d.value
cgf(d::Dirac, t) = t * d.value
cf(d::Dirac, t) = cis(t * d.value)

#### Sampling

rand(rng::AbstractRNG, d::Dirac) = d.value

#### Fitting

fit_mle(::Type{<:Dirac}, x::AbstractArray{T}) where {T<:Real} = length(unique(x)) == 1 ? Dirac(first(x)) : Dirac(NaN)

function fit_mle(::Type{<:Dirac}, x::AbstractArray{T}, w::AbstractArray{Float64}) where {T<:Real}
n = length(x)
if n != length(w)
throw(DimensionMismatch("Inconsistent array lengths."))
end
return length(unique(x[findall(!iszero, w)])) == 1 ? Dirac(first(x)) : Dirac(NaN)
end
dmetivie marked this conversation as resolved.
Show resolved Hide resolved