Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fit_mle methods #1670

Open
wants to merge 19 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 16 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/Project.toml
@@ -1,4 +1,5 @@
[deps]
Distributions = "31c24e10-a181-5473-b8eb-7969acd0382f"
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
GR = "28b8d3ca-fb5f-59d9-8090-bfdbd6d07a71"

Expand Down
18 changes: 18 additions & 0 deletions docs/src/fit.md
Expand Up @@ -60,6 +60,7 @@ The `fit_mle` method has been implemented for the following distributions:
- [`Multinomial`](@ref)
- [`MvNormal`](@ref)
- [`Dirichlet`](@ref)
- [`ProductDistribution`](@ref)
dmetivie marked this conversation as resolved.
Show resolved Hide resolved

For most of these distributions, the usage is as described above. For a few special distributions that require additional information for estimation, we have to use a modified interface:

Expand All @@ -74,6 +75,23 @@ fit_mle(Categorical, x) # equivalent to fit_mle(Categorical, max(x), x)
fit_mle(Categorical, x, w)
```

It is also possible to directly input a distribution `fit_mle(d::Distribution, x[, w])`. This form avoids the extra arguments:

```julia
fit_mle(Binomial(n, 0.1), x)
# equivalent to fit_mle(Binomial, ntrials(Binomial(n, 0.1)), x), here the parameter 0.1 is not used

fit_mle(Categorical(p), x)
# equivalent to fit_mle(Categorical, ncategories(Categorical(p)), x), here the only the length of p is used not its values

d = product_distribution([Exponential(0.5), Normal(11.3, 3.2)])
fit_mle(d, x)
# equivalent to product_distribution([fit_mle(Exponential, x[1,:]), fit_mle(Normal, x[2, :])]). Again parameters of d are not used.
Comment on lines +80 to +88
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this different behaviour of the distribution instances is very confusing and exactly shows the problem I had in mind in https://github.com/JuliaStats/Distributions.jl/pull/1670/files#r1092571161. There's no clear and obvious way to tell which properties of the distribution will be fixed when fitting. And even in the case of the product distribution the resulting code is not (significantly) shorter than the current code.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Categorical the current fit_mle(Categorical, x) = fit_mle(Categorical, max(x), x) could also lead to confusion (and error in really unlucky case where there are 0 samples of the last category) and not so different of fit_mle(Categorical, ncategories(Categorical(p)), x).

It is not about code length. See my comment below.

```

Note that for standard distributions, the values of the distribution parameters `d` are not used in `fit_mle` only the “structure” of `d` is passed into `fit_mle`.
However, for complex Maximum Likelihood estimation requiring optimization, e.g., EM algorithm, one could use `D` as an initial guess.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is yet another behaviour and IMO adds to the confusion.


## Sufficient Statistics

For many distributions, the estimation can be based on (sum of) sufficient statistics computed from a dataset. To simplify implementation, for such distributions, we implement `suffstats` method instead of `fit_mle` directly:
Expand Down
2 changes: 1 addition & 1 deletion docs/src/mixture.md
Expand Up @@ -106,4 +106,4 @@ rand!(::AbstractMixtureModel, ::AbstractArray)
## Estimation

There are several methods for the estimation of mixture models from data, and this problem remains an open research topic.
This package does not provide facilities for estimating mixture models. One can resort to other packages, *e.g.* [*GaussianMixtures.jl*](https://github.com/davidavdav/GaussianMixtures.jl), for this purpose.
This package does not provide facilities for estimating mixture models. One can resort to other packages, *e.g.* [*GaussianMixtures.jl*](https://github.com/davidavdav/GaussianMixtures.jl) or [*ExpectationMaximization.jl*](https://github.com/dmetivie/ExpectationMaximization.jl), for this purpose.
2 changes: 2 additions & 0 deletions src/genericfit.jl
Expand Up @@ -30,5 +30,7 @@ fit_mle(dt::Type{D}, x::AbstractArray, w::AbstractArray) where {D<:UnivariateDis
fit_mle(dt::Type{D}, x::AbstractMatrix) where {D<:MultivariateDistribution} = fit_mle(D, suffstats(D, x))
fit_mle(dt::Type{D}, x::AbstractMatrix, w::AbstractArray) where {D<:MultivariateDistribution} = fit_mle(D, suffstats(D, x, w))

fit_mle(dt::D, args...) where {D<:Distribution} = fit_mle(typeof(dt), args...)
dmetivie marked this conversation as resolved.
Show resolved Hide resolved

fit(dt::Type{D}, x) where {D<:Distribution} = fit_mle(D, x)
fit(dt::Type{D}, args...) where {D<:Distribution} = fit_mle(D, args...)
18 changes: 17 additions & 1 deletion src/multivariate/product.jl
Expand Up @@ -27,7 +27,7 @@ function Product(v::V) where {S<:ValueSupport,T<:UnivariateDistribution{S},V<:Ab
"`Product(v)` is deprecated, please use `product_distribution(v)`",
:Product,
)
return Product{S, T, V}(v)
return Product{S, T ,V}(v)
dmetivie marked this conversation as resolved.
Show resolved Hide resolved
end

length(d::Product) = length(d.v)
Expand All @@ -36,6 +36,8 @@ function Base.eltype(::Type{<:Product{S,T}}) where {S<:ValueSupport,
return eltype(T)
end

params(g::Product) = params.(g.v)

Comment on lines +39 to +40
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unrelated:

Suggested change
params(g::Product) = params.(g.v)

_rand!(rng::AbstractRNG, d::Product, x::AbstractVector{<:Real}) =
map!(Base.Fix1(rand, rng), x, d.v)
function _logpdf(d::Product, x::AbstractVector{<:Real})
Expand All @@ -60,3 +62,17 @@ maximum(d::Product) = map(maximum, d.v)
function product_distribution(dists::V) where {S<:ValueSupport,T<:UnivariateDistribution{S},V<:AbstractVector{T}}
return Product{S,T,V}(dists)
end

#### Fitting

"""
fit_mle(g::Product, x::AbstractMatrix)
fit_mle(g::Product, x::AbstractMatrix, γ::AbstractVector)

The `fit_mle` for a multivariate Product distributions `g` is the `product_distribution` of `fit_mle` of each components of `g`.
"""
function fit_mle(g::Product, x::AbstractMatrix, args...)
d = size(x, 1)
length(g) == d || throw(DimensionMismatch("The dimensions of g and x are inconsistent."))
return product_distribution([fit_mle(g.v[s], y, args...) for (s, y) in enumerate(eachrow(x))])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to what I mentioned above, I think there's another issue here: It is not possible to pass different arguments/options to the fits of the different components, which makes this approach much less flexible than just fitting each component manually.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wondered about that: is there a way to add kwargs = [kwargs[i] for i in 1:d] and apply them to each distribution?
About the arguments, I believe in most situation you have a sample and fit the whole ProductDistribution at once.
Hence, IMO the code above would correspond to the most typical situation. If you think about a situation where some distributions have weighted samples and some other not, one could either try to define a method for this case or as you said do it manually.

end
dmetivie marked this conversation as resolved.
Show resolved Hide resolved
21 changes: 21 additions & 0 deletions src/product.jl
Expand Up @@ -74,6 +74,8 @@ end

size(d::ProductDistribution) = d.size

params(d::ArrayOfUnivariateDistribution) = params.(d.dists)

Comment on lines +77 to +78
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
params(d::ArrayOfUnivariateDistribution) = params.(d.dists)

mean(d::ProductDistribution) = reshape(mapreduce(vec ∘ mean, vcat, d.dists), size(d))
var(d::ProductDistribution) = reshape(mapreduce(vec ∘ var, vcat, d.dists), size(d))
cov(d::ProductDistribution) = Diagonal(vec(var(d)))
Expand Down Expand Up @@ -244,3 +246,22 @@ function product_distribution(dists::AbstractVector{<:Normal})
σ2 = map(var, dists)
return MvNormal(µ, Diagonal(σ2))
end

#### Fitting
promote_sample(::Type{dT}, x::AbstractArray{T}) where {T<:Real, dT<:Real} = T <: dT ? x : convert.(dT, x)

"""
fit_mle(dists::ArrayOfUnivariateDistribution, x::AbstractArray)
fit_mle(dists::ArrayOfUnivariateDistribution, x::AbstractArray, γ::AbstractVector)

The `fit_mle` for a `ArrayOfUnivariateDistribution` distributions `dists` is the `product_distribution` of `fit_mle` of each components of `dists`.
"""
function fit_mle(dists::VectorOfUnivariateDistribution, x::AbstractMatrix{<:Real}, args...)
length(dists) == size(x, 1) || throw(DimensionMismatch("The dimensions of dists and x are inconsistent."))
return product_distribution([fit_mle(d, promote_sample(eltype(d), x[s, :]), args...) for (s, d) in enumerate(dists.dists)])
end

function fit_mle(dists::ArrayOfUnivariateDistribution, x::AbstractArray, args...)
size(dists) == size(first(x)) || throw(DimensionMismatch("The dimensions of dists and x are inconsistent."))
return product_distribution([fit_mle(d, promote_sample(eltype(d), [x[i][s] for i in eachindex(x)]), args...) for (s, d) in enumerate(dists.dists)])
end
1 change: 1 addition & 0 deletions src/univariate/discrete/binomial.jl
Expand Up @@ -196,6 +196,7 @@ suffstats(::Type{T}, data::BinomData, w::AbstractArray{<:Real}) where {T<:Binomi

fit_mle(::Type{T}, ss::BinomialStats) where {T<:Binomial} = T(ss.n, ss.ns / (ss.ne * ss.n))

fit_mle(d::T, x::AbstractArray{<:Integer}) where {T<:Binomial} = fit_mle(T, suffstats(T, ntrials(d), x))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally not a big fan of supporting both instances and types if it is not really needed (https://docs.julialang.org/en/v1/manual/style-guide/#Avoid-confusion-about-whether-something-is-an-instance-or-a-type). In this line here, I'm not sure if there is a very strong argument for adding the method based on the instance since the same functionality already exists. More generally, I think it could be confusing to know which parameters of d are fixed and which ones are optimized with MLE.

Copy link
Author

@dmetivie dmetivie Feb 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The arguments are the ones I made before.
Mostly, it allows using fit_mle(dist, x) in a very generic code without having to care about extra arguments for some distributions like ProductDistribution, Binomial and Categorical.

For me, the point is not really adding a functionality to Binomial but all that comes with the form fit_mle(dist, x). The only cost is some possible minor confusion.
Docs should be clear on the difference between the instance and types difference. I tried typing something in that sense.

fit_mle(::Type{T}, n::Integer, x::AbstractArray{<:Integer}) where {T<:Binomial}= fit_mle(T, suffstats(T, n, x))
fit_mle(::Type{T}, n::Integer, x::AbstractArray{<:Integer}, w::AbstractArray{<:Real}) where {T<:Binomial} = fit_mle(T, suffstats(T, n, x, w))
fit_mle(::Type{T}, data::BinomData) where {T<:Binomial} = fit_mle(T, suffstats(T, data))
Expand Down
1 change: 1 addition & 0 deletions src/univariate/discrete/categorical.jl
Expand Up @@ -156,6 +156,7 @@ function fit_mle(::Type{<:Categorical}, ss::CategoricalStats)
Categorical(normalize!(ss.h, 1))
end

fit_mle(d::T, x::AbstractArray{<:Integer}) where {T<:Categorical} = fit_mle(T, ncategories(d), x)
function fit_mle(::Type{<:Categorical}, k::Integer, x::AbstractArray{T}) where T<:Integer
Categorical(normalize!(add_categorical_counts!(zeros(k), x), 1), check_args=false)
end
Expand Down
26 changes: 26 additions & 0 deletions test/fit.jl
Expand Up @@ -115,6 +115,11 @@ end
@test d isa D
@test ntrials(d) == 100
@test succprob(d) ≈ 0.3 atol=0.01

d2 = @inferred fit_mle(D(100, 0.5), x)
@test d2 isa D
@test ntrials(d2) == 100
@test succprob(d2) ≈ 0.3 atol = 0.01
end
end
end
Expand All @@ -141,6 +146,10 @@ end
@test isa(d2, Categorical)
@test probs(d2) == probs(d)

d3 = fit_mle(Categorical(p), x)
@test isa(d3, Categorical)
@test probs(d3) == probs(d)

ss = suffstats(Categorical, (3, x), w)
h = Float64[sum(w[x .== i]) for i = 1 : 3]
@test isa(ss, Distributions.CategoricalStats)
Expand Down Expand Up @@ -414,6 +423,23 @@ end
end
end

@testset "Testing fit_mle for ProductDistribution" begin
dists = [product_distribution([Exponential(0.5), Normal(11.3, 3.2)]), product_distribution([Exponential(0.5) Normal(11.3, 3.2)
Bernoulli(0.2) Normal{Float32}(0f0,0f0)])]
for func in funcs, dist in dists
x = rand(dist, N)
w = func[1](N)

d = fit_mle(dist, x)
@test isa(d, typeof(dist))
@test isapprox(collect.(params(d)), collect.(params(dist)), atol=0.1)

d = fit_mle(dist, x, w)
@test isa(d, typeof(dist))
@test isapprox(collect.(params(d)), collect.(params(dist)), atol=0.1)
end
end

@testset "Testing fit for InverseGaussian" begin
for func in funcs, dist in (InverseGaussian, InverseGaussian{Float64})
x = rand(dist(3.9, 2.1), n0)
Expand Down