New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fit_mle methods #1670
base: master
Are you sure you want to change the base?
Add fit_mle methods #1670
Changes from all commits
a059f39
bb4bf69
168efa2
2fdb322
c198e7f
81fcffe
12722ca
a96582e
c310bc3
95f13d5
f933504
fe9c3c4
07433f2
22bb43b
dd4629d
d892d98
2f8139f
da8929f
1a57ad4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -74,6 +74,23 @@ fit_mle(Categorical, x) # equivalent to fit_mle(Categorical, max(x), x) | |
fit_mle(Categorical, x, w) | ||
``` | ||
|
||
It is also possible to directly input a distribution `fit_mle(d::Distribution, x[, w])`. This form avoids the extra arguments: | ||
|
||
```julia | ||
fit_mle(Binomial(n, 0.1), x) | ||
# equivalent to fit_mle(Binomial, ntrials(Binomial(n, 0.1)), x), here the parameter 0.1 is not used | ||
|
||
fit_mle(Categorical(p), x) | ||
# equivalent to fit_mle(Categorical, ncategories(Categorical(p)), x), here the only the length of p is used not its values | ||
|
||
d = product_distribution([Exponential(0.5), Normal(11.3, 3.2)]) | ||
fit_mle(d, x) | ||
# equivalent to product_distribution([fit_mle(Exponential, x[1,:]), fit_mle(Normal, x[2, :])]). Again parameters of d are not used. | ||
``` | ||
|
||
Note that for standard distributions, the values of the distribution parameters `d` are not used in `fit_mle` only the “structure” of `d` is passed into `fit_mle`. | ||
However, for complex Maximum Likelihood estimation requiring optimization, e.g., EM algorithm, one could use `D` as an initial guess. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is yet another behaviour and IMO adds to the confusion. |
||
|
||
## Sufficient Statistics | ||
|
||
For many distributions, the estimation can be based on (sum of) sufficient statistics computed from a dataset. To simplify implementation, for such distributions, we implement `suffstats` method instead of `fit_mle` directly: | ||
|
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -36,6 +36,8 @@ function Base.eltype(::Type{<:Product{S,T}}) where {S<:ValueSupport, | |||
return eltype(T) | ||||
end | ||||
|
||||
params(g::Product) = params.(g.v) | ||||
|
||||
Comment on lines
+39
to
+40
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems unrelated:
Suggested change
|
||||
_rand!(rng::AbstractRNG, d::Product, x::AbstractVector{<:Real}) = | ||||
map!(Base.Fix1(rand, rng), x, d.v) | ||||
function _logpdf(d::Product, x::AbstractVector{<:Real}) | ||||
|
@@ -60,3 +62,17 @@ maximum(d::Product) = map(maximum, d.v) | |||
function product_distribution(dists::V) where {S<:ValueSupport,T<:UnivariateDistribution{S},V<:AbstractVector{T}} | ||||
return Product{S,T,V}(dists) | ||||
end | ||||
|
||||
#### Fitting | ||||
|
||||
""" | ||||
fit_mle(g::Product, x::AbstractMatrix) | ||||
fit_mle(g::Product, x::AbstractMatrix, γ::AbstractVector) | ||||
|
||||
The `fit_mle` for a multivariate Product distributions `g` is the `product_distribution` of `fit_mle` of each components of `g`. | ||||
""" | ||||
function fit_mle(g::Product, x::AbstractMatrix, args...) | ||||
d = size(x, 1) | ||||
length(g) == d || throw(DimensionMismatch("The dimensions of g and x are inconsistent.")) | ||||
return product_distribution([fit_mle(g.v[s], y, args...) for (s, y) in enumerate(eachrow(x))]) | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In addition to what I mentioned above, I think there's another issue here: It is not possible to pass different arguments/options to the fits of the different components, which makes this approach much less flexible than just fitting each component manually. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wondered about that: is there a way to add |
||||
end | ||||
dmetivie marked this conversation as resolved.
Show resolved
Hide resolved
|
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -74,6 +74,8 @@ end | |||
|
||||
size(d::ProductDistribution) = d.size | ||||
|
||||
params(d::ArrayOfUnivariateDistribution) = params.(d.dists) | ||||
|
||||
Comment on lines
+77
to
+78
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||
mean(d::ProductDistribution) = reshape(mapreduce(vec ∘ mean, vcat, d.dists), size(d)) | ||||
var(d::ProductDistribution) = reshape(mapreduce(vec ∘ var, vcat, d.dists), size(d)) | ||||
cov(d::ProductDistribution) = Diagonal(vec(var(d))) | ||||
|
@@ -244,3 +246,22 @@ function product_distribution(dists::AbstractVector{<:Normal}) | |||
σ2 = map(var, dists) | ||||
return MvNormal(µ, Diagonal(σ2)) | ||||
end | ||||
|
||||
#### Fitting | ||||
promote_sample(::Type{dT}, x::AbstractArray{T}) where {T<:Real, dT<:Real} = T <: dT ? x : convert.(dT, x) | ||||
|
||||
""" | ||||
fit_mle(dists::ArrayOfUnivariateDistribution, x::AbstractArray) | ||||
fit_mle(dists::ArrayOfUnivariateDistribution, x::AbstractArray, γ::AbstractVector) | ||||
|
||||
The `fit_mle` for a `ArrayOfUnivariateDistribution` distributions `dists` is the `product_distribution` of `fit_mle` of each components of `dists`. | ||||
""" | ||||
function fit_mle(dists::VectorOfUnivariateDistribution, x::AbstractMatrix{<:Real}, args...) | ||||
length(dists) == size(x, 1) || throw(DimensionMismatch("The dimensions of dists and x are inconsistent.")) | ||||
return product_distribution([fit_mle(d, promote_sample(eltype(d), x[s, :]), args...) for (s, d) in enumerate(dists.dists)]) | ||||
end | ||||
|
||||
function fit_mle(dists::ArrayOfUnivariateDistribution, x::AbstractArray, args...) | ||||
size(dists) == size(first(x)) || throw(DimensionMismatch("The dimensions of dists and x are inconsistent.")) | ||||
return product_distribution([fit_mle(d, promote_sample(eltype(d), [x[i][s] for i in eachindex(x)]), args...) for (s, d) in enumerate(dists.dists)]) | ||||
end |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -196,6 +196,7 @@ suffstats(::Type{T}, data::BinomData, w::AbstractArray{<:Real}) where {T<:Binomi | |
|
||
fit_mle(::Type{T}, ss::BinomialStats) where {T<:Binomial} = T(ss.n, ss.ns / (ss.ne * ss.n)) | ||
|
||
fit_mle(d::T, x::AbstractArray{<:Integer}) where {T<:Binomial} = fit_mle(T, suffstats(T, ntrials(d), x)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm generally not a big fan of supporting both instances and types if it is not really needed (https://docs.julialang.org/en/v1/manual/style-guide/#Avoid-confusion-about-whether-something-is-an-instance-or-a-type). In this line here, I'm not sure if there is a very strong argument for adding the method based on the instance since the same functionality already exists. More generally, I think it could be confusing to know which parameters of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The arguments are the ones I made before. For me, the point is not really adding a functionality to |
||
fit_mle(::Type{T}, n::Integer, x::AbstractArray{<:Integer}) where {T<:Binomial}= fit_mle(T, suffstats(T, n, x)) | ||
fit_mle(::Type{T}, n::Integer, x::AbstractArray{<:Integer}, w::AbstractArray{<:Real}) where {T<:Binomial} = fit_mle(T, suffstats(T, n, x, w)) | ||
fit_mle(::Type{T}, data::BinomData) where {T<:Binomial} = fit_mle(T, suffstats(T, data)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this different behaviour of the distribution instances is very confusing and exactly shows the problem I had in mind in https://github.com/JuliaStats/Distributions.jl/pull/1670/files#r1092571161. There's no clear and obvious way to tell which properties of the distribution will be fixed when fitting. And even in the case of the product distribution the resulting code is not (significantly) shorter than the current code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For
Categorical
the currentfit_mle(Categorical, x) = fit_mle(Categorical, max(x), x)
could also lead to confusion (and error in really unlucky case where there are 0 samples of the last category) and not so different offit_mle(Categorical, ncategories(Categorical(p)), x)
.It is not about code length. See my comment below.