-
Notifications
You must be signed in to change notification settings - Fork 230
Description
Issues
Hello, this issue is
- not only related to multi-level varying slopes with two clusters, cross classification I posted on Julia discourse Multi-level varying slopes with two clusters, cross classification
- but also related to another similar issues I found, for example
I have spent many hours tweaking more than 10 variants of the model, reading up many Stan's docs, PyMC3's docs, and posting the issue on Julia's discourse. I believe the main issue that I, and many other people (the issues linked above) face is that the LKJ is numerically unstable. As per 24.1 LKJ Correlation Distribution | Stan Functions Reference, with reference to LKJ distribution (without cholesky factorisation),
The log of the LKJ density for the correlation matrix y given nonnegative shape eta. The only reason to use this density function is if you want the code to run slower and consume more memory with more risk of numerical errors. Use its Cholesky factor as described in the next section.
...However, it is much better computationally to work directly with the Cholesky factor of Σ, so this distribution should never be explicitly used in practice.
Me and many people in Turing have been using LKJ directly without the Cholesky decomposition, resulting in those issues. I follow some workaround in those issues, such as using Symmetric but it doesn't solve the issue for more complex models.
We need a cholesky decomposition of LKJ, something called LKJCholesky or similar names. It's implemented in Stan, PyMC3 and also Soss.jl
- Stan
lkj_corr_cholesky24.2 Cholesky LKJ Correlation Distribution | Stan Functions Reference - PyMC3:
pymc3.LKJCholeskyCovLKJ Cholesky Covariance Priors for Multivariate Normal Models — PyMC3 3.10.0 documentation - Soss.jl
LKJL(implemented in MeasureTheory.jl) cscherrer/MeasureTheory.jl
LKJ Cholesky
For example, Stan's implementation of lkj_corr_cholesky
As per 24 Correlation Matrix Distributions | Stan Functions Reference
The correlation matrix distributions have support on the (Cholesky factors of) correlation matrices. A Cholesky factor L for a K × K correlation matrix Σ of dimension K has rows of unit length so that the diagonal of L L ⊤ is the unit K -vector. Even though models are usually conceptualized in terms of correlation matrices, it is better to operationalize them in terms of their Cholesky factors.
As per 24.1 LKJ Correlation Distribution | Stan Functions Reference, with reference to LKJ distribution (without cholesky factorisation),
However, it is much better computationally to work directly with the Cholesky factor of Σ, so this distribution should never be explicitly used in practice.
As per 24.2 Cholesky LKJ Correlation Distribution | Stan Functions Reference
Stan provides an implicit parameterization of the LKJ correlation matrix density in terms of its Cholesky factor, which you should use rather than the explicit parameterization in the previous section. For example, if L is a Cholesky factor of a correlation matrix, then
L ~ lkj_corr_cholesky(2.0); # implies L * L' ~ lkj_corr(2.0);
So instead of the "raw version"
@model function gdemo(...)
n_coef, n_cluster = 4, 7 # For example
σ ~ filldist(Exponential(1), n_coef)
ρ ~ LKJ(n_coef,2)
Σ = Diagonal(σ) * ρ * Diagonal (σ)
α ~ MvNormal(..., Σ)
......
endor a slightly better version, using the codes by Seth Axen, which uses the quad_form_diag implemented in Stan
quad_form_diag(M, v) = Symmetric((v .* v') .* (M .+ M') ./ 2)
@model function gdemo(...)
n_coef, n_cluster = 4, 7 # For example
σ ~ filldist(Exponential(1), n_coef)
ρ ~ LKJ(n_coef,2)
# Use of quad_form_diag implemented in Stan
Σ = quad_form_diag(ρ, σ)
α ~ MvNormal(..., Σ)
......
endStan/PyMC3's version of LKJ Cholesky with reparameterisation (LKJ Cholesky can also be used without reparameterisation). For example, Stan's implementation
parameters{
.............
cholesky_factor_corr[4] L_Rho_block;
}
transformed parameters{
............
alpha = (diag_pre_multiply(sigma_actor, L_Rho_actor) * z_actor)';
}
model{
...............
L_Rho_actor ~ lkj_corr_cholesky( 2 );
}If Turing implements LKJCholesky, it should look something like this
@model function gdemo(...)
n_coef, n_cluster = 4, 7 # For example
σ ~ filldist(Exponential(1), n_coef)
# standardised prior for reparameterisation
z_α ~ filldist(Normal(0, 1), n_coef, n_cluster) # 4 x 7 matrix
# Stan's Cholesky LKJ: L_Rho_actor ~ lkj_corr_cholesky( 2 );
L_α ~ LKJCholesky(n_coef,2)
# Stan's alpha = (diag_pre_multiply(sigma_actor, L_Rho_actor) * z_actor)';
α = σ .* L_α * z_α # matrix 4 x 7
......
endor a Soss.jl implementation, full example coded up by Seth Axen using Soss.jl
# using Pkg
# Pkg.add(["Soss", "MeasureTheory", "TuringModels", "CSV", "DataFrames", "DynamicHMC", "LogDensityProblems"])
using Soss, MeasureTheory, LinearAlgebra, Random
using Soss: TransformVariables
# work around LKJL issues, see https://github.com/cscherrer/MeasureTheory.jl/issues/100
# note that we're making LKJL behave like an LKJU
Soss.xform(d::LKJL, _data::NamedTuple=NamedTuple()) = TransformVariables.as(d);
function Base.rand(rng::AbstractRNG, ::Type, d::LKJL{k}) where {k}
return cholesky(rand(rng, MeasureTheory.Dists.LKJ(k, d.η))).U
end
# get data
using TuringModels, CSV, DataFrames
data_path = joinpath(TuringModels.project_root, "data", "chimpanzees.csv");
df = CSV.read(data_path, DataFrame; delim=';');
df.block_id = df.block;
df.treatment = 1 .+ df.prosoc_left .+ 2 * df.condition;
# build the model
model = @model actor, block, treatment, n_treatment, n_actor, n_block begin
σ_block ~ Exponential(1) |> iid(n_treatment)
σ_actor ~ Exponential(1) |> iid(n_treatment)
U_ρ_block ~ LKJL(n_treatment, 2)
U_ρ_actor ~ LKJL(n_treatment, 2)
g ~ Normal(0, 1) |> iid(n_treatment)
z_block ~ Normal(0, 1) |> iid(n_treatment, n_block)
z_actor ~ Normal(0, 1) |> iid(n_treatment, n_actor)
β = σ_block .* (U_ρ_block' * z_block)
α = σ_actor .* (U_ρ_actor' * z_actor)
p = broadcast(treatment, actor, block) do t, a, b
return @inbounds logistic(g[t] + α[t, a] + β[t, b])
end
pulled_left .~ Binomial.(1, p)
endPerformance
Comparing Stan's implementation and current use of LKJ in Turing, which is Parts of my issues posted here
The Stan model, sampling 4 chains, finished in about 6 seconds per chain (total = Warm-up + sampling). Summarized elapsed time
Chain 1: Elapsed Time: 3.27789 seconds (Warm-up)
Chain 1: 2.63135 seconds (Sampling)
Chain 1: 5.90924 seconds (Total)
Chain 2: Elapsed Time: 3.27674 seconds (Warm-up)
Chain 2: 2.77611 seconds (Sampling)
Chain 2: 6.05285 seconds (Total)
Chain 3: Elapsed Time: 3.44 seconds (Warm-up)
Chain 3: 2.66887 seconds (Sampling)
Chain 3: 6.10887 seconds (Total)
Chain 4: Elapsed Time: 3.4264 seconds (Warm-up)
Chain 4: 2.63844 seconds (Sampling)
Chain 4: 6.06484 seconds (Total)Stan model
Stan code saved in “nonCentered.stan”
data{
int L[504];
int block_id[504];
int actor[504];
int tid[504];
}
parameters{
matrix[4,7] z_actor;
matrix[4,6] z_block;
vector[4] g;
cholesky_factor_corr[4] L_Rho_actor;
cholesky_factor_corr[4] L_Rho_block;
vector<lower=0>[4] sigma_actor;
vector<lower=0>[4] sigma_block;
}
transformed parameters{
matrix[7,4] alpha;
matrix[6,4] beta;
beta = (diag_pre_multiply(sigma_block, L_Rho_block) * z_block)';
alpha = (diag_pre_multiply(sigma_actor, L_Rho_actor) * z_actor)';
}
model{
vector[504] p;
sigma_block ~ exponential( 1 );
sigma_actor ~ exponential( 1 );
L_Rho_block ~ lkj_corr_cholesky( 2 );
L_Rho_actor ~ lkj_corr_cholesky( 2 );
g ~ normal( 0 , 1 );
to_vector( z_block ) ~ normal( 0 , 1 );
to_vector( z_actor ) ~ normal( 0 , 1 );
for ( i in 1:504 ) {
p[i] = g[tid[i]] + alpha[actor[i], tid[i]] + beta[block_id[i], tid[i]];
p[i] = inv_logit(p[i]);
}
L ~ binomial( 1 , p );
}
generated quantities{
vector[504] log_lik;
vector[504] p;
matrix[4,4] Rho_actor;
matrix[4,4] Rho_block;
Rho_block = multiply_lower_tri_self_transpose(L_Rho_block);
Rho_actor = multiply_lower_tri_self_transpose(L_Rho_actor);
for ( i in 1:504 ) {
p[i] = g[tid[i]] + alpha[actor[i], tid[i]] + beta[block_id[i], tid[i]];
p[i] = inv_logit(p[i]);
}
for ( i in 1:504 ) log_lik[i] = binomial_lpmf( L[i] | 1 , p[i] );
}Sampling Stan model
library(rethinking)
data(chimpanzees)
d = chimpanzees
d$block_id = d$block
d$treatment = 1L + d$prosoc_left + 2L*d$condition
dat = list(
L = d$pulled_left,
tid = d$treatment,
actor = d$actor,
block_id = as.integer(d$block_id) )
nonCentered = stan_model("nonCentered.stan")
fit = sampling(nonCentered, data = dat, seed = 803214053, control = list(adapt_delta = 0.9))
fit = sampling(nonCentered, data = dat, seed = 803214053, control = list(adapt_delta = 0.9))
SAMPLING FOR MODEL 'nonCentered' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 0.000133 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.33 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 3.27789 seconds (Warm-up)
Chain 1: 2.63135 seconds (Sampling)
Chain 1: 5.90924 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'nonCentered' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 9.7e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.97 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 3.27674 seconds (Warm-up)
Chain 2: 2.77611 seconds (Sampling)
Chain 2: 6.05285 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'nonCentered' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 9.7e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.97 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 3.44 seconds (Warm-up)
Chain 3: 2.66887 seconds (Sampling)
Chain 3: 6.10887 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'nonCentered' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 0.000111 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 1.11 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 3.4264 seconds (Warm-up)
Chain 4: 2.63844 seconds (Sampling)
Chain 4: 6.06484 seconds (Total)
Chain 4:Turing's reference model
To implement it in Turing, I found a similar model here StatisticalRethinkingJulia/TuringModels.jl - non-centered chimpanzees.md.
using TuringModels
using LinearAlgebra
# This script requires latest LKJ bijectors support.
# `] add Bijectors#master` to get latest Bijectors.
data_path = joinpath(@__DIR__, "..", "..", "data", "chimpanzees.csv")
delim = ";"
d = CSV.read(data_path, DataFrame; delim)
d.block_id = d.block
# m13.6nc1 is equivalent to m13.6nc in following Turing model
@model m13_6_nc(actor, block_id, condition, prosoc_left, pulled_left) = begin
# fixed priors
Rho_block ~ LKJ(3, 4.)
Rho_actor ~ LKJ(3, 4.)
sigma_block ~ filldist(truncated(Cauchy(0, 2), 0, Inf), 3)
sigma_actor ~ filldist(truncated(Cauchy(0, 2), 0, Inf), 3)
a ~ Normal(0, 1)
bp ~ Normal(0, 1)
bpc ~ Normal(0, 1)
# adaptive NON-CENTERED priors
Rho_block = (Rho_block' + Rho_block) / 2
Rho_actor = (Rho_actor' + Rho_actor) / 2
L_Rho_block = cholesky(Rho_block).L
L_Rho_actor = cholesky(Rho_actor).L
z_N_block ~ filldist(Normal(0, 1), 3, 6)
z_N_actor ~ filldist(Normal(0, 1), 3, 7)
# @show size(L_Rho_block) size(sigma_block) size(z_N_block_id)
a_block_bp_block_bpc_block = sigma_block .* L_Rho_block * z_N_block
a_actor_bp_actor_bpc_actor = sigma_actor .* L_Rho_actor * z_N_actor
a_block = a_block_bp_block_bpc_block[1, :]
bp_block = a_block_bp_block_bpc_block[2, :]
bpc_block = a_block_bp_block_bpc_block[3, :]
a_actor = a_actor_bp_actor_bpc_actor[1, :]
bp_actor = a_actor_bp_actor_bpc_actor[2, :]
bpc_actor = a_actor_bp_actor_bpc_actor[3, :]
# linear models
BPC = bpc .+ bpc_actor[actor] + bpc_block[block_id]
BP = bp .+ bp_actor[actor] + bp_block[block_id]
A = a .+ a_actor[actor] + a_block[block_id]
logit_p = A + (BP + BPC .* condition) .* prosoc_left
# likelihood
pulled_left .~ BinomialLogit.(1, logit_p)
end
chns = sample(
m13_6_nc(d.actor, d.block_id, d.condition, d.prosoc_left, d.pulled_left),
Turing.NUTS(0.95),
1000
)The model I implemented to replicate Stan's model above
To implement the Stan's model and when doing this, I also refer to @sethaxen 's implementation above and the model of TuringModel.jl. Again, I coded up like more than 10 variants, but same issues
using TuringModels, CSV, DataFrames, Turing, LinearAlgebra, Random, ReverseDiff, Memoization
Turing.setadbackend(:reversediff)
Turing.setrdcache(true)
data_path = joinpath(TuringModels.project_root, "data", "chimpanzees.csv");
df = CSV.read(data_path, DataFrame; delim=';');
df.block_id = df.block;
df.treatment = 1 .+ df.prosoc_left .+ 2*df.condition;
@model function nonCentered(actor, block, treatment, pulled_left)
n_treatment, n_actor, n_block = 4, 7, 6
# fixed priors
g ~ MvNormal(n_treatment, 1)
ρ_block ~ LKJ(n_treatment, 2)
ρ_actor ~ LKJ(n_treatment, 2)
σ_block ~ filldist(Exponential(1), n_treatment)
σ_actor ~ filldist(Exponential(1), n_treatment)
# adaptive NON-CENTERED priors
# I have no idea what does these two lines do. I just copied from the model of TuringModels.jl above
ρ_block = (ρ_block' + ρ_block) / 2
ρ_actor = (ρ_actor' + ρ_actor) / 2
# Cholesky factor
L_block = cholesky(ρ_block).L
L_actor = cholesky(ρ_actor).L
# Standardised prior
z_block ~ filldist(Normal(0, 1), n_treatment, n_block)
z_actor ~ filldist(Normal(0, 1), n_treatment, n_actor)
# Again, copied from the model of TuringModels.jl above
β = σ_block .* L_block * z_block # matrix 4 x 6
α = σ_actor .* L_actor * z_actor # matrix 4 x 7
# Use the code from @sethaxen
logit_p = broadcast(treatment, actor, block) do t, a, b
return @inbounds g[t] + α[t,a] + β[t,b]
end
pulled_left .~ BinomialLogit.(1, logit_p)
end
rng = MersenneTwister(3702)
@time chns = sample(
rng,
nonCentered(df.actor, df.block_id, df.treatment, df.pulled_left),
NUTS(), # Use Turing's default settings
1_000
);
julia> @time chns = sample(
rng,
nonCentered(df.actor, df.block_id, df.treatment, df.pulled_left),
NUTS(), # Use Turing's default settings
1_000
);
┌ Info: Found initial step size
└ ϵ = 0.2
┌ Warning: The current proposal will be rejected due to numerical error(s).
│ isfinite.((θ, r, ℓπ, ℓκ)) = (true, false, false, false)
└ @ AdvancedHMC ~/.julia/packages/AdvancedHMC/MIxdK/src/hamiltonian.jl:47
┌ Warning: The current proposal will be rejected due to numerical error(s).
│ isfinite.((θ, r, ℓπ, ℓκ)) = (true, false, false, false)
└ @ AdvancedHMC ~/.julia/packages/AdvancedHMC/MIxdK/src/hamiltonian.jl:47
Sampling 100%|██████████████████████████████████| Time: 0:01:30
ERROR: DomainError with -5.831779503751022e-11:
sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
Stacktrace:
[1] sqrt
@ ./math.jl:582 [inlined]
[2] sqrt
@ ~/.julia/packages/ForwardDiff/QOqCN/src/dual.jl:203 [inlined]
[3] derivative!
@ ~/.julia/packages/ForwardDiff/QOqCN/src/derivative.jl:46 [inlined]
[4] unary_scalar_forward_exec!(f::typeof(sqrt), output::ReverseDiff.TrackedReal{Float64, Float64, Nothing}, input::ReverseDiff.TrackedReal{Float64, Float64, Nothing}, cache::Base.RefValue{Float64})
@ ReverseDiff ~/.julia/packages/ReverseDiff/E4Tzn/src/derivatives/scalars.jl:91
[5] scalar_forward_exec!(instruction::ReverseDiff.ScalarInstruction{typeof(sqrt), ReverseDiff.TrackedReal{Float64, Float64, Nothing}, ReverseDiff.TrackedReal{Float64, Float64, Nothing}, Base.RefValue{Float64}})
@ ReverseDiff ~/.julia/packages/ReverseDiff/E4Tzn/src/derivatives/scalars.jl:81
[6] forward_exec!(instruction::ReverseDiff.ScalarInstruction{typeof(sqrt), ReverseDiff.TrackedReal{Float64, Float64, Nothing}, ReverseDiff.TrackedReal{Float64, Float64, Nothing}, Base.RefValue{Float64}})
@ ReverseDiff ~/.julia/packages/ReverseDiff/E4Tzn/src/tape.jl:82
[7] ForwardExecutor
@ ~/.julia/packages/ReverseDiff/E4Tzn/src/api/tape.jl:76 [inlined]
[8] (::FunctionWrappers.CallWrapper{Nothing})(f::ReverseDiff.ForwardExecutor{ReverseDiff.ScalarInstruction{typeof(sqrt), ReverseDiff.TrackedReal{Float64, Float64, Nothing}, ReverseDiff.TrackedReal{Float64, Float64, Nothing}, Base.RefValue{Float64}}})
@ FunctionWrappers ~/.julia/packages/FunctionWrappers/8xdVB/src/FunctionWrappers.jl:58
[9] macro expansion
@ ~/.julia/packages/FunctionWrappers/8xdVB/src/FunctionWrappers.jl:130 [inlined]
[10] do_ccall
@ ~/.julia/packages/FunctionWrappers/8xdVB/src/FunctionWrappers.jl:118 [inlined]
[11] FunctionWrapper
@ ~/.julia/packages/FunctionWrappers/8xdVB/src/FunctionWrappers.jl:137 [inlined]
[12] forward_pass!(compiled_tape::ReverseDiff.CompiledTape{ReverseDiff.GradientTape{Turing.Core.var"#f#26"{DynamicPPL.TypedVarInfo{NamedTuple{(:g, :ρ_block, :ρ_actor, :σ_block, :σ_actor, :z_block, :z_actor), Tuple{DynamicPPL.Metadata{Dict{AbstractPPL.VarName{:g, Tuple{}}, Int64}, Vector{I am looking into defining a custom distribution in Turing as per Advanced Usage - How to Define a Customized Distribution, with references to Stan's implementation of lkj_corr_cholesky, I found these
stan-dev/math - lkj_corr_cholesky_log.hpp
stan-dev/math - lkj_corr_cholesky_rng.hpp
stan-dev/math - lkj_corr_cholesky_lpdf.hpp
But everything is written in C/Cpp which I have not learned at all. Seems like I am stucked.
An implementation of LKJCholesky in Turing would help solve many of the issues. With that implementation, I can help writing a tutorial for Turing at Tutorials. It's one of the commonly used distribution for modeling covariance matrix in mutlivariate Gaussian. It seems like many Turing users are using LKJ directly. It's also a good practice to let other Turing and Julia PPL users know that LKJCholesky is a much better version, as quoted by Stan, with reference to LKJ (without cholesky factorisation)
The only reason to use this density function is if you want the code to run slower and consume more memory with more risk of numerical errors. Use its Cholesky factor as described in the next section.