sigma.lv * theta (again) #201

AlainZuur · 2024-12-01T11:35:11Z

AlainZuur
Dec 1, 2024

Hello Bert,

Last year (23 November 2023), you helped me to write some code to simulate sigma.lv * theta a large number of times, and use these to get an (approximate) 95% CI. It worked ok up to at least May 2024. I've now updated gllvm to the latest development version (and updated all packages), but the simulated results don't match the original 'sigma.lv * theta' coming out of the model anymore.

So..something must have changed. Is that a bug in gllvm, or do I need to update the R code below?

#' Function for simulating theta and sigma.lv a large
#' number of times for a GLLVM.
MySimGLLVMthetas <- function(MyModel = MyModel,
Np = Np) {

#MyModel = M1.nb
#Np = ncol(SpecParasit)

#' Input:
#' MyModel: Name of the gllvm model.
#' Np: Number of species

#' Output of function: MyData.theta
#' - The MyData.theta contains Np rows (1 for each species).
#' - The gllvm software estimates a theta and a sigma.lv, but
#' we need the cross-product, and a 95% CI for this cross-product.
#' - The theta.Sigmalv is the cross product.
#' - The lower and upper bounds define a 95% CI.

#' Calculate a 95% confidence interval for the loadings.
#' Our first attempt:
#' MyData.theta$lowerWrong <- MyData.theta$theta.Sigmalv - 1.96 * MyModel$sd$theta
#' MyData.theta$upperWrong <- MyData.theta$theta.Sigmalv + 1.96 * MyModel$sd$theta

#' Technically, this is not correct. The MyModel$params$theta are
#' parameters, and MyModel$params$sigma.lv is also a (strictly positive)
#' parameter. Hence, we have a cross-product of two parameters,
#' and one of them is positive. Getting a mathematical expression for the
#' SE of this cross-product is tricky. When something is tricky, to
#' bootstrap/simulate.

#' Plan:
#' 1. Simulate 10,000 new sets of theta and sigma.lv.
#' This is achieved by simulating from a multivariate normal
#' distribution with mean and covariance matrix coming from
#' the estimated model.
#' 2. Extract the 10,000 simulated theta and sigma.lv values.
#' 3. Calculate 10,000 times: theta * sigma.lv
#' 4. Access the 2.5% and 97.5% values of this cross-product.
#' 5. Plot the whole thing.

#' Let's implement this.
#' 1. Simulate 10,000 new sets of all parameter. We need to get
#' the estimated values and the corresponding covariance matrix.
#' That is slightly ugly:
#' - mean values: MyModel$TMBfn$par[MyModel$Hess$incl]
#' - covariance matrix: MyModel$Hess$cov.mat.mod

#' And not all parameters are on the real scale; some use a different
#' internal parametrisation. For example, for the simulated
#' sigma.lv we need to take abs(sigma.lv).
SimPar <- MASS::mvrnorm(n = 10000,
mu = MyModel$TMBfn$par[MyModel$Hess$incl],
Sigma = MyModel$Hess$cov.mat.mod)

#' dim(SimPar)
#' This object contains 10,000 rows and ** parameters.
#' Each row contains a set of simulated parameters from the model.
#' SimPar[1,] #' First simulated data set.
#' SimPar[2,] #' Second simulated data set.
#' SimPar[3,] #' Third simulated data set.

#' Explain the number of parameters
#' Answer:
#' -We have Np species, 6 covariates and the sample size = N
#' -The total number of parameters =
#' Np intercepts +
#' 6 * Np slopes +
#' Np thetas (factor loadings)
#' Np + 6 * Np + Np

#' Why are those names so funny?
#' These are the names in that Hessian matrix
#'names(MyModel$TMBfn$par[MyModel$Hess$incl])

#' Those are rather cumbersome names.
#' - The lambdas are the thetas (factor loadings).
#' - The bs are the b0 (intercepts) and betas (slopes).
#' - And there is also a sigmaLV.
#' - Really? This is TMB?

#' Anyway...let us stop complaining and go on with the
#' simulation.

#' Grab the 10,000 simulated sigmaLV:
sigma.lv.10000 <- SimPar[,colnames(SimPar) == "sigmaLV"]
length(sigma.lv.10000)

#' Apply the abs() function (see text above).
sigma.lv.10000 <- abs(sigma.lv.10000)

#' Grab the simulated theta values.
#' There are only Np-1 of them as the first theta is set to 1 as otherwise
#' the algorithm cannot estimate the thetas.
theta10000 <- SimPar[,colnames(SimPar) == "lambda"]
#' dim(theta10000)

#' CATCH UP:
#' All we have done is:
#' -Simulate 10,000 sets of parameters.
#' -Extract the 10,000 sigma.lv.
#' -Extract the 10,000 theta

#' We can now 10,000 times calculate theta * sigma.lv

#' Some fancy code from dr. Bert van Veen to calculate
#' theta * sigma.lv for each simulated value, where theta
#' is a vector and sigma.lv is a scalar value.
Perm <- matrix(0,
nrow = MyModel$num.lv,
ncol = ncol(theta10000))
for(i in 1:MyModel$num.lv){
Perm[i,(sum(Perm)+1):(sum(Perm)+ncol(MyModel$y)-i)] <- 1
}

#' Calculate 10,000 times sigma.lv * theta.
SigmalvTimestheta <- (sigma.lv.10000 %*% Perm) * theta10000

#' RECAP:
#' - We used the estimated model parameters and covariance matrix,
#' and simulated 10,000 sets of theta and sigma.lv.
#' - A little issue was that gllvm does not estimate the first
#' factor loading, hence it not simulated neither.
#' - We can calculate 10,000 times sigma.lv * theta,
#' where sigma.lv is a scalar and theta is a vector (in this example).

#' And for each column determine the 2.5% and 97.5% values
loadings.CI <- apply(SigmalvTimestheta,
MARGIN = 2,
FUN = quantile,
probs = c(0.025,0.975))

#' Each column contains the lower and upper bounds for a factor loading
#' for a specific species.
#' head(loadings.CI)

#' Add the results to MyData.theta, but recall that the first
#' loading was set to 1. Hence, it was never simulated.
MyData.theta$lower <- 0
MyData.theta$lower[2:Np] <- loadings.CI["2.5%", ]

MyData.theta$upper <- 0
MyData.theta$upper[2:Np] <- loadings.CI["97.5%", ]
MyData.theta
}

MyData.theta <- MySimGLLVMthetas(MyModel = M1.nb,
Np = ncol(SpecParasit))
head(MyData.theta)

#' This was the outline of the simulation study.
#' 1. Simulate 10,000 new sets of theta and sigma.lv. This is
#' achieved by simulating from a multivariate normal distribution
#' with mean given by the estimated values and covariance matrix
#' also coming from the estimated model.
#' 2. Extract the 10,000 theta and sigma.lv values.
#' 3. Calculate 10,000 times theta * sigma.lv
#' 4. Access the 2.5% and 97.5% values of this cross-product.
#' 5. Plot the whole thing.

#' Creating a nice plot.
#' Reorder the y-axis based on theta.Sigmalv values
MyData.theta$Species <- factor(MyData.theta$Species,
levels = MyData.theta$Species[order(MyData.theta$theta.Sigmalv)])

#' Plot the results.
p2 <- ggplot(MyData.theta,
aes(x = theta.Sigmalv, y = Species)) +
geom_point() +
geom_errorbarh(aes(xmin = lower,
xmax = upper),
height = 0.2) +
geom_vline(xintercept = 0, lty = 2, col = "red") +
theme(text = element_text(size = 8),
legend.position = "none") +
xlab("Theta") + ylab("Species")
p2

AlainZuur · 2024-12-01T12:02:14Z

AlainZuur
Dec 1, 2024
Author

To show you the difference: Two graphs...same data...same R code....different gllvm versions (<> May 2024)

3 replies

BertvanderVeen Dec 1, 2024
Collaborator

Hey Alain. There have been a few changes that could impact this.

Did you refit the model under the new version? If the parameter estimates did not change between versions:
From the top of my head, the most likely I consider a change in how the asymptotic covariance matrix is calculated under near singularity. In the new version, the hessian is standardised by its norms before calculating the inverse. That tends to be more numerically stable if there are large differences in the magnitude of row/column norms of the hessian.
Under those conditions, the standard errors tended to be too optimistic for many parameters (overly close to zero), and should now be larger.

If this is the case, you should be able to see it by doing a quick comparison of the hessian of the two fits stored in the model object.

If this is not the case, either the parameter estimates have changed due to changes in the underlying fitting algorithms, or because something else is going on.

If something else is going on, I need to have a closer look during the week.

AlainZuur Dec 2, 2024
Author

Same gllvm version...(2.0)...different laptops. One crashes with an error:

Error in MASS::mvrnorm(n = 10000, mu = MyModel$TMBfn$par[MyModel$Hess$incl], :
'Sigma' is not positive definite

And the other produces odd simulation results. Estimated parameters differ slightly per laptop (different R versions I guess).

Happens for different data sets. I have 20 people going through it this week...so will be able to give you some feedback during the week.

Alain

BertvanderVeen Dec 3, 2024
Collaborator

Can you turn this into a reproducible example? When I try to use your function I get the error "MyData.theta is missing", because the data is never created.

AlainZuur · 2024-12-03T18:18:15Z

AlainZuur
Dec 3, 2024
Author

Yes....I will do that on Friday. Will use your online spider data. By the way...20 people...20 laptops..1/3th with no problems for this specific simulation, 1/3th with an error, 1/3th with simulated CIs that don't make sense. So...something going on there. Kind regards, Alain

…

On 3 Dec 2024 at 16:54 +0000, Bert ***@***.***>, wrote: Can you turn this into a reproducible example? When I try to use your function I get the error "MyData.theta is missing", because the data is never created. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

BertvanderVeen · 2024-12-19T11:08:53Z

BertvanderVeen
Dec 19, 2024
Collaborator

Thanks for sending my the details on this per e-mail, Alain. Your code seems fine, and there is no bug.

This is a classical case of a situation where the model struggles to converge, and there are multiple similar solutions in close vicinity of each other. Some of those solutions provide more numerical issues than others, which is why the confidence intervals for some of your participants are much larger than for others.

This is somewhat related to a change in the calculation of the standard errors, in an attempt to retain some numerical accuracy for models that poorly converge, and to flag them better. Previously, these could sometimes result in considerably too optimistic standard errors (read: close to zero).

Here are a few tips/workarounds for this:

Fit with gradient.check = TRUE: this will flag that there are parameters that have excessively large values in the gradient (i.e., model not converged). Does tend to give some false positives. In your case, also which.max(abs(M1.nb$TMBfn$gr())) immediately flags the culprit (the "sigmaLV" is close to zero and poorly converged, a general sign of a model that has not well converged)
Change optimizer/settings: unfortunately, gllvm still provides limited access to optimisation settings for users. Here, changing to the "L-BFGS-B" method in "optim" and increasing the "factr" tolerance would have lead to a better convergence (or switching to the minic package for optimisation instead, which gets it right on default settings). gllvm provides all the tools to put the model in an optimizer by hand, but it is a bit messy to get all the bits and pieces in the right places afterwards - specifically the parameter estimates. Note to self: write a function for this.
Try different starting value settings (as the per usual recommendation)
set "diag.iter = 0" for (E)VA(defalts to 1) or change the approximation (via "method") to LA or EVA
Move the IID "row.eff" random effect into "formula" (combine with Ab.struct = "diagonal" to keep it speedy)
Reorder the species in the data (only makes sense with LVs). Due to the constraints on the ordination + the parameterization the model can be sensitive to the order. Place species in front that can be expected to be somewhere around the middle of the ordination (usually generalist species, or species with sufficient observations either way)

There are a bunch of other tips I have collected over time, but you get the idea: this particularly model struggles to converge, and that is causing the issues. This will be different for different participants, just because their computer architecture is different.

1 reply

AlainZuur Dec 31, 2024
Author

Hello Bert,
I tried changing all the optimisation options, but that did not really help.

But randomly changing the order of the columns seems to solve the issue.

Np <- ncol(SpecData)
I <- sample(1:Np, replace = FALSE)
SpecDataOrdered <- SpecData[,I]

And then run the gllvm and do the posterior simulation.

By doing this a large number of times, I ended up with three types of outcomes: (i) sensible results, (ii) a singular $Hess$cov.mat.mod matrix, or (iii) simulated results with very large SEs (as described in the first message). But the nice thing is that when things go wrong in (iii), it was always the same species who had the large SEs.

Hence, it may be wise to advise people to routinely change the order of the columns and see whether results change (not sure how you define change numerically)...as part of a sensitivity analysis.

Alain

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sigma.lv * theta (again) #201

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

sigma.lv * theta (again) #201

Uh oh!

AlainZuur Dec 1, 2024

Replies: 3 comments · 4 replies

Uh oh!

AlainZuur Dec 1, 2024 Author

Uh oh!

Uh oh!

BertvanderVeen Dec 1, 2024 Collaborator

Uh oh!

AlainZuur Dec 2, 2024 Author

Uh oh!

BertvanderVeen Dec 3, 2024 Collaborator

Uh oh!

AlainZuur Dec 3, 2024 Author

Uh oh!

BertvanderVeen Dec 19, 2024 Collaborator

Uh oh!

AlainZuur Dec 31, 2024 Author

AlainZuur
Dec 1, 2024

Replies: 3 comments 4 replies

AlainZuur
Dec 1, 2024
Author

BertvanderVeen Dec 1, 2024
Collaborator

AlainZuur Dec 2, 2024
Author

BertvanderVeen Dec 3, 2024
Collaborator

AlainZuur
Dec 3, 2024
Author

BertvanderVeen
Dec 19, 2024
Collaborator

AlainZuur Dec 31, 2024
Author