The Poisson lognormal model and variants can be used for analysis of mutivariate count data. This package implements efficient algorithms extracting meaningful data from difficult to interpret and complex multivariate count data. It has been built to scale on large datasets even though it has memory limitations. Possible fields of applications include
- Genomics (number of times a gene is expressed in a cell)
- Ecology (species abundances)
One main functionality is to normalize the count data to obtain more valuable data. It also analyse the significance of each variable and their correlation as well as the weight of covariates (if available).
A notebook to get started can be found here. If you need just a quick view of the package, see the quickstart next.
pyPLNmodels is available on pypi. The development version is available on GitHub and GitLab.
pip install pyPLNmodels
For those unfamiliar with the concepts of Poisson or Gaussian random variables, it is not necessary to delve into these statistical descriptions. The key takeaway is as follows: This package is designed to analyze multi-dimensional count data. It effectively extracts significant information, such as the mean, the relationships with covariates, and the correlation between count variables, in a manner appropriate for count data.
Consider endog
in the package) consisting of
where exog
) and offsets
) are
user-specified covariates and offsets. The matrix coef
and covariance
in the package,
respectively. A normalization procedure adequate to count data can be applied
by extracting the latent_variables
The package comes with an ecological data set to present the functionality:
import pyPLNmodels
from pyPLNmodels.models import PlnPCAcollection, Pln, ZIPln
from pyPLNmodels.oaks import load_oaks
oaks = load_oaks()
Each model can be specified in two distinct manners:
- by formula (similar to R), where a data frame is passed and the formula is specified using the
from_formula
initialization:
model = Model.from_formula("endog ~ 1 + covariate_name ", data = oaks)# not run
We rely to the patsy package for the formula parsing.
- by specifying the endog, exog, and offsets matrices directly:
model = Model(endog = oaks["endog"], exog = oaks[["covariate_name"]], offsets = oaks[["offset_name"]])# not run
The parameters exog
and offsets
are optional. By default,
exog
is set to represent an intercept, which is a vector of ones. Similarly,
offsets
defaults to a matrix of zeros.
This is the building-block of the models implemented in this package. It fits a Poisson lognormal model to the data:
pln = Pln.from_formula("endog ~ 1 + tree ", data = oaks)
pln.fit()
print(pln)
transformed_data = pln.transform()
pln.show()
Rank Constrained Poisson lognormal for Poisson Principal Component Analysis (aka PlnPCA
and PlnPCAcollection
)
This model excels in dimension reduction and is capable of scaling to
high-dimensional count data (rank
keyword of the PlnPCA
object. Furthermore, they can specify multiple ranks simultaneously
within a single object (PlnPCAcollection
), and then select the optimal model based on either the
AIC (default) or BIC criterion:
pca_col = PlnPCAcollection.from_formula("endog ~ 1 + tree ", data = oaks, ranks = [3,4,5])
pca_col.fit()
print(pca_col)
pca_col.show()
best_pca = pca_col.best_model()
best_pca.show()
transformed_data = best_pca.transform(project = True)
print('Original data shape: ', oaks["endog"].shape)
print('Transformed data shape: ', transformed_data.shape)
A correlation circle may be employed to graphically represent the relationship between the variables and the components:
best_pca.plot_pca_correlation_circle(["var_1","var_2"], indices_of_variables = [0,1])
The ZiPln
model, a variant of the PLN model, is designed to handle zero
inflation in the data. It is defined as follows:
This model is particularly beneficial when the data contains a significant
number of zeros. It incorporates additional covariates for the zero inflation
coefficient, which are specified following the pipe |
symbol in the formula or via the exog_inflation
keyword. If not specified, it is set to the covariates for the Poisson part.
zi = ZIPln.from_formula("endog ~ 1 + tree | 1 + tree", data = oaks)
zi.fit()
print(zi)
print("Transformed data shape: ", zi.transform().shape)
z_latent_variables, w_latent_variables = zi.transform(return_latent_prob = True)
print(r'$Z$ latent variables shape', z_latent_variables.shape)
print(r'$W$ latent variables shape', w_latent_variables.shape)
By default, the transformation of the data returns only the return_latent_prob
parameter is set to True
, the transformed data will include both the latent
variables
The package is equipped with a set of visualization functions designed to
help the user interpret the data. The viz
function conducts Principal
Component Analysis (PCA) on the latent variables, while the viz_positions
function
carries out PCA on the latent variables, adjusted for covariates. Additionally,
the viz_prob
function provides a visual representation of the zero-inflation
probability.
best_pca.viz(colors = oaks["tree"])
best_pca.viz_positions(colors = oaks["dist2ground"])
pln.viz(colors = oaks["tree"])
pln.viz_positions(colors = oaks["dist2ground"])
zi.viz(colors = oaks["tree"])
zi.viz_positions(colors = oaks["dist2ground"])
zi.viz_prob(colors = oaks["tree"])
Feel free to contribute, but read the CONTRIBUTING.md first. A public roadmap will be available soon.
Please cite our work using the following references: