A Workflow Software for Analyzing Toxicologically Relevant Proteomic and Transcriptomic Data, From Data Preprocessing to Toxicological Mechanism Elucidation.
Exploration of toxicological mechanisms is imperative for the assessment of potential adverse reactions to chemicals and pharmaceutical agents, the engineering of safer compounds, and the preservation of public health.
High-throughput proteomics and transcriptomics can accurately capture the body's response to toxins and have become key tools for revealing complex toxicological mechanisms. Recently, a vast amount of omics data related to toxicological mechanisms has been accumulated. However, analyzing and utilizing this data remains a major challenge for researchers.
To address this, the current study has developed ToxDAR, a workflow-oriented R package for preprocessing and analyzing toxicological multi-omics data. ToxDAR integrates packages like NormExpression, DESeq2, and igraph, and utilizes R functions such as prcomp and phyper. It supports data preprocessing, quality control, differential expression analysis, functional analysis, and network analysis. ToxDAR’s architecture also includes five major categories of mechanism-related biological entities and details fifteen types of interactions among them, providing comprehensive knowledge annotation for omics data analysis results.
ToxDAR, designed to enhance the capabilities of data analysis and annotation, thereby improving the efficiency and quality of toxicological mechanism elucidation.
- ToxDAR is consisted of four main modules:
- Data Preparation
- Quality Control
- Data Analysis
- Data Interpretation
The ToxDAR software package extracts associations and supporting evidence between toxicants and biological entities such as genes, pathways, and phenotypes from multiple knowledge bases.
Source | Version | URL |
---|---|---|
CTD | v2021-10 | http://ctdbase.org/ |
AOP-Wiki | v2022-12 | https://aopwiki.org/ |
KEGG | v0.7.2 | https://www.kegg.jp/ |
DrugBank | v2020-12-15 | https://go.drugbank.com/ |
DisGeNet | v7.0 | https://www.disgenet.org/ |
Disease Ontology | v2021-10-11 | https://disease-ontology.org/ |
Human Phenotype Ontology | v2021-10-10 | https://hpo.jax.org/app/ |
PhosphoSitePlus | v6.6 | https://www.phosphosite.org/ |
UbiBrowser | v2.0 | http://ubibrowser.ncpsb.org.cn |
ENCODE | v120 | https://www.encodeproject.org/ |
STRINGdb | v11.0 | https://string-db.org/ |
Hardware requirements
- PC with 8G RAM or above is recommended
- Passed test on Windows 10, Ubuntu 18.04.6, macOS 14.6
Software requirements
- Depends: R (>= 3.6.1)
- Imports: limma, gprofiler2, ComplexHeatmap, igraph, NormExpression
Run in R session:
install.packages("devtools")
install.packages("remotes")
devtools::install_github("TMCjp/ToxDAR")
If there's any problem during installation, please refer to FAQ.
To assist users in reproducing the results, we have integrated the sample data into the R package. This includes public transcriptome data from the L02 cell line following exposure to Triphenyl Phosphate (TPP). Users can load the data with a single command.
library(ToxDAR)
load(file = paste0(system.file(package = "ToxDAR"), '/extdata/genexp.rdata'))
We utilize the ToxDAR software package to normalize the transcriptome data post-TPP exposure using ten different methods, subsequently generating a report on the effects of normalization. These normalization methods are evaluated based on two metrics: AUCVC and mSCC, to select the appropriate normalization method.
### Apply ten different preprocessing methods
# Gather factor data for each preprocessing method
HG7.norfac <- gatherFactors(genexp, methods = "HG7")[, 1]
TC.norfac <- gatherFactors(genexp, method = "TC")[, 1]
ERCC.norfac <- gatherFactors(genexp, method = "ERCC")[, 1]
TN.norfac <- gatherFactors(genexp, method = "TN")[, 1]
CR.norfac <- gatherFactors(genexp, method = "CR")[, 1]
UQ.norfac <- gatherFactors(genexp, method = "UQ")[, 1]
NR.norfac <- gatherFactors(genexp, method = "NR")[, 1]
TMM.norfac <- gatherFactors(genexp, method = "TMM")[, 1]
DESeq.norfac <- gatherFactors(genexp, method = "DESeq")[, 1]
TU.norfac <- gatherFactors(genexp, method = "TU")[, 1]
### Compute AUCVC(Area Under the normalized Coefficient of Variation threshold Curve) for different preprocessing methods
cv_uniform1 <- gatherCVs(data = genexp, nonzeroRatio = 0.5,
NR = NR.norfac,
DESeq = DESeq.norfac,
TMM= TMM.norfac,
HG7 = HG7.norfac,
TC = TC.norfac,
ERCC = ERCC.norfac,
TN = TN.norfac,
CR = CR.norfac,
UQ = UQ.norfac,
TU= TU.norfac)
cv_uniform2 <- gatherCVs4Matrices(None= genexp, raw_matrix =genexp, nonzeroRatio=1);
cv_uniform <- rbind(cv_uniform1, cv_uniform2);
cv_uniform <- as.data.frame(cv_uniform)
cv_uniform$Cutoff <- as.numeric(cv_uniform$Cutoff)
cv_uniform$Counts <- as.numeric(cv_uniform$Counts)
### Plot Index of AUCVC
library(ggplot2)
png(file = "cv.png", res=300, width=(1200*4.17), height=(960*4.17))
plotCVs(cv_uniform, methods=c("None", "NR", "DESeq", "TMM", "HG7", "TC", "ERCC", "TN", "CR", "UQ", "TU"),
legend.position=c(1.12, .48))
dev.off()
### Calculate Spearman’s Rank Correlation Coefficient for different preprocessing methods
spearman_corr1 <- gatherCors(data= genexp, cor_method="spearman",
NR = NR.norfac,
DESeq = DESeq.norfac,
TMM= TMM.norfac,
HG7 = HG7.norfac,
TC = TC.norfac,
ERCC = ERCC.norfac,
TN = TN.norfac,
CR = CR.norfac,
UQ = UQ.norfac,
TU = TU.norfac,
pre_ratio=1, lower_trim=0.2, upper_trim=0.6, rounds=10000)
spearman_corr2 <- gatherCors4Matrices(None= genexp, raw_matrix=genexp, cor_method="spearman",
pre_ratio=1, lower_trim=0.2, upper_trim=0.6, rounds=10000)
spearman_corr <- rbind(spearman_corr1, spearman_corr2);
spearman_corr <- as.data.frame(spearman_corr)
spearman_corr$Value <- as.numeric(spearman_corr$Value)
### Calculate mSCC(median Spearman’s Rank Correlation Coefficient)
cor.medians <- getCorMedians(spearman_corr)
### Plot mSCC(median Spearman’s Rank Correlation Coefficient) results
# Save the plot as a high-resolution PNG file
png(file = "cor3.png", res=300, width=(1200*4.17), height=(960*4.17))
plotCors(spearman_corr, methods=c("None", "NR", "DESeq", "TMM", "HG7", "TC", "ERCC", "TN", "CR", "UQ", "TU"),
legend.position=c(1.12, .48));
dev.off()
### Preprocess expression profiles using the normalized factor matrix
# Apply normalization using the specified factor matrix from the TU method
genexp <- getNormMatrix(data = genexp, norm.factors = TU.norfac)
We generated PCA plots and violin plots of gene expression distribution for these datasets, allowing us to visually observe the variability both between different experimental groups and within the same group across samples.
### Plot the PCA to illustrate the variability between different experimental groups and individual samples within each group.
gi <- c(rep("control",5), rep("55mg/kg",3), rep("110mg/kg",3),
rep("220mg/kg",3), rep("441mg/kg",3), rep("881mg/kg",3))
pcaresult <- toxPCA(genexp, stype = gi, outpng = T)
### Create a violin plot to compare data distribution differences between groups
png(file = "toxVio.png", res=300, width=(1200*3), height=(960*3))
plotVio(genexp, stype = gi)
dev.off()
We utilized the differential analysis module of the ToxDAR software package to identify differentially expressed genes associated with TPP exposure at various concentrations and time points. Additionally, the integrated knowledge base within ToxDAR provided clear associations between differentially expressed genes and TPP exposure.
genexp <- genexp[, c(1:5, 18,19,20)]
### Perform differential expression analysis, considering the top 10,000 genes
gi <- c(rep("control",5), rep("881mg/kg",3))
diffmat <- diffexp(genexp, groupinfo = gi, ntop = 10000)
### Annotate differential expression results with toxin information
annomat <- diffanno(diffmat, org = "rnorvegicus", toxid = 'C005445')
### Generate an interactive table from the annotated results
IntTab(diffres = annomat, outhtml = T)
### Create a volcano plot to visualize genes associated with the analyzed toxin
library(dplyr)
png(file = "volanno.png", res=300, width=(960*3), height=(1000*3))
volanno(diffres = annomat, nodesize = 1.5, annosize = 5, themsize = 24, legend.position = c(0.82, 0.15), legend.size = 12)
dev.off()
To gain a comprehensive understanding of the functions of these genes, we proceeded to perform enrichment analysis on the differentially expressed genes following TPP exposure using multiple annotation datasets including GO, DO, KEGG, and others collected within the software package.
library(clusterProfiler)
# load data
gsy <- annomat$Symbol[1:30]
gis <- gconvert(query = gsy, organism = "hsapiens",
target="ENTREZGENE_ACC", mthreshold = Inf, filter_na = TRUE)
gis <- gis$target
### Perform functional enrichment analysis using GO terms
# The 'slim' parameter specifies the type of ontology (GO/Kegg/DO)
gomat <- gofact(genes = gis, slim = 'go')
### Generate a bar plot of the top 10 functional categories
png(file = "funcbar.png", res=300, width=(1200*3), height=(960*3))
p <- plotfact(gomat, ntop = 10, fillby = "Pval", tn = 30, collimit = c(0, 0.05), xtextsize = 20, stripsize = 18)
p + theme(axis.text = element_text(size=20), axis.title = element_text(size=30),
legend.text = element_text(size=20), legend.title = element_text(size=30), legend.key.size = unit(20, "pt"),
axis.ticks = element_blank())
dev.off()
# Filter for Biological Process (BP) terms from the enrichment results
tys <- unlist(lapply(gomat$hie, function(x) strsplit(x, '.', fixed = T)[[1]][1]))
bpmat <- gomat[which(tys == 'BP'), ]
# Generate a chord diagram for the top 5 biological processes
png(file = "funcchord.png", res=300, width=(2200*3), height=(2000*3))
plotexpfunc(bpmat, ntop = 5, diffresult = annomat, plottype = 'chord', chord.space = 0.05, gene.space = 0.25, gene.size = 9, process.title = 30, process.label = 28, termcol = 3)
dev.off()
# Generate a cluster plot for the top 5 biological processes
png(file = "funccluster.png", res=300, width=(1200*3), height=(1200*3))
plotexpfunc(bpmat, ntop = 5, diffresult = annomat, plottype = 'cluster')
dev.off()
The ssGSEA method quantifies changes in gene expression profiles as a result of toxic exposure into the activation or inhibition states of biological pathways, revealing the mechanisms by which the organism responds to exposure at a molecular level. The changes are presented visually through enrichment plots and heatmaps for intuitive representation.
# Remove rows with missing gene symbols
nn <- which(is.na(annomat$Symbol))
gseamat <- annomat[-nn, ]
geneList <- gseamat$logFC
genename <- gseamat$Symbol
library(org.Hs.eg.db)
gid <- mapIds(org.Hs.eg.db, keys=genename, column="ENTREZID", keytype="SYMBOL", multiVals = "first")
names(geneList) <- gid
geneList <- geneList[order(geneList, decreasing = T)]
# Perform GSEA analysis using KEGG terms
gsearesult <- gseafc(geneList, term = "KEGG", minGSSize = 5, maxGSSize = 500)
# Plot GSEA results for 'Apoptosis' term
gsid <- gsearesult$gs_id[which(gsearesult$gs_description == 'Apoptosis')]
png(file = "gsea_up2.png", res=300, width=(1200*3), height=(960*3))
plotgsea(gsid, geneList, term = "KEGG")
dev.off()
# Extract Enrichment Scores (ES) and convert to matrix
esmicro <- as.matrix(gsearesult[, 'ES'])
rownames(esmicro) <- gsearesult$gs_exact_source
# Plot the ES heatmap without normalization
ESheatmap(esmicro, Normalization = F)
We employed the ToxDAR network analysis and annotation module to map the associations between the toxin and key molecules, biological pathways, and phenotypes. Within this network, TPP is linked with biological pathways such as apoptosis and inflammatory responses, indicating that triphenyl phosphate (TPP) may act as a potential endocrine disruptor. It exerts its effects by activating the thyroid hormone receptor β (THRB) molecule, thereby influencing signaling pathways related to apoptosis and inflammatory responses, ultimately adversely affecting thyroid function.
### Perform network analysis and annotation to map associations between the toxin and key molecules, biological pathways, and phenotypes.
netmat <- netmoa(toxid = 'C005445', ngid = gis, toppathway = 30, ppiscore = 500)
### Uncomment the following lines to generate and save a static PNG plot of the network
# png(file = "netmoa.png", res=600, width=(1600*5), height=(1200*5))
# netplot(netmoa = netmat)
# dev.off()
### Create and Save Interactive Network Visualization
nethtml <- netplothtml(netmoa = netmat, outtype = 'pdf')
savehtml(network = nethtml, htmlfile = 'netmoa.html')
If devtools::install_github()
raise Installation failed: Problem with the SSL CA cert (path? access rights?)
error, try:
install.packages(c("curl", "httr"))
During installation there may be some configuration error (lack of libraries):
------------------------- ANTICONF ERROR ---------------------------
Configuration failed because libcurl was not found. Try installing:
* deb: libcurl4-openssl-dev (Debian, Ubuntu, etc)
* rpm: libcurl-devel (Fedora, CentOS, RHEL)
* csw: libcurl_dev (Solaris)
If libcurl is already installed, check that 'pkg-config' is in your
PATH and PKG_CONFIG_PATH contains a libcurl.pc file. If pkg-config
is unavailable you can set INCLUDE_DIR and LIB_DIR manually via:
R CMD INSTALL --configure-vars='INCLUDE_DIR=... LIB_DIR=...'
--------------------------------------------------------------------
Just follow the instruction to satisfy the dependencies. For instance, you can run sudo apt-get install libcurl4-openssl-dev
in Ubuntu to fix the problem above.
ToxDAR is developed using the R programming language and integrates numerous commonly used packages provided by Bioconductor, so if you get
'BiocInstaller' must be installed to install Bioconductor packages.
, please choose1 (Yes)
. Since then you may seeInstallation failed: cannot open the connection to 'https://bioconductor.org/biocLite.R'
, runsource('http://bioconductor.org/biocLite.R')
, finally try the installation commands above again.
Please wait for minutes then try again if solving some dependencies from GitHub fails with
Connection timed out after 100001 milliseconds
.
This package is free and open source software, licensed under GPL v3.0.