A tutorial which includes Arboretum & findTransitionGenesets with an example dataset.
- 0. About this tutorial
- 1. Prepare input tree
- 2. Prepare input order
- 3. Prepare input OGID
- 4. Prepare input value
- 5. Run initialization clustering / prepare input config
- 6. Run Arboretum
- 7. Run findTransitioningGenesets
- 8. Result visualization
This tutorial explains how to run the Arboretum and findTransitionGeneset. Arboretum is a multi-task clustering framework which uses hierarchical relationship of samples simultaneously while grouping the genes into a finite number of expression states. In this tutorial, we will do the clustering of genes based on the pseudo-bulk expression of each dataset, i.e. average values of gene expressions within each dataset (cell cluster, in single cell setting). Once we have defined these gene expression states for all datasets, we can identify genes with interesting patterns of expression on the hierarchy and group the genes based on the similarity of their patterns to identify the differential expressing (DE) genesets.
input_files
folder contains tutorial dataset which consists of following files :
Dataset | Description | File name |
---|---|---|
c1_matrix | 155 cells x 20840 genes, with row & column headers | input_files/c1_matrix.txt |
c2_matrix | 166 cells x 20840 genes, with row & column headers | input_files/c2_matrix.txt |
c3_matrix | 126 cells x 20840 genes, with row & column headers | input_files/c3_matrix.txt |
c4_matrix | 265 cells x 20840 genes, with row & column headers | input_files/c4_matrix.txt |
c5_matrix | 155 cells x 20840 genes, with row & column headers | input_files/c5_matrix.txt |
Gene ID | Gene names of 20840 genes | input_files/allgenenames.txt |
Tree | newick format tree text | input_files/newick_tree.txt |
- Note: the codes and scripts of this tutorial were fully tested and checked as working with the format above. So please firslty prepare your data as like above if you want to work with your own data.
code
folder contains the main programs of this tutorial, which could be also downloaded from its own site:
Program | Description | Original site |
---|---|---|
learnMoE | Initialization GMM clustering | https://github.com/Roy-lab/learnMoE |
Arboretum | Multi-task gene clustering | https://github.com/Roy-lab/Arboretum2.0 |
findTransitionGenesets | Hierarchical clustering of patterns identified from Arboretum | https://github.com/Roy-lab/clade-specific_gene_sets |
reformatSpeciesTree | Format newick tree to Arboretum input | - |
- Note: All programs are written in C++ and the compling could be fulfilled by "Makefile" in each directory with the "make" command:
script
folder contains several shell and PERL scripts which is used for the specific formatting of the data while running the wrapper scripts.
The key steps of clustering are performed by the following wrapper scripts:
Wrapper script name | Running program | Output |
---|---|---|
step1_run_GMM_and_prep_config.sh | learnMoE | arb_input/merged_gmm_k#/, arb_input/config_k#.txt |
step2_run_arboretum.sh | Arboretum | arb_output/ |
step3_run_findTransitionGenesets.sh | findTransitionGenesets | transition_genesets_output/ |
Arboretum requires a relationship tree structure between samples for the modeling. The format of the input tree is a text file consist of 3 columns (tab delimited). Each row is explaining the relationship of a parent node and a child node. One parental node always have 2 children nodes (left and right). The left/right children nodes will be dealt as equivalent, i.e. there is no order between children nodes. An ancestral node could be a child of another superordinate ancestral node. Following example is showing a tree consists of 3 nodes (c1, c2, c3) where c1 and c2 are more close. "Anc" means ancestral node.
c1 (TAB) left (TAB) Anc2
c2 (TAB) right (TAB) Anc2
Anc2 (TAB) left (TAB) Anc1
c3 (TAB) right (TAB) Anc1
To deal with a tree easier, especially for a complex tree with many leaves, we can use a program reformatSpeciesTree
in the code
folder. This program takes newick-style text file as a input and prints out the tree file for the Arboretum as the output.
Running command:
- USAGE: reformatSpeciesTree [newick_tree.txt]
./code/reformatSpeciesTree input_files/newick_tree.txt > arb_input/tree.txt
Result files:
input_files/newick_tree.txt
shows the newick tree of relationship among 5 sampples in this tutorial. The output arb_input/tree.txt
is a formatted tree for the Arboretum.
This text files is a simple list of sample names WITHOUT Ancestral nodes ("Anc#") with format like below:
c2
c4
c3
c1
c5
Running command:
cut -f1 arb_input/input_tree.txt |grep -v Anc > arb_input/orders.txt
This script used the tree file prepared in the previous step.
Result files:
arb_input/orders.txt
is the order file of our 5 samples example dataset, an input file of the Arboretum.
OGID stands for "OrthoGroup ID" and it is a list of orthogroups(OGs) with a profiled list of corresponding gene IDs per species (wihch is a legacy of the original Arboretum). Regardless of the its naming, this file is used for the mapping of genes' identities from different datasets. The format of this file is
OGID (TAB) dataset1_GeneID,dataset2_GeneID,dataset3_GeneID,...
(tab delimited between OGID and gene IDs, comma delimited among genes)
Note that the order of dataset-specific geneID should be same to the order file from Step 2. Also, from now on, the gene IDs of each dataset is named as "dataset_GeneID" (e.g. c1_AAA).
Running command:
- USAGE: generating_OGID.sh [orders.txt] [allgenenames.txt]
sh script/generating_OGID.sh arb_input/orders.txt input_files/allgenenames.txt > arb_input/OGID.txt
This script generates OGID based on the order of genes in "allgenenames.txt" file and shapes the gene IDs into "dataset#_geneID" as well as ordering it matched to the order file.
Result files:
The output arb_input/OGID.txt
is an input file of the Arboretum.
We are using pseudo-bulk expression vector of genes per samples, which is the average values of gene expressions within each dataset matrix so that each dataset has 1 vector values. The format of the file is a tab-delimited text file of 2 columns, the 1st column as "dataset_geneID" and the 2nd column is the mean expression values of all cells within the dataset.
c1_A1BG (TAB) 0.673579
c1_A1BG-AS1 (TAB) 0.246118
c1_A1CF (TAB) 0.000000
c1_A2M (TAB) 0.899409
...
Running command:
- USAGE: generating_meanvals.pl [genes-by-cells_matrix.txt] ["dataset-name"]
perl script/generating_meanvals.pl input_files/c1_matrix.txt c1
mv c1_meanval.txt arb_input/
perl script/generating_meanvals.pl input_files/c2_matrix.txt c2
mv c2_meanval.txt arb_input/
... (Do this for all datset matrices.)
Result files: arb_input/c#_meanval.txt
files are the values used as the source values for the Arboretum clustering.
Transposing of the data matrix: The script expects the matrix text file as a tab-delimited [genes x cells] matrix WITH row and column headers. If the user's dataset has given as [cells x genes], one can transpose it by using "script/transpose_matrix.pl" like:
- USAGE: transpose_matrix.pl [(tab-delimited) matrix.txt]
perl script/transpose_matrix.pl c1_matrix.txt > c1_matrix_transposed.txt
Identification of initializaiton clusters of genes is the first step of Arboretum analysis. This clustering is usually done by performing Gaussian Mixture Model (GMM) and this clustering assignments are assigned to the genes equally across all the datasets. Once after the initial clustering results are prepared, a speficific file for the configuration should be written for an input of Arboretum. The config file is a tab-delimited text file which has 3 columns:
DATASET_NAME (TAB) PATH/TO/INIT_CLUSTER.txt (TAB) PATH/TO/INPUT_VALUES.txt
Wrapper script step1_run_GMM_and_prep_config.sh
is written to perform the initialization GMM smoothly as well as to turn the results as Arboretum input form automatically. Jobs done by this script are:
- Performing GMM clustering based on the merged matrix of input values in Step 4
- Formatting the clustering result as dataset-sepcific initial cluster files
- Writing the config file automatically
Running command:
- USAGE: step1_run_GMM_and_prep_config.sh [#_samples] [order.txt] [#_cluster]
sh step1_run_GMM_and_prep_config.sh 5 arb_input/orders.txt 3
Result files: arb_input/merged_gmm_k#/
contains the dataset-specific cluster initialization cluster assignment files. arb_input/config.k#.txt
is the config file which is an input of Arboretum.
About selection of the number of clusters (k): The k number of GMM is usually matched to the k number of following Arboretum clustering, e.g. if we decided to use k=5 for the initialization, Arboretum should be also performed with k=5. However, if you specifically want to do the evaluation in GMM clusstering step, you can use the likelihood.txt
in the GMM result folder (arb_input/merged_gmm_k#/
).
Arboretum identifies the expression states of genes which could be vary across different datasets. These states are driven by the expression amount of genes, which means higher number of cluster ID corresponds to the higher expression of the gene. The increase of the number of states (i.e. number of clusters = k number) enables more precise patterning (e.g. 3 level differences in k=3 while 5-level differences in k=5), but also increases the complexity of the results as well.
Wrapper script step2_run_arboretum.sh
is written to perform the Arboretum more easily by providing the proper input files. However, there are bunch of other parameters for the adjusted run of Arboretum. For the detailed usages of the parameters, please refer to the original Arboretum github page.
Running command:
- USAGE: step2_run_arboretum.sh [#_cluster] [order.txt] [OGID.txt] [tree.txt] [config.txt] ["representative_dataset_name"] [output_folder_name]
sh step2_run_arboretum.sh 3 arb_input/orders.txt arb_input/OGID.txt arb_input/input_tree.txt arb_input/config.txt c1 arb_output/
- Note: "representative_dataset_name" is for the specification of a specific dataset name for printing out the result text with the gene ID of that dataset. You can do not mind this much and just use any dataset name because or OGIDs cannot have "missing" gene IDs, unlike the multi-species analysis cases.
Result files: arb_output/
folder contains all the results generated by the Arboretum run. Among the result file, allspecies_clusterassign_lca_brk.txt
is the most summarized result table, which is a tab-delimited text file. The rows are gene IDs of the "representative_dataset_name" which was specified by user
and columns are all nodes (leaves and branches) in the tree. The table is sorted by the cluster assignment of Ancestor1 (root ancestor of the tree; positioned at the right-most column of the table). Dummy lines are the separators of different cluster assignments between two group of Ancestor1 clusters.
About selection of the number of clusters (k): The specification of k number of the Arboretum could be done in several ways, such as by using the calculaed "penalized likelihood" or "BIC/AIC" values in the likelihood.txt
file in the result output folder arb_output/
, or by calculating the "clustering evaluation metrics (e.g. SI, DBI, etc.)" using the clustering results and input values (as the merged matrix by horizontal concatenation of gene values of datasets.
About random initialization: The wrapper script runs Arboretum based on the random seeding setting (parameter swithch: "-r yes"), which means different runs of Arboretum will result in slightly differen results even with same usage of k number. Running several different times with same k setting and find the best result based on a certain criterion (e.g. clustering evaluation metric) could be helpful for overcoming local optima problem of the Arboretum's E-M algorithm.
Arboretum stratifies the semi-continuous expression levels of genes of the datasets into finite numbers of expression states. This results in a gene's profile of expression states per datasets. The assessment based on these profiles is beneficial because it can catch the nuanced expressional changes which is not allowed in the DE analysis based on pairwise comparisons in usual.
We are interesting to identify the groups of genes that are sharing a common expression states profile because those gene sets may be working together as a pathway and they will generate a specific pattern across different datasets or biological contexts. To identify this type of genesets systematically, we do a hierarchical clustering based on the expression state profiles.
Wrapper script step3_run_findTransitionGenesets.sh
is developed for performing the hierarchical clustering of profiles. For the detailed usage of the program findTransitionGenesets
, please refer to the clustering code github page.
Running command:
- USAGE: step3_run_findTransitionGenesets.sh [Arboretum_output_folder] [order.txt] [OGID.txt] ["representative_dataset_name"] [output_folder_name]
sh step3_run_findTransitionGenesets.sh arb_output/ arb_input/orders.txt arb_input/OGID.txt c1 transition_genesets_output
Practically, the jobs which are done by this wrapper script are:
- Reordering the cluster IDs given by Arboretum into expression amount-related orders
- Re-format the result table of Arboretum
allspecies_clusterassign_lca_brk.txt
to the input form of the hierarchical clustering (allcelltypes_clusterassign_brk.txt
) - Run the hierarchical clustering with given input parameters
Result files:
- all_assign.txt : a formatted result file for plotting the all genes (rows) with expression state profiles and average expression values. The file is written as the input format for Heatmap.awk.
- ordered_clusterset_means.txt : a Heatmap.awk-formatted result file for plotting the pattern of identified clustersets as rows.
- clusterset##.txt : a Heatmap.awk-formatted result files for plotting cluster-specific heatmaps with genes within the cluster as rows.
- all_genes_clusterassignment_matrix.txt : a text table of the hierarchical clustering results in plain text. Rows are genes and columns are nodes of the relationship tree. Each clusters are delimited by "Dummy" rows which are starting with "Dummy###".
- ordered_mean_clusterassign_matrix.txt : a text table of the hierarchical clustering results in plain text but the rows are the clusterset IDs.
- all_genesets.txt : a text file of the hierarchical clustering result in different format. The 2 columns of this files are "cluster ID" and "lists of genes" (delimited as #).
Step 7's following result files are specifically formatted for the heatmap plotting using the script Heatmap.awk.
all_assign.txt
ordered_clusterset_means.txt
clusterset##.txt
Following script will draw the heatmap and save as *.svg
files based on the default color scheme. Please check the svg files in transition_genesets_output/
folder to check how they look.
all_assign.svg
ordered_clusterset_means.svg
clusterset##.svg
Running command:
- USAGE: draw_heatmap.sh [formatted_input_file] [max_value_of_cluster] [max_value_for_depicting_expreesions]
sh script/draw_heatmap.sh transition_genesets_output/ordered_clusterset_means.txt 3 2
sh script/draw_heatmap.sh transition_genesets_output/clusterset101.txt 3 2