Tree-Guided Integrated Factorization (TGIF) to identify conserved & differential genome organization
TGIF applies multi-task non-negative matrix factorization (NMF) to multiple input Hi-C matrices to identify conserved or dynamic compartment and TAD boundaries.
- Install
- Run TGIF-DB for differential TAD boundary identification
- Run TGIF-DC for differential compartment identification
Installation instructions below were tested in Linux Centos 7 distribution. GSL (GNU Scientific Library) 2.7 is used to handle matrix- and vector-related operations. Older versions of GSL (even 2.6) will not work since we use newly added functions.
- If you already have GSL 2.7 installed, edit the first few lines of the Makefile to point to the correct include and shared library directory, then jump to step 3.
#CHANGE PATHS AS NEEDED:
INCLUDE_PATH = ${CONDA_PREFIX}/include
LIBRARY_PATH = ${CONDA_PREFIX}/lib
- If you do not have GSL 2.7 installed, or you are not sure, one way to get it installed is to use conda:
conda install -c conda-forge gsl
- Make sure to add the location of the installed shared library to where the compiler/linker will be looking. If you used conda to install GSL to the default location in step 2, run the following command after activating the appropriate environment (or add the appropriate path if you already have GSL installed):
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${CONDA_PREFIX}/lib
- And let's install! In the same directory you downloaded the code/Makefile (either by cloning the repository or by downloading a release), run:
make tgif
- If all went well, you won't get any alarming messages, and you will see executables named
tgif-db
andtgif-dc
created in the same directory. A quick test below will print the manual for running TGIF:
./tgif-db -h
Note: in order to implement NNDSVD initialization of factors, a fast randomized SVD algorithm from RSVDPACK was used. A minor modification to allow random seed specification was made to the original code from RSVDPACK. This updated code is included under modules/random_svd directory. Compilation of this code is part of the included Makefile; no additional step is necessary for installation.
TGIF-DB identifies conserved and dynamic TAD boundary regions.
./tgif-db input/tgif-db/chr19_tree.txt input/tgif-db/chr19_25kb.bed -o output/tgif-db/
input/tgif-db/chr19_tree.txt
specifies the tree file, which contains file locations to individual task matrices (paths are relative to location of run_tgif executable location; more details below).input/tgif-db/chr19_25kb.bed
is a bed file specifying the start and end coordinates of each bin whose index is referred to in the sparse-format matrix files (more details below).- Optional:
-o output/tgif-db/
will put all output files to output/tgif-db/ directory. By default the output files will be saved to the current directory. Check out the example output directory in the repo.
- This file specifies the task hierarchy, where each task represents some biological condition and has its own input matrix.
- See example in input/tgif-db/chr19_tree.txt.
1 5 ES input/tgif-db/ES_chr19_25kb.txt
2 4 NPC input/tgif-db/NPC_chr19_25kb.txt
3 4 CN input/tgif-db/CN_chr19_25kb.txt
4 5 branch1 N/A
5 -1 root N/A
All 4 columns are required. If you want a flat tree or do not care about the task hierarchy, point all nodes to have the root node as the parent. If you'd like a tree structure suggested based on the structural similarity of the input matrices, see suggest_tree/.
- Column 1: node ID; start from 1 and move up.
- Column 2: parent node ID; right now the implementation will only work correctly if you ID all children nodes before a parent node (so start from the lowest depth of tree, move to next level, till you hit the root, which should be the last node ID.)
- Column 3: node alias, used to refer to each input matrix/task in the output files.
- Column 4: location of input matrix file for leaf nodes, relative to where the run_tgif executable is. Set as N/A for non-leaf nodes.
- Each of the matrix files referred to in the tree file.
- 0-indexed, tab-delimited, sparse-matrix format, no header
- Column 1: index of region or bin A
- Column 2: index of region or bin B
- Column 3: interaction count between bin A and B
- See example in input/tgif-db/ES_chr19_25kb.txt.
0 0 4454.112629
0 1 1280.765686
1 1 4216.131160
...
- This file specifies the starting and ending coordinates of each bin whose index is referred to in the matrix files.
- The matrices are 0-indexed (i.e. first entry index is [0,0]) and they must share a common coordinates file.
- Tab-delimited
- Column 1: chromosome
- Column 2: bin start coordinate
- Column 3: bin end coordinate
- Column 4: bin index (starting at 0)
- See exmaple in input/tgif-db/chr19_25kb.bed.
19 3075000 3100000 0
19 3100000 3125000 1
19 3125000 3150000 2
19 3150000 3175000 3
...
Position | Parameter | Description | Default Value/Behavior |
---|---|---|---|
1 | This file specifies the task hierarchy, where each task represents some biological condition and has its own input matrix. See documentation above for more details, format, and example. | N/A | |
2 | This is a mapping each index in the input matrix file to a chromosomal coordinate. See documentation above for more details, format, and example. | N/A | |
optional | -o | Output file path and prefix. Note: will NOT create a directory if the specified directory does not already exist. | Output files saved to current directory. |
optional | -l | Generate a file called parameters.log under the specified output path (or current directory by default) that lists the parameter settings used. | No log file of parameters saved. |
optional | -b | Generate files boundary_score.txt and background_score.txt under the specified output path (or current directory by default). | No boundary-score related files saved. |
optional | -p | Generate files boundary_pval.txt and boundary_adj_pval.txt under the specified output path (or current directory by default). | No p-value related files saved. |
optional | -w | Specify, in number of basepairs (e.g. 2000000), the size of the "window" or the dimension of the submatrices where factorization will be applied. | 2000000 (i.e., 2MB) |
optional | -s | Specify, in number of basepairs (e.g. 1000000), the size of the step size for sliding down the block diagonal of the input matrix to generate each set of submatrices. | 1000000 (i.e., 1MB) |
optional | -a | Specify an integer value used for the strength of the tree regularization. | 1000000 |
optional | -t | Used to filter out sparse regions with very few interaction counts. If the threshold value is 0.5 and the window size is 2000000 (i.e. 2MB), regions that has fewer than half of its neighbors within 2MB radius are filtered out before applying TGIF-DB. If the input matrices are sparse (e.g. total read depth is less than a billion in any of the matrices) AND you want to apply TGIF-DB to a relatively high resolution (e.g. 10kb), we recommend lowering the threshold to 0.2. | 0.5 |
The program will print its execution progress while it runs and print the total time (in seconds) and max memory (in MB) consumption at the end.
Reading in input files...
Performing TGIF...
[> ] 1%
[===> ] 5%
[======> ] 9%
[==========> ] 15%
[=====================> ] 31%
[================================> ] 47%
[============================================> ] 63%
[========================================================> ] 80%
[===================================================================> ] 96%
Testing for significant boundaries...
Total time elapsed: 213 seconds
Memory usage: 241MB
- A bed file called
significant_boundaries_summit_only.txt
denotes the boundary status of each region. - Each row = a genomic region.
- Each column = each of the input matrix/condition.
- 0 = non-boundary;
- 1 = significant boundary;
- N/A = given region was filted out due to sparsity.
- See example in output/tgif-db/significant_boundaries_summit_only.txt.
- NOTE:
significant_boundaries_summit_only.txt
marks only the region with the highest boundary score if a consecutive stretch of regions is flagged as significant boundaries. For the full set of such regions, seesignificant_boundaries.txt
file which is also generated by default.
#chro start end ES NPC CN
19 3075000 3100000 N/A N/A N/A
19 3100000 3125000 N/A N/A N/A
19 3125000 3150000 0 0 0
19 3150000 3175000 0 0 0
...
19 3575000 3600000 1 1 1
...
In the example output above, chr19 3575000-3600000 is a significant boundary in all three conditions (ES, NPC, CN).
- For each pair of input matrices or conditions, a bed file
A_vs_B_significantly_differential_boundary_regions.txt
is generated, listing only regions that are sigDB in the pairwise comparison of A and B. - Column 1: chromosome
- Column 2: bin start coordinate
- Column 3: bin end coordinate
- Column 4: absolute difference in boundary score between conditions A and B being compared
- Column 5: p-value of the absolute difference
- Column 6: adjusted p-value after FDR correction
- See example in output/tgif-db/ES_vs_CN_significantly_differential_boundary_regions.txt.
#chro start end |diff| pval padj
19 4250000 4275000 0.019795 0.000351258 0.00195699
19 4750000 4775000 0.0195557 0.000418252 0.00202151
19 6350000 6375000 0.0432202 2.64661e-16 0.000430108
In the example shown above, the genomic region chr19:4250000-4275000 contains a sigDB between ES and NPC condition.
Using the -b
flag, the user can print the boundary score for each region and for each task to boundary_score.txt
file. The format is the same as significant_boundaries_summit_only.txt
file, but each entry is a boundary score instead. Also, the scores used as the null distribution to test each boundary score's significance will be saved to background_score.txt
.
Using the -p
flag, the user can print the p-values of each boundary score to boundary_pval.txt
file. The adjusted p-values after FDR correction is printed to boundary_adj_pval.txt
. The format is the same as significant_boundaries_summit_only.txt
file, but each entry is a p-value instead.
Using the -l
flag, the user can save the list of parameter values used in this run of TGIF to parameters.log
file.
TGIF-DC identifies conserved and dynamic compartment regions. Usage for TGIF-DC is very similar to TGIF-DB, with differences in the optional parameters and the output files.
./tgif-dc input/tgif-dc/chr19_tree.txt input/tgif-dc/chr19_100kb.bed -o output/tgif-dc/
./tgif-dc input/tgif-dc/chr19_tree.txt input/tgif-dc/chr19_100kb.bed -k 5 -o output/tgif-dc/
input/tgif-dc/chr19_tree.txt
specifies the tree file, which contains file locations to individual task matrices (paths are relative to location of run_tgif executable location; more details below).input/tgif-dc/chr19_100kb.bed
is a bed file specifying the start and end coordinates of each bin whose index is referred to in the sparse-format matrix files (more details below).- Optional:
-k 5
By default, TGIF-DC operates with k=2 and finds 2 clusters of genomic regions that correspond to A or B compartments. To find subcompartments, i.e., more granular clusters, specify a higher k. - Optional:
-o output/tgif-dc/
will put all output files to output/tgif-dc/ directory. By default the output files will be saved to the current directory. Check out the example output directory in the repo.
- This file specifies the task hierarchy, where each task represents some biological condition and has its own input matrix.
- See example in input/tgif-dc/chr19_tree.txt.
1 5 ES input/tgif-dc/ES_chr19_100kb.txt
2 4 NPC input/tgif-dc/NPC_chr19_100kb.txt
3 4 CN input/tgif-dc/CN_chr19_100kb.txt
4 5 neural N/A
5 -1 root N/A
All 4 columns are required. If you want a flat tree or do not care about the task hierarchy, point all nodes to have the root node as the parent.
- Column 1: node ID; start from 1 and move up.
- Column 2: parent node ID; right now the implementation will only work correctly if you ID all children nodes before a parent node (so start from the lowest depth of tree, move to next level, till you hit the root, which should be the last node ID.)
- Column 3: node alias, used to refer to each input matrix/task in the output files.
- Column 4: location of input matrix file for leaf nodes, relative to where the run_tgif executable is. Set as N/A for non-leaf nodes.
- Each of the matrix files referred to in the tree file.
- 0-indexed, tab-delimited, sparse-matrix format, no header
- Column 1: index of region or bin A
- Column 2: index of region or bin B
- Column 3: interaction count between bin A and B
- See example in input/tgif-dc/ES_chr19_100kb.txt.
- Note that TGIF-DC assumes these are raw counts. If the input is already normalized O/E counts, use the
-e
flag (refer to the optional arguments section below).
0 0 121
0 1 176
1 1 8231
0 2 47
1 2 2442
2 2 31199
0 3 27
...
- This file specifies the starting and ending coordinates of each bin whose index is referred to in the matrix files.
- The matrices are 0-indexed (i.e. first entry index is [0,0]) and they must share a common coordinates file.
- Tab-delimited
- Column 1: chromosome
- Column 2: bin start coordinate
- Column 3: bin end coordinate
- Column 4: bin index (starting at 0)
- See exmaple in input/tgif-dc/chr19_100kb.bed.
19 3000000 3100000 0
19 3100000 3200000 1
19 3200000 3300000 2
19 3300000 3400000 3
...
Position | Parameter | Description | Default Value/Behavior |
---|---|---|---|
1 | This file specifies the task hierarchy, where each task represents some biological condition and has its own input matrix. See documentation above for more details, format, and example. | N/A | |
2 | This is a mapping each index in the input matrix file to a chromosomal coordinate. See documentation above for more details, format, and example. | N/A | |
optional | -k | By default, TGIF-DC operates with k=2 and finds 2 clusters of genomic regions that correspond to A or B compartments. To find subcompartments, i.e., more granular clusters, specify a higher k. | 2 (i.e. find 2 compartments) |
optional | -o | Output file path and prefix. Note: will NOT create a directory if the specified directory does not already exist. | Output files saved to current directory. |
optional | -l | Generate a file called parameters.log under the specified output path (or current directory by default) that lists the parameter settings used. | No log file of parameters saved. |
optional | -a | Specify an integer value used for the strength of the tree regularization. | 1000000 |
optional | -e | Use this flag if the input matrices for TGIF-DC are O/E count matrices. | TGIF-DC assumes input matrices are raw count matrices. |
optional | -u | Generate output files that contain the factors Us and Vs. | No factor files are saved. |
- See example in output/tgif-dc/cluster_assignment.txt and output/tgif-dc/k5_cluster_assignment.txt.
- For all settings of k, TGIF-DC returns a
cluster_assignment.txt
file, listing the cluster assignments of each region. - Each row = genomic region/bin
- Each input matrix/condition has a column denoting compartment A or B.
- In the example below, TGIF-DC was run with
-k 5
to find subcompartments or more than 2 clusters of regions; chr19 3300000-3400000 belongs to cluster 4 in ES and CN states.
#chro start�nd ES NPC CN
19 3000000 3100000 0 0 0
19 3100000 3200000 0 0 0
19 3200000 3300000 4 0 0
19 3300000 3400000 4 0 4
19 3400000 3500000 4 4 4
19 3500000 3600000 1 4 1
...
- See example in output/tgif-dc/compartment_assignment.txt.
- Under the default k=2, TGIF-DC returns a
compartment_assignment.txt
file, listing the compartment assignments of each region based on their mean O/E counts. - Each row = genomic region/bin
- Each input matrix/condition has a column denoting compartment A or B.
- In the example below, chr19 region 3000000-3100000 belongs to compartment B in ES, NPC, and CN cell states; chr19 region 3200000-3300000 belongs to compartment A in ES and B in NPC, and CN cell states.
#chro start end ES NPC CN
19 3000000 3100000 B B B
19 3100000 3200000 B B B
19 3200000 3300000 A B B
19 3300000 3400000 A B A
19 3400000 3500000 A A A
19 3500000 3600000 A A A
...
- For each pair of input matrices or conditions, a bed file
A_vs_B_significantly_differential_compartmental_regions.txt
is generated, listing only regions that are sigDC in the pairwise comparison of A and B. - Column 1: chromosome
- Column 2: bin start coordinate
- Column 3: bin end coordinate
- Column 4: cosine distance between conditions A and B's latent feature from factor
- Column 5: p-value of the distance
- Column 6: adjusted p-value after FDR correction
- See example in output/tgif-db/ES_vs_CN_significantly_differential_compartmental_regions.txt.
#chro start end |diff| pval padj
19 11200000 11300000 0.179182 0.00327075 0.00660377
19 11800000 11900000 0.230168 0.000107882 0.00463122
19 14000000 14100000 0.246547 2.97501e-05 0.00394511
In the example shown above, the genomic region chr19:11200000-11300000 is a sigDC between ES and CN condition.
Each node in the tree will have a V factor (of size n x 2), and each leaf node in the tree will have a U factor (of size n x 2). Using the -u
flag, each of these factors will be printed to file. For instance, the U factor for the ES node will be saved to output file ES_U.txt
; the V factor for the root node will be saved to output file root_V.txt
.
Using the -l
flag, the user can save the list of parameter values used in this run of TGIF to parameters.log
file.