Tissue-enrichment Test for Transcriptome Datasets
This repository contains scripts that allow the user to test a transcriptome dataset (RNA-seq, microarray) for enrichment of transcripts enriched in certain tissue types. By default the test is set up for Arabidopsis thaliana seeds, using the GEO superseries GSE12404 as a reference gene expression atlas.
The scripts included here enable the user to:
- generate-tissue-enriched-genes.R: Use a reference gene expression atlas to identify genes that are spatially and/or temporally expressed in particular tissues or cell types.
- tissue-enrichment-test.R: Perform a statistical enrichment test to determine which tissue types from the atlas are represented in a given transcriptome dataset.
These scripts require the statistical software R version 3.1.0 or higher to be installed on your
computer and added to your PATH. Generating tissue-enrichment heatmaps additionally requires the R packages pheatmap
and RColorBrewer, which will automatically
be installed to a local folder the first time tissue-enrichment-test.R is run.
(CAUTION: This will take some time. You may need administrator priveleges for this action.
If this step causes errors for you, either set the variable 'make_heatmaps' in 'tissue-enrichment-test.R' to FALSE or install the packages yourself and modify lines 92-96 appropriately.)
To download this repository, run this command in your desired destination folder:
git clone https://github.com/Gregor-Mendel-Institute/tissue-enrichment-test.git
After cloning the repository, run these commands to make sure everything is set up properly:
cd tissue-enrichment-test/ Rscript generate-tissue-enriched-genes.R Rscript tissue-enrichment-test.R
This will create a collection of files in your local 'results' folder. These are results for the RNA-seq data from Nodine & Bartel 2012, including both unwashed and washed embryos from early stages of development.
The test defines tissues and timepoints in two separate two-column files:
- 'tissues.txt': column 1 indicates the label used in the file, and column 2 shows the full description of what this label represents
- 'timepoints.txt': column 1 indicates the order of timepoints in a reference atlas with temporal data; these value must be numeric and starting from 1! Column 2 gives a description of each timepoint, e.g. 'mature green' stage of development or '4 hours after induction'. These two files are necessary for interpreting the reference atlas, so if you change the reference atlas you will need to change these files to match.
To test a tissue-specific transcriptome from Arabidopsis seeds, you will just need two files:
- a tab-delimited file (ending in '.tsv') of gene expression values, where each row is a unique gene name and each column is one sample (see 'nodine_2012_embryos.tsv' for an example).
- a '.description' file of the same name, where each line contains a sample name and the developmental timepoint of that sample (see 'nodine_2012_embryos.description'). Each sample must be assigned one of the timepoints defined in 'timepoints.txt'
If you are interested in testing datasets from other tissues or other species, you will need to find an appropriate reference gene expression atlas (see the section below).
This script is used to generate the file 'enriched_genes.txt', which contains lists of genes specifically expressed in one cell/tissue type at a certain timepoint. You can change the statistical cutoffs by modifying the variables defined at the top of the script.
To generate 'enriched_genes.txt', the script requires a reference gene expression atlas of high quality that includes the tissue-of-interest and all nearby tissue types. In our experience, transcriptomes generated by laser capture microdissection (LCM) give the highest specificity, and we recommend that LCM datasets should be used as a reference if they are available.
As mentioned above, three files are used by generate-tissue-enriched-genes.R to interpret the reference atlas:
- 'tissues.txt': All tissue types represented in the atlas; left column- abbreviated names, right column- full names
- 'timepoints.txt': A numbered table of developmental timepoints represented in the atlas
- 'reference_atlas.tsv': A tab-separated table of transcriptomes from different tissues and timepoints
To use your own reference atlas, save the gene expression data in the working directory as 'reference_atlas.tsv'. The name of each sample must be formatted as: tissue_timepoint.replicate (e.g. EP_1.1 or GSC_6.2). If you only have spatial data, set all timepoints to '1'. Timepoints must be numeric to allow the script to identify adjacent timepoints.
Then replace the content of the two description tables with the appropriate information. If any sample types in 'reference_atlas.tsv' should be ignored, add '//exclude' to the description of that timepoint or tissue in the description tables. In our example, whole seed samples were excluded because they do not represent a specific tissue within the seed.
By default, adjacent timepoints are merged together to increase statistical power and reduce false positives from sharp temporal changes in gene expression. This can be disabled by changing the option 'flanking' to FALSE under the 'STATISTICAL THRESHOLDS' heading in the script.
At least two biological replicates are strongly recommended for each tissue in the reference atlas.
This script uses the tissue-enriched gene expression calculated in step 1 to check for enrichment of each tissue type in the transcriptomes provided. Two samples are required to test your samples:
- [data table].TSV
- [data table].DESCRIPTION
To run the script on your datasets, open 'tissue-enrichment-test.R' and under the heading 'USER OPTIONS', change:
filename <- '[data table]'
Alternatively you can provide the name of the data table on the commandline, e.g.
Rscript tissue-enrichment-test.R nodine_2012_embryos OR Rscript tissue-enrichment-test.R nodine_2012_embryos.tsv OR Rscript tissue-enrichment-test.R datasets/nodine_2012_embryos.tsv
This will override whatever default name is stored in the script as 'filename'. Note that any data table must be in the datasets/ subfolder of this repository, it must have the file extension .tsv, and it must have an associated .description file in order for the test to run.
Other user options are listed at the top of the script to change the appearance of the output plots.
After running tissue-enrichment-test.R, you should see a collection of new files in the local folder 'results':
- [sample name].pdf: A plot of the cumulative frequency distribution (CDF) of each tissue type in the given sample
- [data table]_p_values.txt: A table of p-values representing the enrichment of each tissue type in each dataset.
- [data table]_heatmap.pdf: A heatmap showing the enrichment score (-log10 p-value) for each tissue type in each dataset.
Using the example data table 'nodine_2012_embryos', we can see that the 8 samples with washed embryos only show enrichment of 'EP' (embryo proper) and 'SUS' (suspensor) transcripts, whereas the two unwashed samples are also enriched in the transcript populations of 'PEN' (peripheral endosperm) and 'GSC' (general seed coat).