Skip to content

Commit

Permalink
Merge pull request #115 from AlexsLemonade/jaclyn-taroni/109-and-114
Browse files Browse the repository at this point in the history
Changes to docs to account for RNA-seq sample compendia
  • Loading branch information
jaclyn-taroni committed Sep 26, 2019
2 parents f68009e + 1df6ac7 commit 6cb8748
Show file tree
Hide file tree
Showing 2 changed files with 36 additions and 15 deletions.
18 changes: 9 additions & 9 deletions docs/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,26 +78,26 @@ For example R workflows, such as clustering of gene expression data, please see

If you identify issues with your download, please [file an issue on GitHub](https://github.com/AlexsLemonade/refinebio/issues). If you would prefer to report issues via e-mail, you can also email [ccdl@alexslemonade.org](mailto:ccdl@alexslemonade.org).

## Getting Started with Species Compendia
## Getting Started with Normalized Compendia

A species compendium includes a gene expression matrix and experiment and sample metadata for all samples from a given organism that are fit for inclusion in the species compendium.
Species compendia provide a snapshot of the most complete collection of gene expression that refine.bio can produce for each supported organism.
A normalized compendium includes a gene expression matrix and experiment and sample metadata for all samples from a given organism that are fit for to be aggregated and normalized as part of the compendium.
Normalized compendia provide a snapshot of the most complete collection of gene expression that refine.bio can produce for each supported organism.

You can read more about how we process species compendia in [our documentation](http://docs.refine.bio/en/latest/main_text.html#species-compendia).
You can read more about how we process normalized compendia in [our documentation](http://docs.refine.bio/en/latest/main_text.html#species-compendia).

### Structure

**The download folder structure for a species compendium**
**The download folder structure for a normalized compendium**

![docs-downloads-species-compendia](https://user-images.githubusercontent.com/15315514/56142320-74ab4980-5f6c-11e9-8847-9f7d178cd080.png)
![docs-downloads-normalized-compendia](https://user-images.githubusercontent.com/15315514/56142320-74ab4980-5f6c-11e9-8847-9f7d178cd080.png)

### Contents

* The `aggregated_metadata.json` file contains experiment metadata and information about the transformation applied to the data.
Specifically, the `scale_by` field notes any row-wise transformation that was performed on the gene expression data. For species compendia, this value should always be `NONE`.
Specifically, the `scale_by` field notes any row-wise transformation that was performed on the gene expression data. For normalized compendia, this value should always be `NONE`.

* The gene expression matrix is the tab-separated value (TSV) file that bears the species name.
For example, if you have downloaded the zebrafish species compendium, you would find the gene expression matrix in the file `DANIO_RERIO/DANIO_RERIO.tsv`.
For example, if you have downloaded the zebrafish normalized compendium, you would find the gene expression matrix in the file `DANIO_RERIO/DANIO_RERIO.tsv`.
Note that samples are _columns_ and rows are _genes_ or _features_.
This pattern is consistent with the input for many programs specifically designed for working with high-throughput gene expression data but may be transposed from what other machine learning libraries are expecting.

Expand All @@ -117,7 +117,7 @@ We strongly encourage you to consider using methods or models that can account f

#### Methods evaluation and exploratory data analysis

To identify appropriate methods for processing the initial releases of species compendia (described [here](http://docs.refine.bio/en/latest/main_text.html#species-compendia)), we performed a series of evaluations in a small zebrafish test compendium.
To identify appropriate methods for processing the initial releases of normalized compendia (described [here](http://docs.refine.bio/en/latest/main_text.html#species-compendia)), we performed a series of evaluations in a small zebrafish test compendium.
We've made these evaluations available and have documented our rationale on GitHub [here](https://github.com/AlexsLemonade/compendium-processing/tree/94089d2de170f0ca7b87e9e5c32239a8591faaa7/select_imputation_method).

We have also performed exploratory analyses in a larger zebrafish test compendium ([GitHub](https://github.com/AlexsLemonade/compendium-processing/tree/94089d2de170f0ca7b87e9e5c32239a8591faaa7/quality_check)).
Expand Down
33 changes: 27 additions & 6 deletions docs/main_text.md
Original file line number Diff line number Diff line change
Expand Up @@ -404,12 +404,17 @@ Specifically, the `aggregate_by` and `scale_by` fields note how the samples are
The `quantile_normalized` fields notes whether or not quantile normalization was performed.
Currently, we only support skipping quantile normalization for RNA-seq experiments when aggregating by experiment on the web interface.

# Species Compendia
# refine.bio Compendia

We periodically release compendia comprised of all the samples from a species that we were able to process.
We refer to these as **species compendia**.
We refer to these as **refine.bio compendia**.
We offer two kinds of refine.bio compendia: [normalized compendia](#normalized-compendia) and [RNA-seq sample compendia](#rna-seq-sample-compendia).

## Normalized compendia

refine.bio normalized compendia are comprised of all the samples from a species that we were able to process, aggregate, and normalize.
Normalized compendia provide a snapshot of the most complete collection of gene expression that refine.bio can produce for each supported organism.
We process these compendia in a manner that is different from the options that are available via the web user interface.
These species compendia provide a snapshot of the most complete collection of gene expression that refine.bio can produce for each supported organism.

The refine.bio web interface does an inner join when datasets are combined, so only genes present in all datasets are included in the final matrix.
For compendia, we take the union of all genes, filling in any missing values with `NA`.
Expand All @@ -418,9 +423,9 @@ We use a full outer join because it allows us to retain more genes in a compendi

![outer join](https://user-images.githubusercontent.com/15315514/44534241-4dde9100-a6c5-11e8-8a9c-aa147e294e81.png)

We perform an outer join each time samples are combined in the process of building species compendia.
We perform an outer join each time samples are combined in the process of building normalized compendia.

![docs-species-compendia](https://user-images.githubusercontent.com/15315514/48498088-72995f00-e803-11e8-8832-3a9024748431.png)
![docs-normalized-compendia](https://user-images.githubusercontent.com/15315514/65698014-ddcdd780-e049-11e9-8ed6-f1d2f8ac2ee7.png)

Samples from each technology—microarray and RNA-seq—are combined separately.
In RNA-seq samples, we filter out genes with low total counts and then `log2(x + 1)` the data.
Expand All @@ -432,12 +437,28 @@ We then quantile normalize all samples as described above.

We've made our analyses underlying processing choices and exploring test compendia available at our <a href = "https://github.com/AlexsLemonade/compendium-processing" target = "blank">`compendium-processing`</a> repository.

## Download Folder
### Download Folder

Users will receive a zipped folder with a gene expression matrix aggregated by species, along with associated metadata.
Below is the detailed folder structure:

![docs-downloads-species-compendia](https://user-images.githubusercontent.com/15315514/65180873-ddbb4f80-da2b-11e9-97e9-127c68106182.png)

## RNA-Seq Sample Compendia

refine.bio RNA-seq sample compendia are comprised of the Salmon output for the collection of RNA-seq samples from an organism that we have processed with refine.bio.
Each individual sample has its own `quant.sf` file; the samples have not been aggregated and normalized.
RNA-seq sample compendia are designed to allow users that are comfortable handling these files to generate output that is most useful for their downstream applications.
Please see the [Salmon documentation on the `quant.sf` output format](https://salmon.readthedocs.io/en/latest/file_formats.html#quantification-file) for more information.

### Download Folder

Users will receive a zipped folder with individual `quant.sf` files for each sample that we were able to process with Salmon, grouped into folders based on the experiment those samples come from, along with any associated metadata in refine.bio.
Please note that our RNA-seq sample metadata is limited at this time and in some cases, we could not successfully run Salmon on every sample within an experiment (e.g., our processing infrastructure encountered an error with the sample, the sequencing files were malformed).
In addition, we use the terms "sample" and "experiment" to be consistent with the rest of refine.bio, but files will use run identifiers (e.g., SRR, ERR, DRR) and project identifiers (e.g., SRP, ERP, DRP), respectively.
Below is the detailed folder structure:

![docs-downloads-quantpendia](https://user-images.githubusercontent.com/15315514/65271488-4289af00-daeb-11e9-9006-2d7536c4a103.png)

# API

Expand Down

0 comments on commit 6cb8748

Please sign in to comment.