diff --git a/docs/getting_started.md b/docs/getting_started.md index e253ee9..b17cad9 100644 --- a/docs/getting_started.md +++ b/docs/getting_started.md @@ -78,26 +78,26 @@ For example R workflows, such as clustering of gene expression data, please see If you identify issues with your download, please [file an issue on GitHub](https://github.com/AlexsLemonade/refinebio/issues). If you would prefer to report issues via e-mail, you can also email [ccdl@alexslemonade.org](mailto:ccdl@alexslemonade.org). -## Getting Started with Species Compendia +## Getting Started with Normalized Compendia -A species compendium includes a gene expression matrix and experiment and sample metadata for all samples from a given organism that are fit for inclusion in the species compendium. -Species compendia provide a snapshot of the most complete collection of gene expression that refine.bio can produce for each supported organism. +A normalized compendium includes a gene expression matrix and experiment and sample metadata for all samples from a given organism that are fit for to be aggregated and normalized as part of the compendium. +Normalized compendia provide a snapshot of the most complete collection of gene expression that refine.bio can produce for each supported organism. -You can read more about how we process species compendia in [our documentation](http://docs.refine.bio/en/latest/main_text.html#species-compendia). +You can read more about how we process normalized compendia in [our documentation](http://docs.refine.bio/en/latest/main_text.html#species-compendia). ### Structure -**The download folder structure for a species compendium** +**The download folder structure for a normalized compendium** -![docs-downloads-species-compendia](https://user-images.githubusercontent.com/15315514/56142320-74ab4980-5f6c-11e9-8847-9f7d178cd080.png) +![docs-downloads-normalized-compendia](https://user-images.githubusercontent.com/15315514/56142320-74ab4980-5f6c-11e9-8847-9f7d178cd080.png) ### Contents * The `aggregated_metadata.json` file contains experiment metadata and information about the transformation applied to the data. -Specifically, the `scale_by` field notes any row-wise transformation that was performed on the gene expression data. For species compendia, this value should always be `NONE`. +Specifically, the `scale_by` field notes any row-wise transformation that was performed on the gene expression data. For normalized compendia, this value should always be `NONE`. * The gene expression matrix is the tab-separated value (TSV) file that bears the species name. -For example, if you have downloaded the zebrafish species compendium, you would find the gene expression matrix in the file `DANIO_RERIO/DANIO_RERIO.tsv`. +For example, if you have downloaded the zebrafish normalized compendium, you would find the gene expression matrix in the file `DANIO_RERIO/DANIO_RERIO.tsv`. Note that samples are _columns_ and rows are _genes_ or _features_. This pattern is consistent with the input for many programs specifically designed for working with high-throughput gene expression data but may be transposed from what other machine learning libraries are expecting. @@ -117,7 +117,7 @@ We strongly encourage you to consider using methods or models that can account f #### Methods evaluation and exploratory data analysis -To identify appropriate methods for processing the initial releases of species compendia (described [here](http://docs.refine.bio/en/latest/main_text.html#species-compendia)), we performed a series of evaluations in a small zebrafish test compendium. +To identify appropriate methods for processing the initial releases of normalized compendia (described [here](http://docs.refine.bio/en/latest/main_text.html#species-compendia)), we performed a series of evaluations in a small zebrafish test compendium. We've made these evaluations available and have documented our rationale on GitHub [here](https://github.com/AlexsLemonade/compendium-processing/tree/94089d2de170f0ca7b87e9e5c32239a8591faaa7/select_imputation_method). We have also performed exploratory analyses in a larger zebrafish test compendium ([GitHub](https://github.com/AlexsLemonade/compendium-processing/tree/94089d2de170f0ca7b87e9e5c32239a8591faaa7/quality_check)). diff --git a/docs/main_text.md b/docs/main_text.md index 38cc7fd..67d4379 100644 --- a/docs/main_text.md +++ b/docs/main_text.md @@ -404,12 +404,17 @@ Specifically, the `aggregate_by` and `scale_by` fields note how the samples are The `quantile_normalized` fields notes whether or not quantile normalization was performed. Currently, we only support skipping quantile normalization for RNA-seq experiments when aggregating by experiment on the web interface. -# Species Compendia +# refine.bio Compendia We periodically release compendia comprised of all the samples from a species that we were able to process. -We refer to these as **species compendia**. +We refer to these as **refine.bio compendia**. +We offer two kinds of refine.bio compendia: [normalized compendia](#normalized-compendia) and [RNA-seq sample compendia](#rna-seq-sample-compendia). + +## Normalized compendia + +refine.bio normalized compendia are comprised of all the samples from a species that we were able to process, aggregate, and normalize. +Normalized compendia provide a snapshot of the most complete collection of gene expression that refine.bio can produce for each supported organism. We process these compendia in a manner that is different from the options that are available via the web user interface. -These species compendia provide a snapshot of the most complete collection of gene expression that refine.bio can produce for each supported organism. The refine.bio web interface does an inner join when datasets are combined, so only genes present in all datasets are included in the final matrix. For compendia, we take the union of all genes, filling in any missing values with `NA`. @@ -418,9 +423,9 @@ We use a full outer join because it allows us to retain more genes in a compendi ![outer join](https://user-images.githubusercontent.com/15315514/44534241-4dde9100-a6c5-11e8-8a9c-aa147e294e81.png) -We perform an outer join each time samples are combined in the process of building species compendia. +We perform an outer join each time samples are combined in the process of building normalized compendia. -![docs-species-compendia](https://user-images.githubusercontent.com/15315514/48498088-72995f00-e803-11e8-8832-3a9024748431.png) +![docs-normalized-compendia](https://user-images.githubusercontent.com/15315514/65698014-ddcdd780-e049-11e9-8ed6-f1d2f8ac2ee7.png) Samples from each technology—microarray and RNA-seq—are combined separately. In RNA-seq samples, we filter out genes with low total counts and then `log2(x + 1)` the data. @@ -432,12 +437,28 @@ We then quantile normalize all samples as described above. We've made our analyses underlying processing choices and exploring test compendia available at our `compendium-processing` repository. -## Download Folder +### Download Folder + Users will receive a zipped folder with a gene expression matrix aggregated by species, along with associated metadata. Below is the detailed folder structure: ![docs-downloads-species-compendia](https://user-images.githubusercontent.com/15315514/65180873-ddbb4f80-da2b-11e9-97e9-127c68106182.png) +## RNA-Seq Sample Compendia + +refine.bio RNA-seq sample compendia are comprised of the Salmon output for the collection of RNA-seq samples from an organism that we have processed with refine.bio. +Each individual sample has its own `quant.sf` file; the samples have not been aggregated and normalized. +RNA-seq sample compendia are designed to allow users that are comfortable handling these files to generate output that is most useful for their downstream applications. +Please see the [Salmon documentation on the `quant.sf` output format](https://salmon.readthedocs.io/en/latest/file_formats.html#quantification-file) for more information. + +### Download Folder + +Users will receive a zipped folder with individual `quant.sf` files for each sample that we were able to process with Salmon, grouped into folders based on the experiment those samples come from, along with any associated metadata in refine.bio. +Please note that our RNA-seq sample metadata is limited at this time and in some cases, we could not successfully run Salmon on every sample within an experiment (e.g., our processing infrastructure encountered an error with the sample, the sequencing files were malformed). +In addition, we use the terms "sample" and "experiment" to be consistent with the rest of refine.bio, but files will use run identifiers (e.g., SRR, ERR, DRR) and project identifiers (e.g., SRP, ERP, DRP), respectively. +Below is the detailed folder structure: + +![docs-downloads-quantpendia](https://user-images.githubusercontent.com/15315514/65271488-4289af00-daeb-11e9-9006-2d7536c4a103.png) # API