SEQC Data

Download, aggregate and give SEQC samples better names.

Much thanks to SRA-Explorer written by Phil Ewels, from which "raw" metadata TSV files are downloaded.

Usage

Run snakemake --snakefile meta.snk to generate clean metadata file.
Run snakemake download -j <n_jobs> to download (concurrently via Snakemake) FastQs
run snakemake by_fc -j <n_jobs> to aggregate gzipped FastQs by unique flowcells per cite.

File name format from SRA

Sample title: GSM1156797: SEQC_ILM_BGI_A_1_L01_ATCACG_AC0AYTACXX; Homo sapiens; RNA-Seq format: (geo) SEQC_(technology)(location)(sample)(replicate #)(lane)(sample_tag?)(flowcell ID).

SEQC/MAQC-III Consortium - Scientific Data

Cross-platform ultradeep transcriptomic profiling of human reference RNA samples by RNA-Seq.

These descriptions on RNA-Seq sequencing sites are expanded from descriptions in the related research manuscript13. Each sequencing site was assigned a three-letter code and each platform vendor designated three ‘official sites’ (superscripted by ) before samples were distributed. Illumina HiSeq 2000 data were provided by 7 sites (ordered alphabetically by the site code): 1. Australian Genome Research Facility (AGR); 2. Beijing Genomics Institute (BGI); 3. Weill Cornell Medical College (CNL); 4. City of Hope (COH); 5. Mayo Clinic (MAY); 6. Novartis (NVS); and 7. the New York Genome Center (NYG), generating 100+100 nt read-pairs.

Life Technologies SOLiD 5500 data were provided by 4 sites: (1) the University of Liverpool (LIV); (2) Northwestern University (NWU); (3) the Pennsylvania State University (PSU); and (4) SeqWright Inc. (SQW)*, generating 51+36 nt read-pairs, except for Liverpool which applied a protocol variant giving single 76 nt reads.

Notes:

As of 11/09/20, MAY samples are not searchable by All Fields in SRA... and
NYG sample lanes start with L02 instead of L01, process_metadata.py reindexes the lanes to start at 1. ie L02 -> lane==1 etc.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
raw		raw
.gitignore		.gitignore
README.md		README.md
Snakefile		Snakefile
env.yml		env.yml
meta.snk		meta.snk
process_metadata.py		process_metadata.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SEQC Data

Usage

File name format from SRA

SEQC/MAQC-III Consortium - Scientific Data

Notes:

About

Releases

Packages

Languages

theJasonFan/SEQC-data

Folders and files

Latest commit

History

Repository files navigation

SEQC Data

Usage

File name format from SRA

SEQC/MAQC-III Consortium - Scientific Data

Notes:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages