No description, website, or topics provided.
Switch branches/tags
Nothing to show
Clone or download
Latest commit a264596 Mar 11, 2018
Permalink
Failed to load latest commit information.
bin Clean up output Oct 4, 2016
dbGap uploading all XMLS Aug 17, 2016
output_example Clean up output Oct 4, 2016
scripts moving the xml parsing scripts to scripts Aug 27, 2016
LICENSE Initial commit Jul 21, 2016
README.md Fixed README headers Mar 11, 2018

README.md

TCGA_dbGaP

The repository contains scripts to automatically fetch related dbGaP studies and subsequently the specific sequence files for given TCGA data.

To use download the entire contents of the "bin" folder.

The description for the required scripts is provided below

Platform requirements:

Python 2.7 -> For installing and configuring python refer to https://www.python.org/download/releases/2.7/

SRA Toolkit -> For details follow the 'Downloading and installing the SRA Toolkit' instructions at https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=std

Required python packages

To install a package type pip install <package_name> at the command line

https://packaging.python.org/installing/

  • requests
  • pandas

Program descriptions

"fetch_dbGaP_with_TCGA.py"

INPUT:

This function will take an input of TCGA project ID, file ID, or case ID.

Additionally the input can also be a TCGA disease type or experiment stategy.

Allowable arguments are:

'-i', '--idSearch', type=str, default=None, help=' project/case/file id'
'-r', '--returnType', type=str, default='case', help='View TCGA results by project/file/case'
'-s', '--searchType', required = True, type=str, default='case', help='Search type for search by id project/file/case'
'-d', '--disease', type=str, default=None, help='disease param'
'-n', '--studyType', type=str, default=None, help='study type param wgs/wxs/rnaseq/etc'
'-l', '--stringencyLevel', type=str, default="high", help='stringency level of dbGaP term match'

OUTPUT:

The output is two .csv files, one containing the ids, urls, and other TCGA information. The other contains dbGAP accession numbers, associated url, and other study information for related studies. The default name of the TCGA file is "tcga_output.csv" and dbGaP file is "dbGAP_output.csv"

"fetch_SRRs.py"

INPUT:

Allowable arguments are:

'-f', '--file', type=str, default=None, help=' <path to file containing dbGap Ids>'
'-id', '--dbGapIds', type=str, default=None, help='<comma separated list of phs'>'

The file input can directly be the output file "dbGAP_output.csv" from "fetch_dbGaP_with_TCGA.py"

OUTPUT:

List of SRRs found for the queried dbGaP study (accession) numbers.

"sra_query_tool.sh"

INPUT:

File containing list of SRRs, path to output directory, and genomic region of interest to extract from SRRs.

OUTPUT:

SAM files, one per SRR, each containing reads from your genomic region of interest.

USE EXAMPLE:

  1. install SRA toolkit and add the directory containing the toolkit executables to your path (e.g., PATH=$PATH:[download_location]/sratoolkit[version]/bin
  2. type sh /path/to/sra_query_tool.sh /path/to/SRRlist.txt /path/to/output/ 4:1723150-1810650

NOTE: You may need to have permission to access certain SRR files. For testing purposes, SRR390728 is publicly available, and can be used as an example SRR for the sra_query_tool.