# Example using scidat: Download Annotate TCGA

This was designed to be a simple interface that allows for downloading and annotating data from TCGA.


## Step 1:
Download metatdata from TCGA.
For this program to work you need to first go to: https://portal.gdc.cancer.gov/ and select the files to download.

In this example, I chose my data by using the following steps:
    1. Navigate to `Exploration` tab
    2. Selected `kidney` as the Primary Site
    3. Selected `solid tissue normal` and `primary tumor` for my Sample Type (823 cases)
    4. Clicked `View Files in Repository`
    
Since I'm only interested in RNAseq count data and Methylation beta values, I filter my files to only include these two types of files.   
    5. Selected `RNA-seq` as my Experimental Strategy
    6. Selected `HTSeq - Counts` in Workflow Type
    7. Clicked `Add all files to cart` (945 files, 815 cases)
    8. Unselected `HTSeq - Counts` and unselected `RNA-seq`
    9. Selected `Methylation array` as my Experimental Strategy
    10. Selected `illumina human methylation 450` in Platform
    11. Clicked `Add all files to cart` (832 files, 630 cases)
    
Now I navigated to my cart which had 1777 files in it. Since it is recomended to use th GDC-data transfer tool (which scidat uses) I only download the `manifest` and the `metadata` for my files of interest.
Before proceeding, look at how much space the files will need (e.g. for me this is 117GB) make sure the computer you are downloading on has that space available.

### Download: 
    12. `Biospecimen`
    13. `Clinical`
    14. `Sample Sheet`
    15. `Metadata`
    16. Click the `Download` button and select `Manifest`

Lastly, move all these files to a new empty directory `~/Documents/TCGA_data_download_scidat/`. 
Unzip the `Clinical` folder and delete the zipped version.
Make a new directory in `~/Documents/TCGA_data_download_scidat/` called `downloads` (we'll use this below)

## Get GDC Data Transfer Tool
If you haven't already got the GDC transfer tool, you'll need to download this, follow the instructions from TCGA on how to do this:
https://gdc.cancer.gov/access-data/gdc-data-transfer-tool

Move the downloaded GDC transfer tool to the same directory you put your manifest file in i.e. `~/Documents/TCGA_data_download_scidat/` from before.
Now we're ready to download the files and annotate them with the data we downloaded.
    

## Step 2

The script below assumes you have put the files from above in `~/Documents/TCGA_data_download_scidat/` and that your `gdc-client` is executable.

If you are using a windows machine (or put your files in a different folder you'll need to rename that directory).

Before running the script below you will need to make sure you have edited the path to the manifest file, and the gdc_client to be located where you placed them on your computer and with their proper names
(e.g. 

## Step 2

## Download 

In [None]:
from scidat import Download
manifest_file = '~/Documents/TCGA_data_download_scidat/manifest.tsv'
gdc_client = '~/Documents/TCGA_data_download_scidat/./gdc_client'
download_dir = '~/Documents/TCGA_data_download_scidat/downloads/'

# scidat spits the manifest file into submanifests so that we can make multiple calls to TCGA simulateously. 
# This speeds up the download process. Note, it will use up more of your computers' processing so if you are worried
# just set the max_cnt to be more than the number of files in your manifest e.g. 100000000
download = Download(manifest_file, download_dir, download_dir, gdc_client, max_cnt=100000000)

## Step 3

## Annotate

In [None]:
from scidat import Annotate
