This notebook contains shell script which checks and prepares the TCGA BED files

In [12]:
# preview results/tcga-capture_kit-info.tsv
TSV='results/tcga-capture_kit-info.tsv'
head -4 $TSV | column -t -s $'\t'

filename                                           kit_name                                 kit_url
C494.TCGA-DU-5855-10A-01D-1705-08.5_gdc_realn.bam  Custom V2 Exome Bait, 48 RXN X 16 tubes  https://bitbucket.org/cghub/cghub-capture-kit-info/raw/c38c4b9cb500b724de46546fd52f8d532fd9eba9/BI/vendor/Agilent/whole_exome_agilent_1.1_refseq_plus_3_boosters.targetIntervals.bed
C494.TCGA-DU-5847-10A-01D-1705-08.5_gdc_realn.bam  Custom V2 Exome Bait, 48 RXN X 16 tubes  https://bitbucket.org/cghub/cghub-capture-kit-info/raw/c38c4b9cb500b724de46546fd52f8d532fd9eba9/BI/vendor/Agilent/whole_exome_agilent_1.1_refseq_plus_3_boosters.targetIntervals.bed
C494.TCGA-HT-7681-10C-01D-2396-08.1_gdc_realn.bam  Custom V2 Exome Bait, 48 RXN X 16 tubes  https://bitbucket.org/cghub/cghub-capture-kit-info/raw/c38c4b9cb500b724de46546fd52f8d532fd9eba9/BI/vendor/Agilent/whole_exome_agilent_1.1_refseq_plus_3_boosters.targetIntervals.bed


In [13]:
# list BAM with mixed/multiple capture kits
grep "|" $TSV | cut -f1,2

C282.TCGA-06-0745-10A-01W.3_gdc_realn.bam	Custom V2 Exome Bait, 48 RXN X 16 tubes|Custom V2 Exome Bait, 48 RXN X 16 tubes
C282.TCGA-14-0817-10A-01W.4_gdc_realn.bam	Custom V2 Exome Bait, 48 RXN X 16 tubes|Custom V2 Exome Bait, 48 RXN X 16 tubes
C282.TCGA-14-0871-10A-01W.7_gdc_realn.bam	Custom V2 Exome Bait, 48 RXN X 16 tubes|Custom V2 Exome Bait, 48 RXN X 16 tubes
C282.TCGA-16-0850-10A-01W.4_gdc_realn.bam	Custom V2 Exome Bait, 48 RXN X 16 tubes|Custom V2 Exome Bait, 48 RXN X 16 tubes


Those BAMs from the `grep` command above are those with more than one capture kit which neither the GDC nor its origin data center could retrieve/figure out what the actual capture kit had been applied. We should just generate intersec BED for those samples and used that for our analysis.

BAMs with this issue are:
```
C282.TCGA-06-0745-10A-01W.3_gdc_realn.bam
C282.TCGA-14-0817-10A-01W.4_gdc_realn.bam
C282.TCGA-14-0871-10A-01W.7_gdc_realn.bam
C282.TCGA-16-0850-10A-01W.4_gdc_realn.bam
```

In [14]:
# get and count uniq bed url
sed 1d $TSV \
| cut -f3 | tr "\|" "\n" \
| sort | uniq -c

   4 https://bitbucket.org/cghub/cghub-capture-kit-info/raw/c38c4b9cb500b724de46546fd52f8d532fd9eba9/BI/vendor/Agilent/tcga_6k_genes.targetIntervals.bed
 154 https://bitbucket.org/cghub/cghub-capture-kit-info/raw/c38c4b9cb500b724de46546fd52f8d532fd9eba9/BI/vendor/Agilent/whole_exome_agilent_1.1_refseq_plus_3_boosters.targetIntervals.bed
   2 https://bitbucket.org/cghub/cghub-capture-kit-info/raw/c38c4b9cb500b724de46546fd52f8d532fd9eba9/BI/vendor/Agilent/whole_exome_agilent_designed_120.targetIntervals.bed
   4 https://bitbucket.org/cghub/cghub-capture-kit-info/raw/c38c4b9cb500b724de46546fd52f8d532fd9eba9/BI/vendor/Agilent/whole_exome_agilent_plus_tcga_6k.targetIntervals.bed


In [17]:
# download bed to scratch/
sed 1d $TSV \
| cut -f3 | tr "\|" "\n" | sort -u \
| xargs -I {} curl -sJO {} && mv *.bed ../../scratch/

In [19]:
# preview bed
head -3 ../../scratch/*.bed

==> ../../scratch/tcga_6k_genes.targetIntervals.bed <==
1	934438	934812	target_1	+
1	934905	934993	target_2	+
1	935071	935353	target_3	+

==> ../../scratch/whole_exome_agilent_1.1_refseq_plus_3_boosters.targetIntervals.bed <==
1	30365	30503	target_1	+
1	69088	70010	target_2	+
1	367656	368599	target_3	+

==> ../../scratch/whole_exome_agilent_designed_120.targetIntervals.bed <==
1	30365	30503	target_1	+
1	69088	70010	target_2	+
1	367656	368599	target_3	+

==> ../../scratch/whole_exome_agilent_plus_tcga_6k.targetIntervals.bed <==
MT	647	1601	tcga6k_target_72201	+
MT	3306	4263	tcga6k_target_72202	+
MT	4469	5511	tcga6k_target_72203	+


Those are all hg19 based coordinates, use [UCSC online LiftOVer](https://genome.ucsc.edu/cgi-bin/hgLiftOver) to convert to GRCh38 based bed. Add `chr` to the TCGA bed to make sure its format is liftover compatible. 

In [20]:
# adding chr for liftover
sed 1d $TSV \
| cut -f3 | tr "\|" "\n" | sort -u \
| while read i
do 
    filename=`basename $i`
    curl -s $i | awk '{print "chr"$0}' > ../../scratch/$filename
done

upload all the BED files from `scratch/` to UCSC LiftOVer for the converting, and download all liftover'd BED to `results/`