Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Preparing BAM files
Your starting files must be for a single lane of sequencing in either BAM (unaligned BAM) or FASTQ.
If you start with merged BAMs, do this to split them: Please use one of the biobambam tools included in the PCAP-core installation:
$ bamtofastq exclude=SECONDARY,SUPPLEMENTARY,QCFAIL outputperreadgroup=1 outputdir=some_out_folder filename=your_input.bam tryoq=1
This will generate a pair of FASTQ files for every readgroup found in the BAM. Once all FASTQ files are generated, proceed to step b.
NOTE: Note that the generated FASTQ files may be very large, to avoid losing reads please ensure there is enough disk space available before starting bamtofastq.
All of the BAM files ready for submission must contain necessary meta information to allow proper uploading to GNOS and handling for downstream analysis. In particular, this information is kept in the read group and comment sections of the BAM header: @RG and @CO.
The table below describes how tags in @RG should be populated:
|ID||Read group identifier||Unique within site: <centre_name>:<unique_text>|
|PL||Platform/technology used to produce the reads||CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT, PACBIO|
|SM||Sample||The Sample UUID generated via issuing `uuidgen|
A valid @RG line may look like:
@RG ID:WTSI:9399_7 CN:WTSI PL:ILLUMINA PM:Illumina HiSeq 2000 LB:WGS:WTSI:28085 PI:453 SM:f393ba16-9361-5df4-e040-11ac0d4844e8 PU:WTSI:9399_7 DT:2013-03-18T00:00:00+00:00
The table below shows how additional information for sample tracking is kept in @CO lines:
|dcc_specimen_type||See the CV terms table in the Appendix.||This field defines whether the sample is a tumour or normal control.|
|use_cntl||If the dcc_specimen_type is not normal/control, this field will need to be populated with the UUID in the current matched (same donor) normal/control sample's @RG SM field.||If dcc_specimen_type is a normal/control, populate this field with N/A.|
An example of valid @CO lines may look like:
@CO dcc_project_code:BRCA-UK @CO submitter_donor_id:CGP_donor_1199131 @CO submitter_specimen_id:CGP_specimen_1142534 @CO submitter_sample_id:PD3851a @CO dcc_specimen_type:Primary tumour - solid tissue @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053
a) Follow this if you start from single lane BAM files
BAM file re-header For each BAM file, create a new text file and populate it with values as indicated in Table 1 and Table 2 above.
Make sure the first line is always: @HD VN:1.4
An example BAM header file is shown as below (let's name the file: header.sam):
@HD VN:1.4 @RG ID:WTSI:9399_7 CN:WTSI PL:ILLUMINA PM:Illumina HiSeq 2000 LB:WGS:WTSI:28085 PI:453 SM:f393ba16-9361-5df4-e040-11ac0d4844e8 PU:WTSI:9399_7 DT:2013-03-18T00:00:00+00:00 @CO dcc_project_code:BRCA-UK @CO submitter_donor_id:CGP_donor_1199131 @CO submitter_specimen_id:CGP_specimen_1142534 @CO submitter_sample_id:PD3851a @CO dcc_specimen_type:Primary tumour - solid tissue @CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053
Generate unaligned BAM with new header from initial BAM
Generate a BAM file without alignment information that incorporates the new header (in the file mentioned above header.sam) as follows:
Using biobambam 0.0.120+:
$ cat initial.bam | bamreset exclude=QCFAIL,SECONDARY,SUPPLEMENTARY resetheadertext=header.sam md5=1 md5filename=cleaned.bam.md5 > cleaned.bam
NOTE: You will need to do this for all of the BAM files you plan to submit for a donor. Once done with all BAMs and all donors, go to step 6.
b) Follow this if you start from FASTQ files
- Convert FASTQ file to BAM with @RG header added Using biobambam 0.0.117+
$ fastqtobam I=initial_1.fq I=initial_2.fq md5=1 md5filename=cleaned.bam.md5 RGID=<> RGCN=<> RGPL=<> RGLB=<> RGPI=<> RGSM=<> RGPU=<> RGDT=<> > cleaned.bam
NOTE: You will need to replace <> with proper values in the above command. Review Table 1 carefully for how to populate these read group fields. Also, as fastqtobam is not able to populate RGPM, you will need to specify this in the .info file described below.
Create .info file with sample tracking data for later use
Create a cleaned.bam.info file that matches the name of your output BAM file above with the @CO values as specified in Table 2 above. This file will need to be placed in the same directory as the BAM file to which the information in *.bam.info file is associated. bam_to_sra_xml.pl uses this info file to complete the submission XML files, e.g.
Example, @CO removed in *.info file
dcc_project_code:BRCA-UK submitter_donor_id:CGP_donor_1199131 submitter_specimen_id:CGP_specimen_1142534 submitter_sample_id:PD3851a dcc_specimen_type:Primary tumour - solid tissue use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053 PM:Illumina HiSeq 2000
NOTE: You will need to do this for all of the FASTQ files you plan to submit for a donor.
Once you have your BAM files prepared, you are ready to upload to your BAM file repository (such as Amazon S3 buckets).
| Controlled vocabulary for dcc_specimen_type |
| Normal - solid tissue | | Normal - blood derived | | Normal - bone marrow | | Normal - tissue adjacent to primary | | Normal - buccal cell | | Normal - EBV immortalized | | Normal - lymph node | | Normal - other | | Primary tumour - solid tissue | | Primary tumour - blood derived (peripheral blood) | | Primary tumour - blood derived (bone marrow) | | Primary tumour - additional new primary | | Primary tumour - other | | Recurrent tumour - solid tissue | | Recurrent tumour - blood derived (peripheral blood) | | Recurrent tumour - blood derived (bone marrow) | | Recurrent tumour - other | | Metastatic tumour - NOS | | Metastatic tumour - lymph node | | Metastatic tumour - metastasis local to lymph node | | Metastatic tumour - metastasis to distant location | | Metastatic tumour - additional metastatic | | Xenograft - derived from primary tumour | | Xenograft - derived from tumour cell line | | Cell line - derived from tumour | | Primary tumour - lymph node |