Preparing BAM files

Solomon Shorser edited this page Nov 18, 2015 · 5 revisions

Your starting files must be for a single lane of sequencing in either BAM (unaligned BAM) or FASTQ.

If you start with merged BAMs, do this to split them: Please use one of the biobambam tools included in the PCAP-core installation:

$ bamtofastq exclude=SECONDARY,SUPPLEMENTARY,QCFAIL outputperreadgroup=1 outputdir=some_out_folder filename=your_input.bam tryoq=1

This will generate a pair of FASTQ files for every readgroup found in the BAM. Once all FASTQ files are generated, proceed to step b.

NOTE: Note that the generated FASTQ files may be very large, to avoid losing reads please ensure there is enough disk space available before starting bamtofastq.

All of the BAM files ready for submission must contain necessary meta information to allow proper uploading to GNOS and handling for downstream analysis. In particular, this information is kept in the read group and comment sections of the BAM header: @RG and @CO.

The table below describes how tags in @RG should be populated:

Tag Description Details
ID Read group identifier Unique within site: <centre_name>:<unique_text>
PL Platform/technology used to produce the reads CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT, PACBIO
SM Sample The Sample UUID generated via issuing uuidgen | tr 'A-Z' 'a-z' at the shell. All BAM lane files to be merged into a single mapped sample (DCC definition) BAM should share the same UUID.

A valid @RG line may look like:

@RG ID:WTSI:9399_7  CN:WTSI PL:ILLUMINA PM:Illumina HiSeq 2000  LB:WGS:WTSI:28085   PI:453    SM:f393ba16-9361-5df4-e040-11ac0d4844e8   PU:WTSI:9399_7  DT:2013-03-18T00:00:00+00:00

The table below shows how additional information for sample tracking is kept in @CO lines:

Key Value Notes
dcc_specimen_type See the CV terms table in the Appendix. This field defines whether the sample is a tumour or normal control.
use_cntl If the dcc_specimen_type is not normal/control, this field will need to be populated with the UUID in the current matched (same donor) normal/control sample's @RG SM field. If dcc_specimen_type is a normal/control, populate this field with N/A.

An example of valid @CO lines may look like:

@CO dcc_project_code:BRCA-UK
@CO submitter_donor_id:CGP_donor_1199131
@CO submitter_specimen_id:CGP_specimen_1142534
@CO submitter_sample_id:PD3851a
@CO dcc_specimen_type:Primary tumour - solid tissue
@CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053

a) Follow this if you start from single lane BAM files

  • BAM file re-header For each BAM file, create a new text file and populate it with values as indicated in Table 1 and Table 2 above.

    Make sure the first line is always: @HD VN:1.4

    An example BAM header file is shown as below (let's name the file: header.sam):

@HD VN:1.4
@RG ID:WTSI:9399_7  CN:WTSI PL:ILLUMINA PM:Illumina HiSeq 2000  LB:WGS:WTSI:28085   PI:453  SM:f393ba16-9361-5df4-e040-11ac0d4844e8 PU:WTSI:9399_7  DT:2013-03-18T00:00:00+00:00
@CO dcc_project_code:BRCA-UK
@CO submitter_donor_id:CGP_donor_1199131
@CO submitter_specimen_id:CGP_specimen_1142534
@CO submitter_sample_id:PD3851a
@CO dcc_specimen_type:Primary tumour - solid tissue
@CO use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053
  • Generate unaligned BAM with new header from initial BAM

    Generate a BAM file without alignment information that incorporates the new header (in the file mentioned above header.sam) as follows:

    Using biobambam 0.0.120+:

$ cat initial.bam | bamreset exclude=QCFAIL,SECONDARY,SUPPLEMENTARY resetheadertext=header.sam md5=1 md5filename=cleaned.bam.md5 > cleaned.bam

NOTE: You will need to do this for all of the BAM files you plan to submit for a donor. Once done with all BAMs and all donors, go to step 6.

b) Follow this if you start from FASTQ files

  • Convert FASTQ file to BAM with @RG header added Using biobambam 0.0.117+
$ fastqtobam I=initial_1.fq I=initial_2.fq md5=1 md5filename=cleaned.bam.md5 RGID=<> RGCN=<> RGPL=<> RGLB=<> RGPI=<> RGSM=<> RGPU=<> RGDT=<> > cleaned.bam

NOTE: You will need to replace <> with proper values in the above command. Review Table 1 carefully for how to populate these read group fields. Also, as fastqtobam is not able to populate RGPM, you will need to specify this in the .info file described below.

  • Create .info file with sample tracking data for later use

    Create a cleaned.bam.info file that matches the name of your output BAM file above with the @CO values as specified in Table 2 above. This file will need to be placed in the same directory as the BAM file to which the information in *.bam.info file is associated. bam_to_sra_xml.pl uses this info file to complete the submission XML files, e.g.

my_input.bam
my_input.bam.info

Example, @CO removed in *.info file

dcc_project_code:BRCA-UK
submitter_donor_id:CGP_donor_1199131
submitter_specimen_id:CGP_specimen_1142534
submitter_sample_id:PD3851a
dcc_specimen_type:Primary tumour - solid tissue
use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053
PM:Illumina HiSeq 2000

NOTE: You will need to do this for all of the FASTQ files you plan to submit for a donor.

Once you have your BAM files prepared, you are ready to upload to your BAM file repository (such as Amazon S3 buckets).

Appendix

Controlled vocabulary for dcc_specimen_type
Normal - solid tissue
Normal - blood derived
Normal - bone marrow
Normal - tissue adjacent to primary
Normal - buccal cell
Normal - EBV immortalized
Normal - lymph node
Normal - other
Primary tumour - solid tissue
Primary tumour - blood derived (peripheral blood)
Primary tumour - blood derived (bone marrow)
Primary tumour - additional new primary
Primary tumour - other
Recurrent tumour - solid tissue
Recurrent tumour - blood derived (peripheral blood)
Recurrent tumour - blood derived (bone marrow)
Recurrent tumour - other
Metastatic tumour - NOS
Metastatic tumour - lymph node
Metastatic tumour - metastasis local to lymph node
Metastatic tumour - metastasis to distant location
Metastatic tumour - additional metastatic
Xenograft - derived from primary tumour
Xenograft - derived from tumour cell line
Cell line - derived from tumour
Primary tumour - lymph node