Preparing BAM files

Solomon Shorser edited this page Nov 18, 2015 · 5 revisions

Your starting files must be for a single lane of sequencing in either BAM (unaligned BAM) or FASTQ.

If you start with merged BAMs, do this to split them: Please use one of the biobambam tools included in the PCAP-core installation:

$ bamtofastq exclude=SECONDARY,SUPPLEMENTARY,QCFAIL outputperreadgroup=1 outputdir=some_out_folder filename=your_input.bam tryoq=1

This will generate a pair of FASTQ files for every readgroup found in the BAM. Once all FASTQ files are generated, proceed to step b.

NOTE: Note that the generated FASTQ files may be very large, to avoid losing reads please ensure there is enough disk space available before starting bamtofastq.

All of the BAM files ready for submission must contain necessary meta information to allow proper uploading to GNOS and handling for downstream analysis. In particular, this information is kept in the read group and comment sections of the BAM header: @RG and @CO.

The table below describes how tags in @RG should be populated:

Tag Description Details
ID Read group identifier Unique within site: <centre_name>:<unique_text>
PL Platform/technology used to produce the reads CAPILLARY, LS454, ILLUMINA, SOLID, HELICOS, IONTORRENT, PACBIO
SM Sample The Sample UUID generated via issuing `uuidgen

A valid @RG line may look like:

@RG	ID:WTSI:9399_7	CN:WTSI	PL:ILLUMINA	PM:Illumina HiSeq 2000	LB:WGS:WTSI:28085	PI:453    SM:f393ba16-9361-5df4-e040-11ac0d4844e8	PU:WTSI:9399_7	DT:2013-03-18T00:00:00+00:00

The table below shows how additional information for sample tracking is kept in @CO lines:

Key Value Notes
dcc_specimen_type See the CV terms table in the Appendix. This field defines whether the sample is a tumour or normal control.
use_cntl If the dcc_specimen_type is not normal/control, this field will need to be populated with the UUID in the current matched (same donor) normal/control sample's @RG SM field. If dcc_specimen_type is a normal/control, populate this field with N/A.

An example of valid @CO lines may look like:

@CO	dcc_project_code:BRCA-UK
@CO	submitter_donor_id:CGP_donor_1199131
@CO	submitter_specimen_id:CGP_specimen_1142534
@CO	submitter_sample_id:PD3851a
@CO	dcc_specimen_type:Primary tumour - solid tissue
@CO	use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053

a) Follow this if you start from single lane BAM files

  • BAM file re-header For each BAM file, create a new text file and populate it with values as indicated in Table 1 and Table 2 above.

    Make sure the first line is always: @HD VN:1.4

    An example BAM header file is shown as below (let's name the file: header.sam):

@HD	VN:1.4
@RG	ID:WTSI:9399_7	CN:WTSI	PL:ILLUMINA	PM:Illumina HiSeq 2000	LB:WGS:WTSI:28085	PI:453	SM:f393ba16-9361-5df4-e040-11ac0d4844e8	PU:WTSI:9399_7	DT:2013-03-18T00:00:00+00:00
@CO	dcc_project_code:BRCA-UK
@CO	submitter_donor_id:CGP_donor_1199131
@CO	submitter_specimen_id:CGP_specimen_1142534
@CO	submitter_sample_id:PD3851a
@CO	dcc_specimen_type:Primary tumour - solid tissue
@CO	use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053
  • Generate unaligned BAM with new header from initial BAM

    Generate a BAM file without alignment information that incorporates the new header (in the file mentioned above header.sam) as follows:

    Using biobambam 0.0.120+:

$ cat initial.bam | bamreset exclude=QCFAIL,SECONDARY,SUPPLEMENTARY resetheadertext=header.sam md5=1 md5filename=cleaned.bam.md5 > cleaned.bam

NOTE: You will need to do this for all of the BAM files you plan to submit for a donor. Once done with all BAMs and all donors, go to step 6.

b) Follow this if you start from FASTQ files

  • Convert FASTQ file to BAM with @RG header added Using biobambam 0.0.117+
$ fastqtobam I=initial_1.fq I=initial_2.fq md5=1 md5filename=cleaned.bam.md5 RGID=<> RGCN=<> RGPL=<> RGLB=<> RGPI=<> RGSM=<> RGPU=<> RGDT=<> > cleaned.bam

NOTE: You will need to replace <> with proper values in the above command. Review Table 1 carefully for how to populate these read group fields. Also, as fastqtobam is not able to populate RGPM, you will need to specify this in the .info file described below.

  • Create .info file with sample tracking data for later use

    Create a cleaned.bam.info file that matches the name of your output BAM file above with the @CO values as specified in Table 2 above. This file will need to be placed in the same directory as the BAM file to which the information in *.bam.info file is associated. bam_to_sra_xml.pl uses this info file to complete the submission XML files, e.g.

my_input.bam
my_input.bam.info

Example, @CO removed in *.info file

dcc_project_code:BRCA-UK
submitter_donor_id:CGP_donor_1199131
submitter_specimen_id:CGP_specimen_1142534
submitter_sample_id:PD3851a
dcc_specimen_type:Primary tumour - solid tissue
use_cntl:85098796-a2c1-11e3-a743-6c6c38d06053
PM:Illumina HiSeq 2000

NOTE: You will need to do this for all of the FASTQ files you plan to submit for a donor.

Once you have your BAM files prepared, you are ready to upload to your BAM file repository (such as Amazon S3 buckets).

Appendix

| Controlled vocabulary for dcc_specimen_type |

| Normal - solid tissue | | Normal - blood derived | | Normal - bone marrow | | Normal - tissue adjacent to primary | | Normal - buccal cell | | Normal - EBV immortalized | | Normal - lymph node | | Normal - other | | Primary tumour - solid tissue | | Primary tumour - blood derived (peripheral blood) | | Primary tumour - blood derived (bone marrow) | | Primary tumour - additional new primary | | Primary tumour - other | | Recurrent tumour - solid tissue | | Recurrent tumour - blood derived (peripheral blood) | | Recurrent tumour - blood derived (bone marrow) | | Recurrent tumour - other | | Metastatic tumour - NOS | | Metastatic tumour - lymph node | | Metastatic tumour - metastasis local to lymph node | | Metastatic tumour - metastasis to distant location | | Metastatic tumour - additional metastatic | | Xenograft - derived from primary tumour | | Xenograft - derived from tumour cell line | | Cell line - derived from tumour | | Primary tumour - lymph node |

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.