## Test ability of STAR to align unaligned bam files

Normally, star aligns fastq files. Alex Dobin has graciously agreed to add ubam alignment functionality. In this notebook, we confirm that this functionality works. 

In [1]:
# for workflow management
import json
import os
from google.cloud import storage
import cromwell_manager as cwm

with open(os.path.expanduser('~/.ssh/mint_cromwell_config.json')) as f:
    cromwell_server = cwm.Cromwell(**json.load(f))

storage_client = storage.Client(project='broad-dsde-mint-dev')

Because this is a new version of star, it is not compatible with our generated references. We can make some new ones. 

In [8]:
inputs_json = {
    "star_mkref.fasta_file": "gs://broad-dsde-mint-dev-teststorage/demo/chr21.fa",
    "star_mkref.annotation_file": "gs://broad-dsde-mint-dev-teststorage/demo/gencodev19_chr21.gtf",
}

wdl = '../pipelines/accessories/star_mkref.wdl'

dependencies = {
    'StarMkref.wdl': '../pipelines/tasks/StarMkref.wdl',
}

In [9]:
os.environ['wdltool'] = os.path.expanduser('~/google_drive/software/wdltool-0.14.jar')
make_index = cwm.Workflow.validate(
    wdl=wdl, inputs_json=inputs_json, cromwell_server=cromwell_server, storage_client=storage_client,
    workflow_dependencies=dependencies)

CWM:2017-10-26 17:04:16.003290:creating temporary directory
CWM:2017-10-26 17:04:16.008332:writing dependencies
CWM:2017-10-26 17:04:16.019737:writing wdl
CWM:2017-10-26 17:04:16.021304:running wdltool validate
CWM:2017-10-26 17:04:17.375423:validation successful
CWM:2017-10-26 17:04:17.618765:checking docker image ${star_docker_image}... not found. Is image private?


In [10]:
make_index = cwm.Workflow.from_submission(
    wdl=wdl, inputs_json=inputs_json, cromwell_server=cromwell_server, storage_client=storage_client,
    workflow_dependencies=dependencies)

In [17]:
make_index.status

{'id': '554b87ef-3d0c-4b38-a062-4cbf90f40dde', 'status': 'Succeeded'}

In [None]:
%%bash
# copy over index
gsutil cp gs://broad-dsde-mint-dev-cromwell-execution/cromwell-executions/star_mkref/554b87ef-3d0c-4b38-a062-4cbf90f40dde/call-StarMkref/genome.tar \
gs://broad-dsde-mint-dev-teststorage/reference/hg19-chr21-star-2.5.3a-index.tar

This test needn't be on a large fastq; it's just to see if it works. Test that:

1. Sorted bam outputs of STAR from fastq and STAR from ubam are identical
2. No tags should be lost. 

In [15]:
inputs_json = {
    "test_ubam_alignment.r1": "gs://broad-dsde-mint-dev-teststorage/10x/benchmark/1e6/pbmc8k_S1_L007_R1_001.fastq.gz",
    "test_ubam_alignment.r2": "gs://broad-dsde-mint-dev-teststorage/10x/benchmark/1e6/pbmc8k_S1_L007_R2_001.fastq.gz",
    "test_ubam_alignment.i7": "gs://broad-dsde-mint-dev-teststorage/10x/benchmark/1e6/pbmc8k_S1_L007_I1_001.fastq.gz",    
    "test_ubam_alignment.sample_name": "test_chr19_1e6",
    "test_ubam_alignment.tar_star_reference": "gs://broad-dsde-mint-dev-teststorage/reference/hg19-chr21-star-2.5.3a-index.tar"
}

wdl = '../analysis/test_star_ubam_alignment.wdl'

dependencies = {
    'StarAlignFastqSingleEnd.wdl': '../pipelines/tasks/StarAlignFastqSingleEnd.wdl',
    'StarAlignBamSingleEnd.wdl': '../pipelines/tasks/StarAlignBamSingleEnd.wdl',
    'FastqToUBam.wdl': '../pipelines/tasks/FastqToUBam.wdl',
    'Attach10xBarcodes.wdl': '../pipelines/tasks/Attach10xBarcodes.wdl'
}

In [23]:
workflow = cwm.Workflow.validate(
    wdl=wdl, inputs_json=inputs_json, cromwell_server=cromwell_server, storage_client=storage_client,
    workflow_dependencies=dependencies)

CWM:2017-10-26 17:17:12.014899:creating temporary directory
CWM:2017-10-26 17:17:12.019486:writing dependencies
CWM:2017-10-26 17:17:12.029171:writing wdl
CWM:2017-10-26 17:17:12.031048:running wdltool validate
CWM:2017-10-26 17:17:12.911353:validation successful
CWM:2017-10-26 17:17:13.264883:checking docker image humancellatlas/python3-scientific:0.1.0... OK.
CWM:2017-10-26 17:17:13.540806:checking docker image humancellatlas/picard:latest... OK.
CWM:2017-10-26 17:17:13.749258:checking docker image humancellatlas/star:2.5.3a-40ead6e... OK.


In [24]:
workflow = cwm.Workflow.from_submission(
    wdl=wdl, inputs_json=inputs_json, cromwell_server=cromwell_server, storage_client=storage_client,
    workflow_dependencies=dependencies)

In [26]:
workflow.outputs

{'id': '726317de-d349-4aac-b4b3-c7644276b9aa',
 'outputs': {'test_ubam_alignment.bam_alignment_log': 'gs://broad-dsde-mint-dev-cromwell-execution/cromwell-executions/test_ubam_alignment/726317de-d349-4aac-b4b3-c7644276b9aa/call-StarAlignBamSingleEnd/Log.final.out',
  'test_ubam_alignment.bam_to_bam': 'gs://broad-dsde-mint-dev-cromwell-execution/cromwell-executions/test_ubam_alignment/726317de-d349-4aac-b4b3-c7644276b9aa/call-StarAlignBamSingleEnd/Aligned.out.bam',
  'test_ubam_alignment.fastq_alignment_log': 'gs://broad-dsde-mint-dev-cromwell-execution/cromwell-executions/test_ubam_alignment/726317de-d349-4aac-b4b3-c7644276b9aa/call-StarAlignFastqSingleEnd/Log.final.out',
  'test_ubam_alignment.fastq_to_bam': 'gs://broad-dsde-mint-dev-cromwell-execution/cromwell-executions/test_ubam_alignment/726317de-d349-4aac-b4b3-c7644276b9aa/call-StarAlignFastqSingleEnd/Aligned.out.bam'}}

Download the alignment logs

In [27]:
bam_results = cwm.io_util.GSObject(workflow.outputs['outputs']['test_ubam_alignment.bam_alignment_log'], storage_client).download_as_string()
fastq_results = cwm.io_util.GSObject(workflow.outputs['outputs']['test_ubam_alignment.fastq_alignment_log'], storage_client).download_as_string()

# test that the alignment logs have identical outputs. This contains things like # unique mapped, # multimapped, indel rates, etc.
''.join(bam_results.split('\n')[4:])  == ''.join(fastq_results.split('\n')[4:])

True

Affter trimming the time stamps and alignment rates, the results are identical. In addition, bam is slightly faster (although the number of reads is small)!

Download and view the bamfile.

In [None]:
%%bash
# download files
gsutil cp gs://broad-dsde-mint-dev-cromwell-execution/cromwell-executions/test_ubam_alignment/726317de-d349-4aac-b4b3-c7644276b9aa/\
call-StarAlignBamSingleEnd/Aligned.out.bam aligned_ubam.bam

gsutil cp gs://broad-dsde-mint-dev-cromwell-execution/cromwell-executions/test_ubam_alignment/726317de-d349-4aac-b4b3-c7644276b9aa/\
call-StarAlignFastqSingleEnd/Aligned.out.bam aligned_fastq.bam

It contains properly tagged bam reads. Looks like things are working! Sort the files so we can check if each read is identical.

In [185]:
%%bash
# sort files
samtools sort -n -o aligned_ubam_sorted.bam aligned_ubam.bam
samtools sort -n -o aligned_fastq_sorted.bam aligned_fastq.bam

In [182]:
import pysam

fq = pysam.AlignmentFile('aligned_fastq_sorted.bam', 'rb')
ub = pysam.AlignmentFile('aligned_ubam_sorted.bam', 'rb')
for fqr, ubr in zip(fq, ub):
    # check that read qname (name header), pos (alignment location on chromosome) and rname (chromosome) are identical
    if any([fqr.qname != ubr.qname, fqr.pos != ubr.pos, fqr.rname != ubr.rname]):
        print(fqr)
        print(ubr)
        break # print the reads and break out of the loop if a non-identical read is found

Using a samfile iterator based on htslib we checked that for each read, the alignment position and query name in the two sorted bam files is identical.  