Get genome assembly for hg002 and hs37d5.fa.gz:

The original URL: `s3://assembly-results/hg002-asm-r4-pg0.1.5.0/4-cns/cns-merge/p_ctg_cns.fa`

In [1]:
%%bash
dx download -f /GIAB_BAM_files/hg002_asm/p_ctg_cns.fa
wget --quiet ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz

In [2]:
%%bash
mkdir -p /home/dnanexus/asm_map
cp p_ctg_cns.fa hs37d5.fa.gz /home/dnanexus/asm_map

The follow script creates SHIMMER index and map the contigs to the references using the index:

In [3]:
%%writefile /home/dnanexus/asm_map/match.sh
. /opt/conda/bin/activate
conda activate peregrine
echo $1 > seq1.lst
echo $2 > seq2.lst
shmr_mkseqdb -d seq1.lst -p seq1
shmr_mkseqdb -d seq2.lst -p seq2
shmr_index -m 0 -p seq1 -o seq1-shmr
shmr_index -m 0 -p seq2 -o seq2-shmr
shmr_map -r seq1 -m seq1-shmr-L2 -p seq2 -l seq2-shmr-L2 > seq1-seq2-map

Writing /home/dnanexus/asm_map/match.sh


Pull the Peregrine docker image so we can execute the script

In [4]:
%%bash
docker pull cschin/peregrine:0.1.5.2

0.1.5.2: Pulling from cschin/peregrine
05d1a5232b46: Pulling fs layer
d974dd5eb235: Pulling fs layer
2de22c73730e: Pulling fs layer
444f639f1b28: Pulling fs layer
c1b600cb48ba: Pulling fs layer
6905bc91a310: Pulling fs layer
c6873591a651: Pulling fs layer
bd033af558ff: Pulling fs layer
515d53cbabf4: Pulling fs layer
7d46c9b091da: Pulling fs layer
26bd8ecb1bef: Pulling fs layer
250e7fe096ba: Pulling fs layer
0f1c1b919b7e: Pulling fs layer
97752d4e05f4: Pulling fs layer
c69e15245db4: Pulling fs layer
c1b600cb48ba: Waiting
cfceb9845f0b: Pulling fs layer
7abb4c0f3629: Pulling fs layer
bd033af558ff: Waiting
6905bc91a310: Waiting
5fe5fe02dc03: Pulling fs layer
99a2aa6e39ec: Pulling fs layer
c6873591a651: Waiting
515d53cbabf4: Waiting
1cf4752fcf36: Pulling fs layer
7d46c9b091da: Waiting
dc534230c8d9: Pulling fs layer
26bd8ecb1bef: Waiting
0f1c1b919b7e: Waiting
97752d4e05f4: Waiting
250e7fe096ba: Waiting
cfceb9845f0b: Waiting
99a2aa6e39ec: Waiting
5fe5fe02dc03: Waiting
7abb4c0f3629: Waiting
44

Execute the `match.sh` use the docker image environment

In [5]:
%%bash
docker run -v $HOME:$HOME --workdir=/home/dnanexus/asm_map --entrypoint=/bin/bash cschin/peregrine:0.1.5.2 match.sh hs37d5.fa.gz p_ctg_cns.fa 

input sequence dataset file list: 'seq1.lst'
output index file: seq1.idx
output seqdb file: seq1.idx
input sequence dataset file list: 'seq2.lst'
output index file: seq2.idx
output seqdb file: seq2.idx
output data file: seq1-shmr-L2-MC-01-of-01.dat
output data file: seq2-shmr-L2-MC-01-of-01.dat


activate does not accept more than one argument:
('hs37d5.fa.gz', 'p_ctg_cns.fa')

reduction factor= 6
using index file: seq1.idx
using seqdb file: seq1.seqdb
output data file: seq1-shmr-L2-01-of-01.dat
reduction factor= 6
using index file: seq2.idx
using seqdb file: seq2.seqdb
output data file: seq2-shmr-L2-01-of-01.dat
using ref index file: seq1.idx
using ref seqdb file: seq1.seqdb
using ref shimmer data file: seq1-shmr-L2-01-of-01.dat
number of shimmers load: 7370440
using index file: seq2.idx
using seqdb file: seq2.seqdb
using shimmer data file: seq2-shmr-L2-01-of-01.dat
number of shimmers load: 7388875
using shimmer count file: seq2-shmr-L2-MC-01-of-01.dat


Find chromsome 6 id in the read database

In [6]:
%%bash 
cat /home/dnanexus/asm_map/seq1.idx | awk '$2 == 6'

000000005 6 171115067 1062541960


See what contigs are mapped to chr6:28477797-33448354

In [7]:
%%bash
cd /home/dnanexus/asm_map
less seq1-seq2-map | awk '$1 == 5 && $2 > 28477797 && $3 < 33448354' | \
sort -k 2 -g | \
awk '{print $4" "$7}' | sort -g | uniq -c | sort -g -r | head -5

   6387 287 1
     10 2340 1
     10 0 1
      6 286 1
      6 192 1


The contig with internal id 287 are mapped to contig 6 in reversed direction. We can find the beginning and ending coordinates of the mapped regions of contig 287 to the MHC region. We also query the sequence database to see get the length of contig 287. 

In [8]:
%%bash
cd /home/dnanexus/asm_map
less seq1-seq2-map | awk '$1 == 5 && $2 > 28477797 && $3 < 33448354' | sort -k 2 -g | head -1
less seq1-seq2-map | awk '$1 == 5 && $2 > 28477797 && $3 < 33448354' | sort -k 2 -g | tail -1
cat seq2.idx | awk '$1 == 287'


5 28477908 28478170 287 1527469 1527731 1 1 1
5 33445658 33446169 287 6541138 6541649 1 4 216
000000287 000028F 30459935 607429861


The results indicate the MHC region is corresponding the 0000028F: (30459935-6541649)-(30459935-1527469) = 0000028F:23918286-28932466. We want to find all reads that mapped to this region. We can use off-shell aligned. The following shows using the SHIMMER index to find mapped reads.

First, we have to download the sequnence and build the sequece database. Here we start with a pre-built sequence database (~80G size)

In [9]:
%%bash
cd /home/dnanexus/asm_map
dx download /GIAB_BAM_files/CCS_reads/reads.idx /GIAB_BAM_files/CCS_reads/reads.seqdb /GIAB_BAM_files/CCS_reads/reads.lst
mkdir -p read_db/
mv reads.idx reads.seqdb reads.lst read_db/

INFO:dxpy:[Sat Jun 15 22:37:49 2019] GET http://10.0.3.1:8090/F/D2PRJ/file-FYXgbf80Xv3J0f3BJ3zV2FXb/project-FXGVFb00Xv39f6K624b11bZ1: Recovered after 1 retries


In [10]:
%%writefile /home/dnanexus/asm_map/build_read_index_and_map.sh
#/bin/bash
. /opt/conda/bin/activate
conda activate peregrine
mkdir -p read_index
for i in `seq 1 24`; do echo shmr_index -p read_db/reads -o read_index/read-shmr -t 24 -c $i; done | \
parallel -j 12
for i in `seq 1 24`; do echo "shmr_map -r ./seq2 -m ./seq2-shmr-L2 -p read_db/reads -l read_index/read-shmr-L2 -t 24 -c $i > map.$i"; done | \
parallel -j 12

Writing /home/dnanexus/asm_map/build_read_index_and_map.sh


In [11]:
%%bash
docker run -v $HOME:$HOME --workdir=/home/dnanexus/asm_map --entrypoint=/bin/bash cschin/peregrine:0.1.5.2 build_read_index_and_map.sh

output data file: read_index/read-shmr-L0-MC-11-of-24.dat
output data file: read_index/read-shmr-L2-MC-11-of-24.dat
output data file: read_index/read-shmr-L0-MC-12-of-24.dat
output data file: read_index/read-shmr-L2-MC-12-of-24.dat
output data file: read_index/read-shmr-L0-MC-01-of-24.dat
output data file: read_index/read-shmr-L2-MC-01-of-24.dat
output data file: read_index/read-shmr-L0-MC-02-of-24.dat
output data file: read_index/read-shmr-L2-MC-02-of-24.dat
output data file: read_index/read-shmr-L0-MC-05-of-24.dat
output data file: read_index/read-shmr-L2-MC-05-of-24.dat
output data file: read_index/read-shmr-L0-MC-09-of-24.dat
output data file: read_index/read-shmr-L2-MC-09-of-24.dat
output data file: read_index/read-shmr-L0-MC-03-of-24.dat
output data file: read_index/read-shmr-L2-MC-03-of-24.dat
output data file: read_index/read-shmr-L0-MC-10-of-24.dat
output data file: read_index/read-shmr-L2-MC-10-of-24.dat
output data file: read_index/read-shmr-L0-MC-04-of-24.dat
output data fi

reduction factor= 6
using index file: read_db/reads.idx
using seqdb file: read_db/reads.seqdb
output data file: read_index/read-shmr-L0-11-of-24.dat
output data file: read_index/read-shmr-L2-11-of-24.dat
reduction factor= 6
using index file: read_db/reads.idx
using seqdb file: read_db/reads.seqdb
output data file: read_index/read-shmr-L0-12-of-24.dat
output data file: read_index/read-shmr-L2-12-of-24.dat
reduction factor= 6
using index file: read_db/reads.idx
using seqdb file: read_db/reads.seqdb
output data file: read_index/read-shmr-L0-01-of-24.dat
output data file: read_index/read-shmr-L2-01-of-24.dat
reduction factor= 6
using index file: read_db/reads.idx
using seqdb file: read_db/reads.seqdb
output data file: read_index/read-shmr-L0-02-of-24.dat
output data file: read_index/read-shmr-L2-02-of-24.dat
reduction factor= 6
using index file: read_db/reads.idx
using seqdb file: read_db/reads.seqdb
output data file: read_index/read-shmr-L0-05-of-24.dat
output data file: read_index/read-s

Now, let's sort the mapped index and fetch the mapped read identifiers.

In [12]:
%%bash
cd /home/dnanexus/asm_map
cat map.* | sort -g -k1g -k2g --parallel 24 -T /tmp  > map.sorted
cat map.sorted | awk '$1 == 287' > map_to_287
cat map_to_287 | awk '$2 < 30459935- 1527469 && $3 > 30459935- 6541649 {print $4}' | \
sort -g  | uniq -c  | awk '$1 > 2 {print $2}' > mapped.readids

In [13]:
%%writefile /home/dnanexus/asm_map/fetch_reads.py
import mmap
import sys
from peregrine._falcon4py import ffi
from peregrine._falcon4py import lib as falcon
from peregrine._shimmer4py import lib as shimmer
import numpy as np
from collections import OrderedDict

## No option parsing at thie moment, perhaps letter

read_db_prefix=sys.argv[1]
read_id_file=sys.argv[2]

f=open("{}.seqdb".format(read_db_prefix), "rb")
seqdb = mmap.mmap(f.fileno(), 0, flags=mmap.MAP_SHARED, prot=mmap.PROT_READ)

read_idx = {}
with open("{}.idx".format(read_db_prefix)) as f:
    for row in f:
        row = row.strip().split()
        rid, rname, rlen, offset = row
        rid = int(rid)
        rlen = int(rlen)
        offset = int(offset)
        read_idx.setdefault(rid, {})
        read_idx[rid]["name"] = rname
        read_idx[rid]["length"] = rlen
        read_idx[rid]["offset"] = offset

with open(read_id_file) as f:
    for row in f:
        row = row.strip().split()
        read_id = int(row[0])
        s = read_idx[read_id]["offset"]
        read_len = read_idx[read_id]["length"]
        if len(row) == 3:
            start, end = int(row[1]), int(row[2])
        else:
            start, end = 0, read_len
        read_name = read_idx[read_id]["name"]
        bseq1 = seqdb[s:s+read_len]
        read_seq = ffi.new("char[{}]".format(read_len))
        shimmer.decode_biseq(bseq1, read_seq, read_len, 0)
        # print(">{}_{}_{}".format(read_name, start, end))
        print(">{}".format(read_name))
        print(ffi.buffer(read_seq)[start:end].decode("ascii"))


Writing /home/dnanexus/asm_map/fetch_reads.py


In [14]:
%%writefile /home/dnanexus/asm_map/fetch_reads.sh
#/bin/bash
. /opt/conda/bin/activate
conda activate peregrine
cd /home/dnanexus/asm_map
python fetch_reads.py read_db/reads mapped.readids > asm_mapped.fa

Writing /home/dnanexus/asm_map/fetch_reads.sh


In [15]:
%%bash
docker run -v $HOME:$HOME --workdir=/home/dnanexus/asm_map --entrypoint=/bin/bash cschin/peregrine:0.1.5.2 fetch_reads.sh

Peregrine Assembler & SHIMMER ASMKit(0.1.5.2+0.g7091609.dirty)


In [16]:
%%bash
cd /home/dnanexus/asm_map
dx upload asm_mapped.fa --path /phased_reads_2/

ID                  file-FZ2xf280Xv3GVZkP25B9XX0P
Class               file
Project             project-FXGVFb00Xv39f6K624b11bZ1
Folder              /phased_reads_2
Name                asm_mapped.fa
State               closing
Visibility          visible
Types               -
Properties          -
Tags                -
Outgoing links      -
Created             Sat Jun 15 23:00:41 2019
Created by          jchin_dx
 via the job        job-FZ2x37Q0Xv38gvPk8Pgx0v73
Last modified       Sat Jun 15 23:00:47 2019
Media type          
archivalState       "live"
