## `00_fetchreads`: This notebook shows how we fetch reads that may belong to the MHC region of the HG002 Genomes

We use the command `shmr-map` in the Peregrine Assembler Suite to compare the CCS reads for phasing and assembling the HG002 MHC region. The `shmr-map` tools using SHIMMER (https://www.biorxiv.org/content/10.1101/705616v1) to map the reads to the GHRh37 MHC and the (unphased) genome assembly of HG002 MHC region from an assembly (s3://human-pangenomics/HPRC/HG002_Assessment/assemblies/JC_20k_15k_asm/asm.fa.gz, contig 000028F:1700756-6708745). The two MHC sequences are combined in the file `MHC_all.fa` for recruiting the reads.

In [None]:
%%bash
dx download -f /jc_notebook/MHC_37.fa

In [None]:
%%capture
%%bash
conda install -y matplotlib bokeh

In [None]:
%%bash
dx download /20200316_asm_for_revision/data/MHC_37.fa
dx download /20200316_asm_for_revision/data/MHC_all.fa

In [None]:
!cat MHC_37.fa MHC_asm.fa >  /home/dnanexus/MHC_all.fa
!ls -l  /home/dnanexus/MHC_all.fa

MHC on GRC37: chr6:28,477,797-33,448,354

In [None]:
%%writefile /home/dnanexus/build_read_index_and_map.sh
cd /home/dnanexus/

echo  MHC_all.fa > ref.lst
shmr_mkseqdb -d ref.lst -p ref
shmr_index -p ref -o ref-shmr -r 3

find /home/dnanexus/hg002_reads -name "*.fastq" > reads.lst
shmr_mkseqdb -d reads.lst -p reads 

mkdir -p read_index
for i in `seq 1 24`; do echo shmr_index -p reads -o read-shmr -r 3 -t 24 -c $i; done | \
parallel -j 12
for i in `seq 1 24`; do echo "shmr_map -r ref -m ./ref-shmr-L2 -p reads -l read-shmr-L2 -t 24 -c $i > map.$i"; done | \
parallel -j 6

In [None]:
%%bash
cd /home/dnanexus/
cat map.* | sort -g -k1g -k2g --parallel 24 -T /tmp  > map.sorted
cat map.sorted | awk '{print $4}' | \
sort -g  | uniq -c  | awk '$1 > 2 {print $2}' > mapped.readids

In [None]:
!wc /home/dnanexus/mapped.readids

In [None]:
all_readids = set([int(_) for _ in open("/home/dnanexus/mapped.readids").read().split()])

In [None]:
read_sdb = SequenceDatabase("/home/dnanexus/reads.idx", "/home/dnanexus/reads.seqdb")

In [None]:
with open("/home/dnanexus/mapped_reads.fa", "w") as f:
    for rid in all_readids:
        seq_name = read_sdb.index_data[rid].rname
        seq = read_sdb.get_subseq_by_rid(rid)
        print(f">{seq_name}", file=f)
        print(seq.decode(), file=f)

In [None]:
!grep -c ">" /home/dnanexus/mapped_reads.fa