We need to get from mapped bam file to single fast5s:

Input is:
* Genome sequence Pgt21-0
* Fastq files to map with minimap. These were combined with "find * -type f -name '*.gz'| xargs -I X cat X >> 20210406.PgtInfectedLeave.fastq.gz"
* Folder with multifast5 of fastqs mapped to the reference. These were combined with "find * -type f -name "*.fast5" | xargs -I X mv X $dir/." on the command line

Output:
* folder with single fast5s that are mapped

What we need to do:
* map all reads
* get ID of all mapped fastqs
* extract them in batches with the fast5_subset

In [1]:
import os

### INPUTS

### *****Define directories before running on different sampes*****

In [2]:
###define input directories here 
RefFn = '../../../resources/genomic_resources/chr_A_B_unassigned.fasta'
FastqFn = '../../../analysis/infectedLeaves/20210406.PgtInfectedLeave.fastq.gz'
BamFn = '../../../analysis/infectedLeaves/20210406.Pgt210.PgtInfectedLeave.sorted.bam'
InFast5Dir = '/media/ssd-01/ben/projects/Pgt210Methylation/rawData/infectedLeave/fast5/combined'
OutDir = '../../../analysis/infectedLeaves/subsetFast5'
OutDirGuppy = '../../../analysis/infectedLeaves/subsetFast5/basecalled'

### *****Define directories before running on different sampes*****

In [3]:
RefFn = os.path.abspath(RefFn)
FastqFn = os.path.abspath(FastqFn)
BamFn = os.path.abspath(BamFn)
InFast5Dir = os.path.abspath(InFast5Dir)
OutDir = os.path.abspath(OutDir)
OutDirGuppy = os.path.abspath(OutDirGuppy)

In [4]:
threads = 8

In [5]:
##map all the reads with Minimap
!minimap2 -2 --sam-hit-only --secondary=no -t {threads} -ax map-ont {RefFn} {FastqFn} | samtools sort -@ {12-threads} -o {BamFn} -O bam -
!samtools index {BamFn}

[M::mm_idx_gen::3.005*1.43] collected minimizers
[M::mm_idx_gen::3.371*2.14] sorted minimizers
[M::main::3.371*2.14] loaded/built the index for 207 target sequence(s)
[M::mm_mapopt_update::3.637*2.05] mid_occ = 233
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 207
[M::mm_idx_stat::3.806*2.01] distinct minimizers: 13929385 (49.04% are singletons); average occurrences: 2.379; average spacing: 5.333
[M::worker_pipeline::19.721*4.95] mapped 81780 sequences
[M::worker_pipeline::28.730*6.21] mapped 82113 sequences
[M::worker_pipeline::37.558*6.86] mapped 82859 sequences
[M::worker_pipeline::46.450*7.26] mapped 81042 sequences
[M::worker_pipeline::55.885*7.49] mapped 80374 sequences
[M::worker_pipeline::65.461*7.70] mapped 81153 sequences
[M::worker_pipeline::74.617*7.82] mapped 82304 sequences
[M::worker_pipeline::83.699*7.92] mapped 82690 sequences
[M::worker_pipeline::92.850*8.01] mapped 80160 sequences
[M::worker_pipeline::102.979*8.08] mapped 79022 sequences
[M::worker_pipel

In [6]:
##generated the read ID list
MappedIdsFn = BamFn.replace('.bam', '.mappedids.txt')
SamtoolsStatsFn = BamFn.replace('.bam', '.stats.txt')
!samtools stats -@ {threads} {BamFn} > {SamtoolsStatsFn}
!samtools  view -F 4  {BamFn} | cut -f 1 | sort | uniq > {MappedIdsFn}

In [7]:
!wc {MappedIdsFn}

 1632867  1632867 60416079 /media/nvme-02/benWorking/projects/Pgt210Methylation/analysis/infectedLeaves/20210406.Pgt210.PgtInfectedLeave.sorted.mappedids.txt


In [8]:
#subset the reads
!fast5_subset -i {InFast5Dir} -s {OutDir} -l {MappedIdsFn} -t {threads + 4}

DEBUG:h5py._conv:Creating converter from 5 to 3             |  0% ETA:  --:--:--
DEBUG:h5py._conv:Creating converter from 5 to 3
DEBUG:h5py._conv:Creating converter from 5 to 3
DEBUG:h5py._conv:Creating converter from 5 to 3
DEBUG:h5py._conv:Creating converter from 5 to 3             |  0% ETA:  22:18:31
DEBUG:h5py._conv:Creating converter from 5 to 3
DEBUG:h5py._conv:Creating converter from 5 to 3
DEBUG:h5py._conv:Creating converter from 5 to 3              |  0% ETA:  5:48:15
DEBUG:h5py._conv:Creating converter from 5 to 3
DEBUG:h5py._conv:Creating converter from 5 to 3
DEBUG:h5py._conv:Creating converter from 5 to 3
DEBUG:h5py._conv:Creating converter from 5 to 3              |  0% ETA:  4:48:02
\ 1035790 of 1641832|#########################               | 63% ETA:  0:55:30

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [9]:
#basecall the mapped reads only
!guppy_basecaller -i {OutDir} -s {OutDirGuppy} --fast5_out -r --compress_fastq --device auto -c dna_r9.4.1_450bps_hac.cfg

ONT Guppy basecalling software version 4.4.1+1c81d62
config file:        /opt/ont-guppy/data/dna_r9.4.1_450bps_hac.cfg
model file:         /opt/ont-guppy/data/template_r9.4.1_450bps_hac.jsn
input path:         /media/nvme-02/benWorking/projects/Pgt210Methylation/analysis/infectedLeaves/subsetFast5
save path:          /media/nvme-02/benWorking/projects/Pgt210Methylation/analysis/infectedLeaves/subsetFast5/basecalled
chunk size:         2000
chunks per runner:  512
records per file:   4000
fastq compression:  ON
num basecallers:    4
gpu device:         auto
kernel path:        
runners per device: 4
Found 409 fast5 files to process.
Init time: 1866 ms

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Caller time: 14758752 ms, Samples called: 99526016034, samples/s: 6.74353e+06
Finishing up any open output files.
Basecalling completed successfully.


In [11]:
print('hello')

hello
