#  Simulated reads from EC958 
First convert genbank files to FASTA

In [39]:
from Bio import SeqIO
from os import listdir, path 
from textwrap import wrap

for genbank in [ path.join('../ori', x) for x in listdir('../ori') if x.endswith('.gbk')]: 
    output_handle = open('../seq/' + path.basename(genbank).replace('.gbk', '.fasta'), 'w')
    for seq_record in SeqIO.parse(open(genbank), "genbank") :
        print("Dealing with GenBank record %s" % seq_record.id)
        clean_seq = "\n".join(wrap(str(seq_record.seq)))
        output_handle.write(f">{seq_record.id} {seq_record.description}\n{clean_seq}\n")
    output_handle.close()


Dealing with GenBank record HG941720.1
Dealing with GenBank record HG941718.1
Dealing with GenBank record HG941719.1


## Create simulated short reads (illumina)
We will use art_illumina to simulate illumina short reads. 

We can install it into our conda env as follows: 
```
conda activate mge2022 
mamba install -y -c bioconda art
```

We need to give it the input sequence (as a fasta) and choose options like the read length (-l), the coverage (-f), and template size (-m). 


In [36]:
!ls ../seq
!art_illumina -ss HS25  -i ../seq/EC958-HG941718.fasta -p -l 150 -f 40 -m 200 -s 10 -o EC958-HG941718 --rndSeed 200 -na 
!art_illumina -ss HS25  -i ../seq/pEC958-HG941719.fasta -p -l 150 -f 40 -m 200 -s 10 -o pEC958-HG941719 --rndSeed 200  -na 
!art_illumina -ss HS25  -i ../seq/pEC958B-HG941720.fasta -p -l 150 -f 40 -m 200 -s 10 -o pEC958B-HG941720 --rndSeed 200  -na 


EC958-HG941718.fasta   pEC958-HG9417191.fq   pEC958B-HG941720.fasta
pEC958-HG941719.fasta  pEC958-HG9417192.aln
pEC958-HG9417191.aln   pEC958-HG9417192.fq

             ART_Illumina (2008-2016)          
          Q Version 2.5.8 (June 6, 2016)       
     Contact: Weichun Huang <whduke@gmail.com> 
    -------------------------------------------

                  Paired-end sequencing simulation

Total CPU time used: 47.7922

The random seed for the run: 200

Parameters used during run
	Read Length:	150
	Genome masking 'N' cutoff frequency: 	1 in 150
	Fold Coverage:            40X
	Mean Fragment Length:     200
	Standard Deviation:       10
	Profile Type:             Combined
	ID Tag:                   

Quality Profile(s)
	First Read:   HiSeq 2500 Length 150 R1 (built-in profile) 
	First Read:   HiSeq 2500 Length 150 R2 (built-in profile) 

Output files

  FASTQ Sequence Files:
	 the 1st reads: EC958-HG9417181.fq
	 the 2nd reads: EC958-HG9417182.fq


             ART_Illumina (2008-2

## Tidy up and merge
ART produces seperate read sets for each reference, which we now merge and compress into two FASTQ files (read and pair). 

In [37]:
!gzip -c EC958-HG9417181.fq pEC958-HG9417191.fq  pEC958B-HG9417201.fq  > ../seq/MGE-2022_1.fastq.gz 
!gzip -c EC958-HG9417182.fq pEC958-HG9417192.fq  pEC958B-HG9417202.fq  > ../seq/MGE-2022_2.fastq.gz 
!rm *.fq 

## Create simulated reads (nanopore)
We will use badread to simualte reads. We will install it in its own conda env and use it. Simulating reads for the chromosome i.e. `EC958-HG941718.fasta` can take some time. 

```
mamba create -y -n badread badread
conda activate badread
badread  simulate --reference ../seq/pEC958-HG941719.fasta   --quantity 20x | gzip > MGE-2022_ONT.fastq.gz
badread  simulate --reference ../seq/pEC958B-HG941720.fasta   --quantity 20x | gzip >> MGE-2022_ONT.fastq.gz
badread  simulate --reference ../seq/EC958-HG941718.fasta   --quantity 20x | gzip >> MGE-2022_ONT.fastq.gz
```


By the end of the process we have in the `seq` directory: 
 
* A fasta file for each of the original reference sequences 
* Simulated short reads for each reference seqeunce combined together for the first read. 
* Simulated short reads for each reference seqeunce combined together  for the second read. 
* Simulated ONT reads for each reference seqeunce combined together  


In [50]:
!ls  -ahl ../seq

total 196M
drwxrwxr-x 2 ubuntu ubuntu 4.0K Jun 16 11:02 .
drwxrwxr-x 8 ubuntu ubuntu 4.0K Jun 16 10:43 ..
-rw-rw-r-- 1 ubuntu ubuntu 5.0M Jun 16 10:28 EC958-HG941718.fasta
-rw-rw-r-- 1 ubuntu ubuntu  66M Jun 16 10:25 MGE-2022_1.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu  67M Jun 16 10:25 MGE-2022_2.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu  56M Jun 16 11:18 MGE-2022_ONT.fastq.gz
-rw-rw-r-- 1 ubuntu ubuntu 135K Jun 16 10:28 pEC958-HG941719.fasta
-rw-rw-r-- 1 ubuntu ubuntu 4.2K Jun 16 10:28 pEC958B-HG941720.fasta
-rw-rw-r-- 1 ubuntu ubuntu 2.8M Jun 16 10:59 reads.fastq.gz
