In [9]:
from IPython.display import Markdown, display
bash = lambda commands: display(Markdown("```bash\n" + ' && \n'.join(commands) + "\n```"))

We start by preparing all raw data from ASCII text into `.fasta` sequences. This step we use Jasmine's script:

In [8]:
import encode as ec

data_src_dir = 'raw'
books = {
  "atotc": "a_tale_of_two_cities",
  "frank": "frankenstein",
  "je": "jane_eyre"
}

for book, title in books.items():
  ec.ascii_to_binary(f'{data_src_dir}/{book}.txt', f'{data_src_dir}/{book}_bin.txt')
  ec.binary_to_nt(f'{data_src_dir}/{book}_bin.txt', f'{data_src_dir}/{book}_nuc.txt')

  with open(f'{data_src_dir}/{book}_nuc.txt', "r") as infile:
    sequence = "".join(line.strip() for line in infile if line.strip())

  with open(f'{book}.fasta', "w") as outfile:
    outfile.write(f"> {title}\n")
    for i in range(0, len(sequence), 60):
      outfile.write(sequence[i:i+60] + "\n")

We evaluate models by running the basecall pipeline with ATOTC. We prepare the pod5 data using squigulator:

In [10]:
bash(["squigulator atotc.fasta -x dna-r10-prom -o atotc.slow5 --full-contigs --ont-friendly yes --seed 42",
      "blue-crab s2p atotc.slow5 -o atotc.pod5",
      "rm -f atotc.slow5"])

```bash
squigulator atotc.fasta -x dna-r10-prom -o atotc.slow5 --full-contigs --ont-friendly yes --seed 42 && 
blue-crab s2p atotc.slow5 -o atotc.pod5 && 
rm -f atotc.slow5
```

We also want to evaluate the performance on short sequences:

In [11]:
bash(["squigulator atotc.fasta -x dna-r10-prom -o atotc.slow5 -r 300 -n 100 --ont-friendly yes --seed 1729",
      "blue-crab s2p atotc.slow5 -o atotc_short.pod5",
      "rm -f atotc.slow5"])

```bash
squigulator atotc.fasta -x dna-r10-prom -o atotc.slow5 -r 300 -n 100 --ont-friendly yes --seed 1729 && 
blue-crab s2p atotc.slow5 -o atotc_short.pod5 && 
rm -f atotc.slow5
```