# DNA Handling Examples

This notebook provides a brief overview of some included utilities for working with DNA sequences and FASTQ files.

In [1]:
import bootstrap

In [15]:
import gzip
from common import dna, fastq

---
## Load a FASTQ File

In [12]:
path = "/home/shared/prism-data/Nachusa Sequences/nachusa-2020-soil16S-sequences/Wesley001-WH-051220_S140_L001_R1_001.fastq.gz"

In [16]:
with gzip.open(path) as f:
    sample = fastq.read(f)

In [18]:
len(sample)

17164

In [17]:
sample[0]

FastqEntry:
  @MN01227:252:000H3FWK5:1:11101:5434:1073 1:N:0:TGCTACATCA
  GTGCCAGCAGCAGCGGTAATACGGGGGGAGCAAGCGTTGTTCGGATTTACTGGGCGTAAAGGGCGCGTAGGCGGTCAGCACAAGTCAGTTGTGAAATCTCCGAGCTNAACTCGGAANGGTCAACTGAAACTGTGCGACTAGAGTGCGGAAGGG
  +
  FFAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFFFFFF/FF/FAAFFFA/A/FFFFAFFFFFFFF#/FFFFAFFF#FF/F//FFFAAAAFFFFFF/AFFFAFFFFAF/FFFA

---
## Sequence Encoding/Decoding

The following cells demonstrate encoding DNA sequences into vector representations.

- A = 0
- C = 1
- G = 2
- T = 3
- N = 4

In [19]:
sample[0].sequence

'GTGCCAGCAGCAGCGGTAATACGGGGGGAGCAAGCGTTGTTCGGATTTACTGGGCGTAAAGGGCGCGTAGGCGGTCAGCACAAGTCAGTTGTGAAATCTCCGAGCTNAACTCGGAANGGTCAACTGAAACTGTGCGACTAGAGTGCGGAAGGG'

In [31]:
len(sample[0].sequence)

153

In [21]:
encoded = dna.encode_sequence(sample[0].sequence)
encoded

array([2, 3, 2, 1, 1, 0, 2, 1, 0, 2, 1, 0, 2, 1, 2, 2, 3, 0, 0, 3, 0, 1,
       2, 2, 2, 2, 2, 2, 0, 2, 1, 0, 0, 2, 1, 2, 3, 3, 2, 3, 3, 1, 2, 2,
       0, 3, 3, 3, 0, 1, 3, 2, 2, 2, 1, 2, 3, 0, 0, 0, 2, 2, 2, 1, 2, 1,
       2, 3, 0, 2, 2, 1, 2, 2, 3, 1, 0, 2, 1, 0, 1, 0, 0, 2, 3, 1, 0, 2,
       3, 3, 2, 3, 2, 0, 0, 0, 3, 1, 3, 1, 1, 2, 0, 2, 1, 3, 4, 0, 0, 1,
       3, 1, 2, 2, 0, 0, 4, 2, 2, 3, 1, 0, 0, 1, 3, 2, 0, 0, 0, 1, 3, 2,
       3, 2, 1, 2, 0, 1, 3, 0, 2, 0, 2, 3, 2, 1, 2, 2, 0, 0, 2, 2, 2])

In [22]:
dna.decode_sequence(encoded)

'GTGCCAGCAGCAGCGGTAATACGGGGGGAGCAAGCGTTGTTCGGATTTACTGGGCGTAAAGGGCGCGTAGGCGGTCAGCACAAGTCAGTTGTGAAATCTCCGAGCTNAACTCGGAANGGTCAACTGAAACTGTGCGACTAGAGTGCGGAAGGG'

---
## k-mer Sequence Encoding/Decoding

3-mer example:
`to_kmers('ACTCG') = ['ACT', 'CTC', 'TCG']`

In [32]:
kmer_sequence = dna.encode_kmers(encoded, kmer=3)
kmer_sequence

array([ 67,  86,  56,  30,  27,  11,  55,  27,  11,  55,  27,  11,  57,
        37,  63,  65,  75,   3,  15,  76,   7,  37,  62,  62,  62,  62,
        60,  52,  11,  55,  25,   2,  11,  57,  38,  68,  92,  88,  68,
        91,  82,  37,  60,  53,  18,  93,  90,  76,   8,  42,  87,  62,
        61,  57,  38,  65,  75,   0,   2,  12,  62,  61,  57,  36,  57,
        38,  65,  77,  12,  61,  57,  37,  63,  66,  80,  27,  11,  55,
        26,   5,  25,   2,  13,  66,  80,  27,  13,  68,  92,  88,  67,
        85,  50,   0,   3,  16,  83,  41,  81,  32,  35,  52,  11,  58,
        44,  95, 100,   1,   8,  41,  82,  37,  60,  50,   4,  22, 112,
        63,  66,  80,  25,   1,   8,  42,  85,  50,   0,   1,   8,  42,
        88,  67,  86,  57,  35,  51,   8,  40,  77,  10,  52,  13,  67,
        86,  57,  37,  60,  50,   2,  12,  62])

In [30]:
len(kmer_sequence)

151

In [34]:
dna.decode_kmers(kmer_sequence, kmer=3)

array([2, 3, 2, 1, 1, 0, 2, 1, 0, 2, 1, 0, 2, 1, 2, 2, 3, 0, 0, 3, 0, 1,
       2, 2, 2, 2, 2, 2, 0, 2, 1, 0, 0, 2, 1, 2, 3, 3, 2, 3, 3, 1, 2, 2,
       0, 3, 3, 3, 0, 1, 3, 2, 2, 2, 1, 2, 3, 0, 0, 0, 2, 2, 2, 1, 2, 1,
       2, 3, 0, 2, 2, 1, 2, 2, 3, 1, 0, 2, 1, 0, 1, 0, 0, 2, 3, 1, 0, 2,
       3, 3, 2, 3, 2, 0, 0, 0, 3, 1, 3, 1, 1, 2, 0, 2, 1, 3, 4, 0, 0, 1,
       3, 1, 2, 2, 0, 0, 4, 2, 2, 3, 1, 0, 0, 1, 3, 2, 0, 0, 0, 1, 3, 2,
       3, 2, 1, 2, 0, 1, 3, 0, 2, 0, 2, 3, 2, 1, 2, 2, 0, 0, 2, 2, 2])

---
## Quality Scores

The quality scores in a FASTQ file are typically PHRED-33 encoded. Theese scores are decoded to otain the probabilities that the bases are incorrect.

The score can be compute from the probability score like so:

$Q = -10\log_{10}{P}$

Likewise, the error probability can be recovered by re-aranging the equation:

$P = 10^{\frac{Q}{-10}}$

In [54]:
scores = sample[0].quality_scores[:10]
scores

'FFAFFFFFFF'

In [55]:
error_probs = dna.decode_phred(scores)
error_probs

array([0.00019953, 0.00019953, 0.00063096, 0.00019953, 0.00019953,
       0.00019953, 0.00019953, 0.00019953, 0.00019953, 0.00019953])

In [56]:
dna.encode_phred(error_probs)

'FFAFFFFFFF'

---
## FASTQ Sequence ID

Example sequence ID: `@MN01227:252:000H3FWK5:1:11101:5434:1073 1:N:0:TGCTACATCA`

@\<instrument\>:\<run number\>:\<flowcell ID\>:\<lane\>:\<tile\>:\<pos_x\>:\<pos_y\> \<read type\>:\<is filtered (N|Y)\>:\<control number\>:\<sequence index\>


In [36]:
sequence_id = sample[0].sequence_id
sequence_id

@MN01227:252:000H3FWK5:1:11101:5434:1073 1:N:0:TGCTACATCA

In [37]:
sequence_id.instrument

'MN01227'

In [38]:
sequence_id.run_number

252

In [39]:
sequence_id.flowcell_id

'000H3FWK5'

In [40]:
sequence_id.lane

1

In [41]:
sequence_id.tile

11101

In [42]:
sequence_id.pos

(5434, 1073)

In [43]:
sequence_id.read_type

1

In [44]:
sequence_id.is_filtered

False

In [45]:
sequence_id.control_number

0

In [46]:
sequence_id.sequence_index

'TGCTACATCA'