#### I implemented two sequence matching algorithms:

##### 1. A naive exact matching algorithm which is strand-aware

The algorithm takes a pattern of DNA sequence (p), and a reference DNA sequence (t), and finds how many time p occured in t, as well as how many times the reverse-compliment of p occured in t. 

In three steps:
- 1. Parse a fastq/fasta file to extract sequences 
- 2. Define a function to return the reverse-complement of a DNA sequence
- 3. Implement naive exact matching algorithm

##### 2. A naive matching algorithm which allows for up to n mismatches

##### 1. A naive exact matching algorithm which is strand-aware
Step 1. Function which parses fastq and fasta files

In [118]:
%%writefile functions/parse_genome.py
#!/usr/bin/python
# Fastq file has 4 lines for each sequence read
def parse_fastq(filename):
    seqs = []
    quals = []
    with open(filename) as f:
        while True:
            f.readline() # skip name line
            seq = f.readline().rstrip() # read sequence and remove spaces at end
            f.readline() # skip the 3rd line
            qual = f.readline().rstrip()

            if len(seq) == 0:
                break # reached end
                
            seqs.append(seq)
            quals.append(qual)
    
    return seqs, quals

# Fasta file has 2 lines for each sequence read 
def parse_fasta(filename):
    genome = ''
    with open(filename) as f:
        for line in f:
            if not line[0] == '>':
                genome += line.rstrip()
    return genome

Writing functions/parse_genome.py


In [117]:
%%writefile functions/__init__.py
#package functions into module 'functions'

Writing functions/__init__.py


Step 2. Function to return reverse-complement of a  DNA sequence

In [17]:
def reverseComplement(s):
    complement = {'A': 'T', 'C':'G', 'G':'C', 'T':'A', 'N':'N'} # Note N
    rc = ''
    for i in s[::-1]:
        rc = rc + complement[i]
    
    return rc
    # alternatively:
#     for i in s:
#     rc = complement[i] + rc


Step 3. Naive exact matching for both strands

In [82]:
# Naive match with one strand

def naiveMatch(t,p):

    occurrences = []
    for i in range(len(t)-len(p)+1):
        match = True
        for j in range(len(p)):
            
            if t[i+j] != p[j]: 
                match = False
                break
        if match:
            occurrences.append(i)
    return occurrences


In [99]:
# Naive match with both strands
def naive_with_rc(t,p):
    rc_p = reverseComplement(p) # get reverse complement sequence
    matches = naiveMatch(t,p)
    if rc_p == p:
        return matches
    else:
        rc_matches = naiveMatch(t, rc_p)
        return matches + rc_matches
    

In [100]:
# Test case 1
p = 'AGGT'
ten_as = 'AAAAAAAAAA'
t = ten_as + 'AGGT' + ten_as + 'ACCT' + ten_as
occurrences = naive_with_rc(t,p)
print(occurrences)

[10, 24]


In [109]:
# Test case 2
p = 'CGCG'
t = ten_as + 'CGCG' + ten_as + 'CGCG' + ten_as
occurrences = naive_with_rc(t,p)
print(occurrences)

[10, 24]


In [74]:
# Test case 3
# Find matches in the lambda virus genome
# parse fasta data
virus = parse_fasta('data/lambda_virus.fa')

In [75]:
len(virus)

48502

In [76]:
virus[0:100]

'GGGCGGCGACCTCGCGGGTTTTCGCTATTTATGAAAATTTTCCGGTTTAAGGCGTTTCCGTTCTTCTTCGTCATAACTTAATGTTTTTATTTAAAATACC'

In [101]:
t = virus
p = 'AGGT'
matches = naive_with_rc(t,p)
print(len(matches)) # number of matches

306


In [107]:
t = virus
p = 'AGTCGA'
matches = naive_with_rc(t,p)
print(min(matches)) # Leftmost offset for the matches

450


##### 2. A naive matching algorithm which allows for up to n mismatches

In [53]:
# Only consider on strand here
def naive_n_mismatch(t,p,n):
    occurrences = []
    
    for i in range(len(t)-len(p)+1):
        match = True
        mismatch = 0 
        for j in range(len(p)):

            if (t[i+j] != p[j]): 
                mismatch += 1
                if mismatch > n:
                    match = False
                    break
        if match:
            occurrences.append(i)
    return occurrences


In [119]:
# Test case 1
p = 'CTGT'
ten_as = 'AAAAAAAAAA'
t = ten_as + 'CTGT' + ten_as + 'CTTT' + ten_as + 'CGGG' + ten_as
n= 2
occurrences = naive_n_mismatch(t,p,n)
print(occurrences)

[10, 24, 38]


In [56]:
# Test case 2 with virus genome
p = 'TTCAAGCC'
t = virus
n =2 
occurrences = naive_n_mismatch(t,p,n)
print(len(occurrences))

191


In [57]:
# Test case 3 with virus genome
p = 'AGGAGGTT'
t = virus
n =2 
occurrences = naive_n_mismatch(t,p,n)
print(occurrences[0])

49


Lastly, an algorithm to identify bad cycles (corresponding to specific offset across all reads) in fastq data

In [60]:
# Count 'N' for all positions across reads
# Use a human genome dataset 

# parse fastq data
human, _ = parse_fastq('data/ERR037900_1.first1000.fastq')

In [61]:
len(human)

1000

In [62]:
human[0]

'TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCNAACCCTAACCCTAACCCTAACCCTAACCCTAAC'

In [65]:
from collections import Counter
counter = Counter()
for seq in human:
    counter.update(seq)
counter   # 'N' = 914, indicate bad cycle at certain positions

Counter({'T': 22476, 'A': 24057, 'C': 29665, 'N': 914, 'G': 22888})

In [66]:
# Find out positions with lots of Ns
from collections import Counter
read_N = Counter()

for seq in human:
    for i in range(len(seq)):
        if seq[i] == 'N':
            read_N[i] +=1            

In [69]:
read_N

Counter({66: 903, 91: 2, 52: 7, 67: 1, 92: 1})

In [68]:
read_N.most_common(1) # Position 66 has 903 Ns, indicating a bad sequence cycle

[(66, 903)]