Python for Bioinformatics
-----------------------------

![title](https://s3.amazonaws.com/py4bio/tapabiosmall.png)

This Jupyter notebook is intented to be used alongside the book [Python for Bioinformatics](http://py3.us/)



**Note:** Before opening the file, this file should be accesible from this Jupyter notebook. In order to do so, the following commands will download these files from Github and extract them into a directory called samples.

Chapter 13: Regular Expressions
-----------------------------

In [None]:
!curl https://raw.githubusercontent.com/Serulab/Py4Bio/master/samples/samples.tar.bz2 -o samples.tar.bz2
!mkdir samples
!tar xvfj samples.tar.bz2 -C samples

In [None]:
import re

In [None]:
mo = re.search('hello', 'Hello world, hello Python!')

In [None]:
mo.group()

'hello'

In [None]:
mo.span()

(13, 18)

In [None]:
'Hello world, hello Python!'.index('hello')

13

In [None]:
import re

In [None]:
mo = re.search('[Hh]ello', 'Hello world, hello Python!')

In [None]:
mo.group()

'Hello'

In [None]:
re.findall("[Hh]ello","Hello world, hello Python,!")

['Hello', 'hello']

In [None]:
re.finditer("[Hh]ello", "Hello world, hello Python,!")

<callable_iterator at 0x7f90f00dc3c8>

In [None]:
mos = re.finditer("[Hh]ello", "Hello world, hello Python,!")

In [None]:
for x in mos:
    print(x.group())
    print(x.span())

Hello
(0, 5)
hello
(13, 18)


In [None]:
mo = re.match("hello", "Hello world, hello Python!")
print (mo)

None


In [None]:
mo = re.match("Hello", "Hello world, hello Python!")
mo

<_sre.SRE_Match object; span=(0, 5), match='Hello'>

In [None]:
mo.group()

'Hello'

In [None]:
mo.span()

(0, 5)

In [None]:
re.findall("[Hh]ello","Hello world, hello Python,!")

['Hello', 'hello']

In [None]:
rgx = re.compile("[Hh]ello")
rgx.findall("Hello world, hello Python,!")

['Hello', 'hello']

In [None]:
rgx = re.compile("[Hh]ello")
rgx.search("Hello world, hello Python,!")

<_sre.SRE_Match object; span=(0, 5), match='Hello'>

In [None]:
rgx.match("Hello world, hello Python,!")

<_sre.SRE_Match object; span=(0, 5), match='Hello'>

In [None]:
 rgx.findall("Hello world, hello Python,!")

['Hello', 'hello']

**Listing 13.1:** findTAT.py: Find the first “TAT” repeat

In [None]:
import re
seq = "ATATAAGATGCGCGCGCTTATGCGCGCA"
rgx = re.compile("TAT")
i = 1
for mo in rgx.finditer(seq):
    print('Ocurrence {0}: {1}'.format(i, mo.group()))
    print('Position: From {0} to {1}'.format(mo.start(),
                                            mo.end()))
    i += 1

Ocurrence 1: TAT
Position: From 1 to 4
Ocurrence 2: TAT
Position: From 18 to 21


In [None]:
import re
seq = "ATATAAGATGCGCGCGCTTATGCGCGCA"
rgx = re.compile("(GC){3,}")
result = rgx.search(seq)
result.group()

'GCGCGCGC'

In [None]:
result.groups()

('GC',)

In [None]:
rgx = re.compile("((GC){3,})")
result = rgx.search(seq)
result.groups()

('GCGCGCGC', 'GC')

In [None]:
# Only the inner group is non-capturing
rgx = re.compile("((?:GC){3,})")
result = rgx.search(seq)
result.groups()

('GCGCGCGC',)

In [None]:
rgx = re.compile("TAT") # No group at all.
rgx.findall(seq) # This returns a list of matching strings.

['TAT', 'TAT']

In [None]:
rgx = re.compile("(GC){3,}") # One group. Return a list
rgx.findall(seq) # with the group for each match.

['GC', 'GC']

In [None]:
rgx = re.compile("((GC){3,})") # Two groups. Return a
rgx.findall(seq) # list with tuples for each match.

[('GCGCGCGC', 'GC'), ('GCGCGC', 'GC')]

In [None]:
rgx = re.compile("((?:GC){3,})") # Using a non-capturing
rgx.findall(seq) # group to get only the matches.

['GCGCGCGC', 'GCGCGC']

**Listing 13.2:** subgroups.py: Find multiple sub-patterns

In [None]:
import re
rgx = re.compile("(?P<TBX>TATA..).*(?P<CGislands>(?:GC){3,})")
seq = "ATATAAGATGCGCGCGCTTATGCGCGCA"
result = rgx.search(seq)
print(result.group('CGislands'))
print(result.group('TBX'))

GCGCGC
TATAAG


**Listing 13.3:** regexsys1.py: Count lines with a user-supplied pattern on it

In [None]:
import re, sys
myregex = re.compile(sys.argv[2])
counter = 0
with open(sys.argv[1]) as fh:
    for line in fh:
        if myregex.search(line):
            counter += 1
print(counter)

FileNotFoundError: [Errno 2] No such file or directory: '-f'

**Listing 13.4:** countinfile.py: Count the occurrences of a pattern in a file

In [None]:
import re, sys
myregex = re.compile(sys.argv[2])
i = 0
with open(sys.argv[1]) as fh:
    for line in fh:
        i += len(myregex.findall(line))
print(i)

FileNotFoundError: [Errno 2] No such file or directory: '-f'

**Listing 13.5:** deletegc.py: Delete GC repeats (more than 3 GC in a row)

In [None]:
import re
regex = re.compile("(?:GC){3,}")
seq="ATGATCGTACTGCGCGCTTCATGTGATGCGCGCGCGCAGACTATAAG"
print ("Before:",seq)
print ("After:",regex.sub("",seq))

Before: ATGATCGTACTGCGCGCTTCATGTGATGCGCGCGCGCAGACTATAAG
After: ATGATCGTACTTTCATGTGATAGACTATAAG


**Listing 13.6:** searchinfasta.py: Search a pattern in a FASTA file

In [None]:
import re
pattern = "[LIVM]{2}.RL[DE].{4}RLE"
with open('samples/Q5R5X8.fas') as fh:
    fh.readline() # Discard the first line.
    seq = ""
    for line in fh:
        seq += line.strip()
rgx = re.compile(pattern)
result = rgx.search(seq)
patternfound = result.group()
span = result.span()
leftpos = span[0]-10
if leftpos<0:
    leftpos = 0
print(seq[leftpos:span[0]].lower() + patternfound +
      seq[span[1]:span[1]+10].lower())

manmqgLVERLERAVSRLEslsaeshrpp


**Listing 13.7:** cleanseq.py: Cleans a DNA sequence

In [None]:
import re
regex = re.compile(' |\d|\n|\t')
seq = ''
for line in open('samples/pMOSBlue.txt'):
    seq += regex.sub('',line)
print (seq)

ATGACCATGATTACGCCAAGCTCTAATACGACTCACTATAGGGAAAGCTTGCATGCCTGCAGGTCGACTCTAGAGGATCTACTAGTCATATGGATATCGGATCCCCGGGTACCGAGCTCGAATTCACTGGCCGTCGTTTT
