# Finding a substring in a text

In [1]:
line = 'About cats and dogs; cats are fun'
print(line.index('cat'))

6


`012345678901234567890123456789012`  
`About cats and dogs; cats are fun`

# Regular Expressions

A regular expression is a special sequence of characters that represent a text pattern

Regular expressions are widely used in text parsing, and in sequence analysis

In Python, the module **re** is used for regular expressions

http://docs.python.org/3/library/re.html

http://docs.python.org/3/howto/regex.html

http://www.tutorialspoint.com/python/python_reg_expressions.htm


In [2]:
import re

# re.match() and re.search()
**re.match()** matches from the beginning of the string  
**re.search()** matches at any location in the string  


In [3]:
match_result = re.match(r'cat', line)
search_result = re.search(r'cat', line)

print(line)
print('match result: %s'%match_result)
print('search result: %s'%type(search_result))

About cats and dogs; cats are fun
match result: None
search result: <class '_sre.SRE_Match'>


# Match object
Both re.match() and re.search() return a **Match** object on success, **None** on failure  

For attributes and methods of a Match Object see:
https://docs.python.org/3/library/re.html#match-objects


In [4]:
print(search_result.group())

cat


In [5]:
print(search_result.start(),search_result.end())

6 9


In [6]:
match_result = re.match(r'Abo', line)
print(type(match_result))
print('-'*50)
print(match_result.group())
print('-'*50)
print(match_result.start())

<class '_sre.SRE_Match'>
--------------------------------------------------
Abo
--------------------------------------------------
0


In [7]:
list_of_results = re.findall(r'cat', line)
print(list_of_results)
for i in list_of_results:
    print('-'*50)
    print(i, type(i))

['cat', 'cat']
--------------------------------------------------
cat <class 'str'>
--------------------------------------------------
cat <class 'str'>


## Find multiple hits with findall and finditer

**findall** returns a list of strings  
**finditer** returns an iterator to Match objects

In [8]:
result_iterator = re.finditer(r'cat', line)
print(result_iterator)
for i in result_iterator:
    print('-'*50)
    print('type: %s'%type(i))
    print('matched string: %s, start: %s, end: %s'%(i.group(), i.start(), i.end()))

<callable_iterator object at 0x7ff9c921e898>
--------------------------------------------------
type: <class '_sre.SRE_Match'>
matched string: cat, start: 6, end: 9
--------------------------------------------------
type: <class '_sre.SRE_Match'>
matched string: cat, start: 21, end: 24


# Substitute
**re.sub** substitutes strings/patterns

In [9]:
new_line = re.sub(r'cat', 'monkey', line)
print(line)
print(new_line)

About cats and dogs; cats are fun
About monkeys and dogs; monkeys are fun


# Compile regular expressions for speed, reuse, and readability

In [10]:
dna = 'gatgcaggctcgctagcggct'

# Does this string contain a start codon?

startcodon = re.compile(r'atg', re.I)
yesno = 'no'

if startcodon.search(dna):
    yesno = 'a'

print('we have %s start codon'%yesno)

we have a start codon


# Modifiers

Modify the behaviour of matching

re.I, re.L, re.M, re.S, re.U, re.X

Example: **re.I** activates case-insensitive matching


In [11]:
test_line = 'I am testing this TestCase, see if the TEST works'
print(test_line)
mp = re.compile(r'test', re.I)
for i in mp.finditer(test_line):
    print('-'*50)
    print(i.group(), i.start(), i.end())

I am testing this TestCase, see if the TEST works
--------------------------------------------------
test 5 9
--------------------------------------------------
Test 18 22
--------------------------------------------------
TEST 39 43


# Metacharacters

Metacharacters are not matched literally but have a special meaning in the regular expression.

## Example:
### Regular expression for Lysozyme C.{3}C.{2}[LMF].{3}[DEN][LI].{5}C
http://prosite.expasy.org/PDOC00119

In [12]:
protein_sequence = 'GDRSTDYGIFQINSRYWCNDGKTPGAVNACHLSCSALLQDNIADAVACAKRVVRDPQGIRAWVAWRNRCQNRDVRQYVQGCGV'
lysozyme_pattern = r'C.{3}C.{2}[LMF].{3}[DEN][LI].{5}C'

lysozyme_hits = re.findall(lysozyme_pattern,protein_sequence)

print(lysozyme_hits)

['CHLSCSALLQDNIADAVAC']


### Character sets, or character classes [ ]

```
 [abc] will match any of the characters ‘a’, ’b’, or ‘c’ 
 [a-c] is same as [abc]
 Match any lowercase character: [a-z] 
 Match any nucleotide: [acgt]
 Match any character that is not a nucleotide: [^acgt]
```

### period
```
  . matches anything except a newline character (and even that if the flag re.S is set)
```

### repetition
```
 a*     : match zero or more a's
 a+     : match one or more a's 
 a?     : match zero or one a's
 a{5}   : match exactly 5 a's
 a{,5}  : match up to 5 a's
 a{3,5} : match between 3 and 5 a's
```

### other
```
 a|b    : match a or b
 ^a     : match a at the beginning of string
 a$     : match a at the end of the string
 (aa)   : group regular expressions and remembers matched text 
```

In [13]:
bla = 'blablablablabla'
print(re.search(r'[lba]+',bla).group())
print('-'*50)
print(re.search(r'bla{3}',bla))
print('-'*50)
print(re.search(r'(bla){3}',bla).group())
print('-'*50)
print(re.search(r'^lab',bla))

blablablablabla
--------------------------------------------------
None
--------------------------------------------------
blablabla
--------------------------------------------------
None


# Backslash
The backslash can be followed by various characters to signal various special sequences
```
\n      Newline character
\t      Tab
\d      Matches any decimal digit; 
        this is equivalent to the class [0-9].
\D      Matches any non-digit character; 
        this is equivalent to the class [^0-9].
\s      Matches any whitespace character; 
        this is equivalent to the class [ \t\n\r\f\v].
\S      Matches any non-whitespace character; 
        this is equivalent to the class [^ \t\n\r\f\v].
\w      Matches any alphanumeric character; 
        this is equivalent to the class [a-zA-Z0-9_].
\W      Matches any non-alphanumeric character; 
        this is equivalent to the class [^a-zA-Z0-9_].
```
It’s also used to escape a metacharacter so you can match the actual character. For example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: ```\[``` or ```\\```

Metacharacters are not active inside classes, e.g.  `[ab$] matches 'a', 'b', '$'`  

```
 Exception: ^, which means NOT*
 [^5] will match any character that is not 5
```


In [14]:
line = 'About cats and dogs; cats are fun'
print(re.findall(r'\w+',line))

['About', 'cats', 'and', 'dogs', 'cats', 'are', 'fun']


# Raw strings

In [15]:
print('a\tb')
print(r'a\tb')

a	b
a\tb


# Grouping

In [16]:
group_line = 'more or less'

print(re.sub(r'(\w+) (\w+) (\w+)',r'\3 \2 \1',group_line))

less or more


In [17]:
m = re.search(r'(\w+) (\w+) (\w+) (\w+)',line)
print(m.group(1),m.group(4),m.group(3),m.group(2))
print(m.groups())

About dogs and cats
('About', 'cats', 'and', 'dogs')


# Greediness
Pattern matching is greedy by default, which means that it will try to match as many characters as possible. This can be prevented by appending the **?** operator to the expression

In [19]:
dna = 'atgcccgaatagtagtagtagtag'
print(dna)
print('-'*50)
# This will replace the entire string
print(re.search(r'atg([acgt]{3})+tag', dna).group().upper())
print('-'*50)
# This will stop matching after the first “tag”:
print(re.search(r'atg([acgt]{3})+?tag', dna).group().upper())

atgcccgaatagtagtagtagtag
--------------------------------------------------
ATGCCCGAATAGTAGTAGTAGTAG
--------------------------------------------------
ATGCCCGAATAG


### Exercise
#### Write a regular expression to extract the different fields from the ID line of a SwissProt entry and print these fields separately.

This is the template for the ID line:  
````>db|UniqueIdentifier|EntryName ProteinName OS=OrganismName GN=GeneName PE=ProteinExistence SV=SequenceVersion````  
(as described in: http://www.uniprot.org/help/fasta-headers, you can ignore the fact that the GeneName is optional)

You can use this website to develop the regex: https://regex101.com/#python

In [36]:
ID_line = '>sp|P02006|HBAD_PHRHI Hemoglobin subunit alpha-D OS=Phrynops hilarii GN=HBAD PE=1 SV=1'

labels = ['db', 'UniqueIdentifier', 'EntryName', 'ProteinName', 'OrganismName', 'GeneName', 'ProteinExistence', 'SequenceVersion']

mo = re.match(r'^>(\w+)\|(\w+)\|(\w+)\s(.+)\sOS=(.+)\sGN=(\w+)\sPE=(\d)\sSV=(\d)',ID_line)

if mo:
    print('The regex worked, these are the fields:')
    fields = mo.groups()
    print(fields)
else:
    print('The regex did not work...')



The regex worked, these are the fields:
('sp', 'P02006', 'HBAD_PHRHI', 'Hemoglobin subunit alpha-D', 'Phrynops hilarii', 'HBAD', '1', '1')


#### If you manage to get the fields, you can use the following code to print each field with its appropriate label.

In [37]:
ID_dict = dict(zip(labels,fields))

for field, value in ID_dict.items():
    print('%s:\t%s'%(field,value))

db:	sp
UniqueIdentifier:	P02006
EntryName:	HBAD_PHRHI
ProteinName:	Hemoglobin subunit alpha-D
OrganismName:	Phrynops hilarii
GeneName:	HBAD
ProteinExistence:	1
SequenceVersion:	1
