# Regular Expressions
# 1. [Introduction to Regular Expressions](#1.-Regular-Expressions-Intro)
 1. [What are Regular Expressions](#1.-What-are-Regular-Expressions)
 2. [A Simple Python Example](#2.-A-Simple-Python-Example:)
 
# 2. [Regular Expression Patterns Shallow Dive:](#2.-A-Few-Regular-Expression-Patterns)
* Metacharacters and Special Sequences
* Search vs Match

# 1. Regular Expressions Intro

### 1. What are Regular Expressions
* A Description of a Text Pattern
Regular expressions are a way to define a text pattern of interest. Everyone is familiar with some of the common special characters employed in regular expressions,  \*  as a wildcard for example


* A tool for searching for a text pattern


* A tool for repeated utilization of a text pattern
Regular expressions allow for multipart and (limited) nested search patterns.

### 2. A Simple Python Example:
#### The Task
_A collaborator sends you a list of ids of interest, but you only want those corresponding to GenBank accession numbers_

![title](./images/GBFormatsNCBI.png "our key")

### The Python Approach
 1. Import the regular expression module (re)
 2. Compile the regular expression
 3. Use the regular expression to find target pattern
 4. Optional --- Put the match object to work!

In [2]:
#First let's look at the text file
with open('./documents/AccessionExample1.txt') as f:
    ex1text=f.read()
print(ex1text)

adz22510.1
AEV67086.1
  CBL17440
EIM57503.2
AAC19169.1
afY52522.1  (this is the best one)
AAa23220.1
wp_005355457.1
AA123456.1
ADZ22510
zp054688.1
ZP_010248927.1
tr7892101
zp_010248927
abw39335.1 ***check out kcat here
2211254a ***dino DNA I think
 gh781556 ***possibly dino here as well
fgu98722.3
-One of my favorites is adz22510



In [3]:
#let's find all gb protein ids
import re   #1 - importing regular expression module (it is now imported for subsequent cells)
proteingb_regex=re.compile('[A-Za-z]{3}\d{5}')   #2 - compiling the regular expression
protaccs_=proteingb_regex.findall(ex1text)    #3 - using the regular expression
protaccs_

['adz22510',
 'AEV67086',
 'CBL17440',
 'EIM57503',
 'AAC19169',
 'afY52522',
 'AAa23220',
 'ADZ22510',
 'abw39335',
 'fgu98722',
 'adz22510']

### A new task
_You want to know how many of those accession codes reflect sequence record versions >1_
### GB Protein Format: 3letters + 5numbers + . + version number
- AAC19169.1 = version 1 sequence record
- EIM57503.2 = version 2 sequence record

### Solution: using grouping

In [19]:
#Group 1 = accession code
#Group 2 = revision number (0 or 1 matches required)
proteingb_regex=re.compile('([A-Za-z]{3}\d{5})(\.\d)?')
for mobj in proteingb_regex.finditer(ex1text):  #NOW 4 - put the regular expression to work!
    print('match to {0}'.format(mobj.group(0)))
    if mobj.group(2) and int(mobj.group(2)[1])>1:
        print('{0} is version #{1}'.format(mobj.group(1),mobj.group(2)[1]))

match to adz22510.1
match to AEV67086.1
match to CBL17440
match to EIM57503.2
EIM57503 is version #2
match to AAC19169.1
match to afY52522.1
match to AAa23220.1
match to ADZ22510
match to abw39335.1
match to fgu98722.3
fgu98722 is version #3
match to adz22510


# 2. A Few Regular Expression Patterns

### Metacharacters and Special Sequences
See the python regular expression [how-to link](https://docs.python.org/3.6/howto/regex.html#simple-patterns)

Regular expressions use a combination of characters, metacharacters, and special sequences: _([A-Za-z]{3}\d{5})(\.\d)?_

Most characters simply match themselves- e.g., a regular expression compiled from 'stuf' would match 'stuf','stuf8',etc

But not metacharacters: . ^ $ * + ? { } [ ] \ | ( )

Special sequences: \d \D \s \S \s \W

#### Metacharacters:
* [ ] enclose characters of interest
* ^ matches the beginning of a string

In [16]:
#an example
str1='stuffaroo'
str2='bstuffaroo'
abcre=re.compile('[abc]')
print(abcre.match(str1))
print(abcre.match(str2))

None
<_sre.SRE_Match object; span=(0, 1), match='b'>


#### Search vs Match

In [14]:
#side note, 'search' looks for your pattern _anywhere_
print(abcre.search(str1))
print(abcre.search(str2))

<_sre.SRE_Match object; span=(5, 6), match='a'>
<_sre.SRE_Match object; span=(0, 1), match='b'>


In [15]:
#'match' = 'search' with prepended ^
abc_startre=re.compile('^[abc]')
print(abc_startre.search(str1))
print(abc_startre.search(str2))

None
<_sre.SRE_Match object; span=(0, 1), match='b'>


#### More Metacharacters and Let's Throw in a Couple Special Sequences
* \* matches 0 or more of a pattern
* \+ matches 1 or more of a pattern
* ( ) used for grouping as shown in example
* \\s matches any whitespace character [ \t\n\r\f\v]
* \\d matches any decimal digit [0-9]

In [22]:
"""example: you want to a mention of some number of beans,
   which may or may not be cool, in a string/file"""
cbre=re.compile('\d+\s*(cool)*\s*(beans)+')
str3='17 cool beans,thats a lot'
str4='17coolbeans'
str5='17      coolbeans'
str6='21 beans'
str7='something like 15 beans'
str8='something like 15 cool '
str9='no beans'
print(cbre.search(str3))
print(cbre.search(str4))
print(cbre.search(str5))
print(cbre.search(str6))
print(cbre.search(str7))
print(cbre.search(str8))
print(cbre.search(str9))

<_sre.SRE_Match object; span=(0, 13), match='17 cool beans'>
<_sre.SRE_Match object; span=(0, 11), match='17coolbeans'>
<_sre.SRE_Match object; span=(0, 17), match='17      coolbeans'>
<_sre.SRE_Match object; span=(0, 8), match='21 beans'>
<_sre.SRE_Match object; span=(15, 23), match='15 beans'>
None
None
