In [None]:
%%HTML
<style>
.rendered_html table, .rendered_html th, .rendered_html tr, .rendered_html td {
     font-size: 100%;
}
</style>

## Regular Expressions: search text with wildcards

As text processing gets more complex, using the above tools gets more and more either difficult or inconvient to use the tools above. 

How would we extract information (parse) from an XML file?

e.g. Blast results:

    <hitid>3</hitid>
    <hitscore>127</hitscore>

In [None]:
example = "<hitid>3</hitid>"


Regular expressions are a way of describing text patterns.

* (Very) complex matches can be created in a single line.

* Similar syntax across lanugaes

* Some implementations (inc python) make extracting parts from the match easy. 

The patterns we used in grep and awk - e.g.

    grep ^prot datafile
    
or 

    awk '$1 ~ /prot$/ {print $0}' datafile
    
the `^prot` and `/prot$/` are regular expressions. 


In python we can use regular expressions with the `re` module. To search for a regular expression we use `re.search`

In [None]:
#!/usr/bin/env python

import re
data = open("datafile")


 ### Side note: Raw strings
 

In [None]:
a = "hello"


In [None]:
b = "hello\n"


In [None]:
c = r"hello\n"


putting r in front of a string makes it a raw string. This stops python over-interpreting some of the funny characters used in regular expressions.

### Matches with wildcards

Does the sequnce contain the motif `AAAA`?

In [None]:
dna_sequence = 'TGACCCGGTAAGAGCGATAGCGCATACGAGAAAAGCTCCTAGGGCAAAGAGCATA'
motif = "AAAA"


What about if the second base can be either `A` or `T` (so either `AAAA` or `ATAA`)

In [None]:
dna_sequence = 'TGACCCGGTAAGAGCGATAGCGCATACGAGAAAAGCTCCTAGGGCAAAGAGCATA'
motif = "AAAA"
motif2 = "ATAA"


What about if the third base can also be `A` or `C`? (so one of `AAAA`, `ATAA`, `AACA` or `ATCA`)

In [None]:
import re
dna_sequence = 'TGACCCGGTAAGAGCGATAGCGCATACGAGATAAGCTCCTAGGGCAAAGAGCATA'


In a regular expression we can allow a set of characters at a position by putting them between `[]`

We can also search for *any character* using a `.` (fullstop):

In [None]:
dna_sequence = 'TGACCCGGTAAGAGCGATAGCGCATACGAGATAAGCTCCTAGGGCAAAGAGCATA'



And one-or-more of a particular character using: `+`

In [None]:
dna_sequence = 'TGACCCGGTAAGAGCGATAGCGCATACGAGATAAGCTCCTAGGGCAAAGAGCATA'


(`*` means zero-or-more.)

Going back to parsing blast results. What would be the pattern to match here:

    <hitid>3</hitid>
    <hitscore>127</hitscore>

In [None]:
pattern = "<hit.+>[0-9]+</hit.+>"
xml_line = "<hitid>3</hitid>"
# non-case sensitive match


We can specify ranges of characters, but not all characters using dashes:

`[0-9]` is the numbers zero to nine <br>
`[A-Z]` is the letters A thorugh Z <br>
`[A-Za-z0-9]` is any normal letter or number (but not symbols) <br>

### Capture groups

To extract parts of the match for future use we put brackets around parts we want:

In [None]:
pattern = "<hit(.+)>([0-9]+)</.+>"
xml_line = "<hitid>3</hitid>"



The return value of `re.search` is a "match object"

![](matchobj.PNG)

contains details about the match that was found. 

Like strings and filehandles, match objects have methods. One of which is `.groups()`

`.groups()` contains the parts of the match that we wanted to retreve.

In [None]:
pattern = "<hit(.+)>([0-9]+)</.+>"
xml_line = "<hitid>3</hitid>"


![](captured_groups.PNG)

### Aside 2: Truthiness

In [None]:
pattern = "<hit(.+)>([0-9]+)</.+>"
xml_line = "<hitid>bob</hitid>"



Why did that work? Isn't `re.search` supposed to return a match object, not True or False?

In [None]:
m_obj = re.search(pattern, xml_line)


If a match is found, `re.search` returns a match object. If not it returns `None`.

As far as `if` is concerned, `None` means `False`.


`0`, `""`, `[]`, `{}` are all equivalent to `False` as far as `if` is concerned.

Anything else is considered `True` (e.g. a match object)

In [None]:
def truthiness(x):
    ''' This function returns True when if would
    regard x as true and False when if would regard
    x as false'''
    

    


`None` has no methods (definately not `.groups()`.

In [None]:
import re
def parse_xml(xml_line):
    '''This function will parse an xml line print
    the tile of the tag and its value if valid. Otherwise
    will print a warning'''
    pattern = "<hit(.+)>([0-9]+)</.+>"
    search_result = re.search(pattern, xml_line)
    if search_result:
        captured_groups = search_result.groups()
        print ("The value of %s is %s" % captured_groups)
    else:
        print ("WARNING: %s is not a valid line" % xml_line)

### Negative matches

Putting a range of characters between `[]` means "any of these chars".

Putting `^` at the start of the set means "anything other than these chars".

e.g. the pattern `prot[^s]?` would match `prot`, `protein`, `deproteinate` but not `prots`.

### Greedy matching

Consider the pattern

    <hitid>(.*)</hitid>
    
and the line:

    <hitid>1</hitid>abcdefghedg<hitid>2</hitid>


There are three possible ways that the group could be captured.

    <hitid>1</hitid>abcdefghedg<hitid>2</hitid>
    1:     ^
    2:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    3:                                ^

In [None]:
line = "<hitid>1</hitid>abcdefghedg<hitid>2</hitid>"
pattern = "<hitid>(.*)</hitid>"


By default, regular expression matches are **greedy**. This means that the matches will be as long as possible.

Non-greedy matching is possible:

|  Match         |  Greedy  | Non-greedy |
| -------------- | -------- | ---------- |
|  one or more   |   `.+`   |    `.+?`   |
|  zero or more  |   `.*`   |    `.*?`   |
|  zero or one   |   `?`   |    `??`   |

In [None]:
line = "<hitid>1</hitid>abcdefghedg<hitid>2</hitid>"
pattern = "<hitid>(.*?)</hitid>"


### Substitution

`re.sub` allows you to replace a regex with something:

In [None]:
import re
dna_sequence = 'TGACCCGGTAAGAGCGATAGCGCATACGAGATAAGCTCCTAGGGCAAAGAGCATA'
motif = 'A[TA][GA]A'


## Excercise

Given a file containing a list of protein names:

<SAMP>
    BPI23
    SPLUNC14
    SPLUNC3
    LBPBPI
    CETP174
</SAMP>

write a program to parse these names to give a root or "family name" (e.g. "BPIL") and number (e.g. 2)

## Advice on regular expression

* Almost anything you can do with a regular expression, you can do with `in`, `split()`, `index()` and slicing (e.g. `a[3:6]`)

* Some times a regular expression is a more obvious way of expressing what you are trying to do. 

* Regular expressions get very complicated very quickly. Can be hard to debug. <br><br>
  Remember the following rules for writing good python:<br>
     * Explicit is better than implicit. <br>
     * Simple is better than complex. <br>
     * Complex is better than complicated. <br>
     * Readability counts. <br>

Examples of difficult regexes:

`^(?:(?:(?:0?[13578]|1[02])(\/|-|\.)31)\1|(?:(?:0?[13-9]|1[0-2])(\/|-|\.)(?:29|30)\2))(?:(?:1[6-9]|[2-9]\d)?\d{2})$|^(?:0?2(\/|-|\.)29\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:0[48]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:(?:0?[1-9])|(?:1[0-2]))(\/|-|\.)(?:0?[1-9]|1\d|2[0-8])\4(?:(?:1[6-9]|[2-9]\d)?\d{2})$`
    

This checks if a date is valid!

Check for a valid email:
```
(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ 
\t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)
?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\
r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[
 \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)
?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t]
)*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
 \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*
)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
)+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)
*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+
|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r
\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t
]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?
:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?
[ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|
\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>
@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?
:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(
?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;
:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\"
.\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] 
\000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\
.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,
;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[
^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(
?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(
?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t
])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t
])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?
:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:
[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)
?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["
()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[
 \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,
;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
)*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?
(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:
\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[
"()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])
*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
?:\r\n)?[ \t])*))*)?;\s*)
                        ```


* When you write output, choose one of two ways:

     1. Write output in an easily parsed format
         * Keywords at the start of lines can help
         * Keep all things about one entry together on one line.
         * Columns are good for seperating things - use commas or `<tab>` to separate columns rather than space <br><br>
         
     2. Write output in exiting common format for which a parsing model exists.
         * Many file formats already exist
         * Often someone will have written a module to both read and write that filetype
         * In particular if you need to use `FASTA`, `BED`, `GTF` or `JSON` files ask.

## FASTA format

`FASTA` format is a format for storing DNA, RNA and protein sequences.

It consists of a title line marked with a `>`, followed by any number of sequence lines. E.g. the below is the sequence of a single gene/genome/protein:

    >gene1
    AATAGACCGCGATAATAGCGAA
    ATTTTCAGGGCAAAGGCCCCAT

Each file can be more than one line. 

    >gene1
    AATAGACCGCGATAATAGCGAA
    ATTTTCAGGGCAAAGGCCCCAT
    >gene2
    AATAGACCGCGATAATAGCGAA
    ATTTTCAGGGCAAAGGCCCCAT

`FASTA` is pain to parse, but luckily someone has done it for us.

In [None]:
import Bio.SeqIO


The `Bio` module is not included with the standard installation of python. It must be installed. Ask if you need it and we will help you install it. 