# Introduction to Regular expressions

## ⭐️ Why are regular expressions important?
Do you often work with loads of data files on the computer? Do you often try to spot particular files or lines of text in them that are important for you?

**Regular expressions** (or regex/REs) can help you save loads of time of fustration! 


![regex](./assets/regex1.jpg)

## ⁇ What is a regular expression?
A regular expression (regex) is a string of characters defining a pattern that should be searched for in a block of text. A regex can be constructed to allow multiple possible matches, while restricting other possibilities, making the approach considerably more powerful and precise than the simple `CTRL+F` or `Find/Replace` operations that you might be familiar with from word processing/web browser software. It's also possible to use regular expressions to specify parts of the search pattern that will be kept in the replacement string of a substitution operation.

Let's see an illustrated example of a regex:
![](./assets/regex.jpg)

The above regex will look for:
- A character `B` followed by **at least** one of the following characters:
    - Characters between `a` and `z` 
    - Characters between `A` and `Z`
    - Characters between `0` and `9`


## 📑 Illustrative Example
Imagine that you have a long text document. The document contains a lot of numbers - years (2014, 2016, 1998, etc), digits (9, 4, 58, etc), and a single phone number (i.e. eleven digits). You need to find the phone number, but it's buried in this large body of text, amongst many other numbers. You can't remember much about the phone number (like what digit(s) it starts with), you can't just randomly search strings of digits with `CTRL+F`/`Find/Replace`, because there's far too many possible combinations (one *trillion* eleven-digit numbers!) and simply performing ten individual searches for the digits from `0` to `9` finds you a lot of unwanted results.

**What are your options?**

1. You *could* run one of these searches for an individual digit, then manually click through each result, until you eventually find the phone number. 
2. You could simply scroll through the document, scanning, by eye, for something that looks like a phone number. But that is likely to take a *long* time, is really boring, and anyway it's 4:55 on Friday afternoon and you need to find that phone number and get upstairs before all the pizza is gone!

3. **This is where a regular expression search will make things much easier for you!**

You know that you need to find eleven digits in succession, so you need to write an expression that represents eleven digits. Good news! With regular expressions, you can search for any digit using the 'token' `\d`. So, to find the phone number, you only need to provide the expression `\d\d\d\d\d\d\d\d\d\d\d`* to your regex search tool. The phone number will pop right out, leaving you with time to head upstairs and wait for the pizza to arrive!

\* the regular expression language actually allows you to avoid such a long string of the same token/character, but we will come to that later on.

![xkcd](./assets/xkcd_regex.png)



## ⭐️ Regular Expression Engines
In order to perform a regex search, you need to provide the expression to a program that can interpret it. Such tools are known as regular expression engines. Unfortunately, there are multiple different 'flavours' of regex engines, which each work slightly differently. Fortunately, the core rules, which will probably address 99.9% of your text-searching needs, are pretty much common to all of them. It's these that we will cover in this course.

## How can I use Regular Expressions?
### Text Editors
In order to be able to use regular expressions, you need to use a program that provides an engine to interpret them. Most text editors will provide regex search/replace functionality. If you use Windows, we highly recommend the freely-available editor 'Notepad++', available at [https://notepad-plus-plus.org](https://notepad-plus-plus.org). For Mac, TextWrangler is a good generic text editor, free from [http://www.barebones.com/products/TextWrangler/](http://www.barebones.com/products/TextWrangler/) or the App Store. For Windows, Mac, and Linux, Atom is a good option [https://atom.io](https://atom.io]). Finally, Sublime Text is another good option, available from [https://www.sublimetext.com](https://www.sublimetext.com).

Usually, you will need to specify that you want to use regular expression searching instead of the normal plain text searching that you are probably already familiar with. When you open up the `Find/Replace` interface in your editor, you should look for an option with a label like "mode", "grep-like", "regular expression" or "regex", and use this to control the type of searching that you want to do.


### Command Line
In addition, several Unix command line tools are commonly available that use regular expressions, for searching (e.g. `grep`/`egrep`) and replacement/substitution (e.g. `sed`, `awk`, `perl -e`).

### Online tools
There are a number of online tools that can help you getting started with Regex some examples are:
- [https://regex101.com](https://regex101.com)


## ⬇️ How can I get the files to use in the exercises?
Download a Zip of the exercises [here](****)

---

## 🐍 Regex in Python
Since searching for string patters is a very common task Python has its own *regular expressions* library that handles regex very elegantly. You can find the documentation for this library [here](https://docs.python.org/2/library/re.html).

Now let's get started!

In [1]:
import re #import the library

# Now we will be loading the basic examples file we have
examples = open('./example_files/basic_examples.txt', 'r')

lines = [line.strip() for line in open('./example_files/basic_examples.txt', 'r')]
lines

['abcdefghijklmnopqrstuvwxyz',
 'aat bat cat dat eat fat gat hat iat jat kat lat mat nat oat pat qat rat sat tat uat vat wat xat yat zat',
 'AGCCGCTTCGAATCCGGGATCTAAGTCAATACTATTGACGCTACGGCTTAATAGATAGATAGCCCTACGATAGAT',
 'agtccgaatcgatcccatcgatttaagtggccatccccagcttacagtccaaatacatgtcaagtgatagtacct',
 'GTTAGCCGCTAGTTCGAAcattcggATACAGATCCTGACtacaTGACAGTCCAACTTTGGGCATCAGATACGggg',
 'MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASGARPCIPKSFGYSSVVCVCNATYCDSFDPPTFPALGTFSRYESTRSGRRMELSMGPIQANHTGTGLLLTLQPEQKFQKVKGF',
 'the quick brown fox jumps over the lazy dog',
 'all these moments will be lost in time, like tears in the rain',
 'steve.jones@embl.de ann.carradine@embl.de jim.hutchins@google.com billyidol@rebel83.net ljg3112@newwave.co.uk',
 '#python #research #Mozilla',
 '2018-01-02',
 '2017-06-06',
 '2017-07-07']

---
### 📝 The basics

#### Search for a literal string

In [2]:
# Getting all the words into a list
words = examples.read().rsplit()

In [3]:
for word in words:
    if re.search('cat', word):
        print(word)

cat
agtccgaatcgatcccatcgatttaagtggccatccccagcttacagtccaaatacatgtcaagtgatagtacct
GTTAGCCGCTAGTTCGAAcattcggATACAGATCCTGACtacaTGACAGTCCAACTTTGGGCATCAGATACGggg


** Note how we are writing patterns for strings as strings**

In [4]:
for word in words:
    if re.search('cat|dat', word):
        print(word)

cat
dat
agtccgaatcgatcccatcgatttaagtggccatccccagcttacagtccaaatacatgtcaagtgatagtacct
GTTAGCCGCTAGTTCGAAcattcggATACAGATCCTGACtacaTGACAGTCCAACTTTGGGCATCAGATACGggg


---
#### Sets and ranges


---
#### ⚡️ In the spirit of efficiency
We are going to write a lot of patterns against out data so let's write a function that will tell us  which records match a particular pattern. Our function `show_matches` takes a pattern and a list of strings, and then for each of those strings, if the pattern matches, we print out the matched string.

<div class="alert alert-warning">
  <strong>Holy guacamole!</strong> You should check in on some of those fields below.
  <button type="button" class="close" data-dismiss="alert" aria-label="Close">
    <span aria-hidden="true">&times;</span>
  </button>
</div>

In [5]:
def show_matches(pattern, strings):
    for i in strings:
        if re.search(pattern, i):
            print(i)

In [6]:
show_matches('cat|dat', words)

cat
dat
agtccgaatcgatcccatcgatttaagtggccatccccagcttacagtccaaatacatgtcaagtgatagtacct
GTTAGCCGCTAGTTCGAAcattcggATACAGATCCTGACtacaTGACAGTCCAACTTTGGGCATCAGATACGggg


👏🏼 Now we can go back to the ranges... groups of characters can be matched in a certain position is surrounded by `[ ]`

In [7]:
show_matches('[bcrmk]at', words)

bat
cat
kat
mat
rat
agtccgaatcgatcccatcgatttaagtggccatccccagcttacagtccaaatacatgtcaagtgatagtacct
GTTAGCCGCTAGTTCGAAcattcggATACAGATCCTGACtacaTGACAGTCCAACTTTGGGCATCAGATACGggg


** What if we want any upper case letter followed by 'AT'?**

In [8]:
show_matches('[A-Z]AT', words)

AGCCGCTTCGAATCCGGGATCTAAGTCAATACTATTGACGCTACGGCTTAATAGATAGATAGCCCTACGATAGAT
GTTAGCCGCTAGTTCGAAcattcggATACAGATCCTGACtacaTGACAGTCCAACTTTGGGCATCAGATACGggg
MEFSSPSREECPKPLSRVSIMAGSLTGLLLLQAVSWASGARPCIPKSFGYSSVVCVCNATYCDSFDPPTFPALGTFSRYESTRSGRRMELSMGPIQANHTGTGLLLTLQPEQKFQKVKGF


> #### Exercise
> a) Which of the following expressions could you use to match any four-letter word beginning with an uppercase letter, followed by two vowels, and ending with 'd'?  
> 
> ```	
> 	i) [upper][vowel][vowel]d  
> 
> 	ii) [A-Z][a-u][a-u]d
> 
> 	iii) [A-Z][aeiou][aeiou]d
> 
> 	iv) [A-Z][aeiou]*2;d
> ``` 
> b) Try playing around with the character ranges that you have learned above. What does `[A-z]` evauluate to? What about `[a-Z]`? Can you create a range that covers all letter and number characters?

Ranges don't have to include the whole alphabet or every digit - we can match only the second half of the alphabet with


#### Inverted sets
We can also specify a range or characters to _not_ math in a position

In [9]:
show_matches('[^cf]at', words)

aat
bat
dat
eat
gat
hat
iat
jat
kat
lat
mat
nat
oat
pat
qat
rat
sat
tat
uat
vat
wat
xat
yat
zat
agtccgaatcgatcccatcgatttaagtggccatccccagcttacagtccaaatacatgtcaagtgatagtacct


### 📝 Tokens and wildcards
####  Referencing multiple characters

In the introductory example we introduced the `\d` token to represent any digit. So the following two expressions are equivalent:
```
[0-9]
\d
```
Tokens in general are shorter form ways of describing standard, commonly-used character sets/classes. The table below describes the tokens available for use in regex.

Token | Matches                                                 | Set Equivalent |
------|---------------------------------------------------------|----------------|
`\d`  | Any digit                                               | `[0-9]`        |
`\s`  | Any whitespace character (space, tab, newlines)         | `[ \t\n\r\f]`  |
`\w`  | Any 'word character' - letters, numbers, and underscore | `[A-Za-z0-9_]` |
`\b`  | A word boundary character                               | n/a        |
`\D`  | The inverse of `\d` i.e. any character except a digit   | `[^0-9]`       |
`\S`  | Any non-whitespace character                            | `[^ \t\n\r\f]` |
`\W`  | Any non-word character                                  | `[^A-Za-z0-9_]`|


This even extends to far as to the backslash character itself - you can specify that you want to match a literal backslash, by preceding that backslash character with - you guessed it! - a backslash i.e. with `\\`.

> #### Exercise 
> Match dates of the form 31.01.2017 (DAY-MONTH-YEAR) in the example file `person_info.csv`. Pay attention to not wrongly match phone numbers. How many matches do you find?


#### Word Boundaries

The `\d`, `\w`, and `\s` tokens are reasonably easy to understand - each one represents a clear set of characters. The `\b` token is more interesting - it is used to match what is referred to as a 'word boundary', and can be used to ensure matching of whole words only. For example, we want to find every instance of 'chr1' and 'chr2' in the file `example.gff`. Using what we've already learned, we can design the regex

```
chr[12]
```

which will match either of the two target strings. However, this regex will also match all but the last character of 'chr13' and 'chr22', which is not what we want. How can we be sure that we will only match the two chromosome identifiers that we want, without additional digits on the end? We could add a space character to the end of the regex. But what if the target string appears at the end of a line? Or before a symbol/delimiter such as ';' or '.'? These strings will be missed by our regex ending with a space. 

This is where the `\b` token comes in handy. 'Word boundary' characters include all of the options described above - symbols that might be used as field delimiters, periods and commas, newline characters, plus the special regex characters `^` and `$`, which refer to the beginning and end of a string respectively (more on these in a moment). So, by using the regex

```
\bchr[12]\b
```

we ensure that we will only get matches to 'chr1' and 'chr2' as whole words, regardless of whether they are flanked by spaces, symbols, or the beginning or end of a line.

In [13]:
from IPython.core.display import HTML


def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()