## <font color='#B31B1B'> Regex </font>
Regular expressions are patterns that match characters in strings (called regex for short). They are a mix of "ordinary" characters (like substrings you wish to match exactly) and "special" characters that allow for repetitions, combinations, and other interesting features.

Regular expressions are supported by several languages and command-line tools. For example, the grep utility in UNIX allows you to probe files for patterns using regular expression syntax, the sed utility allows you to perform substitutions using regular expressions, and so on. Python has a library called re to create regular expressions as well. During this part of the course, we will be putting on our UNIX hat and working with command-line tools but feel free to use Python to practice them in your spare time.

#### <font color='#B31B1B'> Set-up and Basics </font>

The most common use of regular expressions is filtering a collection of strings trying to find matches to a given pattern. Writing correct and unambiguous patterns is the essence of writing regular expressions.

Consider the simplest task of filtering through a set of strings, returning all those that contain the sequence "vasilis". For example:

In [1]:
%%bash
cat examples/example.txt

My name is connor and I like to write code.
My name is also Connor but I don't like to write code.
I spoke to Vasilis and he told me about Python.
I spoke to vasilid and he taught me about regular expressions.
asdfasdfasdfasdfvasilisasdfasdf.


This file contains a collection of sentences (one per line), and we wish to output each line that contains the sequence "connor". To do so, we can write a simple grep command as follows:

In [4]:
%%bash
egrep "connor" examples/example.txt

My name is connor and I like to write code.
asdfasdfasdfasdfconnorasdfasdf.


**Why is this result a little unsatisfying?**

The `grep` utility works as follows: it treats the first argument is the pattern and the second argument is typically the input file. It applies the pattern to each line in the file, and prints all the lines that match. Note that patterns are case-sensitive; for example, we ignored the second line that contains the word "Connor" because the leading "c" should be lower case to match the pattern. It is good practice to enclose the pattern in double-quotes when using `grep` in a script.

**Note:** We are using egrep here for reasons that will be clarified later; namely, to make sure that meta-characters are treated as expected.

#### <font color='#B31B1B'> Character Ranges </font>

To circumvent the above problem (only match "connor" but not "Connor") we introduce a fundamental construct: character ranges. If there is a part of your pattern where more than one characters match, you can enclose the set of letters in square brackets:

In [6]:
%%bash
egrep "[Cc]onnor" examples/example.txt

My name is connor and I like to write code.
My name is also Connor but I don't like to write code.
asdfasdfasdfasdfconnorasdfasdf.


When using a character range, there are some tricks to simplify the resulting pattern. For example, if you want to match any number between 0 and 9, you can write `[0123456789`] or `[0-9]` - the two are equivalent. The same is true for `[abcdeghijklmnopqrstuvwxyz]` and `[a-z]`. If you want to be case-insensitive, you can also mix the two: `[a-zA-Z]` will match any letter between "a" and "z" as well as their capital versions.

What's going on here?

In [8]:
%%bash
egrep "[CcONnNnOoRr]" examples/example.txt

My name is connor and I like to write code.
My name is also Connor but I don't like to write code.
I spoke to Vasilis and he told me about Python.
I spoke to vasilid and he taught me about regular expressions.
asdfasdfasdfasdfconnorasdfasdf.


What if we want to exclude a set of characters from our pattern? In this case we can use the caret (^) inside the square brackets. For example:

In [10]:
%%bash
egrep "conno[^r]" examples/example.txt

I spoke to connol and he taught me about regular expressions.


Here, we match all strings containing a set of characters "conno" immediately followed by any character other than "r". If there are more than one characters you wish to avoid, you can add them inside the same block following the caret.

In [15]:
%%bash
grep "conno[^lmnop]" examples/example.txt

My name is connor and I like to write code.
asdfasdfasdfasdfconnorasdfasdf.


#### <font color='#B31B1B'> Metacharacters </font>

In the above, the brackets as well as the caret are so-called metacharacters, i.e., characters that take on special function and meaning inside regexes. If we want to match the meta-character itself, we typically add a backslash in front of it (something referred to as "escaping" the character). Note the difference between the following two:

In [35]:
%%bash
echo $(egrep "\[connor\]" examples/example.txt) #ignore echo - its to avoid that annoying CalledProcessError




In [28]:
%%bash
egrep "[connor]" examples/example.txt

My name is connor and I like to write code.
My name is also Connor but I don't like to write code.
I spoke to Vasilis and he told me about Python.
I spoke to connol and he taught me about regular expressions.
asdfasdfasdfasdfconnorasdfasdf.


In the first example, we escape `[` and `]` in order to indicate that we want to treat them as ordinary characters and match the substring `"[vasilis]"`. In the second example, we are not escaping them and instead end up with a character range that will match any character from the set `{a,i,l,s,v}`.

**Note:** forgetting to escape a metacharacter is one of the most common mistakes for firstcomers in regular expressions. Make sure you remember the ones you learn!

Here is another metacharacter: the so-called **Kleene star** (*). The star operator indicates that the preceding character can be "matched" as many ($\geq 0$) times as necessary. Consider, for example, trying to match all strings of the form "hello", "helllo", "hellllo" etc. Here, the words we are looking for start with "he", followed by at least 2 "l" characters and the character "o" last. The following will work fine:

In [36]:
%%bash
cat examples/example_star.txt

hello
helllo
hellllo
helo

In [43]:
%%bash
egrep "hel*o" examples/example_star.txt

heo
hello
helllo
hellllo
helo


Here, we are telling grep to match any strings containing "hell" followed by any number of occurences of "l", followed by "o". A similar operator to the Kleene star is the Kleene plus (+), which matches at least one occurence of the preceding operator (recall that * can match as few as zero of them). For example:

In [41]:
%%bash
egrep "hel+o" examples/example_star.txt

hello
helllo
hellllo
helo


Another useful construct is specifying the number of occurences explicitly. The general syntax for that is `<character>{lower_bound,upper_bound}`. For example:

In [44]:
%%bash
egrep "hel{2,3}o" examples/example_star.txt

hello
helllo


The above matched all strings starting with "he" followed by between 2 and 3 "l"'s, followed by "o". You can also omit the upper or lower bound:

In [45]:
%%bash
egrep "hel{2,}o" examples/example_star.txt

hello
helllo
hellllo


We can also specify an 'optional' character using `?` - it's the equivalent of using `{0,1}`!

In [46]:
%%bash
egrep "hel?o" examples/example_star.txt

heo
helo


#### <font color='#B31B1B'> Quick knowledge check! </font>
Fill in the regex here to match US style phone numbers (i.e. `XXX-XXXXXXX`), can you adapt it to have optional spaces between the 3 and 4 letters sections of the latter half and a space instead of a dash? 

In [47]:
%%bash
cat examples/phone_numbers.txt

123-4567890
234-2319487
12312341123213
234
123-
904-3280983
123-4567 890
432 4328 293



In [58]:
%%bash
#Your code here


#### <font color='#B31B1B'> Conditional Matching </font>
This is another useful construct: suppose you are parsing a file containing paths to other files and want to list all image files that end in `.jpeg` or `.png`. Naively, you can write a regular expression that matches all `".jpeg"` substrings, another that matches all `".png"` substrings, and appends the output to a file:

In [67]:
%%bash
cat examples/paths.txt

my_pic.jpg
pic2.jpg
beautiful_image.png
shot_by_iphone.png
not_an_image.txt

In [77]:
%%bash
egrep "\.jpeg" examples/paths.txt >>outputs/output.txt
egrep "\.png" examples/paths.txt  >>outputs/output.txt
cat outputs/output.txt

my_pic.jpeg
pic2.jpeg
beautiful_image.png
shot_by_iphone.png


A slightly more elegant solution is to use a conditional match!

In [79]:
%%bash
egrep "\.(jpeg|png)" examples/paths.txt > outputs/output.txt
cat outputs/output.txt

my_pic.jpeg
pic2.jpeg
beautiful_image.png
shot_by_iphone.png


#### <font color='#B31B1B'> Handy (and fun!) tools for Regex </font>
To get some practice with regex, <a href='https://regexcrossword.com/'>Regex Crossword</a> is a fun site. <a href='https://regex101.com/'> Regex 101 </a> is also a great tool for debugging regex expressions. Here's a <a href='https://docs.trifacta.com/display/DP/Supported+Special+Regular+Expression+Characters'> handy</a> list of special characters that can make your regex more compact as well.