# Regular Expressions

***

https://docs.python.org/3/library/re.html

https://realpython.com/regex-python/

https://realpython.com/regex-python-part-2/

https://developers.google.com/edu/python/regular-expressions

https://librarycarpentry.org/lc-data-intro/01-regular-expressions/



## Python's re module
Included by default in Python, Lengthy documentation available. 
Re has lots of different methods and functions. re.split very handy. 

Regular expressions find patterns in text. 

In [1]:
import re

In [2]:
# A string to be manipulated
original = "Words, words, words."

# The pattern/regular expression to use on the above string
pattern_1 = r'\W+'

# Splits a string into substrings using a regular expression. 
result_1 = re.split(pattern_1, original)

# Print the result
print(result_1)

['Words', 'words', 'words', '']


Backslash in strings usually means 'escape the following character' 
\r carriage return
\n new line

An r in front of your string as in `pattern` above denotes 'raw'. It means 'do not escape any of your characters'

Split takes a string and returns a list of strings

\W generally means white space. Matches any character which is not a word character.

\W = ``[^a-zA-Z0-9_]`` will not match  any character, lower case, upper case, single digit and underscore. The match will happen with the single character

The ^ means NOT

\w = ``[a-zA-Z0-9_]`` will match any character, lower case, upper case, single digit and underscore. The match will happen with the single character

`+` is a quantifier. It means how many of the previous thing you want to match... one or more of the previous thing. It's greedy - it will match as many as it can. 

The final substring in the list is empty because the '.' matches \W. The split occurs but it is empty. 

In [3]:
original = "Words, words, words."

# The pattern/regular expression to use on the above string
pattern_2 = r'(\w+)'

# Splits a string into substrings using a regular expression. 
result_2 = re.search(pattern_2, original)
print(result_2)

<re.Match object; span=(0, 5), match='Words'>


In [4]:
re.split(r'\W+', 'Words, words, words.', 1)

['Words', 'words, words.']

In [5]:
# for characters a-f in any case, in the given expression, returns the integers.
re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)

['0', '3', '9']

## Real Python
Following this article https://realpython.com/regex-python/
`re.search(<regex>, <string>)`
A regular expression that is supposed to match a string.  Can be ambiguous. 

Sometimes you will want the whole string has to match the regular expression. 

Sometimes you will want to match the substring.

A set of characters specified in square brackets ([]) makes up a character class.The special characters match any single character that is in the class


In [6]:
'abc' == 'abc'

True

In [7]:
'cda' == 'pda'

False

In [8]:
'abc' in 'cab'

False

In [9]:
'abc' in 'cababc'

True

In [10]:
'cbabac'.index('a')

2

In [11]:
'cbabac'[2]

'a'

In [12]:
'cbaabc'.find('aa')

2

In [13]:
s = 'foo123bar'

re.search('123', s)

<re.Match object; span=(3, 6), match='123'>

In [14]:
s[3:6]

'123'

In [15]:
s = 'foo123bar'
re.search(r'[0-9][0-9][0-9]', s)

<re.Match object; span=(3, 6), match='123'>

In [16]:
re.search(r'[0-9][0-9][0-9]', 'foo456bar')

<re.Match object; span=(3, 6), match='456'>

In [17]:
re.search(r'[0-9][0-9][0-9]', '234baz')

<re.Match object; span=(0, 3), match='234'>

In [18]:
re.search(r'[0-9][0-9][0-9]', 'qux678')

<re.Match object; span=(3, 6), match='678'>

In [19]:
print(re.search(r'[0-9][0-9][0-9]', '12foo34'))

None


In [20]:
s = 'foo123bar'
re.search('1.3', s)


<re.Match object; span=(3, 6), match='123'>

In [21]:
s = 'foo13bar'
print(re.search('1.3', s))

None


In [22]:
re.search(r'[0-9]{3}', 'qux678')

<re.Match object; span=(3, 6), match='678'>

## Google Education
***
https://developers.google.com/edu/python/regular-expressions


In [23]:
import re

str = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', str)
# If-statement after search() tests if it succeeded
if match:
  print('found', match.group()) ## 'found word:cat'
else:
  print('did not find')

found word:cat


In [24]:
string = 'aaaaaa'
pattern = r'a+'

re.search(pattern, string)

<re.Match object; span=(0, 6), match='aaaaaa'>

In [25]:
string = 'aaaabaa'
pattern = r'a+'

re.search(pattern, string)

<re.Match object; span=(0, 4), match='aaaa'>

### Repetitions

In [26]:
# i+ = one or more i's, as many as possible.
match = re.search(r'pi+', 'piiig') # found, match.group() == "piii"
match

<re.Match object; span=(0, 4), match='piii'>

In [27]:
# Finds the first/leftmost solution, and within it drives the +
# as far as possible (aka 'leftmost and largest').
# In this example, note that it does not get to the second set of i's.
match = re.search(r'i+', 'piigiiii') # found, match.group() == "ii"
match

<re.Match object; span=(1, 3), match='ii'>

In [28]:
# \s* = zero or more whitespace chars
# Here look for 3 digits, possibly separated by whitespace.
match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx') # found, match.group() == "1 2   3"
match

<re.Match object; span=(2, 9), match='1 2   3'>

In [29]:
match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx') # found, match.group() == "12  3"
match

<re.Match object; span=(2, 7), match='12  3'>

In [30]:
match = re.search(r'\d\s*\d\s*\d', 'xx123xx') # found, match.group() == "123"
match

<re.Match object; span=(2, 5), match='123'>

In [31]:
# ^ = matches the start of string, so this fails:
match = re.search(r'^b\w+', 'foobar') # not found, match == None
match

In [32]:
# but without the ^ it succeeds:
match = re.search(r'b\w+', 'foobar') # found, match.group() == "bar"
match

<re.Match object; span=(3, 6), match='bar'>

#### Email example

In [33]:
# Common use of regex to search for email addresses
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'\w+@\w+', str)
if match:
    print(match.group())  ## 'b@google'

b@google


In [34]:
# use [] to indicate a group of characters or metacharacters 
# except dot (.) means literal dot (.) rather than matching any character 
match = re.search(r'[\w.-]+@[\w.-]+', str)
if match:
    print(match.group())  ## 'alice-b@google.com'

alice-b@google.com


### Regex resources 
Regex golf 
https://alf.nu/RegexGolf?world=regex&level=r00

Regex 101 explains regular expression entered
https://regex101.com/

Enter reg expression for graphical analysis 
https://regexper.com/

### Exercise 1 
***
Write a Python function to remove all non-alphanumeric characters from a string

In [35]:
# Adaped from https://bobbyhadz.com/blog/python-remove-non-alphanumeric-characters-from-string
# For a given string
my_str = 'Coding_ like- poetry| should{be}short#and; concise'

# Remove all non-alphanumeric characters and white space from string 
new_str = re.sub(r'[\W_]', '', my_str)

print(new_str)

Codinglikepoetryshouldbeshortandconcise


## Second Part of Regex Real Python

The regex is like a small program. This part for example tells the search function what to do `[0-9]{3}`
A programming language to describe and turn the pattern into a little program/algorithim in the background by the re module. 

The regex is like a little program and is much more efficient than writing a python program. 
They will also work across several languages aside from python.

If you are reusing the same regex many times in, then you can compile it with the re module.  

Use r'' at start of regex to indicate raw


### reModule 

#### re.search()

`re.search(<regex>, <string>, flags=0)`
flags are a means of tweaking the search string


In [36]:
# \d matches any digit + one or more of the previous regex so in this case it means one or more digits
# finds the first thing matching the regex so doesn't pick up on 456
# () around \d+ are ignored
re.search(r'(\d+)', 'foo123bar456')


<re.Match object; span=(3, 6), match='123'>

In [37]:
# flags gets a match on lower and uppercase i.e. ignore the case
re.search(r'[a-z]+', '123FOO456', flags=re.IGNORECASE)

<re.Match object; span=(3, 6), match='FOO'>

In [38]:
# print wrapper prints None in jupyter. Otherwise it would look like nothing happens. 
print(re.search(r'\d+', 'foo.bar'))

None


In [39]:
re.search(r'\d+', '123foobar')

<re.Match object; span=(0, 3), match='123'>

In [40]:
re.search(r'\d+', 'foo123bar')

<re.Match object; span=(3, 6), match='123'>

In [41]:
re.match(r'\d+', '123foobar')

<re.Match object; span=(0, 3), match='123'>

#### re.match()

`re.match(<regex>, <string>, flags=0)` matches at the start of a string

In [42]:
# re.match will not match here because 123 is in the middle of the string and match will 
# only match at the start of a string
print(re.match(r'\d+', 'foo123bar'))

None


#### re.fullmatch()

`re.fullmatch(<regex>, <string>, flags=0)`

Looks for a regex match on an entire string.

In [43]:
print(re.fullmatch(r'\d+', '123foo'))

None


In [44]:
print(re.fullmatch(r'\d+', 'foo123'))

None


In [45]:
print(re.fullmatch(r'\d+', 'foo123bar'))


None


In [46]:
re.fullmatch(r'\d+', '123')

<re.Match object; span=(0, 3), match='123'>

In [47]:
re.search(r'^\d+$', '123')

<re.Match object; span=(0, 3), match='123'>

You can get `re.search` to act like `re.match` by using 'anchors' i.e. ^ at the beginning and $ at the end. It makes the search start at the beginning of the string

#### re.findall()

`re.findall(<regex>, <string>, flags=0)`

Returns a list of all matches of a regex in a string.

In [48]:
re.findall(r'\d+', '123foo456bar789')

['123', '456', '789']

#### re.finditer()

`re.finditer(<regex>, <string>, flags=0)`

Returns an iterator that yields regex matches.
iter creates a generations

In [49]:
matches = re.finditer(r'\d+', '123foo456bar789')
matches

<callable_iterator at 0x15a0a87e6d0>

In [50]:
next(matches)

<re.Match object; span=(0, 3), match='123'>

In [51]:
next(matches)

<re.Match object; span=(6, 9), match='456'>

In [52]:
next(matches)

<re.Match object; span=(12, 15), match='789'>

In [53]:
try: 
    next(matches)
except: 
    print(None)

None


In [54]:
matches = re.finditer(r'\d+', '123foo456bar789')
matches
for match in matches:
    print(match)

<re.Match object; span=(0, 3), match='123'>
<re.Match object; span=(6, 9), match='456'>
<re.Match object; span=(12, 15), match='789'>


#### re.sub() Really powerful

`re.sub(<regex>, <repl>, <string>, count=0, flags=0)`
Scans a string for regex matches, replaces the matching portions of the string with the specified replacement string, and returns the result. 
Returns a new string that results from performing replacements on a search string.

`re.sub(<regex>, <repl>, <string>)` finds the leftmost non-overlapping occurrences of `<regex>` in `<string>`, replaces each match as indicated by `<repl>`, and returns the result. `<string>` remains unchanged.

`re.subn(<regex>, <repl>, <string>, count=0, flags=0)`
Returns a new string that results from performing replacements on a search string and also returns the number of substitutions made.

The original string remains unchanged

In [55]:
s = 'foo.123.bar.789.baz'

re.sub(r'\d+', '#', s)

'foo.#.bar.#.baz'

In [56]:
re.sub('[a-z]+', '(*)', s)

'(*).123.(*).789.(*)'

In [57]:
# round brackets are grabbing what is in that part of the regex
# in the replacement string, you can number them /1 first thing matched, /2 second thing matched
# could change columns in csv file with regex if you wished. 
re.sub(r'(\w+),bar,baz,(\w+)', r'\2,bar,baz,\1', 'foo,bar,baz,qux')

'qux,bar,baz,foo'

In [58]:
re.sub(r'([a-z]+)([0-9]+)', r'\2\1', 'foo123bar456')


'123foo456bar'

### Compiling

The general idea is that you create your regex and compile it ahead of time when you are using the regex multiple times. The re.compile is much more efficient than repeatedly using another re function.

`re.compile(<regex>, flags=0`

Compiles a regex into a regular expression object.

In [59]:
my_regex = re.compile(r'([0-9]+)')

In [60]:
my_regex

re.compile(r'([0-9]+)', re.UNICODE)

In [61]:
my_regex.search('foo123bar456')

<re.Match object; span=(3, 6), match='123'>

In [62]:
my_regex.findall('foo123bar456')

['123', '456']

In [63]:
my_regex.sub(r'...', 'foo123bar456')

'foo...bar...'

### Regular Expressions on Iris

In [64]:
# https://stackoverflow.com/a/1393367

import urllib.request

url = r'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'

# list comprehension. like a 'for' loop for lists
# .strip() takes the new line out of every line. Every line is now an item in a list. 
iris = [line.decode('utf-8').strip() for line in urllib.request.urlopen(url)]

# each string is a line from the IRIS dataset
iris

['5.1,3.5,1.4,0.2,Iris-setosa',
 '4.9,3.0,1.4,0.2,Iris-setosa',
 '4.7,3.2,1.3,0.2,Iris-setosa',
 '4.6,3.1,1.5,0.2,Iris-setosa',
 '5.0,3.6,1.4,0.2,Iris-setosa',
 '5.4,3.9,1.7,0.4,Iris-setosa',
 '4.6,3.4,1.4,0.3,Iris-setosa',
 '5.0,3.4,1.5,0.2,Iris-setosa',
 '4.4,2.9,1.4,0.2,Iris-setosa',
 '4.9,3.1,1.5,0.1,Iris-setosa',
 '5.4,3.7,1.5,0.2,Iris-setosa',
 '4.8,3.4,1.6,0.2,Iris-setosa',
 '4.8,3.0,1.4,0.1,Iris-setosa',
 '4.3,3.0,1.1,0.1,Iris-setosa',
 '5.8,4.0,1.2,0.2,Iris-setosa',
 '5.7,4.4,1.5,0.4,Iris-setosa',
 '5.4,3.9,1.3,0.4,Iris-setosa',
 '5.1,3.5,1.4,0.3,Iris-setosa',
 '5.7,3.8,1.7,0.3,Iris-setosa',
 '5.1,3.8,1.5,0.3,Iris-setosa',
 '5.4,3.4,1.7,0.2,Iris-setosa',
 '5.1,3.7,1.5,0.4,Iris-setosa',
 '4.6,3.6,1.0,0.2,Iris-setosa',
 '5.1,3.3,1.7,0.5,Iris-setosa',
 '4.8,3.4,1.9,0.2,Iris-setosa',
 '5.0,3.0,1.6,0.2,Iris-setosa',
 '5.0,3.4,1.6,0.4,Iris-setosa',
 '5.2,3.5,1.5,0.2,Iris-setosa',
 '5.2,3.4,1.4,0.2,Iris-setosa',
 '4.7,3.2,1.6,0.2,Iris-setosa',
 '4.8,3.1,1.6,0.2,Iris-setosa',
 '5.4,3.

In [65]:
strip_iris = re.compile(r'Iris-([a-z]+)')

In [66]:
[strip_iris.sub(r'\1', line) for line in iris]

['5.1,3.5,1.4,0.2,setosa',
 '4.9,3.0,1.4,0.2,setosa',
 '4.7,3.2,1.3,0.2,setosa',
 '4.6,3.1,1.5,0.2,setosa',
 '5.0,3.6,1.4,0.2,setosa',
 '5.4,3.9,1.7,0.4,setosa',
 '4.6,3.4,1.4,0.3,setosa',
 '5.0,3.4,1.5,0.2,setosa',
 '4.4,2.9,1.4,0.2,setosa',
 '4.9,3.1,1.5,0.1,setosa',
 '5.4,3.7,1.5,0.2,setosa',
 '4.8,3.4,1.6,0.2,setosa',
 '4.8,3.0,1.4,0.1,setosa',
 '4.3,3.0,1.1,0.1,setosa',
 '5.8,4.0,1.2,0.2,setosa',
 '5.7,4.4,1.5,0.4,setosa',
 '5.4,3.9,1.3,0.4,setosa',
 '5.1,3.5,1.4,0.3,setosa',
 '5.7,3.8,1.7,0.3,setosa',
 '5.1,3.8,1.5,0.3,setosa',
 '5.4,3.4,1.7,0.2,setosa',
 '5.1,3.7,1.5,0.4,setosa',
 '4.6,3.6,1.0,0.2,setosa',
 '5.1,3.3,1.7,0.5,setosa',
 '4.8,3.4,1.9,0.2,setosa',
 '5.0,3.0,1.6,0.2,setosa',
 '5.0,3.4,1.6,0.4,setosa',
 '5.2,3.5,1.5,0.2,setosa',
 '5.2,3.4,1.4,0.2,setosa',
 '4.7,3.2,1.6,0.2,setosa',
 '4.8,3.1,1.6,0.2,setosa',
 '5.4,3.4,1.5,0.4,setosa',
 '5.2,4.1,1.5,0.1,setosa',
 '5.5,4.2,1.4,0.2,setosa',
 '4.9,3.1,1.5,0.1,setosa',
 '5.0,3.2,1.2,0.2,setosa',
 '5.5,3.5,1.3,0.2,setosa',
 

In [67]:
reverse_iris = re.compile(r'([0-9]\.[0-9]),([0-9]\.[0-9]),([0-9]\.[0-9]),([0-9]\.[0-9]),Iris-([a-z]+)')

In [68]:
[reverse_iris.sub(r'\5,\4,\3,\2,\1', line) for line in iris if line]

['setosa,0.2,1.4,3.5,5.1',
 'setosa,0.2,1.4,3.0,4.9',
 'setosa,0.2,1.3,3.2,4.7',
 'setosa,0.2,1.5,3.1,4.6',
 'setosa,0.2,1.4,3.6,5.0',
 'setosa,0.4,1.7,3.9,5.4',
 'setosa,0.3,1.4,3.4,4.6',
 'setosa,0.2,1.5,3.4,5.0',
 'setosa,0.2,1.4,2.9,4.4',
 'setosa,0.1,1.5,3.1,4.9',
 'setosa,0.2,1.5,3.7,5.4',
 'setosa,0.2,1.6,3.4,4.8',
 'setosa,0.1,1.4,3.0,4.8',
 'setosa,0.1,1.1,3.0,4.3',
 'setosa,0.2,1.2,4.0,5.8',
 'setosa,0.4,1.5,4.4,5.7',
 'setosa,0.4,1.3,3.9,5.4',
 'setosa,0.3,1.4,3.5,5.1',
 'setosa,0.3,1.7,3.8,5.7',
 'setosa,0.3,1.5,3.8,5.1',
 'setosa,0.2,1.7,3.4,5.4',
 'setosa,0.4,1.5,3.7,5.1',
 'setosa,0.2,1.0,3.6,4.6',
 'setosa,0.5,1.7,3.3,5.1',
 'setosa,0.2,1.9,3.4,4.8',
 'setosa,0.2,1.6,3.0,5.0',
 'setosa,0.4,1.6,3.4,5.0',
 'setosa,0.2,1.5,3.5,5.2',
 'setosa,0.2,1.4,3.4,5.2',
 'setosa,0.2,1.6,3.2,4.7',
 'setosa,0.2,1.6,3.1,4.8',
 'setosa,0.4,1.5,3.4,5.4',
 'setosa,0.1,1.5,4.1,5.2',
 'setosa,0.2,1.4,4.2,5.5',
 'setosa,0.1,1.5,3.1,4.9',
 'setosa,0.2,1.2,3.2,5.0',
 'setosa,0.2,1.3,3.5,5.5',
 

## Exercise 2

***

**Adapt the above code to capitalise the first letter of the iris species, using regular expressions.**
https://www.geeksforgeeks.org/string-capitalize-python/
https://stackoverflow.com/questions/4145451/using-a-regular-expression-to-replace-upper-case-repeated-letters-in-python-with

In [69]:
# Remove the species name and - 'Iris-'
# Compile the regex
strip_iris = re.compile(r'Iris-([a-z]+)')
capitalize_iris = [strip_iris.sub(r'\1', line) for line in iris if line]
def f(first):
    return first.group().capitalize()
[re.sub(r'\b([a-z]+)', f, line) for line in capitalize_iris]

['5.1,3.5,1.4,0.2,Setosa',
 '4.9,3.0,1.4,0.2,Setosa',
 '4.7,3.2,1.3,0.2,Setosa',
 '4.6,3.1,1.5,0.2,Setosa',
 '5.0,3.6,1.4,0.2,Setosa',
 '5.4,3.9,1.7,0.4,Setosa',
 '4.6,3.4,1.4,0.3,Setosa',
 '5.0,3.4,1.5,0.2,Setosa',
 '4.4,2.9,1.4,0.2,Setosa',
 '4.9,3.1,1.5,0.1,Setosa',
 '5.4,3.7,1.5,0.2,Setosa',
 '4.8,3.4,1.6,0.2,Setosa',
 '4.8,3.0,1.4,0.1,Setosa',
 '4.3,3.0,1.1,0.1,Setosa',
 '5.8,4.0,1.2,0.2,Setosa',
 '5.7,4.4,1.5,0.4,Setosa',
 '5.4,3.9,1.3,0.4,Setosa',
 '5.1,3.5,1.4,0.3,Setosa',
 '5.7,3.8,1.7,0.3,Setosa',
 '5.1,3.8,1.5,0.3,Setosa',
 '5.4,3.4,1.7,0.2,Setosa',
 '5.1,3.7,1.5,0.4,Setosa',
 '4.6,3.6,1.0,0.2,Setosa',
 '5.1,3.3,1.7,0.5,Setosa',
 '4.8,3.4,1.9,0.2,Setosa',
 '5.0,3.0,1.6,0.2,Setosa',
 '5.0,3.4,1.6,0.4,Setosa',
 '5.2,3.5,1.5,0.2,Setosa',
 '5.2,3.4,1.4,0.2,Setosa',
 '4.7,3.2,1.6,0.2,Setosa',
 '4.8,3.1,1.6,0.2,Setosa',
 '5.4,3.4,1.5,0.4,Setosa',
 '5.2,4.1,1.5,0.1,Setosa',
 '5.5,4.2,1.4,0.2,Setosa',
 '4.9,3.1,1.5,0.1,Setosa',
 '5.0,3.2,1.2,0.2,Setosa',
 '5.5,3.5,1.3,0.2,Setosa',
 