# <center>Data Cleansing</center>
---

Pythons 're' module can be used to search for patterns in text. <b>Regular expressions</b>, or regex are a set of characters that the re module searches for in a text. 

<b>Metacharacters</b>, also known as special characters include:  . ^ $ * + ? { } [ ] \ | ( )  They have special meaning in regex, but they can be 'escaped' using a backslash which means they are treated as an ordinary character by regex. The '.escape()' can also be used. 

The letter $r$ is often placed in front of a regular expression to convert it to a raw string. In this way, Python no longer treats certain letters or characters as having a special meaning but just takes them as they are.
For example:

In [39]:
string = r"\n\tstring" 
print(string)

\n\tstring


[^xyz]           -  matches any character <em>except</em> x,y or z<br>
[a - z]          -  a dash z matches any lowercase letters of the alphabet<br>
\d or [0 - 9]    -  matches number from zero to nine<br>
\D or [^ 0 - 9]  -  matches anything except numbers from zero to nine<br>
\w or [a-zA-Z_0-9]  -  matches lower and uppercase letter, an underscore, and numbers from zero to nine<br>
\W  -  matches everthing except what \w matches<br>
asterisk *   -  matches zero or more<br>
plus +  -  matches one or more<br>
?  -  matches zero or one<br>
{5}  -  matches exactly five times<br>
{5,7}  -  matches between five and seven times

Example search for an email:

In [4]:
# Importing Pythons regex module 're':
import re

In [24]:
# Example regex patterns to search for emails in a text:
emails = "john37@outlook.com, lenora44@eire.ie, con&!o'neill@atu.ie, brian_murphy@hotmail.com, jon137-a.@outlook.com"
pattern = re.compile(r'[a-zA-Z_0-9.\-\!&\']+@[a-zA-Z]+\.(com|net|ie)')
matches = pattern.finditer(emails)
for x in matches:
    print(x)

<re.Match object; span=(0, 18), match='john37@outlook.com'>
<re.Match object; span=(20, 36), match='lenora44@eire.ie'>
<re.Match object; span=(38, 57), match="con&!o'neill@atu.ie">
<re.Match object; span=(59, 83), match='brian_murphy@hotmail.com'>
<re.Match object; span=(85, 106), match='jon137-a.@outlook.com'>


In [20]:
# It is more convenient to use '\w' and it saves a programmer from having to insert every character inside square brackets: 
pattern = re.findall(r'[\w\.-]+@[\w\.-]+', emails)
pattern

['john37@outlook.com',
 'lenora44@eire.ie',
 'neill@atu.ie',
 'brian_murphy@hotmail.com',
 'jon137-a.@outlook.com']

### Common regex functions:

The .search() function will return matches anywhere in the string:

In [32]:
x = "Hello and goodbye to you!"
res = re.search(r'[el]', x)
res

<re.Match object; span=(1, 2), match='e'>

The .match() function only matches from index 0 at the start of a string. So nothing is matched from the string of text, x:

In [36]:
res = re.match(r'[el]', x)
print(res)

None


The compile() function allows a regular expression object to be stored to a variable which saves writing the same code over and over again in a program:

In [38]:
x = re.compile(pattern)
result = x.match(string)

### Exercise 1: 

Write a Python function to remove all non-alphanumeric characters from a string.

In [27]:
import re
string = "Will this function remove NON Alphanumeric Characters:£€€# #,..??.=@=~#£ and keep numbers 78247892"
def remove_chars(string):
    # A regular expression passed into the re.sub() method:
    x = re.sub(r'[\W]', ' ', string)
    return x

In [28]:
remove_chars(string)

'Will this function remove NON Alphanumeric Characters                    and keep numbers 78247892'

In [4]:
import urllib.request
import re

url = r'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'

iris = [line.decode('utf-8').strip() for line in urllib.request.urlopen(url)]

iris

['5.1,3.5,1.4,0.2,Iris-setosa',
 '4.9,3.0,1.4,0.2,Iris-setosa',
 '4.7,3.2,1.3,0.2,Iris-setosa',
 '4.6,3.1,1.5,0.2,Iris-setosa',
 '5.0,3.6,1.4,0.2,Iris-setosa',
 '5.4,3.9,1.7,0.4,Iris-setosa',
 '4.6,3.4,1.4,0.3,Iris-setosa',
 '5.0,3.4,1.5,0.2,Iris-setosa',
 '4.4,2.9,1.4,0.2,Iris-setosa',
 '4.9,3.1,1.5,0.1,Iris-setosa',
 '5.4,3.7,1.5,0.2,Iris-setosa',
 '4.8,3.4,1.6,0.2,Iris-setosa',
 '4.8,3.0,1.4,0.1,Iris-setosa',
 '4.3,3.0,1.1,0.1,Iris-setosa',
 '5.8,4.0,1.2,0.2,Iris-setosa',
 '5.7,4.4,1.5,0.4,Iris-setosa',
 '5.4,3.9,1.3,0.4,Iris-setosa',
 '5.1,3.5,1.4,0.3,Iris-setosa',
 '5.7,3.8,1.7,0.3,Iris-setosa',
 '5.1,3.8,1.5,0.3,Iris-setosa',
 '5.4,3.4,1.7,0.2,Iris-setosa',
 '5.1,3.7,1.5,0.4,Iris-setosa',
 '4.6,3.6,1.0,0.2,Iris-setosa',
 '5.1,3.3,1.7,0.5,Iris-setosa',
 '4.8,3.4,1.9,0.2,Iris-setosa',
 '5.0,3.0,1.6,0.2,Iris-setosa',
 '5.0,3.4,1.6,0.4,Iris-setosa',
 '5.2,3.5,1.5,0.2,Iris-setosa',
 '5.2,3.4,1.4,0.2,Iris-setosa',
 '4.7,3.2,1.6,0.2,Iris-setosa',
 '4.8,3.1,1.6,0.2,Iris-setosa',
 '5.4,3.

In [5]:
strip_iris = re.compile(r'([0-9]\.[0-9]),([0-9]\.[0-9]),([0-9]\.[0-9]),([0-9]\.[0-9]),Iris-([a-z]+)')

In [6]:
[strip_iris.sub(r'\5,\4,\3,\2,\1', line) for line in iris if line]

['setosa,0.2,1.4,3.5,5.1',
 'setosa,0.2,1.4,3.0,4.9',
 'setosa,0.2,1.3,3.2,4.7',
 'setosa,0.2,1.5,3.1,4.6',
 'setosa,0.2,1.4,3.6,5.0',
 'setosa,0.4,1.7,3.9,5.4',
 'setosa,0.3,1.4,3.4,4.6',
 'setosa,0.2,1.5,3.4,5.0',
 'setosa,0.2,1.4,2.9,4.4',
 'setosa,0.1,1.5,3.1,4.9',
 'setosa,0.2,1.5,3.7,5.4',
 'setosa,0.2,1.6,3.4,4.8',
 'setosa,0.1,1.4,3.0,4.8',
 'setosa,0.1,1.1,3.0,4.3',
 'setosa,0.2,1.2,4.0,5.8',
 'setosa,0.4,1.5,4.4,5.7',
 'setosa,0.4,1.3,3.9,5.4',
 'setosa,0.3,1.4,3.5,5.1',
 'setosa,0.3,1.7,3.8,5.7',
 'setosa,0.3,1.5,3.8,5.1',
 'setosa,0.2,1.7,3.4,5.4',
 'setosa,0.4,1.5,3.7,5.1',
 'setosa,0.2,1.0,3.6,4.6',
 'setosa,0.5,1.7,3.3,5.1',
 'setosa,0.2,1.9,3.4,4.8',
 'setosa,0.2,1.6,3.0,5.0',
 'setosa,0.4,1.6,3.4,5.0',
 'setosa,0.2,1.5,3.5,5.2',
 'setosa,0.2,1.4,3.4,5.2',
 'setosa,0.2,1.6,3.2,4.7',
 'setosa,0.2,1.6,3.1,4.8',
 'setosa,0.4,1.5,3.4,5.4',
 'setosa,0.1,1.5,4.1,5.2',
 'setosa,0.2,1.4,4.2,5.5',
 'setosa,0.1,1.5,3.1,4.9',
 'setosa,0.2,1.2,3.2,5.0',
 'setosa,0.2,1.3,3.5,5.5',
 

### Exercise 2:

Adapt the above code to capitalise the first letter of the iris species, using regular expressions.

In [33]:
strip_iris = re.compile(r'([0-9]\.[0-9]),([0-9]\.[0-9]),([0-9]\.[0-9]),([0-9]\.[0-9]),Iris-([a-z]+)')

In [34]:
[strip_iris.sub(r'\5,\4,\3,\2,\1', line)[0].upper() + strip_iris.sub(r'\5,\4,\3,\2,\1', line)[1:] for line in iris if line]

['Setosa,0.2,1.4,3.5,5.1',
 'Setosa,0.2,1.4,3.0,4.9',
 'Setosa,0.2,1.3,3.2,4.7',
 'Setosa,0.2,1.5,3.1,4.6',
 'Setosa,0.2,1.4,3.6,5.0',
 'Setosa,0.4,1.7,3.9,5.4',
 'Setosa,0.3,1.4,3.4,4.6',
 'Setosa,0.2,1.5,3.4,5.0',
 'Setosa,0.2,1.4,2.9,4.4',
 'Setosa,0.1,1.5,3.1,4.9',
 'Setosa,0.2,1.5,3.7,5.4',
 'Setosa,0.2,1.6,3.4,4.8',
 'Setosa,0.1,1.4,3.0,4.8',
 'Setosa,0.1,1.1,3.0,4.3',
 'Setosa,0.2,1.2,4.0,5.8',
 'Setosa,0.4,1.5,4.4,5.7',
 'Setosa,0.4,1.3,3.9,5.4',
 'Setosa,0.3,1.4,3.5,5.1',
 'Setosa,0.3,1.7,3.8,5.7',
 'Setosa,0.3,1.5,3.8,5.1',
 'Setosa,0.2,1.7,3.4,5.4',
 'Setosa,0.4,1.5,3.7,5.1',
 'Setosa,0.2,1.0,3.6,4.6',
 'Setosa,0.5,1.7,3.3,5.1',
 'Setosa,0.2,1.9,3.4,4.8',
 'Setosa,0.2,1.6,3.0,5.0',
 'Setosa,0.4,1.6,3.4,5.0',
 'Setosa,0.2,1.5,3.5,5.2',
 'Setosa,0.2,1.4,3.4,5.2',
 'Setosa,0.2,1.6,3.2,4.7',
 'Setosa,0.2,1.6,3.1,4.8',
 'Setosa,0.4,1.5,3.4,5.4',
 'Setosa,0.1,1.5,4.1,5.2',
 'Setosa,0.2,1.4,4.2,5.5',
 'Setosa,0.1,1.5,3.1,4.9',
 'Setosa,0.2,1.2,3.2,5.0',
 'Setosa,0.2,1.3,3.5,5.5',
 