# Tidying up text using Regular Expressions

The first step in processing text is to to tidy up any obvious errors and remove any unwanted spaces, characters etc. This is accomplished in Python using regular expressions. 

A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. The module re provides full support for regular expressions in Python. 

Python offers two different primitive operations based on regular expressions: match checks for a match only at the beginning of the string, while search checks for a match anywhere in the string. See [here](https://www.tutorialspoint.com/python/python_reg_expressions.htm) for details.

```re.match(pattern, string, flags=0)```
```re.search(pattern, string, flags=0)```

There are a number of regular expression patterns to help match specific parts of text. For example: 

'.' -> Matches any single character except newline

'*'  -> Matches 0 or more occurrences of preceding expression

'?' -> Matches 0 or 1 occurrence of preceeding expression

'+' -> matches 1 or more occurances of preceeding expression

'^' -> matched beginning of a line

'$' -> matches end of line

'\d' -> matches digits

'\D' -> matches non digits
'[a-z]'-> matches any lower case ASCII

for example, to find a python style comment:
'#.*$' -> match 0 or more occurances of # followed by any character until the end of the line

or to find digits
'\d'

In [21]:
import re

t = 'some code 367 # here is a comment'
matched = re.search(r'#.*$', t)
print(matched.group())

matched = re.search(r'\d', t)
print(matched.group())


# here is a comment
3


#### Challenge: change the regex so we return all the digits:

In [29]:
t = 'some code 367 # here is a comment'
# Change teh regex below to get the 367 from the above text
matched = re.search(r'\d', t)
print(matched.group())

3


You can also substitute matched regular expressions. This can be useful, for example if you want to remove excess white spaces in a text string. The basic syntax of sub is as follows:

```re.sub(pattern, repl, string, max=0)```

We can use the same regex patterns as before. For example:

In [30]:
phone = "2004-959-559 # This is Phone Number"

# Delete Python-style comments
num = re.sub(r'#.*$', "", phone)
print("Phone Num : ", num)

# Remove anything other than digits
num = re.sub(r'\D', "", phone)    
print("Phone Num : ", num)

Phone Num :  2004-959-559 
Phone Num :  2004959559


#### Challenge: remove the excess white space , remove any full stops amnd remove the words in brackets from the following text:

In [48]:
text = """
I had a soul transplant operation ,
Cause my bypass didn't    function,
So  now I'm keeping  on my toes,
Till I see your one man show.
Show me  magic,
Show me magic,
Show  me magic,
Show me magic.
Wouldn't    it be nice  to know,
What the paper doesn't show,
What  the TV doesn't say,
And what my hamster's ate   today.
(Show me magic)
"""

In [52]:
t = re.sub( r'  +', " ", text)
t2 = re.sub( r'\.', " ", t)
t3 = re.sub( r' ,', ",", t2)
t4 = re.sub(r'\([^)]*\)', "", t3)
print(t4)


I had a soul transplant operation,
Cause my bypass didn't function,
So now I'm keeping on my toes,
Till I see your one man show 
Show me magic,
Show me magic,
Show me magic,
Show me magic 
Wouldn't it be nice to know,
What the paper doesn't show,
What the TV doesn't say,
And what my hamster's ate today 




Regex can be quite powerful - but also a bit tricky and difficult to read at times!