# Tidying up text using Regular Expressions

The first step in processing text is to to tidy up any obvious errors and remove any unwanted spaces, characters etc. This can be accomplished in Python **using regular expressions**. 

A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. The module [**re**](https://docs.python.org/3/library/re.html#module-re) provides full support for regular expressions in Python. 

Python offers two different primitive operations based on regular expressions: 
1. **match** ⇒ find something at the beginning of the string and return a match object
2. **search** ⇒ find something anywhere in the string and return a match object 

```re.match(pattern, string, flags=0)```

```re.search(pattern, string, flags=0)```

See [here](https://www.tutorialspoint.com/python/python_reg_expressions.htm) for more details.

There are a number of regular expression patterns to help match specific parts of text. For example: 

```'.'``` -> Matches any single character except newline

```'*'```  -> Matches 0 or more occurrences of preceding expression

```'?'``` -> Matches 0 or 1 occurrence of preceeding expression

```'+'``` -> matches 1 or more occurances of preceeding expression

```'^'``` -> matched beginning of a line

```'$'``` -> matches end of line

```'\d'``` or ```[0-9]``` -> matches digits

```'\D'``` -> matches non digits

```'[a-z]'``` -> matches any lower case ASCII

for example, to find a python style comment:

``` '#.*$' ``` will match 0 or more occurances of '#' followed by any character until the end of the line.

A helpful interactive website which lets you build your regular expressions is: [regexr](https://regexr.com/) and it even includes a cheatsheet should you forget any of the above. Because it is easy to forget all of the special characters and patterns, its handy to have cheat sheets around - you can also see [here](https://www.dataquest.io/blog/regex-cheatsheet/) and [here](https://www.debuggex.com/cheatsheet/regex/python) for examples. Regex can be quite powerful - but also a bit tricky and difficult to read at times!

## Matching Text

In [None]:
import re

t = 'some code 367 # here is a comment'

# Lets find any Python comment text
matched = re.search(r'#.*$', t)
print(matched.group())


#Lets return the first digit
matched = re.search(r'\d', t)
print(matched.group())


### Challenge: change the regex so we return all the digits:

In [None]:
t = 'some code 367 # here is a comment'
# Change the regex below to get the 367 from the above text
matched = re.search(r'\d', t)
print(matched.group())

## Substituting Text

You can also substitute matched regular expressions. This can be useful, for example if you want to remove excess white spaces in a text string. The basic syntax of sub is as follows:

```re.sub(pattern, replacement, string, max=0)```

We can use the same regex patterns as before. For example:

In [None]:
phone = "2004-959-559 # This is Phone Number"

# Delete Python-style comments
num = re.sub(r'#.*$', "", phone)
print("Phone Num: ", num)

# Remove anything other than digits
num = re.sub(r'\D', "", phone)    
print("Phone Num: ", num)

#### Challenge: remove the excess white space, remove any full stops and remove the words in brackets from the following text:

In [None]:
text = """
I had a soul transplant operation ,
Cause my bypass didn't    function,
So  now I'm keeping   on my toes,
Till I see your one man show.
Show me  magic,
Show me magic,
Show  me magic,
Show me magic.
Wouldn't    it be nice  to know,
What the paper doesn't show,
What  the TV doesn't say,
And what my hamster's ate   today.
(Show me magic)
"""

In [None]:
#enter your code here