## Regular Expressions

Regular expressions are text matching patterns described with a formal syntax. You'll often hear regular expressions referred to as 'regex' or 'regexp' in conversation (and Yes, you have regexp in ES). Regular expressions can include a variety of rules, for finding repetition, to text-matching, and much more. As you advance in Python you'll see that a lot of your parsing problems can be solved with regular expressions.

### Basic Patterns

       
Pattern    | Match      
:----------|:-----------
a, W, 9, < |  ordinary characters match themselves exactly
.          |a period matches any single character except newline
\w | matches a ”word” character: a letter or digit or underbar [a-zA-Z0-9 ]
\W | matches any non-word character
\b | boundary between word and non-word
\s | a single whitespace character – space, newline, return, tab, form [\n \r \t \f]
\S | matches any non-whitespace character
\t | tab
\n | newline
\r | return
\d | decimal digit [0-9]
\D | non-digit character
ˆ | circumflex (top hat) matches the start of a string
$ | dollar matches the end of a string
\ | inhibits the ”specialness” of a character. So, for example, use \. to match a period

```python
'''
These are some of the most commonly used methods
'''
re.search(pattern,text,condition)
re.findall(pattern,text,condition)
re.sub(existing_pattern,replace_with,text)
re.compile(pattern) #compiles the pattern => especially useful if pattern is used multiple times
```


In [5]:
my_sentence = r"Python is fun to learn because it takes {}hrs to learn "

In [6]:
my_sentence.format(60)

'Python is fun to learn because it takes 60hrs to learn '

In [4]:
new_sentence = f"Python is fun to learn because it takes {60}hrs to learn "
new_sentence

'Python is fun to learn because it takes 60hrs to learn '

In [7]:
new_sentence.replace(" ", ".")

'Python.is.fun.to.learn.because.it.takes.60hrs.to.learn.'

In [8]:
new_sentence.strip()

'Python is fun to learn because it takes 60hrs to learn'

In [9]:
import re

In [14]:
string = "@@!abcd!!"

#searching for direct value
match = re.search("abcd!!", string)

if match:
    print(f"Match found: {match.group()}")
else:
    print("no match")



Match found: abcd!!


In [15]:
match.group()

'abcd!!'

In [16]:
string = "@@!cghs!!"

#search for 4 letters followed by 2 non letters in the string
match = re.search("\w\w\w\w\W\W", string)

if match:
    print(f"Match found: {match.group()}")
else:
    print("no match")

Match found: cghs!!


In [18]:
string = "@@!cghs9!!"

#search for 4 letters followed by 2 non letters in the string
match = re.search("\w{4}\W{2}", string)

if match:
    print(f"Match found: {match.group()}")
else:
    print("no match")

Match found: ghs9!!


In [21]:
string = "Hey there budd!. You are doing good!!"

#search for 4 letters followed by 2 non letters in the string
match = re.search("\w{4}\W{2}", string)




Match found: budd!.


In [23]:
string = "Hey there budd!. You are doing good!!"

#search for 4 letters followed by 2 non letters in the string
match = re.findall("\w{4}\W{2}", string)

for idx,value in enumerate(match):
    print(f"{idx} Match found: {value}")

0 Match found: budd!.
1 Match found: good!!


In [28]:
# * => Matches 0 or more item (GREEDY)
# + => Matches 1 or more item (GREEDY)
# ? => Matches 0 or 1 item

#search for piiiiiii
match = re.search(r"pi*","piiiiiigiiiii")

if match:
    print(f"Match found: {match.group()}")
else:
    print("no match")

Match found: piiiiii


In [25]:
# * => Matches 0 or more item (GREEDY)
# + => Matches 1 or more item (GREEDY)
# ? => Matches 0 or 1 item

#search for piiiiiii
match = re.search(r"pi+","piiiiiigiiiii")

if match:
    print(f"Match found: {match.group()}")
else:
    print("no match")

Match found: piiiiii


In [26]:
#search for piiiiiii
match = re.search(r"pi+","p iiiiiigiiiii")

if match:
    print(f"Match found: {match.group()}")
else:
    print("no match")

no match


In [29]:
string = "CEO's phone number is ceo12 3456xx"

#look for 12 3445xx
match = re.search(r"\d\d\s+\d\d\d\d\w\w",string)

if match:
    print(f"Match found: {match.group()}")
else:
    print("no match")

Match found: 12 3456xx


In [30]:
string = "CEO's phone number is ceo12 3456xx"

#look for 12 3445xx
match = re.search(r"\d{2}\s+\d{4}\w+",string)

if match:
    print(f"Match found: {match.group()}")
else:
    print("no match")

Match found: 12 3456xx


In [33]:
string = "CEO's phone number is ceo12 3456xx"

#look for 12 3445xx
match = re.search(r"\d*\s+\d+\w+",string)



Match found: 12 3456xx


In [34]:
# look for emails
email = "elon_musk-07@tesla.com is from an alien planet"

match = re.search("\w*@\w*",email)

if match:
    print(f"Match found: {match.group()}")
else:
    print("no match")



Match found: 07@tesla


### Square Brackets

Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'. The codes \w, \s etc. work inside square brackets too with the one exception that dot (.) just means a literal dot. For the emails problem, the square brackets are an easy way to add '.' and '-' to the set of chars which can appear around the @ with the pattern r'[\w.-]+@[\w.-]+' to get the whole email address:

```python
email = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'[\w.-]+@[\w.-]+', email)

## 'alice-b@google.com'
if match:                      
    print('Found:', match.group()) 
else:
    print('Did not find')
```

(More square-bracket features) You can also use a dash to indicate a range, so [a-z] matches all lowercase letters. To use a dash without indicating a range, put the dash last, e.g. [abc-]. An up-hat (^) at the start of a square-bracket

In [46]:
email = "elon_musk-07@tesla.com is from an alien planet"

match = re.search("[\w.-]*@[\w.-]*",email)

if match:
    print(f"Match found: {match.group()}")
#     print(f"Username: {match.group(1)}")
#     print(f"Domain: {match.group(2)}")
else:
    print("no match")


Match found: elon_musk-07@tesla.com


In [43]:
# using parenthesis ( ) to find groups of items
email = "elon_musk-07@tesla.com is from an alien planet"

match = re.search("([\w.-]*)@([\w.-]*)",email)

if match:
    print(f"Match found: {match.group()}")
    print(f"Username: {match.group(1)}")
    print(f"Domain: {match.group(2)}")
else:
    print("no match")


Match found: elon_musk-07@tesla.com
Username: elon_musk-07
Domain: tesla.com


In [44]:
email = "elon_musk-07@tesla.com is from an alien planet but stevie.wozniak@dads.com is from Mars"

match = re.findall("([\w.-]*)@([\w.-]*)", email)

for username, domain in match:
    print(f"{username} has a domain of {domain}")

elon_musk-07 has a domain of tesla.com
stevie.wozniak has a domain of dads.com


## Exercise 1

Given the following paragraph, find the network IP address and replace it to 127.0.0.1.

> On most computer systems, localhost resolves to the IP address 10.100.11.121, which is the most commonly used IPv4 loopback address, and to the IPv6 loopback address. The localhost IP address is 192.168.11.10.

In [47]:
#Method 1: using pattern groups. and using re.sub
text = "On most computer systems, localhost resolves to the IP address 10.100.11.121, which is the most commonly used IPv4 loopback address, and to the IPv6 loopback address. The localhost IP address is 192.168.11.10."

pattern = r"(\d{1,3}[.]){3}(\d{1,3})"

re.sub(pattern,"127.0.0.1",text)

'On most computer systems, localhost resolves to the IP address 127.0.0.1, which is the most commonly used IPv4 loopback address, and to the IPv6 loopback address. The localhost IP address is 127.0.0.1.'

In [48]:
#Method 2: explicitly specifying the IP pattern (hardcode)

text = "On most computer systems, localhost resolves to the IP address 10.100.11.121, which is the most commonly used IPv4 loopback address, and to the IPv6 loopback address. The localhost IP address is 192.168.11.10."

pattern = r"\d+[.]\d+[.]\d+[.]\d+"

re.sub(pattern,"127.0.0.1",text)


'On most computer systems, localhost resolves to the IP address 127.0.0.1, which is the most commonly used IPv4 loopback address, and to the IPv6 loopback address. The localhost IP address is 127.0.0.1.'

## Exercise 2
Most often, you will be dealing with some technical documents in Oil and Gas environment which gets you to deal with time and certain pattern: Lets say you are dealing with a pdf document that has many time compounds, like end time, start time and etc, but you want to know the time the document was generated. Get the Generated Date from the strings below:

    Daily Report
    Start-time: 2020-08-20 06:00:00
    End-time: 2020-08-20 20:00:00
    Report Generated On: 12ULIM-SCVjdm
    Generated Time : 2020-08-21 06:03:01

***Tip***: Use Python's **Positive Look Behind Assertion** `(?<=...)` for this. Refer to [Official Documentation Here](https://docs.python.org/2/library/re.html)



In [49]:
sentence = "Daily Report\nStart-time: 2020-08-20 06:00:00\nEnd-time: 2020-08-20 20:00:00\nReport Generated On: 12ULIM-SCVjdm\nGenerated Time : 2020-08-21 06:03:01"
print(sentence)

Daily Report
Start-time: 2020-08-20 06:00:00
End-time: 2020-08-20 20:00:00
Report Generated On: 12ULIM-SCVjdm
Generated Time : 2020-08-21 06:03:01


In [53]:
pattern = r"(?<=Generated\sTime)\s?:\s{0,3}\d{4}(\W\d{2}){2}"

match = re.search(pattern, sentence)

if match:
    print(f"Match found: {match.group()}")
else:
    print("No match")

Match found:  : 2020-08-21


## Exercise 3

Here are the combinations of possible phone numbers to be parsed. 

We should be able to get the area code 415, the trunk 867, and the rest of the phone number 5309. 

* 415-867-5309
* 415 657 5039
* 415.567.5467

Use findall to get the required data.

In [59]:
def phone_parser(numbers):
    '''
    Uses python regex to parse phone numbers to area code, trunk and phone number
    '''
    for i,v in enumerate(list_num):
        print(f"Parsing number {v} from index {i}...")
        match = re.search(r'(\d{2,4})\W?(\d{2,4})\W?(\d{2,4})',v)
        if match:
            print(f"Area: {match.group(1)}")
            print(f"Trunk: {match.group(2)}")
            print(f"Phone: {match.group(3)}")
        else:
            print("No match")

In [61]:
list_num = ['4154-867-530','415 657 5039','415.567.5467']
phone_parser(list_num)

Parsing number 4154-867-530 from index 0...
Area: 4154
Trunk: 867
Phone: 530
Parsing number 415 657 5039 from index 1...
Area: 415
Trunk: 657
Phone: 5039
Parsing number 415.567.5467 from index 2...
Area: 415
Trunk: 567
Phone: 5467
