#### **Regular Expression [RegEx]**

A regular expression is a sequence of characters that specifies a search pattern in text.

In Python, the module **`re`** provides full support regular expressions in Python. We need to remember that there are many characters in Python, which would have special meaning when they are used in regular expression.

To avoid bugs while dealing with regular expressions, we use raw strings as **`r'expression'`**.

The **`re`** module in Python provides multiple methods to perform queries on an input string. Here are the most commonly used methods:
```
re.match()
re.search()
re.split()
re.sub()
re.findall()
re.compile()
```

#### **Pattern Table**

| Pattern | Description |
|:--------|:------------|
| **.** | Matches any single character except the newline |
| **^** | Matches the start of the string |
| **$** | Matches end of the line |
| **[.]** | Matches any single character in the bracket |
| **[^.]**| Matches any single character not in the bracket |
| * | Matches 0 or more occurrences of preceding(before) expression |
| + | Matches 1 or more occurrence of preceding(before) expression |
| ? | Matches 0 or 1 occurrence of preceding(before) expression |
| {n} | Matches exactly n number of occurrences of preceding(before) expression|
| {n,} | Matches **`n`** or more occurrences of preceding(before) expression |
| {n,m} | Matches at least **`n`** and at most **`m`** occurrences of preceding(before) expression
| x\|y | Matches either **`x`** or **`y`**
| \d | Matches digits. Equivalent to [0-9]
| \w | Matches word characters
| \W | Matches non-word characters
| \s | Matches whitespaces
| (re) | Groups regular expressions and remembers matched text |


#### **The `match()` Function**

In [None]:
import re

line = "Learn to Analyze Data with Scientific Python";

m = re.match( r'(.*) to (.*?) .*', line)

if m:
   print("m.groups() : ", m.groups())
   print("m.group(0) : ", m.group())
   print("m.group(1) : ", m.group(1))
   print("m.group(2) : ", m.group(2))
else:
   print("No match!!")

m.groups() :  ('Learn', 'Analyze')
m.group(0) :  Learn to Analyze Data with Scientific Python
m.group(1) :  Learn
m.group(2) :  Analyze


In [None]:
import re

line = "Learn Data, Python";
match = re.match( r'(\w+)\s(\w+)', line)

if match:
   print("match.groups() : ", match.groups())
   print("match.group (1,2):", match.group(1, 2))
   print("match.group():", match.group())
   print("match.group(1):", match.group(1))
   print("match.group(2):", match.group(2))
else:
   print("No match!!")

match.groups() :  ('Learn', 'Data')
match.group (1,2): ('Learn', 'Data')
match.group(): Learn Data
match.group(1): Learn
match.group(2): Data


In [None]:
import re

number = "124.13";

match = re.match( r'(?P<Expotent>\d+)\.(?P<Fraction>\d+)', number)

if match:
   print("match.groupdict(): ", match.groupdict())
   print('match.groups(): ', match.groups())
   print('match.group(0): ', match.group(0))
   print('match.group(0): ', match.group(1))
   print('match.group(0): ', match.group(2))
else:
   print("No match!!")

match.groupdict():  {'Expotent': '124', 'Fraction': '13'}
match.groups():  ('124', '13')
match.group(0):  124.13
match.group(0):  124
match.group(0):  13


In [None]:
import re

values = ["Learn", "Live", "Python"];

for value in values:
    # Match the start of a string.
    result = re.match("\AL.+", value)
    if result:
        print("START MATCH [L]:", value)

    # Match the end of a string.
    result2 = re.match(".+n\Z", value)
    if result2:
        print("END MATCH [n]:", value)

START MATCH [L]: Learn
END MATCH [n]: Learn
START MATCH [L]: Live
END MATCH [n]: Python


In [None]:
import re

line = "Learn to Analyze Data with Scientific Python" 
match = re.match( r'(.*) to (.*?) .*', line)

if match:
   print("match.groupdict(): ", match.groupdict())
   print('match.groups(): ', match.groups())
   print('match.group(0): ', match.group(0))
   print('match.group(0): ', match.group(1))
   print('match.group(0): ', match.group(2))
else:
   print("No match!!")


match.groupdict():  {}
match.groups():  ('Learn', 'Analyze')
match.group(0):  Learn to Analyze Data with Scientific Python
match.group(0):  Learn
match.group(0):  Analyze


#### **The `search()` Function**

In [None]:
import re

str = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', str)

if match:
  print('found', match.group())
else:
  print('did not find')

found word:cat


In [None]:
import re

str = 'piiig'
match = re.search(r'iii', str)

if match:
  print('found', match.group())
else:
  print('did not find')

found iii


In [None]:
import re

str = 'piiig'
match = re.search(r'ing', str)

if match:
  print('found', match.group())
else:
  print('did not find')

did not find


In [None]:
import re

str = 'piiig'
match = re.search(r'..g', str)

if match:
   print('found', match.group())
else:
  print('did not find')

found iig


In [None]:
import re

str = 'p123g'
match = re.search(r'\d\d\d', str)

if match:
   print('found', match.group())
else:
  print('did not find')

found 123


In [None]:
import re

str = '@@@abcd!!!'
match = re.search(r'\w\w\w', str)

if match:
   print('found', match.group())
else:
  print('did not find')

found abc


In [None]:
import re

str = 'piiig'
match = re.search(r'i+', str) # one or more i

if match:
   print('found', match.group())
else:
  print('did not find')

found iii


In [None]:
import re

str = 'piiigiiiiiii'
match = re.search(r'i+', str) # one or more i: This will not find second set of i

if match:
   print('found', match.group())
else:
  print('did not find')

found iii


In [None]:
import re

str = 'piiig'
match = re.search(r'i+', str) # one or more i

if match:
   print('found', match.group())
else:
  print('did not find')

found iii


In [None]:
import re

str = 'xx1 2   3xx'
match = re.search(r'\d\s*\d\s*\d', str)

if match:
   print('found', match.group())
else:
  print('did not find')

found 1 2   3


In [None]:
import re

str = 'xx12   3xx'
match = re.search(r'\d\s*\d\s*\d', str)

if match:
   print('found', match.group())
else:
  print('did not find')

found 12   3


In [None]:
import re

str = 'xx123xx'
match = re.search(r'\d\s*\d\s*\d', str)

if match:
   print('found', match.group())
else:
  print('did not find')

found 123


In [None]:
import re

str = 'foobar'
match = re.search(r'^b\w+', str)

if match:
   print('found', match.group())
else:
  print('did not find')

did not find


In [None]:
import re

str = 'foobar'
match = re.search(r'^f\w+', str)

if match:
   print('found', match.group())
else:
  print('did not find')

found foobar


In [None]:
import re

str = 'foobar'
match = re.search(r'b\w+', str)

if match:
   print('found', match.group())
else:
  print('did not find')

found bar


**Email Example**

In [None]:
import re

text = 'My name is Mokarbeen Ansari and my email is mokarbeen.ansari@gmail.com'

match = re.search(r'[\w.-]+@[\w.-]+', text)

if match:
  print(match.groups())
  print(f'Extracted email is: {match.group()}')

()
Extracted email is: mokarbeen.ansari@gmail.com


**Group Extraction**

The "group" feature of a regular expression allows you to pick out parts of the matching text.

In [None]:
import re

text = 'My name is Mokarbeen Ansari and my email is mokarbeen.ansari@gmail.com'

match = re.search(r'([\w.-]+)@([\w.-]+)', text)

if match:
  print(match.groups())
  print(f'Extracted email is: {match.group()}')
  print(f'Extracted email is: {match.group(0)}')
  print(f'Extracted email first group is: {match.group(1)}')
  print(f'Extracted email second group is: {match.group(2)}')
  # print(f'Extracted email is: {match.group(3)}') # This will throw error

('mokarbeen.ansari', 'gmail.com')
Extracted email is: mokarbeen.ansari@gmail.com
Extracted email is: mokarbeen.ansari@gmail.com
Extracted email first group is: mokarbeen.ansari
Extracted email second group is: gmail.com


In [None]:
import re

line = "Learn to Analyze Data with Scientific Python";

match = re.search( r'(.*) to (.*?) .*', line)

if match:
   print("m.groups(): ", match.groups())
   print("m.group() : ", match.group())
   print("m.group(1): ", match.group(1))
   print("m.group(2): ", match.group(2))
else:
   print("No match!!")

m.groups():  ('Learn', 'Analyze')
m.group() :  Learn to Analyze Data with Scientific Python
m.group(1):  Learn
m.group(2):  Analyze


In [None]:
import re

line = "Learn Data, Python";

match = re.search( r'(\w+) (\w+)', line)

if match:
   print("m.groups(): ", match.groups())
   print("m.group() : ", match.group())
   print("m.group(1): ", match.group(1))
   print("m.group(2): ", match.group(2))
else:
   print("No match!!")

m.groups():  ('Learn', 'Data')
m.group() :  Learn Data
m.group(1):  Learn
m.group(2):  Data


In [None]:
import re

number = "124.13";

m = re.search( r'(?P<Expotent>\d+)\.(?P<Fraction>\d+)', number)

if m:
   print("m.groupdict() : ", m.groupdict())
else:
   print("No match!!")

m.groupdict() :  {'Expotent': '124', 'Fraction': '13'}


In [None]:
import re

email = "hello@leremove_thisarntoanalayzedata.com ";

# m = re.match ("remove_this", email) // This will not work!
m = re.search("remove_this", email)

if m:
   print("m.group(): ", m.group())
   print("m.start(): ", m.start())
   print("m.end()  : ", m.end())
   print("email address : ", email[:m.start()] + email[m.end():])
else:
   print("No match!!")

m.group():  remove_this
m.start():  8
m.end()  :  19
email address :  hello@learntoanalayzedata.com 


**`match()`** checks for a match only at the beginning of the string, while **`search()`** checks for a match anywhere in the string.

In [None]:
import re

line = "Learn to Analyze Data with Scientific Python";
m = re.search(r'(python)', line, re.I)

if m:
   print("m.group() : ", m.group())   
else:
   print("No match by obj.search!!")

m = re.match(r'(python)', line, re.I)
    
if m:
  print("m.group() : ", m.group())
else:
  print("No match by obj.match")

m.group() :  Python
No match by obj.match


#### **The `group()` Function**

In [None]:
import re

# A string.
name = "Learn Scientific"

# Match with named groups.
match = re.match("(?P<first>\w+)\W+(?P<last>\w+)", name)

# Print groups using names as id.
if match:
    print(match.groupdict())
    print(match.groups())
    print(match.group())
    print(match.group(0))
    print(match.group(1))
    print(match.group("first"))
    print(match.group(2))
    print(match.group("last"))

In [13]:
import re

name = "Scientific Python"

# Match names.
match = re.match("(?P<first>\w+)\W+(?P<last>\w+)", name)

if match:
    # Get dict.
    d = match.groupdict()

    # Loop over dictionary with for-loop.
    for t in d:
        print("key  :", t)
        print("value:", d[t])
        print("-----:")

key  : first
value: Scientific
-----:
key  : last
value: Python
-----:


#### **The `sub()` Function**
The `sub()` function replaces **every occurrence** of a pattern with a string or the result of a function.

In [16]:
import re

phone = "Please call the phone # 8447535444"

# Remove anything other than digits
num = re.sub(r'\D', "", phone)    
print("The raw phone numbe is : ", num)

The raw phone numbe is :  8447535444


In [23]:
import re
import random

def repl(m):
        inner_word = list(m.group(2))
        random.shuffle(inner_word)
        return m.group(1) + "".join(inner_word) + m.group(3)

line = "Learn Scientific Python with Regex"
m = re.sub(r"(\w)(\w+)(\w)", repl, line)    # This will search every word of line

if m:
   print("munged : ", m);
else:
   print("No match!!");

munged :  Laern Siticfeinc Phyton with Rgeex


#### **The `split()` Function**
The `split()` funtion, splits a string by the occurrences of a pattern.

In [24]:
import re

line = "Learn, Scientific, Python"

m = re.split('\W+', line)    

if m:
  print(m)
else:
  print("No match!")


['Learn', 'Scientific', 'Python']


In [26]:
import re

line = "+61Lean7489Scientific324234"

m = re.split('[A-Za-z]+', line)    

if m:
  print(m)
else:
  print("No match!")

['+61', '7489', '324234']


#### **The `findall()` Function**
`findall()` is a powerful function in the re module. It finds all the matches and returns them as a list of strings, with each string representing one match.

In [27]:
import re

line = 'your alpha@scientificprograming.io, blah beta@scientificprogramming.io blah user'

emails = re.findall(r'[\w\.-]+@[\w\.-]+', line)    

if emails:
  print(emails)
else:
  print("No match!")

['alpha@scientificprograming.io', 'beta@scientificprogramming.io']


In [28]:
import re

line = 'your alpha@scientificprograming.io, blah beta@scientificprogramming.me blah user'

tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', line)    

if tuples:
  print(tuples)
else:
  print("No match!")

[('alpha', 'scientificprograming.io'), ('beta', 'scientificprogramming.me')]
