# 

# <center> Regex

### References

* https://www.regextester.com/97916
* https://dojo.bylearn.com.br/python/dominando-o-regex-com-python/

In [1]:
# Imports

import warnings
warnings.filterwarnings('ignore')

import regex as re

___

## Regular Expression (RegEx)

* [.] - Square brackets: Square brackets specifies a set of characters you wish to match.
* ..  - Period: A period matches any single character (except newline '\n').
*  ^  - Caret:The caret symbol ^ is used to check if a string starts with a certain character.
* '*' - Star: The star symbol * matches zero or more occurrences of the pattern left to it.
* '+' - Plus: The plus symbol + matches one or more occurrences of the pattern left to it.
*  ?  - Question Mark: The question mark symbol ? matches zero or one occurrence of the pattern left to it.
* {.} - Braces: Consider this code: {n,m}. This means at least n, and at most m repetitions of the pattern left to it.
*  |  - Alternation: Vertical bar | is used for alternation (or operator).
*  () - Group: Parentheses () is used to group sub-patterns. For example, (a|b|c)xz match any string that matches either a  or b or c followed by xz
*  \  - Backslash: Backlash \ is used to escape various characters including all metacharacters. For example, \$a match if a string contains $ followed by a. Here, $ is not interpreted by a RegEx engine in a special way. If you are unsure if a character has special meaning or not, you can put \ in front of it. This makes sure the character is not treated in a special way.

## Special Sequences

* \A - Matches if the specified characters are at the start of a string.
* \b - Matches if the specified characters are at the beginning or end of a word.
* \B - Opposite of \b. Matches if the specified characters are not at the beginning or end of a word.
* \d - Matches any decimal digit. Equivalent to [0-9]
* \D - Matches any non-decimal digit. Equivalent to [^0-9]
* \s - Matches where a string contains any whitespace character. Equivalent to [ \t\n\r\f\v].
* \S - Matches where a string contains any non-whitespace character. Equivalent to [^ \t\n\r\f\v].
* \w - Matches any alphanumeric character (digits and alphabets). Equivalent to [a-zA-Z0-9_]. By the way, underscore _ is also considered an alphanumeric character.
* \W - Matches any non-alphanumeric character. Equivalent to [^a-zA-Z0-9_]
* \Z - Matches if the specified characters are at the end of a string.

## re.
* re.sub() - The method returns a string where matched occurrences are replaced with the content of replace variable.
* re.subn() - The re.subn() is similar to re.sub() except it returns a tuple of 2 items containing the new string and the number of substitutions made.
* re.search(): The re.search() method takes two arguments: a pattern and a string. The method looks for the first location where the RegEx pattern produces a match with the string. If the search is successful, re.search() returns a match object; if not, it returns None.

## Match object
* match.group(): The group() method returns the part of the string where there is a match.
* match.start(), match.end() and match.span()The start(): returns the index of the start of the matched substring. Similarly, end() returns the end index of the matched substring.
* match.re and match.string: The re attribute of a matched object returns a regular expression object. Similarly, string attribute returns the passed string.

In [2]:
# The next pattern defines a RegEx pattern. The pattern is: any five letter string starting with a and ending with s.

pattern = '^a...s$'

test_string = 'abyss'
result = re.match(pattern, test_string)

if result:
    print("Search successful.")
else:
    print("Search unsuccessful.")

Search successful.


In [3]:
# Extract numbers from a string

string = 'hello 12 hi 89. Howdy 34'
pattern = '\d+'

result = re.findall(pattern, string)
print(result)

# Output: ['12', '89', '34']

['12', '89', '34']


In [4]:
# If the pattern is not found, re.split() returns a list containing the original string.

string = 'Twelve:12 Eighty nine:89. Word before numbers 123 and words after numbers'
pattern = '\d+'

result = re.split(pattern, string) 
print(result)

['Twelve:', ' Eighty nine:', '. Word before numbers ', ' and words after numbers']


In [5]:
# maxsplit = 1 split only at the first occurrence
# You can pass maxsplit argument to the re.split() method. It's the maximum number of splits that will occur. Default is zero.

string = 'Twelve:12 Eighty nine:89 Nine:9.'
pattern = '\d+'

result = re.split(pattern, string, maxsplit = 1) # test 0, 1, 2 .. 
print(result)

['Twelve:', ' Eighty nine:89 Nine:9.']


In [6]:
# Remove all whitespaces
# If the pattern is not found, re.sub() returns the original string.

# multiline string
string = 'abc    12\
de    23 \n f45   6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ' '

new_string = re.sub(pattern, replace, string) 
print(new_string)

abc 12de 23 f45 6


In [7]:
# Program to remove all whitespaces
import re

# multiline string
string = 'matches   all       whitespace characters'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ' '

new_string = re.sub(pattern, replace, string) 
print(new_string)

# Output: abc12de23f456
# If the pattern is not found, re.sub() returns the original string.

matches all whitespace characters


In [8]:
# You can pass count as a fourth parameter to the re.sub() method. 
# If omited, it results to and will replace all occurrences.

# multiline string
string = 'abc 12\
de 23 \n   f45    6'

# matches all whitespace characters
pattern = '\s+'
replace = ' '

new_string = re.sub(r'\s+', replace, string, 2) 
print(new_string)

# Output:
# abc12de 23
# f45 6

abc 12de 23 
   f45    6


In [9]:
# Program to remove all whitespaces and count replaces

# multiline string
string = 'matches     all       whitespace  characters'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ' '

new_string = re.subn(pattern, replace, string) 

print(new_string)

('matches all whitespace characters', 3)


In [10]:
# check if 'Python' is at the beginning

string = "Python is fun"

re.search('\APython is', string)

<re.Match object; span=(0, 9), match='Python is'>

In [11]:
string = '39801 356, 2102 1111'

# Return Three digit number followed by space followed by two digit number
pattern = '(\d{3}) (\d{2})'

# match variable contains a Match object.
match = re.search(pattern, string) 

if match:
    print(match.group())
else:
    print("pattern not found")

801 35


In [12]:
# Here, match variable contains a match object.
# Our pattern (\d{3}) (\d{2}) has two subgroups (\d{3}) and (\d{2}). 
# You can get the part of the string of these parenthesized subgroups. Here's how:

print(match.group(1))
print(match.group(2))
print(match.group(1, 2))

print(match.groups())

801
35
('801', '35')
('801', '35')


In [13]:
print(match.start())
print(match.end())
print(match.span())

2
8
(2, 8)


In [14]:
match.re
re.compile('(\\d{3}) (\\d{2})')

match.string

'39801 356, 2102 1111'

Using r prefix before RegEx
When r or R prefix is used before a regular expression, it means raw string. For example, '\n' is a new line whereas r'\n' means two characters: a backslash \ followed by n.

Backlash \ is used to escape various characters including all metacharacters. However, using r prefix makes \ treat as a normal character.

In [15]:
import re

string = '\n and \r are escape sequences.'

result = re.findall(r'[\n\r]', string) 
print(result)

# Output: ['\n', '\r']

['\n', '\r']


___