# RegEx

## import python re

In [2]:
import re

test_string = 'E-mail: xiangxuan64@gmail.com, xiang_xuan@qq.com, Null'
pattern = re.compile(r"(?P<name>\w+?)@(?P<service_provider>\w+?).com")

## Methods

### Match(pattern, string, flags=0)

Only match from the beginning of string

In [3]:
print(pattern.match(test_string))

None


### Search(pattern, string, flags=0)

Match everywhere in the string, only match the first occurrence


In [4]:
pattern.search(test_string)

<re.Match object; span=(8, 29), match='xiangxuan64@gmail.com'>

### findall(pattern, string, flags=0)

return groups() for each match


In [5]:
pattern.findall(test_string)

[('xiangxuan64', 'gmail'), ('xiang_xuan', 'qq')]

### finditer(pattern, string, flags=0)

Return an iterator, and each item in iterator is re.Match object


In [6]:
result = pattern.finditer(test_string)
print([i for i in result])

result = pattern.finditer(test_string)
print([i.group() for i in result])

[<re.Match object; span=(8, 29), match='xiangxuan64@gmail.com'>, <re.Match object; span=(31, 48), match='xiang_xuan@qq.com'>]
['xiangxuan64@gmail.com', 'xiang_xuan@qq.com']


### sub(pattern, repl, string, count=0, flags=0)

return new string

In [15]:
print(re.sub(r'@\w+?.com', '@ipm.edu.mo', test_string))
print(pattern.sub(r'\2', test_string))

E-mail: xiangxuan64@ipm.edu.mo, xiang_xuan@ipm.edu.mo, Null
E-mail: gmail, qq, Null


### subn(pattern, repl, string, count=0, flags=0)

Return a 2-tuple containing (new_string, replace_counter)

In [8]:
re.subn(r'@\w+?.com', '@ipm.edu.mo', test_string)

('E-mail: xiangxuan64@ipm.edu.mo, xiang_xuan@ipm.edu.mo, Null', 2)

### re.split(pattern, string, maxsplit=0, flags=0)

In [21]:
print(re.split(r'[:,] ', test_string))
pattern.split(test_string)

['E-mail', 'xiangxuan64@gmail.com', 'xiang_xuan@qq.com', 'Null']


['E-mail: ', 'xiangxuan64', 'gmail', ', ', 'xiang_xuan', 'qq', ', Null']

## re.Match object

In [10]:
from icecream import ic
test_string

'E-mail: xiangxuan64@gmail.com, xiang_xuan@qq.com, Null'

In [11]:
result = pattern.search(test_string)
result

<re.Match object; span=(8, 29), match='xiangxuan64@gmail.com'>

In [12]:
ic(result.groups())
ic(result.groupdict())
ic(result.group())

ic| result.groups(): ('xiangxuan64', 'gmail')
ic| result.groupdict(): {'name': 'xiangxuan64', 'service_provider': 'gmail'}
ic| result.group(): 'xiangxuan64@gmail.com'


'xiangxuan64@gmail.com'

In [13]:
ic(result.span())
ic(result.start(2))
ic(result.end(1))
test_string[result.start('name'): result.end('service_provider')]

ic| result.span(): (8, 29)
ic| result.start(2): 20
ic| result.end(1): 19


'xiangxuan64@gmail'

In [14]:
result.expand(r'Name: \g<name> and the mail service provider is \2')

'Name: xiangxuan64 and the mail service provider is gmail'

## Appendix

### flags

[Regular expression - Python doc](https://docs.python.org/3/library/re.html)

**re.A / re.ASCII**
Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns. Corresponds to the inline flag (?a).

Note that for backward compatibility, the re.U flag still exists (as well as its synonym re.UNICODE and its embedded counterpart (?u)), but these are redundant in Python 3 since matches are Unicode by default for strings (and Unicode matching isn’t allowed for bytes).

**re.I / re.IGNORECASE**
Perform case-insensitive matching; expressions like [A-Z] will also match lowercase letters. Full Unicode matching (such as Ü matching ü) also works unless the re.ASCII flag is used to disable non-ASCII matches. The current locale does not change the effect of this flag unless the re.LOCALE flag is also used. Corresponds to the inline flag (?i).

Note that when the Unicode patterns [a-z] or [A-Z] are used in combination with the IGNORECASE flag, they will match the 52 ASCII letters and 4 additional non-ASCII letters: ‘İ’ (U+0130, Latin capital letter I with dot above), ‘ı’ (U+0131, Latin small letter dotless i), ‘ſ’ (U+017F, Latin small letter long s) and ‘K’ (U+212A, Kelvin sign). If the ASCII flag is used, only letters ‘a’ to ‘z’ and ‘A’ to ‘Z’ are matched.

**re.L / re.LOCALE**
Make \w, \W, \b, \B and case-insensitive matching dependent on the current locale. This flag can be used only with bytes patterns. The use of this flag is discouraged as the locale mechanism is very unreliable, it only handles one “culture” at a time, and it only works with 8-bit locales. Unicode matching is already enabled by default in Python 3 for Unicode (str) patterns, and it is able to handle different locales/languages. Corresponds to the inline flag (?L).

Changed in version 3.6: re.LOCALE can be used only with bytes patterns and is not compatible with re.ASCII.

Changed in version 3.7: Compiled regular expression objects with the re.LOCALE flag no longer depend on the locale at compile time. Only the locale at matching time affects the result of matching.

**re.M / re.MULTILINE**
When specified, the pattern character '^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '\$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '\$' only at the end of the string and immediately before the newline (if any) at the end of the string. Corresponds to the inline flag (?m).

**re.S / re.DOTALL**
Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline. Corresponds to the inline flag (?s).

**re.X / re.VERBOSE**
This flag allows you to write regular expressions that look nicer and are more readable by allowing you to visually separate logical sections of the pattern and add comments. Whitespace within the pattern is ignored, except when in a character class, or when preceded by an unescaped backslash, or within tokens like *?, (?: or (?P<...>. When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.

This means that the two following regular expression objects that match a decimal number are functionally equal:
```
a = re.compile(r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.X)
b = re.compile(r"\d+\.\d*")
```
Corresponds to the inline flag (?x).