# Regualr Expressions

Regular Expressions (regex) is a universal type of syntax that exists in many programming languages and is used to search in text, by defining patterns

In [2]:
import re

# Search

search() takes the pattern and text to scan, and returns a `Match object` when the pattern is found.<br/>
If the pattern is not found, search() returns None.

In [47]:
pattern = 'this'
text = 'Does this text match the pattern?'

print(re.search(pattern,text))
print(type(re.search(pattern,text)))

<re.Match object; span=(5, 9), match='this'>
<class 're.Match'>


In [18]:
if re.search(pattern,text):
    print ('A match was found')
else:
    print ('No match was found')

A match was found


### Match object properties
The `Match object` holds information about the nature of the match, including the original input string, <br/>
the regular expression used, and the location within the original string where the pattern occurs.<br/>
The `group` method returns subgroup(s) of the match by indices or names.<br/>
0 is the default and will return the entire match.

In [79]:
match_object = re.search(pattern,text)
print(match_object.re)
print(f'Original text: {match_object.string}')
print(f'Matched string starts at: {match_object.start()}')
print(f'Matched string ends at: {match_object.end()}')
print(f'The string that was found: {match_object.group()}')

re.compile('this')
Original text: Does this text match this pattern?
Matched string starts at: 5
Matched string ends at: 9
The string that was found: this


### Searching for multiple patterns within the text

In [72]:
patterns = [ 'this', 'that' ]
text = 'Does this text match the pattern?'

for pattern in patterns:
    print('Looking for "{}" in "{}" ->'.format(pattern, text))

    if re.search(pattern,  text):
        print('found a match!')
    else:
        print('no match')

Looking for "this" in "Does this text match the pattern?" ->
found a match!
Looking for "that" in "Does this text match the pattern?" ->
no match


### findall
Many times our patterns may have multiple matches throughout the text. How do we find them all?

In [81]:
pattern = 'this'
text = 'Does this text match this pattern?'

re.findall(pattern, text)

['this', 'this']

`findall` returns a list of all the matches in the string. <br/>By counting the values in that string we can get the number of matches

In [84]:
print (f'The sting "{pattern}" was found {len(re.findall(pattern, text))} times')

The sting "this" was found 2 times


In [85]:
for match in re.findall(pattern, text):
    print(match.upper())

THIS
THIS


Use finditer() to iterate through match objects in a given string

In [88]:
re.finditer(pattern, text)

for match in re.finditer(pattern, text):
    s = match.start()
    e = match.end()
    print (f'Found "{pattern}" at location {s}-{e}')

Found "this" at location 5-9
Found "this" at location 21-25


# Patterns with special charcters (wild cards)

* `.`       - Any Character Except New Line<br/><br/>
* `\d`      - Digit (0-9)<br/><br/>
* `\D`      - Not a Digit (0-9)<br/><br/>
* `\w`      - Word Character (a-z, A-Z, 0-9, _)<br/><br/>
* `\W`      - Not a Word Character<br/><br/>
* `\s`      - Whitespace (space, tab, newline)<br/><br/>
* `\S`      - Not Whitespace (space, tab, newline)<br/><br/>
* `\b`      - Word Boundary<br/><br/>
* `\B`      - Not a Word Boundary<br/><br/>
* `^`       - Beginning of a String<br/><br/>
* `$`       - End of a String

In [None]:
def search_patterns(pattern, text):
    status = 0
    for match in re.finditer(pattern, text):
        status = 1
        s = match.start()
        e = match.end()
        print('Found "{}" at {}:{}'.format(text[s:e], s, e))
    if status == 0:
        print('No match was found')


text = 'abbaaabbbbaaaaa'

The letters "ab"

In [None]:
search_patterns('ab', text)

The letters "zz

In [None]:
search_patterns('zz', text)

a followed by zero or more b

In [None]:
search_patterns('ab*', text)

a followed by one or more b

In [None]:
search_patterns('ab+', text)

a followed by zero or one b

In [None]:
search_patterns('ab?', text)

## Quantifiers
* `*`       - 0 or More<br/><br/>
* `+`       - 1 or More<br/><br/>
* `?`       - 0 or One<br/><br/>
* `{3}`     - Exact Number<br/><br/>
* `{3,4}`   - Range of Numbers (Minimum, Maximum)

a followed by three b

In [None]:
search_patterns('ab{3}', text)

a followed by between two to three b

In [None]:
search_patterns('ab{2,3}', text)

Demo 6
Turning off Greedy-Behavior

The normal processing for a repetition instruction is to consume as much of the input
as possible while matching the pattern. This so-called greedy behavior can be turned off by
following the repetition instruction with ?

In [None]:
string = 'abbaaabbbbaaaaa'

search_patterns('ab*?', string) # a followed by zero or more b

search_patterns('ab+?', string) # a followed by one or more b

search_patterns('ab??', string) # a followed by zero or one b

search_patterns('ab{3}?', string) # a followed by three b

search_patterns('ab{2,3}?', string) # a followed by between two to three b

## Sets & Groups

* `[]`      - Matches Characters in brackets<br/><br/>
* `[^ ]`    - Matches Characters NOT in brackets<br/><br/>
* `|`       - Either Or<br/><br/>
* `( )`     - Group<br/><br/>
* `[a-z]`   - all lowercase letters<br/><br/>
* `[A-Z]`   - all uppercase letters<br/><br/>
* `[0-9]`   - all digits

Demo 7
Character Sets

In [None]:
search_patterns('[ab]', string)    # either a or b

search_patterns('a[ab]+', string)  # a followed by one or more a or b

Demo 8
Except

A character set can also be used to exclude specific characters.
The special marker ^ means to look for characters not in the set following.
This pattern finds all of the substrings that do not contain the characters -, ., or a space.

In [None]:
string = 'This is some text -- with punctuation. Can we remove it? Yes, we can!'

search_patterns('[^-.!? ]+', string) # sequences without -,.,!,? or space

Demo 9
Ranges

As character sets grow larger, typing every character that should (or should not)
match becomes tedious. A more compact format using character ranges lets you define
a character set to include all of the contiguous characters between a start and
stop point.

In [None]:
string = 'This is some text -- with punctuation.'

search_patterns('[a-z]+',string)  # sequences of lower case letters

search_patterns('[A-Z]+',string)  # sequences of upper case letters

search_patterns('[a-zA-Z]+',string)  # sequences of lower or upper case letters

search_patterns('[A-Z][a-z]+',string)  # one upper case letter followed by lower case letters

Demo 10
Ranges

As a special case of a character set the metacharacter dot, or period (.),
indicates that the pattern should match any single character in that position.

In [None]:
string = 'abbaaabbbbaaaaa'

search_patterns('a.',string)  # a followed by any one character

search_patterns('b.',string)  # b followed by any one character

search_patterns('a.*b',string)  # a followed by anything, ending in b

Demo 11
Escape Codes

Escape Codes
\d  a digit
\D  a non-digit
\s  whitespace (tab, space, newline, etc.)
\S  non-whitespace
\w  alphanumeric
\W  non-alphanumeric

In [None]:
string = 'This is a prime #1 example!'

search_patterns(r'\d+', string)  # sequence of digits

search_patterns(r'\D+', string)  # sequence of non-digits

search_patterns(r'\s+', string)  # sequence of whitespace

search_patterns(r'\S+', string)  # sequence of non-whitespace

search_patterns(r'\w+', string)  # alphanumeric characters

search_patterns(r'\W+', string)  # non-alphanumeric

Demo 13
Anchoring

Escape Codes
\d  a digit
\D  a non-digit
\s  whitespace (tab, space, newline, etc.)
\S  non-whitespace
\w  alphanumeric
\W  non-alphanumeric

Anchors
^   start of string, or line
$   end of string, or line
\A  start of string
\Z  end of string
\b  empty string at the beginning or end of a word
\B  empty string not at the beginning or end of a word

In [None]:
string = 'This is some text -- with punctuation.'

search_patterns(r'^\w+',string)  # word at start of string

search_patterns(r'\A\w+',string)  # word at start of string

search_patterns(r'\w+\S*$',string)  # word at end of string, with optional punctuation

search_patterns(r'\w+\S*\Z',string)  # word at end of string, with optional punctuation

search_patterns(r'\w*t\w*',string)  # word containing 't'

search_patterns(r'\bt\w+',string)  # 't' at start of word

search_patterns(r'\w+t\b', string)  # 't' at end of word

Demo 14
Dissecting Matches with Groups

In [None]:
text = 'This is some text -- with punctuation.'

In [None]:
print(text)

In [None]:
for pattern in [ r'^(\w+)',           # word at start of string
                 r'(\w+)\S*$',        # word at end of string, with optional punctuation
                 r'(\bt\w+)\W+(\w+)', # word starting with 't' then another word
                 r'(\w+t)\b',         # word ending with 't'
                 ]:
    regex = re.compile(pattern)
    match = regex.search(text)
    print ('Matching "{}"'.format(pattern))
    print ('  ', match.groups())
    

Demo 15
Dissecting Matches with Groups

In [None]:
text = 'This is some text -- with punctuation.'

In [None]:
print ('Input text : "{}"'.format(text))

word starting with 't' then another word

In [None]:
regex = re.compile(r'(\bt\w+)\W+(\w+)')

In [None]:
print ('Pattern : "{}"'.format(regex.pattern))

In [None]:
match = regex.search(text)

In [None]:
print ('Entire match          :', match.group(0))

In [None]:
print ('Word starting with "t":', match.group(1))

In [None]:
print ('Word after "t" word   :', match.group(2))

In [None]:
print(match.groups())

In [20]:
text = 'this is some text!'

In [36]:
print(re.search('sOme', text, re.I))

<re.Match object; span=(8, 12), match='some'>


In [38]:
mo = re.search('some', text)

In [44]:
mo.group(1)

IndexError: no such group

# Finditer

In [18]:
urls = ['www.ynet.co.il', 'www.google.com', 'abc.edf', 'http://www.bing.com', 'https://learnpython.edu']
pat = re.compile('[a-z]+\.?[a-z]+\.(com|co\.il|edu)')
for match in pat.finditer('-'.join(urls)):
    print(match)

<re.Match object; span=(0, 14), match='www.ynet.co.il'>
<re.Match object; span=(15, 29), match='www.google.com'>
<re.Match object; span=(45, 57), match='www.bing.com'>
<re.Match object; span=(66, 81), match='learnpython.edu'>


In [23]:
pat.findall(''.join(urls))

['co.il', 'com', 'com', 'edu']

In [25]:
'-'.join(urls)

'www.ynet.co.il-www.google.com-abc.edf-http://www.bing.com-https://learnpython.edu'

In [26]:
def sqr_gen(num):
    for i in range(num):
        yield i**2

In [35]:
sqr_count = sqr_gen(10)

## Flags

* `re.IGNORECASE` makes the pattern case insensitive so that it matches strings of different capitalizations
* `re.MULTILINE` is necessary if your input string has newline characters (\n),<br/> this flag allows the start and end metacharacter (^ and $ respectively) to match<br/> at the beginning and end of each line instead of at the beginning and end of the whole input string
* `re.DOTALL` allows the dot (.) metacharacter match all characters, including the newline character (\n)