# regex 
https://developers.google.com/edu/python/regular-expressions

| sign | explanation |
|----------------|-----------------|
|{ }|curly braces { } are used by a quantifier with specific limits|
|[ ]| Square brackets* [ ] define a character class|
|( )|Parentheses ( ) can be used for grouping. |
|\\(| A literal parentheses has to be escaped|
|.| matches any single character except newline '\n'|
|\w|Matches a Non-Alpha numeric character [a-zA-Z0-9_] including underscore  _|
| \W| Matches a Non-Alpha numeric character (letters, numbers, regardless of case) excluding underscore  _ |
|\b | boundary between word and non-word |
|\t, \n, \r | tab, newline, return
|*| Zero or more occurances of a particular character |
|+ | One or more instances of a particular character |
|?| Zero or One occurance of a particular character. If this is used after a * or a +, it tries to do a lazy match and tries to match as few characters as possible, to fit the regex.|
| [a-z] or [A-Z] | Matches any character between, and including, [a to z] or [A to Z]|
| [0-9]|Matches any digit between, and including [0-9]| 
| \D| Matches a Non-Digit character|
|\d|Matches a Digit Character|
|$| End of string or line|
|\Z| End of string|
|\s|Whitespace|
|^|Beginning of a string|
|{m, n}| Between m and n occurences of a particular character|
|[^...]|Matches every character other than the ones inside the box brackets.|

## Basic Examples

The basic rules of regular expression search for a pattern within a string are:
- The search proceeds through the string from start to end, stopping at the first match found
- All of the pattern must be matched, but not all of the string
- If match = re.search(pat, str) is successful, match is not None and in particular match.group() is the matching text 

In [3]:
## Search for pattern 'iii' in string 'piiig'.
## All of the pattern must match, but it may appear anywhere.
## On success, match.group() is matched text.
import re
re.search(r'igs', 'piiig') # not found, match == None
re.search(r'iii', 'piiig') # found, match.group() == "iii"

<re.Match object; span=(1, 4), match='iii'>

In [6]:
re.search(r'..g', 'piiig')  # . finds any char except newline \n

<re.Match object; span=(2, 5), match='iig'>

In [5]:
re.search(r'\d\d\d', 'p123g')  # \d = digit
re.search(r'\w\w\w', '@@abcd!!')  # \w = letter, digit or underscore char

<re.Match object; span=(2, 5), match='abc'>

## Repetition

Things get more interesting when you use + and * to specify repetition in the pattern

    + -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
    * -- 0 or more occurrences of the pattern to its left
    ? -- match 0 or 1 occurrences of the pattern to its left 

Leftmost & Largest<br>
First the search finds the leftmost match for the pattern, and second it tries to use up as much of the <br>
string as possible -- i.e. + and * go as far as possible (the + and * are said to be "greedy").

In [10]:
match1= re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx') # finds digits \d and whitespaces \s*
match2= re.search(r'\d\s*\d\s*\d', 'xx12  3xx') 
match3= re.search(r'\d\s*\d\s*\d', 'xx123xx') 
print(f"{match1[0]} \n{match2[0]} \n{match3[0]}")

1 2   3 
12  3 
123


In [11]:
# one or more instances of a char
import re
match = re.search(r"(l+)","Hello") # finds "l" and possible one "l" more
print(match[1])

ll


In [None]:
  ## i+ = one or more i's, as many as possible.
  match = re.search(r'pi+', 'piiig') # found, match.group() == "piii"

  ## Finds the first/leftmost solution, and within it drives the +
  ## as far as possible (aka 'leftmost and largest').
  ## In this example, note that it does not get to the second set of i's.
  match = re.search(r'i+', 'piigiiii') # found, match.group() == "ii"

## Capturing Parentheses

In [None]:
name = 'Okke Bartholomäo'
ma = re.search(r"(\D+) (\D+)", name) # matches in () are seperable
print(ma[0])
print(ma[1])
print(ma[2])

Okke Bartholomäo
Okke
Bartholomäo


In [None]:
m = re.search(r"can('t)", "We can't do it!")
print(m[0])
print(m[1])

can't
't


In [None]:
# \((\d{3})\) finds (303) seperatly
m = re.search( r"\((\d{3})\)\d{3}-\d{4}", "(303)555-1212" )
print(m[0])
print(m[1])

(303)555-1212
303


## Non-capturing Parentheses
?: create a non-capturing group: that simplyfies and ignores the stuff after :? in the output<br>
the final ? makes the the previous token optional -- it would find "can" as well as "can't"<br>

In [None]:
import re
match1 = re.search("can(?:'t)?","We can do it" )
match2 = re.search("can(?:'t)?","We can't do it" )
print(match1[0])
print(match2[0])

can
can't


In [None]:
# any char
import re
matchObj = re.search(r"(..o)","Hello") # match and as many characters 
print (matchObj.group(1))

llo


In [None]:
# zero or more of that char
import re
matchObj = re.search(r"e(l*)","Hello")
print (matchObj.group(1))

ll


In [None]:
# ?: non-capturing, 's' is optional, the whole first group is optional, 
testString="http://www.thehindu.com/features/education/issues"
matchObj = re.search(r"(?:https?\:\/\/)?([^\/]*)",testString)
print (matchObj.group(0))
print (matchObj.group(1))

http://www.thehindu.com
www.thehindu.com


In [None]:
  ## ^ = matches the start of string, so this fails:
  match = re.search(r'^b\w+', 'foobar') # not found, match == None
  ## but without the ^ it succeeds:
  match = re.search(r'b\w+', 'foobar') # found, match.group() == "bar"