# re

> [Main Table of Contents](../../README.md)

## In This Notebook
- Create regex string (prefer raw)
- Special sequences
- Special characters
   - COMEBACK TO LOOKAHEAD/LOOKBEHIND
- re functions  : COMBACK TO MATCH OBJECT
- Capture group : COMEBACK
- Non-capture groups : COMEBACK
- References
  - Numbered references
  - Named references
- Greedy vs Lazy
  - Gotcha cases
  - Convert greedy to lazy mode

## Create regex string (prefer raw)

In [1]:
str = r'This is a raw string'

## Special sequences

  Sequence | Match Description
  --- | ---
  \d | digit<br>[0-9]
  \D | non-digit<br>[^0-9]
  \s | whitespace<br>[\t\n\r\f\v]<br>*Always use this instead of actual spaces in regex string
  \S | non-whitespace<br>[^\s]
  \w | word char<br>[a-zA-Z0-9]
  \W | non-word-char<br>[^\w]
  \\\\ |backslash
  \\#|contents of a group of the same number



## Special characters
- Special characters apply to whatever is immediately left of it
- For Non-greedy versions add `?` immediately after the greedy version
  - e.g. *?,  ??  

  Character | Match Description
  --- | ---
  ^|Beginning of the string<br>Exception to general rule b/c this should be very first char
  $|Match end of string or right before newline
  . | Wildcard<br>Any character except newline
  ? | 0 or 1 time (greedy)
  +| 1 or more repetitions (greedy)
	+? | 1 or more repetitions (lazy)
  *| 0 or more repetitions (greedy)
  *?| 0 or more repetitions (lazy)
  {#} | Exactly # repetitions (greedy)
  {min,} | At least min repetitions (greedy)
  {min,max} | Both inclusive (greedy)
  a\|b |  or<br>a or b
  [] | or<br>one character in the set
  [^] | First char ^<br>complement set of regular square bracket version
  (...) | Captured group
  (?:...) | Non-captured group<br>Group for the sake of referencing again in the regex string but not extracting from the string
  \\# | Reference previous group, either capture or non-capture type<br>Each group position is the number to reference
  (?P<name>...) | Named group
  (?P=name) | Reference previous named group
  (?=...) | Matches if ... matches next, but doesn't consume the string   TODO: comeback to these 4 look ahead, behind
  (?!...) | Matches if ... doesn't match next
  (?<=...)| Matches if preceded by ... (must be fixed length)
  (?<!...)| Matches if not preceded by ... (must be fixed length)

## re functions

Function | Description
--- | ---
re.findall() | Find all occurances of regex<br>Returns list of matches
re.sub() | Replace all instances of regex with another<br>Returns string
re.split() | Split on regex<br>Returns list
re.match() | Data on the first match found only at the beginning of the string<br>Returns match object or None
re.search() | Data on the first match found anywhere in the string<br>Returns match object or None

In [2]:
import re
re.findall(r'yo', 'yo  yo yo')

['yo', 'yo', 'yo']

In [3]:
re.sub(r'green', r'yellow', 'Some bananas are green green!')

'Some bananas are yellow yellow!'

In [4]:
re.split(r'\d+', 'Return56list78without 45 numbers')

['Return', 'list', 'without ', ' numbers']

### match object
- Index 0 returns match string

  Method | Description
  --- | ---
  .group() | Returns match string
  .span() | Index range of first match found

In [5]:
# Will only match the start of a string
re.match(r'matchme', '   matchme returns none b/c string starts with space')  # returns None

m = re.match(r'matchme', 'matchme yo yo ho yo')  
m[0]   # 'matchme'

'matchme'

In [6]:
re.search(r'matchme', '   matchme returns match object')

<re.Match object; span=(3, 10), match='matchme'>

## Capture groups  TODO: COMEBACK

In [7]:
regex = r'(?:\d+)'
re.findall(regex, '12-34-12-45234-1')

['12', '34', '12', '45234', '1']

## Non-capture groups  TODO: COMEBACK this may be worng
- Group for the sake of referencing again in the regex string but not extracting from the string
- Handle large number of repetitive groups

## References
- Numbered references
- Named references

In [8]:
# Numbered references
regex = r'(\d)-(\d)-\1-\2'  # \1 refers to 1st group, \2 second group
re.search(regex, '1-234-3-4-3-4-6788')

<re.Match object; span=(4, 11), match='4-3-4-3'>

In [9]:
# Named references
regex = r'(?P<bigbird>\d),\s\d,\s\d,\s\d,\s\d,\s(?P=bigbird)'  # group is given the name bigbird, bigbird=2
re.search(regex, '2, 1, 4, 2, 3, 2')

<re.Match object; span=(0, 16), match='2, 1, 4, 2, 3, 2'>

## Greedy vs lazy
- Greedy will find the longest match possible
- Lazy will find the shortest match possible
- Gotcha cases
  - Finding content within enclosing structures e.g. brackets, html tags
- Convert Greedy special characters to non-greedy (lazy) by suffixing `?`

In [10]:
# Greedy pitfall - Get tag names
regex = r'<.+>'
re.findall(regex, '<h1>My-blog-title</h1>')

['<h1>My-blog-title</h1>']

In [11]:
# Fix with lazy version
regex = r'<.+?>'
re.findall(regex, '<h1>My-blog-title</h1>')

['<h1>', '</h1>']