# Regex

Regex are patters we define and ask the code to look for.

**Regex** is composed of 2 basic type of characters:
1. Metacharacters
    Classes, Quantifiers, Position
2. Literals

If Metacharacter is grammar, then Literals are the words.

## Classes
Used to match single characters and they all start with \

**\w** = any word character or digits like a-z, A-Z or 0-9  
**\W** = any non-word character

**\d** = any digit like 0-9  
**\D** = any non-digit

**\s** = whitespace characters like space, tab, newline  
**\S** = non-whitespace characters

## Quantifiers

Used to indicate the number of occurences of a character in the pattern.

**?** = Characters that appear 0 or 1 time  
$\large *$ = Characters appearning 0 or more times  
**+** = Characters appearing 1 or more times  
**{min, max}** = Characters appearing within a range of times  

## Position
Used to indicate the location of a character

^ = Character in the start of a line
$ = Character in the end of a line
\< = Character in the start of the word
\> = Character in the end of the word

## Extras

**$\left[\space\right]$** = Set of characters  
**$\huge .$** = Any 1 character  
**|** = Character between 2 or more options  
**$\left(\space\right)$** = Used to group Quantifiers  

# Python Regex

## Compile
```python
import re
re.compile(pattern, flags=0)
```

## Match
```python
import re
re.match(pattern, string, flags=0)
```

## Search
```python
import re
re.search(pattern, string, flags=0)
```

## Findall
```python
import re
re.findall(pattern, string, flags=0)
```

## Sub
```python
import re
re.sub(pattern, repl, string, count=0, flags=0)
re.subn(pattern, repl, string, count=0)
```
**Flags** are used to finetune the searching and matching of patterns.

I = Allows case-insensitive matches  
S = Allows $\Huge .$ to match any character including a newline character  
M = Allows ^ and $ to match newlines  
X = Allows to write whitespace and comments within an expression  

## Examples

In [19]:
import re

### Create a Regex that matches any word with any length that starts with an uppercase letter

In [6]:
text = "The sidebar includes a Cheatsheet, full Reference, and Help. You can also save & Share with the Community, and view patterns you create or favorite in My Patterns."

[A-Z] = matches uppercase letter  

\w = any word or character  

+ = 1 or more times  

In [7]:
# Compiling regex pattern
pattern = re.compile(r'[A-Z]\w+')

# Scan the text and find all matches
results = re.findall(pattern, text)

# Print the results
print(results)

['The', 'Cheatsheet', 'Reference', 'Help', 'You', 'Share', 'Community', 'My', 'Patterns']


### Create a Regex to match phone numbers in the format xxxx-xxxx-xxx

In [20]:
text = "Yesterday at the office party, I met the manager of the east coast branch, her phone number is 202–555–0180. I also exchange my number 202–555–0195 with a recruiter."
# We will match 202–555–0180 and 202–555–0195

\d = match digits  
\d{3} = match 3 digits  
\d{4} = match 4 digits  

In [21]:
# Compiling the pattern
pattern = re.compile(r'\d{3}-\d{3}-\d{4}')

# Find all matches
results = re.findall(pattern, text)

# Print results
print(results)

[]


### create a Regex to match any date in the format DD.MM.YYYY or DD-MM-YYYY or DD/MM/YYYY. Also, we want the years to be only in the 19s or 20s.

In [22]:
text = "Python 3.0 was released on 03–12–2008. It was a significant revision of the language that is not completely backward-compatible. Many of its major features were backported to Python 2.6. and 2.7 versions that were released on 03.10.2008 and 03/07/2010, respectively."

**Divide the day regex into 3 parts**  
* For dates from 1-9 => 0[1-9]  
* For dates from 10-29 => [12][0-9]  
* For dates 30 and 31 => 3[01]  

**Matching the separators**  
* For . or - or / => (\.|-|/)  

**Matching the months**  
* From 1-9 => 0[1-9]  
* From 10-12 => 1[012]  

**Match the year**  
* For year starting with 19 or 20 => (19|20\d\d)

In [23]:
# Compiling pattern
pattern = re.compile(r'(0[1-9]|[12][0-9]|3[01])(.|-|/)(0[1-9]|1[012])(.|-|/)(19|20\d\d)')

# Find all matches
results = re.findall(pattern, text)

# Combine different groups
dates = ["".join(results[i]) for i in range(len(results))]

# Print results
print(dates)

['03–12–2008', '03.10.2008', '03/07/2010']


## re.search

In [24]:
sentence = "this is a sample string"

if re.search(r'ring', sentence):
    print('mission success')

mission success


In [25]:
if not re.search(r'xyz', sentence):
    print('mission failed')

mission failed


In [26]:
words = ['cat', 'attempt', 'tattle']

[w for w in words if re.search(r'tt', w)]

['attempt', 'tattle']

In [27]:
all(re.search(r'at', w) for w in words)

True

In [28]:
any(re.search(r'stat', w) for w in words)

False

## re.sub

To replace the portion of a string with another string.
```python
re.sub(pattern, repl, string, count=0, flags=0)
```
**pattern** = what we should find to replace  
**repl** = what the pattern should be replaced with  
**string** = main string

In [32]:
greeting = "Have a nice weekend"

# replacing all "e" with "Y"
re.sub(r"e", "Y", greeting)

'HavY a nicY wYYkYnd'

In [33]:
# replacing first 2 occurences of "e" with "E"
re.sub(r'e','E', greeting, count=2)

'HavE a nicE weekend'

## re.compile

Compiling a regex is useful if it has to be used in multiple places or called multiple times inside a loop.

In [34]:
pet = re.compile(r'dog')
type(pet)

re.Pattern

In [41]:
# Since pet is a pattern object, we can call search and sub methods directly on it
bool(pet.search('They bought a dog'))

True

In [42]:
bool(pet.search('They had a cat'))

False

In [43]:
pet.sub('cat', 'They bought a white dog')

'They bought a white cat'

In [44]:
# No flags option has to be specified with re.compile
sentence = 'This is a sample string'

word = re.compile(r'is')

# search for 'is' starting from 5th character of sentence variable
bool(word.search(sentence, 4))

True

In [45]:
# search for 'is' between 3rd and 4th characters
bool(word.search(sentence, 2, 4))

True

# Excercises

References:
1. https://towardsdatascience.com/programmers-guide-to-master-regular-expression-f892c814f878
2. https://learnbyexample.github.io/py_regular_expressions/re-introduction.html 