# Module 02 Advanced Python - 03 Regular expression in Python

# Regular Expressions
A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. Regular Expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the `re` module. Using it, user needs to  specify the rules for the set of possible strings that the user wants to match; this set might contain English sentences, or e-mail addresses, etc. User can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. User can also use REs to modify a string or to split it apart in various ways. Regular Expression patterns are compiled into a series of bytecodes which are then executed by a matching engine written in C.

## Matching Characters
Since regular expressions are used to operate on strings, we’ll begin with the most common task: matching characters.

Most letters and characters will simply match themselves. For example, the regular expression `test` will match the string `test` exactly.

However, some characters called special metacharacters, don’t match themselves. Instead, they signal that some out-of-the-ordinary thing should be matched, or they affect other portions of the RE by repeating them or changing their meaning.

Here’s a complete list of the metacharacters:

**
```
. ^ $ * + ? { } [ ] \ | ( )
```**

### Basic patterns that match single characters
- **`a`, `X`, `9`, `<`**: ordinary characters just match themselves exactly.
- **`.`**: (a period) matches any single character except newline `'\n'`
- **`\w`**: matches a "word" character: a letter or digit or underbar `[a-zA-Z0-9_]`.
- **`\W`**: matches any non-word character.
- **`\b`**: boundary between word and non-word
- **`\s`**: matches a single whitespace character: space, newline, return, tab
- **`\S`**: matches any non-whitespace character.
- **`\t`, `\n`, `\r`**: tab, newline, return
- **`\d`**: decimal digit `[0-9]`
- **`^`**: matches start of the string
- **`$`**: match the end of the string
- **`\`**: inhibit the "specialness" of a character.

## Regular Expression Modifiers: Option Flags
Regular expression literals may include an optional modifier to control various aspects of matching. The modifiers are specified as an optional flag. You can provide multiple modifiers using OR `|`.

- `re.I`: Performs case-insensitive matching.
- `re.L`: Interprets words according to the current locale. This interpretation affects the alphabetic group (`\w` and `\W`), as well as word boundary behavior (`\b` and `\B`).
- `re.M`: Makes `$` match the end of a line (not just the end of the string) and makes `^` match the start of any line (not just the start of the string).
- `re.S`: Makes a period (dot) match any character, including a newline.
- `re.U`: Interprets letters according to the Unicode character set. This flag affects the behavior of `\w`, `\W`, `\b`, `\B`.
- `re.X`: Permits "cuter" regular expression syntax. It ignores whitespace (except inside a set `[]` or when escaped by a backslash) and treats unescaped `#` as a comment marker.

## Regular Expression Patterns
Except for the control characters, **`+ ? . * ^ $ ( ) [ ] { } | \`**, all characters match themselves. You can escape a control character by preceding it with a backslash.

- `^`: Matches beginning of line.
- `[...]`: Matches any single character in brackets.
- `[^...]`: Matches any single character not in brackets
- `re*`: Matches `0` or more occurrences of preceding expression.
- `re+`: Matches `1` or more occurrence of preceding expression.
- `re?`: Matches `0` or `1` occurrence of preceding expression.
- `re{n}`: Matches exactly `n` number of occurrences of preceding expression.
- `re{n, }`: Matches `n` or more occurrences of preceding expression.
- `re{n, m}`: Matches at least `n` and at most `m` occurrences of preceding expression.
- `a|b`: Matches either `a` or `b`.
- `(re)`: Groups regular expressions and remembers matched text.
- `(?imx)`: Temporarily toggles on `i`, `m` or `x` options within a regular expression. If in parentheses, only that area is affected.
- `(?-imx)`: Temporarily toggles off `i`, `m` or `x` options within a regular expression. If in parentheses, only that area is affected.
- `(?: re)`: Groups regular expressions without remembering matched text.
- `(?imx: re)`: Temporarily toggles on i, m, or x options within parentheses.
- `(?-imx: re)`: Temporarily toggles off i, m, or x options within parentheses.
- `(?#...)`: Comment.
- `(?= re)`: Specifies position using a pattern. Does not have a range.
- `(?! re)`: Specifies position using pattern negation. Does not have a range.
- `(?> re)`: Matches independent pattern without backtracking.
- `\A`: Matches beginning of string.
- `\Z`: Matches end of string. If a newline exists, it matches just before newline.
- `\z`: Matches end of string.
- `\G`: Matches point where last match finished.
- `\b`: Matches word boundaries when outside brackets. Matches backspace (0x08) when inside brackets.
- `\B`: Matches nonword boundaries.
- `\n`: Matches newlines
- `\t`: Matches tabs
- `\1...\9`: Matches nth grouped subexpression.
- `\10`: Matches nth grouped subexpression if it matched already. Otherwise refers to the octal representation of a character code.
- `$`: Matches end of line.

### The `[ ]` metacharacters
They’re used for specifying a character class, which is a set of characters that the user wishes to match. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a `-`.

For example, `[abc]` will match any of the characters `a`, `b` or `c`; this is the same as `[a-c]`, which uses a range to express the same set of characters.

If the user wants to match only lowercase letters, then the RE would be `[a-z]`.

**Note:** Metacharacters are not active inside classes. For example, `[akm$]` will match any of the characters `a`, `k`, `m`, or `$`;

`$` is usually a metacharacter, but inside a character class it’s stripped of its special nature.

## Regular Expression Examples
- `python`: Match `"python"`
- `[Pp]ython`: Match `"Python"` or `"python"`
- `rub[ye]`: Match `"ruby"` or `"rube"`
- `[aeiou]`: Match any one lowercase vowel
- `[0-9]`: Match any digit; same as `[0123456789]`
- `[a-z]`: Match any lowercase ASCII letter
- `[A-Z]`: Match any uppercase ASCII letter
- `[a-zA-Z0-9]`: Match any of the above
- `[^aeiou]`: Match anything other than a lowercase vowel
- `[^0-9]`: Match anything other than a digit

## Special Character Classes
- `.`: Match any character except newline
- `\d`: Match a digit: `[0-9]`
- `\D`: Match a nondigit: `[^0-9]`
- `\s`: Match a whitespace character: `[ \t\r\n\f]`
- `\S`: Match nonwhitespace: `[^ \t\r\n\f]`
- `\w`: Match a single word character: `[A-Za-z0-9_]`
- `\W`: Match a nonword character: `[^A-Za-z0-9_]`

### The `compile()` method

**Syntax:**
```python
re.compile(pattern)
```

In [1]:
# importing 're' module
import re

# compiling the regular expression
regex = re.compile(r'[a-z]+')

# diplay the regular expression
regex

re.compile(r'[a-z]+', re.UNICODE)

### The `match()` method
**Syntax:**
```python
re.match(string)
```

In [2]:
# matching the regular expression with an empty string
matchObj = regex.match("")

print("matchObj:", matchObj)

matchObj: None


In [3]:
# matching the regular expression with a string
matchObj = regex.match("python regex")

print("matchObj:", matchObj)

matchObj: <_sre.SRE_Match object; span=(0, 6), match='python'>


In [4]:
matchObj = regex.match("Python regex")

print("matchObj:", matchObj)

matchObj: None


### The `search()` method

In [5]:
matchObj = regex.search('Python regex')

print("matchObj:", matchObj)

matchObj: <_sre.SRE_Match object; span=(1, 6), match='ython'>


#### Actual way of writing the program

In [6]:
regex = re.compile(r'\w+ \w+ \w+')

matchObj = regex.match('string goes here')

if matchObj:
    print('Match found:', matchObj.group())
else:
    print('No match')

Match found: string goes here


### The `findall()` method

In [7]:
import re

regex = re.compile(r'\d+')

matchObj = regex.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')

print(matchObj)

['12', '11', '10']


### The `finditer()` method

In [8]:
iterator = regex.finditer('12 drummers drumming, 11 ... 10 ...')

In [9]:
for matchObj in iterator:
    print(matchObj.group())

12
11
10


### The `match()` method
This method attempts to match RE pattern to string with optional flags.

**Syntax:**
```python
re.match(pattern, string, flags=0)
```

- `pattern`: This is the regular expression to be matched.
- `string`: This is the string, which would be searched to match the attern at the beginning of string.
- `flags`: You can specify different flags using bitwise OR `(|)`. These are modifiers, which are listed in the table below.

The `re.match()` method on returns a match object on success, `None` on failure. The `group(num)` or `groups()` methods of match object are used to get matched expression.
- `group(num=0)`: This method returns entire match (or specific subgroup num)
- `groups()`: This method returns all matching subgroups in a tuple (empty if there weren't any)

In [10]:
import re

pattern = r"(.*) are (.*)"

string = "Cats are smarter than dogs"

matchObj = re.match(pattern, string, re.M|re.I)

if matchObj:
    print("matchObj.group():", matchObj.group())
    print("matchObj.group(1):", matchObj.group(1))
    print("matchObj.group(2):", matchObj.group(2))
    print("matchObj.groups():", matchObj.groups())
else:
    print("No match!!")

matchObj.group(): Cats are smarter than dogs
matchObj.group(1): Cats
matchObj.group(2): smarter than dogs
matchObj.groups(): ('Cats', 'smarter than dogs')


### The `search()` method
The `search()` method searches for first occurence of RE pattern within the string, with optional flags.

```python
re.search(pattern, string, flags=0)
```

In [11]:
import re

line = "Cats are smarter than dogs";

searchObj = re.search(r'(.*) are (.*)', line, re.M|re.I)

if searchObj:
    print("searchObj.group():", searchObj.group())
    print("searchObj.group(1):", searchObj.group(1))
    print("searchObj.group(2):", searchObj.group(2))
    print("searchObj.groups():", searchObj.groups())
else:
    print ("Nothing found!!")

searchObj.group(): Cats are smarter than dogs
searchObj.group(1): Cats
searchObj.group(2): smarter than dogs
searchObj.groups(): ('Cats', 'smarter than dogs')


### Matching versus Searching
Python offers two different primitive operations based on regular expressions: 
- `match()`: checks for a match only at the beginning of the string.
- `search()`: checks for a match anywhere in the string.

In [12]:
import re

line = "Cats are smarter than dogs";


matchObj = re.match(r'dogs', line, re.M|re.I)

# Matching "dogs"
if matchObj:
    print("match -> matchObj.group():", matchObj.group())
else:
    print("No match!!!")

searchObj = re.search( r'dogs', line, re.M|re.I)

# Searching "dogs"
if searchObj:
    print("search -> searchObj.group():", searchObj.group())
else:
    print("Nothing found!!!")

No match!!!
search -> searchObj.group(): dogs


### The `sub()` method

**Syntax:**
```python
re.sub(pattern, repl, string, max=0)
```
This method replaces all occurrences of the RE pattern in string with repl, substituting all occurrences unless max is provided. This method returns modified string.

In [13]:
import re

phone = "2004-959-559 # This is Phone Number"

# Delete Python comments
num = re.sub(r'#.*', "", phone)

print ("Phone Number:", num)

# Remove anything other than digits
num = re.sub(r'\D', "", phone)    

print ("Phone Number:", num)

Phone Number: 2004-959-559 
Phone Number: 2004959559


# Example

In [14]:
import os, re

def print_pdf(files):
    for file in files:
        if re.search(r".*\.pdf", file):
            print(file)

for root, dirs, files in os.walk('.'):
    print_pdf(files)

Module 02 Advanced Python - 01 Files and Directories in Python.pdf
Module 02 Advanced Python - 02 Building Modules and Packages.pdf
Module 02 Advanced Python - 03 Regular expression in Python.pdf
