# Regular Expression

- In this tutorial, you will learn about regular expressions (RegEx), and use Python's re module to work with RegEx.
- A Regular Expression (RegEx) is a sequence of characters that defines a search pattern. For example,
    ```
    ^a...e$
    ```
- The above code defines a RegEx pattern. The pattern is: any five letter string starting with a and ending with e.
- A pattern defined using RegEx can be used to match against a string.

|Expression|String|Matched?|
|----------|:-------------:|------:|
| ^a...e$ | abe | No match |
| - | abcde | Match |
| - | abbbe | Match |
| - | abbcde | No match |


- Python has a module named re to work with RegEx. Here's an example:

In [1]:
import re

In [2]:
pattern = '^c...i$'
test_string = 'chen-pei'
result = re.match(pattern, test_string)

if result:
  print("Search successful.")
else:
  print("Search unsuccessful.")

Search unsuccessful.


- Here, we used re.match() function to search pattern within the test_string. 
- The method returns a match object if the search is successful. If not, it returns None.

In [3]:
print(result)

None


In [4]:
pattern = '^c...i$'
test_string = 'cheni'
result = re.match(pattern, test_string)

if result:
  print("Search successful.")
else:
  print("Search unsuccessful.") 

Search successful.


In [5]:
print(result)

<re.Match object; span=(0, 5), match='cheni'>


In [6]:
result[0]

'cheni'

- There are other several functions defined in the re module to work with RegEx. 
- Before we explore that, let's learn about regular expressions themselves.

- Specify Pattern Using RegEx
    - To specify regular expressions, metacharacters are used. 
    - In the above example, ^ and $ are metacharacters.

- **MetaCharacters**
  
    - Metacharacters are characters that are interpreted in a special way by a RegEx engine. 
    - Here's a list of metacharacters:
```
[] . ^ $ * + ? {} () \ |
```

- Square brackets: [] 

    - Square brackets specifies a set of characters you wish to match.


|Expression|String|	Matched?|
|----------|:-------------:|------:|
|[abc]|	a| 1 match|
 |- |ac|2 matches|
 |- |Hey Joe|	No match|
 |- |abc de ca|5 matches|
 
 
- Here, [abc] will match if the string you are trying to match contains any of the a, b or c.

- You can also specify a range of characters using - inside square brackets.

    - [a-e] is the same as [abcde].
    - [1-4] is the same as [1234].
    - [0-39] is the same as [01239].
    
- You can complement (invert) the character set by using caret ^ symbol at the start of a square-bracket.

    - [^abc] means any character except a or b or c.
    - [^0-9] means any non-digit character.

- Period: . 

    - A period matches any single character (except newline '\n').

|Expression	|String	|Matched?|
|----------|:-------------:|------:|
|..|	a|	No match|
|..| ac	|1 match|
|.. |acd|	1 match (non-overlapping)|
|..|acde	|2 matches (contains 4 characters)|

-  Caret: ^ 

    - The caret symbol ^ is used to check if a string starts with a certain character.

|Expression|	String|	Matched?|
|----------|:-------------:|------:|
|^a	|a	|1 match|
|^a	|abc	|1 match|
|^a	|bac|	No match|
|^ab |	abc	|1 match|
|^ab |acb|	No match (starts with a but not followed by b)|

- Dollar: \$
    - The dollar symbol $ is used to check if a string ends with a certain character.

|Expression|	String	|Matched?|
|----------|:-------------:|------:|
|a\$	|a	 | 1 match|
|a\$	|formula	 |1 match|
|a\$	|cab	|No match|

- Star: \* 

    - The star symbol * matches zero or more occurrences of the pattern left to it.

|Expression	|String	|Matched?|
|----------|:-------------:|------:|
|ma*n|	mn	|1 match|
|ma*n|man	|1 match|
|ma*n|maaan	|1 match|
|ma*n|main	|No match (a is not followed by n)|
|ma*n|woman	|1 match|

- Plus: \+ 

    - The plus symbol + matches one or more occurrences of the pattern left to it.

|Expression	|String	|Matched?|
|----------|:-------------:|------:|
|ma+n|	mn	|No match (no a character)|
|ma+n|man	|1 match|
|ma+n|maaan	|1 match|
|ma+n|main	|No match (a is not followed by n)|
|ma+n|woman	|1 match|

- Question Mark: ? 

    - The question mark symbol ? matches zero or one occurrence of the pattern left to it.

|Expression	|String	|Matched?|
|----------|:-------------:|------:|
|ma?n	|mn	|1 match|
|ma?n|man|	1 match|
|ma?n|maaan|	No match (more than one a character)|
|ma?n|main	|No match (a is not followed by n)|
|ma?n|woman	|1 match|

- Braces: {}

    - Consider this code: {n,m}. This means at least n, and at most m repetitions of the pattern left to it.

|Expression	|String	|Matched?|
|----------|:-------------:|------:|
|a{2,3}	|abc dat	|No match|
|a{2,3}|abc daat	|1 match (at daat)|
|a{2,3}|aabc daaat	|2 matches (at aabc and daaat)|
|a{2,3}|aabc daaaat	|2 matches (at aabc and daaaat)|


    - Let's try one more example. This RegEx [0-9]{2, 4} matches at least 2 digits but not more than 4 digits

|Expression	|String	|Matched?|
|----------|:-------------:|------:|
|[0-9]{2,4}	|ab123csde	|1 match (match at ab123csde)|
|[0-9]{2,4}	|12 and 345673	|3 matches (12, 3456, 73)|
|[0-9]{2,4}	|1 and  2	|No match|

- Alternation: \|
   - Vertical bar \| is used for alternation (OR operator).

|Expression	|String	|Matched?|
|----------|:-------------:|------:|
|a\|b|	cde|	No match|
|a\|b| ade	|1 match (match at ade)|
|a\|b| acdbea|	3 matches (at acdbea)|

   - Here, a\|b match any string that contains either a or b

- Group: ()
    - Parentheses () is used to group sub-patterns. 
    - For example, (a|b|c)xz match any string that matches either a or b or c followed by xz

|Expression	|String	|Matched?|
|----------|:-------------:|------:|
|(a\|b\|c)xz |xz     |	No match|
|(a\|b\|c)xz |abxz	|1 match (match at abxz)|
|(a\|b\|c)xz |axz cabxz|	2 matches (at axzbc cabxz)|


- Backslash: \\ 

    - Backlash \\ is used to escape various characters including all metacharacters. For example,
   
    ```
    \$a
    ```
    - match if a string contains \$ followed by a. 
    - Here, \$ is not interpreted by a RegEx engine in a special way.

    - If you are unsure if a character has special meaning or not, you can put \\ in front of it. This makes sure the character is not treated in a special way.


- Special Sequences

    - Special sequences make commonly used patterns easier to write. Here's a list of special sequences:
        - \\A  Matches if the specified characters are at the start of a string.
        - \\b - Matches if the specified characters are at the beginning or end of a word.
        - \\B - Opposite of \b. Matches if the specified characters are not at the beginning or end of a word.
        - **\\d - Matches any decimal digit. Equivalent to [0-9]**
        - \\D - Matches any non-decimal digit. Equivalent to [^0-9]
        - \\s - Matches where a string contains any whitespace character. Equivalent to [ \\t\\n\\r\\f\\v].
        - \\S - Matches where a string contains any non-whitespace character. Equivalent to [^ \\t\\n\\r\\f\\v].
        - ...

## RegEx  in Python

- Python has a module named re to work with regular expressions. To use it, we need to import the module.
    ```python
    import re
    ```   
- The module defines several functions and constants to work with RegEx。

### re.findall()

   - The re.findall() method returns a list of strings containing all matches.
   
    ```python
    re.findall()

    ```


In [7]:
# Program to extract numbers from a string
import re

string = 'hello 2020, I am 22 years old'
pattern = '\d+'

result = re.findall(pattern, string) 
print(result)

['2020', '22']


 ### re.split()
 - The re.split method splits the string where there is a match 
 - and returns a list of strings where the splits have occurred.

In [8]:
string = 'Highet:12; Price:89000HKD.'
pattern = '\d+'

result = re.split(pattern, string) 
print(result)

['Highet:', '; Price:', 'HKD.']


In [9]:
#You can pass maxsplit argument to the re.split() method. 
# It's the maximum number of splits that will occur.

string = 'Highet:12; Price:89000HKD.'
pattern = '\d+'

# maxsplit = 1
# split only at the first occurrence
result = re.split(pattern, string, 1) 
print(result)

['Highet:', '; Price:89000HKD.']


### re.search() 
- The re.search() method takes two arguments: 
    - a pattern and a string. 
    - The method looks for the first location where the RegEx pattern produces a match with the string.

- If the search is successful, re.search() returns a match object; if not, it returns None.

```python
match = re.search(pattern, str)
```

In [11]:
string = "Python fun"

# check if 'Python' is at the beginning
match = re.search('\APython', string)

if match:
  print("pattern found inside the string")
else:
  print("pattern not found")  

# Output: pattern found inside the string

pattern found inside the string


### Using r prefix before RegEx

- When r or R prefix is used before a regular expression, it means raw string. 
- For example, '\\n' is a new line whereas r'\\n' means two characters: a backslash \\ followed by n.
- Backlash \\ is used to escape various characters including all metacharacters. 
- However, using r prefix makes \\ treat as a normal character.

In [7]:
string = '\n and \r are escape sequences.'

result = re.findall(r'[\n\r]', string) 
print(result)
# Output: ['\n', '\r']

['\n', '\r']


### Look-ahead and Look-behind
useful when we want to match a pattern by its relative position

`(?<=)`: look behind — pattern before the pattern we want

`(?=)`: look ahead — pattern after the pattern we want

In [15]:
logs = '''
146.204.224.152 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1" 302 4622
197.109.77.178 - kertzmann3129 [21/Jun/2019:15:45:25 -0700] "DELETE /virtual/solutions/target/web+services HTTP/2.0" 203 26554
156.127.178.177 - okuneva5222 [21/Jun/2019:15:45:27 -0700] "DELETE /interactive/transparent/niches/revolutionize HTTP/1.1" 416 14701
100.32.205.59 - ortiz8891 [21/Jun/2019:15:45:28 -0700] "PATCH /architectures HTTP/1.0" 204 6048
168.95.156.240 - stark2413 [21/Jun/2019:15:45:31 -0700] "GET /engage HTTP/2.0" 201 9645
71.172.239.195 - dooley1853 [21/Jun/2019:15:45:32 -0700] "PUT /cutting-edge HTTP/2.0" 406 24498
'''

In [16]:
# find all the user names
re.findall('(?<=-\s).+(?= \[)', logs)

['feest6811',
 'kertzmann3129',
 'okuneva5222',
 'ortiz8891',
 'stark2413',
 'dooley1853']

## Web Scraping with RegEx

In [23]:
import requests
from bs4 import BeautifulSoup

URL = 'http://www.pythonscraping.com/pages/page3.html'
page = requests.get(URL)

bs = BeautifulSoup(page.content, 'html.parser')

images = bs.find_all('img',
    {'src':re.compile('../img/gifts/img\d*.jpg')})
for image in images: 
    print(image)

<img src="../img/gifts/img1.jpg"/>
<img src="../img/gifts/img2.jpg"/>
<img src="../img/gifts/img3.jpg"/>
<img src="../img/gifts/img4.jpg"/>
<img src="../img/gifts/img6.jpg"/>


In [25]:
images[0][]

<img src="../img/gifts/img1.jpg"/>

## Tips

- To build and test regular expressions, you can use RegEx - tester tools such as regex101: https://regex101.com/.
- Learn More at : https://docs.python.org/3/library/re.html 
- Ask ChatGPT to write it for you!
 