## RegEX

A Regular Expression (RegEx) is a sequence of characters that use to search pattern from string data.

For ex: `^s..e*...h$` match with string start with `s`, end with `h` and checking fourth character would be `e` with `*` zero or more occurences

Match: `shreeyash`
Match: `shreyanh`
Match: `sreyanh`
Not match: `Shreyash`
Not match: `Shre yash`

**re module**

Python has a built-in package called `re`, which can be used to work with Regular Expressions

**syntax**

```python
import re


## Match


In [1]:
import re

pattern = '^s..e*...h$'
string_data = 'shreeyash'

# re.match() function returns a match object if the search is successful. If not, it returns None
match_object_5 = re.match(pattern, string_data)
print(match_object_5)


<re.Match object; span=(0, 9), match='shreeyash'>


## Search

In [2]:
string_data = 'shreyanh'

# re.search() function returns a match object if the search is successful. If not, it returns None
match_object_6 = re.search(pattern, string_data)
print(match_object_6)


<re.Match object; span=(0, 8), match='shreyanh'>


In [3]:
string_data = 'Shreyash'

# re.match() function returns a match object if the search is successful. If not, it returns None
match_object_6 = re.match(pattern, string_data)
print(match_object_6)

None


## Types of Regular Expression Characters

Regular expressions are made up of both ordinary and special characters. Ordinary characters simply match themselves, while special characters have a more specific meaning.

| Type | Description | Example |
|---|---|---|
| **Ordinary character** | Matches itself | `A` matches the letter `A` |
| **Special character** | Has a specific meaning | `.` matches any character |
| **Quantifier** | Indicates how many times a character or group of characters can be matched | `*` matches zero or more occurrences of the preceding character |
| **Grouping** | Allows you to group together characters or expressions | `(A|B)` matches either `A` or `B` |
| **Escape character** | Escapes the special meaning of a character | `\.` matches the literal character `.` |

For example, the regular expression `\d{3}-\d{3}-\d{4}` matches any string that contains three digits, a hyphen, three more digits, and a hyphen, followed by four digits. This regular expression could be used to match a phone number.

The special characters can be used to create more complex regular expressions that can match a wider variety of strings.


## Square Brackets in Regular Expressions

Square brackets (`[]`) are used to specify a group of ordinary characters to search from data. The characters can be listed individually or ranges of characters can be indicated by giving two characters and separating them by a '-'. Special characters lose their special meaning inside sets. For example, `[(+*)]` will match any of the literal characters '(', '+', '*', or ')'.

Here are some examples of how square brackets can be used in regular expressions:

* `[123]` will match any of the digits 1, 2, or 3.
* `[a-z]` will match any of the lowercase letters from a to z.
* `[0-9a-zA-Z]` will match any of the digits from 0 to 9, or any of the lowercase or uppercase letters from a to z.
* `[^5]` will match any character except '5'.

Square brackets can be a powerful tool for matching patterns in text. They can be used to match a wide variety of characters, and they can be used to complement sets of characters.

Here are some additional resources that you may find helpful:

* [Regular Expressions Tutorial](https://www.w3schools.com/python/python_regex.asp)
* [Python Regular Expressions HOWTO](https://docs.python.org/3/howto/regex.html)
* [Regular Expressions Cookbook](https://www.regular-expressions.info/)



## Other Special Sequences

The following special sequences can be used in regular expressions:

* `\d` Matches any Unicode decimal digit (Similare to `[0-9]`)
* `\D` Matches any Unicode non decimal digit (Similare to `[^\d]`)
* `\s` - Matches where a string contains any whitespace character.(Similare to `[ \t\n\r\f\v]`)
* `\S` - Matches where a string contains any non-whitespace character.(Similare to `[^\t\n\r\f\v]`)
* `\w` - Matches Unicode alphanumeric characters any characters that can be part of a word in any language, as well as numbers and the underscore. (Similare to `[a-zA-Z0-9_]`)
* `\W` - Matches Unicode non-alphanumeric characters. (Similare to `[^\w]`)
* `\Z` Matches only at the end of the string

For example, the regular expression `\d{3}-\d{3}-\d{4}` matches any string that contains three digits, a hyphen, three more digits, and a hyphen, followed by four digits. This regular expression could be used to match a phone number.

The special sequences can be used to create more complex regular expressions that can match a wider variety of strings.


## Sub

In [4]:

pattern = '[0-9]'
string_data = 'My phone is 123456734'
masked_data=re.sub(pattern, '*',string_data)
print(masked_data)

My phone is *********


In [20]:
#(Dot.) - Period
#In the default mode, this matches any character except a newline(\n). If the DOTALL flag has been specified, this matches any character including a newline.
pattern = '.'
string_data = 'Shreyash phone is +123456734 and Shreyash Adhar card no. is 34554 5446 78435 Shreyash'
masked_data=re.sub(pattern, '*',string_data)
print(masked_data)

*************************************************************************************


In [21]:
#^ - Caret
#(Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.
pattern = '^Shreyash'
masked_data=re.sub(pattern, '*',string_data)
print(masked_data)

* phone is +123456734 and Shreyash Adhar card no. is 34554 5446 78435 Shreyash


In [22]:
pattern = '^........'
masked_data=re.sub(pattern, '*',string_data)
print(masked_data)

* phone is +123456734 and Shreyash Adhar card no. is 34554 5446 78435 Shreyash


In [23]:
#$ - Dollar
#Matches the end of the string or just before the newline at the end of the string,
#and in MULTILINE mode also matches before a newline.
pattern = '.........$'
redact_data=re.sub(pattern, '*',string_data)
print(redact_data)

Shreyash phone is +123456734 and Shreyash Adhar card no. is 34554 5446 78435*


In [24]:
pattern = '35$'
redact_data=re.sub(pattern, '*',string_data)
print(redact_data)

Shreyash phone is +123456734 and Shreyash Adhar card no. is 34554 5446 78435 Shreyash


In [25]:
#* - Star
#Matches zero or more occurrences of the pattern left to it, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s
pattern = '34*5'
redact_data=re.sub(pattern, '*',string_data)
print(redact_data)

Shreyash phone is +12*6734 and Shreyash Adhar card no. is *54 5446 784* Shreyash


In [26]:
#+ - Plus
#Matches one or more occurrences of the pattern left to it, as many repetitions as are possible. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.
pattern = '34+5'

redact_data=re.sub(pattern, '*',string_data)
print(redact_data)

Shreyash phone is +12*6734 and Shreyash Adhar card no. is *54 5446 78435 Shreyash


In [27]:
#? - Question Mark
#Matches zero or one occurrences of the pattern left to it, as many repetitions as are possible. ab? will match either ‘a’ or ‘ab’
pattern = '34?5'
redact_data=re.sub(pattern, '*',string_data)
print(redact_data)

Shreyash phone is +12*6734 and Shreyash Adhar card no. is *54 5446 784* Shreyash


In [28]:
#{} - Braces
#a{m,n} Causes the resulting RE to match from m to n (least m, and at most n) repetitions of the preceding RE, attempting to match as many repetitions as possible. For example, a{3,5} will match from aaa, aaaa, aaaaa characters

pattern = '4{2,3}'
redact_data=re.sub(pattern, '*',string_data)
print(redact_data)


Shreyash phone is +123456734 and Shreyash Adhar card no. is 34554 5*6 78435 Shreyash


In [29]:
#| - or operator
pattern = '3|5'
redact_data=re.sub(pattern, '*',string_data)
print(redact_data)

Shreyash phone is +12*4*67*4 and Shreyash Adhar card no. is *4**4 *446 784** Shreyash


In [30]:
#() - Group
#Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group
pattern = '(3|4)5'
redact_data=re.sub(pattern, '*',string_data)
print(redact_data)

Shreyash phone is +123*6734 and Shreyash Adhar card no. is 3*54 5446 784* Shreyash


In [31]:
#\ - Backslash
#Either escapes special characters (permitting you to match characters like '*', '?', and so forth), or signals a special sequence

pattern = '\+3' #\+3 match if + is followed by 3 and + is not consider as special character by RegEx engine.
redact_data=re.sub(pattern, '*',string_data)
print(redact_data)

Shreyash phone is +123456734 and Shreyash Adhar card no. is 34554 5446 78435 Shreyash


In [32]:
pattern = '\.'
redact_data=re.sub(pattern, '*',string_data)
print(redact_data)

Shreyash phone is +123456734 and Shreyash Adhar card no* is 34554 5446 78435 Shreyash


In [33]:
#Special Sequences
#\A Matches only at the start of the string.
pattern = '\AShre'
redact_data=re.sub(pattern, '*',string_data)
print(redact_data)

*yash phone is +123456734 and Shreyash Adhar card no. is 34554 5446 78435 Shreyash


In [34]:
#\b - Matches if the specified characters are at the beginning or end of a word.
pattern = r'\bShr'
redact_data=re.sub(pattern, '*',string_data)
print(redact_data)

*eyash phone is +123456734 and *eyash Adhar card no. is 34554 5446 78435 *eyash


In [36]:
string_data ='''Shreyash phone is +123456734 andShreyash Adhar card no. is 34554 5446 78435'''

In [37]:
#\B - is opposite to \b, Matches if the specified characters are not at the beginning or end of a word.
pattern = r'\BShre'
redact_data=re.sub(pattern, '*',string_data)
print(redact_data)

Shreyash phone is +123456734 and*yash Adhar card no. is 34554 5446 78435


## Recall

## RegeX module methods
* **re.findall()**: 'This method returns all non-overlapping matches of pattern in string, as a list of strings or tuples. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result. The result depends on the number of capturing groups in the pattern. If there are no groups, return a list of strings matching the whole pattern. If there is exactly one group, return a list of strings matching that group. If multiple groups are present, return a list of tuples of strings matching the groups. Non-capturing groups do not affect the form of the result.'

* **re.split()**: 'Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list you can pass maxsplit argument in split() method, If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list.'

* **re.sub()**: This method returns the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. count You can pass optional argument count is the maximum number of pattern occurrences to be replaced

* **re.search()**: Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

* **re.match(pattern, string, flags=0)**: If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.Note that even in MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.If you want to locate a match anywhere in string, use search()

**Match objects**: Match objects always have a boolean value of True. Since match() and search() return None when there is no match

**Match.start() and Match.end()**:Return the indices of the start and end of the substring matched by group; group defaults to zero (meaning the whole matched substring). Return -1 if group exists but did not contribute to the match

**re.search() vs re.match()**: They Both functions are very efficient and fast for searching in strings. Both functions searches for substring in a string and returns a match object if found, else it returns none.

**Difference between re.search() vs re.match()**: Both functions return the first match of a substring found in the string,

but re.match() searches only from the beginning of the string and return match object if found. But if a match of substring is found somewhere in the middle of the string, it returns none.

While re.search() searches for the whole string even if the string contains multi-lines and tries to find a match of the substring in all the lines of string.