## RegEx - Regular Expression
Regex are sequence of characters. They are used to search and match any pattern. 

Python module <b>"re"</b> is used.

For more refer: https://docs.python.org/3/library/re.html

### Patterns

<br>
<div style="display:flex; justify-content: flex-end;"> 
    <img src="./patterns.png" style="margin-left:0 !important;">
</div>

### Sets

<b>[ ]</b>  |   Contains a set of characters to match. The string or pattern to be matched is provided within the brackets<br>
&emsp;   <i>e.g.,  Pattern: [abc] <br>
&emsp;          Expression: [abc]<br>
&emsp;          Input string: abcde<br>
&emsp;          Matches: 1 (matches "abc")</i><br>

<b>[amk]</b> | Characters can be listed individually, e.g. [amk] will match 'a', 'm', or 'k'.

<b>[a-z]</b> | Matches any alphabet from a to z.<br>
&emsp;   <i>e.g., Pattern: [a-e]<br>
&emsp;          Expression: [a-e]<br>
&emsp;          Input string: Hai<br>
&emsp;          Matches: 1 (matches "a")</i><br>

<b>[a\-z]</b> | Matches a, -, or z. It matches - because \ escapes it.

<b>[a-]</b> | Matches a or -, because - is not being used to indicate a series of characters.

<b>[-a]</b> | As above, matches a or -.

<b>[a-z0-9]</b> | Matches characters from a to z and also from 0 to 9.

<b>[0-5][0-9]</b> | Will match all the two-digits numbers from 00 to 59.

<b>[(+*)]</b> | Special characters become literal inside a set, so this matches (, +, *, and ).

<b>[^ab5]</b> | Adding ^ excludes any character in the set. Here, it matches characters that are not a, b, or 5.

<b>" * "</b>  Asterisk | Matches zero or more occurences of any pattern <br>

<b>" + "</b>  Plus | Matches one or more occurences of any pattern

### Special characters in RegEx

^ | Matches the expression to its right at the start of a string. It matches every such instance before each \n in the string.

$ | Matches the expression to its left at the end of a string. It matches every such instance before each \n in the string.

. | Matches any character except line terminators like \n.

\ | Escapes special characters or denotes character classes. Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning.

A|B | Matches expression A or B. If A is matched first, B is left untried.

? | Matches the expression to its left 0 or 1 times. 

{m} | Matches the expression to its left m times, and not less.

{m, n} | Matches the expression to its left m to n times, and not less.

{m, n}? | Matches the expression to its left m times, and ignores n. 

### Character classes

\w | Matches alphanumeric characters, which means a-z, A-Z, and 0-9. It also matches the underscore, _.

\d | Matches digits, which means [0-9].

\D | Matches any non-decimal digits, which means [^0-9].

\s | Matches whitespace characters, which include the \t, \n, \r, and space characters.

\b | Matches the boundary (or empty string) at the start and end of a word, that is, between \w and \W.

\B | Matches the empty string, but only when it is not at the beginning or end of a word. <br>
> - e.g., r'py\B' matches 'python', 'py3', 'py2', but not 'py', 'py.', or 'py!'

\A | Matches the expression to its right at the absolute start of a string whether in single or multi-line mode. (^)

\Z | Matches the expression to its left at the absolute end of a string whether in single or multi-line mode. ($)


### Groups

( ) | Matches the expression inside the parentheses and groups it.

(? ) | Inside parentheses like this, ? acts as an extension notation. Its meaning depends on the character immediately to its right. (?) alone doesn't make sense. This is used to include inline flags like (?a), (?x), (?m) etc.

(?P<name>AB) | Matches the expression AB, and it can be accessed with the group name. <br>
   > - e.g., (?P<username>.*) The match object generated by this expression can be accessed using its name "username". <br>
   > - Method of accessing - matchObj.group('username')

(?:A) | Matches the expression as represented by A, but unlike (?P<name>AB), it cannot be retrieved afterwards.
    
(?=...) | Matches if ... matches next, but doesn’t consume any of the string. 

> - e.g., Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov' 

### Flags 

i — Ignore case (case insensitive) <br>
m — Multi-line matching<br>
s — Matches all<br>
x — Verbose<br>

### Operations using RegExp

##### 1. Union Operation:

Using OR operator (||). This operator performs conditional checking of any one of the patterns that occure in the string.

e.g., [0 || [^\W]] -> matches a value that can be a numeric 0 or a non numeric word

##### 2. Intersection operation

Using AND operator (&&). Performs conditional checking of the occurences of all patterns present in the string.

e.g., [ [^abc] && [xyz$]] -> matches a string that starts with pattern abc and ends with pattern xyz

##### 3. Difference operation

Subtraction operator (-) is used. This operator performs subtraction of the occurences of the patterns between the character classes.

e.g., [[a-z] - [aeiou]] -> used to match all consonants

### Methods in "re"

In [3]:
import re

<h3><u>re.compile</u>()</h3> 

- generates a pattern object (i.e., regex object)
- reusable
- Syntax: patternObject = re.compile(pattern, flags)
-
    Flags: <br>
    &emsp;&emsp; <b>re.I, re.IGNORECASE:</b> Corresponds to the inline flag (?i)<br>
    &emsp;&emsp; <b>re.M, re.MULTILINE:</b> Corresponds to the inline flag (?m)<br>
    &emsp;&emsp; <b>re.S, re.DOTALL:</b> Corresponds to the inline flag (?s) <br>
    &emsp;&emsp; <b>re.X, re.VERBOSE:</b> Allows you to write regular expressions that look nicer and are more readable by allowing you to visually separate logical sections of the pattern and add comments. Corresponds to the inline flag (?x). <br>  

In [2]:
matchObject = re.compile(r"\d+\.\d*")

<h3><u>re.search</u>()</h3> 

- Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object.
- Returns None if no position in the string matches the pattern
- Checks all lines of input string
- Syntax: re.search(pattern, string, flags)

In [3]:
re.search("ari", "arizona") # match exists

<re.Match object; span=(0, 3), match='ari'>

In [4]:
re.search("ari", "python") # match doesnot exist --> returns None

In [5]:
# Actual usage
if re.search("ari", "python"):
    print("Match exists")
else:
    print("No match")

No match


<h3><u>re.match</u>()</h3> 

- search the regular expression pattern and return the first occurrence.

- checks for a match only at the beginning of the string. So, if a match is found in the first line, it returns the match object. 

- returns null if a match is found in some other line

- Syntax: re.match(pattern, string, flags)

In [6]:
re.match("ari", "carry") # doesnot match

In [7]:
re.match("ari", "arizona") # matches since the string starts with "ari"

<re.Match object; span=(0, 3), match='ari'>

In [8]:
re.match("ari", "ARIzona", flags=re.I) # used case insensitive flag

<re.Match object; span=(0, 3), match='ARI'>

In [9]:
# Actual usage
if not re.search("ari", "python"):
    print("String doesnt start with given pattern")
else:
    print("Match found")

String doesnt start with given pattern


<h3><u>re.fullmatch</u>()</h3>

- returns match object if whole string matches the pattern
- returns None otherwise
- Syntax: re.fullmatch(pattern, string, flags)

In [10]:
re.fullmatch("py", "python")

In [11]:
re.fullmatch("py", "py")  # pattern matches whole input string

<re.Match object; span=(0, 2), match='py'>

<h3><u>re.split</u>()</h3>

- Split string by the occurrences of pattern. e.g., re.split(",", "house name, city name")
-  If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. e.g., re.split("(name)", "house name, city name")
- Syntax: re.split(pattern, string, maxsplit=0, flags)

In [12]:
re.split(",", "house name, city name") #splits at ","

['house name', ' city name']

In [13]:
re.split("(name)", "house name, city name") #splits at "name"

['house ', 'name', ', city ', 'name', '']

<h3><u>re.findall</u>()</h3>

- match all the string data that matches regex
- string is scanned left-to-right, and matches are returned in the order found
- Syntax: re.findall(pattern, string, flags)

In [14]:
re.findall("ri", "richie rich")

['ri', 'ri']

<h3><u>re.sub</u>()</h3>

- substitutes the provided data string when the pattern matches with an appropriate replacement pattern
- if the pattern isn’t found, string is returned unchanged
- Syntax: re.sub(pattern, replacement, string, count, flags)

In [15]:
re.sub("d", " REPLACEMENT ", "Made in Country")

'Ma REPLACEMENT e in Country'

In [16]:
re.sub("d", " REPLACEMENT ", "Made in Downtown", flags=re.I) # flag applied

'Ma REPLACEMENT e in  REPLACEMENT owntown'

In [17]:
re.sub("d", " REPLACEMENT ", "Made in Downtown", 1, flags=re.I) # replace 1

'Ma REPLACEMENT e in Downtown'

In [18]:
re.sub("x", " REPLACEMENT ", "Made in Country") # no match

'Made in Country'

### Match Object

Match object can be processed in the following ways:

In [4]:
matchObject = re.search("pattern", "patterns are amazing")

- match.group() : returns portion of the string that matches the pattern

In [5]:
matchObject.group()

'pattern'

- match.start(): returns the starting index of matched pattern

In [6]:
matchObject.start()

0

- match.end(): returns the ending index of matched pattern

In [7]:
matchObject.end()

7

- match.span(): returns starting and ending index of the matched pattern

In [8]:
matchObject.span()

(0, 7)