## Introduction to Regular Expression

Regex (also called Regular Expression) is a mini-language that looks encrypted and mysterious at first. 
It is a wildcard that helps in parsing through strings and matching exact patterns in a text. If you frequently find yourself manually scanning documents or parsing substrings just to identify text patterns, you might find regular expressions particularly useful. Especially in data science and data engineering, they can assist in a wide spectrum of tasks, from wrangling data to qualifying and categorizing it.

It is extremely useful for extracting information from text such as code, files, log, spreadsheets or even documents.

While using the regular expression the first thing is to recognize is that everything is essentially a character, and we are writing patterns to match a specific sequence of characters also referred as string.

For instance, a regular expression could tell a program to search for specific text from the string and then to print out the result accordingly. Expression can include

- Text matching
- Repetition
- Branching
- Pattern-composition etc.

## Regex in Python

#### Importing the Library

In [1]:
import re

### Literal Characters in Regex

In [2]:
txt = 'Cats and dogs'

In [5]:
re.match('Cats',txt) # match() function returns a match object if the text is found, else it returns a None character
                         

<_sre.SRE_Match object; span=(0, 4), match='Cats'>

### Special Characters in Regex

There are roughly eleven special characters in Regex with special meanings. <br>
1. Period (.): Matches any single character except single line character. <br>
2. Lowercase w (\w): Matches any single letter, digit or underscore. <br>
3. Uppercase w (\W): Matches any character not a part of lowercase w. <br>
4. Lowercase s (\s): Matches a single whitespace character. <br>
5. Uppercase s (\S): Matches any character not part of lowercase s. <br>
6. Lowercase t (\t): Matches a single tab character. <br>
7. Lowercase n (\n): Matches new line. <br>
8. Lowercase d (\d): Matches decimal digit 0-9. <br>
9. Caret (^): Matches a pattern at the start of the string. <br>
10. Dollar ($): Matches a pattern at the end of string. <br>
11. Square brackets ([]): 



In [6]:
txt = 'We are learning Regular Expressions (Regex)!'

In [7]:
re.search('\w+',txt)

<_sre.SRE_Match object; span=(0, 2), match='We'>

In [8]:
re.findall('\W',txt)

[' ', ' ', ' ', ' ', ' ', '(', ')', '!']

In [11]:
re.findall('Regular',txt)

['Regular']

In [13]:
re.findall('^Regex',txt)

[]

In [14]:
re.findall('[A-z]+',txt)

['We', 'are', 'learning', 'Regular', 'Expressions', 'Regex']

###  Wildcards in Regex

1. (+): Checks for one or more character to its left. <br>
2. (*): Checks for zero or more characters to is left. <br>
3. (?): Checks for exactly zero or one characters to its left. <br>
4. {x}: Repeat exactly x number of times. <br>
5. {x,}: Repeat atleast x times or more. <br>
6. {x,y}: Repeat atleast x times but not more than y. <br>

In [15]:
txt2 = 'Phone Number:+917564861236, Fax: A8955PA'

In [16]:
re.findall('\+91[0-9]{10}',txt2)

['+917564861236']

In [19]:
re.findall('([+0-9]+)?,\s*Fax',txt2)

['+917564861236']

In [22]:
re.findall('([A-z\s*]+)?Regular',txt)

['We are learning ']

### Functions in Regex

1. re.search(pattern, string, flags=0) <br>
2. re.match(pattern, string, flags=0) <br>
3. re.findall(pattern, string, flags=0) <br>
4. re.sub(pattern, repl, string, count=0, flags=0) <br>
5. re.compile(pattern, flags=0) <br>


###### 1. re.match(pattern,string,flags = 0)

Finds a match if it occurs at the beginning of the string. If no match is found, it returns None.

In [23]:
pattern = 'D'
sequence = 'PenDrive'

In [24]:
re.match(pattern,sequence)

In [25]:
re.match('Pen',sequence)

<_sre.SRE_Match object; span=(0, 3), match='Pen'>

###### 2 re.search(pattern, string,flags = 0) <br>
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern


In [26]:
pattern = 'Regular'
sequence = 'Regular expression on Regular days'

In [31]:
re.search(pattern,sequence).group()

'Regular'

###### 3. re.findall(patteren, strings, flags = 0)

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. 

In [28]:
pattern = 'Regular'
sequence = 'Regular expression on Regular days'

In [29]:
re.findall(pattern,sequence)

['Regular', 'Regular']

###### 4. re.sub(pattern,repl, string)

This function helps to replace substrings in a particular string. It returns the string obtained by replacing or substituting the leftmost non-overlapping occurrences of pattern in string by the replacement 

In [32]:
email_address = "Please contact us at: xyz@yahoo.com"
email_address

'Please contact us at: xyz@yahoo.com'

In [33]:
new_email_address = re.sub(r'([\w\.-]+)@([\w\.-]+)', r'support@yahoo.com', email_address)
new_email_address

'Please contact us at: support@yahoo.com'

###### 5. re.compile(pattern, flags = 0)
If you want to use the same regular expression more than once, you can compile it into a regular expression object, which can be used for matching using its match() and search() methods.

In [34]:
pattern = re.compile(r"wheels")
sequence = "cars and wheels"

In [35]:
pattern.search(sequence).group()

'wheels'

### Groups and Groupings using Regular Expressions

Grouping helps you to match parts of the matching text. Parts of a regular expression pattern bounded by parenthesis() are called groups. The parenthesis does not change what the expression matches, but rather forms groups within the matched sequence.

In [36]:
email_address = 'Please contact us at: support@gmail.com'
match = re.search(r'([\w\.-]+)@([\w\.-]+)', email_address)
match

<_sre.SRE_Match object; span=(22, 39), match='support@gmail.com'>

In [37]:
if match:
  print(match.group()) # The whole matched text
  print(match.group(1)) # The username (group 1) 
  print(match.group(2)) # The host (group 2)

support@gmail.com
support
gmail.com


### Greedy vs Non-Greedy matching

When a special character matches as much of the search sequence (string) as possible, it is said to be a "Greedy Match". Adding ? after the qualifier makes it perform the match in a non-greedy or minimal fashion; That is, as few characters as possible will be matched.

In [38]:
txt3 = r'<h1>heading<\h2>'

In [39]:
re.match(r'<.*>',txt3).group()

'<h1>heading<\\h2>'

In [40]:
re.match(r'<.*?>',txt3).group()

'<h1>'