* A Regular Expression (RegEx) is a sequence of characters that defines a search pattern. 
    * For example "^a...s$"
    * The above code defines a RegEx pattern. The pattern is: any five letter string starting with a and ending with s


In [None]:
import re

pattern = '^a...s$'
test_string = 'abyss'
result = re.match(pattern, test_string)

if result:
  print("Search successful")
else:
  print("Search unsuccessful")

In [None]:
# MetaCharacters
# Metacharacters are characters that are interpreted in a special way by a RegEx engine. 
# Here's a list of metacharacters:

# [] . ^ $ * + ? {} () \ |

In [None]:
[] - Square brackets

Square brackets specifies a set of characters you wish to match.

Expression	String	Matched?
[abc]	a	1 match
ac	2 matches
Hey Jude	No match
abc de ca	5 matches

Here, [abc] will match if the string you are trying to match contains any of the a, b or c.

You can also specify a range of characters using - inside square brackets.

[a-e] is the same as [abcde].
[1-4] is the same as [1234].
[0-39] is the same as [01239].

You can complement (invert) the character set by using caret ^ symbol at the start of a square-bracket.

[^abc] means any character except a or b or c.
[^0-9] means any non-digit character.

In [None]:
+ - Plus

The plus symbol + matches one or more occurrences of the pattern left to it.

Expression	String	Matched?

ma+n	mn	No match (no a character)
man	1 match
maaan	1 match
main	No match (a is not followed by n)
woman	1 match

In [None]:
^ - Caret

The caret symbol ^ is used to check if a string starts with a certain character.

Expression	String	Matched?

^a	a	1 match
abc	1 match
bac	No match
^ab	abc	1 match
acb	No match (starts with a but not followed by b)

In [None]:
$ - Dollar

The dollar symbol $ is used to check if a string ends with a certain character.

Expression	String	Matched?

a$	a	1 match
formula	1 match
cab	No match

In [None]:
? - Question Mark

The question mark symbol ? matches zero or one occurrence of the pattern left to it.

Expression	String	Matched?

ma?n	mn	1 match
man	1 match
maaan	No match (more than one a character)
main	No match (a is not followed by n)
woman	1 match

In [None]:
{} - Braces

Consider this code: {n,m}. This means at least n, and at most m repetitions of the pattern left to it.

Expression	String	Matched?

a{2,3}	abc dat	No match
abc daat	1 match (at daat)
aabc daaat	2 matches (at aabc and daaat)
aabc daaaat	2 matches (at aabc and daaaat)

Let's try one more example. This RegEx [0-9]{2, 4} matches at least 2 digits but not more than 4 digits

Expression	String	Matched?

[0-9]{2,4}	ab123csde	1 match (match at ab123csde)
12 and 345673	3 matches (12, 3456, 73)
1 and 2	No match

In [None]:
| - Alternation

Vertical bar | is used for alternation (or operator).

Expression	String	Matched?

a|b	cde	No match
ade	1 match (match at ade)
acdbea	3 matches (at acdbea)
Here, a|b match any string that contains either a or b

In [None]:
() - Group

Parentheses () is used to group sub-patterns. For example, 
(a|b|c)xz match any string that matches either a or b or c followed by xz

Expression	String	Matched?

(a|b|c)xz	ab xz	No match
abxz	1 match (match at abxz)
axz cabxz	2 matches (at axzbc cabxz)

In [None]:
\ - Backslash

Backlash \ is used to escape various characters including all metacharacters. 

For example,

\$a match if a string contains $ followed by a. Here, $ is not interpreted by a RegEx engine in a special way.

If you are unsure if a character has special meaning or not, you can put \ in front of it.
This makes sure the character is not treated in a special way.

# Special Sequences

Special sequences make commonly used patterns easier to write. Here's a list of special sequences:

In [None]:
\A - Matches if the specified characters are at the start of a string.

Expression	String	Matched?

\Athe	the sun	Match
In the sun	No match

In [None]:
\b - Matches if the specified characters are at the beginning or end of a word.

Expression	String	Matched?

\bfoo	football	Match
a football	Match
afootball	No match
foo\b	the foo	Match
the afoo test	Match
the afootest	No match

In [None]:
\B - Opposite of \b. Matches if the specified characters are not at the beginning or end of a word.

Expression	String	Matched?

\Bfoo	football	No match
a football	No match
afootball	Match
foo\B	the foo	No match
the afoo test	No match
the afootest	Match

In [None]:
\d - Matches any decimal digit. Equivalent to [0-9]

Expression	String	Matched?

\d	12abc3	3 matches (at 12abc3)
Python	No match

In [None]:
\D - Matches any non-decimal digit. Equivalent to [^0-9]

Expression	String	Matched?

\D	1ab34"50	3 matches (at 1ab34"50)
1345	No match

In [None]:
\s - Matches where a string contains any whitespace character. Equivalent to [ \t\n\r\f\v].

Expression	String	Matched?

\s	Python RegEx	1 match
PythonRegEx	No match

In [None]:
\S - Matches where a string contains any non-whitespace character. Equivalent to [^ \t\n\r\f\v].

Expression	String	Matched?

\S	a b	2 matches (at  a b)
   	No match

In [None]:
\w - Matches any alphanumeric character (digits and alphabets). Equivalent to [a-zA-Z0-9_]. By the way, underscore _ is also considered an alphanumeric character.

Expression	String	Matched?

\w	12&": ;c 	3 matches (at 12&": ;c)
%"> !	No match

In [None]:
\W - Matches any non-alphanumeric character. Equivalent to [^a-zA-Z0-9_]

Expression	String	Matched?

\W	1a2%c	1 match (at 1a2%c)
Python	No match

In [None]:
\Z - Matches if the specified characters are at the end of a string.

Expression	String	Matched?

Python\Z	I like Python	1 match
I like Python Programming	No match
Python is fun.	No match

### re.split()
* The re.split method splits the string where there is a match and returns a list of strings where the splits have occurred.

In [None]:
txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)

In [None]:
# initializing string 
data = "GeeksforGeeks, is_an-awesome ! website"
  
# printing original string   
print("The original string is : " + data)  
  
# Using re.split()  
# Splitting characters in String  
res = re.split(', |_|-|!', data) 
  
# printing result   
print("The list after performing split functionality : " + str(res))

In [None]:
import re

string = 'Twelve:12 Eighty nine:89.'
pattern = '\d+'

result = re.split(pattern, string) 
print(result)

In [None]:
# maxsplit = 1
# split only at the first occurrence
result = re.split(pattern, string, 1) 
print(result)
# By the way, the default value of maxsplit is 0; meaning all possible splits.

In [None]:
# '\W+' denotes Non-Alphanumeric Characters or group of characters 
# Upon finding ',' or whitespace ' ', the split(), splits the string from that point 
print(re.split('\W+', 'Words, words , Words')) 
print(re.split('\W+', "Word's words Words"))

In [None]:
# Here ':', ' ' ,',' are not AlphaNumeric thus, the point where splitting occurs 
print(re.split('\W+', 'On 12th Jan 2016, at 11:02 AM')) 
  
# '\d+' denotes Numeric Characters or group of characters 
# Splitting occurs at '12', '2016', '11', '02' only 
print(re.split('\d+', 'On 12th Jan 2016, at 11:02 AM')) 

### re.findall()
* The re.findall() method returns a list of strings containing all matches.
* Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found 

In [None]:
string = 'hello 12 hi 89. Howdy 34'
pattern = '\d+'

result = re.findall(pattern, string) 
print(result)

In [None]:
string  = """Hello my Number is 123456789 and 
             my friend's number is 987654321"""
  
# A sample regular expression to find digits. 
regex = '\d+'             
  
match = re.findall(regex, string) 
print(match) 

In [None]:
# initializing string   
data = "This, is - another : example?!"
  
# printing original string   
print("The original string is : " + data)  
  
# Using re.findall()  
# Splitting characters in String  
res = re.findall("[\w]+", data) 
  
# printing result   
print("The list after performing split functionality : " + str(res))  

### re.sub()
* re.sub(pattern, replace, string)
* The method returns a string where matched occurrences are replaced with the content of replace variable.

In [None]:
# Program to remove all whitespaces
import re

# multiline string
string = 'abc 12 de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.sub(pattern, replace, string) 
print(new_string)

In [None]:
# multiline string
string = 'abc 12 de 23 f45 6'

# matches all whitespace characters
pattern = '\s+'
replace = ''

new_string = re.sub('\s+', replace, string, 1) 
print(new_string)

### re.subn()
* The re.subn() is similar to re.sub() expect it returns a tuple of 2 items containing the new string and the number of substitutions made.

In [None]:
# multiline string
string = 'abc 12\
de 23 \n f45 6'

# matches all whitespace characters
pattern = '\s+'

# empty string
replace = ''

new_string = re.subn(pattern, replace, string) 
print(new_string)

### re.search()
* The re.search() method takes two arguments: a pattern and a string. The method looks for the first location where the RegEx pattern produces a match with the string.

* If the search is successful, re.search() returns a match object; if not, it returns None

In [None]:
string = "Python is fun"

# check if 'Python' is at the beginning
match = re.search('\APython', string)

if match:
  print("pattern found inside the string")
else:
  print("pattern not found")  

In [None]:
txt = "The rain in Spain"
x = re.search("Portugal", txt)
print(x)

In [None]:
txt = "The rain in Spain"
x = re.search("\s", txt)

print("The first white-space character is located in position:", x.start())

In [None]:
txt = "The rain in Spain"
x = re.search("\s", txt)

print("The last white-space character is located in position:", x.end())