___

<a href='https://www.udemy.com/user/joseportilla/'><img src='../Pierian_Data_Logo.png'/></a>
___
<center><em>Content Copyright by Pierian Data</em></center>

In this tutorial, you’ll learn about RegEx and understand various regular expressions.

1. Regular Expressions
2. Why Regular Expressions
3. Basic Regular Expressions
4. More Regular Expressions
5. Compiled Regular Expressions
A RegEx is a powerful tool for matching text, based on a pre-defined pattern. It can detect the presence or absence of a text by matching it with a particular pattern, and also can split a pattern into one or more sub-patterns. The Python standard library provides a re module for regular expressions. Its primary function is to offer a search, where it takes a regular expression and a string. Here, it either returns the first match or else none.

# Overview of Regular Expressions

Regular Expressions (sometimes called regex for short) allows a user to search for strings using almost any sort of rule they can come up. For example, finding all capital letters in a string, or finding a phone number in a document. 

Regular expressions are notorious for their seemingly strange syntax. This strange syntax is a byproduct of their flexibility. Regular expressions have to be able to filter out any string pattern you can imagine, which is why they have a complex string pattern format.

Let's begin by explaining how to search for basic patterns in a string!

## Searching for Basic Patterns

Let's imagine that we have the following string:

In [1]:
text = "The person's phone number is 408-555-1234. Call soon!"

We'll start off by trying to find out if the string "phone" is inside the text string. Now we could quickly do this with:

In [2]:
'phone' in text

True

But let's show the format for regular expressions, because later on we will be searching for patterns that won't have such a simple solution.

In [3]:
import re

In [4]:
pattern = 'phone'

In [5]:
re.search(pattern,text)

<re.Match object; span=(13, 18), match='phone'>

In [6]:
pattern = "NOT IN TEXT"

In [7]:
re.search(pattern,text)

Now we've seen that re.search() will take the pattern, scan the text, and then returns a Match object. If no pattern is found, a None is returned (in Jupyter Notebook this just means that nothing is output below the cell).

Let's take a closer look at this Match object.

In [10]:
pattern = 'phone'

In [13]:
match = re.search(pattern,text)

In [14]:
match

<re.Match object; span=(13, 18), match='phone'>

Notice the span, there is also a start and end index information.

In [15]:
match.span()

(13, 18)

In [16]:
match.start()

13

In [17]:
match.end()

18

But what if the pattern occurs more than once?

In [18]:
text = "my phone is a new phone"

In [19]:
match = re.search("phone",text)

In [20]:
match.span()

(3, 8)

Notice it only matches the first instance. If we wanted a list of all matches, we can use .findall() method:

In [21]:
matches = re.findall("phone",text)

In [27]:
matches

['phone', 'phone']

In [23]:
len(matches)

2

To get actual match objects, use the iterator:

In [29]:
for match in re.finditer("phone",text):
    print(match.span())

(3, 8)
(18, 23)


If you wanted the actual text that matched, you can use the .group() method.

In [30]:
match.group()

'phone'

In [34]:
import re


match = re.search('portal', 'GeeksforGeeks: A computer science \
				portal for geeks')
print(match)
print(match.group())

print('Start Index:', match.start())
print('End Index:', match.end())


<re.Match object; span=(38, 44), match='portal'>
portal
Start Index: 38
End Index: 44


Here r character (r’portal’) stands for raw, not RegEx. The raw string is slightly different from a regular string, it won’t interpret the \ character as an escape character. This is because the regular expression engine uses \ character for its own escaping purpose.


Before starting with the Python regex module let’s see how to actually write RegEx using metacharacters or special sequences. 

**MetaCharacters**

To understand the RE analogy, MetaCharacters are useful, important, and will be used in functions of module re. Below is the list of metacharacters.

MetaCharacters	Description

    \	Used to drop the special meaning of character following it
    []	Represent a character class
    ^	Matches the beginning
    $	Matches the end
    .	Matches any character except newline
    |	Means OR (Matches with any of the characters separated by it.
    ?	Matches zero or one occurrence
    *	Any number of occurrences (including 0 occurrences)
    +	One or more occurrences
    {}	Indicate the number of occurrences of a preceding RegEx to match.
    ()	Enclose a group of RegEx

The group method returns the matching string, and the start and end method provides the starting and ending string index. Apart from this, it has so many other methods, which we will discuss later.

**Why RegEx? **

Let’s take a moment to understand why we should use Regular expression.

1.Data Mining: Regular expression is the best tool for data mining. It efficiently identifies a text in a heap of text by checking with a pre-defined pattern. Some common scenarios are identifying an email, URL, or phone from a pile of text.
2.Data Validation: Regular expression can perfectly validate data. It can include a wide array of validation processes by defining different sets of patterns. A few examples are validating phone numbers, emails, etc.
 
 Basic RegEx
Let’s understand some of the basic regular expressions. They are as follows:

1.Character Classes
2.Rangers
3.Negation
4.Shortcuts
5.Beginning and End of String
6.Any Character

**Character Classes**

Character classes allow you to match a single set of characters with a possible set of characters. You can mention a character class within the square brackets. Let’s consider an example of case sensitive words. 



In [43]:
import re


print(re.findall(r'[DHF]OG', 'GeeksforGeeks:A computer Teeks DOG science portal heeks for geeks'))    

['DOG']


**Ranges**

The range provides the flexibility to match a text with the help of a range pattern such as a range of numbers(0 to 9), a range of characters (A to Z), and so on. The hyphen character within the character class represents a range.

In [48]:
import re


print('Range',re.search(r'[a-zA-Z]', 'q'))


Range <re.Match object; span=(0, 1), match='q'>


**Negation**

Negation inverts a character class. It will look for a match except for the inverted character or range of inverted characters mentioned in the character class.

In [50]:
import re

print(re.search(r'[^a-z]', 'd'))


None


In the above case, we have inverted the character class that ranges from a to z. If we try to match a character within the mentioned range, the regular expression engine returns None.

Let’s consider another example

In [53]:
import re

print(re.search(r'G[^e]', 'Gpeks'))


<re.Match object; span=(0, 2), match='Gp'>


Here it accepts any other character that follows G, other than e.

List of special sequences 

Special Sequence	Description	Examples
\A	Matches if the string begins with the given character	\Afor 	for geeks
for the world
\b	Matches if the word begins or ends with the given character. \b(string) will check for the beginning of the word and (string)\b will check for the ending of the word.	\bge	geeks
get
\B	It is the opposite of the \b i.e. the string should not start or end with the given regex.	\Bge	together
forge
\d	Matches any decimal digit, this is equivalent to the set class [0-9]	\d	123
gee1
\D	Matches any non-digit character, this is equivalent to the set class [^0-9]	\D	geeks
geek1
\s	Matches any whitespace character.	\s	gee ks
a bc a
\S	Matches any non-whitespace character	\S	a bd
abcd
\w	Matches any alphanumeric character, this is equivalent to the class [a-zA-Z0-9_].	\w	123
geeKs4
\W	Matches any non-alphanumeric character.	\W	>$
gee<>
\Z	Matches if the string ends with the given regex	ab\Z	abcdab
abababab



In [58]:
for x in re.finditer(r'\AGe','Geeks  Gegdgd'):
    print(x.span())

(0, 2)


**Shortcuts**

Let’s discuss some of the shortcuts provided by the regular expression engine.

    \w – matches a word character
    \d – matches digit character
    \s – matches whitespace character (space, tab, newline, etc.)
    \b – matches a zero-length character

In [63]:
import re


print('Geeks:', re.search(r'\bGeeks', 'dfdgdg ceeks  '))
print('GeeksforGeeks:', re.search(r'\bGeeks\b', 'Geeks forGeeks'))


Geeks: None
GeeksforGeeks: <re.Match object; span=(0, 5), match='Geeks'>


**Beginning and End of String**

The ^ character chooses the beginning of a string and the $ character chooses the end of a string.

In [66]:
import re


# Beginning of String
match = re.search(r'^Geek', 'This is Geek of the month')
print('Beg. of String:', match)

match = re.search(r'^Geek', 'Geek of the month')
print('Beg. of String:', match)

# End of String
match = re.search(r'Geeks$', 'Compute science portal-GeeksforGeeks')
print('End of String:', match)


Beg. of String: None
Beg. of String: <re.Match object; span=(0, 4), match='Geek'>
End of String: <re.Match object; span=(31, 36), match='Geeks'>


**Any Character**

The . character represents any single character outside a bracketed character class.

In [70]:
import re

print('Any Character', re.search(r'p..th.n', 'pbfthon'))


Any Character <re.Match object; span=(0, 7), match='pbfthon'>


   **More RegEx**
   
Some of the other regular expressions are as follows:

1. Optional Characters
2. Repetition
3. Shorthand
4. Grouping
5. Lookahead
6. Substitution

**Optional Characters**

Regular expression engine allows you to specify optional characters using the ? character. It allows a character or character class either to present once or else not to occur. Let’s consider the example of a word with an alternative spelling – color or colour.

In [72]:
import re


print('Color',re.search(r'col?ou?r', 'coor'))
print('Colour',re.search(r'colou?r', 'colour'))


Color <re.Match object; span=(0, 4), match='coor'>
Colour <re.Match object; span=(0, 6), match='colour'>


**Repetition**

Repetition enables you to repeat the same character or character class. Consider an example of a date that consists of day, month, and year. Let’s use a regular expression to identify the date (mm-dd-yyyy).

In [74]:
import re


print('Date{mm-dd-yyyy}:', re.search(r'[\d]{2}-[\d]{1,2}-[\d]{4}',
									' gfhfhff  19-9-2020'))


Date{mm-dd-yyyy}: <re.Match object; span=(10, 19), match='19-9-2020'>


Here, the regular expression engine checks for two consecutive digits. Upon finding the match, it moves to the hyphen character. After then, it checks the next two consecutive digits, and the process is repeated.  

Let’s discuss three other regular expressions under repetition.

**Repetition ranges**

The repetition range is useful when you have to accept one or more formats. Consider a scenario where both three digits, as well as four digits, are accepted. Let’s have a look at the regular expression.

In [85]:
import re


print('Three Digit:', re.search(r'[\d]{4,6}', '123489'))
print('Four Digit:', re.search(r'[\d]{3,4}', '2145'))


Three Digit: <re.Match object; span=(0, 6), match='123489'>
Four Digit: <re.Match object; span=(0, 4), match='2145'>


### **Open-Ended Ranges**

There are scenarios where there is no limit for a character repetition. In such scenarios, you can set the upper limit as infinitive. A common example is matching street addresses. Let’s have a look  

In [87]:
import re


print(re.search(r'[\d]{1,}','5345353535666666666666666666th Floor, A-118,\
Sector-136, Noida, Uttar Pradesh - 201305'))


<re.Match object; span=(0, 28), match='5345353535666666666666666666'>


**Shorthand**

Shorthand characters allow you to use + character to specify one or more ({1,}) and * character to specify zero or more ({0,}.

In [89]:
import re

print(re.search(r'[\d]+', '5456666645675675th Floor, A-118,\
Sector-136, Noida, Uttar Pradesh - 201305'))


<re.Match object; span=(0, 16), match='5456666645675675'>


**Grouping**

Grouping is the process of separating an expression into groups by using parentheses, and it allows you to fetch each individual matching group.  

In [90]:
import re


grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})', 'This is todays date 26-08-2020')
print(grp)


<re.Match object; span=(20, 30), match='26-08-2020'>


Let’s see some of its functionality.

**Return the entire match**

The re module allows you to return the entire match using the group() method

In [69]:
import re


grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})','This is todays date 26-08-2020 and yesterday was 25-08-2020')
print(grp.group())


26-08-2020


**Return a tuple of matched groups**

You can use groups() method to return a tuple that holds individual matched groups

In [91]:
import re


grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})','26-08-2020')
print(grp.groups())


('26', '08', '2020')


**Retrieve a single group**

Upon passing the index to a group method, you can retrieve just a single group.

In [92]:
import re


grp = re.search(r'([\d]{2})-([\d]{2})-([\d]{4})','26-08-2020')
print(grp.group(2))


08


# Patterns

So far we've learned how to search for a basic string. What about more complex examples? Such as trying to find a telephone number in a large string of text? Or an email address?

We could just use search method if we know the exact phone or email, but what if we don't know it? We may know the general format, and we can use that along with regular expressions to search the document for strings that match a particular pattern.

This is where the syntax may appear strange at first, but take your time with this, often its just a matter of looking up the pattern code.

Let' begin!

## Identifiers for Characters in Patterns

Characters such as a digit or a single string have different codes that represent them. You can use these to build up a pattern string. Notice how these make heavy use of the backwards slash \ . Because of this when defining a pattern string for regular expression we use the format:

    r'mypattern'
    
placing the r in front of the string allows python to understand that the \ in the pattern string are not meant to be escape slashes.

Below you can find a table of all the possible identifiers:

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

For example:

In [93]:
text = "My telephone number is 408-555-1234"

In [94]:
phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d',text)

In [95]:
phone.group()

'408-555-1234'

Notice the repetition of \d. That is a bit of an annoyance, especially if we are looking for very long strings of numbers. Let's explore the possible quantifiers.

## Quantifiers

Now that we know the special character designations, we can use them along with quantifiers to define how many we expect.

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

Let's rewrite our pattern using these quantifiers:

In [96]:
re.search(r'\d{3}-\d{3}-\d{4}',text)

<re.Match object; span=(23, 35), match='408-555-1234'>

## Groups

What if we wanted to do two tasks, find phone numbers, but also be able to quickly extract their area code (the first three digits). We can use groups for any general task that involves grouping together regular expressions (so that we can later break them down). 

Using the phone number example, we can separate groups of regular expressions using parenthesis:

In [98]:
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')

In [99]:
results = re.search(phone_pattern,text)

In [100]:
# The entire result
results.group()

'408-555-1234'

In [101]:
# Can then also call by group position.
# remember groups were separated by parenthesis ()
# Something to note is that group ordering starts at 1. Passing in 0 returns everything
results.group(3)

'1234'

In [13]:
results.group(1)

'408'

In [31]:
results.group(3)

'1234'

In [102]:
# We only had three groups of parenthesis
results.group(4)

IndexError: no such group

## Additional Regex Syntax

### Or operator |

Use the pipe operator to have an **or** statment. For example

In [103]:
re.search(r"man|woman","This man was here.")

<re.Match object; span=(5, 8), match='man'>

In [104]:
re.search(r"man|woman","This woman was here.")

<re.Match object; span=(5, 10), match='woman'>

### The Wildcard Character

Use a "wildcard" as a placement that will match any character placed there. You can use a simple period **.** for this. For example:

In [105]:
re.findall(r".at","The cat in the hate sat here.")

['cat', 'hat', 'sat']

In [36]:
re.findall(r".at","The bat went splat")

['bat', 'lat']

Notice how we only matched the first 3 letters, that is because we need a **.** for each wildcard letter. Or use the quantifiers described above to set its own rules.

In [77]:
re.findall(r"...at","The bat went splat")

['e bat', 'splat']

However this still leads the problem to grabbing more beforehand. Really we only want words that end with "at".

In [38]:
# One or more non-whitespace that ends with 'at'
re.findall(r'\S+at',"The bat went splat")

['bat', 'splat']

### Starts with and Ends With

We can use the **^** to signal starts with, and the **$** to signal ends with:

In [78]:
# Ends with a number
re.findall(r'\d$','This 123 ends with a number 2')

['2']

In [106]:
# Starts with a number
re.findall(r'^[\d]+','3233 is the loneliest number.')

['3233']

Note that this is for the entire string, not individual words!

### Exclusion

To exclude characters, we can use the **^** symbol in conjunction with a set of brackets **[]**. Anything inside the brackets is excluded. For example:

In [108]:
phrase = "there are 3 numbers 34 inside 5 this sentence."

In [109]:
re.findall(r'[^\d]',phrase)

['t',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 's',
 ' ',
 ' ',
 'i',
 'n',
 's',
 'i',
 'd',
 'e',
 ' ',
 ' ',
 't',
 'h',
 'i',
 's',
 ' ',
 's',
 'e',
 'n',
 't',
 'e',
 'n',
 'c',
 'e',
 '.']

To get the words back together, use a + sign 

In [110]:
re.findall(r'[^\d]+',phrase)

['there are ', ' numbers ', ' inside ', ' this sentence.']

We can use this to remove punctuation from a sentence.

In [112]:
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'

In [113]:
re.findall('[^!.? ]+',test_phrase)

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

In [114]:
clean = ' '.join(re.findall('[^!.? ]+',test_phrase))

In [115]:
clean

'This is a string But it has punctuation How can we remove it'

## Brackets for Grouping

As we showed above we can use brackets to group together options, for example if we wanted to find hyphenated words:

In [34]:
text = 'Only find the hypen&words in this sentence. But you do not know how long&ish they are'

In [116]:
re.findall(r'[\w]+&[\w]+',text)

[]

## Parenthesis for Multiple Options

If we have multiple options for matching, we can use parenthesis to list out these options. For Example:

In [37]:
# Find words that start with cat and end with one of these options: 'fish','nap', or 'claw'
text = 'Hello, would you like some catfish?'
texttwo = "Hello, would you like to take a catnap?"
textthree = "Hello, have you seen this caterpillar?"
url = "http\\:www.google.com"

In [38]:
re.search(r'cat(fish|nap|claw)',text)

<re.Match object; span=(27, 34), match='catfish'>

In [39]:
re.search(r'cat(fish|nap|claw)',texttwo)

<re.Match object; span=(32, 38), match='catnap'>

In [40]:
# None returned
re.search(r'cat(fish|nap|claw)',textthree)

### Conclusion

Excellent work! For full information on all possible patterns, check out: https://docs.python.org/3/howto/regex.html

____