In [1]:
import re

### 1. Compiling Regular Expressions
Regular expressions are compiled into Pattern objects, which have methods for various operations such as searching for pattern matches or performing string substitutions.

re.compile(pattern, flags=0)
Compile a regular expression pattern, returning a pattern object.

The regular expression is passed to re.compile() as a string.
Regular expressions are handled as strings because regular expressions aren’t part of the core Python language, and no special syntax was created for expressing them.

Regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine written in C.

In [2]:
pattern = re.compile("hello")
pattern

re.compile(r'hello', re.UNICODE)

re.compile() also accepts an optional flags argument, used to enable various special features and syntax variations. More about flags

In the example below, we use the flag re.I (short for re.IGNORECASE) to ignore letter case in the regex pattern.

In [3]:
pattern = re.compile("hello",flags=re.I)

pattern

re.compile(r'hello', re.IGNORECASE|re.UNICODE)

In [4]:
# here flag=re.I helps to match hello word in different manner like(HELLO,Hello,etc..)

### 2. Performing Matches
So, we have created a Pattern object representing a compiled regular expression using re.compile() method.

Pattern objects have several methods and attributes.

Here is the list of different methods used for performing matches:

Method/Attribute	     Purpose

match()   :-   	Determine if the RE matches at the beginning of the string.

search()  :-	   Scan through a string, looking for any location where this RE matches.

findall() :-	Find all substrings where the RE matches, and returns them as a list.

finditer() :-	Find all substrings where the RE matches, and returns them as an iterator.

Let us go through them one by one:

### 2.1 : match(string[, pos[, endpos]])

* A match is checked only at the beginning (by default).

* Checking starts from pos index of the string. (default is 0)

* Checking is done until endpos index of string. endpos is set as a very large   integer (by default).

* Returns None if no match found.

* If a match is found, a Match object is returned, containing information about the match: where it starts and ends, the substring it matched, and more.

In [5]:
pattern = re.compile("hello")
match = pattern.match("hello world hello")

In [6]:
match.span()

(0, 5)

In [7]:
pattern1 = re.compile("good moring")
match1 = pattern1.match("hi good moring")


In [8]:
match1 == None


True

In [9]:
pattern2 = re.compile("good evening")
match2 = pattern2.match("hi good evening",pos=3)

In [10]:
match2.span()

(3, 15)

In [11]:
pattern3 = re.compile("good bye")
match3 = pattern3.match("good bye",pos=3,endpos=9)

### search(string[, pos[, endpos]])

* A match is checked throughtout the string.

* Same behaviour of pos and endpos as the match() function.

* Returns None if no match found.

* If a match is found, a Match object is returned.

In [12]:
pattern.search("say hello")

<re.Match object; span=(4, 9), match='hello'>

In [13]:
pattern3.search("HI bye good bye,ok good bye")

<re.Match object; span=(7, 15), match='good bye'>

In [14]:
pattern2.search("good evening,bye good evening")

<re.Match object; span=(0, 12), match='good evening'>

### findall(string[, pos[, endpos]])
* Finds all non-overlapping substrings where the match is found, and returns     them as a list.

* same behaviour of pos and endpos as the match() and search() function.

In [15]:
pattern_findall=pattern.findall("hi hello ,my dear friends hello")

In [16]:
pattern_findall

['hello', 'hello']

In [17]:
len(pattern_findall)

2

In [18]:
pattern4 = re.compile('\d')

pattern4.findall("hi im jayesh 22 years old 4")

['2', '2', '4']

### finditer(string[, pos[, endpos]])

* Finds all non-overlapping substrings where the match is found, and returns     them as an iterator of the Match objects.

* Same behaviour of pos and endpos as the match(), search() and findall()         function.

In [19]:
pattern6 = re.compile("good")

match6 = pattern6.finditer("hi all im good ,i wish that you all are good,toys are very good")

In [20]:
lenof=[]

for i in match6:
    lenof.append(i)
    
    
    

In [21]:
print("Number of iteration",len(lenof))

Number of iteration 3


In [22]:
lenof


[<re.Match object; span=(10, 14), match='good'>,
 <re.Match object; span=(40, 44), match='good'>,
 <re.Match object; span=(59, 63), match='good'>]

### Note:
It is not mandatory to create a Pattern object explicitly using re.compile() method in order to perform a regex operation.

* You can direclty use the module level functions such as:

* re.match(pattern, string, flags=0)

* re.search(pattern, string, flags=0)

* re.findall(pattern, string, flags=0)

* re.finditer(pattern, string, flags=0)

and so on.

In a module level function, you can simply pass a #string as your #regex pattern as shown in the examples below.

In [23]:
re.match("hello","hello")

<re.Match object; span=(0, 5), match='hello'>

In [24]:
re.search('bye',"hi ok bye, good bye")

<re.Match object; span=(6, 9), match='bye'>

In [25]:
re.findall('you',"you are very good,you need to bring")

['you', 'you']

In [26]:
re.finditer('good','good boy ,good girl')

<callable_iterator at 0x28ca18ce1c0>

In [27]:
text = " this is book costs $20"

pattern = re.compile("$20")
pattern.search(text) == None


True

In [28]:
text = """ this is book costs $20"""

pattern = re.compile("\$20")

pattern.search(text) == None

False

### The Backslash Plague
Let's start with an example.

Consider a text containing some Windows style directory addresses in which we have to find C:\Windows\System32 substring.

In [29]:
import re

In [30]:
text = """
c:\windows
c:\python
c:\windows\system32
"""

In [31]:
pattern = re.compile("c:\windows")

pattern.search(text) == True

False

### Why are no matches found for above pattern?
Regex Engine is treateing \ as metacharacters, whereas we intend to treat it like a literal.

### Solution???
We need to escape the metacharacters. A metacharacter can be escaped by putting a \ before it.

In [32]:
pattern = re.compile("c:\\windows")

pattern.search(text) == None

True

### Still no match found. Why???
\ is used as an escape at two different levels.

* First, the Python interpreter itself performs substitutions for \ before the   re module ever sees the pattern string. For instance, \n is converted to a  v newline character, \t is converted to a tab character, etc.

* Finally, re reads the substituted pattern string and will apply its own       substitutions for \ character.

Hence, to use \ as a literal, we first escape \ with \\ for python interpreter and then escape \\ as \\\\ for regex engine.

In [33]:
pattern = re.compile("c:\\\\windows")

pattern.search(text) == None

False

### Can we use 2 backslashes instead of 4 here?
Yes. By using raw-strings, we do not need to put escapes at first level.

Python raw strings are represented as *r"your string"*. In raw strings, no escaping is required as escape sequences like \n, \t, etc are not processed.

In [34]:
pattern = re.compile(r"c:\\windows")

pattern.search(text) == None

False

### Do we really need to use 2 backslashes?
If you are not using any metacharacters in your regex pattern, you can use re.escape() method to escape all the characters in pattern except ASCII letters, numbers and '_'.

In [35]:
re.search(re.escape('c:\windows'),text)

<re.Match object; span=(1, 11), match='c:\\windows'>