# Regular Expression Modifiers and Regular Expression Patterns
***

## Introduction to Regular Expressions
- A regular expression is a pattern describing a certain amount of text and usually find the name abbreviated to "regex" or "regexp". 
- A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.
- Regular expressions are extremely useful in extracting information from text such as code, log files, spreadsheets, or even documents.
- RegEx can be used to check if a string contains the specified search pattern.
- Python has a built-in package called **re**, which can be used to work with Regular Expressions.


<img src="https://www.python.org/static/community_logos/python-logo-master-v3-TM.png" title="Python Logo"/>

In [None]:
import re

## Match function in RegEx
A “match” is the piece of text, or sequence of bytes or characters that pattern was found to correspond to by the regex processing software. The match() method searches a string for a match against a regular expression, and returns the matches, as an Array object.


In python the match is check using the following method:


In [None]:
txt = "The rain in Spain"

#Check if "Portugal" is in the string:

x = re.findall("Portugal", txt)
print(x)

if (x):
  print("Yes, there is at least one match!")
else:
  print("No match")

[]
No match


Also it is very easy to print all the matched text. See the example below:

In [None]:
txt = "The rain in Spain"
x = re.findall("ai", txt)
for item in x:
	print(item)

ai
ai


##Search function with RegEx
The search() method uses an expression to search for a match, and returns the position of the match.

In [None]:
txt = "The rain in Spain"
x = re.search("\s", txt)

print("The first white-space character is located in position:", x.start())

The first white-space character is located in position: 3


In [None]:
txt = "The rain in Spain"
x = re.search("Portugal", txt)
print(x)

## Search and Replace with RegEx

The sub() function replaces the matches with the text of your choice:

In [None]:
txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x)

You can control the number of replacements by specifying the count parameter:

In [None]:
txt = "The rain in Spain"
x = re.sub("\s", "9", txt, 2)
print(x)

## Regular Expression Patterns
## Metacharacters and Sequences

`.      - Any Character Except New Line`<br>
`\d     - Digit (0-9)`<br>
`\D     - Not a Digit (0-9)`<br>
`\w     - Word Character (a-z, A-Z, 0-9, _)`<br>
`\W     - Not a Word Character`<br>
`\s     - Whitespace (space, tab, newline)`<br>
`\S     - Not Whitespace (space, tab, newline)`<br>

`\b     - Word Boundary`<br>
`\B     - Not a Word Boundary`<br>
`^      - Beginning of a String`<br>
`$      - End of a String`<br>

`[]     - Matches Characters in brackets`<br>
`[^ ]   - Matches Characters NOT in brackets`<br>
`|      - Either Or`<br>
`( )    - Group`<br>

`re*        - Matches 0 or more occurrences of preceding expression`<br>
`re+        - Matches 1 or more occurrence of preceding expression`<br>
`re?        - Matches 0 or 1 occurrence of preceding expression`<br>
`re{ n}     - Matches exactly n number of occurrences of preceding expression`<br>
`re{ n,}    - Matches n or more occurrences of preceding expression`<br>
`re{ n, m}  - Matches at least n and at most m occurrences of preceding expression`<br>
`a| b       - Matches either a or b`<br>
`re         -Groups regular expressions and remembers matched text`<br>
`?imx        - Temporarily toggles on i, m, or x options within a regular expression`<br>
`? − imx     - Temporarily toggles off i, m, or x options within a regular expression`<br>               
`?:re        - Groups regular expressions without remembering matched text `<br>
`?imx:re     - Temporarily toggles on i, m, or x options within parentheses `<br>
`? − imx:re  - Temporarily toggles off i, m, or x options within parentheses `<br>
`?#...       - Comment.`<br>
`? = re      - Specifies position using a pattern. Doesn't have a range`<br>
`? !re       - Specifies position using pattern negation. Doesn't have a range`<br>

`Quantifiers:`<br>
`*      - 0 or More`<br>
`+      - 1 or More`<br>
`?      - 0 or One`<br>
`{3}    - Exact Number`<br>
`{3,4}  - Range of Numbers (Minimum, Maximum)`<br>

#### Sample RegExps ####

[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+

## Regular Expression Modifiers
- Regular expression literals may include an optional modifier to control various aspects of matching. 
- The modifiers are specified as an optional flag. You can provide multiple modifiers using exclusive OR `|`




### 1. re.A or re.ASCII

Make \w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns. Corresponds to the inline flag (?a).

Note that for backward compatibility, the re.U flag still exists (as well as its synonym re.UNICODE and its embedded counterpart (?u)), but these are redundant in Python 3 since matches are Unicode by default for strings (and Unicode matching isn’t allowed for bytes).

### 2. re.l or re.IGNORECASE
-

In [None]:
email = "user@company.com"

m = re.search("User", email)
if m:
    print("Match Succcessfull")
else:
    print("Match Unsuccessfull")

Match Unsuccessfull


The above written code snippet is for matching a specified string in the variable email. But the drawback of such method is we have to give the value as same as specified in the variable. As shown in the code, `User` is not specified in `user@company.com`.

In [None]:
email = "user@company.com"

m = re.search("User", email, re.IGNORECASE)#re.I can also be used
if m:
    print("Match Succcessfull")
else:
    print("Match Unsuccessfull")

Match Succcessfull


Perform `case-insensitive matching`; expressions like `[A-Z]` will also match `lowercase letters`. Full Unicode matching (such as Ü matching ü) also works unless the re.ASCII flag is used to disable non-ASCII matches. The current locale does not change the effect of this flag unless the re.LOCALE flag is also used. Corresponds to the inline flag (?i).

Note that when the Unicode patterns `a-z]` or `[A-Z]` are used in combination with the IGNORECASE flag, they will match the 52 ASCII letters and 4 additional non-ASCII letters: ‘İ’ (U+0130, Latin capital letter I with dot above), ‘ı’ (U+0131, Latin small letter dotless i), ‘ſ’ (U+017F, Latin small letter long s) and ‘K’ (U+212A, Kelvin sign). If the ASCII flag is used, only letters ‘a’ to ‘z’ and ‘A’ to ‘Z’ are matched.

### 3. re.L or re.LOCALE

Make \w, \W, \b, \B and case-insensitive matching dependent on the current locale. This flag can be used only with bytes patterns. The use of this flag is discouraged as the locale mechanism is very unreliable, it only handles one “culture” at a time, and it only works with 8-bit locales. Unicode matching is already enabled by default in Python 3 for Unicode (str) patterns, and it is able to handle different locales/languages. Corresponds to the inline flag (?L).

### 4. re.S or re.DOTALL

In [None]:
s = "abcd\nefgh\nijkl\nmnop\n"

m = re.search("cd.ef", s)
if m:
    print("Match Succcessfull")
else:
    print("Match Unsuccessfull")

Match Unsuccessfull


The searching pattern `cd.ef` means there can cd and ef and anything between then `.` means anything

In [None]:
s = "abcd\nefgh\nijkl\nmnop\n"

m = re.search("cd.ef", s, re.DOTALL)#re.S can also be used
if m:
    print("Match Succcessfull")
else:
    print("Match Unsuccessfull")

Match Succcessfull


Make the `.` special character match any character at all, including a newline; without this flag, `.` will match anything except a newline. Corresponds to the inline flag (?s).

### 5. re.M or re.MULTILINE

In [None]:
words = open('words').read()

FileNotFoundError: ignored

In [None]:
print(words)

anthropocentrically
anthropomorphically
antiferromagnetisms
antivivisectionists
antiferromagnetically
anthropocentricities
astrophotographers
anthropocentricity
antiferromagnetism
antivivisectionist
astrophotographies
anthropologically
astrophotographer
accountablenesses
acquisitivenesses
acrimoniousnesses
adventurousnesses
allegoricalnesses
alternativenesses
amphitheatrically
anticlimactically
antiferromagnetic
appropriatenesses
ariboflavinosises
atrabiliousnesses



In [None]:
re.findall("^a\w+s$", words)

[]

To find words starts with `a` and ends with `c`

In [None]:
re.findall("^a\w+s$", words, re.MULTILINE)

['antiferromagnetisms',
 'antivivisectionists',
 'anthropocentricities',
 'astrophotographers',
 'astrophotographies',
 'accountablenesses',
 'acquisitivenesses',
 'acrimoniousnesses',
 'adventurousnesses',
 'allegoricalnesses',
 'alternativenesses',
 'appropriatenesses',
 'ariboflavinosises',
 'atrabiliousnesses']

When specified, the pattern character `^` matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character `$` matches at the end of the string and at the end of each line (immediately preceding each newline). By default, `^` matches only at the beginning of the string, and `$` only at the end of the string and immediately before the newline (if any) at the end of the string. Corresponds to the inline flag (?m).

### 6. re.X or re.VERBOSE

In [None]:
re.findall("^a\w+s$", 
           words, 
           re.MULTILINE)

NameError: ignored

In [None]:
re.findall('''
^a  #string must start with a
\w+ #find some letters, numbers or undersores in the middle
s$  #string must end with s
''', 
           words, 
           re.MULTILINE )

[]

we are able to break with commas and explain each argument, but what to do for each character in regexp.....

In [None]:
re.findall('''
^a  #string must start with a
\w+ #find some letters, numbers or undersores in the middle
s$  #string must end with s
''', 
           words, 
           re.MULTILINE | re.VERBOSE)

NameError: ignored

This flag allows you to write regular expressions that look nicer and are more readable by allowing you to visually separate logical sections of the pattern and add comments. Whitespace within the pattern is ignored, except when in a character class, or when preceded by an unescaped backslash, or within tokens like *?, (?: or (?P<...>. When a line contains a `#` that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.

**character classes**


`[Pp]ython       Match "Python" or "python"`<br>
`rub[ye]         Match "ruby" or "rube"`<br>
`[aeiou]         Match any one lowercase vowel`<br>
`[0-9]           Match any digit; same as [0123456789]`<br>
`[a-z]           Match any lowercase ASCII letter`<br>
`[A-Z]           Match any uppercase ASCII letter`<br>
`[a-zA-Z0-9]     Match any of the above`<br>
`[^aeiou]        Match anything other than a lowercase  vowel`<br>
`[^0-9]               Match anything other than a digit`<br>

**special character classes** 

`.        -Match any character except newline`<br>
`\d       -Match a digit: [0-9]`<br>
`\D       -Match a nondigit: [^0-9]`<br>
`\s       -Match a whitespace character: [ \t\r\n\f]`<br>
`\S       -Match nonwhitespace: [^ \t\r\n\f]`<br>
`\w       -Match a single word character: [A-Za-z0-9_]`<br>
`\W       -Match a nonword character: [^A-Za-z0-9_]`<br>



At the same time its also possible to combine one or more flags using `|` as in `re.MULTILINE | re.VERBOSE`.

**repetition cases**
	

`ruby?    -Match "rub" or "ruby": the y is optional`

`ruby*    -Match "rub" plus 0 or more ys`

`ruby+    -Match "rub" plus 1 or more ys`

`\d{3}    -Match exactly 3 digits`

`\d{3,}   -Match 3 or more digits`

`\d{3,5}  -Match 3, 4, or 5 digits`



# compile() method
• The compile() function returns the specified source as a code object, ready to be executed.
Syntax : compile(source, filename, mode, flag, dont_inherit, optimize)
## Parameters:
`-source        :   Required. The source to compile, can be a String, a Bytes object, or an AST object`<br>
`-filename      :   Required. The name of the file that the source comes from. If the source does not come`<br>
                  from a file, you can write whatever you like<br>
`-mode          :   Required. Legal values:`<br>
`-eval          :   if the source is a single expression`<br>
`-exec          :   if the source is a block of statements`<br>
`-single        :   if the source is a single interactive statement`<br>
`-flags         :   Optional. How to compile the source. Default 0`<br>
`-dont-inherit  : Optional. How to compile the source. Default False`<br>
`-optimize      : Optional. Defines the optimization level of the compiler. Default -1`<br>


In [None]:
x = compile('print(55)', 'test', 'eval')
exec(x)                   #Compile text as code, and the execute

55


In [None]:
x = compile('print(55)\nprint(88)', 'test', 'exec')
exec(x)                   #Compile more than one statement, and the execute it

55
88


## findall() method

**The findall() function returns a list containing all matches**<br>
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.


In [None]:
import re

txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)


['ai', 'ai']


In [None]:

txt = "The rain in Spain"
x = re.search("\s", txt)

print("The first white-space character is located in position:", x.start())

The first white-space character is located in position: 3
