# Regular Expressions
- Also called REs, or regexes, or regex patterns
- It is an algebraic notations for distinguishing a set of strings
- Useful for searching
- Formally, a search function based on regular expressions will search through the ***Corpus***, returning on the matching strings 
- A corpus can be a string, a document, or a collection of documents (web pages, product reviews, collectiion of news articles) 

### Meta Characters
- Some special characters are characters that do not match themselves as seen but have special meanings when used in the regular expressions. Some of them are:

| Wild Card | Description         
| :-:       |:-------------
| **^**     |Caret symbol specifies that the match must start at the beginning of the string, and in MULTILINE mode also matches immediately after each newline<br>- `^b` will check if the string starts with 'b' such as baba, boss, basic, b, by, etc.<br>- `^si` will check if the string starts with 'si' such as simple, sister, si, etc.
| **$**     |Specifies that the match must occur at the end of the string <br> - `s$` will check for the string that ends with a such as geeks, ends, s, etc.<br>- `ing$` will check for the string that ends with ing such as going, seeing, ing, etc.
| **.**     |Represent a single occurrance of any character except new line <br> - `a.b` will check for the string that contains any character at the place of the dot such as acb, acbd, abbb, etc<br> - `..` will check if the string contains at least 2 characters
| **\\**    |Used to drop special meaning of a character following it or used to refer to a special character. <br> - Since dot `(.)` is a metacharacter, so if you want to search it in a string you have to use the backslash `(\)` just before the dot `(.)`  so that it will lose its specialty. 
| **[...]** |Matches a single character in the listed set. If caret is the first character inside it, it means negation<br>- `[abc]` means match any single character out of this set<br>- `[123]` means match any single digit out of this set<br>- `[a-z]` means match any single character out of lower case alphabets<br>- `[0-9]` means match any single digit out of this set<br>- `[^0-3]` means any number except 0, 1, 2, or 3<br>- `[^a-c]` means any character except a, b, or c<br>- [0-5][0-9] will match all the two-digits numbers from 00 to 59<br>- `[0-9A-Fa-f]` will match any hexadecimal digit.<br>- Special characters lose their special meaning inside sets, so `[(+*)]` will match any of the literal characters '(', '+', '*', or ')'.<br>- To match a literal ']' inside a set, precede it with a backslash, or place it at the beginning of the set, so `[()[\]{}]` and `[]()[{}]` will both match parenthesis.
| **^[...]**|Matches any character in the set at the beginning of the string
| **[^...]**|Matches any character except those NOT in the listed set (negation)
| **\|**    |Or symbol works as the OR operator meaning it checks whether the pattern before or after the or symbol is present in the string or not<br>- `a\|b` will match any string that contains a or b such as acd, bcd, abcd, etc.<br>- To match a literal '\|', use `\|`, or enclose it inside a character class, as in `[\|]`.
| **( )**   |Used to capture and group 

### Quantifers
- - A quantifier metacharacter immediately follows a portion of a <regex> and indicates how many times that portion must occur for the match to succeed. *, +, ?, {m}, {m,n}. When used alone, the quantifier metacharacters *, +, and ? are all greedy, meaning they produce the longest possible match. 

| Wild Card | Description         
| :-:       |:-------------
| **\***    |The preceding character/expression is repeated zero or more times
| **+**     |The preceding character/expression is repeated one or more times, <br>- `ab+c` will be matched for the string abc, abbc, dabc, but will not be matched for ac, abdc because there is no b in ac and d is not followed by c in abdc.
| **?**     |The preceding character/expression is optional (zero or one occurrence). <br>- `ab?c` will be matched for the string ac, abc, acb, dabc, dac but will not be matched for abbc because there are two b. Similarly, it will not be matched for abdc because b is not followed by c.
| **{n,m}** |The preceding character/expression is repeated from n to m times (both enclusive). <br> - `a{2,4}` will be matched for the string aaab, baaaac, gaad, but will not be matched for strings like abc, bc because there is only one a or no a in both the cases.
| **{n}**   |The preceding character/expression is repeated n times.<br>- `a{6}` will match exactly six 'a' characters, but not five.           
| **{n,}**  |The preceding character/expression is repeated atleast n times 
| **{,m}**  |The preceding character/expression is repeated upto m times

### Escape Codes
## Escape Codes
- You can use special escape codes to find specific types of patterns in your data, such as digits, non-digits,whitespace, and more. 
- The following list of special sequences isn’t complete.

| Code | Description         
| :-:  |:-------------
| **\d** |Matches any decimal digit. This is equivalent to [0-9]                              
| **\D** |Matches any non-digit character. This is equivalent to [^0-9] or [^\d]                           
| **\s** |Matches any whitespace character. This is equivalent to [ \r\n\t\b\f]                
| **\S** |Matches any non-whitespace character. This is equivalent to [^ \r\t\n\f] or [^\s]                         
| **\w** |Matches alphanumeric character. This is equivalent to [a-zA-Z0-9_]                  
| **\W** |Matches any non-alphanumeric character. This is equivalent to [^a-zA-Z0-9_] or [^\w]                  
| **\b** |Matches where the specified characters are at the beginning or at the end of a word r"\bain" OR r"ain\b"
| **\B** |Matches where the specified characters are present, but NOT or at the end of a word r"Bain" OR r"ain\B" 

### The python 're' module

In [1]:
import re
print(dir(re))

['A', 'ASCII', 'DEBUG', 'DOTALL', 'I', 'IGNORECASE', 'L', 'LOCALE', 'M', 'MULTILINE', 'Match', 'Pattern', 'RegexFlag', 'S', 'Scanner', 'T', 'TEMPLATE', 'U', 'UNICODE', 'VERBOSE', 'X', '_MAXCACHE', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '__version__', '_cache', '_compile', '_compile_repl', '_expand', '_locale', '_pickle', '_special_chars_map', '_subx', 'compile', 'copyreg', 'enum', 'error', 'escape', 'findall', 'finditer', 'fullmatch', 'functools', 'match', 'purge', 'search', 'split', 'sre_compile', 'sre_parse', 'sub', 'subn', 'template']


### a. The `re.compile()` Method
- The `re.compile(pattern, flags=0)` method is used to compile a regular expression, and return a `Pattern object`.
- Where,
    - `pattern` is the regular expression which you want to compile that you need to search/modify in a string or may be on a corpus of documents.
    - `flags` can have different values that can be bitwise ORed to change the attributes of `Pattern object`, like:
        - `IGNORECASE` or `I` to do a case in-sensitive search
        - `LOCALE` or `L` to perform a locale aware match.
        - `MULTILINE`, `M` to do multiline matching, affectin `^` and `$`
- Once you have an `Pattern object` representing a compiled regular expression, you can use its methods to perform various operations on a string or may be in a corpus of documents:
    - `p.match()`: Determine if the RE matches at the beginning of the string.
    - `p.search()`: Scan through a string, looking for any location where this RE matches.
    - `p.findall()`: Find all substrings where the RE matches, and returns them as a list.
    - `p.finditer()`: Find all substrings where the RE matches, and returns them as an iterator.
    - `p.split()`: Used to split string by the occurrences of pattern.
    - `p.sub()`: Used for for find and replace purpose..

In [3]:
import re
p = re.compile(r"[A]+[a-z]*", flags = re.M)
# The string which we want to search should start from one or more A and after that there should be zero or more a to z 
print(p)
print(type(p))

# Once you have got a your pattern object you can use it multiple methods for searching from a string 

re.compile('[A]+[a-z]*', re.MULTILINE)
<class 're.Pattern'>


### 1) The `re.Pattern.search()` Method
- The `re.Pattern.search(string, pos=0 endpos=9223372036854775807)` scans through string looking for a match, and return a corresponding match object instance. Return None if no position in the string matches.

In [11]:
import re
str1 = "AAAAAa, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."
str2 = "Mr. Araaif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."
p = re.compile(r'[A]+[a-z]+') 
# rv = p.search(str1)
rv = p.search(str2)
print(rv)
print(type(rv))

<re.Match object; span=(4, 10), match='Araaif'>
<class 're.Match'>


### 2) The `re.Pattern.match()` Method
- The `re.Pattern.match(string, pos=0 endpos=9223372036854775807)` matches zero or more characters at the beginning of the string, and return a corresponding match object instance. Return None if no position in the string matches.

In [5]:
import re
str1 = "Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."
str2 = "Mr. Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."
p = re.compile(r'[A]+[a-z]*')
# rv = p.match(str1)   # When generate Arif as an output
rv = p.match(str2)     # will generate none because is not at the begining of the sentence
print(rv)
print(type(rv))

None
<class 'NoneType'>


### 3). The `re.Pattern.findall()` Method
- The `re.Pattern.findall(string, pos=0 endpos=9223372036854775807)` return a list of all non-overlapping matches of pattern in string
- It will iterate over all the lines of the string by scanning it from left-to-right and returns a list of all non-overlapping matches in the order found. 
- If pattern does not exist, it returns an empty string. 

In [14]:
import re

str1 = "Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."
str2 = "Mr. Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."

p = re.compile(r'[A]+[a-z]*')

# rv = p.findall(str1)
rv = p.findall(str2)
print(rv)
# for i in rv:
#     print(i)
# print(type(rv))

['Arif', 'Ahmad', 'AAA', 'As', 'Arif']


### 4) The `re.Pattern.finditer()` Method
- The `re.Pattern.finditer(string, pos=0 endpos=9223372036854775807)` returns an iterator over all non-overlapping matches for the RE pattern in string. For each match, the iterator returns a match object.

In [21]:
import re

str1 = "Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."
str2 = "Mr. Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."

p = re.compile(r'[A]+[a-z]*')
rv = p.finditer(str1)
# print(rv)
# print(type(rv))
for i in rv:
    print(i)

<re.Match object; span=(0, 4), match='Arif'>
<re.Match object; span=(16, 21), match='Ahmad'>
<re.Match object; span=(60, 63), match='AAA'>
<re.Match object; span=(74, 76), match='As'>
<re.Match object; span=(78, 82), match='Arif'>


>- **Once we have got the iterator of `Match object`, we can iterate it using a `for` loop.**
>- **Let us see how many match objects are there in this iterator named `matches`.**

>- **Every match object has many associated methods.**
>- **Let us see different attributes of each match object using these methods.**

***a.*** The **`group()`** method of the match object, return subgroups of the match (str or tuple) by indices or names. For 0 returns the entire match.

In [23]:
import re

str1 = "Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."
str2 = "Mr. Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."

p = re.compile(r'[A]+[a-z]*')
rv = p.finditer(str1)
for i in rv:
    print(i.group())

Arif
Ahmad
AAA
As
Arif


***b.*** The **`span(group=0)`** method of the match object, return a 2-tuple containing the start and end index (end index not inclusive)

In [24]:
import re

str1 = "Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."
str2 = "Mr. Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."

p = re.compile(r'[A]+[a-z]*')
rv = p.finditer(str1)
for i in rv:
    print(i.span())

(0, 4)
(16, 21)
(60, 63)
(74, 76)
(78, 82)


***C.*** The **`start(group=0)`** method of the match object, return index of the start of the substring matched by group.

In [25]:
import re

str1 = "Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."
str2 = "Mr. Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."

p = re.compile(r'[A]+[a-z]*')
rv = p.finditer(str1)
for i in rv:
    print(i.start())

0
16
60
74
78


***d.*** The `end(group=0)` method of the match object, return index of the end of the substring matched by group.

In [26]:
import re

str1 = "Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."
str2 = "Mr. Arif, Jamil and Ahmad are good at playing acrobatic games.  AAA is triple As. Arif Butt."

p = re.compile(r'[A]+[a-z]*')
rv = p.finditer(str1)
for i in rv:
    print(i.end())

4
21
63
76
82


## Practical Examples

In [30]:
with open('datasets/names_addresses.txt', 'r') as df:
    print(df.read())

Mr. Arif Butt
615-555-7164
131 Model Town, Lahore
01-04-1975
arifpucit@gmail.com
http://www.arifbutt.me


Mrs. Nasira Jadoon
317.615.9124
33 Garden Town, Lahore
20/02/1969
nasira-123@gmail.com
http://nasira.pu.edu.pk



Mr. Khurram Shahzad
321#521#9254
69, A Wapda Town, Lahore
12.09.1985
khurram3@yahoo.com
https://www.khurram.pu.edu.pk


Ms Aqsa
123.555.1997
56 Joher Town, Lahore
12/08/2001
aqsa_007@gmail.com
http://youtube.com

Mr. B
321-555-4321
19 Township, Lahore
05-07-2002
mrB@yahoo.com
http://facebook.com


In [31]:
with open('datasets/names_addresses.txt', 'r') as df:
    teststring = df.read()

In [33]:
teststring

'Mr. Arif Butt\n615-555-7164\n131 Model Town, Lahore\n01-04-1975\narifpucit@gmail.com\nhttp://www.arifbutt.me\n\n\nMrs. Nasira Jadoon\n317.615.9124\n33 Garden Town, Lahore\n20/02/1969\nnasira-123@gmail.com\nhttp://nasira.pu.edu.pk\n\n\n\nMr. Khurram Shahzad\n321#521#9254\n69, A Wapda Town, Lahore\n12.09.1985\nkhurram3@yahoo.com\nhttps://www.khurram.pu.edu.pk\n\n\nMs Aqsa\n123.555.1997\n56 Joher Town, Lahore\n12/08/2001\naqsa_007@gmail.com\nhttp://youtube.com\n\nMr. B\n321-555-4321\n19 Township, Lahore\n05-07-2002\nmrB@yahoo.com\nhttp://facebook.com'

#### Searching Names

In [53]:
import re
p = re.compile(r'(Mr|Ms|Mrs)\.?\s\D+')
rv = p.finditer(teststring)
for i in rv:
    print(i.group())

Mr. Arif Butt

Mrs. Nasira Jadoon

Mr. Khurram Shahzad

Ms Aqsa

Mr. B



#### Searching DoB

In [59]:
import re
p = re.compile(r'(\d{2}).(\d{2}).(\d{4})')

rv = p.finditer(teststring)
for i in rv:
#     print(i.group())
#     print(i.group(1))
#     print(i.group(2))
    print(i.group(3))

1975
1969
1985
2001
2002


#### Searching emails and phone numbers

In [68]:
import re
p = re.compile(r'(\w+)@([a-z]+)(\.[a-zA-Z]{2,3})')
rv = p.finditer(teststring)
for i in rv:
#         print(i.group())
#     print(i.group(1))
#     print(i.group(2))
    print(i.group(3))
    

.com
.com
.com
.com
.com


In [71]:
import re
p = re.compile(r'(\d{3}).(\d{3}).(\d{4})')
rv = p.finditer(teststring)
for i in rv:
        print(i.group())
#     print(i.group(1))
#     print(i.group(2))
#     print(i.group(3))

615-555-7164
317.615.9124
321#521#9254
123.555.1997
321-555-4321


#### Searching URLS

In [73]:
import re
p = re.compile(r'https?\://(www\.)?\w+\.\w+')
rv = p.finditer(teststring)
for i in rv:
        print(i.group())
#     print(i.group(1))
#     print(i.group(2))
#     print(i.group(3))

http://www.arifbutt.me
http://nasira.pu
https://www.khurram.pu
http://youtube.com
http://facebook.com
