# Regular Expression
1. The language accepted by finite automata can be easily described by simple expressions called Regular Expressions. It is the most effective way to represent any language.
2. The languages accepted by some regular expression are referred to as Regular languages.
3. A regular expression can also be described as a sequence of pattern that defines a string.
4. Regular expressions are used to match character combinations in strings. String searching algorithm used this pattern to find the operations on a string.

A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.

RegEx can be used to check if a string contains the specified search pattern.

## Python Regrex Library | RE

**S**: Representing end of the string </br>
**[]**: Defined Conditions inside brackets </br>
**+**: Atleast one character

In [2]:
pip install regex

Note: you may need to restart the kernel to use updated packages.


In [3]:
import re

# Meta character
**Description**
\	Marks the next character as either a special character or a literal. For example, n matches the character n, whereas \n matches a newline character. The sequence \\ matches \ and \( matches (. </br></br>
^	Matches the beginning of input. </br>
$	Matches the end of input.  </br>
*	Matches the preceding character zero or more times. For example, zo* matches either z or zoo. </br>
+	Matches the preceding character one or more times. For example, zo+ matches zoo but not z. </br>
?	Matches the preceding character zero or one time. For example, a?ve? matches the ve in never. </br>
.	Matches any single character except a newline character. </br>
(pattern)	Matches a pattern and remembers the match. The matched substring can be retrieved from the resulting matches collection by using this code: Item [0]...[n]. To match parentheses characters ( ), use \( or \). </br>

x|y	Matches either x or y. For example, z|wood matches z or wood. (z|w)oo matches zoo or wood. </br>
{n}	n is a non-negative integer. Matches exactly n times. For example, o{2} does not match the o in Bob, but matches the first two os in foooood. </br>
{n,}	In this expression, n is a non-negative integer. Matches the preceding character at least n times. For example, o{2,} does not match the o in Bob and matches all the os in foooood. The o{1,} expression is equivalent to o+ and o{0,} is equivalent to o*. </br>
{n,m}	The m and n variables are non-negative integers. Matches the preceding character at least n and at most m times. For example, o{1,3} matches the first three os in fooooood. The o{0,1} expression is equivalent to o?. </br>
[xyz]	A character set. Matches any one of the enclosed characters. For example, [abc] matches the a in plain. </br>
[^xyz]	A negative character set. Matches any character that is not enclosed. For example, [^abc] matches the p in plain. </br>
[a-z]	A range of characters. Matches any character in the specified range. For example, [a-z] matches any lowercase alphabetic character in the English alphabet. </br>
[^m-z]	A negative range of characters. Matches any character that is not in the specified range. For example, [m-z] matches any character that is not in the range m through z. </br>
\A	Matches only at beginning of a string. </br>
\b	Matches a word boundary, that is, the position between a word and a space. For example, er\b matches the er in never but not the er in verb. </br>
\B	Matches a nonword boundary. The ea*r\B expression matches the ear in never early. </br>
\d	Matches a digit character. </br>
\D	Matches a non-digit character. </br>
\f	Matches a form-feed character. </br>
\n	Matches a newline character. </br>
\r	Matches a carriage return character. </br>
\s	Matches any white space including spaces, tabs, form-feed characters, and so on. </br>
\S	Matches any non-white space character. </br>
\t	Matches a tab character. </br>
\v	Matches a vertical tab character. </br>
\w	Matches any word character including underscore. This expression is equivalent to [A-Za-z0-9_]. </br>
\W	Matches any non-word character. This expression is equivalent to [^A-Za-z0-9_]. </br>
\z	Matches only the end of a string. </br>
\Z	Matches only the end of a string, or before a newline character at the end. </br>

https://www.ibm.com/docs/en/rational-clearquest/9.0.1?topic=tags-meta-characters-in-regular-expressions

In [63]:
a = "charlie and the chocolate factory"
b = "ayushi.hain@gmail.com"
c = "hello"
d = "xxy,yz,xyzz,xyyz,xxzzy,zyz,xxyyyzz,xyzxyz"

In [64]:
# search --> search the given in string if dont find return null

In [65]:
match = re.search(r".", b)
print(match)

# Output a: "." is treated as special sequence, to aviod we used backword slash

<re.Match object; span=(0, 1), match='a'>


In [66]:
match = re.search(r"\.", b)
print(match)

# span=(6, 7) - show starting and ending index

<re.Match object; span=(6, 7), match='.'>


In [67]:
# Square bracket

In [68]:
match = re.search(r"[@]", b)
print(match)

<re.Match object; span=(11, 12), match='@'>


In [69]:
match = re.search(r"l",c )
print(match)

# There are two "l" in c but it give only one for finding all we used findall()

<re.Match object; span=(2, 3), match='l'>


In [70]:
# findall() - return list of matching element

In [71]:
match = re.findall(r"l", c)
print(match)

['l', 'l']


In [72]:
match = re.findall(r"a", b)
print(match)

['a', 'a', 'a']


In [73]:
match = re.search(r"^a",b )
print(match)



<re.Match object; span=(0, 1), match='a'>


In [74]:
match = re.search(r"^c",b )
print(match)


None


In [75]:
match = re.search(r"com$",b )
print(match)

<re.Match object; span=(18, 21), match='com'>


In [82]:
a = "charlie chaplin coa and the chocolate factory"
b = "ayushi.hain@gmail.com"
c = "hello"
d = "xxy,yz,xyzz,xyyz,xxzzy,zyz,xxyyyzz,xyzxyz"

In [83]:
match = re.findall(r"cha|fac", a)
print(match)

['cha', 'cha', 'fac']


In [84]:
match = re.findall(r"ch?a", a)
print(match)

['cha', 'cha']


In [85]:
e = "charlie chachchaplin coa and the chocolate factory"
match = re.findall(r"ch*a", e)
print(match)

['cha', 'cha', 'cha']


In [87]:
match = re.findall(r"xy+z", d)
print(match)

['xyz', 'xyyz', 'xyyyz', 'xyz', 'xyz']


In [91]:
match = re.findall(r"x{2,4}", d)
print(match)

['xx', 'xx', 'xx']


In [92]:
match = re.findall(r"y{2,4}", d)
print(match)

['yy', 'yyy']


In [94]:
match = re.findall(r"(x|y)yz", d)
print(match)

['x', 'y', 'y', 'x', 'x']


# Special Sequence in RegEx

**Special Sequence do not match for the actual character in the string instead it tells the specific location in the search string where the match must occurs** </br>
It makes it easier to write commonly used patterns

`\A`: Matches only at beginning of a string. </br>
`\b`: Matches a word boundary, that is, the position between a word and a space. For example, er\b matches the er in never but not the er in verb. </br>
`\B`: Matches a nonword boundary. The ea*r\B expression matches the ear in never early. </br>
`\d`: Matches a digit character. </br>
`\D`: Matches a non-digit character. </br>
`\f`: Matches a form-feed character. </br>
`\n`: Matches a newline character. </br>
`\r`: Matches a carriage return character. </br>
`\s`: Matches any white space including spaces, tabs, form-feed characters, and so on. </br>
`\S`: Matches any non-white space character. </br>
`\t`: Matches a tab character. </br>
`\v`: Matches a vertical tab character. </br>
`\w`: Matches any word character including underscore. This expression is equivalent to [A-Za-z0-9_]. </br>
`\W`: Matches any non-word character. This expression is equivalent to [^A-Za-z0-9_]. </br>
`\z`: Matches only the end of a string. </br>
`\Z`: Matches only the end of a string, or before a newline character at the end. </br>

In [95]:
a = "harry potter"

In [97]:
match = re.search(r"\Ahar", a)
print(match)

<re.Match object; span=(0, 3), match='har'>


In [98]:
match = re.search(r"\bh", a)
print(match)

<re.Match object; span=(0, 1), match='h'>


In [100]:
match = re.search(r"ha\B", a)
print(match)

<re.Match object; span=(0, 2), match='ha'>


In [102]:
f = "harry1 potter2"
match = re.search(r"\d", f)
print(match)

<re.Match object; span=(5, 6), match='1'>


In [103]:
f = "harry1 potter2"
match = re.findall(r"\d", f)
print(match)

['1', '2']


In [104]:
f = "harry1 potter2"
match = re.findall(r"\D", f)
print(match)

['h', 'a', 'r', 'r', 'y', ' ', 'p', 'o', 't', 't', 'e', 'r']


In [105]:
f = "harry1 potter2"
match = re.findall(r"\w", f)
print(match)

['h', 'a', 'r', 'r', 'y', '1', 'p', 'o', 't', 't', 'e', 'r', '2']


In [106]:
f = "harry1 potter2"
match = re.findall(r"\W", f)
print(match)

[' ']


In [116]:
f = "harry1 potter2"
match = re.findall(r"\s", f)
print(match)

[' ']


In [108]:
f = "harry1 potter2"
match = re.findall(r"\S", f)
print(match)

['h', 'a', 'r', 'r', 'y', '1', 'p', 'o', 't', 't', 'e', 'r', '2']


In [117]:
f = "harry1 potter2"
match = re.findall(r"2\Z", f)
print(match)

['2']


In [118]:
f = "harry1 potter2"
match = re.search(r"2\Z", f)
print(match)

<re.Match object; span=(13, 14), match='2'>


# RegEx Set

**A set is a set of characters inside a pair of square brackets [ ] with a special meaning**

`[atx]`: Returns a match where one of the specified characters (a, t, or x) are present. </br> </br>
`[a-h]`: Returns a match for any lower case character, alphabetically between a and h. </br> </br>
`[^atx]`: Returns a match for any character EXCEPT a, t, and x. </br> </br>
`[0123]`: Returns a match where any of the specified digits (0, 1, 2, or 3) are present. </br> </br>
`[0-9]`: Returns a match for any digit between 0 and 9. </br> </br>
`[0-7][0-9]`: Returns a match for any two-digit numbers from 00 and 79. </br> </br>
`[a-zA-Z]`: Returns a match for any character alphabetically between a and z, lower and uppercase </br> </br>
`[+]`: In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means: return a match for any + character in the string 

In [119]:
a = "charlie chaplin coa and the chocolate factory"
b = "ayushi.hain@gmail.com"
c = "hello"
d = "xxy,yz,xyzz,xyyz,xxzzy,zyz,xxyyyzz,xyzxyz"

In [120]:
match = re.findall(r"[atx]", a)
print(match)

['a', 'a', 'a', 'a', 't', 'a', 't', 'a', 't']


In [121]:
match = re.findall(r"[^atx]", a)
print(match)

['c', 'h', 'r', 'l', 'i', 'e', ' ', 'c', 'h', 'p', 'l', 'i', 'n', ' ', 'c', 'o', ' ', 'n', 'd', ' ', 'h', 'e', ' ', 'c', 'h', 'o', 'c', 'o', 'l', 'e', ' ', 'f', 'c', 'o', 'r', 'y']


In [124]:
match = re.findall(r"[a-t]", a)
print(match)

['c', 'h', 'a', 'r', 'l', 'i', 'e', 'c', 'h', 'a', 'p', 'l', 'i', 'n', 'c', 'o', 'a', 'a', 'n', 'd', 't', 'h', 'e', 'c', 'h', 'o', 'c', 'o', 'l', 'a', 't', 'e', 'f', 'a', 'c', 't', 'o', 'r']


In [127]:
n = "123@john#$@6773"
match = re.findall(r"[123]", n)
print(match)

['1', '2', '3', '3']


In [128]:
n = "123@john#$@6773"
match = re.findall(r"[0-9]", n)
print(match)

['1', '2', '3', '6', '7', '7', '3']


In [129]:
n = "123@john#$@6773"
match = re.findall(r"[0-7][0-9]", n)
print(match)

['12', '67', '73']


In [130]:
n = "123@john#$@HELLO6773"
match = re.findall(r"[a-zA-Z]", n)
print(match)

['j', 'o', 'h', 'n', 'H', 'E', 'L', 'L', 'O']


# Functions in RegEx

#### The `findall()` Function:
    Return all the overlapping matches of pattern in string, as list of strings. The strings is scanned left-to-right,and matches the returned in the order found.
    
#### The `complile()` Function: 
    The Regular Expression are compiled into pattern, which have methods for various operations such as searching input matches or performing string substitutions
   
#### The `search()` Function:
    The search() function searches the string for a match, and returns a Match object if there is a match. If there is more than one match, only the first occurrence of the match will be returned and If no matches are found, the value None is returned

#### The `split()` Function:
    The `split()` function returns a list where the string has been split at each match
    You can control the number of occurrences by specifying the maxsplit parameter

#### The `sub()` Function:
    The `sub()` function replaces the matches with the text of your choice. You can control the number of replacements by specifying the count parameter:

#### The `subn()` Function:
    `subn()` is similar to `sub()` in all ways, except in its way of providing output. It return a tuple with the count of the total of replacement and the new string rather than just the string

#### The `escape()` Function:
    Return the string with all the `non-alphanumerics` backslashed, this is usefull if you want to match an arbitrary literal string that may have regular expression metacharacters in it


**Implementation of all Functions**

In [135]:
a = """John has scored 89 marks
Lisa has scored 90 marks
David has scored 70 marks"""

**Find all the marks**

In [139]:
print(re.findall("\d+", a))
# \d used for finding number
# + represented occuring values

['89', '90', '70']


**Find all the names**

In [142]:
print(re.findall("[A-Z][a-z]*", a))

['John', 'Lisa', 'David']


**Compile**

In [143]:
p = re.compile("[a-z]")
print(re.findall(p, a))

['o', 'h', 'n', 'h', 'a', 's', 's', 'c', 'o', 'r', 'e', 'd', 'm', 'a', 'r', 'k', 's', 'i', 's', 'a', 'h', 'a', 's', 's', 'c', 'o', 'r', 'e', 'd', 'm', 'a', 'r', 'k', 's', 'a', 'v', 'i', 'd', 'h', 'a', 's', 's', 'c', 'o', 'r', 'e', 'd', 'm', 'a', 'r', 'k', 's']


In [144]:
p = re.compile("[0-9]")
print(re.findall(p, a))

['8', '9', '9', '0', '7', '0']


In [145]:
p = re.compile("\d")
print(re.findall(p, a))

['8', '9', '9', '0', '7', '0']


In [147]:
# Re-occuring values

In [148]:
p = re.compile("\d+")
print(re.findall(p, a))

['89', '90', '70']


**split**

In [149]:
print(re.split("\d+", a))

['John has scored ', ' marks\nLisa has scored ', ' marks\nDavid has scored ', ' marks']


In [150]:
g = "Fantastic 5 turtles"
print(re.split("\d+", g))

['Fantastic ', ' turtles']


**Sub Function**

In [151]:
print(re.sub("\s+", "", g))

Fantastic5turtles


**subn**
1. same as sub but return output in tuple form
2. Gives numbers of replacement

In [152]:
print(re.subn("\s+", "", g))

('Fantastic5turtles', 2)


In [153]:
print(re.sub("\s+", "", a))

Johnhasscored89marksLisahasscored90marksDavidhasscored70marks


**escape**

In [154]:
print(re.escape(a))

John\ has\ scored\ 89\ marks\
Lisa\ has\ scored\ 90\ marks\
David\ has\ scored\ 70\ marks


**search**
1. return first match values

In [155]:
print(re.search("\d+", a))

<re.Match object; span=(16, 18), match='89'>


# Match Object In RegEx

**A `Match Object` contains all the information about the search and the result and if there is no match found then `None` will be return**
</br</br></br>
The Match object has properties and methods used to retrieve information about the search, and the result:
</br></br>
`.span()` returns a tuple containing the start-, and end positions of the match.</br>
`.string` returns the string passed into the function</br>
`.group()`returns the part of the string where there was a match</br>

In [159]:
a = "John has scored 98 marks"

In [160]:
match = re.search("\d+", a)
print(match)

<re.Match object; span=(16, 18), match='98'>


In [161]:
print(match.re)

re.compile('\\d+')


In [162]:
# Check if the given value i string
print(match.string)

John has scored 98 marks


In [164]:
# Getting starting index
print(match.start())
# Getting ending index
print(match.end())

16
18


In [165]:
# Getting starting and ending index
print(match.span())

(16, 18)


In [166]:
# Return the match which were found in search function
print(match.group())

98


In [168]:
match = re.search("\w{2} s", a)
print(match)

<re.Match object; span=(6, 10), match='as s'>


In [169]:
# Return the match which were found in search function
print(match.group())

as s


# Phone Number and Email Verification and Web Scrapping Using RegEx

**Phone Number**

In [171]:
phone = "0316-2538566"
if re.search("\d{4}-\d{7}", phone):
    print("Verified Number")
else:
    print("Invalid Number")

Verified Number


**Email**

In [197]:
# emails = """john78@gmail.com
# john@.com
# david.989@yahoo.com"""


emails = "john78@gmail.com    john@.com    david.989@yahoo.com"
print(re.findall("[\w._%]{0, 20}@[\w-].[A-Za-z]{2,3}", emails))
print(len(re.findall("[\w._%]{0, 20}@[\w-].[A-Za-z]{2,3}", emails)))

[]
0


# Web Scrapping

import urllib.request </br>
from re import findall

In [199]:
import urllib.request
from re import findall

**URL For Scrapping**

summet.com/dmsi/html/codesamples/addresses.html
</br>
We Wanted to Extract the `Phone Numbers` from the above `url`

In [201]:
url = "https://www.summet.com/dmsi/html/codesamples/addresses.html"
a = urllib.request.urlopen(url)
html = a.read()
htmlstr = html.decode()
phn = findall("\(\d{3}\) \d{3}-\d{4}", htmlstr)

for i in phn:
    print(i)

(257) 563-7401
(372) 587-2335
(786) 713-8616
(793) 151-6230
(492) 709-6392
(654) 393-5734
(404) 960-3807
(314) 244-6306
(947) 278-5929
(684) 579-1879
(389) 737-2852
(660) 663-4518
(608) 265-2215
(959) 119-8364
(468) 353-2641
(248) 675-4007
(939) 353-1107
(570) 873-7090
(302) 259-2375
(717) 450-4729
(453) 391-4650
(559) 104-5475
(387) 142-9434
(516) 745-4496
(326) 677-3419
(746) 679-2470
(455) 430-0989
(490) 936-4694
(985) 834-8285
(662) 661-1446
(802) 668-8240
(477) 768-9247
(791) 239-9057
(832) 109-0213
(837) 196-3274
(268) 442-2428
(850) 676-5117
(861) 546-5032
(176) 805-4108
(715) 912-6931
(993) 554-0563
(357) 616-5411
(121) 347-0086
(304) 506-6314
(425) 288-2332
(145) 987-4962
(187) 582-9707
(750) 558-3965
(492) 467-3131
(774) 914-2510
(888) 106-8550
(539) 567-3573
(693) 337-2849
(545) 604-9386
(221) 156-5026
(414) 876-0865
(932) 726-8645
(726) 710-9826
(622) 594-1662
(948) 600-8503
(605) 900-7508
(716) 977-5775
(368) 239-8275
(725) 342-0650
(711) 993-5187
(882) 399-5084
(287) 755-

`Syed Bakhtawar Fahim` </br>
**bakhtawarfahim10@gmail.com**