# Regular Expresion

**What is Regular Expresion**

[Documentation](https://www.w3schools.com/python/python_regex.asp)

Python Provides a build in module called regular expresion. Used to handling the large text data. and extreact the data

In [28]:
import re

### RegEx Functions
**The `re` module offers a set of functions that allows us to search a string for a match:**

<table>
  <tr>
    <th>Function</th>
    <th>Description</th>
  </tr>
  <tr>
    <td><a href="https://www.w3schools.com/python/python_regex.asp#findall">findall</a></td>
    <td>Returns a list containing all matches</td>
  </tr>
  <tr>
    <td><a href="https://www.w3schools.com/python/python_regex.asp#search">search</a></td>
    <td>Returns a Match object if there is a match anywhere in the string</td>
  </tr>
  <tr>
    <td><a href="https://www.w3schools.com/python/python_regex.asp#split">split</a></td>
    <td>Returns a list where the string has been split at each match</td>
  </tr>
  <tr>
    <td><a href="https://www.w3schools.com/python/python_regex.asp#sub">sub</a></td>
    <td>Replaces one or many matches with a string</td>
  </tr>
</table>


### Metacharacters
Metacharacters are characters with a special meaning:
<table>
  <tr>
    <th>Character</th>
    <th>Description</th>
    <th>Example</th>
  </tr>
  <tr>
    <td>[]</td>
    <td>A set of characters</td>
    <td>"[a-m]"</td>
  </tr>
  <tr>
    <td>\</td>
    <td>Signals a special sequence (can also be used to escape special characters)</td>
    <td>"\d"</td>
  </tr>
  <tr>
    <td>.</td>
    <td>Any character (except newline character)</td>
    <td>"he..o"</td>
  </tr>
  <tr>
    <td>^</td>
    <td>Starts with</td>
    <td>"^hello"</td>
  </tr>
  <tr>
    <td>$</td>
    <td>Ends with</td>
    <td>"planet$"</td>
  </tr>
  <tr>
    <td>*</td>
    <td>Zero or more occurrences</td>
    <td>"he.*o"</td>
  </tr>
  <tr>
    <td>+</td>
    <td>One or more occurrences</td>
    <td>"he.+o"</td>
  </tr>
  <tr>
    <td>?</td>
    <td>Zero or One occurrences</td>
    <td>"he.?o"</td>
  </tr>
  <tr>
    <td>{}</td>
    <td>Exactly the specified number of occurrences	</td>
    <td>"he.{2}o"</td>
  </tr>
  <tr>
    <td>|</td>
    <td>Either or</td>
    <td>"falls|stays"</td>
  </tr>
  <tr>
    <td>()</td>
    <td>Capture and group</td>
    <td></td>
  </tr>
</table>


### Special Sequences
**A special sequence is a `\` followed by one of the characters in the list below, and has a special meaning:**
<table>
  <tr>
    <th>Character</th>
    <th>Description</th>
    <th>Example</th>
  </tr>
  <tr>
    <td>\A</td>
    <td>Returns a match if the specified characters are at the beginning of the string</td>
    <td>"\AThe"</td>
  </tr>
  <tr>
    <td>\b</td>
    <td>Returns a match where the specified characters are at the beginning or at the end of a word
(the "r" in the beginning is making sure that the string is being treated as a "raw string")</td>
    <td>r"\bain"
    r"ain\b"</td>
  </tr>
  <tr>
    <td>\B</td>
    <td>Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word
(the "r" in the beginning is making sure that the string is being treated as a "raw string")</td>
    <td>r"\Bain"
    r"ain\B"</td>
  </tr>
  <tr>
    <td>\d</td>
    <td>Returns a match where the string contains digits (numbers from 0-9)</td>
    <td>"\d"</td>
  </tr>
  <tr>
    <td>\D</td>
    <td>Row 5, Column 2</td>
    <td>Row 5, Column 3</td>
  </tr>
  <tr>
    <td>\s</td>
    <td>Returns a match where the string contains a white space character</td>
    <td>"\s"</td>
  </tr>
  <tr>
    <td>\S</td>
    <td>Returns a match where the string DOES NOT contain a white space character</td>
    <td>"\S"</td>
  </tr>
  <tr>
    <td>\w</td>
    <td>Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)</td>
    <td>"\w"</td>
  </tr>
  <tr>
    <td>\W</td>
    <td>Returns a match where the string DOES NOT contain any word characters</td>
    <td>"\W"</td>
  </tr>
  <tr>
    <td>\Z</td>
    <td>Returns a match if the specified characters are at the end of the string</td>
    <td>"Spain\Z"</td>
  </tr>
</table>


### Sets
**A set is a set of characters inside a pair of square brackets `[]` with a special meaning:**
<table>
  <tr>
    <th>[arn]</th>
    <th>Returns a match where one of the specified characters (a, r, or n) is present</th>
  </tr>
  <tr>
    <td>[a-n]</td>
    <td>Returns a match for any lower case character, alphabetically between a and n</td>
  </tr>
  <tr>
    <td>[^arn]</td>
    <td>Returns a match for any character EXCEPT a, r, and n</td>
  </tr>
  <tr>
    <td>[0123]</td>
    <td>Returns a match where any of the specified digits (0, 1, 2, or 3) are present</td>
  </tr>
  <tr>
    <td>[0-9]</td>
    <td>Returns a match for any digit between 0 and 9</td>
  </tr>
  <tr>
    <td>[0-5][0-9]</td>
    <td>Returns a match for any two-digit numbers from 00 and 59</td>
  </tr>
  <tr>
    <td>[a-zA-Z]</td>
    <td>Returns a match for any character alphabetically between a and z, lower case OR upper case</td>
  </tr>
  <tr>
    <td>[+]</td>
    <td>In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means: return a match for any + character in the string</td>
  </tr>
</table>


### The `findall()` Function
The `findall()` function returns a `list` containing all matches.


In [29]:
import re

txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)

['ai', 'ai']


The list contains the matches in the order they are found.

If no matches are found, an `empty list` is returned:

In [30]:
txt = "The rain in Spain"
x = re.findall("Portugal", txt)
print(x)

[]


### The `search()` Function
The `search()` function searches the string for a match, and returns a [Match object](https://www.w3schools.com/python/python_regex.asp#matchobject) if there is a match.

If there is more than one match, only the `first occurrence` of the match will be returned:

In [31]:
txt = "The rain in Spain"
x = re.search("\s", txt)

print("The first white-space character is located in position:", x.start())
x

The first white-space character is located in position: 3


<re.Match object; span=(3, 4), match=' '>

If no matches are found, the value `None` is returned:

In [32]:
txt = "The rain in Spain"
x = re.search("Portugal", txt)
print(x)

None


### The `split()` Function
The `split()` function returns a list where the string has been split at each match:



In [33]:
txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)

['The', 'rain', 'in', 'Spain']


You can control the number of occurrences by specifying the `maxsplit` parameter:

In [34]:
txt = "The rain in Spain"
x = re.split("\s", txt, 1)
print(x)

['The', 'rain in Spain']


### The `sub()` Function
The `sub()` function `replaces the matches` with the text of your choice:

In [35]:
txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x)

The9rain9in9Spain


You can control the number of replacements by specifying the `count` parameter:

In [36]:
txt = "The rain in Spain"
x = re.sub("\s", "9", txt, 2)
print(x)

The9rain9in Spain


### Match Object
A Match Object is an object containing `information` about the search and the result.

*Note: If there is no match, the value `None` will be returned, instead of the Match Object.*

In [37]:
txt = "The rain in Spain"
x = re.search("ai", txt)
print(x) #this will print an object

<re.Match object; span=(5, 7), match='ai'>


The Match object has properties and methods used to retrieve information about the search, and the result:

`.span()` returns a tuple containing the start-, and end positions of the match.

`.string` returns the string passed into the function

`.group()` returns the part of the string where there was a match

In [38]:
txt = "The rain in Spain" 
x = re.search(r"\bS\w+", txt)
print(x.span()) # The second value of tuble is exclusive
print(x.string)
print(x.group())

(12, 17)
The rain in Spain
Spain


### Metacharacters

**Examples**

`[]` -->	A set of characters	 --> `"[a-m]"`

In [39]:
txt = "The rain in Spain"

In [40]:
x = re.findall("[a-m]", txt)
x

['h', 'e', 'a', 'i', 'i', 'a', 'i']

`\` -->	Signals a special sequence (can also be used to escape special characters)	--> `"\d"`

In [41]:
txt = "That will be 59 dollars"

#Find all digit characters:

x = re.findall("\d", txt)
x

['5', '9']

`.`	--> Any character (except newline character) --> `"he..o"`

In [42]:
txt = "hello planet"

#Search for a sequence that starts with "he",
# followed by two (any) characters, and an "o":

x = re.findall("he..o", txt)
x

['hello']

`^` -->	Starts with -->	`"^hello"`

In [43]:
txt = "hello planet"

#Check if the string starts with 'hello':

x = re.findall("^hello", txt)
if x:
  print("Yes, the string starts with 'hello'")
else:
  print("No match")

Yes, the string starts with 'hello'


`$` -->	Ends with -->	`"planet$"`

In [116]:
txt = "hello planet"

#Check if the string ends with 'planet':

x = re.findall("planet$", txt)
if x:
  print("Yes, the string ends with 'planet'")
else:
  print("No match")

Yes, the string ends with 'planet'


`*` -->	Zero or more occurrences --> `"he.*o"`

In [117]:
txt = "hello planet"

#Search for a sequence that starts with "he",\
# followed by 0 or more  (any) characters, and an "o":

x = re.findall("he.*o", txt)
x

['hello']

`+`	One or more occurrences	`"he.+o"`

In [46]:
txt = "hello planet"

# Search for a sequence that starts with "he",
# followed by 1 or more  (any) characters, and an "o":

x = re.findall("he.+o", txt)
x

['hello']

`?` -->	Zero or one occurrences -->	`"he.?o"`

In [47]:
txt = "hello planet"

#Search for a sequence that starts with "he", followed by 0 or 1
# (any) character, and an "o":

x = re.findall("he.?o", txt)

x
#This time we got no match, because there were not zero,
# not one, but two characters between "he" and the "o"

[]

`{}` --> Exactly the specified number of occurrences --> `"he.{2}o"`

In [48]:
txt = "hello planet"

#Search for a sequence that starts with "he",
# followed excactly 2 (any) characters, and an "o":

x = re.findall("he.{2}o", txt)
x

['hello']

`|` --> Either or -->	`"falls|stays"`

In [54]:
txt = "The rain in Spain falls mainly in the plain!"

#Check if the string contains either "falls" or "stays":

x = re.findall("falls|stays", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['falls']
Yes, there is at least one match!


### Special Sequences
`\A`

Returns a match if the specified characters are at the beginning of the string.

`"\AThe"`

In [55]:
txt = "The rain in Spain"

#Check if the string starts with "The":

x = re.findall("\AThe", txt)

print(x)

if x:
  print("Yes, there is a match!")
else:
  print("No match")

['The']
Yes, there is a match!


`\b`

Returns a match where the specified characters are at the beginning or at the end of a word
(the "r" in the beginning is making sure that the string is being treated as a "raw string")

`r"\bain"`
`r"ain\b"`

In [118]:
txt = "The rain in Spain"

#Check if "ain" is present at the beginning of a WORD:

x = re.findall(r"\bain", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

#Check if "ain" is present at the end of a WORD:

x = re.findall(r"ain\b", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

[]
No match
['ain', 'ain']
Yes, there is at least one match!


`\B`

Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word
(the "r" in the beginning is making sure that the string is being treated as a "raw string")

`r"\Bain"  r"ain\B"`

In [62]:
txt = "The rain in Spain"

#Check if "ain" is present, but NOT at the beginning of a word:

x = re.findall(r"\Bain", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


#Check if "ain" is present, but NOT at the end of a word:

x = re.findall(r"ain\B", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['ain', 'ain']
Yes, there is at least one match!
[]
No match


`\d`

Returns a match where the string contains digits (numbers from 0-9)	

`"\d"`

In [119]:
txt = "The rain in Spain"

#Check if the string contains any digits (numbers from 0-9):

x = re.findall("\d", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

[]
No match


`\D`

Returns a match where the string DOES NOT contain digits

`"\D"`

In [70]:
txt = "The Spain 123 ABC"

#Return a match at every no-digit character:

x = re.findall("\D", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['T', 'h', 'e', ' ', 'S', 'p', 'a', 'i', 'n', ' ', ' ', 'A', 'B', 'C']
Yes, there is at least one match!


`\s	`

Returns a match where the string contains a white space character	

`"\s"`

In [71]:
txt = "The rain in Spain"

#Return a match at every white-space character:

x = re.findall("\s", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

[' ', ' ', ' ']
Yes, there is at least one match!


`\S`	

Returns a match where the string DOES NOT contain a white space character

`"\S"`

In [104]:
txt = "The rain in Spain"

#Return a match at every NON white-space character:

x = re.findall("\S", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['T', 'h', 'e', 'r', 'a', 'i', 'n', 'i', 'n', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


`\w`

Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)

`"\w"`

In [109]:
txt = " Spain 123 !@#$%^&*()__--"

#Return a match at every word character 
# (characters from a to Z, digits from 0-9, and the 
# underscore _ character):

x = re.findall("\w", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['S', 'p', 'a', 'i', 'n', '1', '2', '3', '_', '_']
Yes, there is at least one match!


`\W`

Returns a match where the string DOES NOT contain any word characters

`"\W"`

In [111]:
txt = "The rain in Spain !@#$%^&*()_-"

#Return a match at every NON word character 
# (characters NOT between a and Z. Like "!", "?" white-space etc.):

x = re.findall("\W", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

[' ', ' ', ' ', ' ', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '-']
Yes, there is at least one match!


`\Z	`

Returns a match if the specified characters are at the end of the string

`"Spain\Z"`

In [115]:
txt = "The rain in Spain"

#Check if the string ends with "Spain":

x = re.findall("Spain\Z", txt)

print(x)

if x:
  print("Yes, there is a match!")
else:
  print("No match")

['Spain']
Yes, there is a match!


### Sets
`[arn]`	Returns a match where one of the specified characters `(a, r, or n)` is present

In [120]:
txt = "The rain in Spain"

#Check if the string has any a, r, or n characters:

x = re.findall("[arn]", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['r', 'a', 'n', 'n', 'a', 'n']
Yes, there is at least one match!


`[a-n]`	Returns a match for any lower case character, alphabetically between `a and n`

In [123]:
txt = "abc opqrs ABC OPQRS"

#Check if the string has any characters between a and n:

x = re.findall("[a-n]", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['a', 'b', 'c']
Yes, there is at least one match!


`[^arn]`	Returns a match for any character EXCEPT `a, r, and n`

In [125]:
txt = "arn ARN bcd"

#Check if the string has other characters than a, r, or n:

x = re.findall("[^arn]", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


[' ', 'A', 'R', 'N', ' ', 'b', 'c', 'd']
Yes, there is at least one match!


`[0123]`	Returns a match where any of the specified digits `(0, 1, 2, or 3)` are present

In [126]:
text = "0783 132"

x = re.findall("[0123]", text)
x

['0', '3', '1', '3', '2']

`[0-9]`	Returns a match for any digit between` 0 and 9`

In [128]:
text = "ABC 01234 DEF 56789"

x = re.findall("[0-9]", text)
x

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

`[0-5][0-9]`	Returns a match for any two-digit numbers from `00 and 59`

In [130]:
text = "ABC 45 5555 60 456 DEF"
x = re.findall("[0-5][0-9]", text)
x

['45', '55', '55', '45']

`[a-zA-Z]`	Returns a match for any character alphabetically between` a and z, lower case OR upper case`

In [132]:
text = "ABC abc 123465 !@#$$"
x = re.findall("[a-zA-Z]", text)
x

['A', 'B', 'C', 'a', 'b', 'c']

`[+]`	In sets, `+, *, ., |, (), $,{}` has no special meaning, so [+] means: return a match for any + character in the string

In [136]:
txt = "8 times before 11:45 AM, my number is +910000000"

#Check if the string has any + characters:

x = re.findall("[+]", txt)

print(x)

if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['+']
Yes, there is at least one match!


# Email
In this section going to see validate the email and email extract from the file.

Email validation using the regular expresion

In [142]:
import re

def is_valid_email(email):
    pattern = r'^[a-zA-Z0-9.%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$' # if im simply put '.' its consider as Metacharacters to escape the special characters
    return bool(re.match(pattern, email))

# Example usage:
email1 = "user.+%@example.com"
email2 = "user@example"
email3 = "user.example.com"
print(is_valid_email(email1))  # True
print(is_valid_email(email2))  # False
print(is_valid_email(email3))  # False

True
False
False


In this example, the `is_valid_email()` function takes an email address as input and returns `True` if it is valid and `False` otherwise. The regular expression pattern `r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'` matches a string that starts with one or more alphanumeric characters, dots, underscores, percentage signs, plus signs, or hyphens, followed by an at sign (`@`), followed by one or more alphanumeric characters, dots, or hyphens, followed by a dot (.), followed by two or more alphabetical characters.

Note that this regular expression is a simplified version of the actual email address specification, which is much more complex. However, it should be sufficient for most practical purposes. Also note that this regular expression does not check whether the email address actually exists or is currently in use; it only checks whether it is well-formed according to the syntax rules.

Read a file `email_list.txt` and extract the mails from the file.

In [146]:
# open the text file for reading
with open('files/email_list.txt', 'r') as f:
    # Read the contents of the file
    text = f.read()

# Define a regular expression pattern to match email addressess
pattern = r'\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b'

emails = re.findall(pattern, text)
print(emails)

['john@example.com', 'jane.doe1234@gmail.com', 'info@company.com', 'support@company.com', 'sales@company.com']


Detect the phone number

In [153]:
with open("files/email_list.txt", "r") as f:
    text = f.read()

pattern = r'[+0-9]{10,}|[0-9]{10}'
x = re.findall(pattern, text)
x

['+914561237895',
 '8974561230',
 '4578912563',
 '1234578945',
 '+14578964859',
 '+1434857695482']

In [154]:
import re

# Define a regular expression pattern to match phone numbers
pattern = r'\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{10}'

# Sample text to search for phone numbers
text = "My phone number is 123-456-7890. Call me at (123) 456-7890 or 1234567890"

# Use the findall() method from the re module to extract all phone numbers from the text
phone_numbers = re.findall(pattern, text)

# Print the extracted phone numbers
print(phone_numbers)


['123-456-7890', '(123) 456-7890', '1234567890']
