<center><h1>𝕽𝖊𝖌𝖚𝖑𝖆𝖗 𝖊𝖝𝖕𝖗𝖊𝖘𝖘𝖎𝖔𝖓</h1></center>

Regular expressions are `text patterns` that define the form a `text string` should have. 
Using them, among other usages, it will be possible to do the following activities:
* Check if an input honors a given pattern; for example, we can check whether 
a value entered in a HTML formulary is a valid e-mail address
* Look for a pattern appearance in a piece of text; for example, check if 
either the word "color" or the word "colour" appears in a document 
with just one scan
* Extract specific portions of a text; for example, extract the postal code 
of an address
* Replace portions of text; for example, change any appearance of "color" 
or "colour" with "red"
* Split a larger text into smaller pieces, for example, splitting a text by any 
appearance of the dot, comma, or newline characters. 
* It can able to extract a specific document easily from a large corpus like email, address, phone nr etc...
* Please note this is also a `case sensitive`. 

Before going to detail. Let's remind the name of the symbols first! 

* Backslash \
* Caret ^
* Dollar sign $
* Dot .
* Pipe symbol |
* Question mark ?
* Asterisk *
* Plus sign +
* Opening parenthesis (
* Closing parenthesis )
* Opening square bracket [
* The opening curly brace {

Because we are going to use this symbols fully in the regular expressions. 

**Let's start our first regular expression**

Finding the target word in the big text. For example: if you want to search any specific word in the big corpus you can use this syntax to get the word. 


In this expression, we can find two kind of components: **literals** (file and .xml) 
and **metacharacters** (? or *).

A regular expression is a pattern of text that consists of ordinary characters 
(for example, letters a through z or numbers 0 through 9) and special characters 
known as metacharacters. 

In [24]:
import re  # regular expression 

test_string = '123abc456789abc123ABC'

# pattern 
pattern = re.compile(r'abc')  # why r here? See bellow

matches = pattern.finditer(test_string)  # finditer hleps to iterate over the text string 

for match in matches:  # iterating to find the matching string
    print(match.group())  # returns the string! 

abc
abc


In [23]:
# alternate method to run the exat code!

matches = re.finditer(r'abc', test_string)  # specifying pattern inside the finditer function! 
for match in matches: 
    print(match)
    
print("\nUsing group function:\n", match.group())

<re.Match object; span=(3, 6), match='abc'>
<re.Match object; span=(12, 15), match='abc'>

Using group function:
 abc


**Why r?**

`r` - Raw String. 

Look at the bellow example

In [18]:
a = '\t learny' 
b = r'\t learny'

print(a)  # without using r 
print(b, 'It print whatever the word that present inside the quotes')  # using r 

	 learny
\t learny It print whatever the word that present inside the quotes


**Performing matches with compiled objects**
Once we have our pattern, we can search for this pattern in the text / string that we want to look up.

* `match()`: Determine if the RE matches at the beginning of the string.

* `search()`: Scan through a string, looking for any location where this RE matches.

* `findall()`: Find all substrings where the RE matches, and returns them as a list.

* `finditer()`: Find all substrings where the RE matches, and returns them as an iterator.

In [25]:
# finditer() 

my_string = 'I love NLP and I love to code'
pattern = re.compile(r'code')
matches = pattern.finditer(my_string)
for match in matches:
    print(match)
    print(match.span(), match.start(), match.end())
    print(match.group()) # returns the string

<re.Match object; span=(25, 29), match='code'>
(25, 29) 25 29
code


Code Explanation: 
```Python 
my_string = 'I love NLP and I love to code'  # input text
pattern = re.compile(r'code')                # pattern to find in text 
matches = pattern.finditer(my_string)        # finditer gives the iterator to extract
for match in matches:                        # looping through text 
    print(match)                
    print(match.span(), match.start(), match.end())  # span() -> Gives starting and ending index, start() -> gives start index
    print(match.group())                             # returns the string
```

In [32]:
# findall() 
pattern = re.compile(r'code')
matches = pattern.findall(my_string)  # findall returns a list of all extracted values! 
matches  # returns a list 

['code']

In [34]:
# match()
pattern = re.compile(r'code')
match = pattern.match(my_string)  # It determines the RE matches begining of the sentences! 
print(match)  

None


In [36]:
# search() 
match = pattern.search(my_string)  # Scan through a string, looking for any location where this RE matches.
print(match)

<re.Match object; span=(25, 29), match='code'>


**Methods on a Match object**

* `group():` Return the string matched by the RE
* `start():` Return the starting position of the match
* `end():` Return the ending position of the match
* `span():` Return a tuple containing the (start, end) positions of the match

In [38]:
# Example for methods on a match object! 
my_string = 'I love NLP and I love to code'
pattern = re.compile(r'code')
matches = pattern.finditer(my_string)
for match in matches:
    print(match)
    print(match.span(), match.start(), match.end())
    print(match.group()) # returns the string

<re.Match object; span=(25, 29), match='code'>
(25, 29) 25 29
code


Let's see some important concept, Metacharacters! 

### **Meta characters**
* Metacharacters are characters with a special meaning:
* All meta characters: `. ^ $ * + ? { } [ ] \ | ( )`
* Meta characters need need to be escaped (with ) if we actually want to search for the char.

* `.` Any character (except newline character) "he..o"
* `^` Starts with "^hello"
* `$` Ends with "world$"
* `*` Zero or more occurrences "aix*"
* `+` One or more occurrences "aix+"
* `{ }` Exactly the specified number of occurrences "al{2}"
* `[]` A set of characters "[a-m]"
* `\` Signals a special sequence (can also be used to escape special characters) "\d"
* `|` Either or "falls|stays"
* `( )` Capture and group

Examples of asterisk, question mark: 

Here,  it's not uncommon to find the usage of the asterisk `(*)` or the question mark `(?)` to find files. 

The question mark will match a single character with any value on a filename. 
For example, a pattern such as file`?`.xml will match file`1`.xml, file`2`.xml, 
and file`3`.xml, but it won't match file`99`.xml as the pattern expresses that 
anything that starts with file, followed by just one character of any value, 
and ends with .xml, will be matched.

A similar meaning is defined for asterisk `(*)`. When asterisk is used, any number of 
characters with any value is accepted. In the case of file*.xml, anything that starts 
with file, followed by any number of characters of any value, and finishes with 
.xml, will be matched. like file99.xml, file1234.xml etc. 

**More Meta Characters**
vA special sequence is a \ followed by one of the characters in the list below, and has a special meaning:

* `\d` :Matches any `decimal digit`; this is equivalent to the class `[0-9`].
* `\D` : Matches any `non-digit character`; this is equivalent to the class `[^0-9]`.
* `\s` : Matches any `whitespace character`;
* `\S` : Matches any `non-whitespace character`;
* `\w` : Matches any `alphanumeric` (word) character; this is equivalent to the class `[a-zA-Z0-9_]`.
* `\W` : Matches any `non-alphanumeric` character; this is equivalent to the class `[^a-zA-Z0-9_]`.
* `\b` Returns a match where the specified characters are at the `beginning` or at the `end` of a word r"\bain" r"ain\b"
* `\B` Returns a match where the specified characters are `present`, but NOT at the beginning (or at the end) of a word r"\Bain" r"ain\B"
* `\A` Returns a match if the specified characters are at the `beginning` of the string "\AThe"
* `\Z` Returns a match if the specified characters are at the `end` of the string "Spain\Z"

Let's see some examples: 

In [43]:
# Examples! 
test_string = 'www.wikipedia.com'
pattern = re.compile(r'\.')  # \ -> signales that special sequences, . -> any character except new line character!
matches = pattern.finditer(test_string)

for match in matches:
    print(match.group())

.
.


In [46]:
# Examples!
test_string = 'hello 123_ heyho hohey'
pattern = re.compile(r'\d')  # matches any decimal digit
matches = pattern.finditer(test_string)
for match in matches:
    print(match.group())
    
# (or)
matches = pattern.findall(test_string) 
print('\nFinall:\n', matches)

1
2
3

Finall:
 ['1', '2', '3']


In [52]:
# Examples! 
test_string = "I love to code along with song" 
pattern = re.compile('\s')  # matches any white space character 
matches = pattern.findall(test_string) 
print('All matches:', matches)

# If you want to find start and ending index of each whitespace! 
matches = pattern.finditer(test_string) 
for match in matches: 
    print('\ncharacter:', match) 
    print('start index:', match.start()) 
    print('end inde:', match.end())

All matches: [' ', ' ', ' ', ' ', ' ', ' ']

character: <re.Match object; span=(1, 2), match=' '>
start index: 1
end inde: 2

character: <re.Match object; span=(6, 7), match=' '>
start index: 6
end inde: 7

character: <re.Match object; span=(9, 10), match=' '>
start index: 9
end inde: 10

character: <re.Match object; span=(14, 15), match=' '>
start index: 14
end inde: 15

character: <re.Match object; span=(20, 21), match=' '>
start index: 20
end inde: 21

character: <re.Match object; span=(25, 26), match=' '>
start index: 25
end inde: 26


In [56]:
# Examples!
test_string = 'hello 123_ heyho hohey'
pattern = re.compile(r'\D')  # matches any non digit character 
matches = pattern.findall(test_string)
# matches

In [59]:
# Examples! 
test_string = 'I want to get 1st rank'
pattern = re.compile(r'\w')  # matches any alpha numberic character 
matches = pattern.findall(test_string)
# matches

In [64]:
# Examples! 
test_string = 'I love to walk alone I' 
pattern = re.compile(r'\bI')  # \b returns a match where the specified characters are at the beginning or at the end of a word
matches = pattern.finditer(test_string) 
for match in matches:
    print(match)

<re.Match object; span=(0, 1), match='I'>
<re.Match object; span=(21, 22), match='I'>


In [67]:
# Examples! 
test_string = "Always rocking"
pattern = re.compile(r'\AAlways')  # \A ->  Returns a match if the specified characters are at the beginning of the string
matches = pattern.finditer(test_string)
for match in matches:
    print(match)

<re.Match object; span=(0, 6), match='Always'>


In [74]:
# Examples! 
test_string = 'This is the meaning of nothing _'
pattern = re.compile(r'_\Z')
matches = pattern.finditer(test_string)
for match in matches:
    print(match)

<re.Match object; span=(31, 32), match='_'>


### **Sets**

* A set is a set of characters inside a pair of square brackets `[]` with a special meaning. Append multiple conditions back-to back, e.g. `[aA-Z]`.
* A `^` (caret) inside a set negates the expression.
* A `-`(dash) in a set specifies a range if it is in between, otherwise the dash itself.

In [75]:
# Examples! 
test_string = 'hello 123_'
pattern = re.compile(r'[a-z]')  # it returns everything present a to z alphabets, but it won't give captial letters! 
matches = pattern.finditer(test_string)
for match in matches:
    print(match)

<re.Match object; span=(0, 1), match='h'>
<re.Match object; span=(1, 2), match='e'>
<re.Match object; span=(2, 3), match='l'>
<re.Match object; span=(3, 4), match='l'>
<re.Match object; span=(4, 5), match='o'>


In [76]:
# Examples! 
test_string = 'I love to Read Books'
pattern = re.compile(r'[A-Z]')  # It returns only the captial letter between A to Z! 
matches = pattern.finditer(test_string)
for match in matches:
    print(match)

<re.Match object; span=(0, 1), match='I'>
<re.Match object; span=(10, 11), match='R'>
<re.Match object; span=(15, 16), match='B'>


In [77]:
# Examples! 
test_string = 'I read lots of NLP books between 4am to 4pm' 
pattern = re.compile(r'[0-9]')  # It returns the numbers between 0 to 9 
matches = pattern.findall(test_string) 
matches

['4', '4']

In [80]:
#Examples! 
test_string = 'I love to Read Books'
pattern = re.compile(r'[a-zA-Z]')  # It all the characters betweeen A to Z in both cases! 
matches = pattern.finditer(test_string)
for match in matches:
    # print(match)
    pass

In [83]:
# Examples! 
test_string = "I love to cycling" 
pattern = re.compile(r'[^ing]')  # ^ returns a character except the specified character! 
matches = pattern.finditer(test_string) 
# for match in matches: print(match)   ## uncomment this! 

In [85]:
# Exxamples! 
test_string = 'you are rocking' 
pattern = re.compile(r'[rocking]')  # returns this specified character! 
matches = pattern.finditer(test_string)
for match in matches: print(match)

<re.Match object; span=(1, 2), match='o'>
<re.Match object; span=(5, 6), match='r'>
<re.Match object; span=(8, 9), match='r'>
<re.Match object; span=(9, 10), match='o'>
<re.Match object; span=(10, 11), match='c'>
<re.Match object; span=(11, 12), match='k'>
<re.Match object; span=(12, 13), match='i'>
<re.Match object; span=(13, 14), match='n'>
<re.Match object; span=(14, 15), match='g'>


`Examples overview:`

- `[arn] `Returns a match where one of the specified characters (a, r, or n) are present
- `[a-n]` Returns a match for any lower case character, alphabetically between a and n
- `[^arn]` Returns a match for any character EXCEPT a, r, and n
- `[0123]` Returns a match where any of the specified digits (0, 1, 2, or 3) are present
- `[0-9]` Returns a match for any digit between 0 and 9
- `0-5` Returns a match for any two-digit numbers from 00 and 59
- `[a-zA-Z]` Returns a match for any character alphabetically between a and z, lower case OR upper case

Let's for some beutiful examples, extract the dates now!

In [86]:
dates = '01.04.2020'  

pattern = re.compile(r'\d\d.\d\d.\d\d\d\d')  # \d returns the digit! 
matches = re.findall(pattern, dates) 
matches

['01.04.2020']

Similarly you can retrive all format of dates~

In [89]:
dates = '2020_04_04' 

pattern = re.compile(r'\d\d\d\d[_]\d\d[_]\d\d')  # \d returns the digit, [underscore(_)] -> it returns only the underscore! 
matches = re.findall(pattern, dates) 
matches               

['2020_04_04']

In [91]:
dates = '2020/04/02' 

pattern = re.compile(r'\d\d\d\d[/]\d\d[/]\d\d')  # [/] it returns / this character
matches = re.findall(pattern, dates) 
matches

['2020/04/02']

In [102]:
# Final Examples! 
dates = '''
2020.04.01

2020-04-01
2020-05-23
2020-06-11
2020-07-11
2020-08-11

2020/04/02

2020_04_04
2020_04_04
'''

# If you want to get all kind of dates! 
pattern = re.compile(r'\d\d\d\d[-./_]\d\d[-./_]\d\d') # [-./_] this will return all the sequences if present 
matches = pattern.findall(dates)
matches

['2020.04.01',
 '2020-04-01',
 '2020-05-23',
 '2020-06-11',
 '2020-07-11',
 '2020-08-11',
 '2020/04/02',
 '2020_04_04',
 '2020_04_04']

### Quantifier: 

* `'*'` : 0 or more
* `'+'` : 1 or more
* `?` : 0 or 1, used when a character can be optional
* `{4}` : exact number
* `{4,6}` : range numbers (min, max)

In [109]:
# Examples! 
test_string = 'one of the good thing is sleeping daily 12 hr'
pattern = re.compile(r'\d+')
matches = pattern.findall(test_string)
matches

['12']

In [111]:
# Examples! 
my_string = 'hello_1_2-3'
pattern = re.compile(r'_?\d')  
matches = pattern.findall(my_string)
matches

['_1', '_2', '3']

In [115]:
# Examples!
my_string = '2020-04-01'
pattern = re.compile(r'\d{4}') 
matches = pattern.finditer(my_string)
for match in matches:
    print(match)

<re.Match object; span=(0, 4), match='2020'>


In [116]:
my_string = '2020-04-01'
pattern = re.compile(r'\d{2,5}')   # \d -> digits range between {2-5} 
matches = pattern.finditer(my_string)
for match in matches:
    print(match)

<re.Match object; span=(0, 4), match='2020'>
<re.Match object; span=(5, 7), match='04'>
<re.Match object; span=(8, 10), match='01'>


In [120]:
# Let's use this technqiues to extract the dates! 
dates = '''
2020.04.01

2020-04-01
2020-05-23
2020-06-11
2020-07-11
2020-08-11

2020/04/02

2020_04_04
2020_04_04
'''
pattern = re.compile(r'\d{4}.\d{2}.\d{2}')  
matches = pattern.findall(dates)
print(matches)
pattern = re.compile(r'\d+.\d+.\d+')  # + -> 0 or more digits 
matches = pattern.findall(dates)
print('\n', matches)

['2020.04.01', '2020-04-01', '2020-05-23', '2020-06-11', '2020-07-11', '2020-08-11', '2020/04/02', '2020_04_04', '2020_04_04']

 ['2020.04.01', '2020-04-01', '2020-05-23', '2020-06-11', '2020-07-11', '2020-08-11', '2020/04/02', '2020_04_04', '2020_04_04']


### Conditions
Use the `|`for either or condition.

In [136]:
# Examples! 
my_string = ''' 
Iron Man
Strong Man
Super Man 
Duper Man 
''' 

pattern = re.compile(r'Iron\.?\s\w+')
matches = re.findall(pattern, my_string) 
matches

['Iron Man']

In [142]:
# Examples! 
pattern = re.compile(r'(Iron|Strong|Super|Duper)\s\w*') 
matches = re.finditer(pattern, my_string) 

for match in matches: 
    print(match)

<re.Match object; span=(2, 10), match='Iron Man'>
<re.Match object; span=(11, 21), match='Strong Man'>
<re.Match object; span=(22, 31), match='Super Man'>
<re.Match object; span=(33, 42), match='Duper Man'>


### Grouping 

`()` is used to group substrings in the matches

In [160]:
# Examples! (let's try to extract emails) 

emails = """
nothing@gmail.com
apple-company@gmx.de
micro-service@my-domain.org
"""
# pattern = re.compile('[a-zA-Z1-9-]+@[a-zA-Z-]+\.[a-zA-Z]+')  # helps to extract the first email 
# pattern = re.compile('[a-zA-Z1-9-]+@[a-zA-Z-]+\.(com|de)')   # helps to extract the second email
pattern = re.compile('([a-zA-Z1-9-]+)@([a-zA-Z-]+)\.([a-zA-Z]+)')   # helps to extract all types of mails! 
matches = pattern.finditer(emails)
for match in matches:
    print(match)
    print(match.group())
    print(match.group(0))
    print(match.group(1))
    print(match.group(2))
    print(match.group(3))
    print('-'*100)



<re.Match object; span=(1, 18), match='nothing@gmail.com'>
nothing@gmail.com
nothing@gmail.com
nothing
gmail
com
----------------------------------------------------------------------------------------------------
<re.Match object; span=(19, 39), match='apple-company@gmx.de'>
apple-company@gmx.de
apple-company@gmx.de
apple-company
gmx
de
----------------------------------------------------------------------------------------------------
<re.Match object; span=(40, 67), match='micro-service@my-domain.org'>
micro-service@my-domain.org
micro-service@my-domain.org
micro-service
my-domain
org
----------------------------------------------------------------------------------------------------


**`sub()`** -> Helps to replace the word in the text. 

In [7]:
# Example! 
my_string = "hello world, you are the best world"
pattern = re.compile(r'world')
subbed_string = pattern.sub(r'planet', my_string)
print(subbed_string)

hello planet, you are the best planet


Want to practic question releated to this? [**click here**](https://www.kaggle.com/code/albeffe/regex-exercises-solutions/notebook)

### Reference: 
* [**Regular Expression from Basic**](https://www.python-engineer.com/posts/regular-expressions/)
* [**Regular Expression Book(Packt)**](https://www.packtpub.com/product/mastering-python-regular-expressions/9781783283156)