# Learning RE module of Python for pattern matching in strings

#### Note
- This notes is made from the video tutorial by Corey Schafer.
- You can watch the [video](https://www.youtube.com/watch?v=K8L6KVGG-7o) along with this notes.

In [1]:
import re

> To give a pattern to the regular expression module 're' we must send the string as raw string like r'This is a string'

In [2]:
print('\tTab')
print(r'\tTab')

	Tab
\tTab


>We will use `re.compile()` function to store the patterns in a variables which will help us to reuse the patterns and we can give meaningful names to a pattern which makes the code more readable.

In [3]:
# just a dummy text that contains alphabets, numbers, special characters, 
# email id, phone numbers, Name in specific formats. 
text_to_search = '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890
Ha HaHa
MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )
coreyms.com
321-555-4321
123.555.1234
123*555*1234
800-555-1234
900-555-1234
Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
'''

In [4]:
# lets search 'abc' in the string
pattern_abc = re.compile('abc')

matches = pattern_abc.finditer(text_to_search)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(1, 4), match='abc'>


> As we can see in the output above `finditer()` function returns span of the matched part in the string and matched part of the string

> Span has the indices that can be used in string slicing to get that particular match.

> Also **ABC** was not matched here. Hence regex is case-sensitive.

In [5]:
# lets search '.' in the string
pattern_dot = re.compile('.')

matches = pattern_dot.finditer(text_to_search)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(1, 2), match='a'>
<_sre.SRE_Match object; span=(2, 3), match='b'>
<_sre.SRE_Match object; span=(3, 4), match='c'>
<_sre.SRE_Match object; span=(4, 5), match='d'>
<_sre.SRE_Match object; span=(5, 6), match='e'>
<_sre.SRE_Match object; span=(6, 7), match='f'>
<_sre.SRE_Match object; span=(7, 8), match='g'>
<_sre.SRE_Match object; span=(8, 9), match='h'>
<_sre.SRE_Match object; span=(9, 10), match='i'>
<_sre.SRE_Match object; span=(10, 11), match='j'>
<_sre.SRE_Match object; span=(11, 12), match='k'>
<_sre.SRE_Match object; span=(12, 13), match='l'>
<_sre.SRE_Match object; span=(13, 14), match='m'>
<_sre.SRE_Match object; span=(14, 15), match='n'>
<_sre.SRE_Match object; span=(15, 16), match='o'>
<_sre.SRE_Match object; span=(16, 17), match='p'>
<_sre.SRE_Match object; span=(17, 18), match='q'>
<_sre.SRE_Match object; span=(18, 19), match='u'>
<_sre.SRE_Match object; span=(19, 20), match='r'>
<_sre.SRE_Match object; span=(20, 21), match='t'>
<_sre.SRE_Match o

> Here we can see that *all the characters* in the string were matched and not just the *period*.

> **Dot** is a special charater in regex and thus we <mark>need to escape</mark> it to use the character literally and not as a special symbol.

In [6]:
# escaping the character so the regex takes it literally
pattern_dot = re.compile('\.')

matches = pattern_dot.finditer(text_to_search)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(111, 112), match='.'>
<_sre.SRE_Match object; span=(146, 147), match='.'>
<_sre.SRE_Match object; span=(167, 168), match='.'>
<_sre.SRE_Match object; span=(171, 172), match='.'>
<_sre.SRE_Match object; span=(218, 219), match='.'>
<_sre.SRE_Match object; span=(249, 250), match='.'>
<_sre.SRE_Match object; span=(262, 263), match='.'>


In [7]:
# lets match the url 'coreyms.com'

# here to we need to escape the dot so its taken literally
pattern_url = re.compile('coreyms\.com')

matches = pattern_url.finditer(text_to_search)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(139, 150), match='coreyms.com'>


### The special characters are:


- .       - Any Character Except New Line
- \d      - Digit (0-9)
- \D      - Not a Digit (0-9)
- \w      - Word Character (a-z, A-Z, 0-9, \_)
- \W      - Not a Word Character
- \s      - Whitespace (space, tab, newline)
- \S      - Not Whitespace (space, tab, newline)

- \b      - Word Boundary (Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of word characters.)
- \B      - Not a Word Boundary
- ^       - Beginning of a String
- $      - End of a String

- \[\]      - Matches Characters in brackets
- \[^ \]    - Matches Characters NOT in brackets
- |       - Either Or
- ( )     - Group

In [8]:
# get all the digits in the text

pattern_digit = re.compile('\d')

matches = pattern_digit.finditer(text_to_search)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(55, 56), match='1'>
<_sre.SRE_Match object; span=(56, 57), match='2'>
<_sre.SRE_Match object; span=(57, 58), match='3'>
<_sre.SRE_Match object; span=(58, 59), match='4'>
<_sre.SRE_Match object; span=(59, 60), match='5'>
<_sre.SRE_Match object; span=(60, 61), match='6'>
<_sre.SRE_Match object; span=(61, 62), match='7'>
<_sre.SRE_Match object; span=(62, 63), match='8'>
<_sre.SRE_Match object; span=(63, 64), match='9'>
<_sre.SRE_Match object; span=(64, 65), match='0'>
<_sre.SRE_Match object; span=(151, 152), match='3'>
<_sre.SRE_Match object; span=(152, 153), match='2'>
<_sre.SRE_Match object; span=(153, 154), match='1'>
<_sre.SRE_Match object; span=(155, 156), match='5'>
<_sre.SRE_Match object; span=(156, 157), match='5'>
<_sre.SRE_Match object; span=(157, 158), match='5'>
<_sre.SRE_Match object; span=(159, 160), match='4'>
<_sre.SRE_Match object; span=(160, 161), match='3'>
<_sre.SRE_Match object; span=(161, 162), match='2'>
<_sre.SRE_Match object; span=(16

In [9]:
# get all the non word characters in the text

pattern_non_word_char = re.compile('\W')

matches = pattern_non_word_char.finditer(text_to_search)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(0, 1), match='\n'>
<_sre.SRE_Match object; span=(27, 28), match='\n'>
<_sre.SRE_Match object; span=(54, 55), match='\n'>
<_sre.SRE_Match object; span=(65, 66), match='\n'>
<_sre.SRE_Match object; span=(68, 69), match=' '>
<_sre.SRE_Match object; span=(73, 74), match='\n'>
<_sre.SRE_Match object; span=(88, 89), match=' '>
<_sre.SRE_Match object; span=(89, 90), match='('>
<_sre.SRE_Match object; span=(94, 95), match=' '>
<_sre.SRE_Match object; span=(97, 98), match=' '>
<_sre.SRE_Match object; span=(100, 101), match=' '>
<_sre.SRE_Match object; span=(108, 109), match=')'>
<_sre.SRE_Match object; span=(109, 110), match=':'>
<_sre.SRE_Match object; span=(110, 111), match='\n'>
<_sre.SRE_Match object; span=(111, 112), match='.'>
<_sre.SRE_Match object; span=(112, 113), match=' '>
<_sre.SRE_Match object; span=(113, 114), match='^'>
<_sre.SRE_Match object; span=(114, 115), match=' '>
<_sre.SRE_Match object; span=(115, 116), match='$'>
<_sre.SRE_Match object; span

In [10]:
# get 'ha' that starts with word boundary(word boundary is any character that is not a word characters 
# eg- space, tab, newline etc)

pattern_ha = re.compile(r'\bHa')

matches = pattern_ha.finditer(text_to_search)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(66, 68), match='Ha'>
<_sre.SRE_Match object; span=(69, 71), match='Ha'>


In [11]:
# gettind the second 'Ha' in 'HaHa'
pattern_ha = re.compile(r'\BHa')

matches = pattern_ha.finditer(text_to_search)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(71, 73), match='Ha'>


In [12]:
# match if the sentence starts with the word 'Start'
sentence = 'Start of the sentence and then bring it to an end'
pattern_start = re.compile(r'^Start')
matches = pattern_start.finditer(sentence)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(0, 5), match='Start'>


In [13]:
# match if the sentence starts with the word 'a'
sentence = 'Start of the sentence and then bring it to an end'
pattern_start = re.compile(r'^a')
matches = pattern_start.finditer(sentence)
for match in matches:
    print(match)

> Here it outputs nothing as the start of the string sentence does not have *'a'* .
*'a'* is present in the sentence but we are just looking for *'a'* only at the start of the sentence because of `^`.

In [14]:
# matching the string at the end of the string
sentence = 'Start of the sentence and then bring it to an end'
pattern_end = re.compile(r'end$')
matches = pattern_end.finditer(sentence)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(46, 49), match='end'>


In [15]:
# match if the string ends with 'b'
sentence = 'Start of the sentence and then bring it to an end'
pattern_end = re.compile(r'b$')
matches = pattern_end.finditer(sentence)
for match in matches:
    print(match)

> Here it outputs nothing as the end of the string does not have *'b'*.
*'b'* is present in the sentence but we are just looking for *'b'* only at the end of the sentence because of `$`.

In [16]:
# lets match phone numbers in the text_to_search of pattern 123-123-1234 or 123*123*1234
pattern_ph_no = re.compile(r'\d\d\d.\d\d\d.\d\d\d\d')
matches = pattern_ph_no.finditer(text_to_search)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(151, 163), match='321-555-4321'>
<_sre.SRE_Match object; span=(164, 176), match='123.555.1234'>
<_sre.SRE_Match object; span=(177, 189), match='123*555*1234'>
<_sre.SRE_Match object; span=(190, 202), match='800-555-1234'>
<_sre.SRE_Match object; span=(203, 215), match='900-555-1234'>


> Here we are getting the phone numbers where the seperators are \* or any character 

In [17]:
# Get all phone numbers whose seperators are eithers dash(-) or a dot(.)
pattern_ph_no = re.compile(r'\d\d\d[-.]\d\d\d[-.]\d\d\d\d')
matches = pattern_ph_no.finditer(text_to_search)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(151, 163), match='321-555-4321'>
<_sre.SRE_Match object; span=(164, 176), match='123.555.1234'>
<_sre.SRE_Match object; span=(190, 202), match='800-555-1234'>
<_sre.SRE_Match object; span=(203, 215), match='900-555-1234'>


### NOTE 
- Here in character set of '-.' we didnt have to escape the dot as in character set there a slightly different rules but we can escape them if we want.
- Also even though we have two characters in the character set still it will only match if any one is present and also it should be present exactly once.

In [18]:
# example for above note
sentence = """123-123-1234
354.354.1234
256--433.7413
134-.356.4354
344.-541-3651"""

pattern_ph_no = re.compile(r'\d\d\d[-.]\d\d\d[-.]\d\d\d\d')
matches = pattern_ph_no.finditer(sentence)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(0, 12), match='123-123-1234'>
<_sre.SRE_Match object; span=(13, 25), match='354.354.1234'>


In [19]:
# get all the phone numbers that start with 800 or 900
sentence = """123-123-1234
354.354.1234
800-433.7413
134-356.4354
900.541-3651"""

pattern_ph_no = re.compile(r'[89]00[-.]\d\d\d[-.]\d\d\d\d')
matches = pattern_ph_no.finditer(sentence)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(26, 38), match='800-433.7413'>
<_sre.SRE_Match object; span=(52, 64), match='900.541-3651'>


> `[89]` denotes that the number must have either 8 or 9 (***not both***).<br>
`00` dontes that after we have matched either 8 or 9 then we look for exactly 2 zeros.<br>
`[-.]` we just match that we should have dash or a dot as sepeartor (***not both***).<br>
`\d\d\d` This denotes that we need exactly 3 digits after the seperator.

 ### Note
- Dash has a special function in character sets [].
- when used at the start or end of the character set it just tells regex to match for dash in the string.
- But when used between 2 characters in a character set it specifies a range.
- eg. `[A-J]` would specify that regex should match all the characters in that range. 

In [21]:
# match only digits from range 1 to 5
pattern_digits = re.compile(r'[1-5]')
matches = pattern_digits.finditer(text_to_search)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(55, 56), match='1'>
<_sre.SRE_Match object; span=(56, 57), match='2'>
<_sre.SRE_Match object; span=(57, 58), match='3'>
<_sre.SRE_Match object; span=(58, 59), match='4'>
<_sre.SRE_Match object; span=(59, 60), match='5'>
<_sre.SRE_Match object; span=(151, 152), match='3'>
<_sre.SRE_Match object; span=(152, 153), match='2'>
<_sre.SRE_Match object; span=(153, 154), match='1'>
<_sre.SRE_Match object; span=(155, 156), match='5'>
<_sre.SRE_Match object; span=(156, 157), match='5'>
<_sre.SRE_Match object; span=(157, 158), match='5'>
<_sre.SRE_Match object; span=(159, 160), match='4'>
<_sre.SRE_Match object; span=(160, 161), match='3'>
<_sre.SRE_Match object; span=(161, 162), match='2'>
<_sre.SRE_Match object; span=(162, 163), match='1'>
<_sre.SRE_Match object; span=(164, 165), match='1'>
<_sre.SRE_Match object; span=(165, 166), match='2'>
<_sre.SRE_Match object; span=(166, 167), match='3'>
<_sre.SRE_Match object; span=(168, 169), match='5'>
<_sre.SRE_Match object

In [22]:
# match only lower case a to j
pattern_letters = re.compile(r'[a-j]')
matches = pattern_letters.finditer(text_to_search)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(1, 2), match='a'>
<_sre.SRE_Match object; span=(2, 3), match='b'>
<_sre.SRE_Match object; span=(3, 4), match='c'>
<_sre.SRE_Match object; span=(4, 5), match='d'>
<_sre.SRE_Match object; span=(5, 6), match='e'>
<_sre.SRE_Match object; span=(6, 7), match='f'>
<_sre.SRE_Match object; span=(7, 8), match='g'>
<_sre.SRE_Match object; span=(8, 9), match='h'>
<_sre.SRE_Match object; span=(9, 10), match='i'>
<_sre.SRE_Match object; span=(10, 11), match='j'>
<_sre.SRE_Match object; span=(67, 68), match='a'>
<_sre.SRE_Match object; span=(70, 71), match='a'>
<_sre.SRE_Match object; span=(72, 73), match='a'>
<_sre.SRE_Match object; span=(75, 76), match='e'>
<_sre.SRE_Match object; span=(77, 78), match='a'>
<_sre.SRE_Match object; span=(79, 80), match='h'>
<_sre.SRE_Match object; span=(80, 81), match='a'>
<_sre.SRE_Match object; span=(82, 83), match='a'>
<_sre.SRE_Match object; span=(83, 84), match='c'>
<_sre.SRE_Match object; span=(85, 86), match='e'>
<_sre.SRE_Match o

In [23]:
# match characters from lower case 'a' to 'j' and uppercase 'A' to 'G'

pattern_letters = re.compile(r'[a-jA-G]')
matches = pattern_letters.finditer(text_to_search)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(1, 2), match='a'>
<_sre.SRE_Match object; span=(2, 3), match='b'>
<_sre.SRE_Match object; span=(3, 4), match='c'>
<_sre.SRE_Match object; span=(4, 5), match='d'>
<_sre.SRE_Match object; span=(5, 6), match='e'>
<_sre.SRE_Match object; span=(6, 7), match='f'>
<_sre.SRE_Match object; span=(7, 8), match='g'>
<_sre.SRE_Match object; span=(8, 9), match='h'>
<_sre.SRE_Match object; span=(9, 10), match='i'>
<_sre.SRE_Match object; span=(10, 11), match='j'>
<_sre.SRE_Match object; span=(28, 29), match='A'>
<_sre.SRE_Match object; span=(29, 30), match='B'>
<_sre.SRE_Match object; span=(30, 31), match='C'>
<_sre.SRE_Match object; span=(31, 32), match='D'>
<_sre.SRE_Match object; span=(32, 33), match='E'>
<_sre.SRE_Match object; span=(33, 34), match='F'>
<_sre.SRE_Match object; span=(34, 35), match='G'>
<_sre.SRE_Match object; span=(67, 68), match='a'>
<_sre.SRE_Match object; span=(70, 71), match='a'>
<_sre.SRE_Match object; span=(72, 73), match='a'>
<_sre.SRE_Match o

### Note
- carat `^` out side a character set denotes the start of the string.
- But inside othe character set it denotes **NOT**
- eg. `[^A-G]` This denotes that regex should match all the characters except from the range 'A' to 'G'. 

In [24]:
# match everything that is not a lowercase letter 
pattern_not_lower = re.compile(r'[^a-z]')
matches = pattern_not_lower.finditer(text_to_search)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(0, 1), match='\n'>
<_sre.SRE_Match object; span=(27, 28), match='\n'>
<_sre.SRE_Match object; span=(28, 29), match='A'>
<_sre.SRE_Match object; span=(29, 30), match='B'>
<_sre.SRE_Match object; span=(30, 31), match='C'>
<_sre.SRE_Match object; span=(31, 32), match='D'>
<_sre.SRE_Match object; span=(32, 33), match='E'>
<_sre.SRE_Match object; span=(33, 34), match='F'>
<_sre.SRE_Match object; span=(34, 35), match='G'>
<_sre.SRE_Match object; span=(35, 36), match='H'>
<_sre.SRE_Match object; span=(36, 37), match='I'>
<_sre.SRE_Match object; span=(37, 38), match='J'>
<_sre.SRE_Match object; span=(38, 39), match='K'>
<_sre.SRE_Match object; span=(39, 40), match='L'>
<_sre.SRE_Match object; span=(40, 41), match='M'>
<_sre.SRE_Match object; span=(41, 42), match='N'>
<_sre.SRE_Match object; span=(42, 43), match='O'>
<_sre.SRE_Match object; span=(43, 44), match='P'>
<_sre.SRE_Match object; span=(44, 45), match='Q'>
<_sre.SRE_Match object; span=(45, 46), match='R'>


In [25]:
# in a sentence match all the three letter words that end with 'at' but dont match 'bat'
sentence = 'mat pat rat bat cat'
pattern_not_bat = re.compile(r'[^b]at')
matches = pattern_not_bat.finditer(sentence)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(0, 3), match='mat'>
<_sre.SRE_Match object; span=(4, 7), match='pat'>
<_sre.SRE_Match object; span=(8, 11), match='rat'>
<_sre.SRE_Match object; span=(16, 19), match='cat'>


---
### Quantifiers:
- \*       - 0 or More
- \+       - 1 or More
- \?       - 0 or One
- \{3\}     - Exact Number
- \{3,4\}   - Range of Numbers (Minimum, Maximum)

In [26]:
# lets use quantifier to match phone numbers
pattern_ph_no = re.compile(r'\d{3}.\d{3}.\d{4}')
matches = pattern_ph_no.finditer(text_to_search)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(151, 163), match='321-555-4321'>
<_sre.SRE_Match object; span=(164, 176), match='123.555.1234'>
<_sre.SRE_Match object; span=(177, 189), match='123*555*1234'>
<_sre.SRE_Match object; span=(190, 202), match='800-555-1234'>
<_sre.SRE_Match object; span=(203, 215), match='900-555-1234'>


> Thus `\d{3}` denotes that we need 3 of the digit characters.<br>
This is better that writting `\d\d\d` as we would need to type this 10 time (say) if we need a 10 digit number from the string.

In [27]:
# lets match names from a string that start from Mr
sentence = """
Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
"""
# here we can see that few Mr have dot at the end and few don't
pattern_names = re.compile(r'Mr\.?\s[A-Z]\w*')
matches = pattern_names.finditer(text_to_search)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(216, 227), match='Mr. Schafer'>
<_sre.SRE_Match object; span=(228, 236), match='Mr Smith'>
<_sre.SRE_Match object; span=(260, 265), match='Mr. T'>


> Here we got all the names that start with Mr <br>
- `\.?` denotes that dot is optional
- `\s` denotes there is a space after the Mr or Mr.
- `[A-Z]` denotes that after the space we expect an uppercase letter.
- `\w*` denotes that after the uppercase letter we can have any number (0 or more) of word characters.

In [28]:
# lets get all the names 
sentence = """
Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
"""
# here we can see that few Mr have dot at the end and few don't
pattern_names = re.compile(r'(Mr|Ms|Mrs)\.?\s[A-Z]\w*')
matches = pattern_names.finditer(text_to_search)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(216, 227), match='Mr. Schafer'>
<_sre.SRE_Match object; span=(228, 236), match='Mr Smith'>
<_sre.SRE_Match object; span=(237, 245), match='Ms Davis'>
<_sre.SRE_Match object; span=(246, 259), match='Mrs. Robinson'>
<_sre.SRE_Match object; span=(260, 265), match='Mr. T'>


- `(Mr|Ms|Mrs)` denotes a group of characters, where pipe \| denotes **or**. Thus the designation must start with Mr or Ms or Mrs.
- rest is same as above.

In [34]:
# now lets write a pattern for detecting emails
sentence = """
CoreyMSchafer@gmail.com
corey.schafer@university.edu
corey-321-schafer@my-work.net
"""
pattern_email = re.compile(r'[a-zA-Z.0-9-]+@[a-zA-Z-]+\.(com|edu|net)')
matches = pattern_email.finditer(sentence)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(1, 24), match='CoreyMSchafer@gmail.com'>
<_sre.SRE_Match object; span=(25, 53), match='corey.schafer@university.edu'>
<_sre.SRE_Match object; span=(54, 83), match='corey-321-schafer@my-work.net'>


- `[a-zA-Z.0-9-]+` the email can have atleast one of - lowercase letters or uppercase letters or digits or dot or dash
- `@` after this we look for @
- `[a-zA-Z-]+` after the @, we can atleast one lowercase letter or uppercase letter or dash
- `.` then we look for a dot
- `(com|edu|net)` after dot we can have com or edu or net

### decode other peoples regex
- patter is - `[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+`

- here, `[a-zA-Z0-9_.+-]+` denotes that it should have lowercase or uppercase letters or digits or underscore or dot or plus or dash

- then we match `@`

- then `[a-zA-Z0-9-]+` denotes that it should match lower or uppercase letters or digits or dash.

- then we look for dot `\.`

- then `[a-zA-Z0-9-.]+` denotes that it should have lower or uppercase letters or digits or dash or dot.

In [35]:
# now lets write a pattern that recognizes a URL

sentence = '''
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
'''
pattern_email = re.compile(r'https?://(www.)?\w+\.\w+')
matches = pattern_email.finditer(sentence)
for match in matches:
    print(match)

<_sre.SRE_Match object; span=(1, 23), match='https://www.google.com'>
<_sre.SRE_Match object; span=(24, 42), match='http://coreyms.com'>
<_sre.SRE_Match object; span=(43, 62), match='https://youtube.com'>
<_sre.SRE_Match object; span=(63, 83), match='https://www.nasa.gov'>


- `http` we start the match of it starts by http
- `s?` we need zero or more 's'
- `://` we match this after http or https
- `(www.)?` since 'www.' is optional we put it inside a group so the '?' is applicable on the whole 'www.'
- `\w+` here we denote that the name of the website can have any word character 1 or more times.
- `\.` after the name we want a dot.
- `\w+` for top level domain name.

In [43]:
# now lets capture some info from the url

sentence = '''
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
'''
pattern_url = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
matches = pattern_url.finditer(sentence)

for match in matches:
    print('the entire url')
    print(match.group(0))
    
    print('the optional part "www."')
    print(match.group(1))
    
    print('the name of the website')
    print(match.group(2))
    
    print('the top level domain name of the website')
    print(match.group(3))
    
    print('\n\n\n\n')

the entire url
https://www.google.com
the optional part "www."
www.
the name of the website
google
the top level domain name of the website
.com





the entire url
http://coreyms.com
the optional part "www."
None
the name of the website
coreyms
the top level domain name of the website
.com





the entire url
https://youtube.com
the optional part "www."
None
the name of the website
youtube
the top level domain name of the website
.com





the entire url
https://www.nasa.gov
the optional part "www."
www.
the name of the website
nasa
the top level domain name of the website
.gov







> Here we try to access parts of the pattern we matched to get out the relevant information from it.<br><br>
To do this we make groups of the parts we want from the pattern so we added the domain name and top level domain name(.com/.edu) to the group.<br><br>
To access these part we use `.group()` function where we send the index number of group.<br><br>
0 is for the entire pattern<br>
1 is for the first group (optional www.)<br><br>
2 is for the second group (domain name of the - group)<br><br>
3 is for the third group (top level domain name or the group)<br>

### Sub method

In [44]:
# now lets substitute the urls we found using the pattern into 
# just the name of the website and top level domain name.
# eg. https://www,google.com to google.com

sentence = '''
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
'''
pattern_url = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
subbed_url = pattern_url.sub(r'\2\3',sentence)
print(subbed_url)


google.com
coreyms.com
youtube.com
nasa.gov



> Here we use the `sub()` function of regex where we substitute each pattern we find from the sentence with the 2nd and the 3rd group of the pattern itself.

> First parameter of sub is for what we want to substitute with.

> Second parameter is for the string in which we want to find that pattern in.

### Findall Method

In [45]:
# lets try findall() method
# findall only matches the groups from the pattern
sentence = """
Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T
"""
# here we can see that few Mr have dot at the end and few don't
pattern_names = re.compile(r'(Mr|Ms|Mrs)\.?\s[A-Z]\w*')
matches = pattern_names.findall(text_to_search)
for match in matches:
    print(match)

Mr
Mr
Ms
Mrs
Mr


### Note
- If there are more than one groups, findall returns a tuple which has each group.

In [46]:
sentence = '''
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
'''
pattern_url = re.compile(r'https?://(www\.)?(\w+)(\.\w+)')
matches = pattern_url.findall(sentence)
for match in matches:
    print(match)

('www.', 'google', '.com')
('', 'coreyms', '.com')
('', 'youtube', '.com')
('www.', 'nasa', '.gov')


#### Note
- If the pattern has no group, findall returns the whole pattern.

In [49]:
sentence = """123-123-1234
354.354.1234
256-433.7413
134-356.4354
344.541-3651"""

pattern_ph_no = re.compile(r'\d{3}[-.]\d{3}[-.]\d{4}')
matches = pattern_ph_no.findall(sentence)
for match in matches:
    print(match)

123-123-1234
354.354.1234
256-433.7413
134-356.4354
344.541-3651


### Match Method

In [50]:
sentence = "Start a sentence and then bring it to an end"
pattern = re.compile(r'Start')
match = pattern.match(sentence)
print(match)

<_sre.SRE_Match object; span=(0, 5), match='Start'>


In [51]:
sentence = "Start a sentence and then bring it to an end"
pattern = re.compile(r'then')
match = pattern.match(sentence)
print(match)

None


> Thus `match` method only returns the matched part of the string.

> `match` method only matches the start of the string, of start of the string does not match, the `method` return None.

### Search Method

In [52]:
sentence = "Start a sentence and then bring it to an end"
pattern = re.compile(r'Start')
match = pattern.search(sentence)
print(match)

<_sre.SRE_Match object; span=(0, 5), match='Start'>


In [53]:
sentence = "Start a sentence and then bring it to an end"
pattern = re.compile(r'then')
match = pattern.search(sentence)
print(match)

<_sre.SRE_Match object; span=(21, 25), match='then'>


In [54]:
sentence = "Start a sentence then and then bring then it to an end"
pattern = re.compile(r'then')
match = pattern.search(sentence)
print(match)

<_sre.SRE_Match object; span=(17, 21), match='then'>


> The `search` method return only the first match from the string

> It returns None if there are no matches

### Flags in Regular Expression

In [55]:
# we want to match if the string has the word 'start'
# but any of the character can be lowercase or uppercase still we want to match it.
# from what we have learned we would do something like this
# [Ss][Tt][Aa][Rr][Tt]
# this is a pain to write for bigger words or patterns
# here we use flag - re.IGNORECASE or re.I for short

sentence = "Start of the sentence"

pattern = re.compile(r'start',re.IGNORECASE)
match = pattern.search(sentence)
print(match)

<_sre.SRE_Match object; span=(0, 5), match='Start'>


In [58]:
sentence = "this is so AmaZiNG"

pattern = re.compile(r'amazing',re.IGNORECASE)
match = pattern.search(sentence)
print(match)

<_sre.SRE_Match object; span=(11, 18), match='AmaZiNG'>
