# Regular Expressions

They basically allow us to search and match specific patterns of text

You can create a regex for just about any pattern of text you can think of

**Raw String**

is just a string prefixed with an r--> that ells python not to handle back slashes in any special way

normally back slashes (\) are used to specify tabs or new lines and etc 

In [6]:
print("\tTab") # Python replaced our \t with an actual tab. 

	Tab


but raw string will just interpret the string literally
So if we put an r in front of the string 

In [7]:
print(r"\tTab")#we see that backslashes are no longer handled in any special way

\tTab


that is important for us because we want our regexes to interpret the strings we're passing in 
**and**
not have python doing anything to them first.

In [1]:
import re 

In [2]:
text_to_search = '''
abcdefghijklmnopqurtuvwxyz
ABCDEFGHIJKLMNOPQRSTUVWXYZ
1234567890
Ha HaHa
MetaCharacters (Need to be escaped):
. ^ $ * + ? { } [ ] \ | ( )
coreyms.com
321-555-4321
123.555.1234
123*555*1234
800-555-1234
900-555-1234
Mr. Schafer
Mr Smith
Ms Davis
Mrs. Robinson
Mr. T

cat 
mat
pat

bat



'''

# re.compile() Method

Compile a regular expression pattern, returning a Pattern object.

In [None]:
#Allows to seperate out our patterns into a variable
#Make it easier to reuse that variable to perform multiple searches. 

In [18]:
pattern = re.compile(r'abc')#method içine pattern yzılır. 

In [19]:
matches = pattern.finditer(text_to_search)#variable oluşturulur. finditer method returns an iterator that contains all of the matches

In [20]:
for match in matches:
    print(match)

<re.Match object; span=(1, 4), match='abc'>


In [14]:
print(text_to_search[1:4])

abc


In [22]:
# dot/period (.) is a special char and regex, if you wanna search . ^ $ * + ? { } [ ] \ | ( ) you have to escape it 
#re.compile(r'\.')

In [4]:
pattern = re.compile(r'\.')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(111, 112), match='.'>
<re.Match object; span=(146, 147), match='.'>
<re.Match object; span=(167, 168), match='.'>
<re.Match object; span=(171, 172), match='.'>
<re.Match object; span=(218, 219), match='.'>
<re.Match object; span=(249, 250), match='.'>
<re.Match object; span=(262, 263), match='.'>


In [5]:
pattern = re.compile(r'coreyms\.com')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(139, 150), match='coreyms.com'>


In [6]:
pattern = re.compile(r'coreyms.com')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(139, 150), match='coreyms.com'>


In [7]:
pattern = re.compile('coreyms.com')
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(139, 150), match='coreyms.com'>


### **Snippets**

In [2]:
"""
#search for patterns~where regex has mean since we search literal expressions with python
#Meta Chars is used that we were just escaping
.       - Any Character Except New Line
\d      - Digit (0-9)
\D      - Not a Digit (0-9) #capital letters basically negate whatever the lowercase version means
\w      - Word Character (a-z, A-Z, 0-9, _)
\W      - Not a Word Character
\s      - Whitespace (space, tab, newline)
\S      - Not Whitespace (space, tab, newline)

#Anchors: 
#they don't actually match any chars but rather invisible positions before or after chars
#It is used in conjuction with other patterns for searching

\b      - Word Boundary: indicared by whitespace or a non alpha numeric char
\B      - Not a Word Boundary
^       - Beginning of a String #(caret sign), will match a position that is the beginning of the string
$       - End of a String, will match a position that is the end of the string

[]      - Matches Characters in brackets
[^ ]    - Matches Characters NOT in brackets
|       - Either Or
( )     - Group

Quantifiers:
*       - 0 or More
+       - 1 or More
?       - 0 or One
{3}     - Exact Number
{3,4}   - Range of Numbers (Minimum, Maximum)


#### Sample Regexs ####

[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+
"""

"\n#search for patterns~where regex has mean since we search literal expressions with python\n#Meta Chars is used that we were just escaping\n.       - Any Character Except New Line\n\\d      - Digit (0-9)\n\\D      - Not a Digit (0-9) #capital letters basically negate whatever the lowercase version means\n\\w      - Word Character (a-z, A-Z, 0-9, _)\n\\W      - Not a Word Character\n\\s      - Whitespace (space, tab, newline)\n\\S      - Not Whitespace (space, tab, newline)\n\n#Anchors: \n#they don't actually match any chars but rather invisible positions before or after chars\n#It is used in conjuction with other patterns for searching\n\n\x08      - Word Boundary: indicared by whitespace or a non alpha numeric char\n\\B      - Not a Word Boundary\n^       - Beginning of a String #(caret sign), will match a position that is the beginning of the string\n$       - End of a String, will match a position that is the end of the string\n\n[]      - Matches Characters in brackets\n[^ ]    -

In [12]:
#pattern = re.compile('\d')
#matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

In [13]:
pattern = re.compile(r'\bHa')#it doesn't match 3rd Ha because before it there is no boundary, this is in the middle of the word
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(66, 68), match='Ha'>
<re.Match object; span=(69, 71), match='Ha'>


In [15]:
pattern = re.compile(r'\BHa')#this time it matches only one Ha that do not have a word boundary
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(71, 73), match='Ha'>


**^ caret sign**

In [21]:
sentence = 'Start a sentence and then bring it to an end'

In [23]:
pattern = re.compile(r"^Start")
matches = pattern.finditer(sentence)
for match in matches:
    print(match)

<re.Match object; span=(0, 5), match='Start'>


In [None]:
#outside of the charSet the matches the beginning of a string
#within a charSet it negates the set and matches everything that is not in that charSet

In [18]:
pattern = re.compile(r"[^b]at")#it is going to match everything that is not starting with b
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(267, 270), match='cat'>
<re.Match object; span=(272, 275), match='mat'>
<re.Match object; span=(276, 279), match='pat'>


**$ sign**

In [24]:
pattern = re.compile(r"end$")
matches = pattern.finditer(sentence)
for match in matches:
    print(match)

<re.Match object; span=(41, 44), match='end'>


**Matching phone numbers with regexes**

In [None]:
"""
we cannot type literal search because they're different
but they have similar pattern with different digits
we need to use meta chars instead of literal chars. 
pattern:3digits, and a dash or period, 3 more digits, -/. and 4 more digits 
321-555-4321
123.555.1234
creating pattern to match this:
we can match any digit with \d: it'll match with a single digit
1. Matching 3digits: \d\d\d
2. Matching any char .
""" 

In [32]:
pattern = re.compile(r"\d\d\d.\d\d\d.\d\d\d\d")
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(151, 163), match='321-555-4321'>
<re.Match object; span=(164, 176), match='123.555.1234'>
<re.Match object; span=(177, 189), match='123*555*1234'>
<re.Match object; span=(190, 202), match='800-555-1234'>
<re.Match object; span=(203, 215), match='900-555-1234'>


**charSets**

In [41]:
#to grab only dash and period we use charSet
# charSets uses these square brackets with the chars that we wanna match
#[-.] within charSets these 2 or more chars only match for 1: either dash (-) or period(.)

In [38]:
pattern = re.compile(r"\d\d\d[-.]\d\d\d[-.]\d\d\d\d")#we don't have to escape period (.) in the charSet
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(151, 163), match='321-555-4321'>
<re.Match object; span=(164, 176), match='123.555.1234'>
<re.Match object; span=(190, 202), match='800-555-1234'>
<re.Match object; span=(203, 215), match='900-555-1234'>


In [42]:
pattern = re.compile(r"[89]00[-.]\d\d\d[-.]\d\d\d\d")#we don't have to escape period (.) in the charSet
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(190, 202), match='800-555-1234'>
<re.Match object; span=(203, 215), match='900-555-1234'>


**dash (-) as a range**

In [44]:
pattern = re.compile(r"[1-5]")
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(55, 56), match='1'>
<re.Match object; span=(56, 57), match='2'>
<re.Match object; span=(57, 58), match='3'>
<re.Match object; span=(58, 59), match='4'>
<re.Match object; span=(59, 60), match='5'>
<re.Match object; span=(151, 152), match='3'>
<re.Match object; span=(152, 153), match='2'>
<re.Match object; span=(153, 154), match='1'>
<re.Match object; span=(155, 156), match='5'>
<re.Match object; span=(156, 157), match='5'>
<re.Match object; span=(157, 158), match='5'>
<re.Match object; span=(159, 160), match='4'>
<re.Match object; span=(160, 161), match='3'>
<re.Match object; span=(161, 162), match='2'>
<re.Match object; span=(162, 163), match='1'>
<re.Match object; span=(164, 165), match='1'>
<re.Match object; span=(165, 166), match='2'>
<re.Match object; span=(166, 167), match='3'>
<re.Match object; span=(168, 169), match='5'>
<re.Match object; span=(169, 170), match='5'>
<re.Match object; span=(170, 171), match='5'>
<re.Match object; span=(172, 173), match='1'

In [46]:
#pattern = re.compile(r"[a-zA-Z]")
#matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

**Quantifiers**

In [None]:
#Is used to match more than one chars at once

In [None]:
#below example will match any char for the seperator

In [19]:
pattern = re.compile(r"\d\d\d.\d\d\d.\d\d\d\d")
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(151, 163), match='321-555-4321'>
<re.Match object; span=(164, 176), match='123.555.1234'>
<re.Match object; span=(177, 189), match='123*555*1234'>
<re.Match object; span=(190, 202), match='800-555-1234'>
<re.Match object; span=(203, 215), match='900-555-1234'>


In [None]:
#boyle uzun bir pattern hatalara sebep olabilir
#bunun için quantifier kullanırız hacı. to match multiple chars at a time

In [20]:
pattern = re.compile(r"\d{3}.\d{3}.\d{3}")
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(151, 162), match='321-555-432'>
<re.Match object; span=(164, 175), match='123.555.123'>
<re.Match object; span=(177, 188), match='123*555*123'>
<re.Match object; span=(190, 201), match='800-555-123'>
<re.Match object; span=(203, 214), match='900-555-123'>


In [23]:
pattern = re.compile(r"Mr\.")
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(216, 219), match='Mr.'>
<re.Match object; span=(260, 263), match='Mr.'>


In [25]:
#to grab other Mr without period we need to say that period after the prefix is optional
#we need to use ? to match either zero or one of those chars

In [26]:
pattern = re.compile(r"Mr\.?")
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(216, 219), match='Mr.'>
<re.Match object; span=(228, 230), match='Mr'>
<re.Match object; span=(246, 248), match='Mr'>
<re.Match object; span=(260, 263), match='Mr.'>


In [None]:
#after we have space (s) and we come across to uppercase letters, to match that we can use charSet

In [27]:
pattern = re.compile(r"Mr\.?\s[A-Z]")
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(216, 221), match='Mr. S'>
<re.Match object; span=(228, 232), match='Mr S'>
<re.Match object; span=(260, 265), match='Mr. T'>


In [None]:
#But we still need to match the rest of other names. 
#we could match any word char after that uppercase
#and we have to decide what quantifier we need to use for word chars. 
#we could use plus sign (+) quantifier which would match one or more of these word chars

In [31]:
pattern = re.compile(r"Mr\.?\s[A-Z]\w+")
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(216, 227), match='Mr. Schafer'>
<re.Match object; span=(228, 236), match='Mr Smith'>


In [32]:
#Mr. T kaldı. Better solution might be to use the asterisk(*) quantifier which allows to match zero or more of these chars following that uppercase

In [33]:
pattern = re.compile(r"Mr\.?\s[A-Z]\w*")
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(216, 227), match='Mr. Schafer'>
<re.Match object; span=(228, 236), match='Mr Smith'>
<re.Match object; span=(260, 265), match='Mr. T'>


In [None]:
#we still miss the Ms and Mrs. 
#We should use a charSet that matches either an r or an s after m
#It is better to use groups

**Groups**

In [None]:
#Allows to match several different patterns
#to create a group we use paranthesis 

In [36]:
pattern = re.compile(r"M(r|s|rs)\.?\s[A-Z]\w*")#Capital R'dan sonra r or s or rs gelirse match et. 
#pattern = re.compile(r"(Mr|Ms|Mrs)\.?\s[A-Z]\w*")
matches = pattern.finditer(text_to_search)
for match in matches:
    print(match)

<re.Match object; span=(216, 227), match='Mr. Schafer'>
<re.Match object; span=(228, 236), match='Mr Smith'>
<re.Match object; span=(237, 245), match='Ms Davis'>
<re.Match object; span=(246, 259), match='Mrs. Robinson'>
<re.Match object; span=(260, 265), match='Mr. T'>


**findall method**

In [None]:
#finditer method: it returns match objects with extra info 
#findall method: will just return the matches as a list of strings.
#if it matches groups then it'll only return the groups 
#if there are multiple groups then it would return a list of tuple ~contains all of the groups
#if there are no groups then it would just return all the matches in a list of string

In [42]:
pattern = re.compile(r"(Mr|Ms|Mrs)\.?\s[A-Z]\w*")
matches = pattern.findall(text_to_search)
for match in matches:
    print(match)

Mr
Mr
Ms
Mrs
Mr


In [44]:
pattern = re.compile(r"\d{3}.\d{3}.\d{4}")
matches = pattern.findall(text_to_search)
for match in matches:
    print(match)

321-555-4321
123.555.1234
123*555*1234
800-555-1234
900-555-1234


**match method**

In [None]:
#will determine if the regex matches at the beginning of the string 

In [46]:
sentence = "Start a sentence and then bring it to an end"
pattern = re.compile(r"Start")
matches = pattern.match(sentence)
for match in matches:
    print(match)

TypeError: 're.Match' object is not iterable

In [None]:
#error geldi: because match doesn't return an iterable like finditer or findall
#it just returns the first match. 
#if there is not a match than it returns none

In [None]:
#so, instead of looping through our result we can just print out that matches variable

In [49]:
sentence = "Start a sentence and then bring it to an end"
pattern = re.compile(r"Start")
matches = pattern.match(sentence) #this only match things at the beginning of strings
print(matches)

<re.Match object; span=(0, 5), match='Start'>


In [50]:
sentence = "Start a sentence and then bring it to an end"
pattern = re.compile(r"sentence")
matches = pattern.match(sentence)
print(matches)

None


**search method**

In [51]:
sentence = "Start a sentence and then bring it to an end"
pattern = re.compile(r"sentence")
matches = pattern.search(sentence)
print(matches)

<re.Match object; span=(8, 16), match='sentence'>


**flags**

In [53]:
sentence = "Start a sentence and then bring it to an end"
#pattern = re.compile(r"start", re.IGNORECASE)
pattern = re.compile(r"start", re.I)
matches = pattern.search(sentence)
print(matches)

<re.Match object; span=(0, 5), match='Start'>


**Recap Example**

In [4]:
emails = """CoreyMSchafer@gmail.com
corey.schafer@university.edu
corey-321-schafer@my-work.net"""

Write a regex that matches all these e-mails

In [22]:
#pattern = re.compile(r"^[A-Z]\w{4}[A-Z]{2}\w{6}[@]\w{5}\.\w{3}")#1st email
#pattern = re.compile(r"[a-zA-Z]+@[a-z]+\.com")#1st email
pattern = re.compile(r"[a-zA-Z0-9.-]+@[a-z-]+\.(com|edu|net)")
matches = pattern.finditer(emails)
for match in matches:
    print(match)

<re.Match object; span=(0, 23), match='CoreyMSchafer@gmail.com'>
<re.Match object; span=(24, 52), match='corey.schafer@university.edu'>
<re.Match object; span=(53, 82), match='corey-321-schafer@my-work.net'>


**url example**

In [23]:
urls = """
https://www.google.com
http://coreyms.com
https://youtube.com
https://www.nasa.gov
"""

In [36]:
#pattern = re.compile(r"https?://(www\.)?\w+\.\w+")
pattern = re.compile(r"https?://(www\.)?(\w+)(\.\w+)")#group olarak görülür, çıktıda çözünürlük sağlanır
matches=pattern.finditer(urls)
for match in matches:
    print(match.group(2))

google
coreyms
youtube
nasa


**back reference**

In [None]:
#it is used to reference captured group
#it is shorthand for accessing these group indexes
#re module has a sub method, used to perform substitution

In [None]:
#we can substitute in these back references which reference the groups

In [40]:
pattern = re.compile(r"https?://(www\.)?(\w+)(\.\w+)")
subbed_urls= pattern.sub(r"\2\3", urls)
print(subbed_urls)


google.com
coreyms.com
youtube.com
nasa.gov



In [None]:
#arguman olarak parantez içine substitution konur.Substitution we wanna use are these back references that reference these groups

In [None]:
#we wanna replace these URLs with the domain name-group 2 and top-level domain - group 3
#we use back references with a back slash \ and num of group

**Example: how it becomes powerful for parsing information from data**

In [13]:
pattern = re.compile(r"\d\d\d.\d\d\d.\d\d\d\d")
with open("data.txt", "r", encoding="utf-8") as f:
    contents = f.read()
    
    matches = pattern.finditer(contents)
    
    for match in matches:
        print(match)

<re.Match object; span=(12, 24), match='615-555-7164'>
<re.Match object; span=(102, 114), match='800-555-5669'>
<re.Match object; span=(191, 203), match='560-555-5153'>
<re.Match object; span=(281, 293), match='900-555-9340'>
<re.Match object; span=(378, 390), match='714-555-7405'>
<re.Match object; span=(467, 479), match='800-555-6771'>
<re.Match object; span=(557, 569), match='783-555-4799'>
<re.Match object; span=(647, 659), match='516-555-4615'>
<re.Match object; span=(740, 752), match='127-555-1867'>
<re.Match object; span=(829, 841), match='608-555-4938'>
<re.Match object; span=(915, 927), match='568-555-6051'>
<re.Match object; span=(1003, 1015), match='292-555-1875'>
<re.Match object; span=(1091, 1103), match='900-555-3205'>
<re.Match object; span=(1180, 1192), match='614-555-1166'>
<re.Match object; span=(1269, 1281), match='530-555-2676'>
<re.Match object; span=(1355, 1367), match='470-555-2750'>
<re.Match object; span=(1439, 1451), match='800-555-6089'>
<re.Match object; spa

**Python Tutorial: re Module - How to Write and Match Regular Expressions (Regex)**

Corey Schafer

In [None]:
# https://www.youtube.com/watch?v=K8L6KVGG-7o