# Regular Expressions

# Tasks today:
1) <b>Importing</b> <br>
2) <b>Using Regular Expressions</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) re.compile() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) re.match() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) re.findall() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) re.search() <br>
3) <b>Sets</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Integer Ranges <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) Character Ranges <br>
4) <b>Counting Occurences</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) {x} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) {, x} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) {?} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) {*} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) {+} <br>
5) <b>In-Class Exercise #1</b> <br>
6) <b>Escaping Characters</b> <br>
7) <b>Grouping</b> <br>
8) <b>In-Class Exercise #2</b> <br>
9) <b>Opening a File</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) open() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) with open() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) re.match() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) re.search() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) Store the String in a Variable <br>
10) <b>Regex Project</b> <br>

### Importing <br>
<p>Regular Expressions are universal throughout most programming languages... They are generally imported through the module 're'.</p>

In [2]:
# import re
import re

### Using Regular Expressions <br>
<p>Regular expressions give us the ability to search for patterns within text, strings, files, etc. They serve several uses, such as; security measures, searching, filtering, pattern recognition, and more...</p>

##### RegEx Cheatsheet

In [None]:
########################
# DO NOT RUN THIS CELL #
########################

a, X, 9, < -- ordinary characters just match themselves exactly.
. (a period) -- matches any single character except newline '\n'
\w -- matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_].
\W -- matches any non-word character.
\b -- matches word boundary (in between a word character and a non word character)
\s -- matches a single whitespace character -- space, newline, return, tab
\S -- matches any non-whitespace character.
\t, \n, \r -- tab, newline, return
\d -- matches any numeric digit [0-9]
\D matches any non-numeric character.
^ -- matches the beginning of the string, or specify omition of certain characters
$ -- matches the end of the string
\ -- escapes special character.
(x|y|z) matches exactly one of x, y or z.
(x) in general is a remembered group. We can get the value of what matched by using the groups() method of the object returned by re.search.
x? matches an optional x character (in other words, it matches an x zero or one times).
x* matches x zero or more times.
x+ matches x one or more times.
x{m,n} matches an x character at least m times, but not more than n times.
?: matches an expression but do not capture it. Non capturing group.
?= matches a suffix but exclude it from capture. Positive lookahead.
a(?=b) will match the "a" in "ab", but not the "a" in "ac"
In other words, a(?=b) matches the "a" which is followed by the string 'b', without consuming what follows the a.
?! matches if suffix is absent. Negative look ahead.
a(?!b) will match the "a" in "ac", but not the "a" in "ab"
?<= positive look behind
[] matches for groupings of consecutive characters
?<! negative look behind

########################
# DO NOT RUN THIS CELL #
########################

##### re.compile()

In [19]:
# using compile, 
#pre determines the string to be used in reg expression methods

pattern = re.compile('abcdefg')
pattern

re.compile(r'abcdefg', re.UNICODE)

##### re.match()

In [20]:
# only looks at beginning of string
match = pattern.match('abcdefg123')
print(match)

# accessing span attribute of the returned Match object
span_match = match.span()
'abcdefg123'[span_match[0]:span_match[1]]

<re.Match object; span=(0, 7), match='abcdefg'>


'abcdefg'

##### re.findall()

In [25]:
pattern = re.compile('abc')
random_string = "123 codingtemple abcd abcdef abc"

found = pattern.findall(random_string)
print(found)

['abc', 'abc', 'abc']


##### re.search()

In [28]:
search = pattern.search(random_string)

print(search)

search_span = search.span()

random_string[search_span[0]:search_span[1]]

<re.Match object; span=(17, 20), match='abc'>


'abc'

### Sets <br>
<p>The following cells will allow you to use regular expressions to search for certain values within a range such as numbers 1 through 4.</p>

##### [a-z] or [A-Z] - any lowercase/uppercase letters from a to z<br/>[^2] - anything that's not 2

##### Integer Ranges

In [47]:
pattern_int = re.compile('[0-6][7-9]')

random_nums = "18 22 67 100 98 12"

match = pattern_int.match(random_nums)
print(match)

found_all = pattern_int.findall(random_nums)
print(found_all)

<re.Match object; span=(0, 2), match='18'>
['18', '67']


##### Character Ranges

In [49]:
char_patt = re.compile('[A-Z][a-z]')
found_char = char_patt.findall('Hello There Mr. Anderson...')
print(found_char)

['He', 'Th', 'Mr', 'An']


### Counting Occurences

##### {x} - something that occurs {num_of_times}

In [53]:
pattern_count = re.compile('[A-Z][a-z][0-3]{2}') #[A-Z][a-z][0-3][0-3]
found_count = pattern_count.findall('Hello There Mr22. An33derson... CT2020')
found_count

['Mr22', 'An33']

##### {x, x} - something that occurs between x and x times

In [54]:
# includes 5 in range
random_pattern = re.compile('m{1,5}')
found_range = random_pattern.findall("This is an example of a regular expression trying to find one m, more than one mmm or 5 mmmmms")
print(found_range)

['m', 'm', 'm', 'mmm', 'mmmmm']


##### ? - something that occurs 0 or 1 time

In [59]:
pattern = re.compile('Mrs?')
found_pattern = pattern.findall('Hello Mr. Anderson. How is Mrsssss. Anderson?')
found_pattern

['Mr', 'Mrs']

##### * - something that occurs at least 0 times

In [73]:
pattern = re.compile('Ms*')
found_ms = pattern.findall('My name is Ms. Smith. Here is a weird word MsssssMs')
found_ms

#example to grab all capital letters or title cased words
# char_pattern = re.compile('[A-Z][a-z]*')
# found_again = char_pattern.findall('Hello There hello Mr. Anderson')
# found_again

['M', 'Ms', 'Msssss', 'Ms']


##### + - something that occurs at least once

In [86]:
pattern = re.compile('[A-Z][a-z]+')
found_it = pattern.findall('Hello There hello Mr. Anderson')
print(found_it)

['Hello', 'There', 'Mr', 'Anderson']


##### In-class exercise 1: 

Use a regular expression to find every number in the given string

In [84]:
my_string = "This string has 10909090 numbers, but it is only 1 string. I hope you solve this 2day."

pattern = re.compile('[0-9]+')
numbers = pattern.findall(my_string)
print(numbers)

['10909090', '1', '2']


### Escaping Characters

##### \w - look for any Unicode character<br/>\W - look for anything that isnt a Unicode character

[History on Unicode](http://unicode.org/standard/WhatIsUnicode.html)

[More on Unicode Characters](https://en.wikipedia.org/wiki/List_of_Unicode_characters)

In [94]:
pattern_1 = re.compile('[\w]+') #[\w] same as [a-zA-Z0-9_]
pattern_2 = re.compile('[\W]+') #spaces and punctuation

found_1 = pattern_1.findall(my_string)
print(found_1)
found_2 = pattern_2.findall(my_string)
# print(found_2)

['This', 'string', 'has', '10909090', 'numbers', 'but', 'it', 'is', 'only', '1', 'string', 'I', 'hope', 'you', 'solve', 'this', '2day']


##### \d - look for any digit 0-9<br/>\D - look for anything that isnt a digit

In [97]:
num_pattern = re.compile('\d{1,2}[a-z]{2}')
found_date = num_pattern.findall('Today is the 4th, tomorrow is the 5th, halloween is the 31st')
print(found_date)

non_pattern = re.compile('\D+')
found_non = non_pattern.findall('Today is the 4th, tomorrow is the 5th, halloween is the 31st')
found_non

['4th', '5th', '31st']


['Today is the ', 'th, tomorrow is the ', 'th, halloween is the ', 'st']

##### \s - look for any white space<br/>\S - look for anything that isnt whitespace

In [106]:
pattern = re.compile('\s[a-z]+')
pattern_2 = re.compile('\S[a-z]+')

found_1 = pattern.findall('Are you afraid of the dark?')
found_2 = pattern_2.findall('Are you afraid of the dark?')

print(found_1)
print(found_2)

[' you', ' afraid', ' of', ' the', ' dark']
['Are', 'you', 'afraid', 'of', 'the', 'dark']


##### \b - look for boundaries or edges of a word<br/>\B - look for anything that isnt a boundary

In [130]:
# must use r string to create a raw string (literal interpretation of the string) 
#because '\b' is a python escape character for backspace

pattern = re.compile(r'\bTheCodingTemple\b')
pattern_nobound = re.compile(r'\BTheCodingTemple\B')

found_bound = pattern.findall('I Work at TheCodingTemple OK')
print(found_bound)

found_none = pattern_nobound.findall('IWorkAtTheCodingTempleOK')
print(found_none)

['TheCodingTemple']
['TheCodingTemple']


#### Allows searches for special characters

In [137]:
# backslash allows us to search any special character without using the functionality in RegEx
# \?
q_pattern = re.compile('[\w]+[\?]')
sentence = "Where do you live?"
q_found = pattern.findall(sentence)
print(q_found)

['live?']


### Grouping

In [363]:
# Either or capital/lowercase [a-zA-Z]
my_string_again = "Max Smith, aaron rodgers, Sam Darnold, LeBron James, Michael Jordan, Kevin Durant, Patrick McCormick"

# Group of names RegEx Compiler
pattern_name = re.compile('([A-Z][A-Za-z]+) ([A-Z][A-Za-z]+)')
found_name = pattern_name.findall(my_string_again)
print(found_name)

for names in found_name:
    print(f'This athletes first name is {names[0]}. Their last name is {names[1]}')

[('Max', 'Smith'), ('Sam', 'Darnold'), ('LeBron', 'James'), ('Michael', 'Jordan'), ('Kevin', 'Durant'), ('Patrick', 'McCormick')]
This athletes first name is Max. Their last name is Smith
This athletes first name is Sam. Their last name is Darnold
This athletes first name is LeBron. Their last name is James
This athletes first name is Michael. Their last name is Jordan
This athletes first name is Kevin. Their last name is Durant
This athletes first name is Patrick. Their last name is McCormick


##### In-class Exercise 2:

Write a function using regular expressions to find the domain name in the given email addresses (and return None for the invalid email addresses)<br><b>HINT: Use '|' for either or</b>

In [191]:
my_emails = ["jordanw@codingtemple.orgcom", "pocohontas1776@gmail.com", "helloworld@aol..com",
             "yourfavoriteband@g6.org", "@codingtemple.com"]

# You can also use the $ at the end of your compile expression -- this stops the search

    #solution 1
for email in my_emails:
    pattern = re.compile('[\w]+@[\w]+.(com|org)$')
    match = pattern.match(email)
    if match:
        print(email)
    else:
        print('None')
    
    #solution 2
# def address(i):
#     email = re.compile('([\w]+)@([\w]+).(com|org)$')
#     if email.match(i):
#         return i
#     else:
#         return None

# for i in my_emails:
#     print(address(i))

None
pocohontas1776@gmail.com
None
yourfavoriteband@g6.org
None


### Opening a File <br>
<p>Python gives us a couple ways to import files, below are the two used most often.</p>

##### open()

In [270]:
f = open('names.txt')
data = f.read()
print(data)
f.close()

Hawkins, Derek	derek@codingtemple.com	(555) 555-5555	Teacher, Coding Temple	@derekhawkins
Zhai, Mo	mozhai@codingtemple.com	(555) 555-5554	Teacher, Coding Temple
Johnson, Joe	joejohnson@codingtemple.com		Johson, Joe
Osterberg, Sven-Erik	governor@norrbotten.co.se		Governor, Norrbotten	@sverik
, Tim	tim@killerrabbit.com		Enchanter, Killer Rabbit Cave
Butz, Ryan	ryanb@codingtemple.com	(555) 555-5543	CEO, Coding Temple	@ryanbutz
Doctor, The	doctor+companion@tardis.co.uk		Time Lord, Gallifrey
Exampleson, Example	me@example.com	555-555-5552	Example, Example Co.	@example
Pael, Ripal	ripalp@codingtemple.com	(555) 555-5553	Teacher, Coding Temple	@ripalp
Vader, Darth	darth-vader@empire.gov	(555) 555-4444	Sith Lord, Galactic Empire	@darthvader
Fernandez de la Vega Sanz, Maria Teresa	mtfvs@spain.gov		First Deputy Prime Minister, Spanish Gov



##### with open()

In [280]:
# no need to .close
with open('names.txt') as f:
    data = f.readlines()
data

['Hawkins, Derek\tderek@codingtemple.com\t(555) 555-5555\tTeacher, Coding Temple\t@derekhawkins\n',
 'Zhai, Mo\tmozhai@codingtemple.com\t(555) 555-5554\tTeacher, Coding Temple\n',
 'Johnson, Joe\tjoejohnson@codingtemple.com\t\tJohson, Joe\n',
 'Osterberg, Sven-Erik\tgovernor@norrbotten.co.se\t\tGovernor, Norrbotten\t@sverik\n',
 ', Tim\ttim@killerrabbit.com\t\tEnchanter, Killer Rabbit Cave\n',
 'Butz, Ryan\tryanb@codingtemple.com\t(555) 555-5543\tCEO, Coding Temple\t@ryanbutz\n',
 'Doctor, The\tdoctor+companion@tardis.co.uk\t\tTime Lord, Gallifrey\n',
 'Exampleson, Example\tme@example.com\t555-555-5552\tExample, Example Co.\t@example\n',
 'Pael, Ripal\tripalp@codingtemple.com\t(555) 555-5553\tTeacher, Coding Temple\t@ripalp\n',
 'Vader, Darth\tdarth-vader@empire.gov\t(555) 555-4444\tSith Lord, Galactic Empire\t@darthvader\n',
 'Fernandez de la Vega Sanz, Maria Teresa\tmtfvs@spain.gov\t\tFirst Deputy Prime Minister, Spanish Gov\n']

##### re.match()

In [227]:
re.match('Hawkins, Derek', data)

<re.Match object; span=(0, 14), match='Hawkins, Derek'>

##### re.search()

In [257]:
re.search('Pael, Ripal', data)

<re.Match object; span=(570, 581), match='Pael, Ripal'>

##### re.findall()

In [258]:
re.findall('[\w]+@[\w]+.com', data)

['derek@codingtemple.com',
 'mozhai@codingtemple.com',
 'joejohnson@codingtemple.com',
 'tim@killerrabbit.com',
 'ryanb@codingtemple.com',
 'me@example.com',
 'ripalp@codingtemple.com']

##### Store the String to a Variable

In [260]:
res = input('Who would you like to search for?')

answer = re.findall(res, data)

if answer == []:
    print("No data found")
else:
    print(f"Data is here: {answer}")

Who would you like to search for?ripal
Data is here: ['ripal', 'ripal']


In [264]:
# additional example
re.findall('([A-Z][a-z]+, [A-Z][a-z]+)[\s]([\w]+@[\w]+.com)', data)

[('Hawkins, Derek', 'derek@codingtemple.com'),
 ('Zhai, Mo', 'mozhai@codingtemple.com'),
 ('Johnson, Joe', 'joejohnson@codingtemple.com'),
 ('Butz, Ryan', 'ryanb@codingtemple.com'),
 ('Exampleson, Example', 'me@example.com'),
 ('Pael, Ripal', 'ripalp@codingtemple.com')]

### HW Exercise #1<br>
<p>Print each persons name and twitter handle, using groups, should look like:</p>
<p>==============<br>
   Full Name / Twitter<br>
   ==============</p>
Derek Hawkins / @derekhawkins

 Erik Sven-Osterberg / @sverik

 Ryan Butz / @ryanbutz

 Example Exampleson / @example

 Ripal Pael / @ripalp

 Darth Vader / @darthvader

In [332]:
with open("names.txt") as f:
    data = f.readlines()
    print(data)

['Hawkins, Derek\tderek@codingtemple.com\t(555) 555-5555\tTeacher, Coding Temple\t@derekhawkins\n', 'Zhai, Mo\tmozhai@codingtemple.com\t(555) 555-5554\tTeacher, Coding Temple\n', 'Johnson, Joe\tjoejohnson@codingtemple.com\t\tJohson, Joe\n', 'Osterberg, Sven-Erik\tgovernor@norrbotten.co.se\t\tGovernor, Norrbotten\t@sverik\n', ', Tim\ttim@killerrabbit.com\t\tEnchanter, Killer Rabbit Cave\n', 'Butz, Ryan\tryanb@codingtemple.com\t(555) 555-5543\tCEO, Coding Temple\t@ryanbutz\n', 'Doctor, The\tdoctor+companion@tardis.co.uk\t\tTime Lord, Gallifrey\n', 'Exampleson, Example\tme@example.com\t555-555-5552\tExample, Example Co.\t@example\n', 'Pael, Ripal\tripalp@codingtemple.com\t(555) 555-5553\tTeacher, Coding Temple\t@ripalp\n', 'Vader, Darth\tdarth-vader@empire.gov\t(555) 555-4444\tSith Lord, Galactic Empire\t@darthvader\n', 'Fernandez de la Vega Sanz, Maria Teresa\tmtfvs@spain.gov\t\tFirst Deputy Prime Minister, Spanish Gov\n']


In [16]:
pattern = re.compile("([A-Z][a-z]+), ([A-Za-z-]*)([A-Z][a-z]+).*\s(@[\w]+)")

for i in data:
    names = pattern.search(i)
    if names:
        print(f'{names.group(3)} {names.group(2)}{names.group(1)} / {names.group(4)}')
        

Derek Hawkins / @derekhawkins
Erik Sven-Osterberg / @sverik
Ryan Butz / @ryanbutz
Example Exampleson / @example
Ripal Pael / @ripalp
Darth Vader / @darthvader


### Regex project (HW #2)

Use python to read the file regex_test.txt and print the last name on each line using regular expressions and groups (return None for names with no first and last name, or names that aren't properly capitalized)
##### Hint: use with open() and readlines()

In [342]:
with open('regex-test.txt') as f:
    properNames = f.readlines()
    print(properNames)


['Abraham Lincoln\n', 'Andrew P Garfield\n', 'Connor Milliken\n', 'Jordan Alexander Williams\n', 'Madonna\n', 'programming is cool']


In [355]:
"""
-- Expected Output --
Abraham Lincoln
Andrew P Garfield
Connor Milliken
Jordan Alexander Williams
None
None
"""

pattern = re.compile("([A-Z][a-z]+)(\s[A-Za-z])")

for i in properNames:
    name = pattern.search(i)
    if name:
        print(i)
    else:
        print('None\n')

Abraham Lincoln

Andrew P Garfield

Connor Milliken

Jordan Alexander Williams

None

None

