# Regular Expressions
<h4>Go to <a>regex101.com</a> for more help

# Tasks today:
1) <b>Importing</b> <br>
2) <b>Using Regular Expressions</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) re.compile() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) re.match() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) re.findall() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) re.search() <br>
3) <b>Sets</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Integer Ranges <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) Character Ranges <br>
4) <b>Counting Occurences</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) {x} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) {, x} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) {?} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) {*} <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) {+} <br>
5) <b>In-Class Exercise #1</b> <br>
6) <b>Escaping Characters</b> <br>
7) <b>Grouping</b> <br>
8) <b>In-Class Exercise #2</b> <br>
9) <b>Opening a File</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) open() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) with open() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) re.match() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) re.search() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) Store the String in a Variable <br>
10) <b>Regex Project</b> <br>

### Importing <br>
<p>Regular Expressions are universal throughout most programming languages... They are generally imported through the module 're'.</p>

In [145]:
# import re
import re

### Using Regular Expressions <br>
<p>Regular expressions give us the ability to search for patterns within text, strings, files, etc. They serve several uses, such as; security measures, searching, filtering, pattern recognition, and more...</p>

##### re.compile()

In [146]:
# using compile, pre determines the string to be used in regular expression methods

pattern=re.compile('abcd')

##### re.match()

In [147]:
match=pattern.match('abcd123')
print(match)

# Accessing the span of the match
print(type(match.span()))

<re.Match object; span=(0, 4), match='abcd'>
<class 'tuple'>


##### re.findall()

In [148]:
finders=pattern.findall('123abcd abcd123 abcd abcabc acb')
print(finders)
help(re.findall)

['abcd', 'abcd', 'abcd']
Help on function findall in module re:

findall(pattern, string, flags=0)
    Return a list of all non-overlapping matches in the string.
    
    If one or more capturing groups are present in the pattern, return
    a list of groups; this will be a list of tuples if the pattern
    has more than one group.
    
    Empty matches are included in the result.



##### re.search()

In [149]:
random_string='123 123 234 abcd abc'

search = pattern.search(random_string)
print(search)
span=search.span()
print(span)
print(random_string[span[0]:span[1]])

<re.Match object; span=(12, 16), match='abcd'>
(12, 16)
abcd


### Sets <br>
<p>The following cells will allow you to use regular expressions to search for certain values within a range such as numbers 1 through 4.</p>

##### [a-z] or [A-Z] - any lowercase/uppercase letters from a to z<br/>[^2] - anything that's not 2

##### Integer Ranges

In [150]:
pattern_int=re.compile('[0-7][7-9][0-3]')

random_numbers=pattern_int.search('67383')
print(random_numbers)
span=random_numbers.span()
print(random_numbers[span[0]])


<re.Match object; span=(0, 3), match='673'>
673


##### Character Ranges

In [151]:
char_pattern= re.compile('[A-Z][a-z]')

found=char_pattern.findall('Hello there Mr. Anderson')
print(found)

['He', 'Mr', 'An']


### Counting Occurences

##### {x} - something that occurs {num_of_times}

In [152]:
char_pattern_count = re.compile('[A-Z][a-z][0-3]{2}')

found_count=char_pattern_count.findall('Hello Mr. An33derson')
print(found_count)

['An33']


##### {x, x} - something that occurs between x and x times

In [153]:
random_pattern=re.compile('m{1,5}')
random_statement= random_pattern.findall('This mes  mommy more an example of a regular')
print(random_statement)

['m', 'm', 'mm', 'm', 'm']


##### ? - something that occurs 0 or 1 time

In [154]:
pattern=re.compile('\.?')

found_pat=pattern.findall('Hello M there Mr. Mrssanderson, Mid how is Mrs.Anderson')
print(found_pat)

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '.', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '.', '', '', '', '', '', '', '', '', '']


##### * - something that occurs at least 0 times

In [155]:
pattern_m=re.compile('M*s')

found_m=pattern_m.findall('MMMs name is ms. Smith')
print(found_m)

['MMMs', 's', 's']


##### + - something that occurs at least once

In [156]:
pattern_again=re.compile('M+s')

found_patt=pattern_again.findall('MMMs name is Ms.Smith. this is sss')
print(found_patt)

['MMMs', 'Ms']


##### In-class exercise 1: 

Use a regular expression to find every number in the given string

In [157]:
my_string = "This string has 10909090 numbers, but it is only 1 string. I hope you solve this 2day."
#OUTPUT SHOULD EQUAL : ['10909090','1',2]
pattern = re.compile(r'\d+')
found_pattern=pattern.findall(my_string)
print(found_pattern)

['10909090', '1', '2']


### Escaping Characters

##### \w - look for any Unicode character<br/>\W - look for anything that isnt a Unicode character

[History on Unicode](http://unicode.org/standard/WhatIsUnicode.html)

[More on Unicode Characters](https://en.wikipedia.org/wiki/List_of_Unicode_characters)

In [158]:
pattern_1=re.compile('[\w]+')
pattern_2=re.compile('[\W]+')

found_1=pattern_1.findall('This is a sentence. With, exclamation at the end!')
found_2=pattern_2.findall('This is a sentence. With, exclamation at the end!')
print(found_1)
print(found_2)

['This', 'is', 'a', 'sentence', 'With', 'exclamation', 'at', 'the', 'end']
[' ', ' ', ' ', '. ', ', ', ' ', ' ', ' ', '!']


##### \d - look for any digit 0-9<br/>\D - look for anything that isnt a digit

In [159]:
pattern_nums = re.compile('\d{1,2}[a-z]{2}')

found_date=pattern_nums.findall('Today is the 7th, in 20days it will be the 27th')
print(found_date)

['7th', '20da', '27th']


##### \s - look for any white space<br/>\S - look for anything that isnt whitespace

In [160]:
pattern_no_space=re.compile('\S[a-z]+')
pattern_space=re.compile('\s+')

found_space=pattern_space.findall("Are you afraid of the dark?")
print(found_space)

found_dark=pattern_no_space.findall('Are you afraid of the dark?')
print(found_dark)



[' ', ' ', ' ', ' ', ' ']
['Are', 'you', 'afraid', 'of', 'the', 'dark']


##### \b - look for boundaries or edges of a word<br/>\B - look for anything that isnt a boundary

In [161]:
pattern_bound=re.compile(r'\bTheCodingTemple\b')
pattern_bound_none=re.compile(r'\BTheCodingTemple\B')

no_found_bound=pattern_bound_none.findall('TheCodingTemple')
print(f'No found bound: {no_found_bound}') # Output: []

found_bound=pattern_bound.findall('TheCodingTemple')
print(f'found bound: {found_bound}') # Output: []

No found bound: []
found bound: ['TheCodingTemple']


### Grouping

In [162]:
my_string_again='Michelle Obama, sebastien Dupont , Bill Gates, Jim Carrey'

#Group of names regex compiler
pattern_name=re.compile('([A-Z][a-zA-Za-z]+) ([A-Z][A-Za-z]+)')

found_names=pattern_name.findall(my_string_again)
print(found_names)

for name in my_string_again.split(','):
    match=pattern_name.search(name)
    if match:
        print(match.groups(2))
    else:
        print('not a name')

[('Michelle', 'Obama'), ('Bill', 'Gates'), ('Jim', 'Carrey')]
('Michelle', 'Obama')
not a name
('Bill', 'Gates')
('Jim', 'Carrey')


##### In-class Exercise 2:

Write a function using regular expressions to find the domain name in the given email addresses (and return None for the invalid email addresses)<br><b>HINT: Use '|' for either or</b>

In [163]:
my_emails = ["jordanw@codingtemple..orgcom", "pocohontas1776@gmail.com", "helloworld@aol..com",
             "yourfavoriteband@g6.org", "@codingtemple.com"]

# You can also use the $ at the end of your compile expression -- this stops the search

#.com OR .org => com|org

#Expected output:
#None
#pocohontas1776@gmail.com
#None
#yourfavoriteband@g6.org
#None
pattern_for_emails=re.compile(r'[\w+-]+(?:\.[\w+-]+)*@(?:[a-z0-9-]+\.)+[a-z]{3}')
for email in my_emails:
    found_email=pattern_for_emails.findall(email)
    print(found_email)


[]
['pocohontas1776@gmail.com']
[]
['yourfavoriteband@g6.org']
[]


### Opening a File <br>
<p>Python gives us a couple ways to import files, below are the two used most often.</p>

##### open()

In [164]:
f=open("names.txt")
data=f.read()
print(data)
f.close()

Hawkins, Derek	derek@codingtemple.com	(555) 555-5555	Teacher, Coding Temple	@derekhawkins
Zhai, Mo	mozhai@codingtemple.com	(555) 555-5554	Teacher, Coding Temple
Johnson, Joe	joejohnson@codingtemple.com		Johson, Joe
Osterberg, Sven-Erik	governor@norrbotten.co.se		Governor, Norrbotten	@sverik
, Tim	tim@killerrabbit.com		Enchanter, Killer Rabbit Cave
Butz, Ryan	ryanb@codingtemple.com	(555) 555-5543	CEO, Coding Temple	@ryanbutz
Doctor, The	doctor+companion@tardis.co.uk		Time Lord, Gallifrey
Exampleson, Example	me@example.com	555-555-5552	Example, Example Co.	@example
Pael, Ripal	ripalp@codingtemple.com	(555) 555-5553	Teacher, Coding Temple	@ripalp
Vader, Darth	darth-vader@empire.gov	(555) 555-4444	Sith Lord, Galactic Empire	@darthvader
Fernandez de la Vega Sanz, Maria Teresa	mtfvs@spain.gov		First Deputy Prime Minister, Spanish Gov



##### with open()

In [165]:
with open("names.txt")as f:
    data=f.read()
    print(data)

Hawkins, Derek	derek@codingtemple.com	(555) 555-5555	Teacher, Coding Temple	@derekhawkins
Zhai, Mo	mozhai@codingtemple.com	(555) 555-5554	Teacher, Coding Temple
Johnson, Joe	joejohnson@codingtemple.com		Johson, Joe
Osterberg, Sven-Erik	governor@norrbotten.co.se		Governor, Norrbotten	@sverik
, Tim	tim@killerrabbit.com		Enchanter, Killer Rabbit Cave
Butz, Ryan	ryanb@codingtemple.com	(555) 555-5543	CEO, Coding Temple	@ryanbutz
Doctor, The	doctor+companion@tardis.co.uk		Time Lord, Gallifrey
Exampleson, Example	me@example.com	555-555-5552	Example, Example Co.	@example
Pael, Ripal	ripalp@codingtemple.com	(555) 555-5553	Teacher, Coding Temple	@ripalp
Vader, Darth	darth-vader@empire.gov	(555) 555-4444	Sith Lord, Galactic Empire	@darthvader
Fernandez de la Vega Sanz, Maria Teresa	mtfvs@spain.gov		First Deputy Prime Minister, Spanish Gov



##### re.match()

In [166]:
print(re.match(r"Hawkins, Derek",data))

<re.Match object; span=(0, 14), match='Hawkins, Derek'>


##### re.search()

In [167]:
print(re.search(r"ripalp@codingtemple.com",data))

<re.Match object; span=(582, 605), match='ripalp@codingtemple.com'>


##### Store the String to a Variable

In [168]:
answer=input("what would you like to search for ?")
found=re.findall(answer,data)
if found:
    print(f'I found your data : {found}')

I found your data : ['Derek']


### In-Class Exercise #3 <br>
<p>Print each persons name and twitter handle, using groups, should look like:</p>
<p>==============<br>
   Full Name / Twitter<br>
   ==============</p>
Derek Hawkins / @derekhawkins

 Erik Sven-Osterberg / @sverik

 Ryan Butz / @ryanbutz

 Example Exampleson / @example

 Ripal Pael / @ripalp

 Darth Vader / @darthvader

In [169]:
def extract_names_and_twitter_handles(file_path):
    with open(file_path, 'r') as file:
        data = file.read()

    # Define the regex pattern
    pattern = re.compile(r'(?P<first_name>\w+),\s*(?P<last_name>\w+(-\w+)?)(?:\t.*?\t|\s{2,})(?:(?P<twitter>@\w+)|.*$)')

    # Find all matches in the data
    matches = pattern.finditer(data)

    # Print the formatted results
    for match in matches:
        first_name = match.group('first_name')
        last_name = match.group('last_name')
        twitter_handle = match.group('twitter') or ''
        full_name = f'{first_name} {last_name}'.strip()
        print(f"{full_name} / {twitter_handle}")

# Provide the path to your name.txt file
file_path = 'names.txt'

# Call the function with the file path
extract_names_and_twitter_handles(file_path)

Hawkins Derek / @derekhawkins
Osterberg Sven-Erik / @sverik
Butz Ryan / @ryanbutz
Exampleson Example / @example
Pael Ripal / @ripalp
Vader Darth / @darthvader


### Regex project

Use python to read the file regex_test.txt and print the last name on each line using regular expressions and groups (return None for names with no first and last name, or names that aren't properly capitalized)
##### Hint: use with open() and readlines()

In [170]:
"""
Expected Output
Abraham Lincoln
Andrew P Garfield
Connor Milliken
Jordan Alexander Williams
None
None
"""

'\nExpected Output\nAbraham Lincoln\nAndrew P Garfield\nConnor Milliken\nJordan Alexander Williams\nNone\nNone\n'

In [171]:
def extract_last_names(file_path):
    #STEP 1 OPEN/READ FILE :)
    with open(file_path, 'r') as file:
        data = file.readlines()

    #STEP 2 REGEX PATTERN FOR NAMES !
    pattern = re.compile(r'^(?:(?P<first>\w+)(?:\s+(?P<middle>[A-Z]\w*))?)?\s+(?P<last>[A-Z]\w*)$')

    #STEP 3 PRINT NAMES !
    for line in data:
        match = pattern.match(line)
        if match:
            first_name = match.group('first')
            middle_name = match.group('middle')
            last_name = match.group('last')

            #IS THE LAST NAME CAPITALIZED ?
            if last_name and not last_name.istitle():
                last_name = None

            #IS THE MIDDLE NAME CAPITALIZED
            if middle_name and not middle_name.istitle():
                middle_name = None

            name_parts = [part for part in [first_name, middle_name, last_name] if part]
            full_name = ' '.join(name_parts).strip()
            print(full_name)
        else:
            print(None)

file_path = 'regex_test.txt'
extract_last_names(file_path)


Abraham Lincoln
Andrew P Garfield
Connor Milliken
Jordan Alexander Williams
None
None
