# How to use Regular Expressions

***

# Reading Files

Documentation
- __open() docs__
- __file_object.close()__ docs
- __file_object.read()__ docs
- Regular Expressions: https://docs.python.org/3/library/re.html

New Terms
- __open()__ - Opens a file in Python. This won't contain the content of the file, it just points to it in memory.
- __.read()__ - Reads the entire contents of the file object it's called on.
- .__close()__ - Closes the file object it's called on. This clears the file out of Python's memory.
- __r'string'__ - A raw string that makes writing regular expressions easier.
- __re.match(pattern, text, flags)__ - Tries to match a pattern against the beginning of the text.
- __re.search(pattern, text, flags)__ - Tries to match a pattern anywhere in the text. Returns the first match.

A better way to read files

If you don't know the size of a file, it's better to read it a chunk at a time and close it automatically. The following snippet does that:

```python
with open("some_file.txt") as open_file:
    data = open_file.read()
```    
The with causes the file to automatically close once the action inside of it finishes. And the action inside, the .read(), will finish when there are no more bytes to read from the file.

In [1]:
# address_book.py

names_file = open("names.txt", encoding="utf-8")  # This line creates a pointer to the specified file.
data = names_file.read()
names_file.close()

print(data)

Love, Kenneth	kenneth@teamtreehouse.com	(555) 555-5555	Teacher, Treehouse	@kennethlove
McFarland, Dave	dave@teamtreehouse.com	(555) 555-5554	Teacher, Treehouse
Arthur, King	king_arthur@camelot.co.uk		King, Camelot
Österberg, Sven-Erik	governor@norrbotten.co.se		Governor, Norrbotten	@sverik
, Tim	tim@killerrabbit.com		Enchanter, Killer Rabbit Cave
Carson, Ryan	ryan@teamtreehouse.com	(555) 555-5543	CEO, Treehouse	@ryancarson
Doctor, The	doctor+companion@tardis.co.uk		Time Lord, Gallifrey
Exampleson, Example	me@example.com	555-555-5552	Example, Example Co.	@example
Obama, Barack	president.44@us.gov	555 555-5551	President, United States of America	@potus44
Chalkley, Andrew	andrew@teamtreehouse.com	(555) 555-5553	Teacher, Treehouse	@chalkers
Vader, Darth	darth-vader@empire.gov	(555) 555-4444	Sith Lord, Galactic Empire	@darthvader
Fernández de la Vega Sanz, María Teresa	mtfvs@spain.gov		First Deputy Prime Minister, Spanish Govt.


__Using re.match to search the beginnings of lines for matches :__

In [2]:
import re

print(re.match(r'Love', data))  # r' ' lets python know that the txt is raw, that is, it does not contain things / to indicate spaces and the like.
issue = " <----- The reason we get None for this second one is because 're.match' only looks the first word of each line."
print(re.match(r'Kenneth', data), issue)

<_sre.SRE_Match object; span=(0, 4), match='Love'>
None  <----- The reason we get None for this second one is because 're.match' only looks the first word of each line.


__Using re.search to search entire lines for matches :__

In [3]:
# This how remedy the above problem. re.search looks at the entire line
print(re.search(r'Kenneth', data))

<_sre.SRE_Match object; span=(6, 13), match='Kenneth'>


***

# Escape Hatches

New terms
- __\w__ - matches an Unicode word character. That's any letter, uppercase or lowercase, numbers, and the underscore character. In "new-releases-204", \w would match each of the letters in "new" and "releases" and the numbers 2, 0, and 4. It wouldn't match the hyphens.
- __\W__ - is the opposite to \w and matches anything that isn't an Unicode word character. In "new-releases-204", \W would only match the hyphens.
- __\s__ - matches whitespace, so spaces, tabs, newlines, etc.
- __\S__ - matches everything that isn't whitespace.
- __\d__ - is how we match any number from 0 to 9
- __\D__ - matches anything that isn't a number.
- __\b__ - matches word boundaries. What's a word boundary? It's the edges of word, defined by white space or the edges of the string.
- __\B__ - matches anything that isn't the edges of a word.

Two other escape characters that we didn't cover in the video are \A and \Z. These match the beginning and the end of the string, respectively. As we'll learn later, though, ^ and $ are more commonly used and usually what you actually want.

__Using \d to find a phone number :__

In [4]:
print(re.search(r'\d\d\d-\d\d\d\d', data))

<_sre.SRE_Match object; span=(46, 54), match='555-5555'>


__Using escape characters '\' to use parenthesis in order find the area code (555) :__

In [5]:
print(re.search(r'\(\d\d\d\) \d\d\d-\d\d\d\d', data))

<_sre.SRE_Match object; span=(40, 54), match='(555) 555-5555'>


***

# Counts
New terms
- __\w{3}__ - matches any three word characters in a row.
- __\w{,3}__ - matches 0, 1, 2, or 3 word characters in a row.
- __\w{3,}__ - matches 3 or more word characters in a row. There's no upper limit.
- __\w{3, 5}__ - matches 3, 4, or 5 word characters in a row.
- __\w?__ - matches 0 or 1 word characters.
- __\w*__ - matches 0 or more word characters. Since there is no upper limit, this is, effectively, infinite word characters.
- __\w+__ - matches 1 or more word characters. Like *, it has no upper limit, but it has to occur at least once.
- __.findall(patter, text, flags)__ - Finds all *non-overlapping* occurrences of the __pattern__ in the __text__.

__Using counts to find a word with any amount of letters :__

In [6]:
print(re.search(r'\w+, \w+', data))

<_sre.SRE_Match object; span=(0, 13), match='Love, Kenneth'>


__Using curly brackets to specify occurances of a repeated character type :__

In [7]:
print(re.search(r'\(\d{3}\) \d{3}-\d{4}', data))

<_sre.SRE_Match object; span=(40, 54), match='(555) 555-5555'>


__Using ? to specify a character type is optional (either zero or one):__

In [8]:
print(re.search(r'\(?\d{3}\)? \d{3}-\d{4}', data))

<_sre.SRE_Match object; span=(40, 54), match='(555) 555-5555'>


__Using .findall() to find all occurances of the specified regex in data :__

In [9]:
print(re.findall(r'\(?\d{3}\)? \d{3}-\d{4}', data))

['(555) 555-5555', '(555) 555-5554', '(555) 555-5543', '555 555-5551', '(555) 555-5553', '(555) 555-4444']


In [10]:
print(re.findall(r'\(?\d{3}\)?-?\s?\d{3}-\d{4}', data))  # included optional spaces, and hyphens

['(555) 555-5555', '(555) 555-5554', '(555) 555-5543', '555-555-5552', '555 555-5551', '(555) 555-5553', '(555) 555-4444']


__Using * to specify a character type is optional (either zero or unlimited):__

In [11]:
print(re.findall(r'\w*, \w+', data))

['Love, Kenneth', 'Teacher, Treehouse', 'McFarland, Dave', 'Teacher, Treehouse', 'Arthur, King', 'King, Camelot', 'Österberg, Sven', 'Governor, Norrbotten', ', Tim', 'Enchanter, Killer', 'Carson, Ryan', 'CEO, Treehouse', 'Doctor, The', 'Lord, Gallifrey', 'Exampleson, Example', 'Example, Example', 'Obama, Barack', 'President, United', 'Chalkley, Andrew', 'Teacher, Treehouse', 'Vader, Darth', 'Lord, Galactic', 'Sanz, María', 'Minister, Spanish']


## Code Challenge

Create a function named find_words that takes a count and a string. Return a list of all of the words in the string that are count word characters long or longer.

In [12]:
import re

# EXAMPLE:
# >>> find_words(4, "dog, cat, baby, balloon, me")
# ['baby', 'balloon']

def find_words(count, string):
    regex = r'\w{' + str(count) + r',}'
    return re.findall(regex, string)

find_words(4, "dog, cat, baby, balloon, me")

['baby', 'balloon']

***

# Sets
New terms
- __[abc]__ - this is a set of the characters 'a', 'b', and 'c'. It'll match any of those characters, in any order, but only once each.
- __[a-z], [A-Z], or [a-zA-Z]__ - ranges that'll match any/all letters in the English alphabet in lowercase, uppercase, or both upper and lowercases.
- __[0-9]__ - range that'll match any number from 0 to 9. You can change the ends to restrict the set.

__Using sets (brackets [ ] ) to define which types of character were looking for (order does not matter) :__ 

In [13]:
print(re.findall(r'[-\w\d+.]+@[-\w\d.]+', data))

['kenneth@teamtreehouse.com', 'dave@teamtreehouse.com', 'king_arthur@camelot.co.uk', 'governor@norrbotten.co.se', 'tim@killerrabbit.com', 'ryan@teamtreehouse.com', 'doctor+companion@tardis.co.uk', 'me@example.com', 'president.44@us.gov', 'andrew@teamtreehouse.com', 'darth-vader@empire.gov', 'mtfvs@spain.gov']


__Similar to the example above, but this time were searching specifically for '@teamtreehouse.com' :__

In [14]:
print(re.findall(r'[-\w\d+.]+@teamtreehouse.com', data))

['kenneth@teamtreehouse.com', 'dave@teamtreehouse.com', 'ryan@teamtreehouse.com', 'andrew@teamtreehouse.com']


__Searching for a specific word ('Treehouse' in this case) regardless of case :__

In [15]:
print("CODE: re.findall(r'\\b[trehous]{9}\\b', data, re.IGNORECASE)")
print('* \\b indicates word boundary')
print("* [trehous]{9} indicates were looking words that contain those letters and have a length of 9 letters.\n")
print('RETURNED: ', re.findall(r'\b[trehous]{9}\b', data, re.IGNORECASE))
print('\nor you can use the re.I shorthand & get the same result:'.upper())
print(re.findall(r'\b[trehous]{9}\b', data, re.I))

CODE: re.findall(r'\b[trehous]{9}\b', data, re.IGNORECASE)
* \b indicates word boundary
* [trehous]{9} indicates were looking words that contain those letters and have a length of 9 letters.

RETURNED:  ['Treehouse', 'Treehouse', 'Treehouse', 'Treehouse']

OR YOU CAN USE THE RE.I SHORTHAND & GET THE SAME RESULT:
['Treehouse', 'Treehouse', 'Treehouse', 'Treehouse']


***

# Negation
New terms
- __[^abc]__ - a set that will not match, and, in fact, exclude, the letters 'a', 'b', and 'c'.
- __re.IGNORECASE or re.I__ - flag to make a search case-insensitive. re.match('A', 'apple', re.I) would find the 'a' in 'apple'.
- __re.VERBOSE or re.X__ - flag that allows regular expressions to span multiple lines and contain (ignored) whitespace and comments.

__Grab all email addresses, ignore .gov :__

In [16]:
print(re.findall(r'''
    \b@[-\w\d.]*  # word boundary, @ symbol, any num of char.
    [^gov\t]+ # Ignore one or more instances of the letters 'g', 'o', or 'v' and a tab.
    \b # Match another word boundary.
''', data, re.VERBOSE|re.I))

['@teamtreehouse.com', '@teamtreehouse.com', '@camelot.co.uk', '@norrbotten.co.se', '@killerrabbit.com', '@teamtreehouse.com', '@tardis.co.uk', '@example.com', '@us.', '@teamtreehouse.com', '@empire.', '@spain.']


__Grab anything matching the following pattern "_word_, _word_" :__

In [17]:
print(re.findall(r'''
    \b[-\w]*,  # Find a word boundary, 1+ hyphens or characters, and a comma
    \s  # Find 1 white space
    [-\w ]+  # 1+ hyphens and characters an explicit spaces
    [^\t\n]  # Ignore tabs and newlines
''', data, re.X))

['Love, Kenneth', 'Teacher, Treehouse', 'McFarland, Dave', 'Teacher, Treehouse', 'Arthur, King', 'King, Camelot', 'Österberg, Sven-Erik', 'Governor, Norrbotten', 'Enchanter, Killer Rabbit Cave', 'Carson, Ryan', 'CEO, Treehouse', 'Doctor, The', 'Lord, Gallifrey', 'Exampleson, Example', 'Example, Example Co.', 'Obama, Barack', 'President, United States of America', 'Chalkley, Andrew', 'Teacher, Treehouse', 'Vader, Darth', 'Lord, Galactic Empire', 'Sanz, María Teresa', 'Minister, Spanish Govt.']


***

# Groups
New terms

- __([abc])__ - creates a group that contains a set for the letters 'a', 'b', and 'c'. This could be later accessed from the Match object as .group(1)
- __(?P<name>[abc])__ - creates a named group that contains a set for the letters 'a', 'b', and 'c'. This could later be accessed from the Match object as .group('name').
- __.groups()__ - method to show all of the groups on a Match object.
- __re.MULTILINE or re.M__ - flag to make a pattern regard lines in your text as the beginning or end of a string.
    - The re.MULTILINE flag allows newlines to be treated as individual strings?
- __^__ - specifies, in a pattern, the beginning of the string.
- __$__ - specifies, in a pattern, the end of the string.

__Using groups to collect specific pieces of information in a single regex :__

In [18]:
print(re.findall(r'''
    ^(?P<name>[-\w ]*,\s[-\w ]+)\t  # first and last name.
    (?P<email>[-\w\d.+]+@[-\w\d.]+)\t  # Email
    (?P<phone>\(?\d{3}\)?-?\s?\d{3}-\d{4})?\t  # Phone
    (?P<job>[\w\s]+,\s[\w\s.]+)\t?  # Job and company
    (?P<twitter>@[\w\d]+)?$  # Twitter
''', data, re.X|re.M))

[('Love, Kenneth', 'kenneth@teamtreehouse.com', '(555) 555-5555', 'Teacher, Treehouse\t', '@kennethlove'), ('McFarland, Dave', 'dave@teamtreehouse.com', '(555) 555-5554', 'Teacher, Treehouse', ''), ('Arthur, King', 'king_arthur@camelot.co.uk', '', 'King, Camelot', ''), ('Österberg, Sven-Erik', 'governor@norrbotten.co.se', '', 'Governor, Norrbotten\t', '@sverik'), (', Tim', 'tim@killerrabbit.com', '', 'Enchanter, Killer Rabbit Cave', ''), ('Carson, Ryan', 'ryan@teamtreehouse.com', '(555) 555-5543', 'CEO, Treehouse\t', '@ryancarson'), ('Doctor, The', 'doctor+companion@tardis.co.uk', '', 'Time Lord, Gallifrey', ''), ('Exampleson, Example', 'me@example.com', '555-555-5552', 'Example, Example Co.\t', '@example'), ('Obama, Barack', 'president.44@us.gov', '555 555-5551', 'President, United States of America\t', '@potus44'), ('Chalkley, Andrew', 'andrew@teamtreehouse.com', '(555) 555-5553', 'Teacher, Treehouse\t', '@chalkers'), ('Vader, Darth', 'darth-vader@empire.gov', '(555) 555-4444', 'Sith

__Creating a Dict with groups :__

In [19]:
line = re.search(r'''
    ^(?P<name>[-\w ]*,\s[-\w ]+)\t  # first and last name.
    (?P<email>[-\w\d.+]+@[-\w\d.]+)\t  # Email
    (?P<phone>\(?\d{3}\)?-?\s?\d{3}-\d{4})?\t  # Phone
    (?P<job>[\w\s]+,\s[\w\s.]+)\t?  # Job and company
    (?P<twitter>@[\w\d]+)?$  # Twitter
''', data, re.X|re.M)

print(line,'\n')
print(line.groupdict())

<_sre.SRE_Match object; span=(0, 86), match='Love, Kenneth\tkenneth@teamtreehouse.com\t(555) 5> 

{'name': 'Love, Kenneth', 'email': 'kenneth@teamtreehouse.com', 'phone': '(555) 555-5555', 'job': 'Teacher, Treehouse\t', 'twitter': '@kennethlove'}


## Code Challenge
Create a variable names that is an re.match() against string. The pattern should provide two groups, one for a last name match and one for a first name match. The name parts are separated by a comma and a space.

In [20]:
import re

string = 'Perotto, Pier Giorgio'

names = re.match(r'''
        ([\w]+),\s
        ([\s\w]+)
        ''', string, re.X)

print(names.groups())

('Perotto', 'Pier Giorgio')


## Code Challenge
Challenge Task 1 of 2

Create a new variable named contacts that is an re.search() where the pattern catches the email address and phone number from string. Name the email pattern email and the phone number pattern phone. The comma and spaces * should not* be part of the groups.

In [21]:
import re

string = '''Love, Kenneth, kenneth+challenge@teamtreehouse.com, 555-555-5555, @kennethlove
Chalkley, Andrew, andrew@teamtreehouse.co.uk, 555-555-5556, @chalkers
McFarland, Dave, dave.mcfarland@teamtreehouse.com, 555-555-5557, @davemcfarland
Kesten, Joy, joy@teamtreehouse.com, 555-555-5558, @joykesten'''

contacts = re.search(r'''
        (?P<email>[\w.+]*@[\w.]+),\s
        (?P<phone>\d{3}-\d{3}-\d{4})
''', string, re.X)
print(contacts.groups())

('kenneth+challenge@teamtreehouse.com', '555-555-5555')


Challenge Task 2 of 2

Great! Now, make a new variable, twitters that is an re.search() where the pattern catches the Twitter handle for a person. Remember to mark it as being at the end of the string. You'll also want to use the re.MULTILINE flag.

In [31]:
twitters = re.search(r'''
    ([@\w]+)$
''', string, re.X|re.M)

print(twitters)

<_sre.SRE_Match object; span=(66, 78), match='@kennethlove'>


***

# Compiling and Loops
New terms

- __re.compile(pattern, flags)__ - method to pre-compile and save a regular expression pattern, and any associated flags, for later use.
- __.groupdict()__ - method to generate a dictionary from a Match object's groups. The keys will be the group names. The values will be the results of the patterns in the group.
- __re.finditer()__ - method to generate an iterable from the non-overlapping matches of a regular expression. Very handy for for loops.
    - Gives us an iterable full of match objects essentially.
- __.group()__ - method to access the content of a group. 0 or none is the entire match. 1 through how ever many groups you have will get that group. Or use a group's name to get it if you're using named groups.

Why would you want to compile a pattern?
- It is going to be used multiple times
- I want to be able to pass it to functions
- I want to be able to use it directly
- I want to provide multiple patterns as part of a library

__Using re.compile to preserve a regex to be used later with different inputs :__

In [38]:
line = re.compile(r'''
    ^(?P<name>(?P<last>[-\w ]*),\s(?P<first>[-\w ]+))\t  # first and last name.
    (?P<email>[-\w\d.+]+@[-\w\d.]+)\t  # Email
    (?P<phone>\(?\d{3}\)?-?\s?\d{3}-\d{4})?\t  # Phone
    (?P<job>[\w\s]+,\s[\w\s.]+)\t?  # Job and company
    (?P<twitter>@[\w\d]+)?$  # Twitter
''', re.X|re.M)

print(re.search(line, data).groupdict(), '\n')
print(line.search(data).groupdict())

{'name': 'Love, Kenneth', 'last': 'Love', 'first': 'Kenneth', 'email': 'kenneth@teamtreehouse.com', 'phone': '(555) 555-5555', 'job': 'Teacher, Treehouse\t', 'twitter': '@kennethlove'} 

{'name': 'Love, Kenneth', 'last': 'Love', 'first': 'Kenneth', 'email': 'kenneth@teamtreehouse.com', 'phone': '(555) 555-5555', 'job': 'Teacher, Treehouse\t', 'twitter': '@kennethlove'}


__Example of using subgroups (check out previous example for actual subgroups) :__

In [37]:
for match in line.finditer(data):
    print(match.group('name'))

Love, Kenneth
McFarland, Dave
Arthur, King
Österberg, Sven-Erik
, Tim
Carson, Ryan
Doctor, The
Exampleson, Example
Obama, Barack
Chalkley, Andrew
Vader, Darth
Fernández de la Vega Sanz, María Teresa


__Using groupdict alongside subgroups :__

In [40]:
for match in line.finditer(data):
    print('{first} {last} <{email}>'.format(**match.groupdict()))

Kenneth Love <kenneth@teamtreehouse.com>
Dave McFarland <dave@teamtreehouse.com>
King Arthur <king_arthur@camelot.co.uk>
Sven-Erik Österberg <governor@norrbotten.co.se>
Tim  <tim@killerrabbit.com>
Ryan Carson <ryan@teamtreehouse.com>
The Doctor <doctor+companion@tardis.co.uk>
Example Exampleson <me@example.com>
Barack Obama <president.44@us.gov>
Andrew Chalkley <andrew@teamtreehouse.com>
Darth Vader <darth-vader@empire.gov>
María Teresa Fernández de la Vega Sanz <mtfvs@spain.gov>


## Code Challenge
Challenge Task 1 of 2

Create a variable named players that is an re.search() or re.match() to capture three groups: last_name, first_name, and score. It should include re.MULTILINE.

In [48]:
import re

string = '''
Love, Kenneth: 20
Chalkley, Andrew: 25
McFarland, Dave: 10
Kesten, Joy: 22
Stewart Pinchback, Pinckney Benton: 18'''

players = re.search(r'''
        ^(?P<last_name>[\w]+\s?[\w]*),
        \s(?P<first_name>[\w]+\s?[\w]*):
        \s(?P<score>[\d]+)$
''', string, re.X|re.M)

print(players)

<_sre.SRE_Match object; span=(1, 18), match='Love, Kenneth: 20'>


Challenge Task 2 of 2

Wow! OK, now, create a class named Player that has those same three attributes, last_name, first_name, and score. I should be able to set them through ```__init__.```

In [49]:
class Player:
    def __init__(self, last_name, first_name, score):
        self.last_name = last_name
        self.first_name = first_name
        self.score = score

***

Extra Credit

Write a class to represent a person based on the information in the text file. They should have names, email addresses, phone numbers, jobs, and Twitter accounts. Remember, some of these can be blank, though!

To go ever further, make a class to act as as address book to hold all of the people class instances created above. Can you make it searchable?