# Denison CS-181/DA-210 Homework

---

## Regular Expression Exercises

*Execute the prolog cell*

In [1]:
import re

import os
import io
import sys

from contextlib import redirect_stdout
from IPython.core.debugger import set_trace

def add_modules():
    """
    Starting at the current directory and proceeding up the file system
    tree, search for a directory named `modules`.  If found, and if not
    already there, add to the Python module search path.
    
    Params: None
    
    Return: None
    """
    directory = "."
    levels = 0
    while not os.path.isdir(os.path.join(directory, "modules")) and \
          levels < 5:
        directory = os.path.join(directory, "..")
        levels += 1
    module_path = os.path.abspath(os.path.join(directory, "modules"))
    if os.path.isdir(module_path):
        if not module_path in sys.path:
            sys.path.append(module_path)

add_modules()
import util

ModuleNotFoundError: No module named 'util'

### Instructions

For these exercises, the focus is on your writing **restrictive** regular expressions to solve the stated problem.  The focus is **not** on regular expression programming and using the `re` module in Python.

For each question, you will, as normal, delete the two-line sequence (the comment and the `raise` statement) and provide your solution.  But your solution will simply consist of entering your regular expression solution, as a raw Python string, and assign it to `pattern`.

For both our own testing and your testing and debugging, we provide, in the `util` module, two new functions that take care of the Python programming side and return a list with the results of exercising the matching:

- `util.assembleMatches(pattern, text, flag=0)`: find all matches of `pattern` in `text` (subject to behavior of `flags`) and return a list containing, for each match, the string of the match and the index in `text` where the match begins.  Use this if there are no capture groups.
- `util.assembleCaptures(pattern, text, flag=0)`: find all matches of `pattern` in `text` (subject to behavior of `flags`) and return a list containing, for each match, a **list** containing the capture groups within the match.  The capture group is represented as a tuple with the string of the capture and the index in `text` where the capture begins.

The testing cells will have minimal testing and asserts.  You should try other variations and boundary cases for yourself to make sure your regular expression pattern works per the specification.

**Q1** Write a regular expression that matches complete words that begin with `t` then `h` and then two following letters (i.e. they should match words that are four letters long).

In [2]:
text1 = "Does this text match that pattern?"

pattern = r"th\w\w"
util.assembleMatches(pattern, text1)

[('this', 5), ('that', 21)]

In [3]:
text1 = "Does this text match that pattern?"

assert util.assembleMatches(pattern, text1) == [('this', 5), ('that', 21)]

**Q2:** Hexadecimal numbers are numbers that have 16 symbols, and the symbols use the digits 0 through 9 and then the first six alphabetic letters of A through F.  The letters may be in either upper or lower case, or some mix.  Numbers are written the same way as integers---with one or more of the 16 symbols in a sequence, with no intervening spaces or punctuation.

Write a regular expression pattern that, if applied to a target, completely matches all hexadecimal numbers within that target.

In [46]:
text = "The hex number may include 43ac21 or feed beef, but not fred"

pattern = r"\b[\dabcdefABCDEF]+\b"
util.assembleMatches(pattern, text)

[('43ac21', 27), ('feed', 37), ('beef', 42)]

In [47]:
text = "The hex number may include 43ac21 or feed beef, but not fred"

assert util.assembleMatches(pattern, text) == [('43ac21', 27), ('feed', 37), ('beef', 42)]

**Q3** Outside of the US, it is common to write dates in the form `year.month.day`, e.g., `2020.01.05` for `January 05, 2020`. Write a regular expression that matches a date written in this form.  Note that single digits for the month and day will use a leading zero.

In [40]:
text1 = "2021.02.15 is the first day of week 3; but 123.45.6789 might be mistaken for a social security number"

pattern = r"[\d]{4}\.[\d]{2}\.[\d]{2}"
util.assembleMatches(pattern, text1)

[('2021.02.15', 0)]

In [41]:
text1 = "2021.02.15 is the first day of week 3; but 123.45.6789 might be mistaken for a social security number"

assert util.assembleMatches(pattern, text1) == [('2021.02.15', 0)]

**Q4:** Write a regular expression pattern that matches 9 digit telephone numbers. The numbers will be formatted as `dddsdddsdddd` or `(ddd)sdddsdddd`, where `d` is a digit and `s` is a separator (space, period, or dash), e.g., `555 555.5555` or `(555) 555-5555`. Besides matching valid phone numbers, you should **capture** all three portions of the phone number---the area code, the three digit exchange prefix, and the four digit line number.

In [49]:
text1 = "Looking to match 555-123-4567 and (800) 721-6432 but 123.45.6789 might be mistaken for a social security number"

pattern = r"\({0,1}([\d]{3})[\-)\s]{1,2}([\d]{3})\-([\d]{4})"
util.assembleCaptures(pattern, text1)

[[('555-123-4567', 17), ('555', 17), ('123', 21), ('4567', 25)],
 [('(800) 721-6432', 34), ('800', 35), ('721', 40), ('6432', 44)]]

In [50]:
text1 = "Looking to match 555-123-4567 and (800) 721-6432 but 123.45.6789 might be mistaken for a social security number"

assert util.assembleCaptures(pattern, text1) == [[('555-123-4567', 17), ('555', 17), ('123', 21), ('4567', 25)],
 [('(800) 721-6432', 34), ('800', 35), ('721', 40), ('6432', 44)]]


**Q5:** Given the table below, reproduced as a single string within the solution cell, find a single regular expression that matches all the items in the first column (the entirety of the match, but stopping before the trailing spaces) but none of those in the second column.  You **must** use matching of true patterns, looking at the similarities and differences between the two columns.  A disjunction of the literals in the first column will not be awarded any points.
```python
"""
Match     | No Match
--------------------
pit       | pt
spot      | Pot
spate     | peat
slap two  | part
respite   | top it
"""
```

In [74]:
target = """
Match     | No Match
--------------------
pit       | pt
spot      | Pot
spate     | peat
slap two  | part
respite   | top it
"""

pattern = r"s\w+\s\w+|s\w+|re\w+|pi\w+"
matches = util.assembleMatches(pattern, target)
matches

[('pit', 43), ('spot', 58), ('spate', 74), ('slap two', 91), ('respite', 108)]

In [75]:
target = """
Match     | No Match
--------------------
pit       | pt
spot      | Pot
spate     | peat
slap two  | part
respite   | top it
"""

matches = util.assembleMatches(pattern, target)
assert len(matches) == 5
assert matches == [('pit', 43), ('spot', 58), ('spate', 74), ('slap two', 91), ('respite', 108)]