# Regular Expressions Exercise:
***
prepared by: Van Julius Leander G. Lopez, MSDS
#### Instructions:

All answers should be the result of a single call to a `re` function. The asserts are your guide to check if your code is working as instructed. It should pass for full marks. Each number is of increasing difficulty so BEST OF LUCK!

For checking, the notebook should run from top to bottom without throwing an error (Best practice to restart and run all before you submit). 

In [1]:
import re
from numpy.testing import assert_equal

## Task 1
***
Develop a function named `is_date` which employs Python regular expressions to verify if the provided string adheres to any of the subsequent date formats:

* yyyy-mm-dd
* yyyy/mm/dd
* mm-dd-yyyy
* mm/dd/yyyy
* dd-mm-yyyy
* dd/mm/yyyy

The function should yield `True` if the conditions are met, and `False` otherwise. There's no requirement to validate the actual date itself.

In [2]:
def is_date(date_str):
    date = re.compile(r'^(\d{4})[-/](\d{2})[-/](\d{2})$|^(\d{2})[-/](\d{2})[-/](\d{4})$')
    return bool(date.match(date_str))

In [3]:
assert_equal(is_date('2020-05-18'), True)
assert_equal(is_date('2020/05/18'), True)
assert_equal(is_date('05-18-2020'), True)
assert_equal(is_date('05/18/2020'), True)
assert_equal(is_date('18-05-2020'), True)
assert_equal(is_date('18/05/2020'), True)
assert_equal(is_date('5-18-2020'), False)
assert_equal(is_date('5/18/2020'), False)
assert_equal(is_date('18-5-2020'), False)
assert_equal(is_date('18/5/2020'), False)
assert_equal(is_date('2020-054-18'), False)
assert_equal(is_date('2020-05-181'), False)
assert_equal(is_date('202-05-18'), False)

## Task 2
***
Develop a function named `find_data` which retrieves all instances of `data set` or `dataset` within the provided `text`, returning them as a list of strings in the sequence they appear in `text`.

In [4]:
def find_data(text):
    data = re.compile(r'\bdata set\b|\bdataset\b')
    return data.findall(text)

In [5]:
text = """
My dataset is bigger than your data set. Your data set is so tiny, ten of your
data sets can fit inside my dataset. Ten of your data sets inside one of my
data set! Your data set doesn't stand a chance.
"""
assert_equal(
    find_data(text), 
    ['dataset', 'data set', 'data set', 'dataset', 'data set', 'data set'])

## Task 3
***
Develop a function named `find_lamb` which retrieves all phrases in `text` that start with the word `little` and conclude with the word `lamb`, returning them as a list of strings in the order they appear within `text`.

In [6]:
def find_lamb(text):
    lamb = re.compile(r'\blittle\b.*?\blamb\b')
    return lamb.findall(text)

In [7]:
text = """
Mary had a little lamb, little lamb, little lamb
Mary had a little brown lamb, little brown lamb, little brown lamb
Whose wool is as brown as wood
"""
assert_equal(
    find_lamb(text), 
    ['little lamb',
     'little lamb',
     'little lamb',
     'little brown lamb',
     'little brown lamb',
     'little brown lamb'])

## Task 4
***
Develop a function named `repeat_alternate` which generates a string where every second word from `text` is duplicated.

In [8]:
def repeat_alternate(text):
    repeat = re.findall(r"\b\w+(?:'\w+)?\b", text)
    result = []
    
    for i, word in enumerate(repeat):
        if i % 2 == 1:
            result.extend([repeat[i-1], word])
        else:
            result.append(word)
    
    return ' '.join(result) + ' '

In [9]:
text = ("Peter Piper picked a peck of pickled peppers "
        "A peck of pickled peppers Peter Piper picked "
        "If Peter Piper picked a peck of pickled peppers "
        "Where's the peck of pickled peppers Peter Piper picked ")
assert_equal(
    repeat_alternate(text),
    ("Peter Peter Piper picked picked a peck peck of pickled pickled peppers "
     "A A peck of of pickled peppers peppers Peter Piper Piper picked If If "
     "Peter Piper Piper picked a a peck of of pickled peppers peppers "
     "Where's the the peck of of pickled peppers peppers Peter Piper Piper "
     "picked ")
)

## Task 5
***
Develop a function named `whats_on_the_bus` that retrieves the distinct items present on the bus based on the provided `text`.

In [10]:
def whats_on_the_bus(text):
    bus = re.compile(r'\b(wheels|wipers|horn|babies)\b')
    return set(bus.findall(text))

In [11]:
text = """
The wheels on the bus go round and round
Round and round, round and round
The wheels on the bus go round and round
All day long
The wipers on the bus go swish, swish, swish
Swish, swish, swish, swish, swish, swish
The wipers on the bus go swish, swish, swish
All day long
The horn on the bus goes beep, beep, beep
Beep, beep, beep, beep, beep, beep
The horn on the bus goes beep, beep, beep
All day long
The babies on the bus go wah, wah, wah
Wah, wah, wah, wah, wah, wah
The babies on the bus go wah, wah, wah
All day long
The wheels on the bus go round and round
Round and round, round and round
The wheels on the bus go round and round
All day long
"""
items = whats_on_the_bus(text)
assert_equal(len(items), 4)
assert_equal(set(items), set(['babies', 'horn', 'wheels', 'wipers']))

## Task 6
***
Develop a function called `to_list` that provides a list of items extracted from `text`, considering delimiters such as `,`, `+`, or `and`.

In [12]:
def to_list(text):
    lists = re.split(r'[,+]', text)
    result = []
    
    for item in lists:
        if 'and' in item:
            result.extend(item.split('and'))
        else:
            result.append(item.strip())
    return result

In [13]:
text = "a,b,candfoo bar+bazandd e+fee fi fo"
assert_equal(
    to_list(text), 
    ['a', 'b', 'c', 'foo bar', 'baz', 'd e', 'fee fi fo'])

## Task 7
***
Develop a function named `march_product` that calculates the product of each `m by n` pair found in `text`.

In [14]:
def march_product(text):
    march = re.compile(r'(\d+)\sby\s(\d+)')
    products = [int(m[0]) * int(m[1]) for m in march.findall(text)]
    return products

In [15]:
text = """
The ants go marching 1 by 1, hurrah, hurrah
The ants go marching 2 by 13, hurrah, hurrah
The ants go marching 42 by 8,
The little one stops to suck his thumb
And they all go marching down to the ground
To get out of the rain, BOOM! BOOM! BOOM!

The ants go marching 9 by 16, hurrah, hurrah
The ants go marching 54 by 7, hurrah, hurrah
The ants go marching 8 by 42,
The little one stops to tie his shoe
And they all go marching down to the ground
To get out of the rain, BOOM! BOOM! BOOM!
"""
assert_equal(march_product(text), [1, 26, 336, 144, 378, 336])

## Task 8
***
Develop a function named `get_big` that accepts `items` and provides a list of `ITEM`s beginning with `Big` but having an SKU that isn't entirely numeric.

In [16]:
def get_big(items):
    big_items = []
    
    for line in items.split('\n')[1:]:
        parts = line.split()
        if len(parts) > 1 and parts[1] == 'Big':
            sku = parts[0]
            if not re.match(r'^\d+$', sku): 
                big_items.append(' '.join(parts[1:]))
    return big_items

In [17]:
items = """SKU ITEM
1A Big Red Box
A0 Big Bad Wolf
A1 Bigrams and Trigrams
02 Big Big World
BC Ain't Big Shoes
3C Bigger not Big"""

assert_equal(get_big(items), ['Big Red Box', 'Big Bad Wolf'])

## Task 9
***
Develop a function called `find_van` which takes `names` as input and produces a list containing the first names from `names` that contain the case-insensitive substring `Van`, but where the corresponding last name doesn't start with either `B` or `M`.

In [18]:
def find_van(names):
    van_first_names = []
    
    for line in names.split('\n')[1:]:
        parts = line.split()
        if len(parts) > 2 and re.search(r'van', parts[1], re.IGNORECASE):
            if not re.match(r'^[BM]', parts[2], re.IGNORECASE):
                van_first_names.append(parts[1])
    return van_first_names

In [19]:
names = '''ID FIRST_NAME LAST_NAME
1 Van Lopez
A Vance Batumbakal
A1 Vanora Rant
02 Vanadium Chemistry
BC Vanity Leon
3C Christopher de Leon
4F Vanessa Yu'''

res = find_van(names)
assert_equal(type(res), list)
assert_equal(set(res), 
             set(['Van', 'Vanora', 'Vanadium', 'Vanity', 'Vanessa']))

## Task 10
***
Develop a function called `get_client` that takes `server_log` as input and generates a list of tuples containing the client IP, date/time of server access, and status code. The `server_log` is provided below, with relevant information highlighted in red.

<pre>
<font color="red">66.249.65.159</font> - - [<font color="red">06/May/2019:19:10:38 +0800</font>] "GET /news/53f8d72920ba2744fe873ebc.html HTTP/1.1" <font color="red">404</font> 177 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
<font color="red">66.249.65.3</font> - - [<font color="red">06/May/2019:19:11:24 +0800</font>] "GET /?q=%E0%A6%AB%E0%A6%BE%E0%A7%9F%E0%A6%BE%E0%A6%B0 HTTP/1.1" <font color="red">200</font> 4223 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
<font color="red">127.0.0.1</font> - - [<font color="red">06/May/2019:19:12:14 +0800</font>] "GET /?q=%E0%A6%A6%E0%A7%8B%E0%A7%9F%E0%A6%BE HTTP/1.1" <font color="red">200</font> 4356 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
</pre>

In [20]:
def get_client(server_log):
    client = re.compile(r'(\d+\.\d+\.\d+\.\d+).*\[(.*?)\].*HTTP.*" (\d+) ')
    return [(match[0], match[1], match[2]) for match in client.findall(server_log)]

In [21]:
server_log = '''66.249.65.159 - - [06/May/2019:19:10:38 +0800] "GET /news/53f8d72920ba2744fe873ebc.html HTTP/1.1" 404 177 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.65.3 - - [06/May/2019:19:11:24 +0800] "GET /?q=%E0%A6%AB%E0%A6%BE%E0%A7%9F%E0%A6%BE%E0%A6%B0 HTTP/1.1" 200 4223 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
127.0.0.1 - - [06/May/2019:19:12:14 +0800] "GET /?q=%E0%A6%A6%E0%A7%8B%E0%A7%9F%E0%A6%BE HTTP/1.1" 200 4356 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"'''

assert_equal(get_client(server_log),
             [('66.249.65.159', '06/May/2019:19:10:38 +0800', '404'),
              ('66.249.65.3', '06/May/2019:19:11:24 +0800', '200'),
              ('127.0.0.1', '06/May/2019:19:12:14 +0800', '200')])