# Regular Expressions


## Table of Contents

* [Turn 0](#Turn-0)
* [Turn 1](#Turn-1)
* [Turn 2](#Turn-2)
* [References](#References)



## Task

Our task is to develop a chatbot that can talk about smartphones.
Here is an example dialogue (`S`: system, `U`: user):

```
S0: are you using a smartphone?
U0: yes, i have an iphone.
S1: how long have you been using iphone?
U1: about 2 years.
S2: oh, are you using iphone 10s or 10s max?
U2: no, i'm using iphone 6s plus.
S3: iphone 6s plus is about 5 years old.
```

## Turn 0

Given the following question initiated by the system:

> S0: are you using a smartphone?

We expect either `Yes`, `No`, or `None` as the user response.

### Response: `Yes`

The following code defines a group `(yes|yeah)` in the regular expression that matches the literals:

In [1]:
import re

re_yn = re.compile(r'(yes|yeah)')
m = re_yn.match('yeah, i am')
print(m)

<re.Match object; span=(0, 4), match='yeah'>


If there is a match, we can retrive the literal:

In [2]:
if m:
    yes = m.group()
    print(yes)

yeah


If no match is found, it returns `None`:

In [3]:
m = re_yn.match('sure, i am')
print(m)

None


### Issue 1

`re_yn` can overmatch:

In [4]:
m = re_yn.match('yesterday was my birthday')
if m: print(m.group())

yes


Match only if the literals are followed by a space (`\s`), a comma (`,`), a period (`\.`), or the end of the string (`$`):

In [5]:
re_yn = re.compile(r'(yes|yeah)(\s|,|\.|$)')

m = re_yn.match('yesterday was my birthday')
print(m)

m = re_yn.match('yes, i am')
print(m.groups())
for i in range(len(m.groups())): print(i, m.group(i))

None
('yes', ',')
0 yes,
1 yes


Exclude the second group from capturing with `?:`:

In [6]:
re_yn = re.compile(r'(yes|yeah)(?:\s|,|\.|$)')

m = re_yn.match('yes, i am')
for i in range(len(m.groups())): print(i, m.group(i))

0 yes,


### Issue 2

`re_yn` matches only from the beginning of the string:

In [7]:
m = re_yn.match('well, yes I am')
print(m)

None


Allow the regular expression to match the literals with zero-to-many prior characters (`.*`) followed by a space (`\s`) or the beginning of the string (`^`):

In [8]:
re_yn = re.compile(r'(?:.*)(?:\s|^)(yes|yeah)(?:\s|,|\.|$)')

m = re_yn.match('yeah, I am')
if m: print(m.group(1))

m = re_yn.match('well yes I am')
if m: print(m.group(1))

yeah
yes


### Response: `No`

Define another group `(no|not really)` in the same reqular expression that matches the literals:

In [9]:
re_yn = re.compile(r'(?:.*)(?:\s|^)(yes|yeah)|(no|not really)(?:\s|,|\.|$)')

m = re_yn.match('yes, I am')
if m: print(m.groups())
    
m = re_yn.match('no I am not')
if m: print(m.groups())

('yes', None)
(None, 'no')


### Issue

The matching stops after the first match:

In [10]:
m = re_yn.match('yes or no')
if m: print(m.groups())

('yes', None)


Use the `findall` method instead of `match`:

In [11]:
re_yn = re.compile(r'(?:\s|^)(yes|yeah)|(no|not really)(?:\s|,|\.|$)')
ts = re_yn.findall('yes or no')
for t in ts: print(t)

('yes', '')
('', 'no')


### Regex Matcher

Write a function that returns a list of literals defined in the regular expression matching the input string:

In [12]:
from typing import List
from re import Pattern

def regex_matcher(regex: Pattern, instr: str) -> List[str]:
    ts = [None] * regex.groups
    
    for t in regex.findall(instr):
        if isinstance(t, str): t = [t]
        for i, literal in enumerate(t):
            if ts[i] is None and literal:
                ts[i] = literal
    
    return ts

In [13]:
print(regex_matcher(re_yn, 'yes, I am'))
print(regex_matcher(re_yn, 'no I am not'))
print(regex_matcher(re_yn, 'yes or no'))
print(regex_matcher(re_yn, 'I have an iphone'))

['yes', None]
[None, 'no']
['yes', 'no']
[None, None]


### Response: Phone Model

The `yes/no` response can be followed by the user's specific phone model.

In [53]:
re_phone = re.compile(r'(?:\s|^)(apple|google|samsung)|(iphone|pixel|galaxy|android)(?:\s|,|\.|$)')

print(regex_matcher(re_phone, 'yes I have an iphone'))
print(regex_matcher(re_phone, 'yes I have google pixel'))
print(regex_matcher(re_phone, 'yes I have a galaxy phone'))

[None, 'iphone']
['google', 'pixel']
[None, 'galaxy']


### Put Together

Create the [SimpleNamespace](https://docs.python.org/3/library/types.html#types.SimpleNamespace) `regs` containing all regular expressions:

In [15]:
from types import SimpleNamespace

res = SimpleNamespace()
res.re_yn = re.compile(r'(?:\s|^)(yes|yeah)|(no|not really)(?:\s|,|\.|$)')
res.re_phone = re.compile(r'(?:\s|^)(apple|google|samsung)|(iphone|pixel|galaxy|android)(?:\s|,|\.|$)')

Write a function to handle `Turn 0`:

In [16]:
from typing import Any

def turn_0(res: SimpleNamespace):
    s = 'S: are you using a smartphone?'
    u = input(s + '\nU: ')

    yn = regex_matcher(res.re_yn, u)
    phone = regex_matcher(res.re_phone, u)
    res.in_phone_company = phone[0]
    res.in_phone_name = phone[1]

    if any(phone):
        turn_1a(res)
    elif yn[0]:
        turn_1b(res)
    elif yn[1]:
        turn_1c(res)

    print('S: good bye!')

def turn_1a(res: SimpleNamespace):
    p = res.in_phone_name if res.in_phone_name else res.in_phone_company
    s = 'S: how long have you been using {}?'.format(p)
    u = input(s + '\nU: ')

def turn_1b(res: SimpleNamespace):
    s = 'S: what kind of smartphone do you have?'
    u = input(s + '\nU: ')
    # TODO: to be filled

def turn_1c(res: SimpleNamespace):
    s = 'really? you should consider getting one.'
    print(s)

In [17]:
turn_0(res)

S: are you using a smartphone?
U: yes, i have an iphone
S: how long have you been using iphone?
U: about 2 years
S: good bye!


## Turn 1

Given the following question:

> S1: how long have you been using iphone?

We expect date information from the user.

### Response: Duration

Write a regular expression that captures the duration:

In [18]:
res.re_duration = re.compile(r'(?:\s|^)(\d+)(?:\s|-)+(month|year)')

print(regex_matcher(res.re_duration, 'about 2 years'))
print(regex_matcher(res.re_duration, 'over 6-months'))
print(regex_matcher(res.re_duration, 'almost 1 - year'))

['2', 'year']
['6', 'month']
['1', 'year']


Infer the month and the year of user's phone from the duration:

In [41]:
duration = regex_matcher(res.re_duration, '6 months')
res.in_phone_year, res.in_phone_month = None, 1
curr_year, curr_month = 2020, 1

if all(duration):
    d = int(duration[0])
    m = duration[1]
    if m == 'year':
        res.in_phone_year = curr_year - d
    elif m == 'month':
        res.in_phone_year = curr_year - int(d / 12)
        res.in_phone_month = curr_month - (d % 12)
        if res.in_phone_month <= 0:
            month_diff = abs(res.in_phone_month)
            res.in_phone_month = 12 - month_diff
            res.in_phone_year -= 1

print(res.in_phone_year, res.in_phone_month)

2019 7


### Response: From Date

Alternatively, the user may response with the from date:

In [52]:
res.re_from_date = re.compile(r'(?:\s|^)(?:since|from)\s(?:(january|february|march|april|may|june|july|august|[sS]eptember|october|november|december)\s)?(\d{2,4})')

print(regex_matcher(res.re_from_date, 'since September 2018'))
print(regex_matcher(res.re_from_date, 'from 2017'))

['September', '2018']
[None, '2017']


Create a dictionary that maps months in string to their corresponding numbers:

In [44]:
res.d_month_to_number = {
    month: i for i, month in enumerate(
        ['january','february','march','april','may','june',
         'july','august','september','october','november','december'], start=1)}
res.d_month_to_number

{'january': 1,
 'february': 2,
 'march': 3,
 'april': 4,
 'may': 5,
 'june': 6,
 'july': 7,
 'august': 8,
 'september': 9,
 'october': 10,
 'november': 11,
 'december': 12}

Store the year and the month of user's phone:

In [46]:
from_date = regex_matcher(res.re_from_date, 'since september 2018')
res.in_phone_year, res.in_phone_month = None, 1

if any(from_date):
    res.in_phone_year = int(from_date[1])
    if res.in_phone_year <= 20: res.in_phone_year += 2000
    res.in_phone_month = res.d_month_to_number[from_date[0]] if from_date[0] else 1

print(res.in_phone_year, res.in_phone_month)

2018 9


Finally, predict the model 

In [23]:
res.d_iphone = {
    2019: [(9, ['11', '11 pro', '11 pro max'])], 
    2018: [(9, ['10s', '10s max'])], 
    2017: [(11, ['10']), (9, ['8', '8 plus'])], 
    2016: [(9, ['7', '7 plus'])], 
    2015: [(9, ['6s', '6s plus'])], 
    2014: [(9, ['6', '6 plus'])]}

r = res.d_iphone.get(res.in_phone_year, None)
if r:
    v = next(models for month, models in r if month <= res.in_phone_month)
    res.in_phone_version = ' or '.join(v)
    s = 'S: oh, are you using iphone {}?'.format(res.in_phone_version)
    u = input(s + '\nU: ')

S: oh, are you using iphone 10s or 10s max?
U: no, i'm using iphone 6s plus


### Put Together

Write a function to handle `Turn 1a`:

In [47]:
def turn_1a(res: SimpleNamespace):
    p = res.in_phone_name if res.in_phone_name else res.in_phone_company
    s = 'S: how long have you been using {}?'.format(p)
    u = input(s + '\nU: ')

    duration = regex_matcher(res.re_duration, u)
    from_date = regex_matcher(res.re_from_date, u)

    res.in_phone_year, res.in_phone_month = None, 1
    curr_year, curr_month = 2020, 1

    if all(duration):
        d = int(duration[0])
        m = duration[1]
        if m == 'year':
            res.in_phone_year = curr_year - d
        elif m == 'month':
            res.in_phone_year = curr_year - int(d / 12)
            res.in_phone_month = curr_month - (d % 12)
            if res.in_phone_month <= 0:
                month_diff = abs(res.in_phone_month)
                res.in_phone_month = 12 - month_diff
                res.in_phone_year -= 1
    elif any(from_date):
        res.in_phone_year = int(from_date[1])
        res.in_phone_month = res.d_month_to_number[from_date[0]] if from_date[0] else 1

    if res.in_phone_year:
        if res.in_phone_name == 'iphone' or res.in_phone_company == 'apple':
            turn_2a(res)

def turn_2a(res: SimpleNamespace):
    r = res.d_iphone.get(res.in_phone_year, None)
    res.in_phone_version = None
    
    if r:
        v = next(models for month, models in r if month <= res.in_phone_month)
        res.in_phone_version = v
        s = 'S: oh, are you using iphone {}?'.format(' or '.join(v))
        u = input(s + '\nU: ')
    else:
        s = 'S: which version of iphone is your model?'
        u = input(s + '\nU: ')

In [48]:
turn_0(res)

S: are you using a smartphone?
U: yes i have an iphone
S: how long have you been using iphone?
U: about 2 years
S: oh, are you using iphone 10s or 10s max?
U: no i'm using iphone 6s plus
S: good bye!


## Turn 2

Given the following question:

> S2: oh, are you using iphone 10s or 10s max?

We expect the user would response with the specific version of the phone

### Response: Phone Version

Write a regular expression to extract the version of iPhone:

In [49]:
res.re_iphone_version = re.compile(r'(?:\s|^)(?:iphone|version)\s(\d+s?(?: (?:plus|max))?)(?:\s|,|\.|$)')

print(regex_matcher(res.re_iphone_version, 'iphone 6s plus'))
print(regex_matcher(res.re_iphone_version, 'version 12'))

['6s plus']
['12']


In [50]:
def turn_2a(res: SimpleNamespace):
    r = res.d_iphone.get(res.in_phone_year, None)
    res.in_phone_version = None

    if r:
        v = next(models for month, models in r if month <= res.in_phone_month)
        res.in_phone_version = v[0]
        s = 'S: oh, are you using iphone {}?'.format(' or '.join(v))
        u = input(s + '\nU: ')

        yn = regex_matcher(res.re_yn, u)
        if yn[1]: res.in_phone_version = None
    else:
        s = 'S: which version of iphone is your model?'
        u = input(s + '\nU: ')

    version = regex_matcher(res.re_iphone_version, u)
    if version[0]: res.in_phone_version = version[0]
    if res.in_phone_version: turn_3a(res)

def turn_3a(res: SimpleNamespace):
    # TODO: to be filled
    old = 5
    s = 'S: iphone {} is about {} years old'.format(res.in_phone_version, old)
    print(s)

In [51]:
turn_0(res)

S: are you using a smartphone?
U: yes i have an iphone
S: how long have you been using iphone?
U: about 2 years
S: oh, are you using iphone 10s or 10s max?
U: no, i'm using iphone 6s plus
S: iphone 6s plus is about 5 years old
S: good bye!


## References

* https://www.regular-expressions.info/tutorial.html
* https://regex101.com