# Exercises for regular expressions in python

In [1]:
import re

## usage of `re` function `finditer` and `findall`, notice the differences

In [118]:
p = '(\w+@(\w+\.)?\w+\.com)' 
m = re.finditer(p, 'seafood@hotmal.com foo@www.baidu.com abc@learning.co')
if m is not None:
    for item in m:
        print(item.group(1))

seafood@hotmal.com
foo@www.baidu.com


In [117]:
p = '(\w+@(?:\w+\.)?\w+\.com)' 
m = re.findall(p, 'seafood@hotmal.com foo@www.baidu.com abc@learning.co')
if m is not None:
    print(m)

['seafood@hotmal.com', 'foo@www.baidu.com']


## usage of `re` function `sub()` 
- use raw string by adding `r` before a string
- use `\N` to resue the subgrups defined by `()`

** The reason for using raw string is that the ASCII char sometimes can conflicts with regrexes special char** 

In [194]:
m = re.sub(r'(\d{1,2})/(\d{1,2})/(\d{2,4})', r'\3/\2/\1', '17/8/999')
print(m)

999/8/17


## Usage of `re` function `split()`
- similar to the string method str.split() but more powerful

In [196]:
re.split(' ', "I don't want't a fight, I want to talk, ok?")

['I', "don't", "want't", 'a', 'fight,', 'I', 'want', 'to', 'talk,', 'ok?']

if we have data of adresses and ZIP codes, we can use split() function to split the exact data we want

In [197]:
data = (
    'Mountain View, CA 94040',
    'Sunnyvale, CA',
    'Los Altos, 94023',
    'Cupertino 95014',
    'Palo Alto CA',
)

In [206]:
real_data = list()
for item in data:
    real_data.append(re.split(', |(?= (?:\d{5})|[A-Z]{2}) ', item))

In [207]:
print(real_data)

[['Mountain View', 'CA', '94040'], ['Sunnyvale', 'CA'], ['Los Altos', '94023'], ['Cupertino', '95014'], ['Palo Alto CA']]


## usage of Extension Natation`(?...)`
### usage of `re` function flags with `(?imxsuL)`
- use `(?i)` to use flag `re.I/re.IGNORECASE`
- use `(?m)` to use flag `re.M/re.MULTILINE`
- use `(?x)` to use flag `re.X/re.VERBOSE` 
- use `(?s)` to use flag `re.S/re.DOTALL`, the special char `.` normally matches single char except `\n`, using the flag means matching `\n` as well
- use `(?L)` to use flag `re.L/re.LOCALE`, Matches via \w, \W, \b, \B, \s, \S depends on locale

In [120]:
re.findall(r'(?i)th\w+', 'The fastest way to get through there is the best way')

['The', 'through', 'there', 'the']

In [121]:
re.findall('(?m)^\s+((?![a-c])(?:\w+)@(?:\w+\.\w+))',
            """
                a@b.com
                b@c.com
                test@c.com
                feaset@d.com
                come@b.com
            """)

['test@c.com', 'feaset@d.com']

In [122]:
re.findall("""(?mx)^\s+(
            #filter emails start with char a-c
            (?![a-c])
            #include normal emails but not store in the subgroups
            (?:\w+)@(?:\w+\.\w+))""",
            """
                a@b.com
                b@c.com
                test@c.com
                feaset@d.com
                come@b.com
            """)

['test@c.com', 'feaset@d.com']

In [123]:
['%s@aw.com' % e.group(1) for e in re.finditer(r'(?m)^\s+(?!noreply|postmaster)(\w+)','''
    sales@a.com
    postmaster@a.com
    noreply@a.com
    admin@a.com
''')]

['sales@aw.com', 'admin@aw.com']

### usage of `(?:...)`
- this notation can group parts of a regex 
- besides, it doesn't save the matches subgroups which often is rather useful when we don't want to save subgroups, e.g.`findall` function with complex pattern using subgroups will store all the subgroups while most of the time all we want is the whole pattern but not the subgroups

In [124]:
re.findall(r'http://(\w+\.)*(\w+\.com)', 'http://www.baidu.com http://www.yahoo.com http:www.leicom.com')

[('www.', 'baidu.com'), ('www.', 'yahoo.com')]

we don't want the subgroups `(\w+\.)` matching results storing in the result list, then we use `(?:...)` instead 

In [125]:
re.findall(r'http://(?:\w+\.)*(\w+\.com)', 'http://www.baidu.com http://www.yahoo.com http:www.leicom.com')

['baidu.com', 'yahoo.com']

if we just want the whole url address

In [136]:
re.findall(r'http://(?:\w+\.)*\w+\.com', 'http://www.baidu.com http://www.yahoo.com http:www.leicom.com')

['http://www.baidu.com', 'http://www.yahoo.com']

### usage of `(?P<name>)` or `(?P=name)`
- the `<name>` is the identifier of the subgroups which can be retrieved using `\g<name>`

In [131]:
re.sub(r'(?P<day>\d{1,2})/(?P<month>\d{1,2})/(?P<year>\d{2,4})', '\g<year>/\g<month>/\g<day>', \
      '17/8/2018')

'2018/8/17'

### usage of `(?=)` and `(?!)`
- `(?=)` lookahead in the target string postively
- `(?!)` lookahead in the target string negatively
- both notations don't store the matching results

find all persons whose last name is West

In [139]:
re.findall(r'\w+(?= West)', 
           '''
           Jaylen West
           Bob Blue
           Cat Meat
           ''')

['Jaylen']

find all persons whose last name is not West

In [173]:
re.findall(r'(?m)(\w+) (?!West)', 
           '''
           Jaylen West
           Bob Blue
           Cat Meat
           ''')

['Bob', 'Cat']

In [192]:
m = re.match(r'(?:(x))(?(1)y|x)', 'xy')
m.group(1)

'x'

In [193]:
re.search(r'(?:(x)|y)(?(1)x)', 'xx')

<_sre.SRE_Match object; span=(0, 2), match='xx'>

## An Example Using Regular Experssions
In this example, we will process the tasklist.exe output in Windows, the output can be obtained through
> tasklist.exe > tasklist

In [1]:
import os
import re

we can use `os.open()` to open the file tasklist generated by tasklist.exe, but we can also just use `os.popen()` to use tasklist.exe as the input directly.
> **the `/nh` options suppresses the column headers**

In [10]:
with os.popen('tasklist /nh', 'r') as f:
    for eachline in f:
        print(re.findall(r'([\w.]+(?: [\w.]+)*)\s\s+(\d+) \w+\s\s+\d+\s\s+([\d,]+ K)',eachline.rstrip()))

[]
[('System Idle Process', '0', '8 K')]
[('System', '4', '13,712 K')]
[('Registry', '120', '6,916 K')]
[('smss.exe', '444', '912 K')]
[('csrss.exe', '624', '3,956 K')]
[('wininit.exe', '732', '4,708 K')]
[('csrss.exe', '744', '4,428 K')]
[('services.exe', '812', '8,668 K')]
[('lsass.exe', '832', '15,236 K')]
[('svchost.exe', '944', '2,988 K')]
[('svchost.exe', '972', '30,372 K')]
[('fontdrvhost.exe', '992', '2,496 K')]
[('WUDFHost.exe', '1004', '5,104 K')]
[('svchost.exe', '584', '15,128 K')]
[('winlogon.exe', '552', '6,808 K')]
[('svchost.exe', '1052', '6,496 K')]
[('fontdrvhost.exe', '1108', '5,968 K')]
[('svchost.exe', '1200', '4,896 K')]
[('svchost.exe', '1276', '11,788 K')]
[('svchost.exe', '1288', '6,676 K')]
[('svchost.exe', '1304', '8,724 K')]
[('svchost.exe', '1392', '4,924 K')]
[('svchost.exe', '1452', '5,624 K')]
[('svchost.exe', '1516', '6,480 K')]
[('svchost.exe', '1600', '14,632 K')]
[('WUDFHost.exe', '1708', '5,840 K')]
[('svchost.exe', '1760', '10,680 K')]
[('dwm.exe',

### Second Example: Data Generator

In [11]:
from random import randrange, choice
from string import ascii_lowercase as lc
from time import ctime

In [170]:
tlds = ('com', 'edu', 'net', 'org', 'gov')

with open('data', 'a') as f:                 # open file data in append mode
    for i in range(randrange(5, 11)):        # choose a random item from range(5, 11)
        dtint = randrange(2**32)             # pick date, the maximum value ctime can accept is 2^32
        dtstr = ctime(dtint)                 # transfer dint(in seconds) to local time string
        llen = randrange(4, 8)               # login name is shorter
        login = ''.join(choice(lc) for j in range(llen))
        dlen = randrange(llen, 13)           # domain is longer
        dom = ''.join(choice(lc) for j in range(dlen))
        print('%s::%s@%s.%s::%d-%d-%d' % (dtstr, login, dom, choice(tlds), dtint, llen, dlen), file=f)

Now let's play with the data we generate using more fun and useful regexes
- `+`,`*`,`?` all are greedy operations, which means they are designed to match as much as possible, using `?` after them can stop this behaviour, which means "don't be greedy", that is match as few as possible

In [77]:
with open('data', 'r') as f:
    strict_pattern = r'^(Mon|Tue|Wed|Thu|Fir|Sat|Sun)'       # filter the days
    lose_pattern = r'^(\w{3})'
    digits_filter = r'\d+-\d+-\d+'                           # filter the digits at the end of eachline
    mid_digit_filter = r'-(\d+)-'
    for eachline in f:
        print(re.match(lose_pattern, eachline).group())
        print(re.search(digits_filter, eachline).group())
        print(re.match(r'.+(\d+-\d+-\d+)', eachline).group(1))       # + * ? all are greedy operations 
        print(re.match(r'.+?(\d+-\d+-\d+)', eachline).group(1))       # + * ? all are greedy operations 
        print(re.search(mid_digit_filter, eachline).group(1))

Thu
3076032614-6-7
4-6-7
3076032614-6-7
6
Mon
2468546424-5-8
4-5-8
2468546424-5-8
5
Tue
1323747946-6-6
6-6-6
1323747946-6-6
6
Sun
4014340941-7-11
1-7-11
4014340941-7-11
7
Wed
1724840313-4-8
3-4-8
1724840313-4-8
4
Mon
958973307-7-11
7-7-11
958973307-7-11
7


### Processing Credit Card Number
Verifying credit card numbers with 15 or 16 digits in the format with or without hypthens

In [165]:
def verify_cc_num(cc_num):
    """
    Verify a credit card number
    
    Arguments:
    cc_num -- a string representing a 15/16-digit-long credit card num in the format with or without hypens
    
    Return:
    flag -- a boolean value which is true when the credit card number is verified or vice verse
    """
    
    flag = True
    pattern = r'''(?x)
          ([0-9]{15,16})| # credit card number in the format without hyphens filter
          ([0-9]{5}-[0-9]{6}-[0-9]{4})|((?:[0-9]{4}-){3}[0-9]{4}) # credit card number in the format with 
                                                                  # hyphens
              '''
    
    if(re.match(pattern, cc_num) == None):
        flag = False
    elif(re.search('-', cc_num) == None):       # check the if it's in the format with hypthens
        flag = validate_cc_num(cc_num)
    else:                                                    
        cc_num = ''.join(item for item in cc_num.split('-'))     # remove hyphens
        flag = validate_cc_num(cc_num)
        
    return flag


def validate_cc_num(cc_num):
    """
    Validate a credit card number using modulus 10 algorithm
    
    Arguments:
    cc_num -- a string representing the credit card number with digits only
    
    Returns:
    flag -- a boolen which value is Ture when the credit card number is validated
    """
    
    flag = True
    reversed_num = cc_num[::-1]
    digits = list()
    
    # modulus 10 algorithm: start from the rightmost digit of a card number, double ervery second digit
    # if the product is greater than 9, add the two digits of the product as the result(same as 
    # subtracting the procdut by 9), then sum up all digits, if the sum modulus 10 equal zero 
    # then the card numbers is validated else it's not
    for i in range(1, len(cc_num), 2):
        digits.append(int(reversed_num[i-1]))
        temp = int(reversed_num[i])*2
        if temp > 9:
            digits.append(temp - 9)
        else:
            digits.append(temp)
    
    digits_sum = sum(digits)
    if digits_sum % 10 != 0:
        flag == False
    
    return flag

In [169]:
c = '5105105105105100'
verify_cc_num(c)

True

### Play with the Data Generated by Tasklist

In [216]:
with open('data', 'r') as f:           # open the file data in read mode
    day_filter = r'^\w{3}'
    mon_filter = r' (\w{3})'
    year_filter = r' (\d+)::'
    date_filter = r'(^\w{3}.+\d)::'
    time_filter = r' (\d+:\d+:\d+) '
    email_filter = r'\w+@\w+.\w+'
    year_mon_day_filter = r'(?P<day>^\w+) (?P<mon>\w+) .+ (?P<year>\d{4})::'
    login_domain_filter = r'::(?P<login>\w+)@(?P<domain>[\w.]+)::'
    day_counts = dict()
    mon_counts = dict()
    for eachline in f:
#         print(re.match(date_filter, eachline).group(1)) # extract the time stamp in each line
#         print(re.search(email_filter, eachline).group())  # extract the email address
#         print(re.search(mon_filter, eachline).group(1))    # extract the months
#         print(re.search(year_filter, eachline).group(1))   # extract the years
#         print(re.search(time_filter, eachline).group(1))   # extract the time only
#         print(re.search(login_domain_filter, eachline).group('login', 'domain'))
#         print(re.sub(email_filter, 'jaylen_west@foxmail.com', eachline))
        print(re.search(year_mon_day_filter, eachline).group('year', 'mon', 'day'))
        key = re.search(mon_filter, eachline).group()
        mon_counts[key] = mon_counts.get(key, 0) + 1
    print(mon_counts)

('2067', 'Jun', 'Thu')
('2048', 'Mar', 'Mon')
('2011', 'Dec', 'Tue')
('2097', 'Mar', 'Sun')
('2024', 'Aug', 'Wed')
('2000', 'May', 'Mon')
('2029', 'Mar', 'Wed')
('2043', 'Jun', 'Fri')
('2007', 'Dec', 'Sat')
('2062', 'Aug', 'Wed')
('2048', 'Mar', 'Tue')
('2011', 'Feb', 'Tue')
('2043', 'Aug', 'Wed')
('2075', 'Dec', 'Mon')
{' Jun': 2, ' Mar': 4, ' Dec': 3, ' Aug': 3, ' May': 1, ' Feb': 1}


### Processing the Phone Numbers

In [230]:
def verify_phone_num(phone_num):
    """
    Verify a phone number with or without 3-digits area codes
    
    Arguments:
    phone_num -- a string reprensting a phone number
    
    Returns:
    flag -- a boolean variable which is True when the phone_num is verified otherwise it's False
    """
    
    flag = True
    phone_num_filter = r'((?:\d{3}-)|(\(\d{3}\)))?\d{3}-\d{4}'
    if(re.match(phone_num_filter, phone_num) == None):
        flag = False
    
    return flag

In [231]:
verify_phone_num('(900)555-5121')

True

### HTML Generation.
This part will take a list of links (and optional short description) and generate a Web page(.html) that include all links as hypertext anchors. If a short description is provided, use that as hypertext instead of url.

In [315]:
def generate_html(links, name):
    """
    Generate a Web page with hypertext anchors extracting from links, and when a short description of a 
    link is provided, use that as the hpyertext.
    
    Arguments:
    links -- a list of links (with optional short description)
    name -- a string of Web page(.html) to be stored as
    
    Returns:
    flag -- a boolean of whether this process is successful
    """
    
    url_filter = r'(?P<url>^((https?://)|(www))[.\w/]+)(::)?(?P<dsrp>(.+))?'
    flag = False
    
    with open(name, 'w') as f:
        f.write('<html>\n')
        for item in links:
            matches = re.match(url_filter, item).group('url', 'dsrp')
            if (matches[0] != None) & (matches[1] != None):
                flag = True
                anchor = '<a href=' + matches[0] + '>' + matches[1] + '</a><br/><hr/>'
                f.write(anchor + '\n')
            if (matches[0] != None) & (matches[1] == None):
                flag = True
                anchor = '<a href=' + matches[0] + '>' + matches[0] + '</a><br/><hr/>'
                f.write(anchor + '\n')
        f.write('</html>\n')
                
    return flag

In [316]:
links = list()
with open('url_data.txt', 'r', encoding='UTF-8') as f:
    for eachline in f:
        links.append(eachline)

print(links)
generate_html(links, 'links.html')

['https://www.baidu.com::百度\n', 'http://xueshu.baidu.com/::百度学术\n', 'http://www.w3school.com.cn/::W3C school\n', 'http://www.dilidili.wang/::DiliDili 动漫\n', 'https://www.163.com/::网易主页\n', 'https://www.qq.com\n', 'www.baidu.com']


True

In [330]:
d = dict(zip(range(5),range(5,10)))
print(d.keys())
c = dict(one=5, two=6, three=7, four=8, five=9)

dict_keys([0, 1, 2, 3, 4])


### Books Tracker
tracking book rankings of favorite books of online booksellers

In [20]:
from bs4 import BeautifulSoup as bs
import urllib3
import certifi
import lxml
import re

In [42]:
def get_html(url):
    """
    Acquire html from url address
    
    Arguments:
    url -- string
    
    Returns:
    html -- binary data of html src code
    """
    
    # use User-Agent filds in http requst header to avoid being excluded from accessing websites 
    header = {
       'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)\
       Chrome/58.0.3029.96 Safari/537.36'
    }
    http = urllib3.PoolManager(cert_reqs='CERT_REQUIRED', ca_certs=certifi.where())  # use certificate
    resp = http.request('GET', url, headers=header)
    html = resp.data
    
    return html



In [56]:
url = 'https://www.douban.com/tag/%E5%B0%8F%E8%AF%B4/book'
html = get_html(url)

soup = bs(html, 'lxml' )
bookdiv = soup.find(attrs={'class': 'mod book-list'})
books = bookdiv.find_all(attrs={'class': 'title'})
ratings = bookdiv.find_all(attrs={'class': 'rating_nums'})

book_rating = zip(books, ratings)
with open('book_list.txt', 'w', encoding='UTF-8') as f:
    for book, rating in book_rating:
        print('书名：{}，评分：{}'.format(book.string, rating.string), file=f)

In [55]:
with open('douban_books.html', 'w', encoding='UTF-8') as f:
    print(soup.prettify(), file=f)   # 按缩进形式输出,存储到 html 文件

In [51]:
soup.find_all(attrs={'class':'title'})

[<a class="title" href="https://book.douban.com/subject/6518605/?from=tag_all" target="_blank">三体全集</a>,
 <a class="title" href="https://book.douban.com/subject/4913064/?from=tag_all" target="_blank">活着</a>,
 <a class="title" href="https://book.douban.com/subject/26954760/?from=tag_all" target="_blank">月亮与六便士</a>,
 <a class="title" href="https://book.douban.com/subject/1770782/?from=tag_all" target="_blank">追风筝的人</a>,
 <a class="title" href="https://book.douban.com/subject/25778492/?from=tag_all" target="_blank">彷徨</a>,
 <a class="title" href="https://book.douban.com/subject/26877597/?from=tag_all" target="_blank">一九八四（纪念版）</a>,
 <a class="title" href="https://book.douban.com/subject/27614904/?from=tag_all" target="_blank">房思琪的初恋乐园</a>,
 <a class="title" href="https://book.douban.com/subject/6720777/?from=tag_all" target="_blank">海明威短篇小说全集（上）</a>,
 <a class="title" href="https://book.douban.com/subject/26879778/?from=tag_all" target="_blank">杀死一只知更鸟</a>,
 <a class="title" href="https:/