## Regex Exercises

In [1]:
import re
import pandas as pd

### Write a function named `is_vowel`. It should accept a string as input and use a regular expression to determine if the passed string is a vowel. While not explicity mentioned in the lesson, you can treat the result of `re.search` as a boolean value that indicates whether or not the regular expression matches the given string.



In [2]:
def is_vowel(s: str)-> bool:
    '''Takes a string and determines if the string has a vowel
    '''
    regex = '[aeiouAEIOU][A-Za-z0-9_]*'
    return True if re.search(regex, s) else False

### Write a function named `is_valid_username` that accepts a string as input. A valid username starts with a lowercase letter, and only consists of lowercase letters, numbers, or the _ character. It should also be no longer than 32 characters. The function should return either `True` or `False` depending on whether the passed string is a valid username.

In [3]:
def is_valid_username(s: str)-> bool:
    '''Takes a string and checks if it starts with lowercase letter
    and only consists of lowercase letters, numbers, or the _ character
    and checks that it isn't longer than 32 characters, returns True or 
    False.
    '''
    regex = r'^[a-z0-9_]{1,32}$'
    return True if re.search(regex, s) else False

### Write a regular expression to capture phone numbers. It should match all of the following:
```
(210) 867 5309
+1 210.867.5309
867-5309
210-867-5309
```

In [4]:
def capture_phone_number(s: str):
    ''' \((\d+)\)
    '''
    
    regex = r'''
    (?P<country_code>\+?1)?
    .?
    (?P<area_code>\d{3})?[\)]? 
    .
    (?P<phone1>\d{3})
    .
    (?P<phone2>\d{4})'''
    
    verb_item_pat = re.compile(regex, re.VERBOSE)
    return verb_item_pat.findall(s)

p1 = '(210) 867 5309'
p2 = '+1 210.867.5309'
p3 = '867-5309'
p4 = '210-867-5309'
test1 = f'''
This is some crazy test string that is
to capture a phone number {p1} {p1} {p2} {p3} {p4}
'''

match = capture_phone_number(test1)
match
df = pd.DataFrame([{'country_code': m[0],
            'area_code': m[1],
           'phone1': m[2],
           'phone2': m[3]} for m in match])
df

Unnamed: 0,country_code,area_code,phone1,phone2
0,,210.0,867,5309
1,,210.0,867,5309
2,1.0,210.0,867,5309
3,,,867,5309
4,,210.0,867,5309


### Use regular expressions to convert the dates below to the standardized year-month-day format.

```
02/04/19
02/05/19
02/06/19
02/07/19
02/08/19
02/09/19
02/10/19
```

In [5]:
def capture_date(s: str):
    ''' captures the date and returns the matched finds
    '''
    
    regex = r'''
    (?P<month>\d{2})
    .
    (?P<day>\d{2})
    .
    (?P<year>\d{2})'''
    
    verb_item_pat = re.compile(regex, re.VERBOSE)
    return verb_item_pat.findall(s)

test1 = '''
02/04/19
02/05/19
02/06/19
02/07/19
02/08/19
02/09/19
02/10/19
'''
match = capture_date(test1)
match
df = pd.DataFrame([{'month': m[0],
            'day': m[1],
           'year': m[2]} for m in match])
df

Unnamed: 0,month,day,year
0,2,4,19
1,2,5,19
2,2,6,19
3,2,7,19
4,2,8,19
5,2,9,19
6,2,10,19


### Write a regex to extract the various parts of these logfile lines:

```
GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58
```

In [6]:
logfile = """GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58"""


In [7]:
def parse_logfile(s: str):
    '''parses text logfile and returns pandas DataFrame
    '''
    regex = r'^(\w+)\s(.*?)\s\[(.*?)\]\s(.*?)\s\{(.*?)\}?\s(.*?)\s"(.*?)"\s(.*)\s?'
    
    match = re.findall(regex, s, re.MULTILINE | re.VERBOSE)
    return pd.DataFrame([{'host': m[0], 'path': m[1], 'time': m[2],
                    'req': m[3], 'status': m[4], 'size': m[5],
                    'request': m[6], 'user': m[7],
                        } for m in match])

df = parse_logfile(logfile)
df

Unnamed: 0,host,path,time,req,status,size,request,user
0,GET,/api/v1/sales?page=86,16/Apr/2019:193452+0000,HTTP/1.1,200,510348,python-requests/2.21.0,97.105.19.58
1,POST,/users_accounts/file-upload,16/Apr/2019:193452+0000,HTTP/1.1,201,42,User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; ...,97.105.19.58
2,GET,/api/v1/items?page=3,16/Apr/2019:193453+0000,HTTP/1.1,429,3561,python-requests/2.21.0,97.105.19.58


## You can find a list of words on your mac at `/usr/share/dict/words`. Use this file to answer the following questions:

- How many words have at least 3 vowels?
    * 191,365
- How many words have at least 3 vowels in a row?
    * 6,182
- How many words have at least 4 consonants in a row?
    * 19,240
- How many words start and end with the same letter?
    * 9,917
- How many words start and end with a vowel?
    * 2,466
- How many words contain the same letter 3 times in a row?
    * 3
- What other interesting patterns in words can you find?
    * There are 161 Palindromes within the words

In [8]:
with open('words') as f:
    lines = f.readlines()
df = pd.DataFrame({'words': pd.Series(lines)})
# counts all vowels
test = 'aaa'
v = 'aeiou'
V = v.upper()
c = 'bcdfghjklmnpqrstvwxyz'
C = c.upper()
df['vowel_count'] = df.words.str.count(fr'[{v+V}]')
df['vowels_3'] = df.words.str.count(fr'[{v+V}][{v+V}][{v+V}]')
df['cons_4'] = df.words.str.count(fr'[{c+C}][{c+C}][{c+C}][{c+C}]')
df['start_end'] = df.words.str.count(r'^([a-zA-Z]).*\1$')
df['start_end_vowel'] = df.words.str.count(fr'^([{v+V}]).*\1$')
df['same_3'] = df.words.str.count(r'([a-zA-Z])\1\1')
df.head()

Unnamed: 0,words,vowel_count,vowels_3,cons_4,start_end,start_end_vowel,same_3
0,A\n,1,0,0,0,0,0
1,a\n,1,0,0,0,0,0
2,aa\n,2,0,0,1,1,0
3,aal\n,2,0,0,0,0,0
4,aalii\n,4,0,0,0,0,0


In [9]:
def check_palindrome(string):
    '''Checks if the string is a palindrome (same forwards as backwards)
    '''
    if(string!=string[::-1]):
        return False
    return True

In [10]:
df['palindrome'] = df.words.apply(lambda x: check_palindrome(x[:-1]))
df[df.palindrome]

Unnamed: 0,words,vowel_count,vowels_3,cons_4,start_end,start_end_vowel,same_3,palindrome
0,A\n,1,0,0,0,0,0,True
1,a\n,1,0,0,0,0,0,True
2,aa\n,2,0,0,1,1,0,True
16,aba\n,2,0,0,1,1,0,True
840,acca\n,2,0,0,1,1,0,True
...,...,...,...,...,...,...,...,...
234267,y\n,0,0,0,0,0,0,True
234413,yaray\n,2,0,0,1,0,0,True
234854,yoy\n,1,0,0,1,0,0,True
234937,Z\n,0,0,0,0,0,0,True
