## Regular Expressions Exercises - NLP module

In [1]:
import pandas as pd
import re

1. Write a function named is_vowel. It should accept a string as input and use a regular expression to determine if the passed string is a vowel. While not explicity mentioned in the lesson, you can treat the result of re.search as a boolean value that indicates whether or not the regular expression matches the given string.

In [24]:
def is_vowel(string):
    regexp = r'[aeiouAEIOU]'
    if re.search(regexp, string):
        print('True')
    else:
        print("False")

In [25]:
is_vowel('a')

True


In [26]:
is_vowel('b')

False


---

2. Write a function named is_valid_username that accepts a string as input. A valid username starts with a lowercase letter, and only consists of lowercase letters, numbers, or the _ character. It should also be no longer than 32 characters. The function should return either True or False depending on whether the passed string is a valid username.

In [45]:
def is_valid_username(username):
    regexp = r'^[a-z][a-z0-9_]{1,31}$'
    if re.search(regexp, username):
        print('True')
    else:
        print('False')

In [46]:
is_valid_username('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa')

False


In [47]:
is_valid_username('codeup')

True


In [48]:
is_valid_username('Codeup')

False


In [49]:
is_valid_username('codeup123')

True


In [50]:
is_valid_username('1codeup')

False


---

3. Write a regular expression to capture phone numbers. It should match all of the following:


- (210) 867 5309
- +1 210.867.5309
- 867-5309
- 210-867-5309

In [83]:
#not the best way to do it...
def capture_number(number):
    subject = pd.Series(number)
    df = subject.str.extract(r'\W?(\d+)?\W?\s?(\d+)?\W?\s?(\d+)\W?\s?(\d+)')
    df.rename(columns={0:'part1',1:'part2',2:'part3',3:'part4'}, inplace=True)
    df['phone_number'] = df.part1 + df.part2 + df.part3 + df.part4
    return list(df.phone_number)

In [84]:
capture_number('+1 210.867.5309')

['12108675309']

In [78]:
capture_number('(210) 867 5309')

['2108675309']

In [79]:
capture_number('867-5309')

['8675309']

In [80]:
capture_number('210-867-5309')

['2108675309']

In [87]:
#better to just remove the non digits...
def capture_number(number):
    regexp = r'\D'
    return re.sub(regexp, '', number)


In [88]:
capture_number('+1 210.867.5309')

'12108675309'

In [89]:
capture_number('(210) 867 5309')

'2108675309'

In [90]:
capture_number('867-5309')

'8675309'

In [91]:
capture_number('210-867-5309')

'2108675309'

In [101]:
regexp = r'(\+\d+)?\D*(\d{3})?\D*(\d{3})\D*(\d{4})$'
subject = '+1 210.867.5309'
re.match(regexp, subject).groups()

('+1', '210', '867', '5309')

In [102]:
#from the exercise review
phone_re = r'''
(?P<country_code>\+\d+)?
\D*
(?P<area_code>\d{3})?
\D*
(?P<exchange_code>\d{3})
\D*
(?P<last_four>\d{4})$
'''

numbers = pd.Series([
    '(210) 867 5309',
    '+1 210.867.5309',
    '867-5309',
    '210-867-5309',
], name='original')

pd.concat([numbers, numbers.str.extract(phone_re, re.VERBOSE)], axis=1)

Unnamed: 0,original,country_code,area_code,exchange_code,last_four
0,(210) 867 5309,,210.0,867,5309
1,+1 210.867.5309,1.0,210.0,867,5309
2,867-5309,,,867,5309
3,210-867-5309,,210.0,867,5309


---

4. Use regular expressions to convert the dates below to the standardized year-month-day format.


- 02/04/19
- 02/05/19
- 02/06/19
- 02/07/19
- 02/08/19
- 02/09/19
- 02/10/19

In [92]:
dates = pd.Series(['02/04/19', '02/05/19', '02/06/19', '02/07/19', '02/08/19', '02/09/19', '02/10/19'])
dates.str.replace(r'(\d{2})/(\d{2})/(\d{2})', r'20\3-\1-\2', regex=True)

0    2019-02-04
1    2019-02-05
2    2019-02-06
3    2019-02-07
4    2019-02-08
5    2019-02-09
6    2019-02-10
dtype: object

---

5. Write a regex to extract the various parts of these logfile lines:


- GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
- POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
- GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58


In [93]:
df = pd.DataFrame()
df['text'] = pd.Series(['GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58',
'POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58',
'GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58'])
df


Unnamed: 0,text
0,GET /api/v1/sales?page=86 [16/Apr/2019:193452+...
1,POST /users_accounts/file-upload [16/Apr/2019:...
2,GET /api/v1/items?page=3 [16/Apr/2019:193453+0...


In [98]:
df.text.str.split()

0    [GET, /api/v1/sales?page=86, [16/Apr/2019:1934...
1    [POST, /users_accounts/file-upload, [16/Apr/20...
2    [GET, /api/v1/items?page=3, [16/Apr/2019:19345...
Name: text, dtype: object

In [103]:
#from the exercise review
logfile_re = r'''
^(?P<method>GET|POST)
\s+
(?P<path>.*?)
\s+
\[(?P<timestamp>.*?)\]
\s+
(?P<http_version>.*?)
\s+
\{(?P<status>\d+)\}
\s+
(?P<bytes_sent>\d+)
\s+
"(?P<user_agent>.*)$
'''

lines = pd.Series([
    'GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58',
    'POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58',
    'GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58',
])
lines.str.extract(logfile_re, re.VERBOSE)

Unnamed: 0,method,path,timestamp,http_version,status,bytes_sent,user_agent
0,GET,/api/v1/sales?page=86,16/Apr/2019:193452+0000,HTTP/1.1,200,510348,"python-requests/2.21.0"" 97.105.19.58"
1,POST,/users_accounts/file-upload,16/Apr/2019:193452+0000,HTTP/1.1,201,42,User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; ...
2,GET,/api/v1/items?page=3,16/Apr/2019:193453+0000,HTTP/1.1,429,3561,"python-requests/2.21.0"" 97.105.19.58"
