## Regex Exercises

In [1]:
import pandas as pd
import re

1. Write a function named is_vowel. It should accept a string as input and use a regular expression to determine if the passed string is a vowel. While not explicity mentioned in the lesson, you can treat the result of re.search as a boolean value that indicates whether or not the regular expression matches the given string.

In [2]:
regexp = r'^[aeiouAEIOU]$'
subject = 'ab'

re.findall(regexp, subject)

[]

In [3]:
def is_vowel(subject):
    '''
    This function will take in a string and look for an exact match for a single character vowel. 
    It will return a boolean value.
    '''
    regexp = r'^[aeiouAEIOU]$'
    
    vowel = re.search(regexp, subject)
    
    return bool(vowel)
    


In [4]:
is_vowel('aa')

False

2. Write a function named is_valid_username that accepts a string as input. A valid username starts with a lowercase letter, and only consists of lowercase letters, numbers, or the _ character. It should also be no longer than 32 characters. The function should return either True or False depending on whether the passed string is a valid username.

In [5]:
regexp = r'^[a-z]{1}[0-9_]*[^A-Z]{,31}$'
subject = 'ab12312adfsa12312_asdasd123123_0'

re.search(regexp, subject)

<re.Match object; span=(0, 32), match='ab12312adfsa12312_asdasd123123_0'>

In [6]:
def is_valid_username(subject):
    '''
    This function accepts a username as a string and returns a boolean value based on whether or not the username meets the following requirements: 
    - starts with a lowercase letter
    - only consists of lowercase letter, numbers, or the '_' character
    - no longer than 32 characters
    '''
    
    regexp = r'^[a-z][a-z0-9_]{,31}$'
    
    username = re.search(regexp, subject)
    
    return bool(username)
    
    

In [7]:
is_valid_username('a12312_123123_0')

True

3. Write a regular expression to capture phone numbers. It should match all of the following:
- (210) 867 5309
- +1 210.867.5309
- 867-5309
- 210-867-5309

In [8]:
# Madeleine
df = pd.DataFrame()
df['number'] = [
    '(210) 867 5309',
    '+1 210.867.5309',
    '867-5309',
    '2108675309',
]


In [9]:
phone_regex = re.compile(
'''
^ 
(?P<country_code>\+\d+)?
\D*?
(?P<area_code>\d{3})?
\D*?
(?P<exchange_code>\d{3})
\D*?
(?P<line_number>\d{4})
\D*
$
''', re.VERBOSE)

In [10]:
df.number.str.extract(phone_regex)

Unnamed: 0,country_code,area_code,exchange_code,line_number
0,,210.0,867,5309
1,1.0,210.0,867,5309
2,,,867,5309
3,,210.0,867,5309


In [11]:
def capture_phone_numbers(target):
    '''
    This function takes in a string and returns a boolean value based on whether or not the string is a valid phone number:
    - may start with '+', '(' or any digit
    - may be 8 to 15 characters
    - may contain whitespace
    '''
    
    # Create a blank dataframe
    df = pd.DataFrame()
    
    # assign the target variable list to a column in the df
    df['input_number'] = target
    
    # create the regexp to compile the sections of the phone numbers
    phone_regex = re.compile(
                            '''
                            ^ 
                            (?P<country_code>\+\d+)?
                            \D*?
                            (?P<area_code>\d{3})?
                            \D*?
                            (?P<exchange_code>\d{3})
                            \D*?
                            (?P<line_number>\d{4})
                            \D*
                            $
                            ''', re.VERBOSE)
    
    # Output results to the dataframe
    df = df['input_number'].str.extract(phone_regex)
    
    return df
    
    

In [12]:
test_list = [
    '(210) 867 5309',
    '+1 210.867.5309',
    '867-5309',
    '2108675309',
]

In [13]:
capture_phone_numbers(test_list)

Unnamed: 0,country_code,area_code,exchange_code,line_number
0,,210.0,867,5309
1,1.0,210.0,867,5309
2,,,867,5309
3,,210.0,867,5309


4. Use regular expressions to convert the dates below to the standardized year-month-day format.
- 02/04/19
- 02/05/19
- 02/06/19
- 02/07/19
- 02/08/19
- 02/09/19
- 02/10/19

In [14]:
# currently in MM/DD/YY
# need to convert to YYYY-MM-DD
# 3 capture groups separated by '/', each two digit, although should be built to accept 1 digit month and day

In [35]:
# define the list of dates
dates_list = [
    '02/04/19',
    '02/05/19',
    '02/06/19',
    '02/07/19',
    '02/08/19',
    '02/09/19',
    '02/10/19']

In [36]:
# create our three capture groups, separated by '/''
# date_reg = r'(\d{1,2})/(\d{1,2})/(d{2})'
date_reg = r'(\d+)/(\d+)/(\d+)'


In [37]:
re.sub(date_reg, r'20\3-\1-\2',dates_list[0])

'2019-02-04'

In [45]:
def convert_date_format(target):
    '''
    
    '''
    
    # Create a blank dataframe
    df = pd.DataFrame()
    
    # assign the target variable list to a column in the df
    df['input_date'] = target
        
    # create the regexp to compile the sections of the phone numbers
    date_regexp = r'(\d+)/(\d+)/(\d+)'

    # create output format
    output = r'20\3-\1-\2'
        
    # create new column of converted dates
    df['converted_date'] = [re.sub(date_regexp, output, i) for i in target]
    
    # convert to datetime
    df['converted_date'] = pd.to_datetime(df['converted_date'])
    
    return df

In [46]:
new_df = convert_date_format(dates_list)

In [47]:
new_df

Unnamed: 0,input_date,converted_date
0,02/04/19,2019-02-04
1,02/05/19,2019-02-05
2,02/06/19,2019-02-06
3,02/07/19,2019-02-07
4,02/08/19,2019-02-08
5,02/09/19,2019-02-09
6,02/10/19,2019-02-10


In [48]:
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   input_date      7 non-null      object        
 1   converted_date  7 non-null      datetime64[ns]
dtypes: datetime64[ns](1), object(1)
memory usage: 240.0+ bytes


5. Write a regex to extract the various parts of these logfile lines:
- GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
- POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; -  - Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
- GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58