In [1]:
import re
import pandas as pd

### Character Classes

|metacharacter | matches |
| ------------ | ------- |
|  . |  anything |
| \w |  any letter or number |
| \W |  anything that's not a letter or number |
| \d |  any digit |
| \D |  anything that's not a digit |
| \s |  any whitespace character |
| [xyz] | any one of the enclosed characters |
| [^xyz] | any character that is not enclosed |
| x\|y | 

### Repeating

All of the metacharacters in the table below will match the previous character a repeated number of times.

|metacharacter | matches |
| ------------ | ------- |
| * | zero or more |
| + | one or more |
| {n} | exactly n repititions |
| {n,} | n or more repititions |
| {n,m} | between n and m repititions |
| ? | an optional character |

### Anchors

There are several special metacharacters that don't match any individual characters, but serve as an "anchor" for the rest of the regular expression.

|metacharacter | matches | example |
| ------------ | ------- | ------- |
| ^ | The start of the string/line | ^[ab] matches the a in `apple` and the b in `banana`|
| $ | The end of the string/line | |
| \b | A word boundary | er\b matches the er in `never` but not the er in `verb` |
| \B| Not word boundary | ear\B matches the ear in `early` but not the ear in `fear` |

1. Write a function named is_vowel. It should accept a string as input and use a regular expression to determine if the passed string is a vowel. While not explicity mentioned in the lesson, you can treat the result of re.search as a boolean value that indicates whether or not the regular expression matches the given string.

In [6]:
re.findall(r"[aeiou]", "cowpatty", re.IGNORECASE)

['o', 'a']

In [5]:
re.findall(r"a|e|i|o|u", "cowpatty", re.IGNORECASE)


['o', 'a']

In [7]:
def is_vowel(string):
    vowel = re.findall(r"a|e|i|o|u", string, re.IGNORECASE)
    return vowel

In [9]:
is_vowel("cowpatty")

['o', 'a']

While not explicity mentioned in the lesson, you can treat the result of re.search as a boolean value that indicates whether or not the regular expression matches the given string.

In [10]:
def is_vowel(string):
    vowel = re.search(r"a|e|i|o|u", string, re.IGNORECASE)
    return bool(vowel)

In [11]:
is_vowel("cowpatty")

True

2. Write a function named is_valid_username that accepts a string as input.
    - A valid username starts with a lowercase letter, and 
    - only consists of lowercase letters, numbers, or the _ character. 
    - It should also be no longer than 32 characters. 
    - The function should return either True or False depending on whether the passed string is a valid username.

In [None]:
def is_valid_username(string):
    #carrot for starts with 
    #list all options for contents of username with no commas 
    #allow 32 digits 
    password_pattern = r"^[a-z][a-z0-9_]{,31}$"
    return bool(re.search(password_pattern, string))

3. Write a regular expression to capture phone numbers. It should match all of the following:

(210) 867 5309
+1 210.867.5309
867-5309
210-867-5309


In [12]:
phone_number = '210-867-5309'
#assigning regex to accept 0-9, ()+- symbols, and + all after 0 range:
regexp = r'[0-9()+-. ]+'
re.match(regexp, phone_number)

<re.Match object; span=(0, 12), match='210-867-5309'>

In [13]:
re.match(r'[0-9()+-. ]+','210-867-5309')

<re.Match object; span=(0, 12), match='210-867-5309'>

In [14]:
re.match(r'[0-9()+-. ]+','+1 210.867.5309')

<re.Match object; span=(0, 15), match='+1 210.867.5309'>

4. Use regular expressions to convert the dates below to the standardized year-month-day format.

02/04/19
02/05/19
02/06/19
02/07/19
02/08/19
02/09/19
02/10/19


5. Write a regex to extract the various parts of these logfile lines:

GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58



In [15]:
log_file = r'''
^(?P<method>GET|POST)
\s+
(?P<path>.*?)
\s+
\[(?P<timestamp>.*?)\]
\s+
(?P<http_version>.*?)
\s+
\{(?P<status>\d+)\}
\s+
(?P<bytes_sent>\d+)
\s+
"(?P<user_agent>.*)"
\s
(?P<ip>.*)$
'''

In [16]:
lines = pd.Series(['GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58',
                                'POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58',
                                'GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58'
                               ])

In [17]:
lines

0    GET /api/v1/sales?page=86 [16/Apr/2019:193452+...
1    POST /users_accounts/file-upload [16/Apr/2019:...
2    GET /api/v1/items?page=3 [16/Apr/2019:193453+0...
dtype: object

In [18]:
#create a column called log for the logfile lines in the df:
df=lines.str.extract(log_file, re.VERBOSE)

In [19]:
df

Unnamed: 0,method,path,timestamp,http_version,status,bytes_sent,user_agent,ip
0,GET,/api/v1/sales?page=86,16/Apr/2019:193452+0000,HTTP/1.1,200,510348,python-requests/2.21.0,97.105.19.58
1,POST,/users_accounts/file-upload,16/Apr/2019:193452+0000,HTTP/1.1,201,42,User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; ...,97.105.19.58
2,GET,/api/v1/items?page=3,16/Apr/2019:193453+0000,HTTP/1.1,429,3561,python-requests/2.21.0,97.105.19.58
