# Regex Exercises
Using the repo setup directions, setup a new local and remote repository named ```natural-language-processing-exercises```. The local version of your repo should live inside of ```~/codeup-data-science```. This repo should be named ```natural-language-processing-exercises```

Save this work in your ```natural-language-processing-exercises``` repo. Then add, commit, and push your changes.

Unless a specific file extension is specified, you may do your work either in a python script (```.py```) or a jupyter notebook (```.ipynb```).

Do your work for this exercise in a file named ```regex_exercises```.

In [3]:
import numpy as np
import pandas as pd
import re

## 1. Write a function named ```is_vowel```. It should accept a string as input and use a regular expression to determine if the passed string is a vowel. While not explicity mentioned in the lesson, you can treat the result of ```re.search``` as a boolean value that indicates whether or not the regular expression matches the given string.

In [4]:
def is_vowel(string):
    # (^) starts with ([List_of_strings]) and ($) ends with same ([List_of_strings]))
    is_vowel = r'^[aeiouAEIOU]$'
    # return the boolean value of the string searched by re.search using the rules assigned to is_vowel
    return bool(re.search(is_vowel, string))

In [5]:
is_vowel('a')

True

In [6]:
is_vowel('A')

True

In [9]:
is_vowel('Aa')

False

In [10]:
is_vowel('AA')

False

In [7]:
is_vowel('b')

False

In [8]:
is_vowel('Dog')

False

## 2. Write a function named ```is_valid_username``` that accepts a string as input. A valid username:
* starts with a lowercase letter, and 
* only consists of lowercase letters, numbers, or the ```_``` character. 
* It should also be no longer than 32 characters. 

The function should return either ```True``` or ```False``` depending on whether the passed string is a valid username.

```
>>> is_valid_username('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa')
False
>>> is_valid_username('codeup')
True
>>> is_valid_username('Codeup')
False
>>> is_valid_username('codeup123')
True
>>> is_valid_username('1codeup')
False
```

In [45]:
def is_valid_username(string):
    # (^) Starts with ([a-z]) lowercase letter, 
    # ([a-z0-9_]) contains lowercase letters,numbers, or _ 
    # stops at 32 characters ({,31}$) 
    is_valid_username = r'^[a-z][a-z0-9_]{,31}$'
    # return the boolean value of the string searched by re.search using the rules assigned to is_vowel
    return bool(re.search(is_valid_username, string))

In [46]:
# check 32 characters starting with number
is_valid_username('12345678901234567890123456789012')

False

In [47]:
# check 33 characters starting with lowercase letter
is_valid_username('a12345678901234567890123456789012')

False

In [48]:
# check 32 characters starting with lowercase letter
is_valid_username('a1234567890123456789012345678901')

True

In [49]:
# check 33 characters starting with lowercase letter
is_valid_username('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa')

False

In [55]:
# check 32 characters or less starting with lowercase letter
is_valid_username('codeup')

True

In [51]:
# check 32 characters or less starting with upercase letter
is_valid_username('Codeup')

False

In [52]:
# check 32 characters or less starting with lowercase letter including numbers
is_valid_username('codeup123')

True

In [53]:
# check 32 characters or less starting with number
is_valid_username('1codeup')

False

In [56]:
# check 32 characters or less starting with lowercase letter including numbers containing UPERCASE
is_valid_username('codeUp123')

False

## 3. Write a regular expression to capture phone numbers. It should match all of the following:

```
(210) 867 5309
+1 210.867.5309
867-5309
210-867-5309
```

In [57]:
def is_phone_number(string):
    # r"(\(?\d{3}\)?)
    # absolute character "(" is optional followed by 3 digits and abs chr ")" optional
    # Open Group r"(      \(        ?                  \d{3}               \)    ?         )close Group
    
    # ?.?
    # (Group)?.? = "(Group)?" means the Group is optional, "." means it can be followed by anything "?" and that is optional 
    
    # (\+?\d+)
    # absolute character "+" is optional followed by digits followed by 1 or more of a matching pattern (/d digits)
    # Open Group(         \+        ?                  \d                  +                          )close Group
    
    # ?.?
    # (Group)?.? = "(Group)?" means the Group is optional, "." means it can be followed by anything "?" and that is optional 

    # \d{3}.?\d{4}
    # 3 digits followed by anything which is optional followed by 4 digits
    # \d{3}                     .                ?                 /d{4}
    
    is_phone_number = r"(\(?\d{3}\)?)?.?(\+?\d+)?.?\d{3}.?\d{4}"
    # return the boolean value of the string searched by re.search using the rules assigned to is_vowel
    return bool(re.search(is_phone_number, string))

In [59]:
# Test Curriculum Assertion
is_phone_number('(210) 867 5309')

True

In [60]:
# Test Curriculum Assertion
is_phone_number('+1 210.867.5309')

True

In [61]:
# Test Curriculum Assertion
is_phone_number('867-5309')

True

In [62]:
# Test Curriculum Assertion
is_phone_number('210-867-5309')

True

In [63]:
# Test my own Assertion
is_phone_number('2108675309')

True

In [64]:
# Test my own Assertion
is_phone_number('867.5309')

True

In [65]:
# Test my own Assertion
is_phone_number('210.867.5309')

True

In [66]:
# Test my own Assertion
is_phone_number('cell: (+1) (210) 867-5309')

True

In [68]:
# Test my own Assertion
is_phone_number("I don't have a phone")

False

In [70]:
# Test my own Assertion
is_phone_number("Don't call me @ (555)555-5555")

True

## 4. Use regular expressions to convert the dates below to the standardized year-month-day format.

```
02/04/19
02/05/19
02/06/19
02/07/19
02/08/19
02/09/19
02/10/19
```

In [74]:
# Create List
text_string = ['02/04/19',
               '02/05/19',
               '02/06/19',
               '02/07/19',
               '02/08/19',
               '02/09/19',
               '02/10/19']

text_string

['02/04/19',
 '02/05/19',
 '02/06/19',
 '02/07/19',
 '02/08/19',
 '02/09/19',
 '02/10/19']

In [75]:
# Turn list into Series
text_series = pd.Series(text_string)
text_series

0    02/04/19
1    02/05/19
2    02/06/19
3    02/07/19
4    02/08/19
5    02/09/19
6    02/10/19
dtype: object

In [76]:
# Replace Groups (2digits)/(2digits)/(2digits) with 20(Group3)-(Group1)-(Group2)
text_series.str.replace(r'(\d{2})/(\d{2})/(\d{2})', r'20\3-\1-\2', regex=True)

0    2019-02-04
1    2019-02-05
2    2019-02-06
3    2019-02-07
4    2019-02-08
5    2019-02-09
6    2019-02-10
dtype: object

## 5. Write a regex to extract the various parts of these logfile lines:

```
GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58
```

In [78]:
# Create Series
text_Series = pd.Series([
    'GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58',
    'POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58',
    'GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58',
])

text_Series

0    GET /api/v1/sales?page=86 [16/Apr/2019:193452+...
1    POST /users_accounts/file-upload [16/Apr/2019:...
2    GET /api/v1/items?page=3 [16/Apr/2019:193453+0...
dtype: object

```?P``` named capturing Group
```<capturing_Group>``` named variable to hold assignment



#### Starts with Group ```^```, start Group ```(```, name variable to hold assignment ```?P<method>```, assign by conditions ```GET|POST``` GET or POST, end Group ```)```
* ```^(?P<method>GET|POST)```

#### matches any whitespace ```\s``` one or more times ```+```
* ```\s+```
    
#### start Group ```(``` name variable to hold assignment ```?P<path>``` followed by any character ```.``` consisting of 0 or more characters ```*``` that are optional ```?```  end Group ```)```
* ```(?P<path>.*?)```

#### matches any whitespace ```\s``` one or more times ```+```
* ```\s+```

#### literal open bracket ```\[``` start Group ```(``` name variable to hold assignment ```?P<timestamp>``` followed by any character ```.``` consisting of 0 or more characters ```*``` that are optional ```?```  end Group ```)``` followed by lieteral close bracket ```\]```
* ```\[(?P<timestamp>.*?)\]```

#### matches any whitespace ```\s``` one or more times ```+```
* ```\s+```

#### start Group ```(``` name variable to hold assignment ```?P<http_version>``` followed by any character ```.``` consisting of 0 or more characters ```*``` that are optional ```?```  end Group ```)```
* ```(?P<http_version>.*?)```

#### matches any whitespace ```\s``` one or more times ```+```
* ```\s+```

#### literal open curly bracket ```\{``` start Group ```(``` name variable to hold assignment ```?P<status>``` followed by digits ```\d``` one or more times ```+``` end Group ```)``` followed by lieteral close curly bracket ```\}```
* ```\{(?P<status>\d+)\}```

#### matches any whitespace ```\s``` one or more times ```+```
* ```\s+```

#### start Group ```(``` name variable to hold assignment ```?P<bytes>``` followed by digits ```\d``` one or more times ```+``` end Group ```)``` 
* ```(?P<bytes>\d+)```

#### matches any whitespace ```\s``` one or more times ```+```
* ```\s+```

#### start Group ```(``` name variable to hold assignment ```?P<user_agent>``` followed by any character ```.``` consisting of 0 or more characters ```*``` end Group ```)``` 
* ```"(?P<user_agent>.*)"```

#### matches any whitespace ```\s``` one or more times ```+```
* ```\s+```

#### start Group ```(``` name variable to hold assignment ```?P<ip>``` followed by any character ```.``` consisting of 0 or more characters ```*``` end Group ```)``` END ```$```
* ```(?P<ip>.*)$```


In [79]:
logfile_re = r'''
^(?P<method>GET|POST)
\s+
(?P<path>.*?)
\s+
\[(?P<timestamp>.*?)\]
\s+
(?P<http_version>.*?)
\s+
\{(?P<status>\d+)\}
\s+
(?P<bytes>\d+)
\s+
"(?P<user_agent>.*)"
\s+
(?P<ip>.*)$
'''

In [80]:
# Extract the string from text_Series using the Regex code assigned to logfile_re and set verbose argument
text_Series.str.extract(logfile_re, re.VERBOSE)

Unnamed: 0,method,path,timestamp,http_version,status,bytes,user_agent,ip
0,GET,/api/v1/sales?page=86,16/Apr/2019:193452+0000,HTTP/1.1,200,510348,python-requests/2.21.0,97.105.19.58
1,POST,/users_accounts/file-upload,16/Apr/2019:193452+0000,HTTP/1.1,201,42,User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; ...,97.105.19.58
2,GET,/api/v1/items?page=3,16/Apr/2019:193453+0000,HTTP/1.1,429,3561,python-requests/2.21.0,97.105.19.58


<div class="alert alert-danger"> 


# Bonus Exercise
You can find a list of words on your mac at ```/usr/share/dict/words```.  

Use this file to answer the following questions:
- How many words have at least 3 vowels?
- How many words have at least 3 vowels in a row?
- How many words have at least 4 consonants in a row?
- How many words start and end with the same letter?
- How many words start and end with a vowel?
- How many words contain the same letter 3 times in a row?
- What other interesting patterns in words can you find?