# Regex Exercises

<hr style="border:2px solid gray">

In [1]:
#regex import
import re

#standard ds import
import pandas as pd
import numpy as np

Do your work for this exercise in a file named `regex_exercises`.

1. Write a function named `is_vowel`. It should accept a string as input and use
   a regular expression to determine if the passed string is a vowel. While not
   explicity mentioned in the lesson, you can treat the result of `re.search` as
   a boolean value that indicates whether or not the regular expression matches
   the given string.

2. Write a function named `is_valid_username` that accepts a string as input. A
   valid username starts with a lowercase letter, and only consists of lowercase
   letters, numbers, or the `_` character. It should also be no longer than 32
   characters. The function should return either `True` or `False` depending on
   whether the passed string is a valid username.

        >>> is_valid_username('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa')
        False
        >>> is_valid_username('codeup')
        True
        >>> is_valid_username('Codeup')
        False
        >>> is_valid_username('codeup123')
        True
        >>> is_valid_username('1codeup')
        False

3. Write a regular expression to capture phone numbers. It should match all of
   the following:

        (210) 867 5309
        +1 210.867.5309
        867-5309
        210-867-5309

4. Use regular expressions to convert the dates below to the standardized year-month-day format.

        02/04/19
        02/05/19
        02/06/19
        02/07/19
        02/08/19
        02/09/19
        02/10/19

5. Write a regex to extract the various parts of these logfile lines:

        GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
        POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
        GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58


**Bonus Exercise**

You can find a list of words on your mac at `/usr/share/dict/words`. Use this
   file to answer the following questions:

    - How many words have at least 3 vowels?
    - How many words have at least 3 vowels in a row?
    - How many words have at least 4 consonants in a row?
    - How many words start and end with the same letter?
    - How many words start and end with a vowel?
    - How many words contain the same letter 3 times in a row?
    - What other interesting patterns in words can you find?

<div class="alert alert-block alert-info">
<b>Remeber:</b> 
<br>
    
- <b>^</b>: beginning of an expression
- <b>$</b>: end of an expression
</div>

<hr style="border:2px solid gray">

### #1. Write a function named ```is_vowel```

<div class="alert alert-block alert-info">
<b>Note:</b> 
<br>
    
- <b>re.search</b>: gives us the <u>first</u> occurance that it observes that meets that request in the order that we asked of it

- <b>re.findall</b>: gives us the <u>ever</u> match that it observes that meets that request in the order that we asked of it

</div>

<b>Break it down<b>:
    
- The <b>r</b> at the beginning of the expression tells Python to interpret the string as a "raw" string.

- The <b>^</b> is an anchor that matches the beginning of a line.

- The <b>[aeiou]</b> part of the expression is a character class that matches any one character that is either an "a", "e", "i", "o", or "u".

- The <b>$</b> at the end of the expression is an anchor that matches the end of a line.
<br>

- Putting all of these pieces together, the regular expression <b>r'^[aeiou]$'</b> will match any line that contains only a single vowel character (either "a", "e", "i", "o", or "u") and nothing else.

In [2]:
def is_vowel(x):
    '''
    This function takes in a string
    returns true if a vowel is entered
    returns false otherwise
    '''
    regexp = r'^[aeiou]$'
    
    if re.search(regexp, x, re.IGNORECASE):
        return True
    else:
        return False

In [3]:
def is_vowel2(x):
    '''
    This function takes in a string
    returns true if a vowel is entered
    returns false otherwise
    '''
    if re.search(r'[aeiouAEIOU]', x):
        return True
    else:
        return False

<div class="alert alert-block alert-info">
<b>Note:</b> 
<br>
    
There are two options here. We either use <b>re.IGNORECASE</b> and include only [aeiou] in our expression OR we use [AEIOUaeiou] as our expression. Using [aeiou] without <b>re.IGNORECASE</b> will only return lowercase vowels. 
</div>

In [4]:
#another way to do the same thing (no ignorecase)
#in this version we are assigning vowel_re first, then putting it into our return statement
def is_vowel3(string):
    vowel_re = r'^[AEIOUaeiou]$'
    
    return bool(re.search(vowel_re, string))

In [5]:
#test the function
is_vowel('A')

True

In [6]:
#test the function
is_vowel('a')

True

In [7]:
#test the function
is_vowel('B')

False

In [8]:
#test the function
is_vowel('Codeup')

False

In [9]:
#test the function
is_vowel('apple')

False

In [10]:
#test the function
is_vowel('aaa')

False

<hr style="border:1px solid black">

### #2. Write a function named ```is_valid_username```

In [11]:
def is_valid_username(x):
    '''
    This function takes in a username 
    returns true if the username begins with lower case letter, 
    has less than 32 characters 
    and only contains numbers, lowercase letters or underscore
    returns false otherwise
    '''
    username_re = r'^[a-z][a-z0-9_]{1,31}$'
    if re.search(username_re, x):
        return True
    else:
        return False

In [12]:
def is_valid_username2(string):
    username_re = '^[a-z][a-z0-9_]{,31}$'
    
    return bool(re.search(username_re, string))

In [13]:
#test the function
assert is_valid_username2('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa') == False
assert is_valid_username2('codeup') == True
assert is_valid_username2('Codeup') == False
assert is_valid_username2('codeup123') == True
assert is_valid_username2('1codeup') == False
assert is_valid_username2('code_up') == True

<b>Break it down</b>:
- As before, the <b>r</b> tels Python it's a "raw" string and the <b>^</b> is an anchor.

- The [a-z] part of the expression is a character class that matches any lowercase letter from "a" to "z".

- The [a-z0-9_] part of the expression is another character class that matches any lowercase letter from "a" to "z", any digit from "0" to "9", or an underscore character "_".

- The {1,31} part of the expression is a quantifier that matches the previous character class one or more times, but no more than 31 times. This means that the string must contain at least one character from the [a-z] character class, followed by one or more characters from the [a-z0-9_] character class, and the entire string must be between 2 and 32 characters long.

- The $ at the end of the expression is an anchor that matches the end of a line.
<br>

- Putting all of these pieces together, the regular expression r'^[a-z][a-z0-9_]{1,31}$' will match any line that contains a string of lowercase letters, digits, and underscores between 2 and 32 characters long. The string must start with a lowercase letter and cannot contain any whitespace characters or other special characters.

<div class="alert alert-block alert-info">
<b>Note:</b> 
<br>
    
The reason we have [a-z] <b>and</b> [a-z0-9_] is because we are saying the username must start with a letter.
</div>

In [14]:
#yet another way to accomplish this
def is_valid_username3(x):
    username_re = r'^[a-z]\w{,31}$'
    
    if re.search(username_re, x):
        return True
    else:
        return False

```is_valid_username```('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa')
<br>
output: False

In [15]:
#what's going on here
len('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa')

33

In [16]:
#test function
is_valid_username('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa')

#has more than 32 characters, so it's invalid

False

```is_valid_username```('codeup')
<br>
output: True

In [17]:
#test function
is_valid_username('codeup')

True

```is_valid_username```('Codeup')
<br>
output: False

In [18]:
#test function
is_valid_username('Codeup')

#starts with a capitol letter, which is invalid

False

```is_valid_username```('codeup123')
<br>
output: True

In [19]:
#test function
is_valid_username('codeup123')

True

```is_valid_username```('1codeup')
<br>
output: False

In [20]:
#test function
is_valid_username('1codeup')

#starts with 1 (which is not a letter), so it'd be False

False

```is_valid_username```('code_up')
<br>
output: True

In [21]:
#test function
is_valid_username('code_up')

True

<hr style="border:1px solid black">

### #3. Write a regular expression to capture phone numbers.

In [22]:
#all the numbers
regexp = r'.*\d{,3}.*\d{3}.*\d{4}'
subject = '(210) 867 5309'

re.findall(regexp, subject)

['(210) 867 5309']

In [23]:
def is_phone_number(x):
    '''
    This function takes in a phone number,
    requires exactly 3 digits, exactly 3 digits, and finally exactly 4 digits. 
    If it does these things, returns true
    otherwise returns false.
    '''
    if re.search(r'.*\d{,3}.*\d{3}.*\d{4}', x):
        return True
    else:
        return False

In [24]:
def is_phone_number2(string):
    phone_number_re = "(\+?\d+)?.?(\(?\d{3}\)?)?.?\d{3}.?\d{4}"
    
    return bool(re.search(phone_number_re, string))


In [25]:
#test the function
assert is_phone_number2('(210) 867 5309') == True
assert is_phone_number2('+1 210.867.5309') == True
assert is_phone_number2('867-5309') == True
assert is_phone_number2('210-867-5309') == True

In [26]:
numbers = ['(210) 867 5309',
'+1 210.867.5309',
'867-5309',
'210-867-5309',
'abc']

regexp = r'.*?\d+\D*?\d+.\d+$'

for number in numbers:
    print(re.findall(regexp, number))

['(210) 867 5309']
['+1 210.867.5309']
['867-5309']
['210-867-5309']
[]


<b>(210) 867 5309

In [27]:
#test the function
is_phone_number('(210) 867 5309')

True

In [28]:
#test the function
is_phone_number2('(210) 867 5309')

True

<b>+1 210.867.5309

In [29]:
#test the function
is_phone_number('+1 210.867.5309')

True

<b>867-5309

In [30]:
#test the function
is_phone_number('867-5309')

True

<b>210-867-5309

In [31]:
#test the function
is_phone_number('210-867-5309')

True

<hr style="border:1px solid black">

### #4. Use regular expressions to convert the dates below to the standardized year-month-day format.

        02/04/19
        02/05/19
        02/06/19
        02/07/19
        02/08/19
        02/09/19
        02/10/19

In [32]:
date_reg = r'(\d+)/(\d+)/(\d+)'
dates = pd.Series(['02/04/19', '02/05/19', '02/06/19', '02/07/19', '02/08/19', '02/09/19', '02/10/19'])

#1st value is now 2nd
#2nd value is now 3rd 
#3rd value is now 1st
[re.sub(date_reg, r'\3/\1/\2' ,date) for date in dates]

['19/02/04',
 '19/02/05',
 '19/02/06',
 '19/02/07',
 '19/02/08',
 '19/02/09',
 '19/02/10']

In [33]:
#or another option is replace
dates = pd.Series(['02/04/19', '02/05/19', '02/06/19', '02/07/19', '02/08/19', '02/09/19', '02/10/19'])

#1st value is now 2nd
#2nd value is now 3rd 
#3rd value is now 1st
#add '20' to the begginning
dates.str.replace(r'(\d+)/(\d+)/(\d+)', r'20\3-\1-\2')

  dates.str.replace(r'(\d+)/(\d+)/(\d+)', r'20\3-\1-\2')


0    2019-02-04
1    2019-02-05
2    2019-02-06
3    2019-02-07
4    2019-02-08
5    2019-02-09
6    2019-02-10
dtype: object

In [34]:
dates = ["02/04/19",
             "02/05/19",
             "02/06/19",
             "02/07/19",
             "02/08/19",
             "02/09/19",
             "02/10/19",
            ]

dates

['02/04/19',
 '02/05/19',
 '02/06/19',
 '02/07/19',
 '02/08/19',
 '02/09/19',
 '02/10/19']

In [35]:
dates = pd.Series(dates)
dates

0    02/04/19
1    02/05/19
2    02/06/19
3    02/07/19
4    02/08/19
5    02/09/19
6    02/10/19
dtype: object

In [36]:
dates.str.replace(r'(\d{2})/(\d{2})/(\d{2})', r'20\3-\1-\2', regex=True)

0    2019-02-04
1    2019-02-05
2    2019-02-06
3    2019-02-07
4    2019-02-08
5    2019-02-09
6    2019-02-10
dtype: object

In [37]:
#Option #3
regexp = r'(\d{2})/(\d{2})/(\d{2})'

for date in dates:
    print(re.sub(regexp, r'20\3-\1-\2' ,date))

2019-02-04
2019-02-05
2019-02-06
2019-02-07
2019-02-08
2019-02-09
2019-02-10


<hr style="border:1px solid black">

### #5. Write a regex to extract the various parts of these logfile lines:

        GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
        
        POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
        
        GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58

<div class="alert alert-block alert-info">
<b>Note:</b> 
<br>

There are multiple parts to this:

1. <b>method</b> --> GET 
2. <b>path</b> --> api/v1/sales?page=86
3. <b>timestamp</b> --> [16/Apr/2019:193452+0000]
4. <b>http_version</b> --> HTTP/1.1 
5. <b>status</b> --> {200}
6. <b>bytes</b> --> 510348
7. <b>user_agent</b> --> "python-requests/2.21.0"
8. <b>ip</b> --> 97.105.19.58
<br>

We will want to call out each component for this exercise
</div>

In [38]:
logfile_re = r'''
^(?P<method>GET|POST)
\s+
(?P<path>.*?)
\s+
\[(?P<timestamp>.*?)\]
\s+
(?P<http_version>.*?)
\s+
\{(?P<status>\d+)\}
\s+
(?P<bytes>\d+)
\s+
"(?P<user_agent>.*)"
\s+
(?P<ip>.*)$
'''

lines = pd.Series([
    'GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58',
    'POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58',
    'GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58',
])

lines.str.extract(logfile_re, re.VERBOSE)

Unnamed: 0,method,path,timestamp,http_version,status,bytes,user_agent,ip
0,GET,/api/v1/sales?page=86,16/Apr/2019:193452+0000,HTTP/1.1,200,510348,python-requests/2.21.0,97.105.19.58
1,POST,/users_accounts/file-upload,16/Apr/2019:193452+0000,HTTP/1.1,201,42,User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; ...,97.105.19.58
2,GET,/api/v1/items?page=3,16/Apr/2019:193453+0000,HTTP/1.1,429,3561,python-requests/2.21.0,97.105.19.58


<b>```re.VERBOSE```: Ignore any whitespace in the regular expression. This can be useful to make more readable regular expressions, especially when combined with non-capturing comment groups.

<hr style="border: 2px solid gray">
<hr style="border: 2px solid gray">

### Bonus: You can find a list of words on your mac at `/usr/share/dict/words`. Use this file to answer the following questions:

In [39]:
df = pd.read_csv("/usr/share/dict/words", header=None)
df.head(10)

Unnamed: 0,0
0,A
1,a
2,aa
3,aal
4,aalii
5,aam
6,Aani
7,aardvark
8,aardwolf
9,Aaron


In [40]:
#rename the only column as 'words'
df.columns = ["words"]

In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 235976 entries, 0 to 235975
Data columns (total 1 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   words   235974 non-null  object
dtypes: object(1)
memory usage: 1.8+ MB


<div class="alert alert-block alert-success">
<b>Takeaways:</b>
<br>

We have a dataframe with one column [words] that contains 235,974 entries.
</div>

<b>A. How many words have at least 3 vowels?
- answer: 130,955

In [42]:
#count all of the vowels
#say how many have at least 3 vowels
three_or_more_vowels = df.words.str.count(r"[aeiouAEIOU]") >3

In [43]:
df[df.words.str.count(r"[aeiouAEIOU]") >0].sample(5)

Unnamed: 0,words
186796,spindletail
10535,antiplanet
216008,unescapableness
112548,mawp
158450,Prussianizer


In [44]:
#return only words that have at least 3 vowels
len(df[three_or_more_vowels])

130955

<b>B. How many words have at least 3 vowels in a row?
- answer: 6,183

In [45]:
#capture group - vowels/ at least 3 in a row
df[df.words.str.count(r"[aeiouAEIOU]{3}") >0].sample(5)

Unnamed: 0,words
12367,arboraceous
101503,kehoeite
125121,nonabstemious
131607,oppositifolious
151387,portmanteau


In [46]:
#how many are there
len(df[df.words.str.count(r"[aeiouAEIOU]{3}") >0])

6183

<b>C. How many words have at least 4 consonants in a row?
- Answer: 19,242

In [47]:
#only consonants 4 in a row
df[df.words.str.count(r"[^aeiouAEIOU]{4}") >0].sample(5)

Unnamed: 0,words
127744,nontypicalness
161214,Pythonomorpha
187647,sportsmanliness
121706,myopachynsis
159637,ptychopariid


In [48]:
#how many are there
len(df[df.words.str.count(r"[^aeiouAEIOU]{4}") >0])

19242

<b>D. How many words start and end with the same letter?
- Answer: 9,946

In [49]:
df[df.words.str.count(r'^[a-z]$|^([a-z]).*\1$') >0].sample(5)

Unnamed: 0,words
43718,cosmologic
170791,romper
196789,syncladous
78080,glowing
116694,minium


In [50]:
len(df[df.words.str.count(r'^[a-z]$|^([a-z]).*\1$') >0])

9946

In [51]:
#let's create a function to do this for any word
def same_letter(string):
    regex = r'^[a-z]$|^([a-z]).*\1$'
    
    return bool(re.search(regex, string))

In [52]:
same_letter('racecar')

True

In [53]:
same_letter('apple')

False

<b>E. How many words start and end with a vowel?
- Answer: 12,356

In [54]:
df[df.words.str.count(r'^[aeiou].*[aeiou]$') >0].sample(5)

Unnamed: 0,words
64973,esteemable
3379,aevia
984,accompletive
93084,inconfutable
97267,intranslatable


In [55]:
len(df[df.words.str.count(r'^[aeiou].*[aeiou]$') >0])

12356

In [56]:
#let's create a function to do this for any word
def vowel_start_end(string):
    regex = r'^[aeiou].*[aeiou]$'
    
    return bool(re.search(regex, string))

In [57]:
vowel_start_end('apple')

True

In [58]:
vowel_start_end('car')

False

<b>F. How many words contain the same letter 3 times in a row?
- Answer: 7

In [59]:
df[df.words.str.count(r'\b\w*(\w)\1\1\w*') >0].sample(5)

Unnamed: 0,words
231775,whenceeer
83037,headmistressship
78535,goddessship
25003,bossship
50660,demigoddessship


In [60]:
len(df[df.words.str.count(r'\b\w*(\w)\1\1\w*') >0])

7

In [61]:
#let's create a function to do this for any word
def three_in_row(string):
    regex = r'\b\w*(\w)\1\1\w*'
    
    return bool(re.search(regex, string))

In [62]:
three_in_row('string')

False

In [63]:
three_in_row('crosssection')

True