<h2 id="exercises">Exercises</h2>
<p>Within your <code>codeup-data-science</code> directory, create a new repo named <code>natural-language-processing-exercises</code>. This will be where you do your work for this module. Create a repository on GitHub with the same name, and link your local repository to GitHub.</p>
<p>Save this work in your <code>natural-language-processing-exercises</code> repo. Then add, commit, and push your changes.</p>
<p>Unless a specific file extension is specified, you may do your work either in a
python script (<code>.py</code>) or a jupyter notebook (<code>.ipynb</code>).</p>
<p>Do your work for this exercise in a file named <code>regex</code>.</p>


In [1]:
import re
import pandas as pd

1. Write a function named <code>is_vowel</code>. It should accept a string as input and use
   a regular expression to determine if the passed string is a vowel. While not
   explicity mentioned in the lesson, you can treat the result of <code>re.search</code> as
   a boolean value that indicates whether or not the regular expression matches
   the given string.</p>


In [2]:
re.search(r"^(a|e|i|o|u)$", "a", re.IGNORECASE)


<re.Match object; span=(0, 1), match='a'>

In [3]:
def is_vowel(string):
    return bool(re.search(r"^[aeiou]$", string, re.IGNORECASE))

In [4]:
is_vowel('k')

False

In [5]:
is_vowel('e')

True

In [6]:
is_vowel('eee')

False

2. Write a function named <code>is_valid_username</code> that accepts a string as input. A
   valid username starts with a lowercase letter, and only consists of lowercase
   letters, numbers, or the <code>_</code> character. It should also be no longer than 32
   characters. The function should return either <code>True</code> or <code>False</code> depending on
   whether the passed string is a valid username.</p>
<pre><code>&gt;&gt;&gt; is_valid_username('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa')
False
&gt;&gt;&gt; is_valid_username('codeup')
True
&gt;&gt;&gt; is_valid_username('Codeup')
False
&gt;&gt;&gt; is_valid_username('codeup123')
True
&gt;&gt;&gt; is_valid_username('1codeup')
False
</code></pre>



In [8]:
re.search(r"^[a-z][a-z0-9_]{,31}", "string")


<re.Match object; span=(0, 6), match='string'>

In [13]:
def is_valid_username(string):
    pattern = r"^[a-z][a-z0-9_]{,31}$"
    return bool(re.search(pattern, string))

In [14]:
is_valid_username('codeup')

True

In [15]:
is_valid_username('codeup123')

True

In [16]:
is_valid_username('CodeupCodeup!')

False

In [17]:
is_valid_username('aaaCODEUPCODEUPaaaaaaaaaaaaaaaaaaaaaaaaaa')

False

In [18]:
is_valid_username('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa')

False

3. Write a regular expression to capture phone numbers. It should match all of
   the following:</p>
<pre><code>(210) 867 5309
+1 210.867.5309
867-5309
210-867-5309
</code></pre>



In [21]:
phone = ['(210) 867 5309', '+1 210.867.5309', '867-5309', '210-867-5309']

In [25]:
#phonePattern = re.compile(r'(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')

In [37]:
phonePattern = re.compile(r'''
                # don't match beginning of string, number can start anywhere
    (\d{3})*     # area code is 3 digits (e.g. '800')
    \D*         # optional separator is any number of non-digits
    (\d{3})     # trunk is 3 digits (e.g. '555')
    \D*         # optional separator
    (\d{4})     # rest of number is 4 digits (e.g. '1212')
    \D*         # optional separator
    (\d*)       # extension is optional and can be any number of digits
    $           # end of string
    ''', re.VERBOSE)

In [38]:
phonePattern.search(phone[0]).groups()

('210', '867', '5309', '')

In [39]:
phonePattern.search(phone[1]).groups()

('210', '867', '5309', '')

In [40]:
phonePattern.search(phone[2]).groups()

(None, '867', '5309', '')

In [35]:
phonePattern.search(phone[3]).groups()

('210', '867', '5309', '')

Data frame approach from exercise review

In [42]:
# another approach
# The \D*? means zero or more of anything that's not a digit (including parentheses)
# This is another way for specifying optional characters like literal "()" or "+"
phone_regex = re.compile(
"""
^
(?P<country_code>\+\d+)?
\D*?
(?P<area_code>\d{3})?
\D*?
(?P<exchange_code>\d{3})
\D*?
(?P<line_number>\d{4})
""", re.VERBOSE)

In [43]:
df = pd.DataFrame()
df['number'] = [
    '(210) 867 5309',
    '+1 210.867.5309',
    '867-5309',
    '210-867-5309',
    '2108675309',
]

In [44]:
# extract turns named capture groups into dataframe columns
# NaNs for no match
df.number.str.extract(phone_regex)

Unnamed: 0,country_code,area_code,exchange_code,line_number
0,,210.0,867,5309
1,1.0,210.0,867,5309
2,,,867,5309
3,,210.0,867,5309
4,,210.0,867,5309


In [45]:
df = pd.concat([df, df.number.str.extract(phone_regex)], axis=1)
df

Unnamed: 0,number,country_code,area_code,exchange_code,line_number
0,(210) 867 5309,,210.0,867,5309
1,+1 210.867.5309,1.0,210.0,867,5309
2,867-5309,,,867,5309
3,210-867-5309,,210.0,867,5309
4,2108675309,,210.0,867,5309


4. Use regular expressions to convert the dates below to the standardized year-month-day format.</p>
<pre><code>02/04/19
02/05/19
02/06/19
02/07/19
02/08/19
02/09/19
02/10/19
</code></pre>



In [55]:
dates = ['02/04/19', '02/05/19', '02/06/19', '02/07/19', '02/08/19', '02/09/19', '02/10/19']

df=pd.DataFrame({'old_format': dates})

In [56]:
df

Unnamed: 0,old_format
0,02/04/19
1,02/05/19
2,02/06/19
3,02/07/19
4,02/08/19
5,02/09/19
6,02/10/19


In [53]:
date_convert = re.compile(r'''
(?P<month>\d{1,2})/
(?P<day>\d{1,2})/
(?P<year>\d{2,4})
''', re.VERBOSE)

In [57]:
df = pd.concat([df, df.old_format.str.extract(date_convert)], axis=1)
df

Unnamed: 0,old_format,month,day,year
0,02/04/19,2,4,19
1,02/05/19,2,5,19
2,02/06/19,2,6,19
3,02/07/19,2,7,19
4,02/08/19,2,8,19
5,02/09/19,2,9,19
6,02/10/19,2,10,19


In [58]:
df['new_format'] = df.year + "/" + df.month + "/" + df.day 
df

Unnamed: 0,old_format,month,day,year,new_format
0,02/04/19,2,4,19,19/02/04
1,02/05/19,2,5,19,19/02/05
2,02/06/19,2,6,19,19/02/06
3,02/07/19,2,7,19,19/02/07
4,02/08/19,2,8,19,19/02/08
5,02/09/19,2,9,19,19/02/09
6,02/10/19,2,10,19,19/02/10


5. Write a regex to extract the various parts of these logfile lines:</p>
<pre><code>GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58
</code></pre>



Solution from exercise review

In [63]:
log_files = [
    '''GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58''',
    '''POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58''',
    '''GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58'''
]
log_files

['GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58',
 'POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58',
 'GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58']

In [64]:
log_pattern = re.compile(r'''
(?P<action>GET|POST) 
\s
(?P<path>/[/\w\-\?=]+)
\s
\[(?P<time>.+)\]
\s
(?P<http>HTTP/\d+\.\d+)
\s
\{(?P<code>\d+)\}
\s
(?P<bytes>\d+)
\s
"(?P<user>.+)"
\s
(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})
$
''', re.VERBOSE)

In [65]:
rows = [re.search(log_pattern, line).groupdict() for line in log_files]
rows

[{'action': 'GET',
  'path': '/api/v1/sales?page=86',
  'time': '16/Apr/2019:193452+0000',
  'http': 'HTTP/1.1',
  'code': '200',
  'bytes': '510348',
  'user': 'python-requests/2.21.0',
  'ip': '97.105.19.58'},
 {'action': 'POST',
  'path': '/users_accounts/file-upload',
  'time': '16/Apr/2019:193452+0000',
  'http': 'HTTP/1.1',
  'code': '201',
  'bytes': '42',
  'user': 'User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',
  'ip': '97.105.19.58'},
 {'action': 'GET',
  'path': '/api/v1/items?page=3',
  'time': '16/Apr/2019:193453+0000',
  'http': 'HTTP/1.1',
  'code': '429',
  'bytes': '3561',
  'user': 'python-requests/2.21.0',
  'ip': '97.105.19.58'}]

In [66]:
df = pd.DataFrame(rows)
df

Unnamed: 0,action,path,time,http,code,bytes,user,ip
0,GET,/api/v1/sales?page=86,16/Apr/2019:193452+0000,HTTP/1.1,200,510348,python-requests/2.21.0,97.105.19.58
1,POST,/users_accounts/file-upload,16/Apr/2019:193452+0000,HTTP/1.1,201,42,User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; ...,97.105.19.58
2,GET,/api/v1/items?page=3,16/Apr/2019:193453+0000,HTTP/1.1,429,3561,python-requests/2.21.0,97.105.19.58


<p><strong>Bonus Exercise</strong></p>
<p>You can find a list of words on your mac at <code>/usr/share/dict/words</code>. Use this
   file to answer the following questions:</p>


In [68]:
df = pd.read_csv("/usr/share/dict/words", header=None)
df.columns = ["word"]
df.head()

Unnamed: 0,word
0,A
1,a
2,aa
3,aal
4,aalii


- How many words have at least 3 vowels?


In [82]:
df[df.word.str.count(r'[aeiou]', re.IGNORECASE) >= 3]


Unnamed: 0,word
4,aalii
6,Aani
7,aardvark
8,aardwolf
9,Aaron
...,...
235874,zymotically
235875,zymotize
235876,zymotoxic
235878,Zyrenian


- How many words have at least 3 vowels in a row?


In [85]:
df[df.word.str.count(r'[aeiou]{3}', re.IGNORECASE) > 0]


Unnamed: 0,word
234,Abietineae
235,abietineous
301,ablatitious
434,abranchious
507,absenteeism
...,...
235800,Zygophyceae
235801,zygophyceous
235802,Zygophyllaceae
235803,zygophyllaceous


- How many words have at least 4 consonants in a row?


In [84]:
df[df.word.str.count(r'[a-z]{4}', re.IGNORECASE) > 0]


Unnamed: 0,word
4,aalii
6,Aani
7,aardvark
8,aardwolf
9,Aaron
...,...
235881,zythem
235882,Zythia
235883,zythum
235884,Zyzomys


- How many words start and end with the same letter?


- How many words start and end with a vowel?


- How many words contain the same letter 3 times in a row?


- What other interesting patterns in words can you find?
