# Regular Expressions

A regular expression is a sort of meta-language that can be used to describe
patterns in text.

Regexes are most commonly used in one of two ways:

- To find/extract text that matches a pattern.
- To replace/substitute text that matches a pattern.

## The `re` module

To demonstrate regular expressions, we'll be using the `re` module from the
python standard library, and its `findall` function. Many other libraries also
work with regular expressions.

This function will accept a string that is a regular expression, the `pattern`,
and another string that is the string to be searched. `findall` will return a
list of all of the times the given regular expression matches the string.

!!!note "Raw Strings"
    Any string in python prefixed with a `r` is a **raw string**. This means
    that backslashes will be included in the string verbatim, and don't carry
    special meaning. It is very common to use raw strings when creating a
    regular expression.

## Basic Regexes 
    
At it's most basic, any alpha numeric character is a valid regular expression.

In [1]:
import re

re.findall(r'b', 'abcd')

['b']

We'll define a function here to simplify the process of showing many results from regular expressions.

In [2]:
def show_all_matches(regexes, subject, re_length=6):
    print('Sentence:')
    print()
    print('    {}'.format(subject))
    print()
    print(' regexp{} | matches'.format(' ' * (re_length - 6)))
    print(' ------{} | -------'.format(' ' * (re_length - 6)))
    for regexp in regexes:
        fmt = ' {:<%d} | {!r}' % re_length
        matches = re.findall(regexp, subject)
        if len(matches) > 8:
            matches = matches[:8] + ['...']
        print(fmt.format(regexp, matches))

In [3]:
sentence = 'Mary had a little lamb. 1 little lamb. Not 10, not 12, not 22, just one.'

show_all_matches([
    r'a',
    r'm',
    r'M',
    r'Mary',
    r'little',
    r'1',
    r'10',
    r'22'
], sentence)

Sentence:

    Mary had a little lamb. 1 little lamb. Not 10, not 12, not 22, just one.

 regexp | matches
 ------ | -------
 a      | ['a', 'a', 'a', 'a', 'a']
 m      | ['m', 'm']
 M      | ['M']
 Mary   | ['Mary']
 little | ['little', 'little']
 1      | ['1', '1', '1']
 10     | ['10']
 22     | ['22']


## Metacharacters and Character Classes

In addition to letters and numbers, there are special **metacharacters** in
regular expressions. These are characters that match several different kinds of
characters, but don't match the character itself literally like others.
Metacharacters must be **escaped** to match the character itself.

Here are several metacharacters that represent various **character classes**.

| metacharacter | matches                                |
| ------------- | -------                                |
| `.`           | anything                               |
| `\w`          | any letter or number                   |
| `\W`          | anything that's not a letter or number |
| `\d`          | any digit                              |
| `\D`          | anything that's not a digit            |
| `\s`          | any whitespace character               |


In [4]:
res = [
    r'\w',
    r'\d',
    r'\s',
    r'.', # matches every character
    r'\.', # a literal period
]
show_all_matches(res, sentence)

Sentence:

    Mary had a little lamb. 1 little lamb. Not 10, not 12, not 22, just one.

 regexp | matches
 ------ | -------
 \w     | ['M', 'a', 'r', 'y', 'h', 'a', 'd', 'a', '...']
 \d     | ['1', '1', '0', '1', '2', '2', '2']
 \s     | [' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', '...']
 .      | ['M', 'a', 'r', 'y', ' ', 'h', 'a', 'd', '...']
 \.     | ['.', '.', '.']


These can be combined together.

In [5]:
show_all_matches([r'l\w\w\w\W', r'\d\d'], sentence, re_length=9)

Sentence:

    Mary had a little lamb. 1 little lamb. Not 10, not 12, not 22, just one.

 regexp    | matches
 ------    | -------
 l\w\w\w\W | ['lamb.', 'lamb.']
 \d\d      | ['10', '12', '22']


## Repeating

All of the metacharacters in the table below will match the previous character a
repeated number of times.

| metacharacter | matches                         | 
| ------------- | -------                         | 
| `*`           | zero or more                    |
| `+`           | one or more                     |
| `{n}`         | exactly `n` repititions         |
| `{n,}`        | `n` or more repititions         |
| `{n,m}`       | between `n` and `m` repititions |
| `?`           | an optional character           |


In [6]:
show_all_matches([
    r'\d+'
], sentence)

print('\n---\n')

show_all_matches([
    r'a{2,}',
    r'a{2}',
    r'a{3,4}'
], 'aabbaaaa')

Sentence:

    Mary had a little lamb. 1 little lamb. Not 10, not 12, not 22, just one.

 regexp | matches
 ------ | -------
 \d+    | ['1', '10', '12', '22']

---

Sentence:

    aabbaaaa

 regexp | matches
 ------ | -------
 a{2,}  | ['aa', 'aaaa']
 a{2}   | ['aa', 'aa', 'aa']
 a{3,4} | ['aaaa']


## Any of or None of

The square brackets in a regular expression represent a single character that
will match any of the values within the square brackets. For example, `[ab]`
will match either an 'a' or a 'b'.

If the first character inside of the square brackets is a caret, `^`, then
anything that is *not* inside of the square brackets will be matched. For
example, `[^ab]` will match any character that is neither 'a' nor 'b'.

Inside of square brackets, ranges of letters and numbers can be abbreviated with
a hyphen.

In [7]:
show_all_matches([
    r'[lt]',
    r'[lt]+',
    r'[^aeiou\s\.]', # any letter that's not a vowel
    r'[a-d]'
], sentence, re_length=12)

Sentence:

    Mary had a little lamb. 1 little lamb. Not 10, not 12, not 22, just one.

 regexp       | matches
 ------       | -------
 [lt]         | ['l', 't', 't', 'l', 'l', 'l', 't', 't', '...']
 [lt]+        | ['l', 'ttl', 'l', 'l', 'ttl', 'l', 't', 't', '...']
 [^aeiou\s\.] | ['M', 'r', 'y', 'h', 'd', 'l', 't', 't', '...']
 [a-d]        | ['a', 'a', 'd', 'a', 'a', 'b', 'a', 'b']


## Anchors

There are several special metacharacters that don't match any individual
characters, but serve as an "anchor" for the rest of the regular expression.

| metacharacter | matches                      |
| ------------- | -------                      |
| `^`           | The start of the string/line |
| `$`           | The end of the string/line   |
| `\b`          | A word boundary              |

In [8]:
show_all_matches([
    r'\bo\w+', # any word that starts with an 'o'
    r'^\s', # starts with a space
    r'^M', # starts with 'M'
    r'\.$', # ends with a period
], sentence)

Sentence:

    Mary had a little lamb. 1 little lamb. Not 10, not 12, not 22, just one.

 regexp | matches
 ------ | -------
 \bo\w+ | ['one']
 ^\s    | []
 ^M     | ['M']
 \.$    | ['.']


## Other Common Functions

- `match`: Matches from the start of the string.
- `search`: Find the first instance of the regular expression.
- `sub`: Make substitutions with a regular expression.
- `compile`: Prepare a regular expression for use ahead of time.

Now we'll take a look at using `search` and `sub` in more detail.

## Capture Groups

We can define groups in our regular expressions called **capture groups**. This allows us to reference the groups later on in the regular expression, or apply repitition to the group as a whole.

Note that when we include capture groups in our regular expressions, `findall` will return only the matched groups, not the entire text that was matched.

In [9]:
sentence = '''
You can find us on the web at https://codeup.com. Our ip address is 123.123.123.123 (maybe).
'''.strip()

In [10]:
ip_re = r'\d+(\.\d+){3}'

match = re.search(ip_re, sentence)
match[0]

'123.123.123.123'

In [11]:
# simplified for demonstration, a real url to parse urls would be much more
# complex
url_re = r'(https?)://(\w+)\.(\w+)'

protocol, domain, tld = re.search(url_re, sentence).groups()

print(f'''
protocol: {protocol}
domain:   {domain}
tld:      {tld}
''')


protocol: https
domain:   codeup
tld:      com



You can create *non-capturing* (aka shy) groups by adding `?:` to the beginning of the group, and groups can be named by adding `?P<name>`.

In [12]:
url_re = r'(?P<protocol>https?)://(?:\w+)\.(?P<tld>\w+)'

match = re.search(url_re, sentence)

print(f'''
groups: {match.groups()}
referencing a group by name: {match.group('tld')}
group dictionary: {match.groupdict()}
''')


groups: ('https', 'com')
referencing a group by name: com
group dictionary: {'protocol': 'https', 'tld': 'com'}



## Substitution

We can use a regular expression to replace or remove parts of a string. In addition, if the supplied regular expression has capture groups in it, the text captured can be referenced when making the substitution.

In [13]:
# remove anything that's not a digit
re.sub(r'\D', '', 'abc 123')

'123'

In [14]:
# remove anything that's not a letter
re.sub(r'[^a-z]', '', 'abc 123')

'abc'

In [15]:
re.sub(r'.(.).', r'\1', 'abc')

'b'

In [16]:
re.sub(r'(.)(.)(.)', r'\3\2\1', 'abc')

'cba'

In [17]:
re.sub(r'.{2}$', 'X', 'abc')

'aX'

## Regex Flags

Include any of the flags below as the last argument to any of the regular
expressions method mentioned in this lesson, and that behavior will be enabled
for that use of the regular expression.

- `re.MULTILINE`: The `^` and `$` anchors will apply line by line, instead of
  applying to start and end of the string.

- `re.IGNORECASE`: Ignore character casing when matching.

- `re.VERBOSE`: Ignore any whitespace in the regular expression. This can be
  useful to make more readable regular expressions, especially when combined
  with non-capturing comment groups.

    ```python
    regexp = r'''
    [aeiou] (?# any vowel)
    [^aeiou] (?# followed by a non-vowel)
    '''
    ```

    The above is equivalent to the following.

    ```python
    regexp = r'[aeiou][^aeiou]'
    ```

    When the `VERBOSE` flag is set.

## Further Reading

- [RegExr: test out regexes in your browser](https://regexr.com/)
- [RegexOne: learn regular expressions](https://regexone.com/)
- [Python docs on the `re` module](https://docs.python.org/3/library/re.html)

## Exercises

Using the [repo setup directions](https://ds.codeup.com/fundamentals/git/), setup a new local and remote repository named `natural-language-processing-exercises`. The local version of your repo should live inside of `~/codeup-data-science`. This repo should be named `natural-language-processing-exercises`

Save this work in your `natural-language-processing-exercises` repo. Then add, commit, and push your changes.

Unless a specific file extension is specified, you may do your work either in a
python script (`.py`) or a jupyter notebook (`.ipynb`).

Do your work for this exercise in a file named `regex_exercises`.

1. Write a function named `is_vowel`. It should accept a string as input and use
   a regular expression to determine if the passed string is a vowel. While not
   explicity mentioned in the lesson, you can treat the result of `re.search` as
   a boolean value that indicates whether or not the regular expression matches
   the given string.

1. Write a function named `is_valid_username` that accepts a string as input. A
   valid username starts with a lowercase letter, and only consists of lowercase
   letters, numbers, or the `_` character. It should also be no longer than 32
   characters. The function should return either `True` or `False` depending on
   whether the passed string is a valid username.

        >>> is_valid_username('aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa')
        False
        >>> is_valid_username('codeup')
        True
        >>> is_valid_username('Codeup')
        False
        >>> is_valid_username('codeup123')
        True
        >>> is_valid_username('1codeup')
        False

1. Write a regular expression to capture phone numbers. It should match all of
   the following:

        (210) 867 5309
        +1 210.867.5309
        867-5309
        210-867-5309

1. Use regular expressions to convert the dates below to the standardized year-month-day format.

        02/04/19
        02/05/19
        02/06/19
        02/07/19
        02/08/19
        02/09/19
        02/10/19

1. Write a regex to extract the various parts of these logfile lines:

        GET /api/v1/sales?page=86 [16/Apr/2019:193452+0000] HTTP/1.1 {200} 510348 "python-requests/2.21.0" 97.105.19.58
        POST /users_accounts/file-upload [16/Apr/2019:193452+0000] HTTP/1.1 {201} 42 "User-Agent: Mozilla/5.0 (X11; Fedora; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36" 97.105.19.58
        GET /api/v1/items?page=3 [16/Apr/2019:193453+0000] HTTP/1.1 {429} 3561 "python-requests/2.21.0" 97.105.19.58


**Bonus Exercise**

You can find a list of words on your mac at `/usr/share/dict/words`. Use this
   file to answer the following questions:

    - How many words have at least 3 vowels?
    - How many words have at least 3 vowels in a row?
    - How many words have at least 4 consonants in a row?
    - How many words start and end with the same letter?
    - How many words start and end with a vowel?
    - How many words contain the same letter 3 times in a row?
    - What other interesting patterns in words can you find?