# Regex

## What?
Regular expressions (called regexes or regex patterns) are a tiny language for dealing with text and character patterns.
With RegEx patterns we can:
- Does this string match a pattern?
- Is there a match for the pattern anywhere in the string?
- Modify + split strings in various ways

In [None]:
import pandas as pd
import re # part of the python stdlib

In [None]:
log_file_lines = '''
76.185.131.226 - - [11/May/2020:14:25:53 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:46 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET / HTTP/1.1" 200 42 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
76.185.131.226 - - [11/May/2020:16:25:58 +0000] "GET /favicon.ico HTTP/1.1" 200 162 "https://python.zach.lol/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36"
104.5.217.57 - - [11/May/2020:16:26:27 +0000] "GET / HTTP/1.1" 200 42 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:46 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:26:54 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
104.5.217.57 - - [11/May/2020:16:27:04 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:05 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
76.185.131.226 - - [11/May/2020:16:27:10 +0000] "GET /documentation HTTP/1.1" 200 348 "-" "python-requests/2.23.0"
'''

In [None]:
lines = pd.Series(log_file_lines.strip().split('\n'))
lines

In [None]:
regex = r'''
    (?P<ip>.*?)\s.*?\[(?P<timestamp>.*?)\]\s+"(?P<method>[A-Z]+)\s(?P<path>.*?)\sHTTP/1.1"
    \s(?P<status>\d+)\s(?P<bytes_sent>\d+)\s"(?P<referrer>.*?)"\s"(?P<user_agent>.*?)"
    '''

regex = re.compile(regex, re.VERBOSE)
regex

In [None]:
lines.str.extract(regex)

## re library function

### `re.findall` 

 - finds all substrings where the RE matches; returns a list

### Literals - start simple

In [None]:
subject = 'abc'

#### find the letter a

#### find the letter c

#### find the letter d

### Literals - make it more complex

In [None]:
subject = 'Mary had a little lamb. 1 little lamb. Not 10, not 12, not 22, just one'

#### find mary

##### regex flag: re.IGNORECASE

#### find little

#### find the number 1

### Metacharacters

-  `.` : anything


- `\w`: any letter or number
- `\W`: anything that is *not* a letter or number



- `\d`: any digit
- `\D`: anything that is *not* a digit


- `\s` : any whitespace

In [None]:
subject = 'abc. 123'

#### try all the metacharacters

#### what does \w\w bring back?

#### match the period

#### match the string 'c 1' using only metacharacters

In [None]:
subject = 'c 1'

### Repeating

- `{}`: custom number of repititions
    - `{x}`: exactly x repititions
    - `{x,}`: x or more
    - `{x,y}`: between x and y repititions
- `*`: zero or more
- `+`: one or more
- `?`: optional
- `?`: greedy

#### what will be returned?

In [None]:
subject = 'abc 123'

In [None]:
regexp = r'\w+\s?\d'

re.findall(regexp, subject)

### Find the matches

In [None]:
subject = """
    Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. 
    You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.
    """

#### match all the numbers

#### match a 5 digit number, but not a number with fewer digits

#### match a 4 or more digit number

#### match 3 to 4 digit number

#### match `http://` or `https://`

### Any of / None of

- `[]`: will match anything inside of
- `[^]`: will match anything not inside of
- `[-]`: will match a range of values inside of

In [None]:
subject = 'abc 12345'

#### match using brackets

#### match using carrot

#### match using range

#### match using range and carrot

### Anchors

- `^`: starts with
- `$`: ends with
- `\b`: word boundary

In [None]:
subject = """
    Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. 
    You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.
    """

#### match 3 or 4 digit number

In [None]:
subject = 'kiwi aardvark banana codeup data science academy extra'

#### match all words that start with a vowel

In [None]:
#using a boundry


In [None]:
#using an anchor


In [None]:
#split subjects to use anchor


#### match all words that end with a vowel

### Capture Groups

- `()`: grab what's contained in parentheses 

In [None]:
subject = '''
You can find us on the web at https://codeup.com. Our ip address is 123.123.123.123 (maybe).
'''

#### find the domain only

#### find everything after the first sentence

#### find the ip address

In [None]:
#dont be greedy


#### find domain and ip address

#### find the protocol, domain, and tld

In [None]:
regexp = r'(https?)://(\w+)\.(\w+)'

re.findall(regexp, subject)

### Non Capture Group

- `?:`: to ignore a capture group also called shy

#### find the ip address

In [None]:
regexp = r'((?:\d{3}\.){3}\d+)'

re.findall(regexp, subject)

#### find the protocol and tld

## more re library functions

- `re.findall` finds all substrings where the RE matches; returns a list


- `re.search` scans through a string, looking for any location where the RE matches; returns match object
- `re.sub` allows us to match a regex and substitute in a new substring for the match; returns string

### `re.search`

- scans through a string, looking for any location where the RE matches; returns match object

In [None]:
subject = """
    Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. 
    You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.com.
    """

#### find the word located

In [None]:
#results type


#### find numbers

#### find navarro

In [None]:
#results type


### Name Capture Group

- `?P`: to name a capture group

#### find the protocol, domain, and tld and name them

In [None]:
#groups()


In [None]:
#groupsdict()


##### regex flag:  `re.VERBOSE`

### `re.sub`

- allows us to match a regex and substitute in a new substring for the match; returns string

In [None]:
subject = 'abc 12345xyz'

#### remove all the digits

#### replace all digits with an o 

#### replace all the digits with a single o

### `re.compile`

- prepare a regular expression for use ahead of time, returns expression

In [None]:
emails = [
    "jane@company.com",
    "bob@company.com",
    "jane.janeway@company.com",
    "jane.janeway@dogood.org",
    "jane.janet.janeway@dogood.org", # bonus for the 3 part address
]