# RegEx

**Regex** or **reg**ular **ex**pressions is an incredibly useful Python library for manipulating data to conform to a standard. RegEx can be **VERY** challenging to learn at first, so do not be surprised if you need to continually look up how to use it. It takes professionals YEARS to become comfortable (if they ever do)!

## RegEx Metacharacters
Metacharacters are characters with special meanings. Some of the most commonly used metacharacters can be found below. 

**.** | Any character (except a new line character). Think of this as a wildcard character. <br> Example: "l...s" would find any 5-character words that start with l and end with s. 

**\*** | Zero or more occurances of the previous pattern. <br> Example: "l.*s" would find any 2-character or more words that start with l and end with s.

**\+** | One or more occurances of the previous pattern. <br> Example: "l.+s" would find any 5-character or more words that start with l and end with s.

**?** | Zero or one occurance of the previous pattern. <br> Example: "l.?.?.?s" would find any 2, 3, 4, or 5-character words that start wth l and end wth s.

**{#}** | Exactly # many occurances of the previous pattern. <br> Example: "l.{3}s" is equivalent to "l...s"

**\^** | Starts with whatever pattern follows. <br> Example: "^Thus" would search for strings that start with the word Thus.

**\$** Ends with whatever pattern came before it. <br> Example: "\\(.*\\)$" would search for strings that end with () where what is in the middle can be of any length.

## RegEx Special Sequences
Special sequences are used to match some group of characters. Some of the most commonly used special sequences can be found below.
- [] - Some set of characters. Example: [amk] will match 'a', 'm', or 'k'. We can also use ranges, such as [a-m] matching letters between a and m individually.
- \d - Matches any decimal digit; this is equivalent to the class [0-9].
- \D - Matches any non-digit character; this is equivalent to the class [^0-9].
- \s - Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
- \S - Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
- \w - Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
- \W - Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].

See https://docs.python.org/3/library/re.html#re-syntax for the entire list.

## RegEx functions

**re.findall(pattern, text_to_search)** - Returns a list of all matches, in the order they are found. If no matches are found, returns an empty list.

**re.search(pattern, text_to_search)** - Returns the *first* match found, ignoring all others.

**re.split(pattern, text_to_search)** - Returns a list where the string has been split at each matching pattern. 

**re.sub(pattern_to_find, new_pattern, text_to_search)** - Returns a new string where any "pattern_to_find" are replaced with "new_pattern". You can even specify the number of replacements by including a number after text_to_search with **re.sub(pattern_to_find, new_pattern, text_to_search, #)**.

## Pattern matching with re.search()

Let's take our previous example and, rather than split the text, use matching.

In [1]:
from numpy import random # needed for random. functions

# List of dicts with month and the month number
day_months_31 = ['January', 'March', 'May', 'July', 
                 'August', 'October', 'December']
day_months_30 = ['April', 'June', 'September', 'November']
day_months_28 = ['February'] # Ignoring leap years
list_of_dict_months = day_months_31 + day_months_30 + day_months_28

# define a function that creates a random date
def random_date():
    month = list_of_dict_months[random.randint(0,12)]
    # ensures the month has the correct number of days
    if month in day_months_31:
        day = random.randint(1, 32)
    elif month in day_months_30:
        day = random.randint(1, 31)
    else:
        day = random.randint(1, 29)
    year = random.randint(1900, 2024)
    return f'{month} {day}, {year}'
    
# initialize empty list
list_of_dates = []

# Create a list of 1000 dates
while len(list_of_dates) < 1000:
    list_of_dates.append(random_date())

# Create dictionary of months and associated number
months_in_order = ['January', 'February', 'March', 'April', 
                       'May', 'June', 'July', 'August', 
                       'September', 'October', 'November', 'December']
dict_of_months = {} # As we saw before, dictionaries are basically sets where each element has an associated key, so we start by defining an empty set that we will populate
    
for index in range(1, 13): # remember range does not include the last value
    # This adds a key (Month name) and an associated value (Month number)
    dict_of_months[months_in_order[index-1]] = index

In [4]:
import re

converted_list_of_dates = []

# Using parentheses groups chunks of string together as a "group"
    # This is how we defined group(1), group(2), group(3)
# .+ means one or more occurances of any character
# , is just a comma and is not special here
# This effectively searches for the following groups
    # (group 1) (group 2), (group 3)
    # Month Day, Year
pattern_to_match = '(.+) (.+), (.+)'

def regex_convert(date):
    temp_match_object = re.search(pattern_to_match, date) # We know there will be only 1 match, so there is no issue using re.search() here.
    
    raw_month = temp_match_object.group(1) 
    day = temp_match_object.group(2)
    year = temp_match_object.group(3)
    
    month_number = dict_of_months[raw_month] # Converting from word to number
    
    short_date = f'{month_number}/{day}/{year}'
    return short_date

In [1]:
for date in list_of_dates:
    short_date = regex_convert(date)
    converted_list_of_dates.append(short_date)

print(converted_list_of_dates)

NameError: name 'list_of_dates' is not defined

Regex can be unintuitive but can be an incredible tool for advanced data cleaning. In this course, you will not _need_ to use regex but it can make life a little easier. 