# Using RegEx in Python

## Raw Python Strings
When writing regular expressions in python, it is recommended to use raw strings. *Raw strings* are strings with an **r** prefix (similar to f-strings). 
<br>
<br>r-strings tell python to refrain from interpreting backslashes and special metacharacters in the string.
<br>
<br>This means that a pattern like "\n\w" will not be interpreted and can be written as r"\n\w" instead of "\\n\\w" as in other languages.

## Matching a String
To test whether a regular expression matches a specific string in Python, you can use **re.search()**. This method either returns None if the pattern doesn't match, or a re.MatchObject with additional information about which part of the match was found. 
<br>
<br>Note that this method stops after the first match, so this is best suited for testing a regular expression more than extracting data.
<br>
<br> Example:
```
matchObject = re.search(pattern, input_str, flags=0)
```

In [9]:
# Import regex library
import re

In [10]:
# Using a regular expression to match a 
# date string. Ignore the output as this 
# is just to test if the regex matches.

regex = r"([a-zA-Z]+) (\d+)"
if re.search(regex, "June 24"):
    
    # Indeed, the expression matches our date string
    
    # If we want, we can use the MatchObject's start()
    # and end() methods to retrieve where the pattern
    # matches in the input string, and the group() method
    # to get all the matches and captured groups.
    
    match = re.search(regex, "June 24")
    
    # This will print [0,7), since it matches at the beginning
    # and end of the string.
    
    print("Match at index %s, %s" % (match.start(), match.end()))
    
    # The groups contain the matched values. In particular: 
    #   match.group(0) always returns the fully matched string.
    #   match.group(1), match.group(2), ... will return the capture
    #      groups in order from left to right in the input string
    #   match.group() is equivalent to match.group()
    
    # So this will print "June 24"
    print("Full match: %s" % (match.group(0)))
    
    # This will print 'June'
    print("Month: %s" % (match.group(1)))
    
    # This will print '24'
    print("Day: %s" % (match.group(2)))
    
else:
    # If re.search() does not match, then None is returned
    print("The regex pattern does not match.")

Match at index 0, 7
Full match: June 24
Month: June
Day: 24


## Capturing Groups

We can use **re.findall()** to perform a global search over the whole input string. If capture groups are found, it will return a list of all captured data. If it can't locate capture groups, it will either return a list of any individual captured items, or an empty list if no captures are made.
<br> 
<br>Method examples:
```
matchList = re.findall(pattern, input_str, flags=0)
```
```
matchList = re.finditer(pattern, input_str, flags=0)
```

In [11]:
# Lets use a regular expression to match a few date strings.
regex = r"[a-zA-Z]+ \d+"
matches = re.findall(regex, "June 24, August 9, Dec 12")
for match in matches:
    # This will print:
    #   June 24
    #   August 9
    #   Dec 12
    print("Full match: %s" % (match))

Full match: June 24
Full match: August 9
Full match: Dec 12


In [12]:
# To capture the specific months of each date we can use the following pattern
regex = r"([a-zA-Z]+) \d+"
matches = re.findall(regex, "June 24, August 9, Dec 12")
for match in matches:
    # This will now print:
    #   June
    #   August
    #   Dec
    print("Match month: %s" % (match))

Match month: June
Match month: August
Match month: Dec


In [13]:
# If we need the exact positions of each match
regex = r"([a-zA-Z]+) \d+"
matches = re.finditer(regex, "June 24, August 9, Dec 12")
for match in matches:
    # This will now print:
    #   0 7
    #   9 17
    #   19 25
    # which corresponds with the start and end of each match in the input string
    print("Match at index: %s, %s" % (match.start(), match.end()))

Match at index: 0, 7
Match at index: 9, 17
Match at index: 19, 25


## Finding and Replacing Strings
Another common task is to find and replace part of a string using regex. I.e. to replace all instances of an old email domain, or to swap the order of some text. You can do this in Python with **re.sub()** method. 
<br>
the optional *count* argument is the exact number of replacements to make in the input string, and if this value is less than or equal to zero, then every match in the string is replaced.
<br>
<br> Method Example:
```
replacedString = re.sub(pattern, replacement_pattern, input_str, count, flags=0)
```

In [14]:
# Lets try and reverse the order of the day and month in a date 
# string. Notice how the replacement string also contains metacharacters
# (the back references to the captured groups) so we use a raw 
# string for that as well.
regex = r"([a-zA-Z]+) (\d+)"

# This will reorder the string and print:
#   24 of June, 9 of August, 12 of Dec
print(re.sub(regex, r"\2 of \1", "June 24, August 9, Dec 12"))

24 of June, 9 of August, 12 of Dec


## re Flags
In the Python regex methods, you will notice they also take optional *flags* arguments. Most of these are a convenience and can just be written into the regex directly. but some can be useful:
* **re.IGNORECASE** makes the pattern case insensitive.
* **re.MULTILINE** is necessary if your input string has newline characters (\n), this flag allows the start and end metacharacter (^ and $ respectively) to match at the beginning and end of each line, instead of the whole input string
* **re.DOTALL** allows the dot (.) metacharacter match all characters, including newline (\n)

## Compiling a pattern for performance
In Python, creating a new regular expression pattern to match many strings can be slow, so it is recommended that you compile them if you need to be testing or extracting information from many input strings using the same expression. This method returns a **re.RegexObject**.
```
regexObject = re.compile(pattern, flags=0)
```
The returned object has exactly the same methods as above, except that they take the input string and no longer require the pattern or flags for each call.



In [15]:
# Lets create a pattern and extract some information with it
regex = re.compile(r"(\w+) World")
result = regex.search("Hello World is the easiest")
if result:
    # This will print:
    #   0 11
    # for the start and end of the match
    print(result.start(), result.end())

# This will print:
#   Hello
#   Bonjour
# for each of the captured groups that matched
for result in regex.findall("Hello World, Bonjour World"):
    print(result)

# This will substitute "World" with "Earth" and print:
#   Hello Earth
print(regex.sub(r"\1 Earth", "Hello World"))

0 11
Hello
Bonjour
Hello Earth
