# Regular expressions 

This task of searching and extracting is so common that Python has a very powerful
library called regular expressions that handles many of these tasks quite elegantly

In [1]:
import re

In [2]:
quote = "I am a Data Analyst. I love data. My goal is to help you learn everything Data."

## Search

In [3]:
re.search('goal', quote)

<re.Match object; span=(37, 41), match='goal'>

In [4]:
re.search('Data', quote).group()

'Data'

## Find 

In [5]:
# Returns a list
re.findall('Data', quote)

['Data', 'Data']

In [6]:
len(re.findall('Data', quote))

2

In [7]:
s = 'A message from csev@umich.edu to cwen@iupui.edu about meeting @2PM'
lst = re.findall('\S+@\S+', s)
print(lst)

['csev@umich.edu', 'cwen@iupui.edu']


The findall() method searches the string in the second argument and returns a
list of all of the strings that look like email addresses. We are using a two-character
sequence that matches a non-whitespace character (\S).

We can indicate that we want to simply match a character by prefixing that character with a backslash.

## Split

In [8]:
re.split('\.', quote)

['I am a Data Analyst',
 ' I love data',
 ' My goal is to help you learn everything Data',
 '']

## Sub 

In [9]:
re.sub('goal', 'task', quote)

'I am a Data Analyst. I love data. My task is to help you learn everything Data.'

In [10]:
re.sub('I', 'You', quote, count=1)

'You am a Data Analyst. I love data. My goal is to help you learn everything Data.'

## Pattern matching 

Search finds from the whole string. Match only finds if the string starts with
that character

In [11]:
# Uppercase match ^ is the start and $ the end 
pattern = re.compile("^[A-Z]+$")

In [12]:
pattern.search('Hello World')

In [13]:
pattern.search('HELLO WORLD')

In [14]:
pattern.search('HELLOWORLD')

<re.Match object; span=(0, 10), match='HELLOWORLD'>

In [15]:
pattern = re.compile("^[a-zA-Z\s]+$")

In [16]:
print(pattern.search('Hello World'))

<re.Match object; span=(0, 11), match='Hello World'>


In [17]:
print(pattern.search('Hello world'))

<re.Match object; span=(0, 11), match='Hello world'>


In [18]:
print(pattern.search('helloworld'))

<re.Match object; span=(0, 10), match='helloworld'>


## Extra methods

- ˆ Matches the beginning of the line.
- $ Matches the end of the line.
- . Matches any character (a wildcard).
- \s Matches a whitespace character.
- \S Matches a non-whitespace character (opposite of \s).
- * Applies to the immediately preceding character and indicates to match zero or more of the preceding character(s).
- *? Applies to the immediately preceding character and indicates to match zero or more of the preceding character(s) in “non-greedy mode”.
- + Applies to the immediately preceding character and indicates to match one ormore of the preceding character(s).
- +? Applies to the immediately preceding character and indicates to match one or more of the preceding character(s) in “non-greedy mode”.
- [aeiou] Matches a single character as long as that character is in the specified set.
In this example, it would match “a”, “e”, “i”, “o”, or “u”, but no other characters.
[a-z0-9] You can specify ranges of characters using the minus sign. This example is a single character that must be a lowercase letter or a digit.
- [ˆA-Za-z] When the first character in the set notation is a caret, it inverts the logic.This example matches a single character that is anything other than an uppercase or lowercase letter.
- ( ) When parentheses are added to a regular expression, they are ignored for the purpose of matching, but allow you to extract a particular subset of the matched string rather than the whole string when using findall()
- \b Matches the empty string, but only at the start or end of a word.
- \B Matches the empty string, but not at the start or end of a word.
- \d Matches any decimal digit; equivalent to the set [0-9].
- \D Matches any non-digit character; equivalent to the set [ˆ0-9].