# 2.4 Regular Expressions

Regular expressions, or "regex" for short, is a special syntax for searching for strings that meets a specified pattern. It's a great tool to filter and sort through text when you want to match patterns rather than a hard coded string or strings.

There are loads of options for the syntax so it's best to just jump in and get started with some examples.

In [1]:
import re

## Raw Strings

Python recognises certain characters to have a special meaning, for example, \n in python is used to indicate a new line. However, sometimes these codes that python recognises to have certain meanings appear in our strings and we want to tell python that a \n in our text is a literal \n, rather than meaning a new line.

We can use the 'r' character before strings to indicate to python that our text is what is known as a "raw string".

In [4]:
# print text without using raw string indicator
my_folder = "C:\Desktop\notes"
print(my_folder)

C:\Desktop
otes


In [5]:
# print text with raw string indicator
my_folder = r"C:\Desktop\notes"
print(my_folder)

C:\Desktop\notes


## re.search

re.search is a function which allows us to check if a certain pattern is in a string. It uses the logic re.search("pattern to find", "string to find it in"). It will return the pattern if it is found in the string, or else it will return None if the pattern is not found.

In [10]:
result_search = re.search("pattern", r"string to contain the pattern ")
print(result_search)
print(result_search.group())

<re.Match object; span=(22, 29), match='pattern'>
pattern


In [9]:
result_search_2 = re.search("pattern", r"the phrase to find isn't in this string")
print(result_search_2)

None


## re.sub

re.sub allows us to find certain text and replace it. It uses the logic re.sub("pattern to find", "replacement text", "string").

In [11]:
string = r"sara was able to help me find the items i needed quickly"

In [12]:
new_string = re.sub("sara", "sarah", string)
print(new_string)

sarah was able to help me find the items i needed quickly


## Regex Syntax

The real power of regex is being able to leverage the syntax to create more complex searches/replacements.

In [13]:
customer_reviews = ['sam was a great help to me in the store', 
                    'the cashier was very rude to me, I think her name was eleanor', 
                    'amazing work from sadeen!', 
                    'sarah was able to help me find the items i needed quickly', 
                    'lucy is such a great addition to the team', 
                    'great service from sara she found me what i wanted'
                   ]

**sarah's reviews but account for the spelling of sara**

In [16]:
sarahs_reviews = []

In [18]:
pattern_to_find = r"sarah?"

In [19]:
for string in customer_reviews:
    if (re.search(pattern_to_find, string)):
        sarahs_reviews.append(string)
        

In [20]:
print(sarahs_reviews)

['sarah was able to help me find the items i needed quickly', 'great service from sara she found me what i wanted']


**Find reviews that start with the letter a**

In [21]:
a_reviews = []
pattern_to_find = r"^a"

In [23]:
for string in customer_reviews:
    if(re.search(pattern_to_find, string)):
        a_reviews.append(string)

In [24]:
print(a_reviews)

['amazing work from sadeen!']


**Find reviews that ends with the letter y**

In [25]:
y_reviews = []
pattern_to_find = r"y$"

In [28]:
for string in customer_reviews:
    if (re.search(pattern_to_find, string)):
        y_reviews.append(string)

In [29]:
print(y_reviews)

['sarah was able to help me find the items i needed quickly']


**Find reviews that contain the words needed or wanted**

In [35]:
needwant_reviews = []

In [46]:
pattern_to_find = r"(need|want)ed"

In [47]:
for string in customer_reviews:
    if (re.search(pattern_to_find, string)):
        needwant_reviews.append(string)

In [48]:
print(needwant_reviews)

['sarah was able to help me find the items i needed quickly', 'great service from sara she found me what i wanted']


**Remove anything from the review that isn't a word or a space (i.e. remove punctuation)**

In [50]:
no_punct_list = []
pattern_to_find = r"[^\w\s]"

# [^ ] means "not", \w means word and \s means whitespace: so find anything that is not a word or a space

In [52]:
for string in customer_reviews:
    no_punct_string = re.sub(pattern_to_find, "", string)
    no_punct_list.append(no_punct_string)

In [53]:
print(no_punct_list)

['sam was a great help to me in the store', 'the cashier was very rude to me I think her name was eleanor', 'amazing work from sadeen', 'sarah was able to help me find the items i needed quickly', 'lucy is such a great addition to the team', 'great service from sara she found me what i wanted']
