# What is a Regular Expression (Regex)?
A regular expression, commonly known as regex, is a powerful tool used in programming to describe patterns of text that can be searched for, matched, and manipulated within strings. It provides a concise and flexible means to perform complex string matching and manipulation tasks, allowing developers to efficiently handle tasks such as validation, extraction, and substitution of text data.

In [1]:
# Python regex library
import re

In [2]:
# Searches for "fox"
text = "The quick brown fox jumps over the lazy dog."
# Using raw string because it tells Python to treat backslashes (\) as literal characters rather than escape characters
pattern = r'fox'
match = re.search(pattern, text)
if match:
    print("Found:", match.group())
else:
    print("Pattern not found.")

Found: fox


In [3]:
# Checks if "fox" is at the beginning of the string
pattern = r'fox'
match = re.match(pattern, text)
if match:
    print("Found at the beginning:", match.group())
else:
    print("Pattern not found at the beginning.")

text = "fox is with the lazy dog."
match = re.match(pattern, text)
if match:
    print("Found at the beginning:", match.group())
else:
    print("Pattern not found at the beginning.")

Pattern not found at the beginning.
Found at the beginning: fox


In [4]:
text = "The quick brown fox jumps over the lazy dog."
# Finds all words with 3 characters
pattern = r'\b\w{3}\b'
matches = re.findall(pattern, text)
print("Words with 3 characters:", matches)

Words with 3 characters: ['The', 'fox', 'the', 'dog']


In [5]:
# Replaces words with 3 characters with "XXX"
pattern = r'\b\w{3}\b'
replacement = "XXX"
new_text = re.sub(pattern, replacement, text)
print("New text:", new_text)

New text: XXX quick brown XXX jumps over XXX lazy XXX.


# Practical Applications of Regular Expressions
Regular expressions find wide application in various fields such as text processing, data validation, and pattern matching. From extracting data from unstructured text to validating user inputs, regular expressions provide a versatile and efficient solution for handling complex string manipulation tasks in programming and data analysis.
Some Examples Are:
- Data Cleaning
- Data Extraction
- Data Validation
- Text Mining and Natural Language Processing
- Log Analysis
- Data Transformation

In [6]:
# Whitespace Removal:
text = "   Remove    extra   spaces \n and\ttabs."
cleaned_text = re.sub(r'\s+', ' ', text)
print(cleaned_text.strip())

Remove extra spaces and tabs.


Pattern: `r'\s+'`
- \s: Matches any whitespace character, including spaces, tabs, and newline characters.
- +: Matches one or more occurrences of the preceding whitespace character.

In [7]:
# Email Validation
def is_valid_email(email):
    pattern = r'^[\w\.-]+@[a-zA-Z\d\.-]+\.[a-zA-Z]{2,}$'
    return re.match(pattern, email) is not None

email = "example@email.com"
print(is_valid_email(email))
email = "example.email.com"
print(is_valid_email(email))

True
False


Email pattern: `r'^[\w\.-]+@[a-zA-Z\d\.-]+\.[a-zA-Z]{2,}$'`
- `^`: Asserts the start of the string.
- `[\w\.-]+`: Matches one or more occurrences of any word character (\w), dot (.), or hyphen (-).
- `@`: Matches the literal "@" symbol.
- `[a-zA-Z\d\.-]+`: Matches one or more occurrences of any letter (uppercase or lowercase), digit, dot (.), or hyphen (-) after the "@" symbol.
- `\.`: Matches a literal dot (.).
- `[a-zA-Z]{2,}`: Matches two or more occurrences of any letter (uppercase or lowercase) after the dot, indicating the top-level domain.
- `$`: Asserts the end of the string.

In [8]:
# URL Extraction
text = "Visit our website at https://www.example.com for more information. Visit https://example.com/about for more details."
urls = re.findall(r'https?://(?:[-\w./]|(?:%[\da-fA-F]{2}))+', text)
print(urls)

['https://www.example.com', 'https://example.com/about']


URL Pattern: `https?://(?:[-\w./]|(?:%[\da-fA-F]{2}))+`
- `https?://`: Matches the literal characters "http://" or "https://", where the s is optional due to the ? quantifier.
- `(?:...)`: Non-capturing group, used for grouping without capturing the matched text.
- `[-\w./]`: Matches any word character (letter, digit, or underscore), hyphen, period, or forward slash.
- `|`: Alternation, matches either the pattern on the left or the pattern on the right.
- `(?:%[\da-fA-F]{2}`): Non-capturing group that matches a percent sign % followed by two hexadecimal digits ([\da-fA-F] matches any hexadecimal digit: 0-9, a-f, A-F).
- `+`: Quantifier, matches one or more occurrences of the preceding group or character.

# Regex in McDonald's Reviews
Looking at a `csv` of complaints sent to McDonald's, let's use regex to find some insights about the complaints.

In [9]:
import pandas as pd
import re

# Load the CSV file into a pandas DataFrame
df = pd.read_csv('./McDonalds-Yelp-Sentiment-DFE.csv', encoding='latin1')

In [10]:
# Define an array of negative connotations for regex use
negative_connotations = ['unhealthy', 'dirty', 'unsanitary', 'rude', 'slow', 'poor', 'bad', 'gross', 'disgusting', 'nasty', 'vile', 'unpleasant', 'unappetizing', 'unfriendly', 'unprofessional', 'unwelcoming', 'unhelpful', 'unaccommodating', 'uncooperative', 'unresponsive', 'unreasonable', 'unfair']

# Create a dictionary to store counts of negative connotations and the most common negative connotation for each city
city_complaint_counts = {}
city_highest_complaint = {}

# Define a regex pattern for negative connotations with word boundaries
pattern = re.compile(r'\b(?:' + '|'.join(negative_connotations) + r')\b', flags=re.IGNORECASE)

# Iterate over each row in the DataFrame
for index, row in df.iterrows():
    city = row['city']
    review = row['review']
    
    # Initialize counts for each city if not already present
    if city not in city_complaint_counts:
        city_complaint_counts[city] = {connotation: 0 for connotation in negative_connotations}
    
    # Find negative connotations in the complaint using regex pattern
    matches = re.findall(pattern, review)
    for connotation in matches:
        connotation = connotation.lower();
        city_complaint_counts[city][connotation] += 1

        # Update the biggest negative connotation for the city
        if city not in city_highest_complaint or connotation > city_highest_complaint[city]:
            city_highest_complaint[city] = connotation

In [11]:
# Find the city with the most negative connotations
most_negative_city = max(city_complaint_counts, key=lambda k: sum(city_complaint_counts[k].values()))
print(f"The city with the most negative connotations is: {most_negative_city}")

The city with the most negative connotations is: Las Vegas


In [12]:
# Print each cities biggest negative connotation
for city in city_complaint_counts.keys():
    key_with_highest_value = max(city_complaint_counts[city], key=lambda k: city_complaint_counts[city][k])
    print(f"The biggest negative connotation for {city} is: {key_with_highest_value} ({city_complaint_counts[city][key_with_highest_value]} occurrences)")

The biggest negative connotation for Atlanta is: slow (26 occurrences)
The biggest negative connotation for Las Vegas is: bad (53 occurrences)
The biggest negative connotation for Dallas is: bad (8 occurrences)
The biggest negative connotation for Portland is: rude (12 occurrences)
The biggest negative connotation for Chicago is: bad (32 occurrences)
The biggest negative connotation for Cleveland is: slow (10 occurrences)
The biggest negative connotation for Houston is: slow (14 occurrences)
The biggest negative connotation for Los Angeles is: bad (23 occurrences)
The biggest negative connotation for New York is: bad (25 occurrences)
The biggest negative connotation for nan is: bad (10 occurrences)


In [13]:
# Define a regex pattern for "bad" followed by "dirty" with positive lookahead
bad_dirty_pattern = re.compile(r'\bbad\b(?=.*\bdirty\b)', flags=re.IGNORECASE)

# Iterate over each row in the DataFrame
for index, row in df.iterrows():
    review = row['review']
    
    # Search for the pattern in the review
    match = re.search(bad_dirty_pattern, review)
    
    # If the pattern is found, print the review
    if match:
        print(f"Review Found:\n{review}\n")

Review Found:
I've only been to this McDonald's twice and both times were bad experiences. Not only does the place smell of dirty rags and cleaner solutions, but the service is bad as well. I guess the convenient location is what keeps this place open but for me, I'll never come back.

Review Found:
We always see this place as we go into Super Walmart to shop and its always busy. So we decided to try them out after we shopped.Bad!! The floors were dirty, the customer service was nothing to talk about and the fries tasted like they had been sitting under the heat lamp for a long while.FYI just because they are always busy doesn't mean they are good. McDonalds of all places!!!!

Review Found:
This has to be the worst McDonald's location I have ever been to. Î¾ It is too bad since it is only a couple of miles away from the corporate headquarters you would think they would be out to impress. We walked in with nobody in front of us and waited for someone to even come to the front counter. Î

# Building Your Own Regular Expressions
As you've seen, the regex syntax can be a bit confusing and daunting. Luckily, there are multiple tools available to help you build and test your regular expressions. I like to use [Regex101](https://regex101.com/) to build and test my regular expressions. *AI models are very useful and are making pretty good regex options, but you need to validate what it gives you actually works.*