Regular Expressions (Regex) for Text Preprocessing

## What are Regular Expressions (Regex)?

Regular expressions (often abbreviated as **regex**) are a powerful tool for text processing and pattern matching.

They provide a way to specify a search pattern for text, which can be used to find, extract, and manipulate data in a string.

Regex is commonly used in text preprocessing tasks, such as cleaning, extracting information, and transforming text into a format suitable for further analysis.

## Why Use Regex in Text Processing?

When working with natural language processing (NLP), a large part of the process involves extracting, cleaning, and transforming raw text data. Regex provides an efficient and flexible way to:

- Search for specific patterns (e.g., email addresses, phone numbers, dates).
- Replace parts of text (e.g., remove unwanted characters, standardize formats).
- Split or tokenize text based on specific rules.
- Extract and manipulate data from structured or unstructured text sources.

In Python, the `re` module provides the tools to work with regular expressions. The module includes several functions that make it easy to perform search, match, and replace operations within a string.

In [76]:
sample_text = """ Hello! My name is Anonymous. I am learning Natural Language Processing (NLP) with Python.
anonymous.123@example.com for more details or call me at 134-122-3440. Yesterday, I was reading about how **Pythonista**
is a term used for enthusiasts who love Python."""

print(sample_text)

 Hello! My name is Anonymous. I am learning Natural Language Processing (NLP) with Python.
anonymous.123@example.com for more details or call me at 134-122-3440. Yesterday, I was reading about how **Pythonista** 
is a term used for enthusiasts who love Python.


In [39]:
import re

In [40]:
# Example 1: Matching a Literal String

pattern = r"Python"
match = re.search(pattern, sample_text)

if match:
    print("Found:", match.group())
    print(match.group())
else:
    print("No match found.")

print('=============================')

pattern = r"Demo"
match = re.search(pattern, sample_text)

if match:
    print("Found:", match.group())
    print(match.group())
else:
    print("No match found.")

Found: Python
Python
No match found.


In [31]:
# Example 2: Matching Digits
# Pattern to match any digit (0-9)
pattern = r"\d"
digits = re.findall(pattern, sample_text)

print("Digits found:", digits)

Digits found: ['1', '2', '3', '1', '3', '4', '1', '2', '2', '3', '4', '4', '0']


In [41]:
# Example 3: Extracting Email Addresses
# Pattern to match email addresses
pattern = r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"
emails = re.findall(pattern, sample_text)

print("Emails found:", emails)


Emails found: ['anonymous.123@example.com']


In [79]:
# Example 4: Extracting Phone Numbers
# Pattern to match phone numbers (e.g., 123-456-7890)
pattern = r"\d{3}-\d{3}-\d{4}"
phone_numbers = re.findall(pattern, sample_text)

print("Phone numbers found:", phone_numbers)


Phone numbers found: ['134-122-3440']


In [43]:
# Example 5: Removing Punctuation
# Pattern to remove punctuation (anything that's not a word character or space)
pattern = r"[^\w\s]"
cleaned_text = re.sub(pattern, "", sample_text)

print("Text after removing punctuation:")
print(cleaned_text)


Text after removing punctuation:
 Hello My name is Anonymous I am learning Natural Language Processing NLP with Python
anonymous123examplecom for more details or call me at 1341223440 Yesterday I was reading about how Pythonista 
is a term used for enthusiasts who love Python


In [44]:
# Example 6: Replacing Text
pattern = r"NLP"
replacement = "Natural Language Processing"
replaced_text = re.sub(pattern, replacement, sample_text)

print("Text after replacement:")
print(replaced_text)


Text after replacement:
 Hello! My name is Anonymous. I am learning Natural Language Processing (Natural Language Processing) with Python.
anonymous.123@example.com for more details or call me at 134-122-3440. Yesterday, I was reading about how **Pythonista** 
is a term used for enthusiasts who love Python.


In [45]:
# Example 7: Matching Multiple Words/Phrases
# Pattern to match either 'Python' or 'NLP'
pattern = r"Python|NLP"
matches = re.findall(pattern, sample_text)

print("Matches for 'Python' or 'NLP':", matches)


Matches for 'Python' or 'NLP': ['NLP', 'Python', 'Python', 'Python']


In [46]:
# Example 8: Matching Word Boundaries
# Pattern to match the word 'Python' as a whole word (not part of another word)
pattern = r"\bPython\b"
matches = re.findall(pattern, sample_text)

print("Matches for the word 'Python':", matches)

Matches for the word 'Python': ['Python', 'Python']


In [47]:
# Example 9: Matching Specific Character Classes
# Pattern to match any word starting with a capital letter
pattern = r"\b[A-Z][a-z]*\b"
capitalized_words = re.findall(pattern, sample_text)

print("Capitalized words found:", capitalized_words)


Capitalized words found: ['Hello', 'My', 'Anonymous', 'I', 'Natural', 'Language', 'Processing', 'Python', 'Yesterday', 'I', 'Pythonista', 'Python']


In [48]:
# Example 10: Counting Occurrences of a Pattern
pattern = r"Python"
count = len(re.findall(pattern, sample_text))

print("Number of occurrences of 'NLP':", count)


Number of occurrences of 'NLP': 3


In [70]:
# Retrieve order number

sample_text='My order 412889912 is having an issue, I was charged 300$ when online it says 280$'
pattern = 'order[^\d]*(\d*)'
matches = re.findall(pattern, sample_text)

print('order information/number: ', matches)

order information/number:  ['412889912']


In [80]:
# Extract specific information
sample_text='''
Born	Elon Reeve Musk
June 28, 1971 (age 50)
Pretoria, Transvaal, South Africa
Citizenship
South Africa (1971–present)
Canada (1971–present)
United States (2002–present)
Education	University of Pennsylvania (BS, BA)
Title
Founder, CEO and Chief Engineer of SpaceX
CEO and product architect of Tesla, Inc.
Founder of The Boring Company and X.com (now part of PayPal)
Co-founder of Neuralink, OpenAI, and Zip2
Spouse(s)
Justine Wilson
​
​(m. 2000; div. 2008)​
Talulah Riley
​
​(m. 2010; div. 2012)​
​
​(m. 2013; div. 2016)
'''

In [84]:
age = re.findall('age (\d+)', sample_text)[0]
full_name = re.findall('Born(.*)\n', sample_text)[0]
birth_date = re.findall('Born.*\n(.*)\(age', sample_text)[0]
birth_place = re.findall('\(age.*\n(.*)', sample_text)[0]

print ({
        'age': age,
        'name': full_name,
        'birth_date': birth_date,
        'birth_place': birth_place
    })

{'age': '50', 'name': '\tElon Reeve Musk', 'birth_date': 'June 28, 1971 ', 'birth_place': 'Pretoria, Transvaal, South Africa'}


In [None]:
# Tokenization

tweet = "loving the new features of #Python @anonymous https://python.org 😊🚀"

# Regex pattern to match words, hashtags, mentions, URLs, and emojis
pattern = r"\b(?:https?://\S+|www\.\S+|\#\w+|\@\w+|[a-zA-Z0-9]+(?:[-']\w+)*|\w+|[^\w\s])\b"
tokens = re.findall(pattern, tweet)

# Displaying the tokens
print("Tokens:", tokens)


In [86]:
import re

text = "Hello, world! I'm learning Python. #Excited"

# Regex pattern to tokenize the text (split by any non-word characters)
pattern = r"\w+"
tokens = re.findall(pattern, text)

# Displaying the tokens
print("Tokens:", tokens)


Tokens: ['Hello', 'world', 'I', 'm', 'learning', 'Python', 'Excited']


In [85]:
tweet = "loving the new features of #Python @anonymous https://python.org 😊🚀"

# Regex pattern to match words, hashtags, mentions, URLs, and emojis
pattern = r"\b(?:https?://\S+|www\.\S+|\#\w+|\@\w+|[a-zA-Z0-9]+(?:[-']\w+)*|\w+|[^\w\s])\b"
tokens = re.findall(pattern, tweet)

# Displaying the tokens
print("Tokens:", tokens)


Tokens: ['loving', 'the', 'new', 'features', 'of', 'Python', 'anonymous', 'https://python.org']


* **https?://\S+**: Matches URLs starting with http:// or https://.
* **www\.\S+**: Matches URLs starting with www.
* **\#\w+**: Matches hashtags (e.g., #Python).
* **\@\w+**: Matches mentions (e.g., @john_doe).
* [a-zA-Z0-9]+(?:[-']\w+)*: Matches words that may contain hyphens or apostrophes.
* **[^\w\s]**: Matches punctuation or special characters.

In [54]:
# Regex to remove emojis
emoji_pattern = r"[^\w\s,.'!&-]"  # Matches any non-word character excluding punctuation
text_with_emojis = "I love Python! 🚀🔥 @PythonDev #Python"

# Removing emojis
cleaned_text = re.sub(emoji_pattern, "", text_with_emojis)

print("Text without emojis:", cleaned_text)

Text without emojis: I love Python!  PythonDev Python


In [60]:
text = "I have a few cats. The dogs were playing."

# Regex pattern for basic plural to singular (e.g., cats -> cat, dogs -> dog)
pattern_plural = r"(\w+)(s)\b"
text_singular = re.sub(pattern_plural, r"\1", text)

# Regex pattern for basic verb forms (e.g., were, is, am, are, was, were, been , being -> be)
pattern_verb = r"(were|is|am|are|was|were|been|being)\b"
text_lemmatized = re.sub(pattern_verb, "be", text_singular)

print("Text after basic lemmatization:", text_lemmatized)

Text after basic lemmatization: I have a few cat. The dog be playing.


In [55]:
tweet = "Check out my blog: https://example.com! @john_doe #Python #AI"

# Regex patterns to extract hashtags, mentions, and URLs
hashtag_pattern = r"\#\w+"
mention_pattern = r"\@\w+"
url_pattern = r"https?://[^\s]+"

# Extracting hashtags, mentions, and URLs
hashtags = re.findall(hashtag_pattern, tweet)
mentions = re.findall(mention_pattern, tweet)
urls = re.findall(url_pattern, tweet)

print("Hashtags:", hashtags)
print("Mentions:", mentions)
print("URLs:", urls)


Hashtags: ['#Python', '#AI']
Mentions: ['@john_doe']
URLs: ['https://example.com!']


In [56]:
# removing stop words
# List of common stopwords
stopwords = ["the", "and", "is", "in", "to", "for", "on"]

text = "Python is great for data analysis and machine learning."

# Regex pattern for stopwords
stopword_pattern = r"\b(?:{})\b".format("|".join(stopwords))
cleaned_text = re.sub(stopword_pattern, "", text)

cleaned_text = re.sub(r"\s+", " ", cleaned_text).strip()
print("Text without stopwords:", cleaned_text)


Text without stopwords: Python great data analysis machine learning.


In [66]:
# Custom tokenization: Preserve punctuation and emojis
pattern = r"([^\w\s]+|[\u2000-\u3300]+|\w+)"
tweet = "I love Python! 🚀🔥 #Python"

tokens = re.findall(pattern, tweet)
print("Custom tokens:", tokens)


Custom tokens: ['I', 'love', 'Python', '!', '🚀🔥', '#', 'Python']
