# Regular Expressions in Python

This notebook demonstrates the usage of the `re` module in Python, which provides regular expression matching operations.

# Introduction

A Regular Expression or RegEx is a special sequence of characters that uses a search pattern to find a string or set of strings.

It can detect the presence or absence of a text by matching it with a particular pattern and also can split a pattern into one or more sub-patterns. 

In [1]:
import re

## Basic Search Operations

Here we search for a pattern in a string and get its position.

In [2]:
# Here we don't have a pattern, we are just looking if where the string is inside s
s = "AbbbbAbbbbAbbb:A computer science portal for aaaaa"
# Search for the word "portal" in the given string and then print the start and end indices of the matched word within the string.

match = re.search('portal', s)
print('Start Index: ', match.start())
print('End Index: ', match.end())

Start Index:  34
End Index:  40


## Common Regular Expression Functions

Here are the commonly used regex functions:

- `re.findall()`: Finds and returns all non-overlapping (meaning that they don't have an intersection) matching occurrences in a list
- `re.compile()`: Regular expressions are compiled into pattern objects
- `re.split()`: Split the string by the occurrences of a character or a pattern
- `re.sub()`: Replaces all occurrences of a pattern with a string
- `re.escape()`: Escapes special characters
- `re.search()`: Finds the first occurrence of a character or a pattern

In [3]:
# \ escape
# \d digit
# \d+ 1 or more digits
# \d* 0 or more digits
# This code uses a regular expression (\d+) to find all the sequences of one or more digits in the given string. 
# It searches for numeric values and stores them in a list.

string = """Hello my Number is 123456789 and
            my friend's number is 987654321"""
regex = r'\d+'
match = re.findall(regex, string)
print(match)

['123456789', '987654321']


## Character Classes

Regular expressions provide a way to match specific sets of characters.

In [4]:
p = re.compile('[a-e]')
# It matches all the characters between a and e
print(p.findall("Aye, said Mr. Gibeeeenson Startk"))
# It starts from left to right, and it returns the list accordingly

['e', 'a', 'd', 'b', 'e', 'e', 'e', 'e', 'a']


In [5]:
p = re.compile(r'\d')
# Find all one singular digit
print(p.findall("I went to him at 11 A.M. on 4th july 1886"))
p = re.compile(r'\d+')
# Find all sequence of digits
print(p.findall("I went to him at 11 A.M. on 4th July 1886"))

['1', '1', '4', '1', '8', '8', '6']
['11', '4', '1886']


## Word Characters

`\w` matches any alphanumeric character (letters, digits, and underscore).
`\W` matches any non-alphanumeric character.

In [6]:
# By default it takes raw
p = re.compile(r'\w')
# Single character (things that can be used in a variable)

print(p.findall("He said * in some_lang.")) 
p = re.compile(r'\w+')
# Successive characters
print(p.findall('I went to him at 11 A.M., he \
                said *** in some_language'))
p = re.compile(r'\W')
# Everything that cannot be used in a variable name
print(p.findall("he said *** in some_language."))

['H', 'e', 's', 'a', 'i', 'd', 'i', 'n', 's', 'o', 'm', 'e', '_', 'l', 'a', 'n', 'g']
['I', 'went', 'to', 'him', 'at', '11', 'A', 'M', 'he', 'said', 'in', 'some_language']
[' ', ' ', '*', '*', '*', ' ', ' ', '.']


In [7]:
p = re.compile('ab*')
print(p.findall("ababbaabbb"))
# 'ab*' matches 'a' followed by zero or more 'b's

['ab', 'abb', 'a', 'abbb']


## Split Function

The `re.split()` function splits a string by the occurrences of a pattern.

Syntax:
```python
re.split(pattern, string, maxsplit=0, flags=0)
```

- `maxsplit` is the number of times we want the code to split the string
- `flags` are used to ignore some stuff (e.g., `re.IGNORECASE` ignores case) (optional)
- If `maxsplit` is not specified then it would work in a way such that it always splits the string according to all occurrences

In [8]:
print(re.split(r'\W+','Words, words , Words'))
# It will split according to W+ which means successive special characters, so it will split according to (, ) and then ( , )
print(re.split(r'\W+',"Words's words Words"))
# Same idea but now (') is also a special character
print(re.split(r'\W+','On 12th Jan 2016, at 11:02 AM'))
# \W+ | \d+ (this is how we do or) (here it will split based on the digit too) 
# when it was W+ it took it because a digit is not considered as a special character
print(re.split(r'\d+','On 12th Jan 2016, at 11:02 AM'))
# It will cut relative to the sequence of digits

['Words', 'words', 'Words']
['Words', 's', 'words', 'Words']
['On', '12th', 'Jan', '2016', 'at', '11', '02', 'AM']
['On ', 'th Jan ', ', at ', ':', ' AM']


In [9]:
# If we limit the number of splits:
print(re.split(r'\d+','On 12th Jan 2016, at 11:02 AM',1))
# It will split only one time
print(re.split('[a-f]+','Aey, Bou oh boy, come here',flags=re.IGNORECASE))
# Equivalent to (a+b+c+d+e+f)+ it will also ignore the case so it is taking into consideration also the characters that are uppercase
print(re.split('[a-f]+','Aey, Boy oh boy, come here'))

['On ', 'th Jan 2016, at 11:02 AM']
['', 'y, ', 'ou oh ', 'oy, ', 'om', ' h', 'r', '']
['A', 'y, Boy oh ', 'oy, ', 'om', ' h', 'r', '']


## Sub Function

The `re.sub()` function replaces occurrences of a pattern with a provided replacement.

Syntax:
```python
re.sub(pattern, repl, string, count=0, flags=0)
```

We search for a pattern in a string and it is replaced by `repl`.
count checks and maintains the number of times this occurs.

In [10]:
# It will replace ub with ~* inside the string (with relation to the condition)
print(re.sub('ub','~*', 'Subject has Uber booked already',flags=re.IGNORECASE))
# IGNORECASE will take into consideration 2^n combinations ub Ub uB UB (4 possibilities because n = 2)
print(re.sub('ub','~*', 'Subject has Uber booked already'))
# Only ub
print(re.sub('ub','~*', 'Subject has Uber booked already',count=1,flags=re.IGNORECASE))
# It will only do it once
print(re.sub('ub','~*', 'Subject has Uber booked already uBik',count=3,flags=re.IGNORECASE))
# It will do it 3 times
print(re.sub(r'\sAND\s','&',"Baked Beans And Spam",flags=re.IGNORECASE))
# r means that it is raw, meaning that it will take symbols like \s (which means space), so we are searching for " AND "
print(re.sub(r'\'AND\'','&',"Baked Beans 'And' Spam",flags=re.IGNORECASE))

S~*ject has ~*er booked already
S~*ject has Uber booked already
S~*ject has Uber booked already
S~*ject has ~*er booked already ~*ik
Baked Beans&Spam
Baked Beans & Spam


## Subn Function

The `re.subn()` function is just like `sub()` but it returns a tuple with the new string and the count of replacements.

Syntax:
```python
re.subn(pattern, repl, string, count=0, flags=0)
```

It's just like re.sub() but it returns the modified string in the first part of the tuple, and the count in the second part of the tuple

In [11]:
print(re.subn('ub','~*', 'Subject has Uber booked already'))
t = re.subn('ub','~*', 'Subject has Uber booked already',flags=re.IGNORECASE)
print(t)
print(len(t))
print(t[0])

('S~*ject has Uber booked already', 1)
('S~*ject has ~*er booked already', 2)
2
S~*ject has ~*er booked already


# Escape Function

The `re.escape()` function returns string with all non-alphanumerics backslashed, this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.

Used to escape special characters in a string, making it safe to be used as a pattern in RegEx. It ensures that any characters with special meanings in RegEX are treated as literal characters.

In [12]:
# Non alphanumerical (not digit and not alphabet)
# re.escape(string) 
print(re.sub(r'\W+',r'\ ', 'This is Awesome even 1 AM'))
print(re.escape("I Asked what is this [a-9], he said \t ^WoW"))

This\ is\ Awesome\ even\ 1\ AM
I\ Asked\ what\ is\ this\ \[a\-9\],\ he\ said\ \	\ \^WoW


# Exercises and Solutions

## Exercise

In this specific example, it searches for a pattern that consists of a month (letters) followed by a day (digits) in the input string "I was born on June 24". If a match is found, it prints the full match, the month, and the day.

In [13]:
regex = r"([a-zA-Z]+) (\d+)"
match = re.search(regex, "I was born in June 24")
if match != None:
    print ("Match at index %s, %s" % (match.start(), match.end()))
    print ("Full match: %s" % (match.group(0)))
    print ("Month: %s" % (match.group(1)))
    print ("Day: %s" % (match.group(2)))
else:
    print ("The regex pattern does not match.")

Match at index 14, 21
Full match: June 24
Month: June
Day: 24


## 1. Extract All Email Addresses

**Exercise:** Write a Python program to extract all email addresses from a given string:
"Contact us at support@example.com or sales@example.com."

In [14]:
import re

text = "Contact us at support@example.com or sales@example.com."
emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
print(emails)

['support@example.com', 'sales@example.com']


**Explanation:**
- `[A-Za-z0-9._%+-]+` matches one or more letters, numbers, or special characters before the @
- `\.` matches a literal dot
- `[A-Z|a-z]{2,}` matches the TLD (like com, org, net) with at least 2 characters
- `\b` defines a word boundary, ensuring that we only match whole words

## 2. Validate Phone Number

**Exercise:** Check if a given phone number is in the format (XXX) XXX-XXXX.

In [15]:
import re

phone = input("Enter your phone number: ")
pattern = r'^\(\d{3}\) \d{3}-\d{4}$'
match = re.match(pattern, phone)
print(bool(match))

Enter your phone number:  (111) 222-333


False


**Explanation:**
- `^` asserts the start of the string
- `\(` matches the opening parenthesis (escaped because parentheses are special in regex)
- `\d{3}` matches exactly 3 digits
- ` ` matches a space
- `$` asserts the end of the string

## 3. Find All Words Starting With "a"

**Exercise:** Extract all words that start with the letter "a" from a given text.

In [16]:
import re

text = input("Enter your text: ")
words = re.findall(r'\ba\w+', text)
print(words)

Enter your text:  hello I am angelo how are you


['am', 'angelo', 'are']


## 4. Replace Multiple Spaces with a Single Space

**Exercise:** Replace multiple consecutive spaces with a single space.

In [17]:
import re

text = input("Enter a text with multiple spaces: ")
cleaned_text = re.sub(r'\s+', ' ', text)
print(cleaned_text)

Enter a text with multiple spaces:  hellaoioeab   a ifhoae  alefnpeia a  a;lke  


hellaoioeab a ifhoae alefnpeia a a;lke 


**Explanation:**
- `\s+` matches one or more whitespace characters

## 5. Validate a Password

**Exercise:** Check if a password meets the criteria: at least one uppercase letter, at least one lowercase letter, at least one digit, and at least 8 characters.

In [18]:
import re

password = input("Enter your password: ")
pattern = r'^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{8,}$'
valid = bool(re.match(pattern, password))
print(valid)

Enter your password:  hello


False


**Explanation:**
- `(?=.*[a-z])`: Positive lookahead assertion that ensures at least one lowercase letter
- `(?=.*[A-Z])`: Positive lookahead assertion that ensures at least one uppercase letter
- `(?=.*\d)`: Positive lookahead assertion that ensures at least one digit
- `.{8,}`: Ensures the password is at least 8 characters long

## 6. Extract Dates from a String

**Exercise:** Extract all dates in the format DD/MM/YYYY from a given text.

In [19]:
import re

text = input("Enter your text: ")
dates = re.findall(r'\b\d{2}/\d{2}/\d{4}\b', text)
print(dates)

Enter your text:  25/04/2005


['25/04/2005']


## 7. Validate an IPv4 Address

**Exercise:** Check if a given IP address is valid.

In [20]:
import re

ip = input("Enter an IPv4 address: ")
pattern = r'^(\d{1,3}\.){3}\d{1,3}$'
valid = bool(re.match(pattern, ip))
print(valid)

Enter an IPv4 address:  192.168.100.5


True


**Explanation:**
- `(\d{1,3}\.){3}` matches 3 occurrences of 1-3 digits followed by a dot
- `\d{1,3}` matches 1-3 digits for the last part of the IP address

**Note:** This basic validation checks only the format. For complete IP validation, you should also check that each number is between 0 and 255.

## Enhanced IPv4 Validation (Bonus)

A more complete IPv4 validation that checks that each number is between 0 and 255:

In [21]:
import re

def validate_ipv4(ip):
    # First check the pattern
    pattern = r'^(\d{1,3}\.){3}\d{1,3}$'
    if not re.match(pattern, ip):
        return False
    
    # Check each number is between 0 and 255
    numbers = ip.split('.')
    for num in numbers:
        if int(num) > 255 or (num[0] == '0' and len(num) > 1):
            return False
    
    return True

# Test it
test_ip = input("Enter an IPv4 address for enhanced validation: ")
print(validate_ipv4(test_ip))

Enter an IPv4 address for enhanced validation:  300.100.23.3


False


## 8. Normalize Phone Numbers

**Exercise:** You have a messy dataset containing phone numbers with different formats. Write a Python script to extract and normalize them into the format XXX-XXX-XXXX.

In [None]:
import re
text = "Call me at (123) 456-7890 or 987-654-3210."
phones = re.findall(r'\(?(\d{3})\)?[-\s]?(\d{3})[-\s]?(\d{4})', text)
normalized_phones = ["-".join(phone) for phone in phones]
print(normalized_phones)

## 9. Extract Stock Prices

**Exercise:** You have a financial report containing stock prices in different formats. Extract all stock prices in the format $XXX.XX, ensuring you don't mistakenly pick other numbers.

Example Input: `"The stock prices are as follows: Apple $132.45, Tesla $899.99, and Amazon $3050.89. Earnings for Q2 were 5,000,000 dollars."`

Expected Output: `['$132.45', '$899.99', '$3050.89']`

In [None]:
import re
text = "The stock prices are as follows: Apple $132.45, Tesla $899.99, and Amazon $3050.89. Earnings for Q2 were 5,000,000 dollars."
stock_prices = re.findall(r'\$\d{1,4}\.\d{2}', text)
print(stock_prices)