# Regular Expressions in python



<div style="background-color: #192332; 
            border-radius: 8px; 
            padding: 15px; 
            margin: 10px 0;
            box-shadow: 0 2px 4px rgba(0,0,0,0.1);">

Regular expression is a sequence of characters that specifies a search pattern in text. Used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation.

</div>

In [1]:
import re

In [4]:
# search function

text = "hassane is a good data scientist"
if re.search("hassane", text):
    print("hello hassane")
else:
    print("not found")

hello hassane


In [8]:
# split function 

text = "hassane is a good student. hassane is a good data scientist. hassane gets a good grade"

re.split("hassane",text)

['',
 ' is a good student. ',
 ' is a good data scientist. ',
 ' gets a good grade']

In [9]:
# findall function
re.findall("hassane",text) 

['hassane', 'hassane', 'hassane']

In [55]:
phone_num_regex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') 

phone_num_regex.findall('Cell: 415-555-9999 Work: 212-555-0000')

['415-555-9999', '212-555-0000']

In [12]:
# Example 1
text = "Hello, hassane is here"
result1 = re.search("^hassane", text)
print(result1)  # None - because "hassane" is not at start

# Example 2 - Will match
text2 = "hassane is here"
result2 = re.search("^hassane", text2)
print(result2)  # <re.Match object; span=(0, 7), match='hassane'>

# To match "hassane" anywhere in text:
result3 = re.search("hassane", text)
print(result3)  # Will match "hassane" anywhere in the string

None
<re.Match object; span=(0, 7), match='hassane'>
<re.Match object; span=(7, 14), match='hassane'>


## Matching regex objects

In [23]:
text = "My phone number is 231-123-568-7875, and my friend's number is 679 987-654-3210"

phone_regex = re.compile(r'\d{3}-\d{3}-\d{4}')

phone_number = phone_regex.search(text)

print(f'Phone number found: {phone_number.group()}')

Phone number found: 123-568-7875


## Grouping with parentheses

In [31]:
phone_regex = re.compile(r"(\d{3})-(\d{3})-(\d{4})")
phone_number = phone_regex.search(text)
print(f'Phone number found part 1: {phone_number.group(1)}')
print(f'Phone number found part 2: {phone_number.group(2)}')
print(f'Phone number found part 3: {phone_number.group(3)}')
print(f'Phone number found: {phone_number.group(0)}')

Phone number found part 1: 123
Phone number found part 2: 568
Phone number found part 3: 7875
Phone number found: 123-568-7875


## Multiple groups with Pipe

In [43]:
bat_regex = re.compile(r'Bat(man|mobile|copter|bat)')
word = bat_regex.search('Batman lost a wheel')

print(word.group())

print(word.group(1))

Batman
man


## Optional matching with the Question Mark

The ? character flags the group that precedes it as an optional part of the pattern.

In [46]:
regex_pattern = re.compile(r'Bat(wo)?man')
word = regex_pattern.search('the adventures of Batman')
print(word.group())

word = regex_pattern.search('the adventures of Batwoman')
print(word.group())

Batman
Batwoman


## Matching zero or more with the Star
The * (star or asterisk) means “match zero or more”. The group that precedes the star can occur any number of times in the text.

In [50]:
regex_pattern = re.compile(r"Bat(wo)*man")
mo = regex_pattern.search('The Adventures of Batman')
print(mo.group())

mo = regex_pattern.search('The Adventures of Batwoman')
print(mo.group())

mo = regex_pattern.search('The Adventures of Batwowowowoman')
print(mo.group())

Batman
Batwoman
Batwowowowoman


## Matching one or more with the Plus
The + (or plus) means match one or more. The group preceding a plus must appear at least once:

In [53]:
regex_pattern = re.compile(r"Bat(wo)+man")
mo = regex_pattern.search('The Adventures of Batman')
print(mo)

mo = regex_pattern.search('The Adventures of Batwoman')
print(mo.group())

mo = regex_pattern.search('The Adventures of Batwowowowoman')
print(mo.group())

None
Batwoman
Batwowowowoman


## Greedy and non-greedy matching
Python’s regular expressions are greedy by default: in ambiguous situations they will match the longest string possible. The non-greedy version of the curly brackets, which matches the shortest string possible, has the closing curly bracket followed by a question mark.

In [54]:
greedy_ha_regex = re.compile(r'(Ha){3,5}') # 3 to 5 times
mo1 = greedy_ha_regex.search('HaHaHaHaHa')
print(mo1.group())

non_greedy_ha_regex = re.compile(r"(Ha){3,5}?")
mo2 = non_greedy_ha_regex.search('HaHaHaHaHa')
print(mo2.group())

HaHaHaHaHa
HaHaHa


## Making your own character classes
You can define your own character class using square brackets. For example, the character class [aeiouAEIOU] will match any vowel, both lowercase and uppercase.

In [59]:
vowel_regex = re.compile(r'[aeiouAEIOU]')
print(vowel_regex.findall("Robocop eats baby food. BABY FOOD."))

consonant_regex = re.compile(r"[^eiouAEIOU]")
print(consonant_regex.findall('Robocop eats baby food. BABY FOOD.'))

['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o', 'A', 'O', 'O']
['R', 'b', 'c', 'p', ' ', 'a', 't', 's', ' ', 'b', 'a', 'b', 'y', ' ', 'f', 'd', '.', ' ', 'B', 'B', 'Y', ' ', 'F', 'D', '.']


By placing a caret character (^) just after the character class’s opening bracket, you can make a negative character class that will match all the characters that are not in the character class:

## The Caret and Dollar sign characters
You can also use the caret symbol ^ at the start of a regex to indicate that a match must occur at the beginning of the searched text.

Likewise, you can put a dollar sign $ at the end of the regex to indicate the string must end with this regex pattern.

And you can use the ^ and $ together to indicate that the entire string must match the regex.

In [62]:
begin_with_hello = re.compile(r"^Hello")
hello = begin_with_hello.search("Hello world!")
not_hello = begin_with_hello.search("He said hello.")
print(hello.group())
print(not_hello)

Hello
None


In [67]:
whole_string_is_num = re.compile(r'^\d+$')
print(whole_string_is_num.search('1234567890'))
print(whole_string_is_num.search('12345xyz67890'))
print(whole_string_is_num.search('12 34567890'))

<re.Match object; span=(0, 10), match='1234567890'>
None
None


## The Wildcard character
The . (or dot) character in a regular expression will match any character except for a newline:

In [68]:
at_regex = re.compile(r'.at')

at_regex.findall('The cat in the hat sat on the flat mat.')

['cat', 'hat', 'sat', 'lat', 'mat']

## Matching everything with Dot-Star

In [83]:
name_regex = re.compile(r'First Name: (.*) Last Name: (.*)')

mo = name_regex.search('First Name: Al Last Name: Sweigart with all of the best')
print(mo.group(1))
print(mo.group(2))

Al
Sweigart with all of the best


## Matching newlines with the Dot character
The dot-star will match everything except a newline. By passing re.DOTALL as the second argument to re.compile(), you can make the dot character match all characters, including the newline character:

In [85]:
no_newline_regex = re.compile('.*')
no_newline_regex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()

'Serve the public trust.'

In [86]:

newline_regex = re.compile('.*', re.DOTALL)
newline_regex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()

'Serve the public trust.\nProtect the innocent.\nUphold the law.'

## Managing complex Regexes

To tell the re.compile() function to ignore whitespace and comments inside the regular expression string, “verbose mode” can be enabled by passing the variable re.VERBOSE as the second argument to re.compile().

Now instead of a hard-to-read regular expression like this:
`phone_regex = re.compile(r'((\d{3}|\(\d{3}\))?(\s|-|\.)?\d{3}(\s|-|\.)\d{4}(\s*(ext|x|ext.)\s*\d{2,5})?)')`

you can spread the regular expression over multiple lines with comments like this:

In [101]:
phone_regex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?            # area code
    (\s|-|\.)?                    # separator
    (\d{3})                         # first 3 digits
    (\s|-|\.)                     # separator
    (\d{4})                         # last 4 digits
    (\s*(ext|x|ext.)\s*\d{2,5})?  # extension
    )''', re.VERBOSE)

# example
for i in range(1,8):
    print(phone_regex.search('Call me at 415-555-1011 ext 22 tomorrow.').group(i))

415-555-1011 ext 22
415
-
555
-
1011
 ext 22


## example

In [113]:
# Sample text
text = """
Harvard University is located in Cambridge, Massachusetts.
Stanford University can be found in Stanford, California.
University of Texas is situated in Austin, Texas.
MIT is based in Cambridge, Massachusetts.
"""

pattern = re.compile(r"""
                     (?P<university>(?:University\s+of\s+[A-Za-z]+|[A-Za-z]+\s+University|MIT)) # university name
                     .*?
                     in\s+
                     (?P<city>[A-Za-z]+) # city name
                     ,\s*
                     (?P<state>[A-Za-z]+) # state name
                     """, re.VERBOSE)          

# Find all matches
matches = pattern.finditer(text)  
for match in matches:
    print(match.groupdict())


{'university': 'Harvard University', 'city': 'Cambridge', 'state': 'Massachusetts'}
{'university': 'Stanford University', 'city': 'Stanford', 'state': 'California'}
{'university': 'University of Texas', 'city': 'Austin', 'state': 'Texas'}
{'university': 'MIT', 'city': 'Cambridge', 'state': 'Massachusetts'}
