1. What is the name of the feature responsible for generating Regex objects?

The feature responsible for generating Regex objects in Python is called the "re" module. The "re" module is a built-in module in Python that provides support for regular expressions. It allows you to create Regex objects, which can be used to perform pattern matching and various operations on strings, such as searching, replacing, and splitting based on patterns defined by regular expressions.

2. Why do raw strings often appear in Regex objects?


Raw strings are often used in Regex objects to avoid unwanted interpretation of escape sequences.

In a raw string, escape sequences like \n or \t are treated as literal characters instead of their special meanings. This is particularly useful in regular expressions because they often contain backslashes and special characters that have special meanings in both Python string literals and regular expressions.

For example, in a regular expression pattern, if you want to match a literal backslash character, you would need to escape it with another backslash (\\). However, in a regular string, the backslash would be interpreted as an escape character before it reaches the regular expression engine.

By using a raw string, denoted by adding an r prefix before the string (r"..."), you can ensure that the regular expression pattern is interpreted as-is without any unintended escape character interpretation. This simplifies and improves the readability of regular expressions.

In summary, raw strings are commonly used in Regex objects to avoid conflicts between Python's string literal escape sequences and the escape sequences used in regular expressions.

3. What is the return value of the search() method?

The search() method from the re module in Python returns a match object if a match is found, or None if no match is found.

When you call the search() method on a regular expression object, it scans the given string from left to right and returns the first occurrence of a match. The match object contains information about the match, such as the matched substring, its start and end indices, and more.

In [4]:
import re

pattern = r"apple"
text = "I like to eat an apple every day."

match = re.search(pattern, text)

if match:
    print("Match found!")
else:
    print("No match found.")


Match found!


4. From a Match item, how do you get the actual strings that match the pattern?

From a Match object in Python, you can retrieve the actual strings that match the pattern using the group() method.

The group() method of a Match object returns the substring that was matched by the regular expression pattern. By default, calling group() without any arguments returns the entire match.

Here's an example that demonstrates how to use group() to obtain the matching string:

In [5]:
import re

pattern = r"apple"
text = "I like to eat an apple every day."

match = re.search(pattern, text)

if match:
    matched_string = match.group()
    print("Match found:", matched_string)
else:
    print("No match found.")


Match found: apple


5. In the regex which created from the r'(\d\d\d)-(\d\d\d-\d\d\d\d)', what does group zero cover? Group 2? Group 1?

In the regular expression r'(\d\d\d)-(\d\d\d-\d\d\d\d)', the groups are denoted by the parentheses.

Group 0 (zero) refers to the entire match of the regular expression pattern. It represents the portion of the string that matches the entire pattern.

Group 1 refers to the first group captured by the regular expression, which is (\d\d\d) in this case. It matches three consecutive digits.

Group 2 refers to the second group captured by the regular expression, which is (\d\d\d-\d\d\d\d) in this case. It matches three digits followed by a hyphen, and then four more digits.

Here's an example that demonstrates how to access the groups using the group() method of a Match object:

In [7]:
import re

pattern = r'(\d\d\d)-(\d\d\d-\d\d\d\d)'
text = 'Phone number: 123-456-7890'

match = re.search(pattern, text)

if match:
    group_0 = match.group(0)
    group_1 = match.group(1)
    group_2 = match.group(2)

    print("Group 0:", group_0)
    print("Group 1:", group_1)
    print("Group 2:", group_2)
else:
    print("No match found.")


Group 0: 123-456-7890
Group 1: 123
Group 2: 456-7890


6. In standard expression syntax, parentheses and intervals have distinct meanings. How can you tell a regex that you want it to fit real parentheses and periods?

In regular expression syntax, parentheses and periods have special meanings and are used to define groups and match any character, respectively. To tell a regex that you want it to match literal parentheses and periods, you can use the backslash (\) to escape those characters.

To match a literal parentheses, you can use \( to represent an opening parenthesis and \) to represent a closing parenthesis.

To match a literal period, you can use \..

Here's an example that demonstrates how to match literal parentheses and periods using escape characters:

In [9]:
import re

pattern = r'\(Hello\)\.'
text = "(Hello). World."

match = re.search(pattern, text)

if match:
    matched_string = match.group()
    print("Match found:", matched_string)
else:
    print("No match found.")


Match found: (Hello).


7. The findall() method returns a string list or a list of string tuples. What causes it to return one of the two options?

The findall() method from the re module in Python returns a list of strings or a list of string tuples depending on the presence of capturing groups in the regular expression pattern.

When the regular expression pattern contains capturing groups (defined by parentheses), findall() returns a list of string tuples. Each tuple corresponds to a match, and each element of the tuple represents the matched substring captured by a specific group.

On the other hand, when the regular expression pattern does not contain any capturing groups, findall() returns a list of strings. Each string in the list represents a non-overlapping match of the entire pattern.

Here's an example that demonstrates the difference:

In [11]:
import re

pattern_with_groups = r'(\d{2})-(\d{2})'
pattern_without_groups = r'\d{2}-\d{2}'

text = 'Birthdays: 01-25, 12-07, 08-15'

matches_with_groups = re.findall(pattern_with_groups, text)
matches_without_groups = re.findall(pattern_without_groups, text)

print("Matches with groups:", matches_with_groups)
print("Matches without groups:", matches_without_groups)


Matches with groups: [('01', '25'), ('12', '07'), ('08', '15')]
Matches without groups: ['01-25', '12-07', '08-15']


8. In standard expressions, what does the | character mean?


In regular expressions, the | character, called a pipe or vertical bar, represents the logical OR operator. It is used to specify multiple alternative patterns within a regular expression.

The | character allows you to create a pattern that matches either the pattern on its left side or the pattern on its right side. It functions similar to the logical OR operator in programming languages.

Here's an example that demonstrates the usage of the | character:

In [15]:
import re

pattern = r"cat|dog"
text = "I have a cat and a dog."

matches = re.findall(pattern, text)

print(matches)


['cat', 'dog']


9. In regular expressions, what does the character \r stand for?

In regular expressions, the character \r represents a carriage return.

A carriage return is a control character that originated from typewriters and is used to move the cursor to the beginning of a line. In the context of regular expressions, \r is used to match the carriage return character itself.

It's worth noting that the interpretation of \r can vary depending on the platform or text file format being used. In some environments, such as Windows, a newline is represented by the combination of carriage return (\r) followed by a line feed (\n), whereas in other environments, such as Unix-like systems (Linux, macOS), only a line feed (\n) is used.

Here's an example that demonstrates the usage of \r in a regular expression:

In [17]:
import re

pattern = r"Hello\rWorld"
text = "Hello\rWorld"

match = re.search(pattern, text)

if match:
    matched_string = match.group()
    print("Match found:", matched_string)
else:
    print("No match found.")


World


10.In regular expressions, what is the difference between the + and * characters?

In [None]:
In regular expressions, the + and * characters are quantifiers that specify the number of occurrences of the preceding pattern.

The + quantifier means "one or more" occurrences of the preceding pattern. It requires that the preceding pattern occurs at least once but can be repeated multiple times. If the pattern is not present at least once, the match will fail.

For example:

Pattern: a+
Matches: "a", "aa", "aaa", ...
Does not match: "", "b", "ab"
The * quantifier means "zero or more" occurrences of the preceding pattern. It allows the preceding pattern to occur zero times or multiple times. It will still result in a match even if the pattern is not present at all.

For example:

Pattern: a*
Matches: "", "a", "aa", "aaa", ...
Does not match: "b", "ab"
In summary:

+ matches one or more occurrences of the preceding pattern.
* matches zero or more occurrences of the preceding pattern.
Here's an example that demonstrates the difference between + and * in Python:

In [19]:
import re

pattern_plus = r"ab+"
pattern_star = r"ab*"

text = "ab abb a"

matches_plus = re.findall(pattern_plus, text)
matches_star = re.findall(pattern_star, text)

print("Matches with + quantifier:", matches_plus)
print("Matches with * quantifier:", matches_star)


Matches with + quantifier: ['ab', 'abb']
Matches with * quantifier: ['ab', 'abb', 'a']


11. What is the difference between {4} and {4,5} in regular expression?


In regular expressions, {4} and {4,5} are quantifiers that specify the exact number of occurrences of the preceding pattern.

The quantifier {4} specifies that the preceding pattern must occur exactly four times.

For example:

Pattern: a{4}
Matches: "aaaa", "abaaa", "aaaab", ...
Does not match: "", "a", "aa", "aaaaa"
On the other hand, the quantifier {4,5} specifies that the preceding pattern must occur

12. What do you mean by the \d, \w, and \s shorthand character classes signify in regular expressions?

\d represents the digit character class. It matches any decimal digit from 0 to 9. It is equivalent to the character range [0-9].

\w represents the word character class. It matches any alphanumeric character (letters, digits, and underscores). It is equivalent to the character range [a-zA-Z0-9_].

\s represents the whitespace character class. It matches any whitespace character, including spaces, tabs, and line breaks.

In [21]:
import re

text = "abc 123 !@#"

digits = re.findall(r"\d", text)
word_chars = re.findall(r"\w", text)
whitespace_chars = re.findall(r"\s", text)

print("Digits:", digits)
print("Word Characters:", word_chars)
print("Whitespace Characters:", whitespace_chars)


Digits: ['1', '2', '3']
Word Characters: ['a', 'b', 'c', '1', '2', '3']
Whitespace Characters: [' ', ' ']


13. What do means by \D, \W, and \S shorthand character classes signify in regular expressions?

\D represents the negation of the digit character class \d. It matches any character that is not a decimal digit from 0 to 9.

\W represents the negation of the word character class \w. It matches any character that is not an alphanumeric character (letter, digit, or underscore).

\S represents the negation of the whitespace character class \s. It matches any character that is not a whitespace character (space, tab, line break).

In [25]:
import re

text = "abc 123 !@#"

non_digits = re.findall(r"\D", text)
non_word_chars = re.findall(r"\W", text)
non_whitespace_chars = re.findall(r"\S", text)

print("Non-Digits:", non_digits)
print("Non-Word Characters:", non_word_chars)
print("Non-Whitespace Characters:", non_whitespace_chars)


Non-Digits: ['a', 'b', 'c', ' ', ' ', '!', '@', '#']
Non-Word Characters: [' ', ' ', '!', '@', '#']
Non-Whitespace Characters: ['a', 'b', 'c', '1', '2', '3', '!', '@', '#']


14. What is the difference between .*? and .*?

.*? is a non-greedy or lazy quantifier. It matches the shortest possible sequence of characters that satisfies the pattern.

.* is a greedy quantifier. It matches the longest possible sequence of characters that satisfies the pattern.

15. What is the syntax for matching both numbers and lowercase letters with a character class?

In [None]:
[0-9a-z]

 the character class [0-9a-z] specifies a range of characters. It matches any single character that is either a digit from 0 to 9 or a lowercase letter from a to z.

In [None]:
Pattern: [0-9a-z]

Matches: "0", "a", "7", "x"
Does not match: "A", ".", "#", "!"
Pattern: [a-z0-9]

Matches: "a", "5", "b", "2"
Does not match: "X", "*", "!", "."

16. What is the procedure for making a normal expression in regax case insensitive?

Using the re.IGNORECASE flag as an argument in the re.compile() function:

In [30]:
import re

pattern = re.compile(r"pattern", re.IGNORECASE)


17. What does the . character normally match? What does it match if re.DOTALL is passed as 2nd argument in re.compile()?

In regular expressions, the . (dot) character is a special metacharacter that normally matches any character except a newline character (\n). It represents a wildcard and can match any single character in the input text.

However, if the re.DOTALL or re.S flag is passed as the second argument to re.compile(), the dot . will match any character, including newline characters. The re.DOTALL flag makes the dot behave as a truly wildcard character, spanning across multiple lines.

In [32]:
import re

text = "Line 1\nLine 2\nLine 3"

pattern = re.compile(r".*", re.DOTALL)
matches = pattern.findall(text)

print(matches)


['Line 1\nLine 2\nLine 3', '']


18. If numReg = re.compile(r'\d+'), what will numRegex.sub('X', '11 drummers, 10 pipers, five rings, 4 hen') return?


If numReg = re.compile(r'\d+'), and we call numReg.sub('X', '11 drummers, 10 pipers, five rings, 4 hen'), the return value will be:

'X drummers, X pipers, five rings, X hen'

19. What does passing re.VERBOSE as the 2nd argument to re.compile() allow to do?

Passing re.VERBOSE as the second argument to re.compile() allows you to add whitespace and comments to the regular expression pattern for improved readability. It enables you to write more organized and understandable regular expressions.

When the re.VERBOSE flag is used, the following rules apply to the pattern:

Whitespace outside of character classes is ignored: Whitespace characters (spaces, tabs, and line breaks) outside of character classes are ignored. This allows you to format the regular expression pattern with indentation, line breaks, and comments to make it more readable.

Whitespace inside character classes is preserved: Whitespace characters inside character classes (enclosed within square brackets) are treated as literal characters and not ignored.

Comments are allowed: You can include comments in the pattern by starting them with the # character. Comments can provide explanations, clarify intent, or make the pattern easier to understand.

In [34]:
import re

pattern = re.compile(r"""
    \d{3}  # Match three digits
    -      # Match a hyphen
    \d{4}  # Match four digits
""", re.VERBOSE)

text = "Phone number: 123-4567"

match = pattern.search(text)
if match:
    print("Phone number found:", match.group())


Phone number found: 123-4567


20. How would you write a regex that match a number with comma for every three digits? It must match the given following:
'42'
'1,234'
'6,368,745'
but not the following:
'12,34,567' (which has only two digits between the commas)
'1234' (which lacks commas)


In [35]:
import re

pattern = re.compile(r'^\d{1,3}(,\d{3})*$')

# Valid matches
print(pattern.match('42'))  # Matches
print(pattern.match('1,234'))  # Matches
print(pattern.match('6,368,745'))  # Matches

# Invalid matches
print(pattern.match('12,34,567'))  # Does not match
print(pattern.match('1234'))  # Does not match


<re.Match object; span=(0, 2), match='42'>
<re.Match object; span=(0, 5), match='1,234'>
<re.Match object; span=(0, 9), match='6,368,745'>
None
None


21. How would you write a regex that matches the full name of someone whose last name is Watanabe? You can assume that the first name that comes before it will always be one word that begins with a capital letter. The regex must match the following:
'Haruto Watanabe'
'Alice Watanabe'
'RoboCop Watanabe'
but not the following:
'haruto Watanabe' (where the first name is not capitalized)
'Mr. Watanabe' (where the preceding word has a nonletter character)
'Watanabe' (which has no first name)
'Haruto watanabe' (where Watanabe is not capitalized)


In [39]:
import re

pattern = re.compile(r'^[A-Z][a-zA-Z]*\sWatanabe$')

# Valid matches
print(pattern.match('Haruto Watanabe'))  # Matches
print(pattern.match('Alice Watanabe'))  # Matches
print(pattern.match('RoboCop Watanabe'))  # Matches

# Invalid matches
print(pattern.match('haruto Watanabe'))  # Does not match
print(pattern.match('Mr. Watanabe'))  # Does not match
print(pattern.match('Watanabe'))  # Does not match
print(pattern.match('Haruto watanabe'))  # Does not match


<re.Match object; span=(0, 15), match='Haruto Watanabe'>
<re.Match object; span=(0, 14), match='Alice Watanabe'>
<re.Match object; span=(0, 16), match='RoboCop Watanabe'>
None
None
None
None


22. How would you write a regex that matches a sentence where the first word is either Alice, Bob, or Carol; the second word is either eats, pets, or throws; the third word is apples, cats, or baseballs; and the sentence ends with a period? This regex should be case-insensitive. It must match the following:
'Alice eats apples.'
'Bob pets cats.'
'Carol throws baseballs.'
'Alice throws Apples.'
'BOB EATS CATS.'
but not the following:
'RoboCop eats apples.'
'ALICE THROWS FOOTBALLS.'
'Carol eats 7 cats.'



In [41]:
import re

pattern = re.compile(r'^(Alice|Bob|Carol)\s+(eats|pets|throws)\s+(apples|cats|baseballs)\.$', re.IGNORECASE)

# Valid matches
print(pattern.match('Alice eats apples.'))  # Matches
print(pattern.match('Bob pets cats.'))  # Matches
print(pattern.match('Carol throws baseballs.'))  # Matches
print(pattern.match('Alice throws Apples.'))  # Matches
print(pattern.match('BOB EATS CATS.'))  # Matches

# Invalid matches
print(pattern.match('RoboCop eats apples.'))  # Does not match
print(pattern.match('ALICE THROWS FOOTBALLS.'))  # Does not match
print(pattern.match('Carol eats 7 cats.'))  # Does not match


<re.Match object; span=(0, 18), match='Alice eats apples.'>
<re.Match object; span=(0, 14), match='Bob pets cats.'>
<re.Match object; span=(0, 23), match='Carol throws baseballs.'>
<re.Match object; span=(0, 20), match='Alice throws Apples.'>
<re.Match object; span=(0, 14), match='BOB EATS CATS.'>
None
None
None
