#### 1. What is the name of the feature responsible for generating Regex objects?

In Python, the feature responsible for generating regular expression objects is called the "re" module. The "re" module provides functions and methods for working with regular expressions, including the ability to compile regular expressions into pattern objects.
Here's a simple example that demonstrates the usage of the "re" module to create a Regex object:

In [1]:
import re

# Compile a regular expression pattern
pattern = re.compile(r'\d{2}-\d{2}-\d{4}')  # Matches date in the format "DD-MM-YYYY"

# Use the compiled pattern for matching
result = pattern.match('31-12-2022')
if result:
    print("Valid date format")
else:
    print("Invalid date format")


Valid date format


In the above example, the re.compile() function is used to create a Regex object that matches a date in the format "DD-MM-YYYY". The pattern.match() method is then used to check if the input string '31-12-2022' matches the pattern.

#### 2. Why do raw strings often appear in Regex objects?

In Python, a raw string is created by prefixing the string literal with the letter 'r'.
Raw strings are often used in regular expressions to avoid the need for excessive escaping of special characters.


To simplify the handling of regular expressions, raw strings are commonly used because they treat backslashes as literal characters. This means that backslashes within a raw string are not used for escaping, and the backslash retains its literal meaning. This behavior is particularly useful in regular expressions, where backslashes are frequently used to escape metacharacters or introduce special sequences.

In the below code, the raw string prefix 'r' is used to create a raw string literal. As a result, the double backslash '\', which is necessary to represent a literal backslash in a regular expression pattern, is treated as a single backslash character. Without the raw string prefix, the pattern would need four backslashes '\\' to represent the desired regular expression pattern.

In [1]:
import re

pattern = re.compile(r'\\(\d)')


#### 3. What is the return value of the search() method?

The <b>search()</b> method of a regular expression object in Python returns a match object if a match is found, or <b>'None'</b> if no match is found.

The match object represents the first occurrence of a pattern within a string. It provides information about the match, such as the matched substring, the starting and ending positions of the match, and other details.

In [2]:
import re

pattern = re.compile(r'apple')
result = pattern.search('I like apples and oranges.')

if result:
    print("Match found:", result.group())  # prints "Match found: apple"
    print("Starting position:", result.start())  # prints "Starting position: 7"
    print("Ending position:", result.end())  # prints "Ending position: 12"
    print("Matched substring:", result.group(0))  # prints "Matched substring: apple"
else:
    print("No match found.")


Match found: apple
Starting position: 7
Ending position: 12
Matched substring: apple


In the above example, the <b>'search()'</b> method is used to find the first occurrence of the pattern 'apple' in the input string 'I like apples and oranges.' Since the pattern is found at index 7, the match object is returned. We can then access various attributes and methods of the match object, such as <b>'group()' , 'start()' , 'end()'</b>, to extract information about the match.

#### 4. From a Match item, how do you get the actual strings that match the pattern?

To obtain the actual strings that match the pattern from a match object, you can use the <b>group()</b> method. The group() method returns the substring that was matched by the regular expression.

By default, calling <b>group()</b> without any arguments returns the entire match. You can also provide an optional group number or a group name as an argument to group() to retrieve a specific matched subgroup if your regular expression contains capturing groups.

In [3]:
import re

pattern = re.compile(r'(\d+)-(\d+)-(\d+)')
result = pattern.search('Date: 31-12-2022')

if result:
    print("Match found:", result.group())  # prints "Match found: 31-12-2022"
    print("Day:", result.group(1))  # prints "Day: 31"
    print("Month:", result.group(2))  # prints "Month: 12"
    print("Year:", result.group(3))  # prints "Year: 2022"
else:
    print("No match found.")


Match found: 31-12-2022
Day: 31
Month: 12
Year: 2022


Using <b>result.group()</b>, we retrieve the entire matched substring "31-12-2022". Additionally, we can use <b>result.group(1)</b>, <b>result.group(2)</b>, and <b>result.group(3)</b> to access the specific captured groups, which correspond to the day, month, and year components of the date respectively.

#### 5. In the regex which created from the r&#39;(\d\d\d)-(\d\d\d-\d\d\d\d)&#39;, what does group zero cover? Group 2? Group 1?

In [None]:
import re

pattern = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
result = pattern.search('Phone: 123-456-7890')

if result:
    print("Match found:", result.group())  # prints "Match found: 123-456-7890"
    print("Group 0 (entire match):", result.group(0))  # prints "Group 0 (entire match): 123-456-7890"
    print("Group 1:", result.group(1))  # prints "Group 1: 123"
    print("Group 2:", result.group(2))  # prints "Group 2: 456-7890"
else:
    print("No match found.")


In the regular expression r'(\d\d\d)-(\d\d\d-\d\d\d\d)', which consists of two capturing groups, the group numbering starts from 1. Here's what each group represents:

>Group 0: The entire match. It represents the entire substring that matches the entire regular expression pattern.

>Group 1: This corresponds to the first capturing group (\d\d\d). It captures a sequence of three digits.

>Group 2: This corresponds to the second capturing group (\d\d\d-\d\d\d\d). It captures a sequence of three digits followed by a hyphen and four digits.

#### 6. In standard expression syntax, parentheses and intervals have distinct meanings. How can you tell a regex that you want it to fit real parentheses and periods?

In regular expressions, parentheses and periods have special meanings, as they are used to define groups and match any character (except newline), respectively. To match literal parentheses and periods in a regular expression, you can use the backslash \ to escape these characters. By preceding the parentheses or periods with a backslash, you can instruct the regular expression engine to interpret them as literal characters instead of special metacharacters.

In [4]:
import re

pattern = re.compile(r'\(Hello\.\)')
result = pattern.search('Say (Hello.) to the world.')

if result:
    print("Match found:", result.group())  # prints "Match found: (Hello.)"
else:
    print("No match found.")


Match found: (Hello.)


In this example, the regular expression pattern \(Hello\.\) is used to match the string "(Hello.)". The parentheses and period in the pattern are escaped using backslashes \, indicating that they should be treated as literal characters.

#### 7. The findall() method returns a string list or a list of string tuples. What causes it to return one of the two options?

The findall() method of a regular expression object in Python returns either a list of strings or a list of string tuples, depending on the presence of capturing groups in the regular expression pattern.

If the regular expression pattern contains capturing groups, the findall() method returns a list of string tuples. Each tuple represents a match, and each element within the tuple corresponds to the captured groups.

On the other hand, if the regular expression pattern does not contain any capturing groups, the findall() method returns a list of strings. Each string represents a complete match of the pattern.

To illustrate the behavior, let's consider an example:

In [5]:
import re

pattern = re.compile(r'(\d+)-(\w+)')
result = pattern.findall('123-abc 456-def')

print(result)


[('123', 'abc'), ('456', 'def')]


#### 8. In standard expressions, what does the | character mean?

In regular expressions, the <b> "|"</b> is called a vertical bar or pipe symbol

#### 9. In regular expressions, what does the character stand for?

The vertical bar or pipe symbol <b> "|"</b> is used as a logical OR operator. It allows you to specify multiple alternative patterns or options within a regular expression.

#### 10. In regular expressions, what is the difference between the + and * characters?

In regular expressions, the + and * characters are quantifiers used to specify the repetition of the preceding element in the pattern. They have different meanings and behaviors:

>+ (Plus):
- The + character matches one or more occurrences of the preceding element.
- It requires that the preceding element must appear at least once in the input string, but it can repeat more than once.
- For example, the pattern cat+ will match "cat", "catt", "cattt", and so on, but it will not match "ca" because the t is required to appear one or more times.

>* (Asterisk):
- The * character matches zero or more occurrences of the preceding element.
- It allows for optional matching of the preceding element; it can appear zero times or any number of times.
- For example, the pattern ab*c will match "ac", "abc", "abbc", "abbbc", and so on, including cases where the b is repeated zero times.

#### 11. What is the difference between {4} and {4,5} in regular expression?

In regular expressions, the curly braces {} are used as quantifiers to specify the exact number of repetitions of the preceding element. The numbers inside the curly braces indicate the range of repetition. The difference between {4} and {4,5} is as follows:

>{4}: This specifies an exact repetition of the preceding element. It matches exactly four occurrences of the preceding element.
For example, the pattern a{4} will match "aaaa" but will not match "aaa" or "aaaaa".

>{4,5}: This specifies a range of repetition for the preceding element. It matches a minimum of four occurrences and a maximum of five occurrences of the preceding element.
For example, the pattern a{4,5} will match "aaaa" and "aaaaa", but not "aaa" or "aaaaaaaa".

It's worth noting that if you only provide a single number in the curly braces, such as {4,}, it means "at least" that many occurrences. For example, {4,} matches four or more occurrences of the preceding element

#### 12. What do you mean by the \d, \w, and \s shorthand character classes signify in regular expressions?

- \d: This shorthand represents the digit character class. It matches any digit from 0 to 9. For example, the pattern \d will match any single digit character.

- \w: This shorthand represents the word character class. It matches any alphanumeric character (letters A-Z, a-z, and digits 0-9) as well as the underscore character ( _ ).For example, the pattern \w will match any single alphanumeric character or underscore.

- \s: This shorthand represents the whitespace character class. It matches any whitespace character, including spaces, tabs, and line breaks. For example, the pattern \s will match a single whitespace character.

>These shorthand character classes can be used in combination with other regular expression elements to form more complex patterns. For instance, \d+ will match one or more consecutive digits, \w* will match zero or more alphanumeric characters or underscores, and \s? will match an optional whitespace character.

#### 13. What do means by \D, \W, and \S shorthand character classes signify in regular expressions?

The uppercase counterparts \D, \W, and \S are negated versions of the shorthand character classes \d, \w, and \s, respectively. They represent the complement sets of characters that do not match the corresponding lowercase shorthand classes. Here's what each negated shorthand character class signifies:

- \D: This shorthand represents the negation of the digit character class. It matches any character that is not a digit. For example, the pattern \D will match any character that is not a digit, such as letters, symbols, or whitespace.

- \W: This shorthand represents the negation of the word character class. It matches any character that is not an alphanumeric character or underscore.For example, the pattern \W will match any character that is not an alphanumeric character or underscore, such as symbols or whitespace.

- \S: This shorthand represents the negation of the whitespace character class. It matches any character that is not a whitespace character. For example, the pattern \S will match any character that is not a whitespace character, such as letters, digits, or symbols.

>These negated shorthand character classes are useful when you want to match characters that do not belong to a specific character class. For example, \D+ will match one or more consecutive non-digit characters, \W* will match zero or more characters that are not alphanumeric or underscores, and \S? will match an optional non-whitespace character.

#### 14. What is the difference between '.* ?' and '.*' ?

The difference between <b>. * ? </b> and <b> . *</b> lies in their behavior of matching characters in regular expressions. Let's explore each of them:
> 1 <b>. * ?  (Reluctant / Lazy Quantifier):
- The . * ? pattern is a combination of the .* expression and the ? modifier. It matches any character (except newline characters) zero or more times in a non-greedy or reluctant manner.
- The ? modifier makes the preceding * or + quantifier lazy, meaning it matches as few characters as possible to satisfy the overall pattern.
- For example, if you have the input string "abcabcabc", and the pattern is a '. * ?' c, the '. * ?' will match "ab" because it matches as few characters as possible to reach the subsequent "c" in the pattern.

> 2 <b> . * (Greedy Quantifier):</b>

- The '.* ' pattern, without the ? modifier, matches any character (except newline characters) zero or more times in a greedy manner.
- The greedy behavior means it matches as many characters as possible to satisfy the overall pattern.
- For example, if you have the input string "abcabcabc", and the pattern is a'.* 'c, the '.* ' will match "abcabcab" because it matches as many characters as possible until the last "c" in the pattern.

#### 15. What is the syntax for matching both numbers and lowercase letters with a character class?

In this syntax:

- 0-9 represents the range of numbers from 0 to 9.
- a-z represents the range of lowercase letters from "a" to "z".
By combining these ranges within square brackets, [0-9a-z], you create a character class that matches any character that is either a number or a lowercase letter.

For example, the pattern [0-9a-z]+ will match one or more consecutive occurrences of either a number or a lowercase letter.

Here are a few examples of how this character class can be used:

In [None]:
[0-9a-z]       # Matches a single digit or lowercase letter
[0-9a-z]+      # Matches one or more consecutive digits or lowercase letters
[0-9a-z]{3}    # Matches exactly three digits or lowercase letters
[0-9a-z]{2,4}  # Matches two to four consecutive digits or lowercase letters

#### 16. What is the procedure for making a normal expression in regax case insensitive?

In Python, you can make a regular expression case insensitive by using the <b>"re.IGNORECASE"</b> flag or the <b>"re.I"</b> flag. Here's the procedure to create a case-insensitive regular expression:

In [2]:
import re #                    (1) Import the regex module
#                              (2) Construct your regular expression pattern using the desired case-insensitive pattern.
text = "I have an Apple"#          For example, let's say we want to match the word "apple" in a case-insensitive manner.
#                              (3) Compile the regular expression pattern using the re.compile() function
pattern = re.compile("apple", re.IGNORECASE) # and pass the re.IGNORECASE flag as the second argument:

match = pattern.search(text)#(4)Use the compiled pattern with the desired regex function, such as match(),search(),or findall().

if match:
    print("Match found!")
else:
    print("No match found.")


Match found!


#### 17. What does the . character normally match? What does it match if re.DOTALL is passed as 2nd argument in re.compile()?

In regular expressions, the . (dot) character normally matches any character except a newline character (\n). It matches a single occurrence of any character, including letters, digits, symbols, and whitespace, but it does not match newline characters.
However, if the <b>re.DOTALL</b> flag (or <b>re.S</b> flag) is passed as the second argument when compiling a regular expression using <b>re.compile()</b>, the behavior of the <b>'.'</b> character changes. With <b>re.DOTALL</b> enabled, the <b>"."</b> character will match any character, including newline characters.

In [3]:
import re

text = "Hello\nWorld"

pattern1 = re.compile("Hello.World")
match1 = pattern1.search(text)
print(match1)  # No match because newline is not matched by '.'

pattern2 = re.compile("Hello.World", re.DOTALL)
match2 = pattern2.search(text)
print(match2)  # Match because newline is matched by '.'

None
<re.Match object; span=(0, 11), match='Hello\nWorld'>


In the above example, the input text contains a newline character between "Hello" and "World". The first pattern, <b>"Hello.World"</b>, is compiled without the <b>re.DOTALL</b> flag, so the dot does not match the newline character. Therefore, the <b>search()</b> function using this pattern does not find a match.

On the other hand, the second pattern, <b>"Hello.World"</b>, is compiled with the <b>re.DOTALL</b> flag. This enables the dot to match any character, including newline characters. Therefore, the <b>search()</b> function using this pattern finds a match, considering the newline character between "Hello" and "World" as part of the match.

#### 18. If numReg = re.compile(r&#39;\d+&#39;), what will numRegex.sub(&#39;X&#39;, &#39;11 drummers, 10 pipers, five rings, 4 hen&#39;) return?

In [5]:
import re

numRegex = re.compile(r'\d+')
result = numRegex.sub('X', '11 drummers, 10 pipers, five rings, 4 hens')
print(result)

X drummers, X pipers, five rings, X hens


<b>Explanation</b>:

- The regular expression pattern r'\d+' matches one or more digits.
- The numRegex.sub('X', '11 drummers, 10 pipers, five rings, 4 hens') statement substitutes all matches of the pattern in the second argument ('11 drummers, 10 pipers, five rings, 4 hens') with the replacement string 'X'.
- In the input string, there are four matches for the pattern: "11", "10", "4", and "5" (although "5" is not matched due to the non-digit characters surrounding it).
- Each match is replaced with the letter "X", resulting in the output string 'X drummers, X pipers, five rings, X hens'.
- The numbers "11", "10", and "4" are replaced, while the word "five" (with non-digit characters) is not affected by the substitution.

#### 19. What does passing re.VERBOSE as the 2nd argument to re.compile() allow to do?

Passing re.VERBOSE as the second argument to re.compile() allows you to write more readable and organized regular expressions by enabling verbose mode. Here's what it allows you to do:

- Ignore Whitespace: The re.VERBOSE flag allows you to ignore whitespace and comments within the regular expression pattern. Whitespace characters (except those within character classes) and comments starting with # are ignored, allowing you to add spaces, line breaks, and indentation to improve readability.

- Add Comments: You can add comments within the regular expression pattern using #. These comments are ignored by the regex engine and serve as a way to document and explain the pattern.

- Enable Multiline Patterns: When using re.VERBOSE, a multiline regular expression pattern can be written across multiple lines, making it easier to structure and organize complex patterns.

In [6]:
import re

pattern = re.compile(r"""
    \d{3}     # Match three digits
    \s*       # Match optional whitespace
    -         # Match a hyphen
    \s*       # Match optional whitespace
    \d{4}     # Match four digits
""", re.VERBOSE)

text = "Phone numbers: 123-4567, 987-6543, 555-1234"
matches = pattern.findall(text)
print(matches)

['123-4567', '987-6543', '555-1234']


In the above example, the regular expression pattern is written across multiple lines using re.VERBOSE. Whitespace and comments are used to improve readability and understanding of the pattern. The pattern matches phone numbers in the format "###-####". The resulting matches will be a list of phone numbers found in the input text.

Using re.VERBOSE is optional, and it doesn't affect the functionality or the matching behavior of the regular expression. It only provides a way to make the pattern more readable and maintainable by ignoring whitespace and adding comments.

#### 20. How would you write a regex that match a number with comma for every three digits? It must match the given following: &#39;42&#39; &#39;1,234&#39; &#39;6,368,745&#39;
#### but not the following: &#39;12,34,567&#39; (which has only two digits between the commas) &#39;1234&#39; (which lacks commas)

To match a number with commas for every three digits, you can use the following regular expression pattern:

>^\d{1,3}(,\d{3})*$


In [7]:
import re

pattern = re.compile(r'^\d{1,3}(,\d{3})*$')

numbers = ['42', '1,234', '6,368,745', '12,34,567', '1234']

for number in numbers:
    match = pattern.match(number)
    if match:
        print(f"{number} is a valid number.")
    else:
        print(f"{number} is not a valid number.")


42 is a valid number.
1,234 is a valid number.
6,368,745 is a valid number.
12,34,567 is not a valid number.
1234 is not a valid number.


<b> Explanation of the pattern:</b>
- ^ and $ are anchors to match the start and end of the string, ensuring that the entire string is matched.
- \d{1,3} matches one to three digits at the beginning of the string.
- (,\d{3})* matches zero or more occurrences of a comma followed by exactly three digits.
- Combining the above parts, the pattern matches a number with commas for every three digits.

#### 21. How would you write a regex that matches the full name of someone whose last name is Watanabe? You can assume that the first name that comes before it will always be one word that begins with a capital letter. The regex must match the following: &#39;Haruto Watanabe&#39; ; &#39;Alice Watanabe&#39; ; &#39;RoboCop Watanabe&#39;
<b>but not the following: &#39;haruto Watanabe&#39; (where the first name is not capitalized); &#39;Mr. Watanabe&#39; (where the preceding word has a nonletter character); &#39;Watanabe&#39; (which has no first name) &#39;Haruto watanabe&#39; (where Watanabe is not capitalized)</b>

To match the full name of someone whose last name is "Watanabe" with the conditions specified, you can use the following regular expression pattern:  
>^[A-Z][a-zA-Z]*\sWatanabe$

In [8]:
import re

pattern = re.compile(r'^[A-Z][a-zA-Z]*\sWatanabe$')

names = ['Haruto Watanabe', 'Alice Watanabe', 'RoboCop Watanabe', 'haruto Watanabe', 'Mr. Watanabe', 'Watanabe', 'Haruto watanabe']

for name in names:
    match = pattern.match(name)
    if match:
        print(f"{name} is a valid name.")
    else:
        print(f"{name} is not a valid name.")


Haruto Watanabe is a valid name.
Alice Watanabe is a valid name.
RoboCop Watanabe is a valid name.
haruto Watanabe is not a valid name.
Mr. Watanabe is not a valid name.
Watanabe is not a valid name.
Haruto watanabe is not a valid name.


<b>Explanation of the pattern:</b>
- ^ and $ are anchors to match the start and end of the string, ensuring that the entire string is matched.
- [A-Z] matches an uppercase letter at the beginning of the string (first name's first letter).
- [a-zA-Z]* matches zero or more lowercase or uppercase letters for the rest of the first name.
- \s matches a whitespace character.
- Watanabe matches the last name exactly as "Watanabe".

#### 22. How would you write a regex that matches a sentence where the first word is either Alice, Bob, 'or Carol; the second word is either eats, pets, or throws; the third word is apples, cats, or baseballs;and the sentence ends with a period? This regex should be case-insensitive. It must match the following:

<b>&#39;Alice eats apples.&#39;
&#39;Bob pets cats.&#39;
&#39;Carol throws baseballs.&#39;
&#39;Alice throws Apples.&#39;
&#39;BOB EATS CATS.&#39;
but not the following:
&#39;RoboCop eats apples.&#39;
&#39;ALICE THROWS FOOTBALLS.&#39;
&#39;Carol eats 7 cats.&#39;</b>

To match a sentence with specific patterns as described, you can use the following regular expression with the re.IGNORECASE flag:
>^(Alice|Bob|Carol)\s(eats|pets|throws)\s(apples|cats|baseballs)\.$


In [9]:
import re

pattern = re.compile(r'^(Alice|Bob|Carol)\s(eats|pets|throws)\s(apples|cats|baseballs)\.$', re.IGNORECASE)

sentences = ['Alice eats apples.', 'Bob pets cats.', 'Carol throws baseballs.', 'Alice throws Apples.', 'BOB EATS CATS.',
             'RoboCop eats apples.', 'ALICE THROWS FOOTBALLS.', 'Carol eats 7 cats.']

for sentence in sentences:
    match = pattern.match(sentence)
    if match:
        print(f"{sentence} is a valid sentence.")
    else:
        print(f"{sentence} is not a valid sentence.")


Alice eats apples. is a valid sentence.
Bob pets cats. is a valid sentence.
Carol throws baseballs. is a valid sentence.
Alice throws Apples. is a valid sentence.
BOB EATS CATS. is a valid sentence.
RoboCop eats apples. is not a valid sentence.
ALICE THROWS FOOTBALLS. is not a valid sentence.
Carol eats 7 cats. is not a valid sentence.


<b>Explanation of the pattern:</b>
- ^ and $ are anchors to match the start and end of the string, ensuring that the entire string is matched.
- (Alice|Bob|Carol) matches either "Alice", "Bob", or "Carol" as the first word, using a group with the | alternation operator.
- \s matches a whitespace character between the words.
- (eats|pets|throws) matches either "eats", "pets", or "throws" as the second word.
- (apples|cats|baseballs) matches either "apples", "cats", or "baseballs" as the third word.
- \. matches a period at the end of the sentence.

In this example, the sentences are tested against the regular expression pattern. The sentences that match the specified format (following the given words and ending with a period) are considered valid, while sentences that don't meet the conditions are considered invalid. The re.IGNORECASE flag is used to make the pattern case-insensitive, allowing matches regardless of the letter case.