**Miguel Ramirez**

**Week 2 Assignment**

**Professor Cohen**

<u>Data Acquisition and Management</u>


# Difficulties of Using Regular Expressions in Python (with Examples and Output)
## While regular expressions are a powerful tool for pattern matching in Python, they can also present certain challenges:
### 1. Complexity and Readability:
Example: A complex regular expression to validate email addresses might be difficult to understand and maintain:

In [38]:
import re

email_regex = r"^[a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9]+$"

email_address = "john.doe@example.com"
if re.match(email_regex, email_address):
    print("Valid email address")
else:
    print("Invalid email address")

Valid email address


Solution: Break down complex expressions into smaller, more readable parts:

In [40]:
import re

username_regex = r"^[a-zA-Z0-9._-]+"
domain_regex = r"[a-zA-Z0-9._-]+\.[a-zA-Z0-9]+"
email_regex = rf"{username_regex}@{domain_regex}$"

email_address = "john.doe@example.com"
if re.match(email_regex, email_address):
    print("Valid email address")
else:
    print("Invalid email address")


Valid email address


### 2. Performance Overhead:
Example: A regular expression with excessive backtracking can significantly impact performance:

In [45]:
import re

pattern = r"a*b*c*"
text = "abc"

match = re.search(pattern, text)
if match:
    print("Match found")
else:
    print("No match found")

Match found


Solution: Optimize the regular expression to avoid unnecessary backtracking:

In [46]:
import re

pattern = r"a+?b+?c+?"
text = "abc"

match = re.search(pattern, text)
if match:
    print("Match found")
else:
    print("No match found")

Match found


### 3. Ambiguity and Edge Cases:
Example: Let's compare the results of the two pieces of code and explain any potential error in the second one.
Pattern: r"\d+" matches one or more digits in the string. In the text "001423", this regex will match the entire string of digits, including any leading zeros.:

In [50]:
import re

pattern = r"\d+"
text = "001423"

match = re.search(pattern, text)
if match:
    print("Match found:", match.group())
else:
    print("No match found")


Match found: 001423


### Code 2 (Second Example):
Solution: Use appropriate anchors or negative lookbehinds to handle specific cases. Pattern: r"^(?!0)\d+$" is designed to match a string of digits (\d+) but not if the string starts with a 0 (due to the ^(?!0) part, which is a negative lookahead asserting that the string does not start with a zero).
Text: The string "00123" starts with a zero, so the pattern won't match this string.:

In [52]:
import re

pattern = r"^(?!0)\d+$"
text = "00123"

match = re.search(pattern, text)
if match:
    print("Match found:", match.group())
else:
    print("No match found")

No match found


## Error Explanation in Code 2:
The potential "error" is not exactly a code error, but rather the mismatch between the pattern and the data. The pattern r"^(?!0)\d+$" is designed to reject any string that starts with a zero. So in your example, the string "00123" starts with 0, and the regular expression correctly fails to find a match because of the negative lookahead (?!0).

### Summary of Differences:
First Code (r"\d+"):

This pattern matches any sequence of digits, including those starting with 0.
It finds a match in the string "001423" because the pattern allows leading zeros.
Second Code (r"^(?!0)\d+$):

This pattern explicitly disallows strings that start with a 0 and matches only sequences of digits that do not start with 0.
In the case of "00123", no match is found because it starts with a 0.

### Solution:
If you want to match numbers that don't start with 0, but still allow other numbers, the second pattern is correct. However, if you want to match all sequences of digits, including those with leading zeros, the first pattern (r"\d+") will be more appropriate.