<a href="https://colab.research.google.com/github/Pakpako95/GooglexPython/blob/main/Course2Module3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Google IT Automation with Python Specialization**

## **Course 2: Using Python to Interact with the Operating System**

### **Module 3: Regular Expressions**

#### **Theme 2: Basic Regular Expressions**

##### **Repetition Qualifiers**

**Course 2:** Using Python to Interact with the Operating System. |
**Module 3:** Regular Expressions | **Theme:** Basic Regular Expressions

---
**Sub-theme:** Repetition Qualifiers

Repetition qualifiers in regex allow matching characters multiple times.
The ***** matches any character repeated zero or more times, but behaves greedily by matching as much as possible. For example, **p.*n matches from the first "p" to the last "n" in a string.

The + qualifier matches one or more occurrences of the preceding character, while ? matches zero or one occurrence. For example, O+L+ matches one or more O's followed by one or more L's, and P?each matches both "each" and "Peach".

These qualifiers help build more complex patterns for finding specific text patterns like the longest word in a string or hostnames in log files.

In [None]:
import re
print(re.search(r"Py.*n", "Pygmalion"))
print(re.search(r"Py.*n", "Python Programming"))
print(re.search(r"Py[a-z]*n", "Python Programming"))
print(re.search(r"Py[a-z]*n", "Pyn"))

In [None]:
import re
print(re.search(r"o+l+", "goldfish"))
print(re.search(r"o+l+", "woolly"))
print(re.search(r"o+l+", "boil"))

In [None]:
import re
print(re.search(r"p?each", "To each their own"))
print(re.search(r"p?each", "I like peaches"))

Greedy vs Non-Greedy in Regular Expressions
**Greedy:**

Matches as much as possible.
Example: a.*a
→ Captures from the first a to the last a.

**Non-Greedy:**

Matches as little as possible.
Example: a.*?a
→ Captures from the first a to the next closest a.
___

Explaining the expression ( . * ? )

( . ) means any character (except newlines).

( * ) means zero or more repetitions of the preceding item (in this case, any character).

( ? ) makes the * non-greedy, so it matches the smallest possible number of characters between the two letters.

→ So ( . * ? ) means: “the fewest possible characters of any kind.”

##### **Escaping Characters**

**Course 2:** Using Python to Interact with the Operating System. |
**Module 3:** Regular Expressions | **Theme:** Basic Regular Expressions

---
**Sub-theme:** Escaping Characters


Escaping special characters in regex requires using a backslash (). For example, to match a literal dot rather than any character, use . as in .com which matches ".com" specifically.

Python uses raw strings (r"") to avoid confusion with string escape sequences like \n or \t, as backslashes won't be interpreted when generating the string.
Python provides special character sequences:

* \w matches alphanumeric characters (letters, numbers, underscores)
* \d matches digits
* \s matches whitespace (space, tab, newline)
* \b matches word boundaries

For regex testing and analysis, regex101.com is a helpful resource.

In [None]:
import re
print(re.search(r".com", "welcome"))
print(re.search(r"\.com", "welcome"))
print(re.search(r"\.com", "mydomain.com"))

<re.Match object; span=(2, 6), match='lcom'>
None
<re.Match object; span=(8, 12), match='.com'>


In [None]:
import re
print(re.search(r"\w*", "This is an example"))
print(re.search(r"\w*", "And_this_is_another"))

<re.Match object; span=(0, 4), match='This'>
<re.Match object; span=(0, 19), match='And_this_is_another'>


##### **Regular Expressions in Action**

**Course 2:** Using Python to Interact with the Operating System. |
**Module 3:** Regular Expressions | **Theme:** Basic Regular Expressions

---
**Sub-theme:** Regular Expressions in Action


Regular expressions can be combined to create powerful pattern matching. To match countries that start and end with "a", use `^a.*a$`, where `^` marks the beginning and `$` marks the end of the string.

To validate Python variable names (starting with letter or underscore, containing letters, numbers or underscores):

Start with:

1. `^[a-zA-Z_]` to match first character
2. Add `[a-zA-Z0-9_]*$` to match remaining characters
3. The complete pattern is ` ^[a-zA-Z_][a-zA-Z0-9_]*$ `

This pattern correctly validates variable names like "my_var1" while rejecting invalid ones like "123var" (starts with number) or "my var" (contains space).

Practice with regex builds comfort with this powerful tool for text processing.

In [None]:
import re
print(re.search(r"A.*a", "Argentina"))
print(re.search(r"A.*a", "Azerbaijan"))
print(re.search(r"^A.*a$", "Azerbaijan"))
print(re.search(r"^A.*a$", "Argentina"))

In [None]:
import re
pattern = r"^[a-zA-Z_][a-zA-Z0-9_]*$"
print(re.search(pattern, "_this_is_a_valid_variable_name"))
print(re.search(pattern, "this isn't a valid variable"))
print(re.search(pattern, "my_variable1"))
print(re.search(pattern, "2my_variable1"))

Explanation of the regex `r"^[A-Z][a-z\s]+[.?!]$`:

`^` — Start of string

`[A-Z]` — First character must be an uppercase letter

`[a-z\s]+` — At least one lowercase letter or space

`[.?!]` — Ends with a period, question mark, or exclamation mark

`$` — End of string

##### **Practice of the Theme: Basic Regular Expressions**

**Course 2:** Using Python to Interact with the Operating System. |
**Module 3:** Regular Expressions | **Theme:** Basic Regular Expressions

---
**Sub-theme:** Practice


Question 1
The check_web_address() function checks if the text passed qualifies as a top-level web address, meaning that it contains alphanumeric characters (which includes letters, numbers, and underscores), as well as periods, dashes, and a plus sign, followed by a period and a character-only top-level domain such as ".com", ".info", ".edu", etc. Fill in the regular expression to do that, using escape characters, wildcards, repetition qualifiers, beginning and end-of-line characters, and character classes.

In [None]:
import re
def check_web_address(text):
  pattern = r"^[\w\.\-\+]+\.([a-zA-Z]+)$"
  result = re.search(pattern, text)
  return result != None

print(check_web_address("gmail.com")) # True
print(check_web_address("www@google")) # False
print(check_web_address("www.Coursera.org")) # True
print(check_web_address("web-address.com/homepage")) # False
print(check_web_address("My_Favorite-Blog.US")) # True


Question 2
The check_time() function checks for the time format of a 12-hour clock, as follows: the hour is between 1 and 12, with no leading zero, followed by a colon, then minutes between 00 and 59, then an optional space, and then AM or PM, in upper or lower case. Fill in the regular expression to do that. How many of the concepts that you just learned can you use here?

In [None]:
import re
def check_time(text):
  pattern = r"^(1[0-2]|[1-9]):[0-5][0-9]\s?(am|pm|AM|PM)$"
  result = re.search(pattern, text)
  return result != None

print(check_time("12:45pm")) # True
print(check_time("9:59 AM")) # True
print(check_time("6:60am")) # False
print(check_time("five o'clock")) # False
print(check_time("6:02 am")) # True
print(check_time("6:02km")) # False


Question 3
The contains_acronym() function checks the text for the presence of 2 or more characters or digits surrounded by parentheses, with at least the first character in uppercase (if it's a letter), returning True if the condition is met, or False otherwise. For example, "Instant messaging (IM) is a set of communication technologies used for text-based communication" should return True since (IM) satisfies the match conditions." Fill in the regular expression in this function:

In [None]:
import re
def contains_acronym(text):
  pattern = r"\(([A-Z][a-zA-Z0-9]+)\)"
  result = re.search(pattern, text)
  return result != None

print(contains_acronym("Instant messaging (IM) is a set of communication technologies used for text-based communication")) # True
print(contains_acronym("American Standard Code for Information Interchange (ASCII) is a character encoding standard for electronic communication")) # True
print(contains_acronym("Please do NOT enter without permission!")) # False
print(contains_acronym("PostScript is a fourth-generation programming language (4GL)")) # True
print(contains_acronym("Have fun using a self-contained underwater breathing apparatus (Scuba)!")) # True


What does the r before the pattern string in re.search(r"Py.*n", sample.txt) indicate?

Raw Strings

What does the plus character [+] do in regex?

Matches one or more occurrences of the character before it.



An intern implemented a zip code checker, but it works only with five-digit zip codes. Your task is to update the checker so that it includes all nine digits of the zip code; the leading five digits and the optional four after the hyphen. The zip code needs to be preceded by at least one space, and cannot be at the start of the text. Update the regular expression.

In [None]:
import re

def correct_function(text):
  result = re.search(r"\s\d{5}(-\d{4})?", text)  # Corrected regex pattern with space
  return result is not None

def check_zip_code(text):
  return correct_function(text)  # Call the correct_function

# Call the check_zip_code function with test cases
print(check_zip_code("The zip codes for New York are 10001 thru 11104."))  # True
print(check_zip_code("90210 is a TV show"))  # False (no space before 90210)
print(check_zip_code("Their address is: 123 Main Street, Anytown, AZ 85258-0001."))  # True
print(check_zip_code("The Parliament of Canada is at 111 Wellington St, Ottawa, ON K1A0A9."))  # False
