<a href="https://colab.research.google.com/github/Pakpako95/GooglexPython/blob/main/Google_IT_Automation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **🐍 Google IT Automation with Python Specialization**

## 🟥 **Course 2: Using Python to Interact with the Operating System**

### 🟢 **Module 3: Regular Expressions**

#### ◽**Theme 2: Basic Regular Expressions**

#####💠**Simple Matching in Python**

In Python's regex implementation, the re module provides functions for string manipulation. Use r prefix to create rawstrings, preventing Python from interpreting special characters.

The search function checks if a pattern exists in a string, returning a match object with information like position (span) and the matching substring.

If no match is found, search returns None. Special regex characters include ^ to match the beginning of a line and . to match any character. Additional options like re.IGNORECASE can modify matching behavior.

In [None]:
import re
result = re.search(r"aza", "plaza")
print(result)

In [None]:
import re
result = re.search(r"aza", "bazaar")
print(result)

In [None]:
import re
result = re.search(r"aza", "maze")
print(result)

print(re.search(r"^x", "xenon"))

In [None]:
import re
print(re.search(r"p.ng", "penguin"))

In [None]:
import re
print(re.search(r"p.ng", "clapping"))
print(re.search(r"p.ng", "sponge"))

In [None]:
import re
print(re.search(r"p.ng", "Pangaea", re.IGNORECASE))

##### 💠**Wildcards and Character Classes**

Character classes in regex allow matching specific character groups using square brackets.

To match Python with either uppercase or lowercase p, use `[Pp]ython`. Ranges are defined with dashes: `[a-z]` for lowercase letters, `[A-Z]` for uppercase, `[0-9]` for digits.

The `^` symbol inside brackets negates the class, matching anything not in the specified set. For example, `[^a-zA-Z]` matches characters that aren't letters.

The pipe symbol `|` creates alternatives, matching either expression. For instance, cat|dog matches either "cat" or "dog".
While search finds only the first match, findall returns all matches in a string.

In [None]:
import re
print(re.search(r"[Pp]ython", "Python"))

In [None]:
import re
print(re.search(r"[a-z]way", "The end of the highway"))
print(re.search(r"[a-z]way", "What a way to go"))
print(re.search("cloud[a-zA-Z0-9]", "cloudy"))
print(re.search("cloud[a-zA-Z0-9]", "cloud9"))

In [None]:
import re
print(re.search(r"[^a-zA-Z]", "This is a sentence with spaces."))
print(re.search(r"[^a-zA-Z ]", "This is a sentence with spaces."))

print(re.search(r"cat|dog", "I like cats."))
print(re.search(r"cat|dog", "I love dogs!"))
print(re.search(r"cat|dog", "I like both dogs and cats."))

print(re.search(r"cat|dog", "I like cats."))
print(re.search(r"cat|dog", "I love dogs!"))
print(re.search(r"cat|dog", "I like both dogs and cats."))
print(re.findall(r"cat|dog", "I like both dogs and cats."))

##### 💠**Repetition Qualifiers**

Repetition qualifiers in regex allow matching characters multiple times.
The ***** matches any character repeated zero or more times, but behaves greedily by matching as much as possible. For example, **p.*n matches from the first "p" to the last "n" in a string.

The + qualifier matches one or more occurrences of the preceding character, while ? matches zero or one occurrence. For example, O+L+ matches one or more O's followed by one or more L's, and P?each matches both "each" and "Peach".

These qualifiers help build more complex patterns for finding specific text patterns like the longest word in a string or hostnames in log files.

In [None]:
import re
print(re.search(r"Py.*n", "Pygmalion"))
print(re.search(r"Py.*n", "Python Programming"))
print(re.search(r"Py.*?n", "Python Programming"))
print(re.search(r"Py[a-z]*n", "Python Programming"))
print(re.search(r"Py[a-z]*n", "Pyn"))

In [None]:
import re
print(re.search(r"o+l+", "goldfish"))
print(re.search(r"o+l+", "woolly"))
print(re.search(r"o+l+", "boil"))

In [None]:
import re
print(re.search(r"p?each", "To each their own"))
print(re.search(r"p?each", "I like peaches"))

Greedy vs Non-Greedy in Regular Expressions\

**Greedy:** Matches as much as possible.
Example: a.*a
→ Captures from the first a to the last a.

**Non-Greedy:** Matches as little as possible.
Example: a.*?a
→ Captures from the first a to the next closest a.

Explaining the expression ( . * ? )

( . ) means any character (except newlines).

( * ) means zero or more repetitions of the preceding item (in this case, any character).

( ? ) makes the * non-greedy, so it matches the smallest possible number of characters between the two letters.

→ So ( . * ? ) means: “the fewest possible characters of any kind.”

##### 💠**Escaping Characters**

Escaping special characters in regex requires using a backslash (). For example, to match a literal dot rather than any character, use . as in .com which matches ".com" specifically.

Python uses raw strings (r"") to avoid confusion with string escape sequences like \n or \t, as backslashes won't be interpreted when generating the string.
Python provides special character sequences:

* \w matches alphanumeric characters (letters, numbers, underscores)
* \d matches digits
* \s matches whitespace (space, tab, newline)
* \b matches word boundaries

For regex testing and analysis, regex101.com is a helpful resource.

In [None]:
import re
print(re.search(r".com", "welcome"))
print(re.search(r"\.com", "welcome"))
print(re.search(r"\.com", "mydomain.com"))

<re.Match object; span=(2, 6), match='lcom'>
None
<re.Match object; span=(8, 12), match='.com'>


In [None]:
import re
print(re.search(r"\w*", "This is an example"))
print(re.search(r"\w*", "And_this_is_another"))

<re.Match object; span=(0, 4), match='This'>
<re.Match object; span=(0, 19), match='And_this_is_another'>


##### 💠**Regular Expressions in Action**

Regular expressions can be combined to create powerful pattern matching. To match countries that start and end with "a", use `^a.*a$`, where `^` marks the beginning and `$` marks the end of the string.

To validate Python variable names (starting with letter or underscore, containing letters, numbers or underscores):

Start with:

1. `^[a-zA-Z_]` to match first character
2. Add `[a-zA-Z0-9_]*$` to match remaining characters
3. The complete pattern is ` ^[a-zA-Z_][a-zA-Z0-9_]*$ `

This pattern correctly validates variable names like "my_var1" while rejecting invalid ones like "123var" (starts with number) or "my var" (contains space).

Practice with regex builds comfort with this powerful tool for text processing.

In [None]:
import re
print(re.search(r"A.*a", "Argentina"))
print(re.search(r"A.*a", "Azerbaijan"))
print(re.search(r"^A.*a$", "Azerbaijan"))
print(re.search(r"^A.*a$", "Argentina"))

<re.Match object; span=(0, 9), match='Argentina'>
<re.Match object; span=(0, 9), match='Azerbaija'>
None
<re.Match object; span=(0, 9), match='Argentina'>


In [None]:
import re
pattern = r"^[a-zA-Z_][a-zA-Z0-9_]*$"
print(re.search(pattern, "_this_is_a_valid_variable_name"))
print(re.search(pattern, "this isn't a valid variable"))
print(re.search(pattern, "my_variable1"))
print(re.search(pattern, "2my_variable1"))

Explanation of the regex `r"^[A-Z][a-z\s]+[.?!]$`:

`^` — Start of string

`[A-Z]` — First character must be an uppercase letter

`[a-z\s]+` — At least one lowercase letter or space

`[.?!]` — Ends with a period, question mark, or exclamation mark

`$` — End of string

##### 💠**Practice of the Theme: Basic Regular Expressions**

Question 1
The check_web_address() function checks if the text passed qualifies as a top-level web address, meaning that it contains alphanumeric characters (which includes letters, numbers, and underscores), as well as periods, dashes, and a plus sign, followed by a period and a character-only top-level domain such as ".com", ".info", ".edu", etc. Fill in the regular expression to do that, using escape characters, wildcards, repetition qualifiers, beginning and end-of-line characters, and character classes.

In [None]:
import re
def check_web_address(text):
  pattern = r"^[\w\.\-\+]+\.([a-zA-Z]+)$"
  result = re.search(pattern, text)
  return result != None

print(check_web_address("gmail.com")) # True
print(check_web_address("www@google")) # False
print(check_web_address("www.Coursera.org")) # True
print(check_web_address("web-address.com/homepage")) # False
print(check_web_address("My_Favorite-Blog.US")) # True


Question 2
The check_time() function checks for the time format of a 12-hour clock, as follows: the hour is between 1 and 12, with no leading zero, followed by a colon, then minutes between 00 and 59, then an optional space, and then AM or PM, in upper or lower case. Fill in the regular expression to do that. How many of the concepts that you just learned can you use here?

In [None]:
import re
def check_time(text):
  pattern = r"^(1[0-2]|[1-9]):[0-5][0-9]\s?(am|pm|AM|PM)$"
  result = re.search(pattern, text)
  return result != None

print(check_time("12:45pm")) # True
print(check_time("9:59 AM")) # True
print(check_time("6:60am")) # False
print(check_time("five o'clock")) # False
print(check_time("6:02 am")) # True
print(check_time("6:02km")) # False


Question 3
The contains_acronym() function checks the text for the presence of 2 or more characters or digits surrounded by parentheses, with at least the first character in uppercase (if it's a letter), returning True if the condition is met, or False otherwise. For example, "Instant messaging (IM) is a set of communication technologies used for text-based communication" should return True since (IM) satisfies the match conditions." Fill in the regular expression in this function:

In [None]:
import re
def contains_acronym(text):
  pattern = r"\(([A-Z][a-zA-Z0-9]+)\)"
  result = re.search(pattern, text)
  return result != None

print(contains_acronym("Instant messaging (IM) is a set of communication technologies used for text-based communication")) # True
print(contains_acronym("American Standard Code for Information Interchange (ASCII) is a character encoding standard for electronic communication")) # True
print(contains_acronym("Please do NOT enter without permission!")) # False
print(contains_acronym("PostScript is a fourth-generation programming language (4GL)")) # True
print(contains_acronym("Have fun using a self-contained underwater breathing apparatus (Scuba)!")) # True


What does the r before the pattern string in re.search(r"Py.*n", sample.txt) indicate?

Raw Strings

What does the plus character [+] do in regex?

Matches one or more occurrences of the character before it.



An intern implemented a zip code checker, but it works only with five-digit zip codes. Your task is to update the checker so that it includes all nine digits of the zip code; the leading five digits and the optional four after the hyphen. The zip code needs to be preceded by at least one space, and cannot be at the start of the text. Update the regular expression.

In [None]:
import re

def correct_function(text):
  result = re.search(r"\s\d{5}(-\d{4})?", text)  # Corrected regex pattern with space
  return result is not None

def check_zip_code(text):
  return correct_function(text)  # Call the correct_function

# Call the check_zip_code function with test cases
print(check_zip_code("The zip codes for New York are 10001 thru 11104."))  # True
print(check_zip_code("90210 is a TV show"))  # False (no space before 90210)
print(check_zip_code("Their address is: 123 Main Street, Anytown, AZ 85258-0001."))  # True
print(check_zip_code("The Parliament of Canada is at 111 Wellington St, Ottawa, ON K1A0A9."))  # False


#### ◽**Theme 3: Advanced Regular Expressions**

##### 💠**Capturing Groups**

Capturing groups in regex allow extracting matched portions for further processing. Created by enclosing patterns in parentheses, they store matched text that can be accessed using the `groups()` method or index notation.

When working with names in "lastname, firstname" format, use `(\w+)`, `(\w+)` to capture both parts separately. The complete match is accessed at index 0, while captured groups start at index 1.

To handle more complex names with spaces, dots, or dashes, expand the character class: `([a-zA-Z .-]+)`, `([a-zA-Z .-]+)`

This pattern can be implemented in a rearrange_name function that returns "firstname lastname" when the pattern matches, or the original string if no match is found.

In [None]:
import re
result = re.search(r" ^(\w*), (\w*)$", "Lovelace, Ada")
print(result)
print(result.groups())
print(result[0])
print(result[1])
print(result[2])
"{} {}".format(result[2], result[1])

<re.Match object; span=(0, 13), match='Lovelace, Ada'>
('Lovelace', 'Ada')
Lovelace, Ada
Lovelace
Ada


'Ada Lovelace'

In [None]:
import re
def rearrange_name(name):
    result = re.search(r"^(\w*), (\w*)$", name)
    if result is None:
        return name
    return "{} {}".format(result[2], result[1])
rearrange_name("Lovelace, Ada")
rearrange_name("Lovelace, Ada")

'Ada Lovelace'

In [None]:
import re
def rearrange_name(name):
    result = re.search(r"^(\w*), (\w*)$", name)
    if result is None:
        return name
    return "{} {}".format(result[2], result[1])
rearrange_name("Ritchie, Dennis")

'Dennis Ritchie'

In [None]:
import re
def rearrange_name(name):
    result = re.search(r"^([\w \.-]*), ([\w \.-]*)$", name)
    if result == None:
        return name
    return "{} {}".format(result[2], result[1])
rearrange_name("Hopper, Grace M.")

'Grace M. Hopper'

##### 💠**More on Repetition Qualifiers**

Numeric repetition qualifiers in regex allow matching patterns a specific number of times using curly brackets:

* `{n}`: Exactly n repetitions
* `{n,m}`: Between n and m repetitions
* `{n,}`: At least n repetitions
* `{,m}`: Up to m repetitions (from zero)

For example, `[a-zA-Z]{5}` matches exactly 5 letters. To match complete words of exactly 5 letters, use `\b[a-zA-Z]{5}\b` where `\b` marks word boundaries.

The findall function returns all matches rather than just the first one. For instance, `[a-zA-Z0-9]{5,10}` finds all alphanumeric sequences between 5-10 characters long, while `s[a-zA-Z0-9]{,20}` matches "s" followed by up to 20 alphanumeric characters.

In [None]:
import re
print(re.search(r"[a-zA-Z]{5}", "a ghost"))

<re.Match object; span=(2, 7), match='ghost'>


In [None]:
import re
print(re.search(r"[a-zA-Z]{5}", "a scary ghost appeared"))

<re.Match object; span=(2, 7), match='scary'>


In [None]:
import re
print(re.findall(r"[a-zA-Z]{5}", "a scary ghost appeared"))

['scary', 'ghost', 'appea']


In [None]:
import re
re.findall(r"\b[a-zA-Z]{5}\b", "A scary ghost appeared")

['scary', 'ghost']

In [None]:
import re
print(re.findall(r"\w{5,10}", "I really like strawberries"))

['really', 'strawberri']


In [None]:
import re
print(re.findall(r"\w{5,}", "I really like strawberries"))

['really', 'strawberries']


In [None]:
import re
print(re.search(r"s\w{,20}", "I really like strawberries"))

<re.Match object; span=(14, 26), match='strawberries'>


**Reflection:**

In [None]:
import re

def long_words(text):
  pattern = r"\b\w{7,}\b"
  result = re.findall(pattern, text)
  return result

# Test cases
print(long_words("I like to drink coffee in the morning."))  # ['morning']
print(long_words("I also have a taste for hot chocolate in the afternoon."))  # ['chocolate', 'afternoon']
print(long_words("I never drink tea late at night."))  # []

['morning']
['chocolate', 'afternoon']
[]


##### 💠**Extracting a PID Using regexes in Python**

The process ID extraction example uses capturing groups to get numbers between square brackets:

`[(\d+)]` matches a pattern where:

* `[` is an escaped opening square bracket
* `(\d+)` is a capturing group matching one or more digits
* `]` is an escaped closing square bracket

To safely extract PIDs from log lines, the extract_pid function:

1. Searches for the pattern in the string
2. Checks if the result exists (not None)
3. Returns the first capturing group if found
4. Returns an empty string if no match exists

This approach prevents errors when processing lines without PIDs. For example, given "[12345]" it returns "12345", while for strings without this pattern it returns an empty string.

In [None]:
import re
log = "July 31 07:51:48 mycomputer bad_process[12345]: ERROR Performing package upgrade"
regex = r"\[(\d+)\]"
result = re.search(regex, log)
print(result[1])

12345


In [None]:
import re
log = "July 31 07:51:48 mycomputer bad_process[12345]: ERROR Performing package upgrade"
regex = r"\[(\d+)\]"
result = re.search(regex, log)
result = re.search(regex, "A completely different string that also has numbers [34567]")
print(result[1])

34567


In [None]:
import re
log = "July 31 07:51:48 mycomputer bad_process[12345]: ERROR Performing package upgrade"
regex = r"\[(\d+)\]"
result = re.search(regex, log)
result = re.search(regex, "A completely different string that also has numbers [34567]")
result = re.search(regex, "99 elephants in a [cage]")
print(result[1])
#Note that this print command results in an error as shown in the video.

TypeError: 'NoneType' object is not subscriptable

In [None]:
import re
log = "July 31 07:51:48 mycomputer bad_process[12345]: ERROR Performing package upgrade"
regex = r"\[(\d+)\]"
result = re.search(regex, log)
result = re.search(regex, "A completely different string that also has numbers [34567]")
result = re.search(regex, "99 elephants in a [cage]")
def extract_pid(log_line):
    regex = r"\[(\d+)\]"
    result = re.search(regex, log_line)
    if result is None:
        return ""
    return result[1]
print(extract_pid(log))

12345


In [None]:
import re
log = "July 31 07:51:48 mycomputer bad_process[12345]: ERROR Performing package upgrade"
regex = r"\[(\d+)\]"
result = re.search(regex, log)
result = re.search(regex, "A completely different string that also has numbers [34567]")
result = re.search(regex, "99 elephants in a [cage]")
def extract_pid(log_line):
    regex = r"\[(\d+)\]"
    result = re.search(regex, log_line)
    if result is None:
        return ""
    return result[1]
print(extract_pid("99 elephants in a [cage]"))




**Reflection:** Add to the regular expression used in the extract_pid function, to return the uppercase message in parenthesis, after the process id.

In [None]:
import re
def extract_pid(log_line):
    regex = r"\[(\d+)\]:\s+([A-Z]+)"
    result = re.search(regex, log_line)
    if result is None:
        return None
    return "{} ({})".format(result.group(1), result.group(2))

print(extract_pid("July 31 07:51:48 mycomputer bad_process[12345]: ERROR Performing package upgrade")) # 12345 (ERROR)
print(extract_pid("99 elephants in a [cage]")) # None
print(extract_pid("A string that also has numbers [34567] but no uppercase message")) # None
print(extract_pid("July 31 08:08:08 mycomputer new_process[67890]: RUNNING Performing backup")) # 67890 (RUNNING)

12345 (ERROR)
None
None
67890 (RUNNING)


In [None]:
# Regular Expression Explanation:
    # r"\[(\d+)\]:\s+([A-Z]+)"
    #
    # \[         => Match a literal '[' character. The backslash escapes the special meaning of '['.
    # (\d+)      => Match one or more digits and capture them in group 1 (this is the PID).
    # \]         => Match a literal ']' character.
    # :          => Match a literal colon ':'.
    # \s+        => Match one or more whitespace characters (space, tab, etc.).
    # ([A-Z]+)   => Match one or more uppercase letters and capture them in group 2 (the status message).

##### 💠**Splitting and Replacing**

* **The `split()` Function**

Divides text using regex patterns as separators

Example: `re.split(r"[.?!]", text)` splits text into sentences
Using capturing parentheses (pattern) keeps separators in the result list

* **The `sub()` Function**

Substitutes matching patterns with replacement text
For anonymizing emails: `re.sub(r"[\w.%+-]+@[\w.-]+", "[REDACTED]", text)`

Captured groups can be referenced in replacements using `\1`, `\2`, etc.

Example for name rearrangement: `re.sub(r"([a-zA-Z \.-]+), ([a-zA-Z \.-]+)", r"\2 \1", name)`

These functions expand regex capabilities for text processing, making them powerful tools for data manipulation despite their complexity.

In [None]:
import re
re.split(r"[.?!]", "One sentence. Another one? And the last one!")

['One sentence', ' Another one', ' And the last one', '']

In [None]:
import re
re.split(r"([.?!])", "One sentence. Another one? And the last one!")

['One sentence', '.', ' Another one', '?', ' And the last one', '!', '']

In [None]:
import re
re.sub(r"[\w.%+-]+@[\w.-]+", "[REDACTED]", "Received an email for go_nuts95@my.example.com")

'Received an email for [REDACTED]'

In [None]:
import re
re.sub(r"^([\w .-]*), ([\w .-]*)$", r"\2 \1", "Lovelace, Ada")

'Ada Lovelace'

In [None]:
re.split(r"the|a", "One sentence. Another one? And the last one!")

['One sentence. Ano', 'r one? And ', ' l', 'st one!']

##### 💠**Practice**

**Advanced regex techniques expand pattern matching capabilities:**

Alterations use pipe symbol `(|)` to match any one of multiple options:

`r"location.*(London|Berlin|Madrid)"` matches `"location"` followed by any of the three cities

Position anchors limit where matches can occur:

`^` at beginning matches only at start of string
`$` at end matches only at end of string
Example: `r"^My name is (\w+)"` only matches if string begins with `"My name is"`

**Character ranges match single characters from defined sets:**

`r"[A-Z]"` matches any uppercase letter
`r"[0-9$-,.]"` matches digits or specific symbols
Combined with quantities: `r"([0-9]{3}-[0-9]{3}-[0-9]{4})"` matches phone numbers like "888-123-7612"

Backreferences in substitutions refer to captured groups:

`re.sub(r"([A-Z]).\s+(\w+)", r"Ms. \2", text)` changes "A. Weber" to "Ms. Weber"

Lookahead matches patterns only when followed by another pattern:

`r"(Test\d)-(?=Passed)"` matches Test numbers only if followed by "Passed"

Question 1
You’re working with a CSV file that contains employee information. Each record has a name field, followed by a phone number field, and a role field. The phone number field contains U.S. phone numbers and needs to be modified to the international format, with +1- in front of the phone number. The rest of the phone number should not change. Fill in the regular expression, using groups, to use the transform_record() function to do that.

In [None]:
import re
def transform_record(record):
  new_record = re.sub(r"(\d{3}-\d{3}-\d{4}|\d{3}-\d{7})", r"+1-\1", record)
  return new_record

print(transform_record("Sabrina Green,802-867-5309,System Administrator"))
# Sabrina Green,+1-802-867-5309,System Administrator

print(transform_record("Eli Jones,684-3481127,IT specialist"))
# Eli Jones,+1-684-3481127,IT specialist

print(transform_record("Melody Daniels,846-687-7436,Programmer"))
# Melody Daniels,+1-846-687-7436,Programmer

print(transform_record("Charlie Rivera,698-746-3357,Web Developer"))
# Charlie Rivera,+1-698-746-3357,Web Developer

The multi_vowel_words() function returns all words with 3 or more consecutive vowels (a, e, i, o, u). Fill in the regular expression to do that.

In [None]:
import re
def multi_vowel_words(text):
  pattern = r'\b\w*[aeiou]{3,}\w*\b'
  result = re.findall(pattern, text)
  return result

print(multi_vowel_words("Life is beautiful"))
# ['beautiful']

print(multi_vowel_words("Obviously, the queen is courageous and gracious."))
# ['Obviously', 'queen', 'courageous', 'gracious']

print(multi_vowel_words("The rambunctious children had to sit quietly and await their delicious dinner."))
# ['rambunctious', 'quietly', 'delicious']

print(multi_vowel_words("The order of a data queue is First In First Out (FIFO)"))
# ['queue']

print(multi_vowel_words("Hello world!"))
# []

When capturing regex groups, what datatype does the groups method return?

A tuple

The transform_comments() function converts comments in a Python script into those usable by a C compiler. This means looking for text that begins with a hash mark (#) and replacing it with double slashes (//), which is the C single-line comment indicator. For the purpose of this exercise, we'll ignore the possibility of a hash mark embedded inside of a Python command, and assume that it's only used to indicate a comment. We also want to treat repetitive hash marks (##), (###), etc., as a single comment indicator, to be replaced with just (//) and not (#//) or (//#). Fill in the parameters of the substitution method to complete this function:

In [None]:
import re
def transform_comments(line_of_code):
  result = re.sub(r'#+', '//', line_of_code)
  return result

print(transform_comments("### Start of program"))
# Should be "// Start of program"
print(transform_comments("  number = 0   ## Initialize the variable"))
# Should be "  number = 0   // Initialize the variable"
print(transform_comments("  number += 1   # Increment the variable"))
# Should be "  number += 1   // Increment the variable"
print(transform_comments("  return(number)"))
# Should be "  return(number)"

The convert_phone_number() function checks for a U.S. phone number format: XXX-XXX-XXXX (3 digits followed by a dash, 3 more digits followed by a dash, and 4 digits), and converts it to a more formal format that looks like this: (XXX) XXX-XXXX. Fill in the regular expression to complete this function.

In [None]:
import re
def convert_phone_number(phone):
  result = re.sub(r'(\b\d{3})-(\d{3})-(\d{4}\b)', r'(\1) \2-\3', phone)
  return result

print(convert_phone_number("My number is 212-345-9999.")) # My number is (212) 345-9999.
print(convert_phone_number("Please call 888-555-1234")) # Please call (888) 555-1234
print(convert_phone_number("123-123-12345")) # 123-123-12345
print(convert_phone_number("Phone number of Buckingham Palace is +44 303 123 7300"))
# Phone number of Buckingham Palace is +44 303 123 7300

#### ◽**Theme 4: Module Review**

##### 💠**Glossary terms from course 2, module 3**

**Terms and definitions from course 2, module 3**

**Alteration:** RegEx that matches any one of the alternatives separated by the pipe symbol

**Backreference:** This is applied when using re.sub( ) to substitute the value of a capture group into the output

**Character classes:** These are written inside square brackets and let us list the characters we want to match inside of those brackets

**Character ranges:** Ranges used to match a single character against a set of possibilities

**grep:** An especially easy to use yet extremely powerful tool for applying RegExes

**Lookahead:** RegEx that matches a pattern only if it’s followed by another pattern

**Regular expression:** A search query for text that's expressed by string pattern, also known as RegEx or RegExp

**Wildcard:** A character that can match more than one character

#####💠**Qwiklabs assessment: Work with regular expressions**

**Introduction:**
It's time to put your new skills to the test! In this lab, you'll have to find the users using an old email domain in a big list using regular expressions.

**What you'll do**
* Replacing the old domain name (abc.edu) with a new domain name (xyz.edu).

* Storing all domain names, including the updated ones, in a new file.

Please note that there is a graded quiz that follows this lab. You must complete the lab before attempting the quiz. The quiz will assess your comprehension of the key concepts and procedures covered in the lab.

Here's what you can do to prepare:

* Pay close attention to the instructions and explanations provided during the lab session.

* Actively participate in the lab activities and take notes throughout.

* Review your lab notes before taking the quiz.

**Pro tip:**
You are allowed to refer to your completed lab notes during the quiz.

Data directory in `cd data`. `ls`. `user_emails.csv`. `cat user_email.csv` -> Access to Python with `cd ~/scripts`. `ls`. `script.py`. Use this file to use regex to find all the instances with the old domain `("abc.edu")` in the user_email.csv and replace them with the new domain `("xyz.edu")`. The file has already the functions. Just complete it to make it work.

Let's update permissions `sudo schmod 777 script.py`
`nano script.py`. `import csv`, `import re`.

**IDENTIFY THE OLD DOMAIN**
```
def contains_domain(address, domain):
  domain_pattern = r'[\w\.-]+@'+domain+'$'
  if re.match(domain_pattern, address):
    return True
  return False
```
**REPLACE THE DOMAIN NAME**
```
def replace_domain(address, old_domain, new_domain):
  old_domain_pattern = r'' + old_domain + '$'
  address = re.sub(old_domain_pattern, new_domain, address)
  return address
```
**Write a CSV file with replaced domain from main**
```#!/usr/bin/env python3

import re
import csv

def contains_domain(address, domain):
    """Returns True if the email address contains the given domain in the domain position, False if not."""
    domain_pattern = r'[\w\.-]+@' + domain + '$'
    if re.match(domain_pattern, address):
        return True
    return False

def replace_domain(address, old_domain, new_domain):
    """Replaces the old domain with the new domain in the received address."""
    old_domain_pattern = r'' + old_domain + '$'
    address = re.sub(old_domain_pattern, new_domain, address)
    return address

def main():
    """Processes the list of emails, replacing any instances of the old domain with the new domain."""
    old_domain, new_domain = 'abc.edu', 'xyz.edu'
    csv_file_location = '/home/student/data/user_emails.csv'
    report_file = '/home/student/data/updated_user_emails.csv'

    user_email_list = []
    old_domain_email_list = []
    new_domain_email_list = []

    with open(csv_file_location, 'r') as f:
        user_data_list = list(csv.reader(f))
        user_email_list = [data[1].strip() for data in user_data_list[1:]]

        for email_address in user_email_list:
            if contains_domain(email_address, old_domain):
                old_domain_email_list.append(email_address)
                replaced_email = replace_domain(email_address, old_domain, new_domain)
                new_domain_email_list.append(replaced_email)

        email_index = user_data_list[0].index('Email Address')

        for user in user_data_list[1:]:
            for old_email, new_email in zip(old_domain_email_list, new_domain_email_list):
                if user[email_index].strip() == old_email:
                    user[email_index] = new_email

    with open(report_file, 'w+') as output_file:
        writer = csv.writer(output_file)
        writer.writerows(user_data_list)

main()

```
Run `./script.py`
`ls ~/data`
`cat /dataupdated_user_emails.csv`

Expected result:

```
Full Name, Email Address
Blossom Gill, blossom@xyz.edu
Hayes Delgado, nonummy@utnisia.com
Petra Jones, ac@xyz.edu
Oleg Noel, noel@liberomauris.ca
Ahmed Miller, ahmed.miller@nequenonquam.co.uk
Macaulay Douglas, mdouglas@xyz.edu
Aurora Grant, enim.non@xyz.edu
Madison Mcintosh, mcintosh@nisiaenean.net
Montana Powell, montanap@semmagna.org
Rogan Robinson, rr.robinson@xyz.edu
Simon Rivera, sri@xyz.edu
Benedict Pacheco, bpacheco@xyz.edu
Maisie Hendrix, mai.hendrix@xyz.edu
Xaviera Gould, xlg@utnisia.net
Oren Rollins, oren@semmagna.com
Flavia Santiago, flavia@utnisia.net
Jackson Owens, jackowens@xyz.edu
Britanni Humphrey, britanni@ut.net
Kirk Nixon, kirknixon@xyz.edu
Bree Campbell, breee@utnisia.net

```



##### 💠**Module 3 challenge: Work with Regular Expressions**

✅ Question 1

What is a regular expression?

Answer: A sequence of characters that forms a search pattern

✅ Question 2

What is a characteristic of a CSV file?

Answer: Data in each row is separated by a special character

✅ Question 3

Complete the function to extract full URLs. Question 3
You are reading an article that contains website urls in the format:

https://www.website-domain.com

You’d like to extract the complete urls from the text automatically, instead of copying and pasting them by hand. Complete the function find_url() to extract all encrypted websites that begin with https:// and end with any top level domain, such as .org, .com, or .co from the text.

In [None]:
def find_url(website):
 pattern = r"https://[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"#enter the regex pattern here
 result = re.findall(pattern, website) #enter the re method here
 return result


print(find_url("Go to the website https://www.coursera.com find more information about Google Certificate Programs. Then, visit https://www.python.org/ to learn more about Python. ")) # Should return ['https://www.coursera.com', 'https://www.python.org']
print(find_url("You can find anything on https://www.google.com!")) # Should return ['https://www.google.com']
print(find_url("You can find anything on http://www.google.com!")) # Should return []
print(find_url("Check out python.org!")) # Should return []


✅ Question 4

Split sentences keeping punctuation with the word.

You are exploring the punctuation at the end of sentences and want to split sentences so that each word is separate and any punctuation is included in the word next to it. Complete the parse_sentences() function to accomplish this task.

In [None]:
def parse_sentences(sentence):
 pattern = r'\w+[\!\?\.]?|[^\w\s]' #enter the regex pattern here
 result = re.findall(pattern, sentence) #enter the re method  here
 return result

print(parse_sentences("Hello! How are you doing?")) # should return ['Hello!', 'How', 'are', 'you', 'doing?']
print(parse_sentences("what a beautiful day it is")) # should return ['what', 'a', 'beautiful', 'day', 'it', 'is']
print(parse_sentences("2 + 2 is definitely 4!")) # should return ['2', '+', '2', 'is', 'definitely', '4!']


✅ Question 5

Extract 9-character employee IDs. A company uses unique, 9-character codes that begin with a capital letter, followed by a hyphen (-), followed by 7 or 8 digits as employee ID numbers, in the format:

A-1234567 or A-12345678

Project reports submitted to the company include the employee’s ID number and a summary of the work they completed on the project. A data analyst wants to pull all of the employee ID numbers out of these projects. Complete the find_eid() function to extract these employee ID numbers from the reports.

In [None]:
def find_eid(report):
  pattern = r'\b[A-Z]-\d{7,8}\b' #enter the regex pattern here
  result = re.findall(pattern, report) #enter the re method  here
  return result


print(find_eid("Employees B-1234567 and C-12345678 worked with products X-123456 and Z-123456789")) # Should return ['B-1234567', 'C-12345678']
print(find_eid("Employees B-1234567 and C-12345678, not employees b-1234567 and c-12345678")) #Should return ['B-1234567', 'C-12345678']


✅ Question 6

Role of replace_domain() in the lab.

Answer: To replace the old domain with the new domain in an email address



In [None]:
def replace_domain(address, old_domain, new_domain):
  old_domain_pattern = r'' + old_domain + '$'
  address = re.sub(old_domain_pattern, new_domain, address)
  return address

✅ Question 7

Purpose of old_domain_email_list.

Answer: To store email addresses with the old domain that match the regex pattern

✅ Question 8

Why write the updated list to an output file?

Answer: To save the changes made to the email addresses and create a new CSV file with updated data

✅ Question 9

Why replace domains and generate a new file?

Answer: To enhance data security measures

✅ Question 10

How contains_domain() and replace_domain() work together.

Answer:
contains_domain() uses a regular expression to identify emails with a specific domain, and replace_domain() replaces these domains with new ones

### 🟢 **Module 4: Managing Data and Processes**


#### ◽**Theme 1: Data Streams**

##### 💠**Reading data interactively**

Python's `input()` function allows interactive user input during script execution. It always returns data as a string, so conversion to other types is necessary when needed.

In [2]:
#!/usr/bin/env python3
name = input("Please enter your name: ")
print("Hello, " + name)

Please enter your name: Paco
Hello, Paco


When collecting numeric input, convert the string to the appropriate type:

In [3]:
def to_seconds(hours, minutes, seconds):
    return hours*3600+minutes*60+seconds

print("Welcome to this time converter")

cont = "y"
while(cont.lower() == "y"):
    hours = int(input("Enter the number of hours: "))
    minutes = int(input("Enter the number of minutes: "))
    seconds = int(input("Enter the number of seconds: "))

    print("That's {} seconds".format(to_seconds(hours, minutes, seconds)))
    print()
    cont = input("Do you want to do another conversion? [y to continue] ")

print("Goodbye!")

Welcome to this time converter
Enter the number of hours: 2
Enter the number of minutes: 30
Enter the number of seconds: 0
That's 9000 seconds

Do you want to do another conversion? [y to continue] n
Goodbye!


While interactive input isn't always the best solution for automation tasks, it's valuable for scenarios requiring user-specific information that can't be predetermined.

##### 💠**Standard Streams**

I/O streams are pathways for programs to receive input and send output. Python uses three default streams:

* STDIN (Standard Input): Channel for receiving input data, typically from keyboard. Used when calling the input() function.
* STDOUT (Standard Output): Channel for sending normal program output, usually displayed on screen. Used when calling the print() function.
* STDERR (Standard Error): Channel specifically for error messages and diagnostics. Python error messages like TypeError appear here.

In [6]:
#!/usr/bin/env python3

data = input("This will come from STDIN: ")
print("Now we write it to STDOUT: " + data)

This will come from STDIN: Paco
Now we write it to STDOUT: Paco


These streams aren't exclusive to Python - all system commands use them. For instance, when using commands like cat or ls, their output goes to STDOUT and their errors to STDERR, though both typically display on screen.

##### 💠**Environment variables**

The shell is a command-line interface that executes commands on Linux systems. Common shells include bash (most common), Zsh, and Fish. Environment variables are values set in the shell environment that programs can access.
Environment variables store information like:

* PATH: Lists directories the shell searches for executable files
* HOME: User's home directory location
* SHELL: Current shell being used

To view environment variables:

* Use the env command to see all variables
* Use echo $VARIABLE_NAME to see specific variables

In Python, access environment variables using the os.environ dictionary from the os module:

In [10]:
!echo $PATH

/opt/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin


In [7]:
#!/usr/bin/env python3
import os
print("HOME: " + os.environ.get("HOME", ""))
print("SHELL: " + os.environ.get("SHELL", ""))
print("FRUIT: " + os.environ.get("FRUIT", ""))

HOME: /root
SHELL: /bin/bash
FRUIT: 


The get() method returns a default value (empty string) if the variable doesn't exist, preventing errors.

To set environment variables for a Python script to access:

In [30]:
!export FRUIT=Strawberry

In [31]:
#!/usr/bin/env python3
import os
os.environ["FRUIT"] = "Strawberry"

print("HOME: " + os.environ.get("HOME", ""))
print("SHELL: " + os.environ.get("SHELL", ""))
print("FRUIT: " + os.environ.get("FRUIT", ""))

HOME: /root
SHELL: /bin/bash
FRUIT: Strawberry


##### 💠**Sub-theme**

DESCRIPTION

In [None]:
print("Code")

##### 💠**Sub-theme**

DESCRIPTION

In [None]:
print("Code")

### 🟢 **Module X: XXX**

#### ◽**Theme X: XXX**

##### 💠**Sub-theme**

DESCRIPTION

In [None]:
print("Code")

##### 💠**Sub-theme**

DESCRIPTION

In [None]:
print("Code")

##### 💠**Sub-theme**

DESCRIPTION

In [None]:
print("Code")

##### 💠**Sub-theme**

DESCRIPTION

In [None]:
print("Code")

##### 💠**Sub-theme**

DESCRIPTION

In [None]:
print("Code")