<a href="https://colab.research.google.com/github/Rachiesqueek/NPL/blob/main/ICE_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **ICE-2: Regular Expressions**

Python includes a builtin module called `re` which provides regular expression matching operations (Click [here](https://docs.python.org/3/library/re.html) for the official module documentation). Once the module is imported into your code, you can use all of the available capabilities for performing pattern-based matching or searching using regular expressions.

In [None]:
import re

def apply_regex(data, pattern):
  for text in data:
    if re.fullmatch(pattern, text):
      print(f"Test string {text} accepted.")
    else:
      print(f"Test string {text} failed!")

Let's write a simple regular expression for matching binary strings.

In [None]:
# find binary strings
test_strings = ["0", "1", "dog", "hello, world", "123", "00", "10101010111"]
binary_pattern = r'[0-1]+'
apply_regex(test_strings, binary_pattern)

Test string 0 accepted.
Test string 1 accepted.
Test string dog failed!
Test string hello, world failed!
Test string 123 failed!
Test string 00 accepted.
Test string 10101010111 accepted.


Now, how about for matching 24-bit hexadecimal codes?

In [None]:
# find 24-bit hexadecimal color codes
test_strings = ["#F0F8FF", "#FFF", "#00FFFFF", "#2980BD", "#FAEBD7"]
hexcode_pattern = r'\#[0-9A-F]{4,6}'
apply_regex(test_strings, hexcode_pattern)

Test string #F0F8FF accepted.
Test string #FFF failed!
Test string #00FFFFF failed!
Test string #2980BD accepted.
Test string #FAEBD7 accepted.


#### **Question 1.**

*Identify what types of strings the regular expression in the below code block represents?* 

In [None]:
# All you need to do is run this code block, analyze the output and answer the associated question.
re_pattern = r'(\([0-9][0-9][0-9]\)\s)?[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]' #? is an represents that the stuff before it is optional
test_strings = ["00", "999-999-9999", "(111) 111-1111", "(111)111-1111", "989-1830", "241/131/103", "(182).1903.1021", "(101).101.1001"]
apply_regex(test_strings, re_pattern)

Test string 00 failed!
Test string 999-999-9999 failed!
Test string (111) 111-1111 accepted.
Test string (111)111-1111 failed!
Test string 989-1830 accepted.
Test string 241/131/103 failed!
Test string (182).1903.1021 failed!
Test string (101).101.1001 failed!


Answer:

The types of strings that are accepted are 989-1830 and (111) 111-1111 where the numbers in the string are values 0-9. The string (111) in the string (111) 111-1111 is an optional occurance because of the ? which states that the preceding characters in the statement are optional and also must contain a space after the last parenthesis due to a white space character (/s) being present. 

#### **Question 2.**

*Modify the regular expression (used in Q1.) to also accept strings that follow the format `xxx-xxx-xxxx` where `x` is a digit between 0 to 9?*

In [None]:
# Replace the modified regular expression in the code snippet used in Question 1
# Add your code below this comment and execute your code
re_pattern = r'(\([0-9][0-9][0-9]\)\s|[0-9][0-9][0-9]-)?[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]'
test_strings = ["00", "999-999-9999", "(111) 111-1111", "(111)111-1111", "989-1830", "241/131/103", "(182).1903.1021", "(101).101.1001","214-432-4321"]
apply_regex(test_strings, re_pattern)



Test string 00 failed!
Test string 999-999-9999 accepted.
Test string (111) 111-1111 accepted.
Test string (111)111-1111 failed!
Test string 989-1830 accepted.
Test string 241/131/103 failed!
Test string (182).1903.1021 failed!
Test string (101).101.1001 failed!
Test string 214-432-4321 accepted.


#### **Question 3.**

*Modify the regular expression (used in Q2.) to also accept strings that are of the form `(xxx) xxx-xxxx` and `(xxx)xxx-xxxx` where `x` is a digit between 0 to 9?*

In [None]:
# Replace the modified regular expression in the code snippet used in Question 2
# Add your code below this comment and execute your code
re_pattern = r'(\([0-9][0-9][0-9]\)\s|[0-9][0-9][0-9]-|)?[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]' #using an | (or) will help with switching between diffrent forms of the accepted string types
test_strings = ["00", "999-999-9999", "(111) 111-1111", "(111)111-1111", "989-1830", "241/131/103", "(182).1903.1021", "(101).101.1001","214-432-4321"]
apply_regex(test_strings, re_pattern)


Test string 00 failed!
Test string 999-999-9999 accepted.
Test string (111) 111-1111 accepted.
Test string (111)111-1111 failed!
Test string 989-1830 accepted.
Test string 241/131/103 failed!
Test string (182).1903.1021 failed!
Test string (101).101.1001 failed!
Test string 214-432-4321 accepted.


#### **Question 4.**

*Modify the regular expression (used in Q3.) to accept strings that have the format `(xxx)-xxx-xxxx` and `(xxx).xxx.xxxx` but reject strings that are of the format `(xxx)/xxx.xxxx` where `x` is a digit between 0 to 9?*

In [None]:
# Replace the modified regular expression in the code snippet used in Question 3
# Add your code below this comment and execute your code
re_pattern = r'(\([0-9][0-9][0-9]\)[\s\-\.]?|[0-9][0-9][0-9][\-\.])?[0-9][0-9][0-9][\-\.][0-9][0-9][0-9][0-9]' #Adding on to #3,[\s\-\.] and [\-\.] is added to add more strings to our accepted strings
 test_strings = ["00", "999-999-9999", "(111) 111-1111", "(111)111-1111", "989-1830", "241/131/103", "(182).1903.1021", "(101).101.1001","214-432-4321","(312)/333.3432"]
apply_regex(test_strings, re_pattern)




Test string 00 failed!
Test string 999-999-9999 accepted.
Test string (111) 111-1111 accepted.
Test string (111)111-1111 accepted.
Test string 989-1830 accepted.
Test string 241/131/103 failed!
Test string (182).1903.1021 failed!
Test string (101).101.1001 accepted.
Test string 214-432-4321 accepted.
Test string (312)/333.3432 failed!


---

## **Using regular expressions based pattern matching on real world text**

For the purposes of demonstration, here's a dummy paragraph of text. A few observations here:
* The text has multiple paragraphs with each paragraph having more than one sentence. 
* Some of the words are capitalized (first letter is in uppercase followed by lowercase letters). 

In [None]:
text = """Here is the First Paragraph and this is the First Sentence. here is the Second Sentence. now is the Third Sentence. this is the Fourth Sentence of the first paragaraph. this paragraph is ending now with a Fifth Sentence.
Now, it is the Second Paragraph and its First Sentence. here is the Second Sentence. now is the Third Sentence. this is the Fourth Sentence of the second paragraph. this paragraph is ending now with a Fifth Sentence.
Finally, this is the Third Paragraph and is the First Sentence of this paragraph. here is the Second Sentence. now is the Third Sentence. this is the Fourth Sentence of the third paragaraph. this paragraph is ending now with a Fifth Sentence.
4th paragraph is not going to be detected by either of the regex patterns below.
"""

print(text)

Here is the First Paragraph and this is the First Sentence. here is the Second Sentence. now is the Third Sentence. this is the Fourth Sentence of the first paragaraph. this paragraph is ending now with a Fifth Sentence.
Now, it is the Second Paragraph and its First Sentence. here is the Second Sentence. now is the Third Sentence. this is the Fourth Sentence of the second paragraph. this paragraph is ending now with a Fifth Sentence.
Finally, this is the Third Paragraph and is the First Sentence of this paragraph. here is the Second Sentence. now is the Third Sentence. this is the Fourth Sentence of the third paragaraph. this paragraph is ending now with a Fifth Sentence.
4th paragraph is not going to be detected by either of the regex patterns below.



The following code block shows a regular expression that matches only those strings that:
1. are at the start of a line and
2. the string does not start with a number or a whitespace

`re.findall()` finds all matches of the pattern in the text under consideration. The output is a list of strings that matched.

In [None]:
re_pattern1 = r'^[^0-9 ]+'
print(re.findall(re_pattern1, text, re.MULTILINE))

['Here', 'Now,', 'Finally,']


Further, the regular expression defined below matches two consecutive words that are capitalized.

In [None]:
re_pattern2 = r'[A-Z][a-z]+ [A-Z][a-z]+'
print(re.findall(re_pattern2, text))

['First Paragraph', 'First Sentence', 'Second Sentence', 'Third Sentence', 'Fourth Sentence', 'Fifth Sentence', 'Second Paragraph', 'First Sentence', 'Second Sentence', 'Third Sentence', 'Fourth Sentence', 'Fifth Sentence', 'Third Paragraph', 'First Sentence', 'Second Sentence', 'Third Sentence', 'Fourth Sentence', 'Fifth Sentence']




Following is a text excerpt on "Inaugural Address" taken from the website of the [Joint Congressional Committee on Inaugural Ceremonies](https://www.inaugural.senate.gov/inaugural-address/):

In [None]:
inau_text="""The custom of delivering an address on Inauguration Day started with the very first Inauguration—George Washington’s—on April 30, 1789. After taking his oath of office on the balcony of Federal Hall in New York City, Washington proceeded to the Senate chamber where he read a speech before members of Congress and other dignitaries. His second Inauguration took place in Philadelphia on March 4, 1793, in the Senate chamber of Congress Hall. There, Washington gave the shortest Inaugural address on record—just 135 words —before repeating the oath of office.
Every President since Washington has delivered an Inaugural address. While many of the early Presidents read their addresses before taking the oath, current custom dictates that the Chief Justice of the Supreme Court administer the oath first, followed by the President’s speech.
William Henry Harrison delivered the longest Inaugural address, at 8,445 words, on March 4, 1841—a bitterly cold, wet day. He died one month later of pneumonia, believed to have been brought on by prolonged exposure to the elements on his Inauguration Day. John Adams’ Inaugural address, which totaled 2,308 words, contained the longest sentence, at 737 words. After Washington’s second Inaugural address, the next shortest was Franklin D. Roosevelt’s fourth address on January 20, 1945, at just 559 words. Roosevelt had chosen to have a simple Inauguration at the White House in light of the nation’s involvement in World War II.
In 1921, Warren G. Harding became the first President to take his oath and deliver his Inaugural address through loud speakers. In 1925, Calvin Coolidge’s Inaugural address was the first to be broadcast nationally by radio. And in 1949, Harry S. Truman became the first President to deliver his Inaugural address over television airwaves.
Most Presidents use their Inaugural address to present their vision of America and to set forth their goals for the nation. Some of the most eloquent and powerful speeches are still quoted today. In 1865, in the waning days of the Civil War, Abraham Lincoln stated, “With malice toward none, with charity for all, with firmness in the right as God gives us to see the right, let us strive on to finish the work we are in, to bind up the nation’s wounds, to care for him who shall have borne the battle and for his widow and his orphan, to do all which may achieve and cherish a just and lasting peace among ourselves and with all nations.” In 1933, Franklin D. Roosevelt avowed, “we have nothing to fear but fear itself.” And in 1961, John F. Kennedy declared, “And so my fellow Americans: ask not what your country can do for you—ask what you can do for your country.”
Today, Presidents deliver their Inaugural address on the West Front of the Capitol, but this has not always been the case. Until Andrew Jackson’s first Inauguration in 1829, most Presidents spoke in either the House or Senate chambers. Jackson became the first President to take his oath of office and deliver his address on the East Front Portico of the U.S. Capitol in 1829. With few exceptions, the next 37 Inaugurations took place there, until 1981, when Ronald Reagan’s Swearing-In Ceremony and Inaugural address occurred on the West Front Terrace of the Capitol. The West Front has been used ever since."""



#### **Question 5a.**

*Identify all the capitalized words in the "Inaugural Address" excerpt and write a regular expression that finds all occurrences of such words in the text. Then, run the Python code snippet to automatically display the matched strings according to the pattern.*.

NOTE: You can use the *re.findall()* method as demonstrated in the example before this exercise.

In [None]:
# Write your code below these comments and execute your code
# HINT: You may need to tweak the use of capture groups to NOT capture partial matches
# For e.g. 'New York' instead of 'New', 'York')
re_pattern2 = r'[A-Z][a-z]+(?:\s[A-Z][a-z]+)*' #[A-Z][a-z] finds a word with a captial at the beginning. ?: filters out the suffex (,) Then, the regex finds another capital word after it. At the end, a * is used to denote that there could be 0 or more capital words after the 2 words that were found
print(re.findall(re_pattern2, inau_text))




['The', 'Inauguration Day', 'Inauguration', 'George Washington', 'April', 'After', 'Federal Hall', 'New York City', 'Washington', 'Senate', 'Congress', 'His', 'Inauguration', 'Philadelphia', 'March', 'Senate', 'Congress Hall', 'There', 'Washington', 'Inaugural', 'Every President', 'Washington', 'Inaugural', 'While', 'Presidents', 'Chief Justice', 'Supreme Court', 'President', 'William Henry Harrison', 'Inaugural', 'March', 'He', 'Inauguration Day', 'John Adams', 'Inaugural', 'After Washington', 'Inaugural', 'Franklin', 'Roosevelt', 'January', 'Roosevelt', 'Inauguration', 'White House', 'World War', 'In', 'Warren', 'Harding', 'President', 'Inaugural', 'In', 'Calvin Coolidge', 'Inaugural', 'And', 'Harry', 'Truman', 'President', 'Inaugural', 'Most Presidents', 'Inaugural', 'America', 'Some', 'In', 'Civil War', 'Abraham Lincoln', 'With', 'God', 'In', 'Franklin', 'Roosevelt', 'And', 'John', 'Kennedy', 'And', 'Americans', 'Today', 'Presidents', 'Inaugural', 'West Front', 'Capitol', 'Until An

#### **Question 5b.**

*Identify all the dates in the "Inaugural Address" excerpt and write a regular expression that finds all occurrences of the dates in the text. Then, run the Python code snippet to automatically display a list of all such dates identified.*

NOTE: You can use the *re.findall()* method as demonstrated in the example before this exercise.

In [None]:
# Write your code below these comments and execute your code
# HINT: You may need to tweak the use of capture groups to NOT capture partial matches
# For e.g. 'April 20, 1945' instead of 'April', '20')
re_pattern2 = r'(?:January|February|March|April|May|June|July|August|September|October|November|December)?(?:\s\d{1,2},\s)?(?:\d{4})'
#First, the months are listed out. Next, a ? is used to denote that the months could be optional. Next the date is added which is /d{1,2} meaning it could be a number with 1 or 2 numbers
#Next, a ? is used after the date listing to show that the date is also optional. Lastly, the year is found using /d{4} meaning that there needs to be 4 numbers.
print(re.findall(re_pattern2, inau_text))




['April 30, 1789', 'March 4, 1793', 'March 4, 1841', 'January 20, 1945', '1921', '1925', '1949', '1865', '1933', '1961', '1829', '1829', '1981']


---