## Homework: Regular Expressions


### University of Virginia
### Foundations of Computer Science
### Last Updated: November 13, 2021


#### Alanna Hazlett
#### uwa6xv
#### November 9th, 2024
---

### Objectives: 
- Practice writing and testing regular expressions

### Executive Summary


There are two short text documents in this notebook. You will write regular expressions to find certain patterns.  

Note: This website is a helpful resource for writing and testing regexes: [regex101](https://regex101.com/)

### Instructions

Answer the questions, showing all code and results.  
When the file is completed, submit the notebook through Canvas.

**Notes:**  
1) When instructions ask for a case insensitive match on a word or phrase, any mix of uppercase and lowercase characters are a match.  
2) The regexes do not need to be robust generally. They simply need to find all the matches in the documents. For example, when matching dollar amounts,  
   the regex does not need to guard against matching invald forms such as $61,0 as they are not in the documents. 

**TOTAL POINTS: 12**

---


In [1]:
import re

#### DOCUMENTS FOR SEARCH

In [2]:
doc1 = "(CNN) This is an article about America's Workers. Getting family health insurance on the job now costs workers and their employers more than $22,000 a year, on average. And companies have not been able to do much to make coverage more affordable, even though the coronavirus pandemic has reinforced the importance of health benefits.\
Employees foot about $6,000 of the tab, while companies pick up the rest, according to the 2021 Kaiser Family Foundation Employer Health Benefits Survey. The report, released Wednesday, found that the average annual premium rose 4% this year to $22,221.\
The average annual premium for a single staffer in 2021 hit $7,739, also up 4%. Workers pay about $1,300, and employers cover the remaining tab.\
About 155 million Americans rely on employer-sponsored coverage -- and they are paying a lot more for that benefit than they were a decade ago. The average family premium has increased 47%, more than wages or inflation, which rose 31% and 19%, respectively, Kaiser found.\
The average count is 21,000."

In [3]:
doc2 = "Curry reacts in the second half against the Chicago Bulls. (CNN)It seems every week NBA superstar Steph Curry is making history.\
Earlier this week, he overtook Wilt Chamberlain to become the oldest player to record 50 points and 10 assists in a game.\
And on Friday night, the 33-year-old passed basketball great Ray Allen for the most three-pointers scored in all NBA games, including playoffs, in NBA history.\
Curry connected with nine of his 17 three-point attempts in the Golden State Warriors' 119-93 win over the Chicago Bulls, taking his tally in regular season and playoff games to 3,366, surpassing Allen's total of 3,358.\
He had come into the game just one behind two-time NBA champion Allen and equaled his record within the first few minutes of the game.\
And he became the all-time lead just minutes later, drilling a long-range effort over Alex Caruso."

---

#### 1) (1 POINT) Search *doc1* for the word 'family', print the matches, and print the number of matches.

Keeping these just for my own personal notes:

In [4]:
#re.search will return the first instance of a match
# match = re.search("family",doc1)
# if match:
#     print(match)

In [5]:
# print(match.group())
# print(match.start())
# print(match.start())
# print(match.span())

In [6]:
#finditer returns an iterator object
# all_matches = re.finditer("family",doc1)
# for match in all_matches:   
#     print(match.group(), match.span())
# all_matches

The actual answer to the question:

In [7]:
#Find all does indeed find all matches. It returns a list object of strings of the group (match). 
family_matches = re.findall("family",doc1)
print("The matches are", family_matches, "of which, there are", len(family_matches), "matches.")

The matches are ['family', 'family'] of which, there are 2 matches.


#### 2) (2 POINTS) Search *doc1* for the first occurrence of the word "workers" (case insensitive).  
####    If it finds a match, use the start() and end() methods to extract the match from the document, printing the result.

In [8]:
# First occurence means we want to use re.search()
# (?-i) would cease the i modifier of case insensitivity
match = re.search("(?i)workers",doc1)
if match:
    print(match.group(), match.start(), match.end())

Workers 41 48


#### 3) (1 POINT) Search *doc1* for the word 'family' (case insensitive), print the matches, and print the number of matches.

In [9]:
family_matches = re.findall("(?i)family",doc1)
print("The matches are", family_matches, "of which, there are", len(family_matches), "matches.")

The matches are ['family', 'Family', 'family'] of which, there are 3 matches.


#### 4) (1 POINT) Search *doc1* for dollar amounts, print the matches, and print the number of matches. Dollar amounts start with "$" followed by digits and possibly commas.

Note: "$" will have different meanings in a regex, so take care to use it properly in this context.

In [10]:
# possibly commas means we need to take into account there may or may not be a comma this is taken into account using []? around the comma
dollar_matches = re.findall("\$\d+[\,]?\d+",doc1)
print("The matches are", dollar_matches, "of which, there are", len(dollar_matches), "matches.")

The matches are ['$22,000', '$6,000', '$22,221', '$7,739', '$1,300'] of which, there are 5 matches.


#### 5) (2 POINTS) Search *doc1* for numbers that are not percentages nor dollar amounts. Print the matches, and print the number of matches.


Examples:  
55 is a match, and 55,000 is a match, and 55. is a match (the last could occur at the end of a sentence, for example.)  
$55,000 is not a match, and 55% is not a match


In [11]:
# This returns values that do associate with dollar amounts. 
# number_matches = re.findall(r"\b\d+[,]?\d+\b",doc1)
# print("The matches are", number_matches, "of which, there are", len(number_matches), "matches.")

In [12]:
# This returns numbers that are not associated with dollar amount or percentages. 
number_matches = re.findall(r"\s\d+[,]?\d+[.\s]",doc1)
print("The matches are", number_matches, "of which, there are", len(number_matches), "matches.")

The matches are [' 2021 ', ' 2021 ', ' 155 ', ' 21,000.'] of which, there are 4 matches.


---

#### The following questions ask you to search doc2.

#### 6) (2 POINTS) Search *doc2* for two or more words (consisting of only letters) joined by dashes. Print the matches, and print the number of matches.

Examples: "twenty-year-old" and "all-star"  
Non-examples: '22-year' and '110-90' are not matches as they contain numbers


In [13]:
# This doesn't work because w is word characters, which includes a-z, A-Z, 0-9, _
# hyphen_matches = re.findall(r"\w+-\w+",doc2)
# print("The matches are", hyphen_matches, "of which, there are", len(hyphen_matches), "matches.")

In [14]:
hyphen_matches = re.findall(r"\b[a-zA-Z]+-[a-zA-Z]+\b",doc2)
print("The matches are", hyphen_matches, "of which, there are", len(hyphen_matches), "matches.")

The matches are ['year-old', 'three-pointers', 'three-point', 'two-time', 'all-time', 'long-range'] of which, there are 6 matches.


#### 7) (1 POINT) Search *doc2* for all words starting with an uppercase letter.  Print the matches, and print the number of matches. 

In [15]:
# This requires the * operator, because + is one or more, if the word was "I" it would be missed. * Is zero or more. 
upper_matches = re.findall(r"\b[A-Z][a-z]*\b",doc2)
print("The matches are", upper_matches, "of which, there are", len(upper_matches), "matches.")

The matches are ['Curry', 'Chicago', 'Bulls', 'It', 'Steph', 'Curry', 'Earlier', 'Wilt', 'Chamberlain', 'And', 'Friday', 'Ray', 'Allen', 'Curry', 'Golden', 'State', 'Warriors', 'Chicago', 'Bulls', 'Allen', 'He', 'Allen', 'And', 'Alex', 'Caruso'] of which, there are 25 matches.


#### 8) (1 POINT) Search *doc2* for the word "in." Print the matches, and print the number of matches. 

Example: "Jordan is *in* the house  
Non-example: Jordan is ready to win (careful not to match on the substring "in" in "win")

In [16]:
in_matches = re.findall(r"\bin\b",doc2)
print("The matches are", in_matches, "of which, there are", len(in_matches), "matches.")

The matches are ['in', 'in', 'in', 'in', 'in', 'in'] of which, there are 6 matches.


#### 9) (1 POINT) Search *doc2* for a number followed by the word "points."  
####    Include capture groups in the regex to extract the number of points, and print the number.  
####    Credit is only given if you use capture groups in this exercise.
Hint: use the search() function.


In [17]:
points_match = re.search("(\d+) (points)",doc2)
if points_match:
    print(points_match.group(1))

50


---  

<div class="alert alert-block alert-info">
<b>I pledge that I have neither given nor received help on this assignment. : Alanna Hazlett </b>
</div>