<a href="https://colab.research.google.com/github/ProfessorPatrickSlatraigh/CST2312/blob/main/CST2312_OL80_Sets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CST2312 Complex Data Strucures: Sets**



---



A set is a data structure where all elements are unique. While they are similar 
to lists, in sets there is **no order** among the items in a set, and sets do **not** support indexing. Sets are ideal for membership queries, i.e., checking if an items appears in the set or not.

 **Reading** 

This lesson is not a part of the Python for Everybody (py4e.com) lesson set.  You can read about **Sets** at the following link from the book "Think Python 2020": 

[Sets](https://greenteapress.com/thinkpython2/html/thinkpython2020.html#sec227) 



---



## Creating Sets

Sets are specified by curly braces, `{ }`, containing one or more comma separated values. To specify an empty list, you can use the alternative construct, `set()`.  Note that creating sets using curly braces differs from creating dictionaries with curly braces as the elements separated by commas are not key:value pairs.

In [None]:
# creating sets
one_set = {4, 2, 1, 3}
print(one_set)

In [None]:
two_set = {6, 5, 4, -2}
print(two_set)

In [None]:
three_set = {5, 0, -3, 4, 4, 4, 4}
print(three_set)

In [None]:
# creating an empty set; notice that we do *not* use the "empty set = {}" command
# as someone would expect based on the way that we create an empty list
empty_set = set()
print(empty_set)

We can also create a set from a list:

In [None]:
my_list = [1, 2, 3, 0, 5, 10, 11, 1, 5]
my_set = set(my_list)
print(my_list)
print(my_set)

**As expected, the set created from a list does not contain any duplicate elements.**

This is a fast, easy, and reliable way to prepare to analyze membership in a list or lists, and to evaluate like subsets (intersections) or differences between lists. 

In [None]:
print("Length of list:", len(my_list))
print("Length of set:", len(my_set))

As seen above, we can use **len()** to get the number of elements in a list. Similarly, we can use the other functions **min()**, **max()**, **sum()**, **sorted()** etc., that we also used for lists.

**Exercise**

What is the number of _distinct_ words in the `nyt` variable?

Text from: [New York Times](https://www.nytimes.com/2020/09/06/us/colleges-coronavirus-students.html)

In [None]:
nyt = """Last month, facing a budget shortfall of at least $75 million because of the pandemic, the University of Iowa welcomed thousands of students back to its campus — and into the surrounding community.
Iowa City braced, cautious optimism mixing with rising panic. The university had taken precautions, and only about a quarter of classes would be delivered in person. But each fresh face in town could also carry the virus, and more than 26,000 area residents were university employees.
"Covid has a way of coming in," said Bruce Teague, the city's mayor, "even when youre doing all the right things."
Within days, students were complaining that they couldn’t get coronavirus tests or were bumping into people who were supposed to be in isolation. Undergraduates were jamming sidewalks and downtown bars, masks hanging below their chins, never mind the city’s mask mandate.
Now, Iowa City is a full-blown pandemic hot spot — one of about 100 college communities around the country where infections have spiked in recent weeks as students have returned for the fall semester. Though the rate of infection has bent downward in the Northeast, where the virus first peaked in the U.S., it continues to remain high across many states in the Midwest and South — and evidence suggests that students returning to big campuses are a major factor.
In a New York Times review of 203 counties in the country where students comprise at least 10 percent of the population, about half experienced their worst weeks of the pandemic since Aug. 1. In about half of those, figures showed the number of new infections is peaking right now.
Despite the surge in cases, there has been no uptick in deaths in college communities, data shows. This suggests that most of the infections are stemming from campuses, since young people who contract the virus are far less likely to die than older people. However, leaders fear that young people who are infected will contribute to a spread of the virus throughout the community.
The surge in infections reported by county health departments comes as many college administrations are also disclosing clusters on their campuses.
Brazos County, Tex., home to Texas A&M University, added 742 new coronavirus cases during the last week of August, the county’s worst week so far, as the university reported hundreds of new cases.
Pitt County, N.C., site of East Carolina University, saw its coronavirus cases rise above 800 in a single week at the end of August. The Times has identified at least 846 infections involving students, faculty and staff since mid-August.
In South Dakota’s Clay and Brookings counties, ballooning infections in the past two weeks have reflected outbreaks at the state’s major universities. In McLean County, Ill., the virus has been spreading as more than 1,200 people have contracted the virus at Illinois State University.
At Washington State University and the University of Idaho, about eight miles apart, combined coronavirus cases have risen since early July to more than 300 infections. In the surrounding communities — rural Whitman County, Wash., and Latah County, Idaho — cases per week have climbed from low single-digits in the first three months of the pandemic, to double-digits in July, to more than 300 cases in the last week of August.
The Times has collected infection data from both state and local health departments and individual colleges. Academic institutions generally report cases involving students, faculty and staff, while the countywide data includes infections for all residents of the county.
It's unclear precisely how the figures overlap and how many infections in a community outside of campus are definitively tied to campus outbreaks. But epidemiologists have warned that, even with exceptional contact tracing, it would be difficult to completely contain the virus on a campus when students shop, eat and drink in town, and local residents work at the college.
The potential spread of the virus beyond campus greens has deeply affected the workplaces, schools, governments and other institutions of local communities. The result often is an exacerbation of traditional town-and-gown tensions as college towns have tried to balance economic dependence on universities with visceral public health fears.
In Story County, Iowa, a local outcry following a burst of new Iowa State University cases pressured the university on Wednesday to reverse plans to welcome 25,000 football fans for its Sept. 12 opener against the University of Louisiana at Lafayette. In Monroe County, Indiana, the health department quarantined 30 Indiana University fraternity and sorority houses, prompting the university to publicly recommend that members shut them down and move elsewhere.
In Johnson County, where the University of Iowa is located, cases have more than doubled since the start of August, to more than 4,000. Over the past two weeks, Iowa City’s metro area added the fourth-most cases per capita in the country. The university has recorded more than 1,400 cases for the semester.
With a population of roughly 75,000, Iowa City relies on the university as an economic engine. The University of Iowa is by far the community’s largest employer, and its approximately 30,000 students are a critical market. Hawkeye football alone brings $120 million a year into the community, said Nancy Bird, executive director of the Iowa City Downtown District.
When the pandemic first hit in March, the university sent students home and pivoted to remote instruction, like most of the country’s approximately 5,000 colleges and universities. That exodus, heightened by health restrictions, has been an existential challenge for many downtown businesses, Ms. Bird said.
Jim Rinella, who owns The Airliner bar and restaurant, said the 76-year-old landmark across the street from campus “had zero revenue the whole month of April.” May was almost as scant, he said, and in June, he shut down after a couple of employees became infected.
By the time he reopened after July 4, too few students were in town to come close to making up the losses. He and his wife, Sherry, had hoped the campus reopening in August might be a lifeline.
But the photos taken by the local press from outside his establishment and others were damning. In an open letter, the university president lashed out, saying he was “exceedingly disappointed” in the failure of local businesses to keep students masked and socially distanced. Days later, the governor cited high infection rates among young people as she closed bars and restricted restaurants in Johnson County and five other counties with high concentrations of students.
Now The Airliner — where a booth is named for the University of Iowa's most famous dropout, Tom Brokaw, and a modeling scout is said to have discovered Ashton Kutcher — has to close at 10 p.m. as well as require customers to buy a meal and sit far apart if they want to drink there.
"I'm at a pain point," Mr. Rinella said. "If my grandfather hadn't started the place, I'd question whether I want to be in the restaurant business." A recent lunch hour visit found one customer at the bar drinking a beer."""

#### Solution

In [None]:
# your code here

## Two-step approach 
# nyt_list = nyt.split()
# nyt_set = set(nyt_list)

## One-step approach
nyt_set = set(nyt.split())

print("Length of NYT set:", len(nyt_set))

We can print the **nyt_set** to scan it for any redundant words...

In [None]:
print(sorted(nyt_set))

How would we resolve the different treatment of capitalized words vs. the same word in all lower case?

#### Checking for membership in a set

For sets, we can only check if an item appears within the set or not. We achieve this using the `in` operator:

In [None]:
my_set = {1, 2, 3, 4}

In [None]:
val = 1
if val in my_set:
    print("The value", val ,"appears in the set", my_set)
else: 
    print("The value", val ,"does not appear in the set", my_set)

The value 1 appears in the set {1, 2, 3, 4}


In [None]:
val = 0
if val in my_set:
    print("The value", val ,"appears in the set", my_set)
else: 
    print("The value", val ,"does not appear in the set", my_set)

We also have the `not in` operator, which behaves as expected:

In [None]:
if val not in my_set:
    print("The value", val ,"does not appear in the set", my_set)
else: 
    print("The value", val ,"appears in the set", my_set)

### Set operators: **union**, **intersection**, **difference**, **subset**. Plus, Jaccard Similarity

See this Wikipedia entry for a definition of Jaccard Similarity:
https://en.wikipedia.org/wiki/Jaccard_index

Sets also support operations that allow us to quickly compute the difference, intersection, and union of two sets. For example:

+ `set_a - set_b`: elements in a but not in b. Equivalent to `set_a.difference(set_b)`
+ `set_a | set_b`: elements in a or b. Equivalent to `set_a.union(set_b)`
+ `set_a & set_b`: elements in both a and b. Equivalent to `set_a.intersection(set_b)`
+ `set_a ^ set_b`: elements in a or b but not both. Equivalent to `set_a.symmetric_difference(set_b)` 
+ `set_a <= set_b`:	tests whether every element in set_a is in set_b. Equivalent to `set_a.issubset(set_b)`

**Exercise**

Try the above yourself using the `set_A` and `set_B` variables, and compute the difference, union, intersection, and symmetric difference, between the two sets.

In [None]:
# Your code here
set_A = {1, 2, 3, 4, 5}
set_B = {4, 5, 6, 7}

# Set A
# Set B
# Difference
# Union
# Intersection
# Symmetric Difference


#### Solution

In [None]:
set_A = {1, 2, 3, 4, 5}
set_B = {4, 5, 6, 7}

# print("Set A: ", set_A)
# print("Set B: ", set_B)
# print("Difference between Set A and Set B: ", set_A - set_B)
# print("Union of Set A and Set B: ", set_A | set_B)
# print("Intersection of Set A and Set B: ", set_A & set_B)
# print("Symetric Difference between Set A and Set B: ", set_A ^ set_B)

#### Exercise cont.

Now, lets try to use the [Jaccard index similarity](https://en.wikipedia.org/wiki/Jaccard_index) to compute the similarity of the two sets. The Jaccard coefficient is defined as the ratio of the size of the intersection of the two sets, divided by the size of the union of the two sets.

#### Solution

In [None]:
set_A = {1, 2, 3, 4, 5}
set_B = {4, 5, 6, 7}

# Your code here

# print("Jaccard Similarity of Set A and Set B: ", (len(set_A & set_B)) / len((set_A | set_B)))

#### Exercise

Below we have two news articles discussing a security breach at Yahoo. We want to compute the similarity of these articles using the Jaccard similarity. (For the sake of simplicity, we have removed all punctuation from the text.)



In [None]:
wsj = """
Yahoo Inc  disclosed a massive security breach by a  state sponsored actor  affecting at least 500 million users  potentially the largest such data breach on record and the latest hurdle for the beaten down internet company as it works through the sale of its core business 
Yahoo said certain user account information including names  email addresses  telephone numbers  dates of birth  hashed passwords and  in some cases  encrypted or unencrypted security questions and answers was stolen from the company s network in late 2014 by what it believes is a state sponsored actor 
Yahoo said it is notifying potentially affected users and has taken steps to secure their accounts by invalidating unencrypted security questions and answers so they can t be used to access an account and asking potentially affected users to change their passwords 
Yahoo recommended users who haven t changed their passwords since 2014 do so  It also encouraged users change their passwords as well as security questions and answers for any other accounts on which they use the same or similar information used for their Yahoo account 
The company  which is working with law enforcement  said the continuing investigation indicates that stolen information didn t include unprotected passwords  payment card data or bank account information 
With 500 million user accounts affected  this is the largest ever publicly disclosed data breach  according to Paul Stephens  director of policy and advocacy with Privacy Rights Clearing House  a not for profit group that compiles information on data breaches 
No evidence has been found to suggest the state sponsored actor is currently in Yahoo s network  and Yahoo didn t name the country it suspected was involved  In August  a hacker called  Peace  appeared in online forums  offering to sell 200 million of the company s usernames and passwords for about  1 900 in total  Peace had previously sold data taken from breaches at Myspace and LinkedIn Corp 
"""

ust = """
SAN FRANCISCO   Information from at least 500 million Yahoo accounts was stolen from the company in 2014  and the  company said Thursday it believes that a state sponsored actor was behind the hack 
The information may have included names  email addresses  telephone numbers  dates of birth  and  in some cases  encrypted or unencrypted security questions and answers  Yahoo said 
Claims surfaced in early August that a hacker using the name  Peace  was trying to sell the usernames  passwords and dates of birth of Yahoo account users on the dark web   a black market of thousands of secret websites 
The FBI said it was aware of the matter  The compromise of public and private sector systems is something the agency takes very seriously and it said it will continue to investigate and hold accountable all who pose a threat in cyberspace  the agency said in an emailed statement 
Yahoo recommends that users who haven t changed their passwords since 2014 do so  The company said it was notifying potentially affected users and taking steps to secure their accounts  That included invalidating unencrypted security questions and answers and asking users to change their passwords 
The announcement comes as Yahoo looks to complete its  4 8  billion sale of its core Internet business to media giant Verizon Communications  which said it was notified of the Yahoo breach  within the last two days  
 We understand that Yahoo is conducting an active investigation of this matter  but we otherwise have limited information and understanding of the impact   Verizon said 
Given the unsettled nature of Yahoo s ownership just now   regulators should be concerned with who will take responsibility for the response to this compromise  It can be easy for the  right thing to do  to slip through the cracks in a multi billion dollar transition   said Tim Erlin  senior director of IT security and risk strategy at Tripwire  a computer security firm 
Yahoo Chief Executive Officer Marissa Mayer has pledged to stay on with the company through the close of the merger  which is being overseen by Verizon s Marni Walden and AOL CEO Tim Armstrong  Yahoo shares  YHOO  were flat Thursday  Verizon  VZ  shares were up 1  at  52 39 
"""

#### Solution

In [None]:
## your code here

wsj_set = set(wsj.split())
ust_set = set(ust.split())

# print("Jaccard Similarity of WSJ article and UST article: ", (len(wsj_set & ust_set)) / len((wsj_set | ust_set)))



---



*This lessons notebook files are available on Github and on Blackboard.*