# Expansion Unit: Sets

By now you know several types of objects in Python:

1. strings (`"This is a string"`),
1. integers (`5`, `18300`, `-523`),
1. floats (`5.0`, `18300.3`, `-523.21223519`),
1. Booleans (`True`, `False`),
1. lists (`["a", "b", "c"]`, `[]`, `[10, 10, 5, 8]`)
1. Counters (`{"a": 6, "b": 4, "c": 2}`)

Python provides many other data types, each one of which serves a specific purpose.
The data types above cover almost all general usage cases, but sometimes a specific data structure is more convenient or more efficient.
One of those more specialized data structures is the *set*.

A set is essentially an impoverished list.
Sets cannot contain an element more than once, and they are unordered.

In [None]:
# a list can contain multiple elements
example_list = ["the", "boy", "likes", "the", "girl"]
# converting the list to a set removes all duplicates
example_set = set(example_list)

print(example_list)
print(example_set)

In [None]:
# a list has a specific order
example_list = ["first", "second", "third", "fourth"]
# converting the list to a set destroys the order
example_set = set(example_list)

print("Printing list items")
for item in example_list:
    print(item)
    
print("\nPrinting set items")
for item in example_set:
    print(item)

The code above may print the items in the correct order in both cases, but there is no guarantee that this always happens.
In principle, Python can pull the items from the set in any order it wants.

Alright, so sets are a variant of lists that misses two useful properties, order and the ability to contain multiple tokens of the same type.
Why would anybody want such an impoverished data structure?

Well, sometimes the removal of duplicate entries is a boon rather than a shortcoming.
Suppose we want to write a function that tells us for any two strings whether they contain the same characters.
This is very easy with sets.

In [None]:
def char_equivalent(string1, string2):
    # convert strings to sets of characters
    string1 = set(string1)
    string2 = set(string2)
    if string1 == string2:
        return True
    else:
        return False
    
# let's run some tests:

# the comparison is case sensitive
print(char_equivalent("Tokyo", "Kyoto"))

# but order of characters does not matter, as desired
print(char_equivalent("tokyo", "kyoto"))

# and repetition is also fine
print(char_equivalent("New York", "New New York"))

The other advantage of sets is that they are much faster with the `in` operator.
This means that a statement like `if x in y` is computed much faster if `y` is a set rather than a list.
For our stop word removal function, for example, we would have been better off using a set of stop words rather than a list.
We can verify this by timing how long the code takes to run with a list of stopwords in comparison to a set of stopwords.

First we have to define all the custom functions and variables in the familiar fashion.
Just run the cell below to take care of that.

In [None]:
%run wordcounts.py

Next we load a list of stopwords.

In [None]:
import re
import urllib.request

# define stop words
urllib.request.urlretrieve("https://raw.githubusercontent.com/Alir3z4/stop-words/master/english.txt", "stopwords.txt")
with open("stopwords.txt", "r", encoding="utf-8") as stopwords_file:
    stopwords_list = re.findall(r"[^\n]+", stopwords_file.read())
    stopwords_set = set(stopwords_list)

The code above defines

1. a list `stopwords_list`, and
1. a **set** `stopwords_set`.

Now we can test how long the code takes to execute with `stopwords_list`.

In [None]:
def test_list():
    # empty list of words
    words = []

    # start for-loop
    for token in hamlet:
        if token not in stopwords_list:
            # add token to words
            list.append(words, token)
        
# tell Jupyter to time how long it takes to run the function
% time test_list()

Execution time depends a lot on the hardware you are running this notebook on.
On my computer, the code takes around 20 milliseconds.

In [None]:
def test_set():
    # empty list of words
    words = []

    # start for-loop
    for token in hamlet:
        if token not in stopwords_set:
            # add token to words
            list.append(words, token)
        
# tell Jupyter to time how long it takes to run the function
% time test_set()

The code with the set, on the other hand, takes about 2 milliseconds.
That's a 10-fold speed increase!
But okay, saving 8 milliseconds does not seem all that impressive, either way it's only a fraction of a second.
But that's only because the list of stopwords isn't all that long.
Let's try this test again with a longer list, our dictionary of 500,000 English words.

We'll now check for each word type of *Hamlet* as to whether it occurs in the dictionary.
This takes quite a while if the dictionary is a list, but is almost instantaneous if the dictionary is a set.

In [None]:
import re
import urllib.request

url = "https://raw.githubusercontent.com/dwyl/english-words/master/words.txt"
urllib.request.urlretrieve(url, "words.txt")
with open("words.txt", "r", encoding="utf-8") as dictionary:
    dict_list = re.findall("[^\n]+", dictionary.read())
    dict_set = set(dict_list)

In [None]:
def test_list():
    # empty list of words
    words = []

    # start for-loop
    for token in hamlet:
        if token not in dict_list:
        # add token to words
            list.append(words, token)
        
# tell Jupyter to time how long it takes to run the function
% time test_list()

In [None]:
def test_set():
    # empty list of words
    words = []

    # start for-loop
    for token in hamlet:
        if token not in dict_set:
        # add token to words
            list.append(words, token)
        
# tell Jupyter to time how long it takes to run the function
% time test_set()

On my computer, `test_set` takes 2 milliseconds to check all of *Hamlet*, whereas `test_list` takes almost 30 seconds.
That is an enormous speed difference, and it is one that is noticeable in practice.

In general, you should not worry too much about efficiency, in particular in this class.
But if you are working on a project on your own and you notice that your program is taking quite a bit longer to run than you'd like, take a closer look: maybe there are some lists you want to convert to sets for faster `in` tests?