<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Python-Sets" data-toc-modified-id="Python-Sets-1">Python Sets</a></span></li><li><span><a href="#Learning-Outcomes" data-toc-modified-id="Learning-Outcomes-2">Learning Outcomes</a></span></li><li><span><a href="#Sets" data-toc-modified-id="Sets-3">Sets</a></span></li><li><span><a href="#Set-methods" data-toc-modified-id="Set-methods-4">Set methods</a></span></li><li><span><a href="#Translating-business-requirements-into-code-(or-math)" data-toc-modified-id="Translating-business-requirements-into-code-(or-math)-5">Translating business requirements into code (or math)</a></span></li><li><span><a href="#2-best-uses-of-set-" data-toc-modified-id="2-best-uses-of-set--6">2 best uses of set </a></span><ul class="toc-item"><li><span><a href="#Eliminating-duplicate-entries" data-toc-modified-id="Eliminating-duplicate-entries-6.1">Eliminating duplicate entries</a></span></li><li><span><a href="#Membership-testing" data-toc-modified-id="Membership-testing-6.2">Membership testing</a></span></li></ul></li><li><span><a href="#Takeaways" data-toc-modified-id="Takeaways-7">Takeaways</a></span></li><li><span><a href="#Bonus-Material" data-toc-modified-id="Bonus-Material-8">Bonus Material</a></span></li></ul></div>

<center><h2>Python Sets</h2></center>


<center><h2>Learning Outcomes</h2></center>

__By the end of this session, you should be able to__:

- Explain the features and benefits of sets.
- Use Python's `set` methods to manipulate sets.
- Explain the most common use cases for sets:
    1. Eliminate duplicates
    2. Membership testing

Sets
-----
 
There are 3 important features of sets. Sets are collections that are:

1. unique elements
1. mutable
1. unordered

In [6]:
# Let's explore sets

s = set()
s.add(42) # Mutable
print(s)

s.add(42) # Unique elements only
print(s)

s.add(1)
print(s) # Unordered by insertion

{1}
{1}
{1, 42}


Set methods
-----

In [8]:
# Let's explore set methods
# set.<tab>
set.

SyntaxError: invalid syntax (<ipython-input-8-5714e2847022>, line 3)

<center><img src="../images/sets_operration.png" width="75%"/></center>

Similar to mathematical sets.

You need to know the method name and English definition. I find the images help me remember.

Source: DataCamp

[Read the docs on sets](https://docs.python.org/3/tutorial/datastructures.html#sets)

__Set Intersection Example__
<br>
<center><img src="../images/reals.png" width="75%"/></center>

I find jokes are easier to remember than facts.

Translating business requirements into code (or math)
--------

The difficult part of data science is translating business requests into math or code.

The new CMO (Chief Marketing Office) wants you find the new marketing channels that both new and old marketing channels.



In [None]:
old = ['tv', 'referral', 'organic', 'adwords', ]
new = ['organic', 'adwords', 'instagram', 'referral', 'tiktok']

set(new) & set(old)

In [None]:
set(new).intersection(set(old))

The new CMO (Chief Marketing Office) wants you find the new marketing channels that do not overlap with old marketing channels.

In [None]:
old = ['tv', 'referral', 'organic', 'adwords', ]
new = ['organic', 'adwords', 'instagram', 'referral', 'tiktok']

set(new) - set(old)

In [9]:
set(new).difference(set(old))

NameError: name 'new' is not defined

The new CMO (Chief Marketing Office) wants you find the marketing channels that are either new or old, but not both new and old. 

In other words, items only in a single type (old or new).

In [None]:
old = ['tv', 'referral', 'organic', 'adwords', ]
new = ['organic', 'adwords', 'instagram', 'referral', 'tiktok']

set(new) ^ set(old)


In [None]:
set(new).symmetric_difference(set(old))

2 best uses of set 
-----

1. Eliminating duplicate entries

1. Membership testing

### Eliminating duplicate entries

Sets only allow unique elements thus duplicate entries are room

__Cardinality__

[Cardinality](https://en.wikipedia.org/wiki/Cardinality) is the number of unique elements in a collection.

Common examples of cardinality include the number of unique categories or unique words.

In [10]:
# One Fish, Two Fish, Red Fish, Blue Fish
# by Dr. Seuss 
text = """
One fish
Two fish
Red fish
Blue fish
Black fish
Blue fish
Old fish
New fish
This one has a little star
This one has a little car
"""

normalized_words = text.lower().split()
normalized_words

['one',
 'fish',
 'two',
 'fish',
 'red',
 'fish',
 'blue',
 'fish',
 'black',
 'fish',
 'blue',
 'fish',
 'old',
 'fish',
 'new',
 'fish',
 'this',
 'one',
 'has',
 'a',
 'little',
 'star',
 'this',
 'one',
 'has',
 'a',
 'little',
 'car']

In [11]:
# "Vocabulary" is the number of unique words used in a collection of text.
len(set(normalized_words)) 

14

In [12]:
# In contrast to the total number of words
len(normalized_words)  

28

### Membership testing

Is a value in a collection.

In [13]:
from random import random

In [14]:
nums_list = [random() for _ in range(1_000_000)]

In [15]:
# Membership check for a list
# Slow because it has to go by item-by-item (linear search)
%timeit -n 10 (.75 in nums_list)

34.3 ms ± 7.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [16]:
nums_set = set(random() for _ in range(1_000_000))

In [17]:
# Membership check for a set
# Fast because it looks for the item in a memory location (constant)
 
%timeit -n 10 (.75 in nums_set)

118 ns ± 50.8 ns per loop (mean ± std. dev. of 7 runs, 10 loops each)


__Membership testings is MUCH FASTER in sets than lists.__
<br>
<br>
<center><img src="https://izquotes.com/quotes-pictures/quote-the-difference-between-the-almost-right-word-the-right-word-is-really-a-large-matter-it-s-the-mark-twain-389372.jpg" width="75%"/></center>

Use `set` by default. Only use `list` if you need order or duplicate items.

You'll need order far less of the time than you think.

<center><h2>Takeaways</h2></center>

- Python implements the most important parts of Mathematical sets.
- Sets are unordered, unique collection of items.
- Python's set have the attributes you would except of sets.
- Sets should be preferred over lists. Only use lists if you need order and duplicate items.


Bonus Material
-----

Learn more about sets:
    
- https://realpython.com/python-sets/
- https://stackabuse.com/sets-in-python/