# Sets and Pandas

This notebook explores what a `set` is in Python, and how to work with them in Pandas. There are an awful lot of subtleties available here, so this notebook is absolutely not a full and in-depth guide to everything set-related, but a brief introduction to some of the most likely use-cases.

## Imports

In [1]:
import pandas as pd

## Sets

A **set** is a data structure, similar to a list or a tuple: sets contain information in a particular form.

A set has the following properties:

1. Unordered - a set contains information, but doesn't structure it in a particular order
2. Unindexed - because a set is unordered, you can't ask for the third item, etc.
3. Mutable - a set can change, with information being added or removed
4. Only contains unique items - a set cannot contain duplicate values; if you try to include them, it will ignore you

In [2]:
# Create a set using curly brackets

my_set = {1, 2, 3}

In [3]:
# Display a set

print(my_set)

{1, 2, 3}


In [4]:
# Attempt to create a set with non-unique values

my_set = {1, 2, 3, 4, 4, 5, 5, 6}

In [5]:
# Display the set

print(my_set)

{1, 2, 3, 4, 5, 6}


## Basic operations

Sets support a bunch of basic operations that let you treat them in a very similar way to lists.

In [6]:
# Add a value

my_set.add(7)

# Show it

my_set

{1, 2, 3, 4, 5, 6, 7}

In [7]:
# Remove a value

my_set.discard(7)

# .remove will also work, but throws an error if the set doesn't contain the thing to remove

# Show it

my_set

{1, 2, 3, 4, 5, 6}

Although sets aren't ordered or indexed, you can loop through them.

In [8]:
# Loop through a set

for val in my_set:
    print(val + 10)

11
12
13
14
15
16


Converting a list into a set is a quick way to remove duplicates. You can always call `list()` to change the set back into a probably-more-familiar data structure.

In [9]:
vals = [45, 0, 23, 0, 192, 192, 17]

print(vals)

vals = set(vals)

print(vals)

[45, 0, 23, 0, 192, 192, 17]
{0, 192, 45, 17, 23}


You can check if a set contains a particular value using `in`.

In [10]:
if 17 in vals:
    print("Found!")

Found!


## More exciting sets

There's a whole [complex mathematical theory](https://en.wikipedia.org/wiki/Set_theory) behind sets and the way they work, which you don't have to understand and I definitely don't. The upshot of all this fancy stuff is that sets can do some really useful things in really efficient ways.

To demonstrate this, we'll use three sets - `a`, `b`, and `c` - which have some items in common and some items not.

In [11]:
a = {0, 1, 2, 3}

b = {5, 4, 3, 2}

c = {0, 1}

In [12]:
# Join the sets together, keeping all unique elements

a.union(b)

{0, 1, 2, 3, 4, 5}

In [13]:
# Join the sets together, only keeping elements in common

a.intersection(b)

{2, 3}

In [14]:
# Join the sets together, only keeping elements that aren't shared

a.symmetric_difference(b)

{0, 1, 4, 5}

In [15]:
# Get just the elements of a that aren't in b (or vice versa)

a.difference(b)

{0, 1}

In [16]:
# Does a contain all the items in b?

a.issuperset(b)

False

In [17]:
# Is everything in c also contained in a?

c.issubset(a)

True

## Sets in Pandas

A column in a Pandas dataframe can contain sets; these get recorded as the `object` type.

Sets are relatively rare in dataframes - some models (particularly association rules ones, for some reason) might result in sets, but it's generally not a common data structure to work with in Pandas. Pandas can handle them, but doesn't like doing it, and whenever you have a column in Pandas containing a container of some kind, you'll need to work around some fiddly bits. Working with sets inside Pandas is possible, but rarely fun or easily intuitive.

The most likely things you'll want to do with a column containing sets is filter based on it, returning only rows in which the set in a particular column contains a specified value. There are essentially (as far as I am aware) two ways to get this functionality:

1. Pretend that the sets are strings
2. Use lambda functions to call set methods

To demonstrate these two approaches, we'll use an example dataframe with two columns - one containing just a number, and one containing a set.

In [18]:
# Make the dataframe

df = pd.DataFrame(columns=["id", "set"],
                  data=[[1, {"wyvern", "dragon", "harpy"}],
                        [2, {"fairy", "harpy", "nixie"}],
                        [3, {"kelpie", "fairy", "redcap"}],
                        [4, {"basilisk", "wyvern", "dragon"}],
                        [5, {"basilisk", "kelpie", "harpie"}]])

In [19]:
# View it

df.head()

Unnamed: 0,id,set
0,1,"{wyvern, harpy, dragon}"
1,2,"{fairy, harpy, nixie}"
2,3,"{fairy, redcap, kelpie}"
3,4,"{wyvern, basilisk, dragon}"
4,5,"{harpie, basilisk, kelpie}"


In [20]:
# Check the types

df.dtypes

id      int64
set    object
dtype: object

### Treating sets as strings

Pandas doesn't like working with sets, but it has a whole bunch of methods to work with strings. If you convert your set column into strings, you can take advantage of those methods at the minor expense of being able to use set methods (which are harder to use in Pandas).

Converting a set to a string gives you a string that looks like a set, with all the values visible. When you're dealing with simple elements (integers, strings, etc.), that's all you need.

The set `{0, 1, 2, 3}` becomes the string `"{0, 1, 2, 3}"`.

In [21]:
# View a as a string

str(a)

'{0, 1, 2, 3}'

To convert a column of sets in Pandas, you can use `.astype(str)`. You can then chain that together with `.str.` methods such as `.str.contains` to check if a value is inside any of the sets.

Remember, you have to do the `astype(str)` first - `.str.contains` doesn't understand how to work with sets, so just returns `NaN` for everything.

In [22]:
# Check if a set-column contains a particular value

df["set"].astype(str).str.contains("harpy")

0     True
1     True
2    False
3    False
4    False
Name: set, dtype: bool

You can then take that boolean filter and use it to filter your dataframe as you normally would.

In [23]:
df[df["set"].astype(str).str.contains("harpy")]

Unnamed: 0,id,set
0,1,"{wyvern, harpy, dragon}"
1,2,"{fairy, harpy, nixie}"


The `.astype(str)` trick isn't just good for checking if a set contains a value - you could use `.str.replace()`, for example, to change values in a set.

It's really important to remember that this is a hack - you're not working with the data as it actually is, but converting it to appear differently and then treating the appearance as the data. It will work, but it's messy, and could cause problems when used in more complex ways. I'm not saying not to use it - hacks exist for a reason, and this is often the fastest route to the result that you want - but just be very careful with it, and make sure you are clear on what is happening at every stage.

### Lambda functions

The more stable method, albeit conceptually harder, is to use lambda functions on your dataframe, and then use set methods inside those functions.

For the purposes of this notebook, I'm going to only briefly touch on lambda functions themselves; [here's a more in-depth tutorial on them](https://github.com/Peritract/data-snippets/blob/master/Lambda%20functions%20and%20Apply.ipynb) if they're not too familiar to you.

A lambda function, matched with a Pandas `.apply()`, lets you transform each value in a column in the same way. Lambda functions let you write custom behaviour, so that you can work with the sets directly, as sets, taking advantage of their built-in functions.

In [24]:
# Example lambda with an apply
# For each value in the id column, give back that value + 7

df["id"].apply(lambda val: val + 7)

0     8
1     9
2    10
3    11
4    12
Name: id, dtype: int64

`.apply()` lets you use the same function on every row of a column.

`lambda val:` takes the value given it by `.apply()` (once for each row) and transforms it in some way.

`val + 7` takes the value passed in and returns that value + 7. 1 becomes 8, 2 becomes 9, etc.

The same basic process can be applied to sets, not just integer values.

In [25]:
# Lambda with an apply for a set
# For each set in the set column, check if it contains a particular value

df["set"].apply(lambda x: "harpy" in x)

0     True
1     True
2    False
3    False
4    False
Name: set, dtype: bool

`lambda x:` takes the set given it by `.apply()` (once for each row)

`"harpy" in x"` checks if the value "harpy" is in the set, and returns `True` or `False`.

Just as with the .`astype(str)` methods, you can use the result of this `.apply()` and lambda to filter the dataframe.

In [26]:
df[df["set"].apply(lambda x: "harpy" in x)]

Unnamed: 0,id,set
0,1,"{wyvern, harpy, dragon}"
1,2,"{fairy, harpy, nixie}"


Lambda lets you write arbitrary code, so using this method, you can do anything with the sets in the column that you could do with a set on its own. This method is more stable than the the string method, because you're treating sets as sets, rather than using the string representation. It might take a little longer to get your head round, but if you're planning on working with sets in Pandas, it's worth the effort.