# Logical Operators

Sometimes it is useful to query data from multiple columns at once. There are special considerations when using logical operators, and this notebook covers them.

In [44]:
import numpy as np
import pandas as pd

In [45]:
# Here is a DataFrame showing the population of 3 states at two different
# times: 2000 and 2010.

pop_data = [
        ['California', 2000, 33871648],
        ['California', 2010, 37253956],
        ['New York', 2000, 18976457],
        ['New York', 2010,  19378102],
        ['Texas', 2000, 20851820],
        ['Texas', 2010,  25145561]
]

pop_df = pd.DataFrame(pop_data, columns=['State', 'Year', 'Population'])
pop_df

Unnamed: 0,State,Year,Population
0,California,2000,33871648
1,California,2010,37253956
2,New York,2000,18976457
3,New York,2010,19378102
4,Texas,2000,20851820
5,Texas,2010,25145561


In [46]:
# The fact that each state appears more than once means that some questions we
# can ask are kind of problematic. For example, we can't just compute the mean
# population of each state, since that would be the mean of different states
# but also the mean across multiple years:

pop_df['Year'].mean()

2005.0

In [47]:
# It would be much more reasonable for us to compute the mean over a specific year:
pop_df.loc[pop_df['Year'] == 2000, 'Population'].mean()

24566641.666666668

In [48]:
# But what if we want to access a specific value? For example, what if I wanted
# to know the population of California in 2010? That is more complex. We can
# try programming something by relying more on Python, but this is not the best
# way to do it:
temp_df = pop_df[pop_df['State'] == 'California']
temp_df[temp_df['Year'] == 2010]

Unnamed: 0,State,Year,Population
1,California,2010,37253956


In [49]:
# What we really want is to be able to query our data based on logical conditions.
# What I am really asking is for rows where two conditions are true at the same time:
# That the state equals California, and the year equals 2010.
# This seems like something we should be able to do - after all, we can perform
# relational operations and get Boolean Series objects:
print(pop_df['State'] == 'California')
print()
print(pop_df['Year'] == 2010)

0     True
1     True
2    False
3    False
4    False
5    False
Name: State, dtype: bool

0    False
1     True
2    False
3     True
4    False
5     True
Name: Year, dtype: bool


In [50]:
# Unfortunately, if I try to use the `and` operator, it fails.
# This is because the "and" operator in Python expects to work by converting
# objects on either side to a Boolean value. How do we convert a Series object
# to a boolean?
pop_df['State'] == 'California' and pop_df['Year'] == 2010

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

In [None]:
# For example, this Series contains a combination of True and False values.
# There is no way we can take the Series as it exists here and distill it to a
# single Boolean value.
pop_df['Year'] == 2010

0    False
1     True
2    False
3     True
4    False
5     True
Name: Year, dtype: bool

In [51]:
# There are some methods that can do this - for example, any() returns True if
# any of the contents of the Series are True, and False if none are True:
(pop_df['Year'] == 2010).any()

True

In [52]:
# And the all() method returns True if all of the contents of the Series are True,
# and False if any are False:
(pop_df['Year'] == 2010).all()

False

In [53]:
# But these don't actually help us - we aren't interested in summaries of the
# contents of each boolean series, we want to identify specific rows where
# both conditions are true. This is where the bitwise logical operators help us.

In [54]:
# The following bitwise logical operators are available:
# and: &
# or: |
# not: ~
# xor: ^
# Unlike the more commonly used logical operators, they are designed to work on
# the contents of objects rather than the objects themselves.
#
# We can see if we use & to compare rows where the state is California and where
# the year is 2010, it produces a new Series object with the and operation applied
# element-wise.
(pop_df['State'] == 'California') & (pop_df['Year'] == 2010)

0    False
1     True
2    False
3    False
4    False
5    False
dtype: bool

In [55]:
# Note that due to Python operator precedence rules, we must surround each
# relational operation with parenthesis. If we don't do this, then Python
# will interpret the expression like this, which will fail:
pop_df['State'] == ('California' & pop_df['Year']) == 2010

TypeError: Cannot perform 'rand_' with a dtyped [int64] array and scalar of type [bool]

In [None]:
# To apply this to our DataFrame, we can use masking to get all rows that meet our
# original condition:
pop_df[(pop_df['State'] == 'California') & (pop_df['Year'] == 2010)]

Unnamed: 0,State,Year,Population
1,California,2010,37253956


In [56]:
# Hierarchical indexing

# Hierarchical Indexing

Sometimes, the structure of data suggests that multiple columns can index data at once. This section identifies a situation where this is true, shows how to create a multi-level index, and shows techniques for querying them.

In [57]:
# Reviewing our state population DataFrame again, we can ask the following question:
# What values in the DataFrame uniquely identify a population?
# It isn't State, because each state appears more than once.
# It isn't Year either, because each year appears more than once.
# But State and Year together - each combination of state and year uniquely identifies
# a population.
#
# This implies that both state and year together might make a good index.
# But how do we do that?
pop_df

Unnamed: 0,State,Year,Population
0,California,2000,33871648
1,California,2010,37253956
2,New York,2000,18976457
3,New York,2010,19378102
4,Texas,2000,20851820
5,Texas,2010,25145561


In [58]:
# We'll demonstrate with an example DataFrame. It is a simple DataFrame with
# 4 rows, 3 columns, and custom row indices.
pd.DataFrame(
    {'a': [1, 5, 76, 99], 'b': [3, 6, 2, 1], 'c': [9, 2, 4, 6]},
    index= [1, 5, 76, 99]
)

Unnamed: 0,a,b,c
1,1,3,9
5,5,6,2
76,76,2,4
99,99,1,6


In [59]:
# For any DataFrame, we can change the index to one of its columns by calling
# set_index(). If we call set_index() with the name of a column in the DataFrame,
# then its original index is discarded and replaced by the contents of the column.
pd.DataFrame(
    {'a': [1, 5, 76, 99], 'b': [3, 6, 2, 1], 'c': [9, 2, 4, 6]},
    index= [1, 5, 76, 99]
).set_index('a')

Unnamed: 0_level_0,b,c
a,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3,9
5,6,2
76,2,4
99,1,6


In [60]:
# We can also specify a list of columns. If we do, both columns will become the
# index, and we will have created a multi-level hierarchical index:
pd.DataFrame(
    {'a': [1, 5, 76, 99], 'b': [3, 6, 2, 1], 'c': [9, 2, 4, 6]},
    index= [1, 5, 76, 99]
).set_index(['a', 'b'])

Unnamed: 0_level_0,Unnamed: 1_level_0,c
a,b,Unnamed: 2_level_1
1,3,9
5,6,2
76,2,4
99,1,6


In [61]:
# We can apply this to our population DataFrame:
pop_df_multi = pop_df.set_index(['State', 'Year'])
pop_df_multi

Unnamed: 0_level_0,Unnamed: 1_level_0,Population
State,Year,Unnamed: 2_level_1
California,2000,33871648
California,2010,37253956
New York,2000,18976457
New York,2010,19378102
Texas,2000,20851820
Texas,2010,25145561


In [62]:
# What are the advantages of doing this? It makes it easier to look for specific
# values or ranges of values. For example, if I want all populations from 2010:
pop_df_multi.loc[:, :, :].mean()

Population    25912924.0
dtype: float64

In [63]:
# I want all populations from California:
pop_df_multi.loc['California', :]

Unnamed: 0_level_0,Population
Year,Unnamed: 1_level_1
2000,33871648
2010,37253956


In [64]:
# I want California in 2010:
pop_df_multi.loc['California', 2010]

Population    37253956
Name: (California, 2010), dtype: int64

In [65]:
# Unfortunately, the syntax as shown above can get messy and ambiguous.
# For example, when finding California's population, if you didn't know
# the DataFrame had two levels of index, how would you interpret this?
#
#    pop_df_multi.loc['California', 2010]
#
# Am I asking for California (at level 1 of the index) and 2010 (at level 2 of
# the index), or am I asking for a row indexed by 'California' and a column
# indexed at 2010?

In [68]:
# This ambiguity gets worse if we remove the extra colon in our query for
# all populations in 2010. Am I asking for all rows in column 2010, or
# all values in level 1 of the index (state) and 2010 in level 2 of the index (year)?
# Even Pandas doesn't know, and it crashes:
# pop_df_multi.loc[:, 2010]

In [69]:
# A less ambiguous way of signaling your intentions to Pandas is to group all
# index-specific queries in an IndexSlice object. By convention, it is often
# assigned to a variable `idx` like this:
idx = pd.IndexSlice

In [70]:
# Here is how you can use it. This is nice because it allows you to maintain
# the structure of .loc[(row index), (column index)] by grouping all row indices
# inside one object.
#
# Here, I am asking for rows with state=California and year=2010, then all columns:
pop_df_multi.loc[idx['California', 2010], :]

Population    37253956
Name: (California, 2010), dtype: int64

In [71]:
# Here, I am asking for rows with state=all, year=2010, then all columns:
pop_df_multi.loc[idx[:, 2010], :]

Unnamed: 0_level_0,Unnamed: 1_level_0,Population
State,Year,Unnamed: 2_level_1
California,2010,37253956
New York,2010,19378102
Texas,2010,25145561


In [81]:
# Here, I am asking for rows with state=Texas', year=all, then all columns:
pop_df_multi.loc[idx['Texas', :], :]

Unnamed: 0_level_0,Unnamed: 1_level_0,Population
State,Year,Unnamed: 2_level_1
Texas,2000,20851820
Texas,2010,25145561


pop_+df

In [82]:
pop_df_multi

Unnamed: 0_level_0,Unnamed: 1_level_0,Population
State,Year,Unnamed: 2_level_1
California,2000,33871648
California,2010,37253956
New York,2000,18976457
New York,2010,19378102
Texas,2000,20851820
Texas,2010,25145561


In [87]:
pop_df_multi.loc[idx['California', :]]

Unnamed: 0_level_0,Population
Year,Unnamed: 1_level_1
2000,33871648
2010,37253956


In [88]:
pop_df_multi.loc[idx['California', :]].mean()

Population    35562802.0
dtype: float64