# Indexing and selecting data

This section will be focused on slice subsets of series and dataframe objects. Python has built in tools to do this that while capable are not the best for the job. Pandas has optimized methods to access data reccommended for production code.

# Different choices for indexing

Pandas supports 3 types of multi-axis indexing:
    .loc(), mainly for label based operations but boolean arrays are also possible with a KeyError being raised in cases when the item isn't found. Possible inputs include:
            
            a single label(e.g. 'avocados' or 3 which is interpreted as an index label)
        
            a list or array of labels (e.g. ['avocado', 'banana']
        
            Slices of objects with labels (e.g. 'avocado':'banana'), unlike normal python slicing operations both the start and stop are included however.
        
            A boolean array
        
            A callable function with a single argument the yields a valid indexing output from the above list.

    .iloc() is primarily a integer position based (0 to len(-1)) but also accepts boolean arrays. When an indexer is out of bounds .iloc() will raise an IndexError barring a slice indexer which can use out-of-bounds indexers. Valid inputs include:
            
            An integer (e.g.9)
            
            A list or array of integers (e.g. [3, 6, 2])
            
            A slice object with ints (e.g. 0:3)
            
            A boolean array
            
            Or a callable function with a single argument the yields a valid indexing output from the above list.
            
    .loc(), .iloc(), and [] all accept callable functions as indexers

When working will multiple axes, the following notation appplies. Null slices (':') can be used fpr any accesor but can also be left out (e.g. df.loc['b'] == df.loc['b', :, :]).

For a series object the format is s.loc[indexer].
For a DataFrame object the format is df.loc[row_indexer, column_indexer]
For a Panel object the format is p.loc[item_indexer, major_indexer, minor_indexer]

# Basics

The primary function of slicing with [] notation is to select lower dimensiional slices.
    for a series, series[label] returns a scalar value
    
    for a dataframe, df[colname], returns a series matching the colname
    
    for a panel, panel[itemname], returns a dataframe matching the itemname

In [1]:
#importing modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
#constructing a simple dataframe to work with
index = list('abcde')
df = pd.DataFrame(np.random.randn(5, 3), index = index, columns = ['happy', 'sad', 'angry'])
df

Unnamed: 0,happy,sad,angry
a,0.959104,-0.053295,0.218944
b,-0.830617,-1.34374,-0.588519
c,-0.831723,0.670823,-1.21519
d,1.673634,-0.268827,0.19293
e,-1.251227,-0.050444,0.021572


In [3]:
#constructing a panel
panel = pd.Panel({'alpha': df, 'beta' : df - df['angry'].mean()})
panel

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 5 (major_axis) x 3 (minor_axis)
Items axis: alpha to beta
Major_axis axis: a to e
Minor_axis axis: happy to angry

In [4]:
#selecting a scalar index from a series from a datframe from a panel
df1 = panel['beta']
s = df1['happy']
s[2:4]

c   -0.557671
d    1.947686
Name: happy, dtype: float64

In [5]:
#we can also select multiple columns by passing in a list using [[]].
s = df[['angry', 'sad']]
s[0:3]

Unnamed: 0,angry,sad
a,0.218944,-0.053295
b,-0.588519,-1.34374
c,-1.21519,0.670823


This same process can also be used to set multiple columns.

In [6]:
df[['sad', 'angry']] = df1[['sad', 'angry']]
df

Unnamed: 0,happy,sad,angry
a,0.959104,0.220758,0.492997
b,-0.830617,-1.069688,-0.314467
c,-0.831723,0.944876,-0.941137
d,1.673634,0.005226,0.466982
e,-1.251227,0.223609,0.295625


This might be useful for applying in-place transformations to a subset of columns. However, it is important to note that pandas aligns all axes when setting Series and Dataframe objects using .loc and .iloc.

The following fails to modify the df because column alignment preceeds value assignment.

In [7]:
#incorrect method
df.loc[:, ['sad', 'angry']] = df[['angry', 'sad']]
df

Unnamed: 0,happy,sad,angry
a,0.959104,0.220758,0.492997
b,-0.830617,-1.069688,-0.314467
c,-0.831723,0.944876,-0.941137
d,1.673634,0.005226,0.466982
e,-1.251227,0.223609,0.295625


The correct method uses the raw values as follows

In [8]:
df.loc[:, ['sad', 'angry']] = df[['angry', 'sad']].values
df

Unnamed: 0,happy,sad,angry
a,0.959104,0.492997,0.220758
b,-0.830617,-0.314467,-1.069688
c,-0.831723,-0.941137,0.944876
d,1.673634,0.466982,0.005226
e,-1.251227,0.295625,0.223609


# Attribute Access

Directly accessible attributes include an index of a series, a dataframe column, and a panel item. In the IPython environment you can also use tab completion the access these values.

In [9]:
#accesing a series attribute
s = df['happy']
s.b

-0.8306172652622016

In [10]:
#on a dataframe
df.sad

a    0.492997
b   -0.314467
c   -0.941137
d    0.466982
e    0.295625
Name: sad, dtype: float64

In [11]:
#on a panel
panel.beta

Unnamed: 0,happy,sad,angry
a,1.233157,0.220758,0.492997
b,-0.556565,-1.069688,-0.314467
c,-0.557671,0.944876,-0.941137
d,1.947686,0.005226,0.466982
e,-0.977174,0.223609,0.295625


we can also use this process to modify existing objects

In [12]:
#modifying a series value
s.b = 2
s.b

2.0

In [13]:
#modifying column values
df.happy = list(range(len(df.index)))
df

Unnamed: 0,happy,sad,angry
a,0,0.492997,0.220758
b,1,-0.314467,-1.069688
c,2,-0.941137,0.944876
d,3,0.466982,0.005226
e,4,0.295625,0.223609


In [14]:
#to create a new column the notation is as follows
df['glad'] = df.happy - df.sad
df

Unnamed: 0,happy,sad,angry,glad
a,0,0.492997,0.220758,-0.492997
b,1,-0.314467,-1.069688,1.314467
c,2,-0.941137,0.944876,2.941137
d,3,0.466982,0.005226,2.533018
e,4,0.295625,0.223609,3.704375


Some caveats:
    This access only works when the index element is a valid python identifier.
    The attribute is not available if it conflicts with an existing method name like min or max
    It will also be unavaiable if it conflicts with the following lists: *index, major_axis, minor_axis, items, labels.*

when these cases occur, standard indexing is still valid.

We can also assign a dict to a row of a dataframe:

In [15]:
df.iloc[3] = {'happy':12, 'sad':13, 'angry':14, 'glad':6}
df

Unnamed: 0,happy,sad,angry,glad
a,0,0.492997,0.220758,-0.492997
b,1,-0.314467,-1.06969,1.31447
c,2,-0.941137,0.944876,2.94114
d,happy,sad,angry,glad
e,4,0.295625,0.223609,3.70438


# slicing ranges

This section will focus on the [] operator

With a Series the [] operator uses the same syntax as when working with an ndarray.

In [16]:
# slicing out a series from df
s = df['angry']
#slicing a range
s[:3]

a    0.220758
b    -1.06969
c    0.944876
Name: angry, dtype: object

In [17]:
s[::3]

a    0.220758
d       angry
Name: angry, dtype: object

In [18]:
s[::-2]

e    0.223609
c    0.944876
a    0.220758
Name: angry, dtype: object

setting works the same was as well

In [19]:
s2 = s.copy()
s2[3] = 6
s2

a    0.220758
b    -1.06969
c    0.944876
d           6
e    0.223609
Name: angry, dtype: object

Slicing for a dataframe using the [] operator slices rows.

In [20]:
#slicing rows in a dataframe
df[2:4]

Unnamed: 0,happy,sad,angry,glad
c,2,-0.941137,0.944876,2.94114
d,happy,sad,angry,glad


# Selection by label

This section concerns the .loc accessor and other purely label based methods.

a few notes:
    chained assignment should be avoided
    slicers must be compatible or convertible with the index type or they will raise a type error (e.g. tryin to slice a datetime index with integers will raise this error)
    
To reiterate a few points. Purely label based indexing in pandas is a strict inclusion protocol. Slices must include the start bound and stop bound when present in the index. Integers, in this case, refer to labels and not positions.

The.loc method is the primary attribute used to do this. valid inputs include the following:
    a single label
    
    a list or array of labels
    
    a slice object with labels 'start':'finish'
    
    a boolean array
    
    a callable function

In [21]:
# generating a new dataframe to work with
df = pd.DataFrame(np.random.randn(6, 6), index = list('abcdef'), columns = [1, 2, 3, 4, 5, 6])
df

Unnamed: 0,1,2,3,4,5,6
a,-1.389047,1.482557,0.431618,1.35243,-0.741461,-0.19307
b,-0.194087,1.400704,-0.833427,-2.034406,0.608704,0.703152
c,0.283637,0.222736,-0.688027,-0.156686,-0.087464,0.620577
d,0.558529,-0.604913,1.037787,0.029727,-0.905349,-2.026783
e,0.054147,-0.502664,0.591121,-1.060757,1.377003,-0.618515
f,1.627963,1.209759,1.813845,-1.407924,-2.462283,0.492717


In [22]:
#selecting rows based on a series label within a dataframe
df[1].loc['a':'c']

a   -1.389047
b   -0.194087
c    0.283637
Name: 1, dtype: float64

In [23]:
#setting a series value based on label in a series within a dataframe
df[1].loc['a'] = np.nan
df

Unnamed: 0,1,2,3,4,5,6
a,,1.482557,0.431618,1.35243,-0.741461,-0.19307
b,-0.194087,1.400704,-0.833427,-2.034406,0.608704,0.703152
c,0.283637,0.222736,-0.688027,-0.156686,-0.087464,0.620577
d,0.558529,-0.604913,1.037787,0.029727,-0.905349,-2.026783
e,0.054147,-0.502664,0.591121,-1.060757,1.377003,-0.618515
f,1.627963,1.209759,1.813845,-1.407924,-2.462283,0.492717


In [24]:
#using selected rows and columns
df.loc[['b', 'd', 'f'], 1:3]

Unnamed: 0,1,2,3
b,-0.194087,1.400704,-0.833427
d,0.558529,-0.604913,1.037787
f,1.627963,1.209759,1.813845


In [25]:
#using label slices
df.loc['c':'f', 3:6]

Unnamed: 0,3,4,5,6
c,-0.688027,-0.156686,-0.087464,0.620577
d,1.037787,0.029727,-0.905349,-2.026783
e,0.591121,-1.060757,1.377003,-0.618515
f,1.813845,-1.407924,-2.462283,0.492717


In [26]:
#cross section with a label
df.loc['e']

1    0.054147
2   -0.502664
3    0.591121
4   -1.060757
5    1.377003
6   -0.618515
Name: e, dtype: float64

In [27]:
#using a boolean array
df.loc['a':'c', 1:4] <1

Unnamed: 0,1,2,3,4
a,False,False,True,False
b,True,False,True,True
c,True,True,True,True


In [28]:
#for grabbing a value explicitly, equivalent to "df.at['a', '1']
df.loc['b', 3]

-0.8334272699491675

# slicing with labels

Slicing using the .loc accessor returns the elements inbetween and including the start and stop labels when they are both present in the index.

In [29]:
s = df[1]
s

a         NaN
b   -0.194087
c    0.283637
d    0.558529
e    0.054147
f    1.627963
Name: 1, dtype: float64

In [30]:
#slicing a series
s.loc['b':'e']

b   -0.194087
c    0.283637
d    0.558529
e    0.054147
Name: 1, dtype: float64

In the case where one of the two elements is missing but the index is sorted. Slicing will still work by selecting the labels ranked between the two.

In [31]:
s.sort_index().loc['d':'g']

d    0.558529
e    0.054147
f    1.627963
Name: 1, dtype: float64

In this same case when the index is not sorted an error will be raised instead. aka, don't do the thing.

# Selecting by Position

Chained assingment should be avoided

Purely integer based indexing is available in pandas with a number of methods. All methods are 0 based, start bound is included stop bound is excluded. Using anything but an integer will raise and IndexError.

.iloc is the primary method with the following valid inputs
    an integer
    a list or array of integers
    a slice object with ints

In [32]:
# creating a new series to work with
s = pd.Series(np.random.randn(6), index = list(range(0, 18, 3)))
s

0    -0.159950
3     0.713671
6     2.968385
9    -0.488876
12   -1.953149
15    1.860303
dtype: float64

In [33]:
#grabbing the third value (integer position 2 in this case)
s.iloc[2]

2.9683854497635727

In [34]:
#grabbing a central slice
s.iloc[2:4]

6    2.968385
9   -0.488876
dtype: float64

In [35]:
#setting the value of integer position 2
s.iloc[2] = 3
s

0    -0.159950
3     0.713671
6     3.000000
9    -0.488876
12   -1.953149
15    1.860303
dtype: float64

In [36]:
#generating a new dataframe to work with
df = pd.DataFrame(np.random.randn(10, 5), index = list(range(0, 40, 4)), columns = list(range(0, 10, 2)))
df

Unnamed: 0,0,2,4,6,8
0,-1.032341,0.570455,0.599838,0.607759,-0.032532
4,-0.239279,-1.391194,0.211967,0.100847,-1.166451
8,-0.721131,2.47598,-0.797685,-0.160964,1.794013
12,-0.963105,0.174867,-0.197381,-0.171538,-1.203575
16,1.179574,1.716814,-0.17502,0.462596,-0.879724
20,-1.537079,0.294939,-0.454594,-0.28948,1.779657
24,-0.920551,0.294642,-0.483766,-1.731495,1.002644
28,0.206399,-1.137371,-1.287931,-0.238863,-1.639294
32,0.132309,-0.884868,0.354093,-0.717782,0.310802
36,-0.340178,0.131652,0.039102,-0.308562,-1.186703


In [37]:
#using integer slicing, specifying rows
df.iloc[:4]

Unnamed: 0,0,2,4,6,8
0,-1.032341,0.570455,0.599838,0.607759,-0.032532
4,-0.239279,-1.391194,0.211967,0.100847,-1.166451
8,-0.721131,2.47598,-0.797685,-0.160964,1.794013
12,-0.963105,0.174867,-0.197381,-0.171538,-1.203575


In [38]:
#using integer slicing specifying rows and columns
df.iloc[3:5, 3:5]

Unnamed: 0,6,8
12,-0.171538,-1.203575
16,0.462596,-0.879724


In [39]:
#using a list of integers
df.iloc[[3, 5, 6], [3, 4]]

Unnamed: 0,6,8
12,-0.171538,-1.203575
20,-0.28948,1.779657
24,-1.731495,1.002644


basic form for dataframes
df.iloc[rows, columns]

When a cross section is desired:

In [40]:
df.iloc[3]

0   -0.963105
2    0.174867
4   -0.197381
6   -0.171538
8   -1.203575
Name: 12, dtype: float64

In [41]:
#out of bounds should be handled as well
df.iloc[7:15, 3:10]

Unnamed: 0,6,8
28,-0.238863,-1.639294
32,-0.717782,0.310802
36,-0.308562,-1.186703


when slices go out of bounds they can result in an empty dataframe.

when a single indexer is out of bounds an IndexError will be raised. Similarly, a list of indexers where any single element is out of bounds will also raise an IndexError

# Selection by a callable

The primary indexers (.loc, .iloc, and []) can all accept a callable function as an indexer. However, the callable MUST be a function with one argument(the data object in this case), returning a valid output for indexing.

In [42]:
df.loc[lambda df: df[0] > -1]

Unnamed: 0,0,2,4,6,8
4,-0.239279,-1.391194,0.211967,0.100847,-1.166451
8,-0.721131,2.47598,-0.797685,-0.160964,1.794013
12,-0.963105,0.174867,-0.197381,-0.171538,-1.203575
16,1.179574,1.716814,-0.17502,0.462596,-0.879724
24,-0.920551,0.294642,-0.483766,-1.731495,1.002644
28,0.206399,-1.137371,-1.287931,-0.238863,-1.639294
32,0.132309,-0.884868,0.354093,-0.717782,0.310802
36,-0.340178,0.131652,0.039102,-0.308562,-1.186703


In [43]:
df.iloc[:, lambda df: [2, 3] ]

Unnamed: 0,4,6
0,0.599838,0.607759
4,0.211967,0.100847
8,-0.797685,-0.160964
12,-0.197381,-0.171538
16,-0.17502,0.462596
20,-0.454594,-0.28948
24,-0.483766,-1.731495
28,-1.287931,-0.238863
32,0.354093,-0.717782
36,0.039102,-0.308562


In [44]:
df[lambda df: df.columns[:3]]

Unnamed: 0,0,2,4
0,-1.032341,0.570455,0.599838
4,-0.239279,-1.391194,0.211967
8,-0.721131,2.47598,-0.797685
12,-0.963105,0.174867,-0.197381
16,1.179574,1.716814,-0.17502
20,-1.537079,0.294939,-0.454594
24,-0.920551,0.294642,-0.483766
28,0.206399,-1.137371,-1.287931
32,0.132309,-0.884868,0.354093
36,-0.340178,0.131652,0.039102


In [45]:
#callable indexing can also be used in a series
s.loc[lambda s: s>1]

6     3.000000
15    1.860303
dtype: float64

you can avoid the use of a temporary variable by using chain data selection operations


The following are deprecated
.ix indexer in favor of .loc, and .iloc
using .loc or [] with a list containing one or more missing labels in favor of .reindex()

# reindexing

This is the idiomatic way to select potentially not-found elements.

In [46]:
s.reindex([0, 3, 8])

0   -0.159950
3    0.713671
8         NaN
dtype: float64

Another option for returing only valid keys and preserving the dtype is the following:

In [47]:
labels = [0, 3, 6]
s.loc[s.index.intersection(labels)]

0   -0.159950
3    0.713671
6    3.000000
dtype: float64

a duplicated index will raise an error for .reindex()

In [48]:
#generating a new series and failing to generate the the dulicated axis error
s = pd.Series(np.arange(5), index = ['a', 'b', 'c', 'd', 'e'])
labels = ['a', 'b']
s.reindex(labels)

a    0
b    1
dtype: int32

The duplication error can be circumvented by first intersecting the desired lables and then reindexing but this will still raise an error if the resulting index is duplicated

# selecting random samples

This is done using the sample() method on a Series, DataFrame or Panel. Its default behavior is to sample rows by default, returning a specific number of rows/columns or a fraction of rows

In [49]:
#generating a new series to work with
s = pd.Series(np.arange(15), index = list('abcdefghijklmno'))
print(s)

a     0
b     1
c     2
d     3
e     4
f     5
g     6
h     7
i     8
j     9
k    10
l    11
m    12
n    13
o    14
dtype: int32


In [50]:
#without passing an argument only one row is returned
s.sample()

j    9
dtype: int32

In [51]:
#specifying a number of rows
s.sample(n=5)

f    5
d    3
e    4
g    6
b    1
dtype: int32

In [52]:
#sampling a fraction of rows
s.sample(frac = 0.66)

d     3
m    12
f     5
h     7
b     1
o    14
c     2
e     4
a     0
l    11
dtype: int32

You can sample with replacement using the replace option otherwise sample() will only return each row at most once.

In [53]:
#sampling without replacement
s.sample(n = 5, replace=False)

n    13
c     2
a     0
e     4
m    12
dtype: int32

In [54]:
#with replacement
s.sample(n=6, replace = True)

c     2
j     9
n    13
c     2
j     9
c     2
dtype: int32

Using sample(), by default, each row has an equal probability of being selected. To change this we can pass the weights argument to the sample function. The weights can be in the form of a list, a np array, or series as long as they are the same length as the object being sampled. Missing values are assigned a weight of 0 and infinite values are not allowed. If the weights do not sum to one they will be normalized by dividing all the weights by the sum of the weights.

In [55]:
#creating a list of weights with sum 105
weights = pd.Series(np.arange(15))
#sampling with weights with re-normalizing
s.sample(n= 5, weights = weights.values)

l    11
d     3
k    10
m    12
o    14
dtype: int32

For DataFrames, a column within the df can be used as sampling weights(only when you are sampling rows, not when you are sampling columns) by passing the column name as a string.

In [56]:
#modifying our existing dataframe
df['weights'] = df[8]
del df[8]

In [57]:
#selecting the rows where weights are positive since it cannot accept negatives
df = df[df['weights'] >=0]

In [58]:
#sampling
df.sample(n = 4, weights = 'weights')

Unnamed: 0,0,2,4,6,weights
8,-0.721131,2.47598,-0.797685,-0.160964,1.794013
20,-1.537079,0.294939,-0.454594,-0.28948,1.779657
24,-0.920551,0.294642,-0.483766,-1.731495,1.002644
32,0.132309,-0.884868,0.354093,-0.717782,0.310802


In [59]:
#we can alse sample columns
df.sample(n = 2, axis = 1)

Unnamed: 0,0,2
8,-0.721131,2.47598
20,-1.537079,0.294939
24,-0.920551,0.294642
32,0.132309,-0.884868


As a final not, we can also set a seed for sample()'s RNG using the random_state arg using either an int or a np RandomState object.

In [60]:
#the sample will always draw the same rows when given a seed(aka. int)
df.sample(n=3, random_state= 5)

Unnamed: 0,0,2,4,6,weights
8,-0.721131,2.47598,-0.797685,-0.160964,1.794013
20,-1.537079,0.294939,-0.454594,-0.28948,1.779657
24,-0.920551,0.294642,-0.483766,-1.731495,1.002644


# Setting with Enlargement
enlargement can be performed using either the .loc or [] operations when setting a non-existent key for that axis.

This is basically an appending operation in the case of a series

In [61]:
# setting by enlargement a value for p
s['p'] = 15
s

a     0
b     1
c     2
d     3
e     4
f     5
g     6
h     7
i     8
j     9
k    10
l    11
m    12
n    13
o    14
p    15
dtype: int64

In the case of a DataFrame, either axis can be enlarged by using .loc

In [62]:
#creating a new column via enlargement using the .loc accessor
df.loc[:, 'alpha'] = 15
df
#I am not sure about the error here......

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item_labels[indexer[info_axis]]] = value


Unnamed: 0,0,2,4,6,weights,alpha
8,-0.721131,2.47598,-0.797685,-0.160964,1.794013,15
20,-1.537079,0.294939,-0.454594,-0.28948,1.779657,15
24,-0.920551,0.294642,-0.483766,-1.731495,1.002644,15
32,0.132309,-0.884868,0.354093,-0.717782,0.310802,15


In [63]:
#the following is an append operation
df.loc[8, :] = 7
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


Unnamed: 0,0,2,4,6,weights,alpha
8,7.0,7.0,7.0,7.0,7.0,7
20,-1.537079,0.294939,-0.454594,-0.28948,1.779657,15
24,-0.920551,0.294642,-0.483766,-1.731495,1.002644,15
32,0.132309,-0.884868,0.354093,-0.717782,0.310802,15


# Fast scalar value getting and setting

handled in alternate version. will merge later.

# Boolean indexing

Used to filter data with the following operations: | for or, & for and, ~ for not. These operations must be used with parentheses or risk misinterpretation by default python.

In [65]:
#finding s greater than 3
s[s> 3]

e     4
f     5
g     6
h     7
i     8
j     9
k    10
l    11
m    12
n    13
o    14
p    15
dtype: int64

In [71]:
#finding s less than 2 or greater than 6, remember to use parantheses
s[(s<2) | (s> 6)]

a     0
b     1
h     7
i     8
j     9
k    10
l    11
m    12
n    13
o    14
p    15
dtype: int64

In [72]:
# selecting s not less than 6, use ()
s[~(s<6)]

g     6
h     7
i     8
j     9
k    10
l    11
m    12
n    13
o    14
p    15
dtype: int64

In the case of dataframes you can select rows using a boolean vector of the same length as the index (e.g. using one of the dataframes columns)

In [74]:
#selecting all rows where column 0 is greater than 0
df[df[0]> 0]

Unnamed: 0,0,2,4,6,weights,alpha
8,7.0,7.0,7.0,7.0,7.0,7
32,0.132309,-0.884868,0.354093,-0.717782,0.310802,15


more complex operations can be accomplished using list comprehension and map methods.

In [77]:
df = pd.DataFrame({'alpha': ['blue', 'green', 'butterfly', 'bombshell'],
                  'beta' :[1, 2, 3, 4],
                  'gamma':['brocolli', 'cauliflower', 'asparagus', 'coconut']})
df

Unnamed: 0,alpha,beta,gamma
0,blue,1,brocolli
1,green,2,cauliflower
2,butterfly,3,asparagus
3,bombshell,4,coconut


In [78]:
#using a map method, selecting df where column alpha observations start with b
selection_criteria = df['alpha'].map(lambda x: x.startswith('b'))
df[selection_criteria]

Unnamed: 0,alpha,beta,gamma
0,blue,1,brocolli
2,butterfly,3,asparagus
3,bombshell,4,coconut


it is possible to use boolean vectors combined with other indexing expressions when using selection by label (.loc), selection by position(.iloc) and advanced indexing.

# Indexing with isin
the isin() method of a series accepts a list and returns a boolean vector that is is True wherever the series elements exist in the passed list. This provides a method for selecting rows where multiple comlumns contain values of interest.

In [80]:
# selecting rows containing 4, 6, or 8
s.isin([4, 6, 8])

a    False
b    False
c    False
d    False
e     True
f    False
g     True
h    False
i     True
j    False
k    False
l    False
m    False
n    False
o    False
p    False
dtype: bool

In [82]:
#pulling out rows where the previous condition is true
s[s.isin([4, 6, 8])]

e    4
g    6
i    8
dtype: int64

In [86]:
#we can use the same method on index objects in cases where we don't know if the labels we need are present
s[s.index.isin(['f', 'j', 'm'])]

f     5
j     9
m    12
dtype: int64

In [91]:
# we can also utilize this method with multiindex to check membership
s_mi = pd.Series(np.arange(12), index =  pd.MultiIndex.from_product([[0, 1, 2], ['a', 'b', 'c', 'd']]))
#checking membership
s_mi.iloc[s_mi.index.isin([(0, 'a'), (2, 'd')])]
#using the level argument
s_mi.iloc[s_mi.index.isin(['b', 'c'], level = 1)]

0  b     1
   c     2
1  b     5
   c     6
2  b     9
   c    10
dtype: int32

The isin() method is also present in dataframes where is can be passed either an array or a dict of values. A dataframe of booleans, true wherever the element is in the sequence of values, in the same shape as the original dataframe will be returned when isin is passed an array

In [93]:
# the values we are looking for are 'butterfly', 2, and 'cauliflower'
values = ['butterfly', 2, 'cauliflower']
df.isin(values)

Unnamed: 0,alpha,beta,gamma
0,False,False,False
1,False,True,True
2,True,False,False
3,False,False,False


To match certain values with certain columns you will want to make values a dict where the key is the column and the value is a list of items to check for.

In [94]:
df

Unnamed: 0,alpha,beta,gamma
0,blue,1,brocolli
1,green,2,cauliflower
2,butterfly,3,asparagus
3,bombshell,4,coconut


In [100]:
values = {'alpha': ['blue', 'green'],'beta':[2, 3], 'gamma': ['cauliflower','asparagus', 'coconut']}
df.isin(values)

Unnamed: 0,alpha,beta,gamma
0,True,False,False
1,True,True,True
2,False,True,True
3,False,False,True


In [103]:
#if we combine isin() with any() and all() methods, we can quickly subset data matched to our criteria
#using our previous values, we can see this only returns rows where all criteria are met
row_mask = df.isin(values).all(1)
df[row_mask]

Unnamed: 0,alpha,beta,gamma
1,green,2,cauliflower


# the where() method and Masking