# Indexing and selecting data

This section will be focused on slice subsets of series and dataframe objects. Python has built in tools to do this that while capable are not the best for the job. Pandas has optimized methods to access data reccommended for production code.

# Different choices for indexing

Pandas supports 3 types of multi-axis indexing:
    .loc(), mainly for label based operations but boolean arrays are also possible with a KeyError being raised in cases when the item isn't found. Possible inputs include:
            
            a single label(e.g. 'avocados' or 3 which is interpreted as an index label)
        
            a list or array of labels (e.g. ['avocado', 'banana']
        
            Slices of objects with labels (e.g. 'avocado':'banana'), unlike normal python slicing operations both the start and stop are included however.
        
            A boolean array
        
            A callable function with a single argument the yields a valid indexing output from the above list.

    .iloc() is primarily a integer position based (0 to len(-1)) but also accepts boolean arrays. When an indexer is out of bounds .iloc() will raise an IndexError barring a slice indexer which can use out-of-bounds indexers. Valid inputs include:
            
            An integer (e.g.9)
            
            A list or array of integers (e.g. [3, 6, 2])
            
            A slice object with ints (e.g. 0:3)
            
            A boolean array
            
            Or a callable function with a single argument the yields a valid indexing output from the above list.
            
    .loc(), .iloc(), and [] all accept callable functions as indexers

When working will multiple axes, the following notation appplies. Null slices (':') can be used fpr any accesor but can also be left out (e.g. df.loc['b'] == df.loc['b', :, :]).

For a series object the format is s.loc[indexer].
For a DataFrame object the format is df.loc[row_indexer, column_indexer]
For a Panel object the format is p.loc[item_indexer, major_indexer, minor_indexer]

# Basics

The primary function of slicing with [] notation is to select lower dimensiional slices.
    for a series, series[label] returns a scalar value
    
    for a dataframe, df[colname], returns a series matching the colname
    
    for a panel, panel[itemname], returns a dataframe matching the itemname

In [2]:
#importing modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [6]:
#constructing a simple dataframe to work with
index = list('abcde')
df = pd.DataFrame(np.random.randn(5, 3), index = index, columns = ['happy', 'sad', 'angry'])
df

Unnamed: 0,happy,sad,angry
a,1.467524,-1.399923,0.060172
b,0.427773,0.292764,-0.038432
c,1.352089,1.518014,-0.424001
d,-0.650626,0.773246,0.495996
e,-0.83173,-0.416825,0.397891


In [7]:
#constructing a panel
panel = pd.Panel({'alpha': df, 'beta' : df - df['angry'].mean()})
panel

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 5 (major_axis) x 3 (minor_axis)
Items axis: alpha to beta
Major_axis axis: a to e
Minor_axis axis: happy to angry

In [8]:
#selecting a scalar index from a series from a datframe from a panel
df1 = panel['beta']
s = df1['happy']
s[2:4]

c    1.253764
d   -0.748951
Name: happy, dtype: float64

In [9]:
#we can also select multiple columns by passing in a list using [[]].
s = df[['angry', 'sad']]
s[0:3]

Unnamed: 0,angry,sad
a,0.060172,-1.399923
b,-0.038432,0.292764
c,-0.424001,1.518014


This same process can also be used to set multiple columns.

In [10]:
df[['sad', 'angry']] = df1[['sad', 'angry']]
df

Unnamed: 0,happy,sad,angry
a,1.467524,-1.498248,-0.038153
b,0.427773,0.194439,-0.136757
c,1.352089,1.419689,-0.522326
d,-0.650626,0.674921,0.397671
e,-0.83173,-0.51515,0.299566


This might be useful for applying in-place transformations to a subset of columns. However, it is important to note that pandas aligns all axes when setting Series and Dataframe objects using .loc and .iloc.

The following fails to modify the df because column alignment preceeds value assignment.

In [11]:
#incorrect method
df.loc[:, ['sad', 'angry']] = df[['angry', 'sad']]
df

Unnamed: 0,happy,sad,angry
a,1.467524,-1.498248,-0.038153
b,0.427773,0.194439,-0.136757
c,1.352089,1.419689,-0.522326
d,-0.650626,0.674921,0.397671
e,-0.83173,-0.51515,0.299566


The correct method uses the raw values as follows

In [12]:
df.loc[:, ['sad', 'angry']] = df[['angry', 'sad']].values
df

Unnamed: 0,happy,sad,angry
a,1.467524,-0.038153,-1.498248
b,0.427773,-0.136757,0.194439
c,1.352089,-0.522326,1.419689
d,-0.650626,0.397671,0.674921
e,-0.83173,0.299566,-0.51515


# Attribute Access

Directly accessible attributes include an index of a series, a dataframe column, and a panel item. In the IPython environment you can also use tab completion the access these values.

In [15]:
#accesing a series attribute
s = df['happy']
s.b

0.42777283518234194

In [16]:
#on a dataframe
df.sad

a   -0.038153
b   -0.136757
c   -0.522326
d    0.397671
e    0.299566
Name: sad, dtype: float64

In [17]:
#on a panel
panel.beta

Unnamed: 0,happy,sad,angry
a,1.369199,-1.498248,-0.038153
b,0.329448,0.194439,-0.136757
c,1.253764,1.419689,-0.522326
d,-0.748951,0.674921,0.397671
e,-0.930055,-0.51515,0.299566


we can also use this process to modify existing objects

In [18]:
#modifying a series value
s.b = 2
s.b

2.0

In [19]:
#modifying column values
df.happy = list(range(len(df.index)))
df

Unnamed: 0,happy,sad,angry
a,0,-0.038153,-1.498248
b,1,-0.136757,0.194439
c,2,-0.522326,1.419689
d,3,0.397671,0.674921
e,4,0.299566,-0.51515


In [20]:
#to create a new column the notation is as follows
df['glad'] = df.happy - df.sad
df

Unnamed: 0,happy,sad,angry,glad
a,0,-0.038153,-1.498248,0.038153
b,1,-0.136757,0.194439,1.136757
c,2,-0.522326,1.419689,2.522326
d,3,0.397671,0.674921,2.602329
e,4,0.299566,-0.51515,3.700434


Some caveats:
    This access only works when the index element is a valid python identifier.
    The attribute is not available if it conflicts with an existing method name like min or max
    It will also be unavaiable if it conflicts with the following lists: *index, major_axis, minor_axis, items, labels.*

when these cases occur, standard indexing is still valid.

We can also assign a dict to a row of a dataframe:

In [21]:
df.iloc[3] = {'happy':12, 'sad':13, 'angry':14, 'glad':6}
df

Unnamed: 0,happy,sad,angry,glad
a,0,-0.0381534,-1.49825,0.0381534
b,1,-0.136757,0.194439,1.13676
c,2,-0.522326,1.41969,2.52233
d,happy,sad,angry,glad
e,4,0.299566,-0.51515,3.70043


# slicing ranges

This section will focus on the [] operator

With a Series the [] operator uses the same syntax as when working with an ndarray.

In [22]:
# slicing out a series from df
s = df['angry']
#slicing a range
s[:3]

a    -1.49825
b    0.194439
c     1.41969
Name: angry, dtype: object

In [23]:
s[::3]

a   -1.49825
d      angry
Name: angry, dtype: object

In [24]:
s[::-2]

e   -0.51515
c    1.41969
a   -1.49825
Name: angry, dtype: object

setting works the same was as well

In [29]:
s2 = s.copy()
s2[3] = 6
s2

a    -1.49825
b    0.194439
c     1.41969
d           6
e    -0.51515
Name: angry, dtype: object

Slicing for a dataframe using the [] operator slices rows.

In [31]:
#slicing rows in a dataframe
df[2:4]

Unnamed: 0,happy,sad,angry,glad
c,2,-0.522326,1.41969,2.52233
d,happy,sad,angry,glad


# Selection by label

This section concerns the .loc accessor and other purely label based methods.

a few notes:
    chained assignment should be avoided
    slicers must be compatible or convertible with the index type or they will raise a type error (e.g. tryin to slice a datetime index with integers will raise this error)
    
To reiterate a few points. Purely label based indexing in pandas is a strict inclusion protocol. Slices must include the start bound and stop bound when present in the index. Integers, in this case, refer to labels and not positions.

The.loc method is the primary attribute used to do this. valid inputs include the following:
    a single label
    
    a list or array of labels
    
    a slice object with labels 'start':'finish'
    
    a boolean array
    
    a callable function

In [44]:
# generating a new dataframe to work with
df = pd.DataFrame(np.random.randn(6, 6), index = list('abcdef'), columns = [1, 2, 3, 4, 5, 6])
df

Unnamed: 0,1,2,3,4,5,6
a,0.622585,-0.583078,0.041826,-0.364183,0.063753,-0.460983
b,0.153781,0.232202,-0.201869,-0.26698,-0.006292,-1.996154
c,0.131929,0.515535,0.260445,1.100241,0.495811,0.104446
d,-0.542233,0.449143,0.774571,0.496205,0.78326,-0.134026
e,-1.208442,-0.067473,0.615567,-0.960694,0.673328,-0.067205
f,0.155598,-0.495639,0.340753,-0.248018,1.448017,-2.790511


In [45]:
#selecting rows based on a series label within a dataframe
df[1].loc['a':'c']

a    0.622585
b    0.153781
c    0.131929
Name: 1, dtype: float64

In [46]:
#setting a series value based on label in a series within a dataframe
df[1].loc['a'] = np.nan
df

Unnamed: 0,1,2,3,4,5,6
a,,-0.583078,0.041826,-0.364183,0.063753,-0.460983
b,0.153781,0.232202,-0.201869,-0.26698,-0.006292,-1.996154
c,0.131929,0.515535,0.260445,1.100241,0.495811,0.104446
d,-0.542233,0.449143,0.774571,0.496205,0.78326,-0.134026
e,-1.208442,-0.067473,0.615567,-0.960694,0.673328,-0.067205
f,0.155598,-0.495639,0.340753,-0.248018,1.448017,-2.790511


In [50]:
#using selected rows and columns
df.loc[['b', 'd', 'f'], 1:3]

Unnamed: 0,1,2,3
b,0.153781,0.232202,-0.201869
d,-0.542233,0.449143,0.774571
f,0.155598,-0.495639,0.340753


In [51]:
#using label slices
df.loc['c':'f', 3:6]

Unnamed: 0,3,4,5,6
c,0.260445,1.100241,0.495811,0.104446
d,0.774571,0.496205,0.78326,-0.134026
e,0.615567,-0.960694,0.673328,-0.067205
f,0.340753,-0.248018,1.448017,-2.790511


In [52]:
#cross section with a label
df.loc['e']

1   -1.208442
2   -0.067473
3    0.615567
4   -0.960694
5    0.673328
6   -0.067205
Name: e, dtype: float64

In [56]:
#using a boolean array
df.loc['a':'c', 1:4] <1

Unnamed: 0,1,2,3,4
a,False,True,True,True
b,True,True,True,True
c,True,True,True,False


In [57]:
#for grabbing a value explicitly, equivalent to "df.at['a', '1']
df.loc['b', 3]

-0.2018694749875987

# slicing with labels

Slicing using the .loc accessor returns the elements inbetween and including the start and stop labels when they are both present in the index.

In [58]:
s = df[1]
s

a         NaN
b    0.153781
c    0.131929
d   -0.542233
e   -1.208442
f    0.155598
Name: 1, dtype: float64

In [59]:
#slicing a series
s.loc['b':'e']

b    0.153781
c    0.131929
d   -0.542233
e   -1.208442
Name: 1, dtype: float64

In the case where one of the two elements is missing but the index is sorted. Slicing will still work by selecting the labels ranked between the two.

In [61]:
s.sort_index().loc['d':'g']

d   -0.542233
e   -1.208442
f    0.155598
Name: 1, dtype: float64

In this same case when the index is not sorted an error will be raised instead. aka, don't do the thing.

# Selecting by Position

Chained assingment should be avoided

Purely integer based indexing is available in pandas with a number of methods. All methods are 0 based, start bound is included stop bound is excluded. Using anything but an integer will raise and IndexError.

.iloc is the primary method with the following valid inputs
    an integer
    a list or array of integers
    a slice object with ints

In [3]:
# creating a new series to work with
s = pd.Series(np.random.randn(6), index = list(range(0, 18, 3)))
s

0     0.366016
3    -1.222416
6     0.098190
9     1.189658
12    0.382180
15   -0.222200
dtype: float64

In [5]:
#grabbing the third value (integer position 2 in this case)
s.iloc[2]

0.09819001339857099

In [6]:
#grabbing a central slice
s.iloc[2:4]

6    0.098190
9    1.189658
dtype: float64

In [9]:
#setting the value of integer position 2
s.iloc[2] = 3
s

0     0.366016
3    -1.222416
6     3.000000
9     1.189658
12    0.382180
15   -0.222200
dtype: float64

In [10]:
#generating a new dataframe to work with
df = pd.DataFrame(np.random.randn(10, 5), index = list(range(0, 40, 4)), columns = list(range(0, 10, 2)))
df

Unnamed: 0,0,2,4,6,8
0,1.389885,0.563322,-0.059504,-0.41894,-1.397282
4,-0.496035,-0.964505,0.286087,-0.887837,-0.079963
8,-0.02464,0.466169,0.64808,-1.022141,-0.604358
12,2.056765,0.918313,-0.386721,-0.080141,-0.527467
16,0.096921,0.574549,-0.367309,-2.919086,-1.905532
20,-0.00574,-0.911236,0.64834,-0.530138,-0.60311
24,-0.939329,1.673403,-1.819467,-0.082868,1.697949
28,0.171628,0.170172,0.804764,0.79886,0.039293
32,-1.53196,-1.803041,-0.354322,0.992184,1.368617
36,0.414384,-0.176184,-0.0502,1.521509,-0.250517


In [11]:
#using integer slicing, specifying rows
df.iloc[:4]

Unnamed: 0,0,2,4,6,8
0,1.389885,0.563322,-0.059504,-0.41894,-1.397282
4,-0.496035,-0.964505,0.286087,-0.887837,-0.079963
8,-0.02464,0.466169,0.64808,-1.022141,-0.604358
12,2.056765,0.918313,-0.386721,-0.080141,-0.527467


In [12]:
#using integer slicing specifying rows and columns
df.iloc[3:5, 3:5]

Unnamed: 0,6,8
12,-0.080141,-0.527467
16,-2.919086,-1.905532


In [14]:
#using a list of integers
df.iloc[[3, 5, 6], [3, 4]]

Unnamed: 0,6,8
12,-0.080141,-0.527467
20,-0.530138,-0.60311
24,-0.082868,1.697949


basic form for dataframes
df.iloc[rows, columns]

When a cross section is desired:

In [15]:
df.iloc[3]

0    2.056765
2    0.918313
4   -0.386721
6   -0.080141
8   -0.527467
Name: 12, dtype: float64

In [16]:
#out of bounds should be handled as well
df.iloc[7:15, 3:10]

Unnamed: 0,6,8
28,0.79886,0.039293
32,0.992184,1.368617
36,1.521509,-0.250517


when slices go out of bounds they can result in an empty dataframe.

when a single indexer is out of bounds an IndexError will be raised. Similarly, a list of indexers where any single element is out of bounds will also raise an IndexError

# Selection by a callable

The primary indexers (.loc, .iloc, and []) can all accept a callable function as an indexer. However, the callable MUST be a function with one argument(the data object in this case), returning a valid output for indexing.

In [27]:
df.loc[lambda df: df[0] > -1]

Unnamed: 0,0,2,4,6,8
0,1.389885,0.563322,-0.059504,-0.41894,-1.397282
4,-0.496035,-0.964505,0.286087,-0.887837,-0.079963
8,-0.02464,0.466169,0.64808,-1.022141,-0.604358
12,2.056765,0.918313,-0.386721,-0.080141,-0.527467
16,0.096921,0.574549,-0.367309,-2.919086,-1.905532
20,-0.00574,-0.911236,0.64834,-0.530138,-0.60311
24,-0.939329,1.673403,-1.819467,-0.082868,1.697949
28,0.171628,0.170172,0.804764,0.79886,0.039293
36,0.414384,-0.176184,-0.0502,1.521509,-0.250517


In [31]:
df.iloc[:, lambda df: [2, 3] ]

Unnamed: 0,4,6
0,-0.059504,-0.41894
4,0.286087,-0.887837
8,0.64808,-1.022141
12,-0.386721,-0.080141
16,-0.367309,-2.919086
20,0.64834,-0.530138
24,-1.819467,-0.082868
28,0.804764,0.79886
32,-0.354322,0.992184
36,-0.0502,1.521509


In [32]:
df[lambda df: df.columns[:3]]

Unnamed: 0,0,2,4
0,1.389885,0.563322,-0.059504
4,-0.496035,-0.964505,0.286087
8,-0.02464,0.466169,0.64808
12,2.056765,0.918313,-0.386721
16,0.096921,0.574549,-0.367309
20,-0.00574,-0.911236,0.64834
24,-0.939329,1.673403,-1.819467
28,0.171628,0.170172,0.804764
32,-1.53196,-1.803041,-0.354322
36,0.414384,-0.176184,-0.0502


In [34]:
#callable indexing can also be used in a series
s.loc[lambda s: s>1]

6    3.000000
9    1.189658
dtype: float64

you can avoid the use of a temporary variable by using chain data selection operations


The following are deprecated
.ix indexer in favor of .loc, and .iloc
using .loc or [] with a list containing one or more missing labels in favor of .reindex()

# reindexing

This is the idiomatic way to select potentially not-found elements.

In [39]:
s.reindex([0, 3, 8])

0    0.366016
3   -1.222416
8         NaN
dtype: float64

Another option for returing only valid keys and preserving the dtype is the following:

In [40]:
labels = [0, 3, 6]
s.loc[s.index.intersection(labels)]

0    0.366016
3   -1.222416
6    3.000000
dtype: float64

a duplicated index will raise an error for .reindex()

In [45]:
#generating a new series and failing to generate the the dulicated axis error
s = pd.Series(np.arange(5), index = ['a', 'b', 'c', 'd', 'e'])
labels = ['a', 'b']
s.reindex(labels)

a    0
b    1
dtype: int32

The duplication error can be circumvented by first intersecting the desired lables and then reindexing but this will still raise an error if the resulting index is duplicated

# selecting random samples