# Indexing and selecting data

This section will be focused on slice subsets of series and dataframe objects. Python has built in tools to do this that while capable are not the best for the job. Pandas has optimized methods to access data reccommended for production code.

# Different choices for indexing

Pandas supports 3 types of multi-axis indexing:
    .loc(), mainly for label based operations but boolean arrays are also possible with a KeyError being raised in cases when the item isn't found. Possible inputs include:
            
            a single label(e.g. 'avocados' or 3 which is interpreted as an index label)
        
            a list or array of labels (e.g. ['avocado', 'banana']
        
            Slices of objects with labels (e.g. 'avocado':'banana'), unlike normal python slicing operations both the start and stop are included however.
        
            A boolean array
        
            A callable function with a single argument the yields a valid indexing output from the above list.

    .iloc() is primarily a integer position based (0 to len(-1)) but also accepts boolean arrays. When an indexer is out of bounds .iloc() will raise an IndexError barring a slice indexer which can use out-of-bounds indexers. Valid inputs include:
            
            An integer (e.g.9)
            
            A list or array of integers (e.g. [3, 6, 2])
            
            A slice object with ints (e.g. 0:3)
            
            A boolean array
            
            Or a callable function with a single argument the yields a valid indexing output from the above list.
            
    .loc(), .iloc(), and [] all accept callable functions as indexers

When working will multiple axes, the following notation appplies. Null slices (':') can be used fpr any accesor but can also be left out (e.g. df.loc['b'] == df.loc['b', :, :]).

For a series object the format is s.loc[indexer].
For a DataFrame object the format is df.loc[row_indexer, column_indexer]
For a Panel object the format is p.loc[item_indexer, major_indexer, minor_indexer]

# Basics

The primary function of slicing with [] notation is to select lower dimensiional slices.
    for a series, series[label] returns a scalar value
    
    for a dataframe, df[colname], returns a series matching the colname
    
    for a panel, panel[itemname], returns a dataframe matching the itemname

In [1]:
#importing modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
#constructing a simple dataframe to work with
index = list('abcde')
df = pd.DataFrame(np.random.randn(5, 3), index = index, columns = ['happy', 'sad', 'angry'])
df

Unnamed: 0,happy,sad,angry
a,-0.899504,-0.99609,-0.975984
b,-1.465181,-0.303762,-1.330598
c,-1.380999,-0.456404,-1.725321
d,1.666017,0.505573,-1.512308
e,-0.150476,1.064429,-0.221653


In [3]:
#constructing a panel
panel = pd.Panel({'alpha': df, 'beta' : df - df['angry'].mean()})
panel

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 5 (major_axis) x 3 (minor_axis)
Items axis: alpha to beta
Major_axis axis: a to e
Minor_axis axis: happy to angry

In [4]:
#selecting a scalar index from a series from a datframe from a panel
df1 = panel['beta']
s = df1['happy']
s[2:4]

c   -0.227826
d    2.819190
Name: happy, dtype: float64

In [5]:
#we can also select multiple columns by passing in a list using [[]].
s = df[['angry', 'sad']]
s[0:3]

Unnamed: 0,angry,sad
a,-0.975984,-0.99609
b,-1.330598,-0.303762
c,-1.725321,-0.456404


This same process can also be used to set multiple columns.

In [6]:
df[['sad', 'angry']] = df1[['sad', 'angry']]
df

Unnamed: 0,happy,sad,angry
a,-0.899504,0.157083,0.177188
b,-1.465181,0.849411,-0.177425
c,-1.380999,0.696769,-0.572148
d,1.666017,1.658746,-0.359135
e,-0.150476,2.217601,0.93152


This might be useful for applying in-place transformations to a subset of columns. However, it is important to note that pandas aligns all axes when setting Series and Dataframe objects using .loc and .iloc.

The following fails to modify the df because column alignment preceeds value assignment.

In [7]:
#incorrect method
df.loc[:, ['sad', 'angry']] = df[['angry', 'sad']]
df

Unnamed: 0,happy,sad,angry
a,-0.899504,0.157083,0.177188
b,-1.465181,0.849411,-0.177425
c,-1.380999,0.696769,-0.572148
d,1.666017,1.658746,-0.359135
e,-0.150476,2.217601,0.93152


The correct method uses the raw values as follows

In [8]:
df.loc[:, ['sad', 'angry']] = df[['angry', 'sad']].values
df

Unnamed: 0,happy,sad,angry
a,-0.899504,0.177188,0.157083
b,-1.465181,-0.177425,0.849411
c,-1.380999,-0.572148,0.696769
d,1.666017,-0.359135,1.658746
e,-0.150476,0.93152,2.217601


# Attribute Access

Directly accessible attributes include an index of a series, a dataframe column, and a panel item. In the IPython environment you can also use tab completion the access these values.

In [9]:
#accesing a series attribute
s = df['happy']
s.b

-1.4651806771382072

In [10]:
#on a dataframe
df.sad

a    0.177188
b   -0.177425
c   -0.572148
d   -0.359135
e    0.931520
Name: sad, dtype: float64

In [11]:
#on a panel
panel.beta

Unnamed: 0,happy,sad,angry
a,0.253669,0.157083,0.177188
b,-0.312008,0.849411,-0.177425
c,-0.227826,0.696769,-0.572148
d,2.81919,1.658746,-0.359135
e,1.002696,2.217601,0.93152


we can also use this process to modify existing objects

In [12]:
#modifying a series value
s.b = 2
s.b

2.0

In [13]:
#modifying column values
df.happy = list(range(len(df.index)))
df

Unnamed: 0,happy,sad,angry
a,0,0.177188,0.157083
b,1,-0.177425,0.849411
c,2,-0.572148,0.696769
d,3,-0.359135,1.658746
e,4,0.93152,2.217601


In [14]:
#to create a new column the notation is as follows
df['glad'] = df.happy - df.sad
df

Unnamed: 0,happy,sad,angry,glad
a,0,0.177188,0.157083,-0.177188
b,1,-0.177425,0.849411,1.177425
c,2,-0.572148,0.696769,2.572148
d,3,-0.359135,1.658746,3.359135
e,4,0.93152,2.217601,3.06848


Some caveats:
    This access only works when the index element is a valid python identifier.
    The attribute is not available if it conflicts with an existing method name like min or max
    It will also be unavaiable if it conflicts with the following lists: *index, major_axis, minor_axis, items, labels.*

when these cases occur, standard indexing is still valid.

We can also assign a dict to a row of a dataframe:

In [15]:
df.iloc[3] = {'happy':12, 'sad':13, 'angry':14, 'glad':6}
df

Unnamed: 0,happy,sad,angry,glad
a,0,0.177188,0.157083,-0.177188
b,1,-0.177425,0.849411,1.17742
c,2,-0.572148,0.696769,2.57215
d,happy,sad,angry,glad
e,4,0.93152,2.2176,3.06848


# slicing ranges

This section will focus on the [] operator

With a Series the [] operator uses the same syntax as when working with an ndarray.

In [16]:
# slicing out a series from df
s = df['angry']
#slicing a range
s[:3]

a    0.157083
b    0.849411
c    0.696769
Name: angry, dtype: object

In [17]:
s[::3]

a    0.157083
d       angry
Name: angry, dtype: object

In [18]:
s[::-2]

e      2.2176
c    0.696769
a    0.157083
Name: angry, dtype: object

setting works the same was as well

In [19]:
s2 = s.copy()
s2[3] = 6
s2

a    0.157083
b    0.849411
c    0.696769
d           6
e      2.2176
Name: angry, dtype: object

Slicing for a dataframe using the [] operator slices rows.

In [20]:
#slicing rows in a dataframe
df[2:4]

Unnamed: 0,happy,sad,angry,glad
c,2,-0.572148,0.696769,2.57215
d,happy,sad,angry,glad


# Selection by label

This section concerns the .loc accessor and other purely label based methods.

a few notes:
    chained assignment should be avoided
    slicers must be compatible or convertible with the index type or they will raise a type error (e.g. tryin to slice a datetime index with integers will raise this error)
    
To reiterate a few points. Purely label based indexing in pandas is a strict inclusion protocol. Slices must include the start bound and stop bound when present in the index. Integers, in this case, refer to labels and not positions.

The.loc method is the primary attribute used to do this. valid inputs include the following:
    a single label
    
    a list or array of labels
    
    a slice object with labels 'start':'finish'
    
    a boolean array
    
    a callable function

In [221]:
# generating a new dataframe to work with
df = pd.DataFrame(np.random.randn(6, 6), index = list('abcdef'), columns = [1, 2, 3, 4, 5, 6])
df

Unnamed: 0,1,2,3,4,5,6
a,0.913293,-0.781307,-0.199023,0.416864,-0.060411,0.525741
b,0.437896,0.077776,-0.103921,0.509881,0.313029,-1.069061
c,0.500598,-0.626932,-2.023572,-0.018421,0.103889,0.467637
d,0.569298,0.200945,0.968974,1.530719,-0.662057,1.042125
e,2.313123,-1.408673,-0.785873,0.063202,0.168122,1.295151
f,0.839945,-0.365172,-0.10965,-0.358384,-0.53236,-1.849353


In [222]:
#selecting rows based on a series label within a dataframe
df[1].loc['a':'c']

a    0.913293
b    0.437896
c    0.500598
Name: 1, dtype: float64

In [223]:
#setting a series value based on label in a series within a dataframe
df[1].loc['a'] = np.nan
df

Unnamed: 0,1,2,3,4,5,6
a,,-0.781307,-0.199023,0.416864,-0.060411,0.525741
b,0.437896,0.077776,-0.103921,0.509881,0.313029,-1.069061
c,0.500598,-0.626932,-2.023572,-0.018421,0.103889,0.467637
d,0.569298,0.200945,0.968974,1.530719,-0.662057,1.042125
e,2.313123,-1.408673,-0.785873,0.063202,0.168122,1.295151
f,0.839945,-0.365172,-0.10965,-0.358384,-0.53236,-1.849353


In [224]:
#using selected rows and columns
df.loc[['b', 'd', 'f'], 1:3]

Unnamed: 0,1,2,3
b,0.437896,0.077776,-0.103921
d,0.569298,0.200945,0.968974
f,0.839945,-0.365172,-0.10965


In [225]:
#using label slices
df.loc['c':'f', 3:6]

Unnamed: 0,3,4,5,6
c,-2.023572,-0.018421,0.103889,0.467637
d,0.968974,1.530719,-0.662057,1.042125
e,-0.785873,0.063202,0.168122,1.295151
f,-0.10965,-0.358384,-0.53236,-1.849353


In [226]:
#cross section with a label
df.loc['e']

1    2.313123
2   -1.408673
3   -0.785873
4    0.063202
5    0.168122
6    1.295151
Name: e, dtype: float64

In [227]:
#using a boolean array
df.loc['a':'c', 1:4] <1

Unnamed: 0,1,2,3,4
a,False,True,True,True
b,True,True,True,True
c,True,True,True,True


In [228]:
#for grabbing a value explicitly, equivalent to "df.at['a', '1']
df.loc['b', 3]

-0.10392131216093962

# slicing with labels

Slicing using the .loc accessor returns the elements inbetween and including the start and stop labels when they are both present in the index.

In [229]:
s = df[1]
s

a         NaN
b    0.437896
c    0.500598
d    0.569298
e    2.313123
f    0.839945
Name: 1, dtype: float64

In [230]:
#slicing a series
s.loc['b':'e']

b    0.437896
c    0.500598
d    0.569298
e    2.313123
Name: 1, dtype: float64

In the case where one of the two elements is missing but the index is sorted. Slicing will still work by selecting the labels ranked between the two.

In [231]:
s.sort_index().loc['d':'g']

d    0.569298
e    2.313123
f    0.839945
Name: 1, dtype: float64

In this same case when the index is not sorted an error will be raised instead. aka, don't do the thing.

# Selecting by Position

Chained assingment should be avoided

Purely integer based indexing is available in pandas with a number of methods. All methods are 0 based, start bound is included stop bound is excluded. Using anything but an integer will raise and IndexError.

.iloc is the primary method with the following valid inputs
    an integer
    a list or array of integers
    a slice object with ints

In [232]:
# creating a new series to work with
s = pd.Series(np.random.randn(6), index = list(range(0, 18, 3)))
s

0    -0.105696
3     0.342546
6    -0.253182
9    -0.343195
12    0.586655
15   -0.410247
dtype: float64

In [233]:
#grabbing the third value (integer position 2 in this case)
s.iloc[2]

-0.25318198756080884

In [234]:
#grabbing a central slice
s.iloc[2:4]

6   -0.253182
9   -0.343195
dtype: float64

In [235]:
#setting the value of integer position 2
s.iloc[2] = 3
s

0    -0.105696
3     0.342546
6     3.000000
9    -0.343195
12    0.586655
15   -0.410247
dtype: float64

In [236]:
#generating a new dataframe to work with
df = pd.DataFrame(np.random.randn(10, 5), index = list(range(0, 40, 4)), columns = list(range(0, 10, 2)))
df

Unnamed: 0,0,2,4,6,8
0,1.530334,-0.76553,-0.454886,-0.025518,-0.382077
4,0.687364,0.155899,-0.665378,0.610816,0.90495
8,1.444345,0.587427,0.296277,1.097772,-1.358637
12,-0.79031,0.681152,-1.016647,0.330489,-0.89228
16,-0.502022,0.198242,0.354392,-0.838369,-1.710354
20,-1.330508,0.517547,0.083811,0.760611,-0.477276
24,0.818817,-1.898499,-1.154984,0.192518,0.32149
28,-1.025761,-1.169875,-0.000948,-0.405525,-0.117234
32,-0.24268,0.672276,0.422261,-0.290302,-1.001469
36,0.688537,0.125309,-1.007323,-0.203927,0.43339


In [237]:
#using integer slicing, specifying rows
df.iloc[:4]

Unnamed: 0,0,2,4,6,8
0,1.530334,-0.76553,-0.454886,-0.025518,-0.382077
4,0.687364,0.155899,-0.665378,0.610816,0.90495
8,1.444345,0.587427,0.296277,1.097772,-1.358637
12,-0.79031,0.681152,-1.016647,0.330489,-0.89228


In [238]:
#using integer slicing specifying rows and columns
df.iloc[3:5, 3:5]

Unnamed: 0,6,8
12,0.330489,-0.89228
16,-0.838369,-1.710354


In [239]:
#using a list of integers
df.iloc[[3, 5, 6], [3, 4]]

Unnamed: 0,6,8
12,0.330489,-0.89228
20,0.760611,-0.477276
24,0.192518,0.32149


basic form for dataframes
df.iloc[rows, columns]

When a cross section is desired:

In [240]:
df.iloc[3]

0   -0.790310
2    0.681152
4   -1.016647
6    0.330489
8   -0.892280
Name: 12, dtype: float64

In [241]:
#out of bounds should be handled as well
df.iloc[7:15, 3:10]

Unnamed: 0,6,8
28,-0.405525,-0.117234
32,-0.290302,-1.001469
36,-0.203927,0.43339


when slices go out of bounds they can result in an empty dataframe.

when a single indexer is out of bounds an IndexError will be raised. Similarly, a list of indexers where any single element is out of bounds will also raise an IndexError

# Selection by a callable

The primary indexers (.loc, .iloc, and []) can all accept a callable function as an indexer. However, the callable MUST be a function with one argument(the data object in this case), returning a valid output for indexing.

In [242]:
df.loc[lambda df: df[0] > -1]

Unnamed: 0,0,2,4,6,8
0,1.530334,-0.76553,-0.454886,-0.025518,-0.382077
4,0.687364,0.155899,-0.665378,0.610816,0.90495
8,1.444345,0.587427,0.296277,1.097772,-1.358637
12,-0.79031,0.681152,-1.016647,0.330489,-0.89228
16,-0.502022,0.198242,0.354392,-0.838369,-1.710354
24,0.818817,-1.898499,-1.154984,0.192518,0.32149
32,-0.24268,0.672276,0.422261,-0.290302,-1.001469
36,0.688537,0.125309,-1.007323,-0.203927,0.43339


In [243]:
df.iloc[:, lambda df: [2, 3] ]

Unnamed: 0,4,6
0,-0.454886,-0.025518
4,-0.665378,0.610816
8,0.296277,1.097772
12,-1.016647,0.330489
16,0.354392,-0.838369
20,0.083811,0.760611
24,-1.154984,0.192518
28,-0.000948,-0.405525
32,0.422261,-0.290302
36,-1.007323,-0.203927


In [244]:
df[lambda df: df.columns[:3]]

Unnamed: 0,0,2,4
0,1.530334,-0.76553,-0.454886
4,0.687364,0.155899,-0.665378
8,1.444345,0.587427,0.296277
12,-0.79031,0.681152,-1.016647
16,-0.502022,0.198242,0.354392
20,-1.330508,0.517547,0.083811
24,0.818817,-1.898499,-1.154984
28,-1.025761,-1.169875,-0.000948
32,-0.24268,0.672276,0.422261
36,0.688537,0.125309,-1.007323


In [245]:
#callable indexing can also be used in a series
s.loc[lambda s: s>1]

6    3.0
dtype: float64

you can avoid the use of a temporary variable by using chain data selection operations


The following are deprecated
.ix indexer in favor of .loc, and .iloc
using .loc or [] with a list containing one or more missing labels in favor of .reindex()

# reindexing

This is the idiomatic way to select potentially not-found elements.

In [246]:
s.reindex([0, 3, 8])

0   -0.105696
3    0.342546
8         NaN
dtype: float64

Another option for returing only valid keys and preserving the dtype is the following:

In [247]:
labels = [0, 3, 6]
s.loc[s.index.intersection(labels)]

0   -0.105696
3    0.342546
6    3.000000
dtype: float64

a duplicated index will raise an error for .reindex()

In [248]:
#generating a new series and failing to generate the the dulicated axis error
s = pd.Series(np.arange(5), index = ['a', 'b', 'c', 'd', 'e'])
labels = ['a', 'b']
s.reindex(labels)

a    0
b    1
dtype: int32

The duplication error can be circumvented by first intersecting the desired lables and then reindexing but this will still raise an error if the resulting index is duplicated

# selecting random samples

This is done using the sample() method on a Series, DataFrame or Panel. Its default behavior is to sample rows by default, returning a specific number of rows/columns or a fraction of rows

In [249]:
#generating a new series to work with
s = pd.Series(np.arange(15), index = list('abcdefghijklmno'))
print(s)

a     0
b     1
c     2
d     3
e     4
f     5
g     6
h     7
i     8
j     9
k    10
l    11
m    12
n    13
o    14
dtype: int32


In [250]:
#without passing an argument only one row is returned
s.sample()

d    3
dtype: int32

In [251]:
#specifying a number of rows
s.sample(n=5)

o    14
c     2
g     6
b     1
j     9
dtype: int32

In [252]:
#sampling a fraction of rows
s.sample(frac = 0.66)

j     9
m    12
f     5
l    11
d     3
c     2
o    14
e     4
k    10
n    13
dtype: int32

You can sample with replacement using the replace option otherwise sample() will only return each row at most once.

In [253]:
#sampling without replacement
s.sample(n = 5, replace=False)

k    10
c     2
g     6
e     4
d     3
dtype: int32

In [254]:
#with replacement
s.sample(n=6, replace = True)

c     2
h     7
b     1
j     9
m    12
o    14
dtype: int32

Using sample(), by default, each row has an equal probability of being selected. To change this we can pass the weights argument to the sample function. The weights can be in the form of a list, a np array, or series as long as they are the same length as the object being sampled. Missing values are assigned a weight of 0 and infinite values are not allowed. If the weights do not sum to one they will be normalized by dividing all the weights by the sum of the weights.

In [255]:
#creating a list of weights with sum 105
weights = pd.Series(np.arange(15))
#sampling with weights with re-normalizing
s.sample(n= 5, weights = weights.values)

o    14
l    11
f     5
b     1
j     9
dtype: int32

For DataFrames, a column within the df can be used as sampling weights(only when you are sampling rows, not when you are sampling columns) by passing the column name as a string.

In [256]:
#modifying our existing dataframe
df['weights'] = df[8]
del df[8]

In [257]:
#selecting the rows where weights are positive since it cannot accept negatives
df = df[df['weights'] >=0]

In [258]:
#sampling
df.sample(n = 4, weights = 'weights')

ValueError: Cannot take a larger sample than population when 'replace=False'

In [259]:
#we can alse sample columns
df.sample(n = 2, axis = 1)

Unnamed: 0,weights,6
4,0.90495,0.610816
24,0.32149,0.192518
36,0.43339,-0.203927


As a final not, we can also set a seed for sample()'s RNG using the random_state arg using either an int or a np RandomState object.

In [260]:
#the sample will always draw the same rows when given a seed(aka. int)
df.sample(n=3, random_state= 5)

Unnamed: 0,0,2,4,6,weights
4,0.687364,0.155899,-0.665378,0.610816,0.90495
24,0.818817,-1.898499,-1.154984,0.192518,0.32149
36,0.688537,0.125309,-1.007323,-0.203927,0.43339


# Setting with Enlargement
enlargement can be performed using either the .loc or [] operations when setting a non-existent key for that axis.

This is basically an appending operation in the case of a series

In [261]:
# setting by enlargement a value for p
s['p'] = 15
s

a     0
b     1
c     2
d     3
e     4
f     5
g     6
h     7
i     8
j     9
k    10
l    11
m    12
n    13
o    14
p    15
dtype: int64

In the case of a DataFrame, either axis can be enlarged by using .loc

In [262]:
#creating a new column via enlargement using the .loc accessor
df.loc[:, 'alpha'] = 15
df
#I am not sure about the error here......

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item_labels[indexer[info_axis]]] = value


Unnamed: 0,0,2,4,6,weights,alpha
4,0.687364,0.155899,-0.665378,0.610816,0.90495,15
24,0.818817,-1.898499,-1.154984,0.192518,0.32149,15
36,0.688537,0.125309,-1.007323,-0.203927,0.43339,15


In [263]:
#the following is an append operation
df.loc[8, :] = 7
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,0,2,4,6,weights,alpha
4,0.687364,0.155899,-0.665378,0.610816,0.90495,15.0
24,0.818817,-1.898499,-1.154984,0.192518,0.32149,15.0
36,0.688537,0.125309,-1.007323,-0.203927,0.43339,15.0
8,7.0,7.0,7.0,7.0,7.0,7.0


# First scalar value getting and setting