In [44]:
import numpy as np
import pandas as pd

## Different choices for indexing - 

Pandas now supports 3 types of multi-axis indexing:

.loc -   is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when the items are not found. Allowed inputs are:

+ A single label, e.g. 5 or 'a' (Note that 5 is interpreted as a label of the index. This use is not an integer position along the index.).

+ A list or array of labels ['a', 'b', 'c'].

+ A slice object with labels 'a':'f' (Note that contrary to usual python slices, both the start and the stop are included, when present in the index!)

+ A boolean array (any NA values will be treated as False).

+ A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).

.iloc -  primarily integer position based (from 0 to length-1 of the axis), but may also be used with a boolean array. .iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing. (this conforms with Python/NumPy slice semantics). Allowed inputs are:

+ An integer e.g. 5.

+ A list or array of integers e.g [4, 3, 0].

+ A slice object with ints e.g. 1:7.

+ A boolean array (any NA values will be treated as False).

+ A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).

+ For a series, we use the indexer : s.loc[indexer]

+ For a dataframe, we use the indexer: df.loc[row_indexer,column_indexer]

The same kind of indexing applies for .iloc too.

## Basics

In [45]:
dates = pd.date_range('1/1/2020',periods=8)
df = pd.DataFrame(np.random.randn(8,4),index=dates,columns=['A','B','C','D'])
df

Unnamed: 0,A,B,C,D
2020-01-01,2.184166,1.303329,-0.359858,1.531668
2020-01-02,-0.430679,-0.521759,0.323465,-0.39735
2020-01-03,-0.001052,0.157756,2.105661,-0.172643
2020-01-04,-0.869618,0.831339,-1.115317,0.506023
2020-01-05,1.469401,0.525151,0.533935,-1.520247
2020-01-06,-0.437331,0.253235,0.64782,-0.506071
2020-01-07,-1.371116,0.748481,0.61802,1.183946
2020-01-08,0.379136,-1.597653,0.183668,0.046636


In [46]:
df['A'] #indexing a dataframe by column name returns a Series object corresponding to that column name

2020-01-01    2.184166
2020-01-02   -0.430679
2020-01-03   -0.001052
2020-01-04   -0.869618
2020-01-05    1.469401
2020-01-06   -0.437331
2020-01-07   -1.371116
2020-01-08    0.379136
Freq: D, Name: A, dtype: float64

In [47]:
df.loc['2020-01-01'] 

A    2.184166
B    1.303329
C   -0.359858
D    1.531668
Name: 2020-01-01 00:00:00, dtype: float64

In [48]:
df[['A','B']]

Unnamed: 0,A,B
2020-01-01,2.184166,1.303329
2020-01-02,-0.430679,-0.521759
2020-01-03,-0.001052,0.157756
2020-01-04,-0.869618,0.831339
2020-01-05,1.469401,0.525151
2020-01-06,-0.437331,0.253235
2020-01-07,-1.371116,0.748481
2020-01-08,0.379136,-1.597653


In [49]:
df.loc[:,['B','A']]= df[['A','B']].to_numpy() #correct way to swap column values
df[['A','B']]

Unnamed: 0,A,B
2020-01-01,1.303329,2.184166
2020-01-02,-0.521759,-0.430679
2020-01-03,0.157756,-0.001052
2020-01-04,0.831339,-0.869618
2020-01-05,0.525151,1.469401
2020-01-06,0.253235,-0.437331
2020-01-07,0.748481,-1.371116
2020-01-08,-1.597653,0.379136


## Attribute access

+ You may access an index on a Series or column on a DataFrame directly as an attribute:

In [50]:
s = pd.Series([1,2,3],index=['a','b','c'])
s

a    1
b    2
c    3
dtype: int64

In [51]:
s.a

1

In [52]:
s.a=10
s

a    10
b     2
c     3
dtype: int64

In [53]:
df

Unnamed: 0,A,B,C,D
2020-01-01,1.303329,2.184166,-0.359858,1.531668
2020-01-02,-0.521759,-0.430679,0.323465,-0.39735
2020-01-03,0.157756,-0.001052,2.105661,-0.172643
2020-01-04,0.831339,-0.869618,-1.115317,0.506023
2020-01-05,0.525151,1.469401,0.533935,-1.520247
2020-01-06,0.253235,-0.437331,0.64782,-0.506071
2020-01-07,0.748481,-1.371116,0.61802,1.183946
2020-01-08,-1.597653,0.379136,0.183668,0.046636


In [54]:
df.A

2020-01-01    1.303329
2020-01-02   -0.521759
2020-01-03    0.157756
2020-01-04    0.831339
2020-01-05    0.525151
2020-01-06    0.253235
2020-01-07    0.748481
2020-01-08   -1.597653
Freq: D, Name: A, dtype: float64

In [55]:
df.A = list(range(0,len(df.index))) #ok only if A exists
df

Unnamed: 0,A,B,C,D
2020-01-01,0,2.184166,-0.359858,1.531668
2020-01-02,1,-0.430679,0.323465,-0.39735
2020-01-03,2,-0.001052,2.105661,-0.172643
2020-01-04,3,-0.869618,-1.115317,0.506023
2020-01-05,4,1.469401,0.533935,-1.520247
2020-01-06,5,-0.437331,0.64782,-0.506071
2020-01-07,6,-1.371116,0.61802,1.183946
2020-01-08,7,0.379136,0.183668,0.046636


Attribute access isn't allowed in the following scenarios: 

+ You can use this access only if the index element is a valid Python identifier, e.g. s.1 is not allowed.

+ The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed, but s['min'] is possible.

+ Similarly, the attribute will not be available if it conflicts with any of the following list: index, major_axis, minor_axis, items.

+ In any of these cases, standard indexing will still work, e.g. s['1'], s['min'], and s['index'] will access the corresponding element or column.

In [56]:
#use this form to create a new column
df['A'] = list(range(0,len(df.index)))
df

Unnamed: 0,A,B,C,D
2020-01-01,0,2.184166,-0.359858,1.531668
2020-01-02,1,-0.430679,0.323465,-0.39735
2020-01-03,2,-0.001052,2.105661,-0.172643
2020-01-04,3,-0.869618,-1.115317,0.506023
2020-01-05,4,1.469401,0.533935,-1.520247
2020-01-06,5,-0.437331,0.64782,-0.506071
2020-01-07,6,-1.371116,0.61802,1.183946
2020-01-08,7,0.379136,0.183668,0.046636


In [57]:
df.iloc[1]= list(range(0,len(df.columns))) #assigning values to a row of a dataframe
df

Unnamed: 0,A,B,C,D
2020-01-01,0,2.184166,-0.359858,1.531668
2020-01-02,0,1.0,2.0,3.0
2020-01-03,2,-0.001052,2.105661,-0.172643
2020-01-04,3,-0.869618,-1.115317,0.506023
2020-01-05,4,1.469401,0.533935,-1.520247
2020-01-06,5,-0.437331,0.64782,-0.506071
2020-01-07,6,-1.371116,0.61802,1.183946
2020-01-08,7,0.379136,0.183668,0.046636


## Slicing ranges - 

Slicing using [] operator

+ With Series, the syntax works exactly as with an ndarray, returning a slice of the values and the corresponding labels:

In [58]:
s = pd.Series([45,34,87,223,756],index=['a','b','c','d','e'])
s

a     45
b     34
c     87
d    223
e    756
dtype: int64

In [59]:
s[:4]

a     45
b     34
c     87
d    223
dtype: int64

In [60]:
s[::2]

a     45
c     87
e    756
dtype: int64

In [61]:
s[::-1]

e    756
d    223
c     87
b     34
a     45
dtype: int64

+ Note that setting works as well:

In [62]:
s2 = s.copy()
s2[:2]=0 #Whether a copy or a reference is returned for a setting operation, may depend on the context. 
#This is sometimes called chained assignment and should be avoided.
s2

a      0
b      0
c     87
d    223
e    756
dtype: int64

+ With DataFrame, slicing inside of [] slices the rows. This is provided largely as a convenience since it is such a common operation.

In [63]:
df

Unnamed: 0,A,B,C,D
2020-01-01,0,2.184166,-0.359858,1.531668
2020-01-02,0,1.0,2.0,3.0
2020-01-03,2,-0.001052,2.105661,-0.172643
2020-01-04,3,-0.869618,-1.115317,0.506023
2020-01-05,4,1.469401,0.533935,-1.520247
2020-01-06,5,-0.437331,0.64782,-0.506071
2020-01-07,6,-1.371116,0.61802,1.183946
2020-01-08,7,0.379136,0.183668,0.046636


In [64]:
df[:3]

Unnamed: 0,A,B,C,D
2020-01-01,0,2.184166,-0.359858,1.531668
2020-01-02,0,1.0,2.0,3.0
2020-01-03,2,-0.001052,2.105661,-0.172643


In [65]:
df[::-1]

Unnamed: 0,A,B,C,D
2020-01-08,7,0.379136,0.183668,0.046636
2020-01-07,6,-1.371116,0.61802,1.183946
2020-01-06,5,-0.437331,0.64782,-0.506071
2020-01-05,4,1.469401,0.533935,-1.520247
2020-01-04,3,-0.869618,-1.115317,0.506023
2020-01-03,2,-0.001052,2.105661,-0.172643
2020-01-02,0,1.0,2.0,3.0
2020-01-01,0,2.184166,-0.359858,1.531668


## Selection by label

+ .loc is strict when you present slicers that are not compatible (or convertible) with the index type. For example using integers in a DatetimeIndex. These will raise a TypeError.

In [66]:
#df.loc[2:3]- this will raise TypeError
df.loc['20200104':'20200107']

Unnamed: 0,A,B,C,D
2020-01-04,3,-0.869618,-1.115317,0.506023
2020-01-05,4,1.469401,0.533935,-1.520247
2020-01-06,5,-0.437331,0.64782,-0.506071
2020-01-07,6,-1.371116,0.61802,1.183946


+ pandas provides a suite of methods in order to have purely label based indexing. This is a strict inclusion based protocol. Every label asked for must be in the index, or a KeyError will be raised. When slicing, both the start bound AND the stop bound are included, if present in the index. Integers are valid labels, but they refer to the label and not the position.

In [67]:
s.loc['c':]

c     87
d    223
e    756
dtype: int64

In [68]:
s.loc['d']=58 #Setting works as well
s

a     45
b     34
c     87
d     58
e    756
dtype: int64

In [69]:
df = pd.DataFrame(np.random.randn(6,4),index=list('abcdef'),columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
a,-0.190363,0.280675,0.383951,-1.720787
b,-1.27701,0.512831,0.767672,-1.68346
c,-0.135246,0.39842,-0.560162,-0.361949
d,0.499396,0.250124,-0.417902,1.496494
e,0.000508,-0.168682,1.429844,0.176611
f,0.788097,2.500007,-1.726756,0.283547


In [70]:
df.loc[['a','b','d'],:]

Unnamed: 0,A,B,C,D
a,-0.190363,0.280675,0.383951,-1.720787
b,-1.27701,0.512831,0.767672,-1.68346
d,0.499396,0.250124,-0.417902,1.496494


In [71]:
df.loc['d':,['A','C']]

Unnamed: 0,A,C
d,0.499396,-0.417902
e,0.000508,1.429844
f,0.788097,-1.726756


+ For getting a cross section using a label (equivalent to df.xs('a')):

In [72]:
df.loc['a']

A   -0.190363
B    0.280675
C    0.383951
D   -1.720787
Name: a, dtype: float64

+ For getting values with a Boolean array: (NA values in a boolean array propogate as False)



In [73]:
df.loc['a']>0

A    False
B     True
C     True
D    False
Name: a, dtype: bool

In [74]:
df.loc[:,df.loc['a']>0]

Unnamed: 0,B,C
a,0.280675,0.383951
b,0.512831,0.767672
c,0.39842,-0.560162
d,0.250124,-0.417902
e,-0.168682,1.429844
f,2.500007,-1.726756


+ For getting a value explicitly:

In [75]:
df.loc['a','D'] # This is also equivalent to df.at['a','D']

-1.7207869639385933

## Slicing with labels

+ When using .loc with slices, if both the start and the stop labels are present in the index, then elements located between the two (including them) are returned:

In [76]:
s = pd.Series(['a','b','c','d','e'],index=[0,3,2,4,5])
s

0    a
3    b
2    c
4    d
5    e
dtype: object

In [77]:
s.loc[3:5]

3    b
2    c
4    d
5    e
dtype: object

+ If at least one of the two is absent, but the index is sorted, and can be compared against start and stop labels, then slicing will still work as expected, by selecting labels which rank between the two:

In [78]:
s.sort_index()

0    a
2    c
3    b
4    d
5    e
dtype: object

In [79]:
s.sort_index().loc[1:6]

2    c
3    b
4    d
5    e
dtype: object

+ However, if at least one of the two is absent and the index is not sorted, an error will be raised (since doing otherwise would be computationally expensive, as well as potentially ambiguous for mixed type indexes). For instance, in the above example, s.loc[1:6] would raise KeyError.

## Selection by position

+ Pandas provides a suite of methods in order to get purely integer based indexing. The semantics follow closely Python and NumPy slicing. These are 0-based indexing. When slicing, the start bound is included, while the upper bound is excluded. Trying to use a non-integer, even a valid label will raise an IndexError.

In [80]:
s

0    a
3    b
2    c
4    d
5    e
dtype: object

In [81]:
s.iloc[1:6]

3    b
2    c
4    d
5    e
dtype: object

In [82]:
s.iloc[:3]='f' #setting works as well
s

0    f
3    f
2    f
4    d
5    e
dtype: object

In [83]:
df

Unnamed: 0,A,B,C,D
a,-0.190363,0.280675,0.383951,-1.720787
b,-1.27701,0.512831,0.767672,-1.68346
c,-0.135246,0.39842,-0.560162,-0.361949
d,0.499396,0.250124,-0.417902,1.496494
e,0.000508,-0.168682,1.429844,0.176611
f,0.788097,2.500007,-1.726756,0.283547


In [84]:
df.iloc[:3]

Unnamed: 0,A,B,C,D
a,-0.190363,0.280675,0.383951,-1.720787
b,-1.27701,0.512831,0.767672,-1.68346
c,-0.135246,0.39842,-0.560162,-0.361949


In [85]:
df.iloc[:3,2:]

Unnamed: 0,C,D
a,0.383951,-1.720787
b,0.767672,-1.68346
c,-0.560162,-0.361949


In [86]:
df.iloc[[1,3],[2,3]] #Select via integer list

Unnamed: 0,C,D
b,0.767672,-1.68346
d,-0.417902,1.496494


+ A single indexer that is out of bounds will raise an IndexError. A list of indexers where any element is out of bounds will raise an IndexError.

In [87]:
# df.iloc[[5,6]] this will raise index error 

## Selection by callable

+ .loc, .iloc, and also [] indexing can accept a callable as indexer. The callable must be a function with one argument (the calling Series or DataFrame) that returns valid output for indexing.

In [88]:
df

Unnamed: 0,A,B,C,D
a,-0.190363,0.280675,0.383951,-1.720787
b,-1.27701,0.512831,0.767672,-1.68346
c,-0.135246,0.39842,-0.560162,-0.361949
d,0.499396,0.250124,-0.417902,1.496494
e,0.000508,-0.168682,1.429844,0.176611
f,0.788097,2.500007,-1.726756,0.283547


In [89]:
df.loc[lambda df: df.A>0,:]

Unnamed: 0,A,B,C,D
d,0.499396,0.250124,-0.417902,1.496494
e,0.000508,-0.168682,1.429844,0.176611
f,0.788097,2.500007,-1.726756,0.283547


In [90]:
df.loc[:,lambda df:['A','B']]

Unnamed: 0,A,B
a,-0.190363,0.280675
b,-1.27701,0.512831
c,-0.135246,0.39842
d,0.499396,0.250124
e,0.000508,-0.168682
f,0.788097,2.500007


In [91]:
df.iloc[:,lambda df: [0,1]]

Unnamed: 0,A,B
a,-0.190363,0.280675
b,-1.27701,0.512831
c,-0.135246,0.39842
d,0.499396,0.250124
e,0.000508,-0.168682
f,0.788097,2.500007


+ You can use callable indexing in Series.

In [92]:
s

0    f
3    f
2    f
4    d
5    e
dtype: object

In [93]:
s[:3]= ['a','b','c']
s

0    a
3    b
2    c
4    d
5    e
dtype: object

In [94]:
df.A.loc[lambda s: s>0] #passing df.A i.e. a Series object to the lambda function and selecting only those rows in
#series that have a value greater than 0 in df['A'] column

d    0.499396
e    0.000508
f    0.788097
Name: A, dtype: float64

In [95]:
s

0    a
3    b
2    c
4    d
5    e
dtype: object

## Indexing with list with missing labels is deprecated

+ In prior versions, using .loc[list-of-labels] would work as long as at least 1 of the keys was found (otherwise it would raise a KeyError). This behavior is deprecated and will show a warning message pointing to this section. The recommended alternative is to use .reindex().


In [96]:
s = pd.Series([1,2,3])
s

0    1
1    2
2    3
dtype: int64

In [97]:
s.loc[[1,2,3]]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  """Entry point for launching an IPython kernel.


1    2.0
2    3.0
3    NaN
dtype: float64

In [98]:
s.reindex([1,2,3])

1    2.0
2    3.0
3    NaN
dtype: float64

+ Having a duplicated index will raise for a ValueError:

In [99]:
s= pd.Series(np.arange(4),index=['a','a','b','c'])
s

a    0
a    1
b    2
c    3
dtype: int32

In [100]:
labels = ['c','d']
#s.reindex(labels) this will raise a valueError

+ Generally, you can intersect the desired labels with the current axis, and then reindex.

In [101]:
s.loc[s.index.intersection(labels)].reindex(labels)

c    3.0
d    NaN
dtype: float64

+ However, this would still raise if your resulting index is duplicated.

In [102]:
labels = ['a','d']
# s.loc[s.index.intersection(labels)].reindex(labels) -this would raise a ValueError

## Selecting random samples

A random selection of rows or columns from a Series or DataFrame with the sample() method. The method will sample rows by default, and accepts a specific number of rows/columns to return, or a fraction of rows.

In [103]:
s = pd.Series([0,1,2,3,4,5])
s.sample() #when no arguments are passed, it returns one row

4    4
dtype: int64

In [104]:
#One may specify either a number of rows
s.sample(n=3)

5    5
1    1
2    2
dtype: int64

In [105]:
#or a fraction of the rows
s.sample(frac=0.5)

3    3
2    2
1    1
dtype: int64

+ By default, sample will return each row at most once, but one can also sample with replacement using the replace option:

In [106]:
#Without replacement by default
s.sample(n=6,replace=False)

5    5
0    0
3    3
1    1
4    4
2    2
dtype: int64

In [107]:
#With replacement
s.sample(n=6,replace=True)

2    2
4    4
0    0
1    1
3    3
5    5
dtype: int64

+ By default, each row has an equal probability of being selected, but if you want rows to have different probabilities, you can pass the sample function sampling weights as weights. These weights can be a list, a NumPy array, or a Series, but they must be of the same length as the object you are sampling. Missing values will be treated as a weight of zero, and inf values are not allowed. If weights do not sum to 1, they will be re-normalized by dividing all weights by the sum of the weights. For example:

In [108]:
example_weights = [0,0,0.2,0.4,0.3,0.1]
s.sample(n=3,weights=example_weights)

5    5
2    2
3    3
dtype: int64

In [109]:
example_weights=[0.5,0,0,0,0,0] #weights will be re-normalized automatically
s.sample(n=1,weights=example_weights)

0    0
dtype: int64

+ When applied to a DataFrame, you can use a column of the DataFrame as sampling weights (provided you are sampling rows and not columns) by simply passing the name of the column as a string.

In [110]:
df2 = pd.DataFrame({'col1':[9,8,7,6],'weight_col':[0.4,0.2,0.3,0.1]})
df2

Unnamed: 0,col1,weight_col
0,9,0.4
1,8,0.2
2,7,0.3
3,6,0.1


In [111]:
df2.sample(n=3,weights='weight_col')

Unnamed: 0,col1,weight_col
2,7,0.3
0,9,0.4
1,8,0.2


sample also allows users to sample columns instead of rows using the axis argument.

In [112]:
df2.sample(n=1,axis=1)

Unnamed: 0,col1
0,9
1,8
2,7
3,6


+ Finally, one can also set a seed for sample’s random number generator using the random_state argument, which will accept either an integer (as a seed) or a NumPy RandomState object.

In [113]:
#With a given seed, the sample will always draw the same rows
df2.sample(n=3,weights='weight_col',random_state=4)

Unnamed: 0,col1,weight_col
3,6,0.1
1,8,0.2
2,7,0.3


In [114]:
df2.sample(n=3,weights='weight_col',random_state=4)

Unnamed: 0,col1,weight_col
3,6,0.1
1,8,0.2
2,7,0.3


## Setting with enlargement

The .loc/[] operations can perform enlargement when setting a non-existent key for that axis.

+ In the Series case this is effectively an appending operation.

In [115]:
s

0    0
1    1
2    2
3    3
4    4
5    5
dtype: int64

In [116]:
s[6]=18
s

0     0
1     1
2     2
3     3
4     4
5     5
6    18
dtype: int64

+ A DataFrame can be enlarged on either axis via .loc.

In [117]:
df=pd.DataFrame(np.random.randn(6,4),columns=['A','B','C','D'])
df

Unnamed: 0,A,B,C,D
0,1.509854,-0.020006,0.337706,0.581523
1,0.415265,-0.838403,-0.127927,-0.78314
2,1.289532,-0.464113,1.185411,-1.04391
3,2.16904,-0.800722,-1.250889,-1.031277
4,1.815094,0.921846,-0.790781,1.418297
5,-1.91829,0.32964,0.493297,-0.285926


In [118]:
df.loc[:,'E']=df.loc[:,'A']*2
df

Unnamed: 0,A,B,C,D,E
0,1.509854,-0.020006,0.337706,0.581523,3.019707
1,0.415265,-0.838403,-0.127927,-0.78314,0.830531
2,1.289532,-0.464113,1.185411,-1.04391,2.579064
3,2.16904,-0.800722,-1.250889,-1.031277,4.33808
4,1.815094,0.921846,-0.790781,1.418297,3.630188
5,-1.91829,0.32964,0.493297,-0.285926,-3.83658


In [119]:
df.loc[6]=[9,18,27,36,45]
df

Unnamed: 0,A,B,C,D,E
0,1.509854,-0.020006,0.337706,0.581523,3.019707
1,0.415265,-0.838403,-0.127927,-0.78314,0.830531
2,1.289532,-0.464113,1.185411,-1.04391,2.579064
3,2.16904,-0.800722,-1.250889,-1.031277,4.33808
4,1.815094,0.921846,-0.790781,1.418297,3.630188
5,-1.91829,0.32964,0.493297,-0.285926,-3.83658
6,9.0,18.0,27.0,36.0,45.0


## Fast scalar value getting and setting

Since indexing with [] must handle a lot of cases (single-label access, slicing, boolean indexing, etc.), it has a bit of overhead in order to figure out what you’re asking for. If you only want to access a scalar value, the fastest way is to use the at and iat methods, which are implemented on all of the data structures.

+ Similarly to loc, at provides label based scalar lookups, while, iat provides integer based lookups analogously to iloc

In [120]:
s

0     0
1     1
2     2
3     3
4     4
5     5
6    18
dtype: int64

In [121]:
s.iat[4]

4

In [122]:
df

Unnamed: 0,A,B,C,D,E
0,1.509854,-0.020006,0.337706,0.581523,3.019707
1,0.415265,-0.838403,-0.127927,-0.78314,0.830531
2,1.289532,-0.464113,1.185411,-1.04391,2.579064
3,2.16904,-0.800722,-1.250889,-1.031277,4.33808
4,1.815094,0.921846,-0.790781,1.418297,3.630188
5,-1.91829,0.32964,0.493297,-0.285926,-3.83658
6,9.0,18.0,27.0,36.0,45.0


In [123]:
df.iat[3,0]

2.1690400101860363

In [124]:
df.at[4,'B']

0.9218463885362582

## Boolean indexing

+ Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not. These must be grouped by using parentheses, since by default Python will evaluate an expression such as df['A'] > 2 & df['B'] < 3 as df['A'] > (2 & df['B']) < 3, while the desired evaluation order is (df['A'] > 2) & (df['B'] < 3).

+ Using a boolean vector to index a Series works exactly as in a NumPy ndarray:

In [125]:
s= pd.Series(np.arange(-3,4))
s

0   -3
1   -2
2   -1
3    0
4    1
5    2
6    3
dtype: int32

In [126]:
s[(s>-1) & (s<3)]

3    0
4    1
5    2
dtype: int32

In [128]:
s[s>0]

4    1
5    2
6    3
dtype: int32

## The where() method and masking

+ Selecting values from a Series with a boolean vector generally returns a subset of the data. To guarantee that selection output has the same shape as the original data, you can use the where method in Series and DataFrame.

In [130]:
s

0   -3
1   -2
2   -1
3    0
4    1
5    2
6    3
dtype: int32

+ To return only the selected rows:

In [131]:
s[s>0]

4    1
5    2
6    3
dtype: int32

+ To return a Series of the same shape as the original:

In [132]:
s.where(s>0)

0    NaN
1    NaN
2    NaN
3    NaN
4    1.0
5    2.0
6    3.0
dtype: float64

In [133]:
df

Unnamed: 0,A,B,C,D,E
0,1.509854,-0.020006,0.337706,0.581523,3.019707
1,0.415265,-0.838403,-0.127927,-0.78314,0.830531
2,1.289532,-0.464113,1.185411,-1.04391,2.579064
3,2.16904,-0.800722,-1.250889,-1.031277,4.33808
4,1.815094,0.921846,-0.790781,1.418297,3.630188
5,-1.91829,0.32964,0.493297,-0.285926,-3.83658
6,9.0,18.0,27.0,36.0,45.0


+ Selecting values from a DataFrame with a boolean criterion now also preserves input data shape. where is used under the hood as the implementation. The code below is equivalent to df.where(df < 0).

In [134]:
df[df>0]

Unnamed: 0,A,B,C,D,E
0,1.509854,,0.337706,0.581523,3.019707
1,0.415265,,,,0.830531
2,1.289532,,1.185411,,2.579064
3,2.16904,,,,4.33808
4,1.815094,0.921846,,1.418297,3.630188
5,,0.32964,0.493297,,
6,9.0,18.0,27.0,36.0,45.0


+ In addition, where takes an optional other argument for replacement of values where the condition is False, in the returned copy.

In [135]:
df.where(df>0,-df)

Unnamed: 0,A,B,C,D,E
0,1.509854,0.020006,0.337706,0.581523,3.019707
1,0.415265,0.838403,0.127927,0.78314,0.830531
2,1.289532,0.464113,1.185411,1.04391,2.579064
3,2.16904,0.800722,1.250889,1.031277,4.33808
4,1.815094,0.921846,0.790781,1.418297,3.630188
5,1.91829,0.32964,0.493297,0.285926,3.83658
6,9.0,18.0,27.0,36.0,45.0


+ By default, where returns a modified copy of the data. There is an optional parameter inplace so that the original data can be modified without creating a copy.

## Mask

+ mask() is the inverse boolean operation of where().

In [136]:
s.mask(s>=0)

0   -3.0
1   -2.0
2   -1.0
3    NaN
4    NaN
5    NaN
6    NaN
dtype: float64

## The query() method

+ DataFrame objects have a query() method that allows selection using an expression.

In [138]:
df = pd.DataFrame(np.random.randn(10,3),columns=list('abc'))
df

Unnamed: 0,a,b,c
0,1.085738,0.366208,0.048932
1,-0.224596,-0.133775,0.2133
2,-0.093923,-1.069468,0.064011
3,-2.191454,-0.905193,-1.946169
4,-0.076633,0.453772,0.674598
5,1.849557,1.127593,0.441625
6,0.803561,-0.603508,1.133493
7,0.107406,0.642705,0.492959
8,-1.268797,0.294226,-0.145001
9,1.776646,-2.20929,0.693986


In [139]:
#You can get the value of the frame where column b has values between the values of columns a and c. 
df.query('(a<b) & (b<c)')

Unnamed: 0,a,b,c
1,-0.224596,-0.133775,0.2133
4,-0.076633,0.453772,0.674598


+ We can also use index in our query expression:

In [141]:
df.query('index>b<c')

Unnamed: 0,a,b,c
1,-0.224596,-0.133775,0.2133
2,-0.093923,-1.069468,0.064011
4,-0.076633,0.453772,0.674598
6,0.803561,-0.603508,1.133493
9,1.776646,-2.20929,0.693986


## Duplicate data

If you want to identify and remove duplicate rows in a DataFrame, there are two methods that will help: duplicated and drop_duplicates. Each takes as an argument the columns to use to identify duplicated rows.

    + duplicated returns a boolean vector whose length is the number of rows, and which indicates whether a row is     duplicated.

    + drop_duplicates removes duplicate rows.


By default, the first observed row of a duplicate set is considered unique, but each method has a keep parameter to specify targets to be kept.

    keep='first' (default): mark / drop duplicates except for the first occurrence.

    keep='last': mark / drop duplicates except for the last occurrence.

    keep=False: mark / drop all duplicates.


In [143]:
df2 = pd.DataFrame({'a':['one','two','one','one','one','two','three'],
                  'b':['x','y','y','y','x','x','z'],
                  'c':np.random.randn(7)})
df2

Unnamed: 0,a,b,c
0,one,x,-1.043426
1,two,y,-0.322323
2,one,y,1.377269
3,one,y,0.305929
4,one,x,0.30517
5,two,x,1.219772
6,three,z,-0.484239


In [144]:
df2.duplicated('a')

0    False
1    False
2     True
3     True
4     True
5     True
6    False
dtype: bool

In [145]:
df2.duplicated('a',keep='last')

0     True
1     True
2     True
3     True
4    False
5    False
6    False
dtype: bool

In [146]:
df2.duplicated('a',keep=False)

0     True
1     True
2     True
3     True
4     True
5     True
6    False
dtype: bool

In [147]:
df2.drop_duplicates('a')

Unnamed: 0,a,b,c
0,one,x,-1.043426
1,two,y,-0.322323
6,three,z,-0.484239


In [148]:
df2.drop_duplicates('a',keep='last')

Unnamed: 0,a,b,c
4,one,x,0.30517
5,two,x,1.219772
6,three,z,-0.484239


In [149]:
df2.drop_duplicates('a',keep=False)

Unnamed: 0,a,b,c
6,three,z,-0.484239


In [150]:
df2.duplicated(['a','b'])

0    False
1    False
2    False
3     True
4     True
5    False
6    False
dtype: bool

In [151]:
df2.drop_duplicates(['a','b'])

Unnamed: 0,a,b,c
0,one,x,-1.043426
1,two,y,-0.322323
2,one,y,1.377269
5,two,x,1.219772
6,three,z,-0.484239


## Set/reset index

DataFrame has a set_index() method which takes a column name (for a regular Index) or a list of column names (for a MultiIndex). To create a new, re-indexed DataFrame:

In [152]:
df2.set_index('b')

Unnamed: 0_level_0,a,c
b,Unnamed: 1_level_1,Unnamed: 2_level_1
x,one,-1.043426
y,two,-0.322323
y,one,1.377269
y,one,0.305929
x,one,0.30517
x,two,1.219772
z,three,-0.484239


As a convenience, there is a new function on DataFrame called reset_index() which transfers the index values into the DataFrame’s columns and sets a simple integer index. This is the inverse operation of set_index().

In [156]:
df2.reset_index()

Unnamed: 0,index,a,b,c
0,0,one,x,-1.043426
1,1,two,y,-0.322323
2,2,one,y,1.377269
3,3,one,y,0.305929
4,4,one,x,0.30517
5,5,two,x,1.219772
6,6,three,z,-0.484239


+ reset_index takes an optional parameter drop which if true simply discards the index, instead of putting index values in the DataFrame’s columns.

In [157]:
df2.reset_index(drop=True)

Unnamed: 0,a,b,c
0,one,x,-1.043426
1,two,y,-0.322323
2,one,y,1.377269
3,one,y,0.305929
4,one,x,0.30517
5,two,x,1.219772
6,three,z,-0.484239


## Adding an adhoc index

+ If you create an index yourself, you can just assign it to the index field:

In [159]:
index= pd.Index(np.arange(8,15))
index

Int64Index([8, 9, 10, 11, 12, 13, 14], dtype='int64')

In [162]:
df2.index=index
df2

Unnamed: 0,a,b,c
8,one,x,-1.043426
9,two,y,-0.322323
10,one,y,1.377269
11,one,y,0.305929
12,one,x,0.30517
13,two,x,1.219772
14,three,z,-0.484239
