In [193]:
import pandas as pd
import numpy as np

In [194]:
from pandas import Series, DataFrame

In [195]:
kakao = Series([100, 200])
print(kakao)

0    100
1    200
dtype: int64


In [196]:
raw_data = {'col0' : [1, 2, 3, 4],
            'col1' : [10, 20, 30, 40],
            'col2' : [100, 200, 300, 400]}
data = DataFrame(raw_data)
print(data)

   col0  col1  col2
0     1    10   100
1     2    20   200
2     3    30   300
3     4    40   400


In [197]:
date = ['16.02.29', '16.02.26', '16.02.23', '16.02.27']
data1 = DataFrame(raw_data, index=date)
print(data1)

          col0  col1  col2
16.02.29     1    10   100
16.02.26     2    20   200
16.02.23     3    30   300
16.02.27     4    40   400


In [198]:
day_data1 = data1.loc['16.02.29']
print(day_data1)

col0      1
col1     10
col2    100
Name: 16.02.29, dtype: int64


In [199]:
col1 = data1['col1']
print(col1)

16.02.29    10
16.02.26    20
16.02.23    30
16.02.27    40
Name: col1, dtype: int64


In [200]:
print(data1.columns)
print(data1.index)

Index(['col0', 'col1', 'col2'], dtype='object')
Index(['16.02.29', '16.02.26', '16.02.23', '16.02.27'], dtype='object')


In [201]:
data2 = DataFrame(data1, columns = ['Samung', 'Lg', 'Lotte'])
print(data2)

          Samung  Lg  Lotte
16.02.29     NaN NaN    NaN
16.02.26     NaN NaN    NaN
16.02.23     NaN NaN    NaN
16.02.27     NaN NaN    NaN


## Indexing and selecting data
[User guide](https://pandas.pydata.org/docs/user_guide/indexing.html)

## Indexing operators

### 1. Label indexing: `.loc[]` 
* `df.loc[row_label, col_label]`
* Valid inputs:
 * A single label, e.g. `5` or `'a'` (Note that `5` is interpreted as a *label* of the index. This use is **not** an integer position along the index.).
 * A list or array of labels `['a', 'b', 'c']`.
 * A slice object with labels `'a':'f'` (Note that contrary to usual python slices, **both** the start and the stop are included, when present in the index! See [Slicing with labels](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-slicing-with-labels).
 * A boolean array.
 * A `callable`, see [Selection By Callable](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-callable).

### 2. Positional indexing: `.iloc[]`
* `df.iloc[row_pos, col_pos]`
* Valid inputs:
 * An integer e.g. `5`.
 * A list or array of integers `[4, 3, 0]`.
 * A slice object with ints `1:7`.
 * A boolean array.
 * A `callable`, see [Selection By Callable](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-callable).

### 3. General indexing `[]`
* `df[col_label]`
* `df[list of col_labels]`
 * `df[['A', 'B']]
* Slice
 * slices the rows; `df[row_pos1:row_pos2]`
 * `df[1:3]`
 
#### Others to consider
* The `.loc/[]` operations can perform enlargement when setting a non-existent key for that axis.

Getting values from an object with multi-axes selection uses the following notation (using `.loc` as an example, but the following applies to `.iloc` as well). Any of the axes accessors may be the null slice `:`. Axes left out of the specification are assumed to be `:`, e.g. `p.loc['a']` is equivalent to `p.loc['a', :, :]`.

| Object Type | Indexers                             |
| :---------- | :----------------------------------- |
| Series      | `s.loc[indexer]`                     |
| DataFrame   | `df.loc[row_indexer,column_indexer]` |



As mentioned when introducing the data structures in the [last section](https://pandas.pydata.org/docs/user_guide/basics.html#basics), the primary function of indexing with `[]` (a.k.a. `__getitem__` for those familiar with implementing class behavior in Python) is selecting out lower-dimensional slices. The following table shows return type values when indexing pandas objects with `[]`:

| Object Type | Selection        | Return Value Type                 |
| :---------- | :--------------- | :-------------------------------- |
| Series      | `series[label]`  | scalar value                      |
| DataFrame   | `frame[colname]` | `Series` corresponding to colname |

In [202]:
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
2000-01-01,-0.501628,-0.027247,1.221096,-1.011678
2000-01-02,0.143116,-0.871463,0.354171,1.532565
2000-01-03,-0.193109,-1.615947,-0.21391,0.127459
2000-01-04,-1.089993,1.200929,-0.430479,0.53619
2000-01-05,0.333507,-0.643188,0.945588,-2.058857
2000-01-06,0.931516,1.345539,0.65152,0.217818
2000-01-07,0.17717,-0.315621,-0.736386,-0.298658
2000-01-08,0.61027,-1.17743,-0.325076,1.448778


In [203]:
s = df['A']
s[dates[5]]

0.9315162982225246

In [204]:
df[['B', 'A']] = df[['A', 'B']]
df

Unnamed: 0,A,B,C,D
2000-01-01,-0.027247,-0.501628,1.221096,-1.011678
2000-01-02,-0.871463,0.143116,0.354171,1.532565
2000-01-03,-1.615947,-0.193109,-0.21391,0.127459
2000-01-04,1.200929,-1.089993,-0.430479,0.53619
2000-01-05,-0.643188,0.333507,0.945588,-2.058857
2000-01-06,1.345539,0.931516,0.65152,0.217818
2000-01-07,-0.315621,0.17717,-0.736386,-0.298658
2000-01-08,-1.17743,0.61027,-0.325076,1.448778


In [205]:
df[['A', 'B']]

Unnamed: 0,A,B
2000-01-01,-0.027247,-0.501628
2000-01-02,-0.871463,0.143116
2000-01-03,-1.615947,-0.193109
2000-01-04,1.200929,-1.089993
2000-01-05,-0.643188,0.333507
2000-01-06,1.345539,0.931516
2000-01-07,-0.315621,0.17717
2000-01-08,-1.17743,0.61027


In [206]:
df.loc[:, ['B', 'A']] = df[['A', 'B']]
df[['A', 'B']]

Unnamed: 0,A,B
2000-01-01,-0.027247,-0.501628
2000-01-02,-0.871463,0.143116
2000-01-03,-1.615947,-0.193109
2000-01-04,1.200929,-1.089993
2000-01-05,-0.643188,0.333507
2000-01-06,1.345539,0.931516
2000-01-07,-0.315621,0.17717
2000-01-08,-1.17743,0.61027


pandas aligns all AXES when setting `Series` and `DataFrame` from `.loc`, and `.iloc`.

This will not modify `df` because the column alignment is before value assignment.

### Attribute access
You may access an index on a `Series` or column on a `DataFrame` directly as an attribute:
* `df.A` cannot create a new column if there is no `"A"` column, but `df["A"]` can

In [207]:
sa = pd.Series([1, 2, 3], index=list('abc'))
dfa = df.copy()

In [208]:
sa.b

2

In [209]:
dfa.A

2000-01-01   -0.027247
2000-01-02   -0.871463
2000-01-03   -1.615947
2000-01-04    1.200929
2000-01-05   -0.643188
2000-01-06    1.345539
2000-01-07   -0.315621
2000-01-08   -1.177430
Freq: D, Name: A, dtype: float64

In [210]:
sa.a = 5
sa

a    5
b    2
c    3
dtype: int64

In [211]:
dfa.A = list(range(len(dfa.index)))   # ok if A already exists
dfa

Unnamed: 0,A,B,C,D
2000-01-01,0,-0.501628,1.221096,-1.011678
2000-01-02,1,0.143116,0.354171,1.532565
2000-01-03,2,-0.193109,-0.21391,0.127459
2000-01-04,3,-1.089993,-0.430479,0.53619
2000-01-05,4,0.333507,0.945588,-2.058857
2000-01-06,5,0.931516,0.65152,0.217818
2000-01-07,6,0.17717,-0.736386,-0.298658
2000-01-08,7,0.61027,-0.325076,1.448778


In [212]:
dfa['F'] = list(range(len(dfa.index)))  # use this form to create a new column
dfa

Unnamed: 0,A,B,C,D,F
2000-01-01,0,-0.501628,1.221096,-1.011678,0
2000-01-02,1,0.143116,0.354171,1.532565,1
2000-01-03,2,-0.193109,-0.21391,0.127459,2
2000-01-04,3,-1.089993,-0.430479,0.53619,3
2000-01-05,4,0.333507,0.945588,-2.058857,4
2000-01-06,5,0.931516,0.65152,0.217818,5
2000-01-07,6,0.17717,-0.736386,-0.298658,6
2000-01-08,7,0.61027,-0.325076,1.448778,7


You can also assign a `dict` to a row of a `DataFrame`:

In [213]:
x = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5]})
x.iloc[1] = {'x': 9, 'y': 99}
x

Unnamed: 0,x,y
0,1,3
1,9,99
2,3,5


You can use attribute access to modify an existing element of a `Series` or column of a `DataFrame`, but be careful; if you try to use attribute access to create a new column, it creates a new attribute rather than a new column. In 0.21.0 and later, this will raise a UserWarning:

In [214]:
dfsimple = pd.DataFrame({'one': [1., 2., 3.]})
dfsimple.two = [4, 5, 6]

  dfsimple.two = [4, 5, 6]


In [215]:
dfsimple

Unnamed: 0,one
0,1.0
1,2.0
2,3.0


In [216]:
dfsimple.loc[:, 'two'] = [4, 5, 6]
dfsimple

Unnamed: 0,one,two
0,1.0,4
1,2.0,5
2,3.0,6


### Slicing ranges
The most robust and consistent way of slicing ranges along arbitrary axes is described in the Selection by Position section detailing the `.iloc` method. For now, we explain the semantics of slicing using the `[]` operator.

With Series, the syntax works exactly as with an `ndarray`, returning a slice of the values and the corresponding labels:

In [217]:
s[:5]

2000-01-01   -0.027247
2000-01-02   -0.871463
2000-01-03   -1.615947
2000-01-04    1.200929
2000-01-05   -0.643188
Freq: D, Name: A, dtype: float64

In [218]:
s[::2]

2000-01-01   -0.027247
2000-01-03   -1.615947
2000-01-05   -0.643188
2000-01-07   -0.315621
Freq: 2D, Name: A, dtype: float64

In [219]:
s[::-1]

2000-01-08   -1.177430
2000-01-07   -0.315621
2000-01-06    1.345539
2000-01-05   -0.643188
2000-01-04    1.200929
2000-01-03   -1.615947
2000-01-02   -0.871463
2000-01-01   -0.027247
Freq: -1D, Name: A, dtype: float64

In [220]:
s2 = s.copy()
s2[:5] = 0
s2

2000-01-01    0.000000
2000-01-02    0.000000
2000-01-03    0.000000
2000-01-04    0.000000
2000-01-05    0.000000
2000-01-06    1.345539
2000-01-07   -0.315621
2000-01-08   -1.177430
Freq: D, Name: A, dtype: float64

In [221]:
s

2000-01-01   -0.027247
2000-01-02   -0.871463
2000-01-03   -1.615947
2000-01-04    1.200929
2000-01-05   -0.643188
2000-01-06    1.345539
2000-01-07   -0.315621
2000-01-08   -1.177430
Freq: D, Name: A, dtype: float64

With DataFrame, slicing inside of `[]` **slices the rows**. This is provided largely as a convenience since it is such a common operation.

In [222]:
df[1:3] # equals to df.iloc[1:3], however, df.loc[1:3] is an error

Unnamed: 0,A,B,C,D
2000-01-02,-0.871463,0.143116,0.354171,1.532565
2000-01-03,-1.615947,-0.193109,-0.21391,0.127459


In [223]:
df.loc[:,'A':'B'] # df[:,'A':'B'] error

Unnamed: 0,A,B
2000-01-01,-0.027247,-0.501628
2000-01-02,-0.871463,0.143116
2000-01-03,-1.615947,-0.193109
2000-01-04,1.200929,-1.089993
2000-01-05,-0.643188,0.333507
2000-01-06,1.345539,0.931516
2000-01-07,-0.315621,0.17717
2000-01-08,-1.17743,0.61027


### Selection by label

Whether a copy or a reference is returned for a setting operation, may depend on the context. This is sometimes called `chained assignment` and should be avoided. See [Returning a View versus Copy](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-view-versus-copy).

`.loc` is strict when you present slicers that are not compatible (or convertible) with the index type. For example using integers in a `DatetimeIndex`. These will raise a `TypeError`.

In [224]:
df1 = pd.DataFrame(np.random.randn(5, 4), columns=list('ABCD'), index=pd.date_range('20130101', periods=5))
df1

Unnamed: 0,A,B,C,D
2013-01-01,0.696762,0.26015,-0.019224,0.631167
2013-01-02,-1.549131,-2.293421,-0.45383,0.966619
2013-01-03,0.534738,-0.934513,0.044854,0.196777
2013-01-04,1.15202,0.164309,-0.761664,-1.246693
2013-01-05,0.194607,-1.808899,-0.077691,-0.298423


In [225]:
#df1.loc[2:3] # type error

In [226]:
df1.loc['20130102':'20130104']

Unnamed: 0,A,B,C,D
2013-01-02,-1.549131,-2.293421,-0.45383,0.966619
2013-01-03,0.534738,-0.934513,0.044854,0.196777
2013-01-04,1.15202,0.164309,-0.761664,-1.246693


String likes in slicing can be convertible to the type of the index and lead to natural slicing.

pandas provides a suite of methods in order to have **purely label based indexing**. This is a strict inclusion based protocol. Every label asked for must be in the index, or a `KeyError` will be raised. When slicing, both the start bound **AND** the stop bound are *included*, if present in the index. Integers are valid labels, but they refer to the label **and not the position**.

The `.loc` attribute is the primary access method. The following are valid inputs:

- A single label, e.g. `5` or `'a'` (Note that `5` is interpreted as a *label* of the index. This use is **not** an integer position along the index.).
- A list or array of labels `['a', 'b', 'c']`.
- A slice object with labels `'a':'f'` (Note that contrary to usual python slices, **both** the start and the stop are included, when present in the index! See [Slicing with labels](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-slicing-with-labels).
- A boolean array.
- A `callable`, see [Selection By Callable](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-callable).

#### Series

In [227]:
s1 = pd.Series(np.random.randn(6), index=list('abcdef'))
s1

a    1.951408
b   -0.249293
c   -1.263203
d   -1.235176
e    0.406092
f    0.896819
dtype: float64

In [228]:
s1.loc['c':]

c   -1.263203
d   -1.235176
e    0.406092
f    0.896819
dtype: float64

In [229]:
s1.loc['b']

-0.24929345723991522

In [230]:
s1.loc['c':] = 0 # setting works as well
s1

a    1.951408
b   -0.249293
c    0.000000
d    0.000000
e    0.000000
f    0.000000
dtype: float64

#### DataFrame

In [231]:
df1 = pd.DataFrame(np.random.randn(6, 4), index=list('abcdef'), columns=list('ABCD'))
df1

Unnamed: 0,A,B,C,D
a,0.032759,0.056776,0.92607,0.556041
b,1.091456,-1.529164,0.108102,1.431675
c,-0.75138,0.755986,-1.758644,-0.875825
d,-1.538838,-1.142484,0.16691,0.933001
e,1.230151,-0.059232,-0.759146,0.010696
f,-1.140682,-1.956142,-2.185962,-0.086471


In [232]:
df1.loc[['a', 'b', 'd'], :]

Unnamed: 0,A,B,C,D
a,0.032759,0.056776,0.92607,0.556041
b,1.091456,-1.529164,0.108102,1.431675
d,-1.538838,-1.142484,0.16691,0.933001


#### Accessing via label slices

In [233]:
df1.loc['d':, 'A':'C']

Unnamed: 0,A,B,C
d,-1.538838,-1.142484,0.16691
e,1.230151,-0.059232,-0.759146
f,-1.140682,-1.956142,-2.185962


##### For getting a cross section using a label (equivalent to `df.xs('a')`):

In [234]:
df1.loc['a']

A    0.032759
B    0.056776
C    0.926070
D    0.556041
Name: a, dtype: float64

##### For getting values with a boolean array:

In [235]:
df1.loc['a'] > 0

A    True
B    True
C    True
D    True
Name: a, dtype: bool

In [236]:
df1.loc[:, df1.loc['a'] > 0]

Unnamed: 0,A,B,C,D
a,0.032759,0.056776,0.92607,0.556041
b,1.091456,-1.529164,0.108102,1.431675
c,-0.75138,0.755986,-1.758644,-0.875825
d,-1.538838,-1.142484,0.16691,0.933001
e,1.230151,-0.059232,-0.759146,0.010696
f,-1.140682,-1.956142,-2.185962,-0.086471


#### Slicing with labels

In [237]:
s = pd.Series(list('abcde'), index=[0, 3, 2, 5, 4])
s.loc[3:5] # elements located between the two (including them)

3    b
2    c
5    d
dtype: object

If at least one of the two is absent, but the index is sorted, and can be compared against start and stop labels, then slicing will still work as expected, by selecting labels which rank between the two:

In [238]:
s.sort_index()

0    a
2    c
3    b
4    e
5    d
dtype: object

In [239]:
s.sort_index().loc[1:6]

2    c
3    b
4    e
5    d
dtype: object

However, if at least one of the two is absent *and* the index is not sorted, an error will be raised (since doing otherwise would be computationally expensive, as well as potentially ambiguous for mixed type indexes). For instance, in the above example, `s.loc[1:6]` would raise `KeyError`.

### Selection by position
Pandas provides a suite of methods in order to get **purely integer based indexing**. The semantics follow closely Python and NumPy slicing. These are `0-based` indexing. When slicing, the start bound is *included*, while the upper bound is *excluded*. Trying to use a non-integer, even a **valid** label will raise an `IndexError`.

The `.iloc` attribute is the primary access method. The following are valid inputs:

- An integer e.g. `5`.
- A list or array of integers `[4, 3, 0]`.
- A slice object with ints `1:7`.
- A boolean array.
- A `callable`, see [Selection By Callable](https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-callable).

In [240]:
s1 = pd.Series(np.random.randn(5), index=list(range(0, 10, 2)))
s1

0    1.573731
2   -0.159886
4   -1.089443
6    0.032045
8   -0.886351
dtype: float64

In [241]:
s1.iloc[:3]

0    1.573731
2   -0.159886
4   -1.089443
dtype: float64

In [242]:
s1.iloc[3]

0.032044882649583564

In [243]:
df1 = pd.DataFrame(np.random.randn(6, 4), index=list(range(0, 12, 2)), columns=list(range(0, 8, 2)))
df1

Unnamed: 0,0,2,4,6
0,0.052635,-0.680668,0.864208,-0.318706
2,-0.482836,0.182784,1.771977,-1.169878
4,0.109942,-0.960959,0.918863,1.757854
6,-1.251424,1.789567,0.724871,0.503765
8,2.085524,2.204217,1.300644,-0.083999
10,0.127551,-1.225366,2.154979,0.133515


In [244]:
df1.iloc[:3] # select via integer slicing

Unnamed: 0,0,2,4,6
0,0.052635,-0.680668,0.864208,-0.318706
2,-0.482836,0.182784,1.771977,-1.169878
4,0.109942,-0.960959,0.918863,1.757854


In [245]:
df1.iloc[1:5, 2:4] # select via integer slicing

Unnamed: 0,4,6
2,1.771977,-1.169878
4,0.918863,1.757854
6,0.724871,0.503765
8,1.300644,-0.083999


In [246]:
df1.iloc[[1, 3, 5], [1, 3]] # select via integer list

Unnamed: 0,2,6
2,0.182784,-1.169878
6,1.789567,0.503765
10,-1.225366,0.133515


In [247]:
df1.iloc[1, 1] # this is also equivalent to df1.iat[1,1]

0.18278429888313244

In [248]:
df1.iloc[1] # equiv to df.xs(1)

0   -0.482836
2    0.182784
4    1.771977
6   -1.169878
Name: 2, dtype: float64

Out of range slice indexes are handled gracefully just as in Python/Numpy.
...

## Selection by callable
`.loc`, `.iloc`, and also `[]` indexing can accept a `callable` as indexer. The `callable` must be a function with one argument (the calling Series or DataFrame) that returns valid output for indexing.

In [249]:
df1 = pd.DataFrame(np.random.randn(6, 4), index=list('abcdef'), columns=list('ABCD'))
df1

Unnamed: 0,A,B,C,D
a,0.047023,-0.867095,1.064774,1.788035
b,-0.077078,-0.195172,0.128761,-0.487579
c,2.633808,-1.408664,0.627334,-0.576732
d,0.267644,-0.932889,-1.188578,-0.520702
e,-0.00078,0.831552,1.14988,-0.59192
f,-0.574381,-0.563365,0.471458,-0.966883


In [250]:
df1.loc[lambda df: df['A'] > 0, :]

Unnamed: 0,A,B,C,D
a,0.047023,-0.867095,1.064774,1.788035
c,2.633808,-1.408664,0.627334,-0.576732
d,0.267644,-0.932889,-1.188578,-0.520702


In [251]:
df1.loc[:, lambda df: ['A', 'B']]

Unnamed: 0,A,B
a,0.047023,-0.867095
b,-0.077078,-0.195172
c,2.633808,-1.408664
d,0.267644,-0.932889
e,-0.00078,0.831552
f,-0.574381,-0.563365


In [252]:
df1.iloc[:, lambda df: [0, 1]]

Unnamed: 0,A,B
a,0.047023,-0.867095
b,-0.077078,-0.195172
c,2.633808,-1.408664
d,0.267644,-0.932889
e,-0.00078,0.831552
f,-0.574381,-0.563365


In [253]:
df1[lambda df: df.columns[0]]

a    0.047023
b   -0.077078
c    2.633808
d    0.267644
e   -0.000780
f   -0.574381
Name: A, dtype: float64

## Indexing with list with missing labels is deprecated

### Reindexing
The idiomatic way to achieve selecting potentially not-found elements is via `.reindex()`.

## Selecting random samples

In [254]:
s1.sample() # return 1 row

4   -1.089443
dtype: float64

In [255]:
s1.sample(n=3) # number of rows

4   -1.089443
0    1.573731
6    0.032045
dtype: float64

In [256]:
s1.sample(frac=0.5) # ractino of the rows

0    1.573731
8   -0.886351
dtype: float64

In [257]:
s1.sample(n=6, replace=True) # each row more than once allowed

6    0.032045
6    0.032045
0    1.573731
6    0.032045
8   -0.886351
4   -1.089443
dtype: float64

In [258]:
s = pd.Series([0, 1, 2, 3, 4, 5])
example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]
s.sample(n=3, weights=example_weights) # weight

3    3
5    5
4    4
dtype: int64

In [259]:
df2 = pd.DataFrame({'col1': [9, 8, 7, 6], 'weight_column': [0.5, 0.4, 0.1, 0]})
df2.sample(n=3, weights='weight_column')

Unnamed: 0,col1,weight_column
1,8,0.4
0,9,0.5
2,7,0.1


In [260]:
df3 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [2, 3, 4]})
df3.sample(n=1, axis=1) # sample columns instead of rows

Unnamed: 0,col2
0,2
1,3
2,4


In [261]:
df3.sample(n=2, random_state=2) # With a given seed, the sample will always draw the same rows.

Unnamed: 0,col1,col2
2,3,4
1,2,3


In [262]:
df3.sample(n=2, random_state=2)

Unnamed: 0,col1,col2
2,3,4
1,2,3


## Setting with enlargement
The `.loc/[]` operations can perform enlargement when setting a non-existent key for that axis.

In [263]:
se = pd.Series([1, 2, 3])
se

0    1
1    2
2    3
dtype: int64

In [264]:
se[5] = 5
se

0    1
1    2
2    3
5    5
dtype: int64

In [265]:
dfi = pd.DataFrame(np.arange(6).reshape(3, 2), columns=['A', 'B'])

In [266]:
dfi

Unnamed: 0,A,B
0,0,1
1,2,3
2,4,5


##### A DataFrame can be enlarged on either axis via `.loc`.

In [267]:
dfi.loc[:, 'C'] = dfi.loc[:, 'A']
dfi

Unnamed: 0,A,B,C
0,0,1,0
1,2,3,2
2,4,5,4


## Fast scalar value getting and setting
Since indexing with `[]` must handle a lot of cases (single-label access, slicing, boolean indexing, etc.), it has a bit of overhead in order to figure out what you’re asking for. If you only want to access a scalar value, the fastest way is to use the `at` and `iat` methods, which are implemented on all of the data structures.

Similarly to `loc`, `at` provides **label** based scalar lookups, while, `iat` provides **integer** based lookups analogously to `iloc`

In [268]:
s

0    0
1    1
2    2
3    3
4    4
5    5
dtype: int64

In [269]:
s.iat[5]

5

In [270]:
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
2000-01-01,0.181186,0.092791,-1.316499,0.296481
2000-01-02,0.079356,-0.436367,-0.157179,0.281315
2000-01-03,-0.127117,0.003107,0.108214,-0.030207
2000-01-04,0.029655,-0.078306,0.393248,0.352226
2000-01-05,-0.898349,-0.641281,0.418785,1.030004
2000-01-06,0.462188,-1.323855,1.209187,1.191948
2000-01-07,0.026445,0.510044,0.766912,0.312119
2000-01-08,-0.267886,0.055204,-0.262761,-0.180388


In [271]:
df.at[dates[5], 'A']

0.4621876749867321

In [272]:
df.at[dates[5], 'E'] = 7
df.iat[3, 0] = 7

`at` may enlarge the object in-place as above if the indexer is missing.

In [273]:
df.at[dates[-1] + pd.Timedelta('1 day'), 0] = 7
df

Unnamed: 0,A,B,C,D,E,0
2000-01-01,0.181186,0.092791,-1.316499,0.296481,,
2000-01-02,0.079356,-0.436367,-0.157179,0.281315,,
2000-01-03,-0.127117,0.003107,0.108214,-0.030207,,
2000-01-04,7.0,-0.078306,0.393248,0.352226,,
2000-01-05,-0.898349,-0.641281,0.418785,1.030004,,
2000-01-06,0.462188,-1.323855,1.209187,1.191948,7.0,
2000-01-07,0.026445,0.510044,0.766912,0.312119,,
2000-01-08,-0.267886,0.055204,-0.262761,-0.180388,,
2000-01-09,,,,,,7.0


## Boolean indexing
Another common operation is the use of boolean vectors to filter the data. The operators are: `|` for `or`, `&` for `and`, and `~` for `not`. These **must** be grouped by using parentheses, since by default Python will evaluate an expression such as `df['A'] > 2 & df['B'] < 3` as `df['A'] > (2 & df['B']) < 3`, while the desired evaluation order is `(df['A'] > 2) & (df['B'] < 3)`.

In [274]:
s = pd.Series(range(-3, 4))
s

0   -3
1   -2
2   -1
3    0
4    1
5    2
6    3
dtype: int64

In [275]:
s[s > 0]

4    1
5    2
6    3
dtype: int64

In [276]:
s[(s < -1) | (s > 0.5)]

0   -3
1   -2
4    1
5    2
6    3
dtype: int64

In [277]:
s[~(s < 0)]

3    0
4    1
5    2
6    3
dtype: int64

In [278]:
df[df['A'] > 0]

Unnamed: 0,A,B,C,D,E,0
2000-01-01,0.181186,0.092791,-1.316499,0.296481,,
2000-01-02,0.079356,-0.436367,-0.157179,0.281315,,
2000-01-04,7.0,-0.078306,0.393248,0.352226,,
2000-01-06,0.462188,-1.323855,1.209187,1.191948,7.0,
2000-01-07,0.026445,0.510044,0.766912,0.312119,,


In [279]:
df2 = pd.DataFrame({
    'a': ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
    'b': ['x', 'y', 'y', 'x', 'y', 'x', 'x'], 
    'c': np.random.randn(7)
    })
df2

Unnamed: 0,a,b,c
0,one,x,-0.828673
1,one,y,0.229465
2,two,y,0.730438
3,three,x,-1.424645
4,two,y,-0.050984
5,one,x,0.379044
6,six,x,0.035305


##### `map`

In [280]:
criterion = df2['a'].map(lambda x: x.startswith('t'))
df2[criterion]

Unnamed: 0,a,b,c
2,two,y,0.730438
3,three,x,-1.424645
4,two,y,-0.050984


In [281]:
df2[[x.startswith('t') for x in df2['a']]] # equivalent but slower

Unnamed: 0,a,b,c
2,two,y,0.730438
3,three,x,-1.424645
4,two,y,-0.050984


In [282]:
df2[criterion & (df2['b'] == 'x')]

Unnamed: 0,a,b,c
3,three,x,-1.424645


In [283]:
df2.loc[criterion & (df2['b'] == 'x'), 'b':'c']

Unnamed: 0,b,c
3,x,-1.424645


## Indexing with isin

### Series
Consider the [`isin()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html#pandas.Series.isin) method of `Series`, which returns a boolean vector that is true wherever the `Series` elements exist in the passed list. This allows you to select rows where one or more columns have values you want

In [284]:
s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')
s

4    0
3    1
2    2
1    3
0    4
dtype: int64

In [285]:
s.isin([2, 4, 6])

4    False
3    False
2     True
1    False
0     True
dtype: bool

In [286]:
s[s.isin([2, 4, 6])]

2    2
0    4
dtype: int64

The same method is available for `Index` objects and is useful for the cases when you don’t know which of the sought labels are in fact present:

In [287]:
s[s.index.isin([2, 4, 6])]

4    0
2    2
dtype: int64

In [288]:
s.reindex([2, 4, 6]) # compare

2    2.0
4    0.0
6    NaN
dtype: float64

In addition to that, `MultiIndex` allows selecting a separate level to use in the membership check:

In [289]:
s_mi = pd.Series(np.arange(6), index=pd.MultiIndex.from_product([[0, 1], ['a', 'b', 'c']]))
s_mi

0  a    0
   b    1
   c    2
1  a    3
   b    4
   c    5
dtype: int32

In [290]:
s_mi.iloc[s_mi.index.isin([(1, 'a'), (2, 'b'), (0, 'c')])]

0  c    2
1  a    3
dtype: int32

In [291]:
s_mi.iloc[s_mi.index.isin(['a', 'c', 'e'], level=1)]

0  a    0
   c    2
1  a    3
   c    5
dtype: int32

### DataFrame
DataFrame also has an [`isin()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isin.html#pandas.DataFrame.isin) method. When calling `isin`, pass a set of values as either an **array** or **dict**. If values is an array, `isin` returns a DataFrame of booleans that is the same shape as the original DataFrame, with True wherever the element is in the sequence of values.

In [292]:
df = pd.DataFrame({'vals': [1, 2, 3, 4], 
                   'ids': ['a', 'b', 'f', 'n'], 
                   'ids2': ['a', 'n', 'c', 'n']})
df

Unnamed: 0,vals,ids,ids2
0,1,a,a
1,2,b,n
2,3,f,c
3,4,n,n


In [293]:
values = ['a', 'b', 1, 3]
df.isin(values)

Unnamed: 0,vals,ids,ids2
0,True,True,True
1,False,True,False
2,True,False,False
3,False,False,False


In [294]:
values = {'ids': ['a', 'b'], 'vals': [1, 3]}
df.isin(values)

Unnamed: 0,vals,ids,ids2
0,True,True,False
1,False,True,False
2,True,False,False
3,False,False,False


Combine DataFrame’s `isin` with the `any()` and `all()` methods to quickly select subsets of your data that meet a given criteria. To select a row where each column meets its own criterion:

In [295]:
values = {'ids': ['a', 'b'], 'ids2': ['a', 'c'], 'vals': [1, 3]}
pre_row_mask = df.isin(values)
df[pre_row_mask]

Unnamed: 0,vals,ids,ids2
0,1.0,a,a
1,,b,
2,3.0,,c
3,,,


In [296]:
row_mask = pre_row_mask.all(1) # 1 indicates the row axis, 0 the column axis
row_mask

0     True
1    False
2    False
3    False
dtype: bool

In [297]:
df[row_mask]

Unnamed: 0,vals,ids,ids2
0,1,a,a


## The `where()` Method and Masking

Selecting values from a Series with a boolean vector generally returns a subset of the data. To guarantee that selection output has the same shape as the original data, you can use the `where` method in `Series` and `DataFrame`.

To return only the selected rows:

In [298]:
s

4    0
3    1
2    2
1    3
0    4
dtype: int64

In [299]:
s[s > 0]

3    1
2    2
1    3
0    4
dtype: int64

In [300]:
s.where(s > 0)

4    NaN
3    1.0
2    2.0
1    3.0
0    4.0
dtype: float64

In [301]:
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
2000-01-01,-1.624134,0.148801,-0.589106,1.745429
2000-01-02,1.317259,0.981145,-0.097421,-2.649369
2000-01-03,-1.554583,-0.349321,-0.554042,1.310447
2000-01-04,-1.254769,1.033617,-1.825814,0.808718
2000-01-05,0.268206,0.513326,-1.218002,0.724987
2000-01-06,-0.897538,1.257497,0.764211,0.323889
2000-01-07,-1.566366,0.592235,-1.215264,-0.418668
2000-01-08,0.934434,-2.400253,-1.174835,1.741538


`df.where(df < 0)` is equivalant to `df[df < 0]`

In [302]:
df[df < 0]

Unnamed: 0,A,B,C,D
2000-01-01,-1.624134,,-0.589106,
2000-01-02,,,-0.097421,-2.649369
2000-01-03,-1.554583,-0.349321,-0.554042,
2000-01-04,-1.254769,,-1.825814,
2000-01-05,,,-1.218002,
2000-01-06,-0.897538,,,
2000-01-07,-1.566366,,-1.215264,-0.418668
2000-01-08,,-2.400253,-1.174835,


In addition, `where` takes an optional `other` argument for replacement of values where the condition is False, in the returned copy.

In [303]:
df.where(df < 0, -df)

Unnamed: 0,A,B,C,D
2000-01-01,-1.624134,-0.148801,-0.589106,-1.745429
2000-01-02,-1.317259,-0.981145,-0.097421,-2.649369
2000-01-03,-1.554583,-0.349321,-0.554042,-1.310447
2000-01-04,-1.254769,-1.033617,-1.825814,-0.808718
2000-01-05,-0.268206,-0.513326,-1.218002,-0.724987
2000-01-06,-0.897538,-1.257497,-0.764211,-0.323889
2000-01-07,-1.566366,-0.592235,-1.215264,-0.418668
2000-01-08,-0.934434,-2.400253,-1.174835,-1.741538


You may wish to set values based on some boolean criteria. This can be done intuitively like so:

In [304]:
s2 = s.copy()
s2[s2 < 3] = 0
s2

4    0
3    0
2    0
1    3
0    4
dtype: int64

In [305]:
df2 = df.copy()
df2[df2 < 0] = 0
df2

Unnamed: 0,A,B,C,D
2000-01-01,0.0,0.148801,0.0,1.745429
2000-01-02,1.317259,0.981145,0.0,0.0
2000-01-03,0.0,0.0,0.0,1.310447
2000-01-04,0.0,1.033617,0.0,0.808718
2000-01-05,0.268206,0.513326,0.0,0.724987
2000-01-06,0.0,1.257497,0.764211,0.323889
2000-01-07,0.0,0.592235,0.0,0.0
2000-01-08,0.934434,0.0,0.0,1.741538


By default, `where` returns a modified copy of the data. There is an optional parameter `inplace` so that the original data can be modified without creating a copy:

In [306]:
df2 = df.copy()
df2.where(df > 0, -df, inplace=True)
df2

Unnamed: 0,A,B,C,D
2000-01-01,1.624134,0.148801,0.589106,1.745429
2000-01-02,1.317259,0.981145,0.097421,2.649369
2000-01-03,1.554583,0.349321,0.554042,1.310447
2000-01-04,1.254769,1.033617,1.825814,0.808718
2000-01-05,0.268206,0.513326,1.218002,0.724987
2000-01-06,0.897538,1.257497,0.764211,0.323889
2000-01-07,1.566366,0.592235,1.215264,0.418668
2000-01-08,0.934434,2.400253,1.174835,1.741538


The signature for [`DataFrame.where()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.where.html#pandas.DataFrame.where) differs from [`numpy.where()`](https://numpy.org/doc/stable/reference/generated/numpy.where.html#numpy.where). Roughly `df1.where(m, df2)` is equivalent to `np.where(m, df1, df2)`.

In [307]:
df.where(df < 0, -df) == np.where(df < 0, df, -df)

Unnamed: 0,A,B,C,D
2000-01-01,True,True,True,True
2000-01-02,True,True,True,True
2000-01-03,True,True,True,True
2000-01-04,True,True,True,True
2000-01-05,True,True,True,True
2000-01-06,True,True,True,True
2000-01-07,True,True,True,True
2000-01-08,True,True,True,True


#### Alignment
Furthermore, `where` aligns the input boolean condition (ndarray or DataFrame), such that partial selection with setting is possible. This is analogous to partial setting via `.loc` (but on the contents rather than the axis labels).

In [308]:
df2 = df.copy()
df2[df2[1:4] > 0] = 3
df2

Unnamed: 0,A,B,C,D
2000-01-01,-1.624134,0.148801,-0.589106,1.745429
2000-01-02,3.0,3.0,-0.097421,-2.649369
2000-01-03,-1.554583,-0.349321,-0.554042,3.0
2000-01-04,-1.254769,3.0,-1.825814,3.0
2000-01-05,0.268206,0.513326,-1.218002,0.724987
2000-01-06,-0.897538,1.257497,0.764211,0.323889
2000-01-07,-1.566366,0.592235,-1.215264,-0.418668
2000-01-08,0.934434,-2.400253,-1.174835,1.741538


Where can also accept `axis` and `level` parameters to align the input when performing the `where`.

In [309]:
df2 = df.copy()
df2.where(df2 > 0, df2['A'], axis='index')

Unnamed: 0,A,B,C,D
2000-01-01,-1.624134,0.148801,-1.624134,1.745429
2000-01-02,1.317259,0.981145,1.317259,1.317259
2000-01-03,-1.554583,-1.554583,-1.554583,1.310447
2000-01-04,-1.254769,1.033617,-1.254769,0.808718
2000-01-05,0.268206,0.513326,0.268206,0.724987
2000-01-06,-0.897538,1.257497,0.764211,0.323889
2000-01-07,-1.566366,0.592235,-1.566366,-1.566366
2000-01-08,0.934434,0.934434,0.934434,1.741538


This is equivalent to (but faster than) the following.

In [310]:
df2 = df.copy()
df.apply(lambda x, y: x.where(x > 0, y), y=df['A'])

Unnamed: 0,A,B,C,D
2000-01-01,-1.624134,0.148801,-1.624134,1.745429
2000-01-02,1.317259,0.981145,1.317259,1.317259
2000-01-03,-1.554583,-1.554583,-1.554583,1.310447
2000-01-04,-1.254769,1.033617,-1.254769,0.808718
2000-01-05,0.268206,0.513326,0.268206,0.724987
2000-01-06,-0.897538,1.257497,0.764211,0.323889
2000-01-07,-1.566366,0.592235,-1.566366,-1.566366
2000-01-08,0.934434,0.934434,0.934434,1.741538


`where` can accept a callable as condition and `other` arguments. The function must be with one argument (the calling Series or DataFrame) and that returns valid output as condition and `other` argument.

In [311]:
df3 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df3.where(lambda x: x > 4, lambda x: x + 10)

Unnamed: 0,A,B,C
0,11,14,7
1,12,5,8
2,13,6,9


## The [`query()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html#pandas.DataFrame.query) Method
[`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame) objects have a [`query()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html#pandas.DataFrame.query) method that allows selection using an expression.



In [312]:
df = pd.DataFrame(np.random.rand(10, 3), columns=list('abc'))
df

Unnamed: 0,a,b,c
0,0.162244,0.618093,0.275828
1,0.995627,0.312349,0.414071
2,0.583457,0.299553,0.547863
3,0.327829,0.224183,0.012086
4,0.991408,0.815178,0.329658
5,0.460976,0.34258,0.24963
6,0.294134,0.605087,0.091292
7,0.204702,0.940988,0.129326
8,0.741616,0.352322,0.217248
9,0.495896,0.267719,0.14201


In [313]:
# pure python
df[(df['a'] < df['b']) & (df['b'] < df['c'])]

Unnamed: 0,a,b,c


In [314]:
# query
df.query('(a < b) & (b < c)')

Unnamed: 0,a,b,c


Do the same thing but fall back on a named index if there is no column with the name `a`.

In [315]:
df = pd.DataFrame(np.random.randint(10 / 2, size=(10, 2)), columns=list('bc'))
df.index.name = 'a'
df

Unnamed: 0_level_0,b,c
a,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3,3
1,0,3
2,2,0
3,3,2
4,3,1
5,4,1
6,1,4
7,2,3
8,2,3
9,2,0


In [316]:
df.query('a < b and b < c')

Unnamed: 0_level_0,b,c
a,Unnamed: 1_level_1,Unnamed: 2_level_1


If instead you don’t want to or cannot name your index, you can use the name `index` in your query expression:

In [317]:
df.query('index < b < c')

Unnamed: 0_level_0,b,c
a,Unnamed: 1_level_1,Unnamed: 2_level_1


If the name of your index overlaps with a column name, the column name is given precedence.

### [`MultiIndex`](https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.html#pandas.MultiIndex) [`query()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html#pandas.DataFrame.query) Syntax
You can also use the levels of a `DataFrame` with a [`MultiIndex`](https://pandas.pydata.org/docs/reference/api/pandas.MultiIndex.html#pandas.MultiIndex) as if they were columns in the frame:

In [318]:
n = 10
colors = np.random.choice(['red', 'green'], size=n)
foods = np.random.choice(['eggs', 'ham'], size=n)
index = pd.MultiIndex.from_arrays([colors, foods], names=['color', 'food'])
df = pd.DataFrame(np.random.randn(n, 2), index=index)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1
color,food,Unnamed: 2_level_1,Unnamed: 3_level_1
red,ham,-0.976417,-0.340411
red,eggs,0.566949,-0.550655
green,eggs,-0.221818,-1.73618
green,eggs,2.06512,-0.936475
green,ham,0.24274,1.649544
red,ham,-1.266554,-0.788354
green,ham,0.700053,-0.433579
green,eggs,0.096898,-0.772778
green,ham,-0.610445,-0.427649
green,eggs,0.555521,-0.875439


In [319]:
df.query('color == "red"')

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1
color,food,Unnamed: 2_level_1,Unnamed: 3_level_1
red,ham,-0.976417,-0.340411
red,eggs,0.566949,-0.550655
red,ham,-1.266554,-0.788354


If the levels of the `MultiIndex` are unnamed, you can refer to them using special names:

In [320]:
df.index.names = [None, None]
df

Unnamed: 0,Unnamed: 1,0,1
red,ham,-0.976417,-0.340411
red,eggs,0.566949,-0.550655
green,eggs,-0.221818,-1.73618
green,eggs,2.06512,-0.936475
green,ham,0.24274,1.649544
red,ham,-1.266554,-0.788354
green,ham,0.700053,-0.433579
green,eggs,0.096898,-0.772778
green,ham,-0.610445,-0.427649
green,eggs,0.555521,-0.875439


In [321]:
df.query('ilevel_0 == "red"')

Unnamed: 0,Unnamed: 1,0,1
red,ham,-0.976417,-0.340411
red,eggs,0.566949,-0.550655
red,ham,-1.266554,-0.788354


### [`query()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html#pandas.DataFrame.query) Use Cases

A use case for [`query()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html#pandas.DataFrame.query) is when you have a collection of [`DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame) objects that have a subset of column names (or index levels/names) in common. You can pass the same query to both frames *without* having to specify which frame you’re interested in querying

In [322]:
n = 3
df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
df2 = pd.DataFrame(np.random.rand(n + 2, 3), columns=df.columns)
expr = '0.0 <= a <= c <= 0.5'
map(lambda frame: frame.query(expr), [df, df2])

<map at 0x55a6580>

### [`query()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html#pandas.DataFrame.query) Python versus pandas Syntax Comparison

Full numpy-like syntax:

In [323]:
df.query('(a < b) & (b < c)')

Unnamed: 0,a,b,c


In [324]:
df[(df['a'] < df['b']) & (df['b'] < df['c'])]

Unnamed: 0,a,b,c


### The `in` and `not in` operators

[`query()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html#pandas.DataFrame.query) also supports special use of Python’s `in` and `not in` comparison operators, providing a succinct syntax for calling the `isin` method of a `Series` or `DataFrame`.

In [325]:
df = pd.DataFrame({'a': list('aabbccddeeff'), 'b': list('aaaabbbbcccc'),
                   'c': np.random.randint(5, size=12),
                   'd': np.random.randint(9, size=12)})
df

Unnamed: 0,a,b,c,d
0,a,a,3,7
1,a,a,3,1
2,b,a,1,0
3,b,a,4,8
4,c,b,4,5
5,c,b,0,0
6,d,b,3,2
7,d,b,0,1
8,e,c,1,3
9,e,c,3,8


In [326]:
df.query('a in b') # same as df[df['a'].isin(df['b'])]

Unnamed: 0,a,b,c,d
0,a,a,3,7
1,a,a,3,1
2,b,a,1,0
3,b,a,4,8
4,c,b,4,5
5,c,b,0,0


In [327]:
df.query('a not in b') # same as df[~df['a'].isin(df['b'])]

Unnamed: 0,a,b,c,d
6,d,b,3,2
7,d,b,0,1
8,e,c,1,3
9,e,c,3,8
10,f,c,0,4
11,f,c,2,2


### Special use of the `==` operator with `list` objects

Comparing a `list` of values to a column using `==`/`!=` works similarly to `in`/`not in`.

In [328]:
df.query('b == ["a", "b", "c"]') # same as df[df['b'].isin(["a", "b", "c"])]

Unnamed: 0,a,b,c,d
0,a,a,3,7
1,a,a,3,1
2,b,a,1,0
3,b,a,4,8
4,c,b,4,5
5,c,b,0,0
6,d,b,3,2
7,d,b,0,1
8,e,c,1,3
9,e,c,3,8


In [329]:
df.query('c == [1, 2]') # same as df.query('[1, 2] in c') or df.query('c in [1, 2]' )

Unnamed: 0,a,b,c,d
2,b,a,1,0
8,e,c,1,3
11,f,c,2,2


In [330]:
df.query('c != [1, 2]')

Unnamed: 0,a,b,c,d
0,a,a,3,7
1,a,a,3,1
3,b,a,4,8
4,c,b,4,5
5,c,b,0,0
6,d,b,3,2
7,d,b,0,1
9,e,c,3,8
10,f,c,0,4


In [331]:
df.query('c in [1, 2]')

Unnamed: 0,a,b,c,d
2,b,a,1,0
8,e,c,1,3
11,f,c,2,2


In [332]:
df.query('[1, 2] in c')

Unnamed: 0,a,b,c,d
2,b,a,1,0
8,e,c,1,3
11,f,c,2,2


### Boolean operators

You can negate boolean expressions with the word `not` or the `~` operator.

In [333]:
n = 10
df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))
df['bools'] = np.random.rand(len(df)) > 0.5
df.query('~bools') # same as df.query('not bools') and df[~df['bools']]

Unnamed: 0,a,b,c,bools
0,0.338916,0.422961,0.157816,False
3,0.129101,0.0853,0.486881,False


In [334]:
shorter = df.query('a < b < c and (not bools) or bools > 2')
shorter

Unnamed: 0,a,b,c,bools


In [335]:
longer = df[(df['a'] < df['b'])
            & (df['b'] < df['c'])
            & (~df['bools'])
            | (df['bools'] > 2)]
longer

Unnamed: 0,a,b,c,bools


`DataFrame.query()` using `numexpr` is slightly faster than Python for large frames.

## Duplicate data

If you want to identify and remove duplicate rows in a DataFrame, there are two methods that will help: `duplicated` and `drop_duplicates`. Each takes as an argument the columns to use to identify duplicated rows.

- `duplicated` returns a boolean vector whose length is the number of rows, and which indicates whether a row is duplicated.
- `drop_duplicates` removes duplicate rows.

By default, the first observed row of a duplicate set is considered unique, but each method has a `keep` parameter to specify targets to be kept.

- `keep='first'` (default): mark / drop duplicates except for the first occurrence.
- `keep='last'`: mark / drop duplicates except for the last occurrence.
- `keep=False`: mark / drop all duplicates.

In [336]:
df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'two', 'two', 'three', 'four'],
                    'b': ['x', 'y', 'x', 'y', 'x', 'x', 'x'],
                    'c': np.random.randn(7)})
df2

Unnamed: 0,a,b,c
0,one,x,-1.054577
1,one,y,-0.222925
2,two,x,0.877194
3,two,y,0.214998
4,two,x,-0.101185
5,three,x,-0.727296
6,four,x,0.388022


In [337]:
df2.duplicated('a')

0    False
1     True
2    False
3     True
4     True
5    False
6    False
dtype: bool

In [338]:
df2.drop_duplicates('a')

Unnamed: 0,a,b,c
0,one,x,-1.054577
2,two,x,0.877194
5,three,x,-0.727296
6,four,x,0.388022


In [339]:
df2.drop_duplicates('a', keep=False)

Unnamed: 0,a,b,c
5,three,x,-0.727296
6,four,x,0.388022


In [340]:
df2.drop_duplicates(['a', 'b']) # you can pass a list of columns

Unnamed: 0,a,b,c
0,one,x,-1.054577
1,one,y,-0.222925
2,two,x,0.877194
3,two,y,0.214998
5,three,x,-0.727296
6,four,x,0.388022


To drop duplicates by index value, use `Index.duplicated` then perform slicing. The same set of options are available for the `keep` parameter.

In [341]:
df3 = pd.DataFrame({'a': np.arange(6),
                    'b': np.random.randn(6)},
                   index=['a', 'a', 'b', 'c', 'b', 'a'])
df3

Unnamed: 0,a,b
a,0,0.199639
a,1,-0.338407
b,2,0.98035
c,3,0.255357
b,4,-0.771023
a,5,-0.253611


In [342]:
df3[~df3.index.duplicated()]

Unnamed: 0,a,b
a,0,0.199639
b,2,0.98035
c,3,0.255357


## Dictionary-like [`get()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.get.html#pandas.DataFrame.get) method

Each of Series or DataFrame have a `get` method which can return a default value.

In [343]:
s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s.get('a')  # equivalent to s['a']

1

In [344]:
s.get('x', default=-1)

-1

## The [`lookup()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.lookup.html#pandas.DataFrame.lookup) method[¶](https://pandas.pydata.org/docs/user_guide/indexing.html#the-lookup-method)

Sometimes you want to extract a set of values given a sequence of row labels and column labels, and the `lookup` method allows for this and returns a NumPy array. For instance:

In [345]:
dflookup = pd.DataFrame(np.random.rand(20, 4), columns = ['A', 'B', 'C', 'D'])
dflookup.lookup(list(range(0, 10, 2)), ['B', 'C', 'A', 'B', 'D'])

array([0.30763619, 0.51238417, 0.92366556, 0.27009083, 0.66835886])

## Index objects

The pandas [`Index`](https://pandas.pydata.org/docs/reference/api/pandas.Index.html#pandas.Index) class and its subclasses can be viewed as implementing an *ordered multiset*. Duplicates are allowed. However, if you try to convert an [`Index`](https://pandas.pydata.org/docs/reference/api/pandas.Index.html#pandas.Index) object with duplicate entries into a `set`, an exception will be raised.

[`Index`](https://pandas.pydata.org/docs/reference/api/pandas.Index.html#pandas.Index) also provides the infrastructure necessary for lookups, data alignment, and reindexing. The easiest way to create an [`Index`](https://pandas.pydata.org/docs/reference/api/pandas.Index.html#pandas.Index) directly is to pass a `list` or other sequence to [`Index`](https://pandas.pydata.org/docs/reference/api/pandas.Index.html#pandas.Index):

In [346]:
index = pd.Index(['e', 'd', 'a', 'b'])
index

Index(['e', 'd', 'a', 'b'], dtype='object')

In [347]:
index = pd.Index(['e', 'd', 'a', 'b'], name='something')
index.name

'something'

In [348]:
index = pd.Index(list(range(5)), name='rows')
columns = pd.Index(['A', 'B', 'C'], name='cols')
df = pd.DataFrame(np.random.randn(5, 3), index=index, columns=columns)
df

cols,A,B,C
rows,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.207167,-0.763919,0.245915
1,-0.774261,0.216112,-1.267829
2,-1.675563,0.125447,1.599144
3,-1.004635,0.663763,0.708422
4,1.73611,0.803509,-0.142926


## Index objects

The pandas [`Index`](https://pandas.pydata.org/docs/reference/api/pandas.Index.html#pandas.Index) class and its subclasses can be viewed as implementing an *ordered multiset*. Duplicates are allowed. However, if you try to convert an [`Index`](https://pandas.pydata.org/docs/reference/api/pandas.Index.html#pandas.Index) object with duplicate entries into a `set`, an exception will be raised.

[`Index`](https://pandas.pydata.org/docs/reference/api/pandas.Index.html#pandas.Index) also provides the infrastructure necessary for lookups, data alignment, and reindexing. The easiest way to create an [`Index`](https://pandas.pydata.org/docs/reference/api/pandas.Index.html#pandas.Index) directly is to pass a `list` or other sequence to [`Index`](https://pandas.pydata.org/docs/reference/api/pandas.Index.html#pandas.Index):

In [349]:
ind = pd.Index([1, 2, 3])
ind.rename("apple")

Int64Index([1, 2, 3], dtype='int64', name='apple')

In [350]:
ind

Int64Index([1, 2, 3], dtype='int64')

In [351]:
ind.set_names(["apple"], inplace=True)
ind

Int64Index([1, 2, 3], dtype='int64', name='apple')

In [352]:
ind.name = "bob" # same as inplace=True
ind

Int64Index([1, 2, 3], dtype='int64', name='bob')

`set_names`, `set_levels`, and `set_codes` also take an optional `level` argument

In [353]:
index = pd.MultiIndex.from_product([range(3), ['one', 'two']], names=['first', 'second'])
index

MultiIndex([(0, 'one'),
            (0, 'two'),
            (1, 'one'),
            (1, 'two'),
            (2, 'one'),
            (2, 'two')],
           names=['first', 'second'])

In [354]:
index.levels[1]

Index(['one', 'two'], dtype='object', name='second')

In [355]:
index.set_levels(["a", "b"], level=1)

MultiIndex([(0, 'a'),
            (0, 'b'),
            (1, 'a'),
            (1, 'b'),
            (2, 'a'),
            (2, 'b')],
           names=['first', 'second'])

### Set operations on Index objects

The two main operations are `union (|)` and `intersection (&)`. These can be directly called as instance methods or used via overloaded operators. Difference is provided via the `.difference()` method.

In [356]:
a = pd.Index(['c', 'b', 'a'])
b = pd.Index(['c', 'e', 'd'])
a | b

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [357]:
a & b

Index(['c'], dtype='object')

In [358]:
a.difference(b)

Index(['a', 'b'], dtype='object')

Also available is the `symmetric_difference (^)` operation, which returns elements that appear in either `idx1` or `idx2`, but not in both. This is equivalent to the Index created by `idx1.difference(idx2).union(idx2.difference(idx1))`, with duplicates dropped.



In [359]:
a ^ b # same as a.symmetric_difference(b)

Index(['a', 'b', 'd', 'e'], dtype='object')

### Missing values

Even though `Index` can hold missing values (`NaN`), it should be avoided if you do not want any unexpected results. For example, some operations exclude missing values implicitly.

`Index.fillna` fills missing values with specified scalar value.

In [360]:
idx1 = pd.Index([1, np.nan, 3, 4])
idx1

Float64Index([1.0, nan, 3.0, 4.0], dtype='float64')

In [361]:
idx1.fillna(2)

Float64Index([1.0, 2.0, 3.0, 4.0], dtype='float64')

## Set / reset index

Occasionally you will load or create a data set into a DataFrame and want to add an index after you’ve already done so. There are a couple of different ways.



### Set an index

DataFrame has a [`set_index()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html#pandas.DataFrame.set_index) method which takes a column name (for a regular `Index`) or a list of column names (for a `MultiIndex`). To create a new, re-indexed DataFrame:

In [362]:
data = pd.DataFrame({'a': ['bar', 'bar', 'foo', 'foo'],
                     'b': ['one', 'two', 'one', 'two'],
                     'c': ['z', 'y', 'x', 'w'],
                     'd': list(range(1,5))})
data

Unnamed: 0,a,b,c,d
0,bar,one,z,1
1,bar,two,y,2
2,foo,one,x,3
3,foo,two,w,4


In [363]:
data2 = pd.DataFrame([['bar', 'one', 'z', 1],
                     ['bar', 'two', 'y', 2],
                     ['foo', 'one', 'x', 3],
                     ['foo', 'two', 'w', 4]],
                   columns=list('abcd'))
data2

Unnamed: 0,a,b,c,d
0,bar,one,z,1
1,bar,two,y,2
2,foo,one,x,3
3,foo,two,w,4


In [364]:
indexed1 = data.set_index('c')
indexed1

Unnamed: 0_level_0,a,b,d
c,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
z,bar,one,1
y,bar,two,2
x,foo,one,3
w,foo,two,4


In [365]:
indexed2 = data.set_index(['a', 'b'])
indexed2

Unnamed: 0_level_0,Unnamed: 1_level_0,c,d
a,b,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,z,1
bar,two,y,2
foo,one,x,3
foo,two,w,4


In [366]:
frame = data.set_index('c', drop=False)
frame

Unnamed: 0_level_0,a,b,c,d
c,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
z,bar,one,z,1
y,bar,two,y,2
x,foo,one,x,3
w,foo,two,w,4


The `append` keyword option allow you to keep the existing index and append the given columns to a MultiIndex:

In [367]:
frame = frame.set_index(['a', 'b'], append=True)
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,c,d
c,a,b,Unnamed: 3_level_1,Unnamed: 4_level_1
z,bar,one,z,1
y,bar,two,y,2
x,foo,one,x,3
w,foo,two,w,4


Other options in `set_index` allow you not drop the index columns or to add the index in-place (without creating a new object):

In [368]:
data.set_index('c', drop=False)

Unnamed: 0_level_0,a,b,c,d
c,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
z,bar,one,z,1
y,bar,two,y,2
x,foo,one,x,3
w,foo,two,w,4


In [369]:
data.set_index(['a', 'b'], inplace=True)
data

Unnamed: 0_level_0,Unnamed: 1_level_0,c,d
a,b,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,z,1
bar,two,y,2
foo,one,x,3
foo,two,w,4


### Reset the index

As a convenience, there is a new function on DataFrame called [`reset_index()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html#pandas.DataFrame.reset_index) which transfers the index values into the DataFrame’s columns and sets a simple integer index. This is the inverse operation of [`set_index()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html#pandas.DataFrame.set_index).

In [371]:
data.reset_index()

Unnamed: 0,a,b,c,d
0,bar,one,z,1
1,bar,two,y,2
2,foo,one,x,3
3,foo,two,w,4


The output is more similar to a SQL table or a record array. The names for the columns derived from the index are the ones stored in the `names` attribute.

You can use the `level` keyword to remove only a portion of the index:

In [372]:
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,c,d
c,a,b,Unnamed: 3_level_1,Unnamed: 4_level_1
z,bar,one,z,1
y,bar,two,y,2
x,foo,one,x,3
w,foo,two,w,4


In [373]:
frame.reset_index(level=1)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,c,d
c,b,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
z,one,bar,z,1
y,two,bar,y,2
x,one,foo,x,3
w,two,foo,w,4


`reset_index` takes an optional parameter `drop` which if true simply discards the index, instead of putting index values in the DataFrame’s columns.