<h1>Indexing and selecting data</h1>

<p>The axis labeling information in pandas objects serves many purposes:</p>

<ul>
<li><p>Identifies data (i.e. provides <em>metadata</em>) using known indicators,
important for analysis, visualization, and interactive console display.</p></li>
<li><p>Enables automatic and explicit data alignment.</p></li>
<li><p>Allows intuitive getting and setting of subsets of the data set.</p></li>
</ul>

<p>In this section, we will focus on the final point: namely, how to slice, dice, and generally get and set subsets of pandas objects.
The primary focus will be on Series and DataFrame as they have received more development attention in this area.</p>

<p>Note</p>

<p>The Python and NumPy indexing operators <code>[]</code> and attribute operator <code>.</code> provide quick and easy access to pandas data structures across a wide range of use cases.
This makes interactive work intuitive, as there’s little new to learn if you already know how to deal with Python dictionaries and NumPy arrays.
However, since the type of the data to be accessed isn’t known in advance, directly using standard operators has some optimization limits.
For production code, we recommended that you take advantage of the optimized
pandas data access methods exposed in this chapter.</p>

<p>Warning</p>

<p>Whether a copy or a reference is returned for a setting operation, may
depend on the context. This is sometimes called <em>chained assignment</em> and
should be avoided. See <a href="https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-view-versus-copy">Returning a View versus Copy</a>.</p>

<p>See the <a href="https://pandas.pydata.org/docs/user_guide/advanced.html#advanced">MultiIndex / Advanced Indexing</a> for <code>MultiIndex</code> and more advanced indexing documentation.</p>

<p>See the <a href="https://pandas.pydata.org/docs/user_guide/cookbook.html#cookbook-selection">cookbook</a> for some advanced strategies.</p>

<h2>Different choices for indexing</h2>

<p>Object selection has had a number of user-requested additions in order to
support more explicit location based indexing. Pandas now supports three types
of multi-axis indexing.</p>

<ul>
<li><p><code>.loc</code> is primarily label based, but may also be used with a boolean array. <code>.loc</code> will raise <code>KeyError</code> when the items are not found. Allowed inputs are:</p>

<blockquote>
<div><ul>

<li><p>A single label, e.g. <code>5</code> or <code>'a'</code> (Note that <code>5</code> is interpreted as a <em>label</em> of the index. This use is <strong>not</strong> an integer position along the index.).</p></li>

<li><p>A list or array of labels <code>['a', 'b', 'c']</code>.</p></li>

<li><p>A slice object with labels <code>'a':'f'</code> (Note that contrary to usual Python slices, <strong>both</strong> the start and the stop are included, when present in the index! See <a href="https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-slicing-with-labels">Slicing with labels</a>
and <a href="https://pandas.pydata.org/docs/user_guide/advanced.html#advanced-endpoints-are-inclusive">Endpoints are inclusive</a>.)</p></li>

<li><p>A boolean array (any <code>NA</code> values will be treated as <code>False</code>).</p></li>

<li><p>A <code>callable</code> function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).</p></li>

<li><p>A tuple of row (and column) indices whose elements are one of the
above inputs.</p></li>

</ul>
</div></blockquote>

<p>See more at <a href="https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-label">Selection by Label</a>.</p>
</li>

<li><p><code>.iloc</code> is primarily integer position based (from <code>0</code> to <code>length-1</code> of the axis), but may also be used with a boolean
array.
<code>.iloc</span></code> will raise <code>IndexError</code> if a requested
indexer is out-of-bounds, except <em>slice</em> indexers which allow
out-of-bounds indexing.  (this conforms with Python/NumPy <em>slice</em>
semantics).  Allowed inputs are:</p>

<blockquote>
<div><ul>

<li><p>An integer e.g. <code>5</code>.</p></li>

<li><p>A list or array of integers <code>[4,</span> <span class="pre">3,</span> <span class="pre">0]</span></code>.</p></li>

<li><p>A slice object with ints <code>1:7</span></code>.</p></li>
<li><p>A boolean array (any <code>NA</span></code> values will be treated as <code>False</code>).</p></li>

<li><p>A <code>callable</code> function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).</p></li>

<li><p>A tuple of row (and column) indices whose elements are one of the
above inputs.</p></li>

</ul>
</div></blockquote>

<p>See more at <a href="https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-integer">Selection by Position</a>, <a href="https://pandas.pydata.org/docs/user_guide/advanced.html#advanced">Advanced Indexing</a> and <a href="https://pandas.pydata.org/docs/user_guide/advanced.html#advanced-advanced-hierarchical">Advanced Hierarchical</a>.</p>
</li>

<li><p><code>.loc</code>, <code>.iloc</code>, and also <code>[]</code> indexing can accept a <code>callable</code> as indexer. See more at <a href="https://pandas.pydata.org/docs/user_guide/indexing.html#indexing-callable">Selection By Callable</a>.</p>

<div>
<p>Note</p>

<p>Destructuring tuple keys into row (and column) indexes occurs
<em>before</em> callables are applied, so you cannot return a tuple from
a callable to index both rows and columns.</p>
</div>
</li>
</ul>

<p>Getting values from an object with multi-axes selection uses the following
notation (using <code>.loc</code> as an example, but the following applies to <code>.iloc</code> as well).
Any of the axes accessors may be the null slice <code>:</code>.
Axes left out of the specification are assumed to be <code>:</code>, e.g. <code>p.loc['a']</code> is equivalent to <code>p.loc['a', :]</code>.</p>

In [6]:
import pandas as pd
import numpy as np

In [3]:
ser = pd.Series(range(5), index=list("abcde"))

In [4]:
ser.loc[["a", "c", "e"]]

Unnamed: 0,0
a,0
c,2
e,4


In [7]:
df = pd.DataFrame(np.arange(25).reshape(5, 5), index=list("abcde"), columns=list("abcde"))

In [8]:
df.loc[["a", "c", "e"], ["b", "d"]]

Unnamed: 0,b,d
a,1,3
c,11,13
e,21,23


<h2>Basics</h2>

<p>As mentioned when introducing the data structures in the last section, the primary function of indexing with <code>[]</code> (a.k.a. <code>__getitem__</code> for those familiar with implementing class behavior in Python) is selecting out lower-dimensional slices.
The following table shows return type values when indexing pandas objects with <code>[]</code>:</p>

<table>
<colgroup>
<col style="width: 25.0%">
<col style="width: 25.0%">
<col style="width: 50.0%">
</colgroup>
<thead>
<tr><th><p>Object Type</p></th>
<th><p>Selection</p></th>
<th><p>Return Value Type</p></th>
</tr>
</thead>
<tbody>
<tr><td><p>Series</p></td>
<td><p><code>series[label]</span></code></p></td>
<td><p>scalar value</p></td>
</tr>
<tr><td><p>DataFrame</p></td>
<td><p><code>frame[colname]</span></code></p></td>
<td><p><code>Series</span></code> corresponding to colname</p></td>
</tr>
</tbody>
</table>

<p>Here we construct a simple time series data set to use for illustrating the
indexing functionality:</p>

In [9]:
dates = pd.date_range('1/1/2000', periods=8)

In [11]:
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])

In [12]:
df

Unnamed: 0,A,B,C,D
2000-01-01,-1.62725,-0.015844,0.052282,0.320231
2000-01-02,-0.574977,-0.352076,1.282027,1.860257
2000-01-03,1.921966,0.333497,-0.799281,0.176419
2000-01-04,1.052191,1.375357,0.337722,-1.372481
2000-01-05,2.371898,-0.521351,0.034824,0.308666
2000-01-06,-0.304609,-2.02946,1.115783,-0.449982
2000-01-07,-0.083503,-0.818043,-0.175667,1.220953
2000-01-08,0.278752,-1.003495,0.895828,-0.261049


<p>Note</p>

<p>None of the indexing functionality is time series specific unless
specifically stated.</p>


<p>Thus, as per above, we have the most basic indexing using <code>[]</span></code>:</p>

In [13]:
s = df['A']

In [14]:
s[dates[5]]

np.float64(-0.30460922930712253)

<p>You can pass a list of columns to <code>[]</code> to select columns in that order.
If a column is not contained in the DataFrame, an exception will be
raised. Multiple columns can also be set in this manner:</p>

In [15]:
df

Unnamed: 0,A,B,C,D
2000-01-01,-1.62725,-0.015844,0.052282,0.320231
2000-01-02,-0.574977,-0.352076,1.282027,1.860257
2000-01-03,1.921966,0.333497,-0.799281,0.176419
2000-01-04,1.052191,1.375357,0.337722,-1.372481
2000-01-05,2.371898,-0.521351,0.034824,0.308666
2000-01-06,-0.304609,-2.02946,1.115783,-0.449982
2000-01-07,-0.083503,-0.818043,-0.175667,1.220953
2000-01-08,0.278752,-1.003495,0.895828,-0.261049


In [17]:
df[['B', 'A']] = df[['A', 'B']]

In [18]:
df

Unnamed: 0,A,B,C,D
2000-01-01,-0.015844,-1.62725,0.052282,0.320231
2000-01-02,-0.352076,-0.574977,1.282027,1.860257
2000-01-03,0.333497,1.921966,-0.799281,0.176419
2000-01-04,1.375357,1.052191,0.337722,-1.372481
2000-01-05,-0.521351,2.371898,0.034824,0.308666
2000-01-06,-2.02946,-0.304609,1.115783,-0.449982
2000-01-07,-0.818043,-0.083503,-0.175667,1.220953
2000-01-08,-1.003495,0.278752,0.895828,-0.261049


<p>You may find this useful for applying a transform (in-place) to a subset of the columns.</p>

<div class="admonition warning">
<p class="admonition-title">Warning</p>
<p>pandas aligns all AXES when setting <code class="docutils literal notranslate"><span class="pre">Series</span></code> and <code class="docutils literal notranslate"><span class="pre">DataFrame</span></code> from <code class="docutils literal notranslate"><span class="pre">.loc</span></code>.</p>
<p>This will <strong>not</strong> modify <code class="docutils literal notranslate"><span class="pre">df</span></code> because the column alignment is before value assignment.</p>

In [19]:
df[['A', 'B']]

Unnamed: 0,A,B
2000-01-01,-0.015844,-1.62725
2000-01-02,-0.352076,-0.574977
2000-01-03,0.333497,1.921966
2000-01-04,1.375357,1.052191
2000-01-05,-0.521351,2.371898
2000-01-06,-2.02946,-0.304609
2000-01-07,-0.818043,-0.083503
2000-01-08,-1.003495,0.278752


In [20]:
df.loc[:, ['B', 'A']] = df[['A', 'B']]

In [21]:
df[['A', 'B']]

Unnamed: 0,A,B
2000-01-01,-0.015844,-1.62725
2000-01-02,-0.352076,-0.574977
2000-01-03,0.333497,1.921966
2000-01-04,1.375357,1.052191
2000-01-05,-0.521351,2.371898
2000-01-06,-2.02946,-0.304609
2000-01-07,-0.818043,-0.083503
2000-01-08,-1.003495,0.278752


<p>The correct way to swap column values is by using raw values:</p>

In [22]:
df.loc[:, ['B', 'A']] = df[['A', 'B']].to_numpy()

In [23]:
df[['A', 'B']]

Unnamed: 0,A,B
2000-01-01,-1.62725,-0.015844
2000-01-02,-0.574977,-0.352076
2000-01-03,1.921966,0.333497
2000-01-04,1.052191,1.375357
2000-01-05,2.371898,-0.521351
2000-01-06,-0.304609,-2.02946
2000-01-07,-0.083503,-0.818043
2000-01-08,0.278752,-1.003495


<p>However, pandas does not align AXES when setting <code class="docutils literal notranslate"><span class="pre">Series</span></code> and <code class="docutils literal notranslate"><span class="pre">DataFrame</span></code> from <code class="docutils literal notranslate"><span class="pre">.iloc</span></code>
because <code class="docutils literal notranslate"><span class="pre">.iloc</span></code> operates by position.</p>
<p>This will modify <code class="docutils literal notranslate"><span class="pre">df</span></code> because the column alignment is not done before value assignment.</p>

In [24]:
df[['A', 'B']]

Unnamed: 0,A,B
2000-01-01,-1.62725,-0.015844
2000-01-02,-0.574977,-0.352076
2000-01-03,1.921966,0.333497
2000-01-04,1.052191,1.375357
2000-01-05,2.371898,-0.521351
2000-01-06,-0.304609,-2.02946
2000-01-07,-0.083503,-0.818043
2000-01-08,0.278752,-1.003495


In [25]:
df.iloc[:, [1, 0]] = df[['A', 'B']]

In [26]:

df[['A','B']]

Unnamed: 0,A,B
2000-01-01,-0.015844,-1.62725
2000-01-02,-0.352076,-0.574977
2000-01-03,0.333497,1.921966
2000-01-04,1.375357,1.052191
2000-01-05,-0.521351,2.371898
2000-01-06,-2.02946,-0.304609
2000-01-07,-0.818043,-0.083503
2000-01-08,-1.003495,0.278752
