### Selecting Data

In [1]:
import numpy as np
import pandas as pd

Let's build a data frame up:

In [2]:
arr = np.arange(9).reshape(3, 3)
df = pd.DataFrame(
    arr, 
    columns=['c1', 'c2', 'c3'], 
    index=['r1', 'r2', 'r3'])
df

Unnamed: 0,c1,c2,c3
r1,0,1,2
r2,3,4,5
r3,6,7,8


We can think of this a Series of Series objects (`c1`, `c2`, `c3`).

And the index for `df` (an index on the columns) is:

In [3]:
df.index

Index(['r1', 'r2', 'r3'], dtype='object')

We know we can retrieve elements from a Series object using the explicit index:

In [4]:
df['c2']

r1    1
r2    4
r3    7
Name: c2, dtype: int64

As you can see, we get the second column back, and the row index is preserved.

Note that `[]` will not use the implicit index:

In [5]:
try:
    df[0]
except KeyError as ex:
    print('KeyError:', ex)

KeyError: 0


Now that we have a single column, a `Series` essentially, we can easily get to a specific element of the column by using either the explicit or the implicit index.

In [6]:
df['c2'][1]

  df['c2'][1]


np.int64(4)

In [7]:
df['c2']['r2']

np.int64(4)

But just like we saw with `Series` objects, the preferred way to access data in a `DataFrame` is by using the `loc` and `iloc` attributes.

The difference is that we are now using NumPy array accessing (in a sense), and recall that with NumPy 2-D arrays we access data using `[row, index]`:

In [8]:
df.values[1, 2]

np.int64(5)

When we use `iloc` on a data frame, we are essentially following the same `row, column` pattern:

In [9]:
df.iloc[1, 2]

np.int64(5)

And, in fact, the same holds even if we use the explicit index:

In [10]:
df

Unnamed: 0,c1,c2,c3
r1,0,1,2
r2,3,4,5
r3,6,7,8


In [11]:
df.loc['r2', 'c3']

np.int64(5)

As you can see this is very different than when we used the `[]` - in that case we are looking at the data frame as if it were a series of series - not a NumPy 2-D array. I recommend, just like I did with `Series` objects, that you stay away from the `[]` notation, and instead rely on `loc` and `iloc`.

Slicing and fancy indexing works the same way using `loc` and `iloc`:

In [12]:
print(df)
df.loc['r1': 'r2', 'c2': 'c3']

    c1  c2  c3
r1   0   1   2
r2   3   4   5
r3   6   7   8


Unnamed: 0,c2,c3
r1,1,2
r2,4,5


And just like Series slicing, note that the endpoint of the slice is **included** in the result, unlike slicing with the implicit (positional) index:

In [13]:
print(df)
df.iloc[0:1, 1:2]

    c1  c2  c3
r1   0   1   2
r2   3   4   5
r3   6   7   8


Unnamed: 0,c2
r1,1


If we want to slice the columns and include all the rows, we just specify `:` for the row slice:

In [14]:
df.iloc[:, 1:2]

Unnamed: 0,c2
r1,1
r2,4
r3,7


If we want all the columns for a specific slice of rows we can use `:` for the column slice:

In [15]:
df.iloc[0:2, :]

Unnamed: 0,c1,c2,c3
r1,0,1,2
r2,3,4,5


But in this case, we can actually omit the column slice altogether:

In [16]:
df.iloc[0:2]

Unnamed: 0,c1,c2,c3
r1,0,1,2
r2,3,4,5


Fancy indexing works as expected:

In [17]:
df.loc[:, ['c1', 'c3']]

Unnamed: 0,c1,c3
r1,0,2
r2,3,5
r3,6,8


And with the implicit index:

In [18]:
df.iloc[:, [0, 2]]

Unnamed: 0,c1,c3
r1,0,2
r2,3,5
r3,6,8


So, what if you want to index/slice using an implicit index in one axis and an explicit index in the other?

You can use a two step process.

For example, suppose we want the first two rows, with columns `c1` and `c3`:

In [19]:
print(df)
tmp = df.iloc[0:2, :]
tmp

    c1  c2  c3
r1   0   1   2
r2   3   4   5
r3   6   7   8


Unnamed: 0,c1,c2,c3
r1,0,1,2
r2,3,4,5


In [20]:
tmp.loc[:, ['c1', 'c3']]

Unnamed: 0,c1,c3
r1,0,2
r2,3,5


But of course, we could do all this in one step:

In [21]:
df.iloc[0:2, :].loc[:, ['c1', 'c3']]

Unnamed: 0,c1,c3
r1,0,2
r2,3,5


Of course, you can replace values in the data frame using an assignment operation, just like we saw with `Series` and NumPy arrays:

In [22]:
df

Unnamed: 0,c1,c2,c3
r1,0,1,2
r2,3,4,5
r3,6,7,8


In [23]:
df.iloc[0, 0] = -10
df

Unnamed: 0,c1,c2,c3
r1,-10,1,2
r2,3,4,5
r3,6,7,8


Or even with a slice - as long as the slice is being replaced with an array (or dataframe) of the same shape, or one that can be broadcast to that shape.

In [24]:
df.loc['r1': 'r2', 'c1': 'c2']

Unnamed: 0,c1,c2
r1,-10,1
r2,3,4


In [25]:
df.loc['r1': 'r2', 'c1': 'c2'] = np.array([10, 20, 30, 40]).reshape(2, 2)
df

Unnamed: 0,c1,c2,c3
r1,10,20,2
r2,30,40,5
r3,6,7,8


With broadcasting we could assign a scalar value:

In [26]:
df.loc['r1': 'r2', 'c1': 'c2'] = -100
df

Unnamed: 0,c1,c2,c3
r1,-100,-100,2
r2,-100,-100,5
r3,6,7,8


Or even broadcasting from a 1-D array with 2 elements (or even just a Python list):

In [27]:
df.loc['r1': 'r2', 'c1': 'c2'] = [100, 200]
df

Unnamed: 0,c1,c2,c3
r1,100,200,2
r2,100,200,5
r3,6,7,8


We can also replace with another Pandas `DataFrame` or `Series`, but when we do we have to be careful because of the explicit indexes!

Consider this series:

In [28]:
ser = pd.Series([-10, -20], index=['n1', 'n2'])
ser

n1   -10
n2   -20
dtype: int64

Now let's replace a slice of the same shape in `df`:

In [29]:
df.iloc[0:2, 0:2]

Unnamed: 0,c1,c2
r1,100,200
r2,100,200


In [30]:
df.iloc[0:2, 0:2] = ser
df

Unnamed: 0,c1,c2,c3
r1,-10,-20,2
r2,-10,-20,5
r3,6,7,8


In [31]:
df.iloc[0:2, 0:2] = ser.values
df

Unnamed: 0,c1,c2,c3
r1,-10,-20,2
r2,-10,-20,5
r3,6,7,8


We can also use boolean masking to select elements, but we'll come back to that later.

Pandas data selection can get more complicated.

If you're interested in reading up more on it, you can look at the Pandas docs:

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html